Bayesian Statistical Methods

Download as pdf or txt
Download as pdf or txt
You are on page 1of 288
At a glance
Powered by AI
The key takeaways are an introduction to Bayesian inference and concepts such as prior distributions, likelihood functions, posterior distributions, and predictive distributions.

The main probability concepts discussed in the chapter include univariate and multivariate distributions, marginal and conditional distributions, and Bayes' rule.

The two leading interpretations of probability discussed are the objective interpretation which views probability as a purely mathematical statement, and the subjective interpretation which views probability as an individual's degree of belief.

Bayesian Statistical

Methods
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada

Recently Published Titles


Theory of Stochastic Objects
Probability, Stochastic Processes and Inference
Athanasios Christou Micheas
Linear Models and the Relevant Distributions and Matrix Algebra
David A. Harville

An Introduction to Generalized Linear Models, Fourth Edition


Annette J. Dobson and Adrian G. Barnett
Graphics for Statistics and Data Analysis with R
Kevin J. Keen
Statistics in Engineering, Second Edition
With Examples in MATLAB and R
Andrew Metcalfe, David A. Green, Tony Greenfield, Mahayaudin Mansor, Andrew Smith,
and Jonathan Tuke
A Computational Approach to Statistical Learning
Taylor Arnold, Michael Kane, and Bryan W. Lewis
Introduction to Probability, Second Edition
Joseph K. Blitzstein and Jessica Hwang

A Computational Approach to Statistical Learning


Taylor Arnold, Michael Kane, and Bryan W. Lewis

Theory of Spatial Statistics


A Concise Introduction
M.N.M van Lieshout

Bayesian Statistical Methods


Brian J. Reich, Sujit K. Ghosh

Time Series
A Data Analysis Approach Using R
Robert H. Shumway, David S. Stoffer

The Analysis of Time Series


An Introduction, Seventh Edition
Chris Chatfield, Haipeng Xing

Probability and Statistics for Data Science


Math + R + Data
Norman Matloff

Sampling
Design and Analysis, Second Edition
Sharon L. Lohr

Practical Multivariate Analysis, Sixth Edition


Abdelmonem Afifi, Susanne May, Robin A. Donatello, Virginia A. Clark

For more information about this series, please visit: https://www.crcpress.com/go/textsseries


Bayesian Statistical
Methods

Brian J. Reich
Sujit K. Ghosh
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2019 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20190313

International Standard Book Number-13: 978-0-815-37864-8 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To Michelle, Sophie, Swagata, and Sabita
Contents

Preface xi

1 Basics of Bayesian inference 1


1.1 Probability background . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Univariate distributions . . . . . . . . . . . . . . . . . 2
1.1.1.1 Discrete distributions . . . . . . . . . . . . . 2
1.1.1.2 Continuous distributions . . . . . . . . . . . 6
1.1.2 Multivariate distributions . . . . . . . . . . . . . . . . 9
1.1.3 Marginal and conditional distributions . . . . . . . . . 10
1.2 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Discrete example of Bayes’ rule . . . . . . . . . . . . . 16
1.2.2 Continuous example of Bayes’ rule . . . . . . . . . . . 18
1.3 Introduction to Bayesian inference . . . . . . . . . . . . . . . 21
1.4 Summarizing the posterior . . . . . . . . . . . . . . . . . . . 24
1.4.1 Point estimation . . . . . . . . . . . . . . . . . . . . . 25
1.4.2 Univariate posteriors . . . . . . . . . . . . . . . . . . . 25
1.4.3 Multivariate posteriors . . . . . . . . . . . . . . . . . . 27
1.5 The posterior predictive distribution . . . . . . . . . . . . . . 31
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 From prior information to posterior inference 41


2.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 Beta-binomial model for a proportion . . . . . . . . . 42
2.1.2 Poisson-gamma model for a rate . . . . . . . . . . . . 45
2.1.3 Normal-normal model for a mean . . . . . . . . . . . . 47
2.1.4 Normal-inverse gamma model for a variance . . . . . . 48
2.1.5 Natural conjugate priors . . . . . . . . . . . . . . . . . 50
2.1.6 Normal-normal model for a mean vector . . . . . . . . 51
2.1.7 Normal-inverse Wishart model for a covariance matrix 52
2.1.8 Mixtures of conjugate priors . . . . . . . . . . . . . . . 56
2.2 Improper priors . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3 Objective priors . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.1 Jeffreys’ prior . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.2 Reference priors . . . . . . . . . . . . . . . . . . . . . 61
2.3.3 Maximum entropy priors . . . . . . . . . . . . . . . . 62
2.3.4 Empirical Bayes . . . . . . . . . . . . . . . . . . . . . 62

vii
viii Contents

2.3.5 Penalized complexity priors . . . . . . . . . . . . . . . 63


2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3 Computational approaches 69
3.1 Deterministic methods . . . . . . . . . . . . . . . . . . . . . 70
3.1.1 Maximum a posteriori estimation . . . . . . . . . . . . 70
3.1.2 Numerical integration . . . . . . . . . . . . . . . . . . 71
3.1.3 Bayesian central limit theorem (CLT) . . . . . . . . . 74
3.2 Markov chain Monte Carlo (MCMC) methods . . . . . . . . 75
3.2.1 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . 77
3.2.2 Metropolis–Hastings (MH) sampling . . . . . . . . . . 89
3.3 MCMC software options in R . . . . . . . . . . . . . . . . . . 97
3.4 Diagnosing and improving convergence . . . . . . . . . . . . 100
3.4.1 Selecting initial values . . . . . . . . . . . . . . . . . . 100
3.4.2 Convergence diagnostics . . . . . . . . . . . . . . . . . 103
3.4.3 Improving convergence . . . . . . . . . . . . . . . . . . 108
3.4.4 Dealing with large datasets . . . . . . . . . . . . . . . 110
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4 Linear models 119


4.1 Analysis of normal means . . . . . . . . . . . . . . . . . . . . 120
4.1.1 One-sample/paired analysis . . . . . . . . . . . . . . . 120
4.1.2 Comparison of two normal means . . . . . . . . . . . . 121
4.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.1 Jeffreys prior . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.2 Gaussian prior . . . . . . . . . . . . . . . . . . . . . . 126
4.2.3 Continuous shrinkage priors . . . . . . . . . . . . . . . 128
4.2.4 Predictions . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2.5 Example: Factors that affect a home’s microbiome . . 130
4.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . 133
4.3.1 Binary data . . . . . . . . . . . . . . . . . . . . . . . . 135
4.3.2 Count data . . . . . . . . . . . . . . . . . . . . . . . . 137
4.3.3 Example: Logistic regression for NBA clutch free throws 138
4.3.4 Example: Beta regression for microbiome data . . . . 140
4.4 Random effects . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.5 Flexible linear models . . . . . . . . . . . . . . . . . . . . . . 149
4.5.1 Nonparametric regression . . . . . . . . . . . . . . . . 149
4.5.2 Heteroskedastic models . . . . . . . . . . . . . . . . . 152
4.5.3 Non-Gaussian error models . . . . . . . . . . . . . . . 153
4.5.4 Linear models with correlated data . . . . . . . . . . . 153
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Contents ix

5 Model selection and diagnostics 163


5.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.2 Hypothesis testing and Bayes factors . . . . . . . . . . . . . 166
5.3 Stochastic search variable selection . . . . . . . . . . . . . . . 170
5.4 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . 175
5.5 Model selection criteria . . . . . . . . . . . . . . . . . . . . . 176
5.6 Goodness-of-fit checks . . . . . . . . . . . . . . . . . . . . . . 186
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

6 Case studies using hierarchical modeling 195


6.1 Overview of hierarchical modeling . . . . . . . . . . . . . . . 195
6.2 Case study 1: Species distribution mapping via data fusion . 200
6.3 Case study 2: Tyrannosaurid growth curves . . . . . . . . . . 203
6.4 Case study 3: Marathon analysis with missing data . . . . . 211
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

7 Statistical properties of Bayesian methods 217


7.1 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.2 Frequentist properties . . . . . . . . . . . . . . . . . . . . . . 220
7.2.1 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . 221
7.2.2 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . 223
7.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . 223
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Appendices 231
A.1 Probability distributions . . . . . . . . . . . . . . . . . . . . . 231
A.2 List of conjugate priors . . . . . . . . . . . . . . . . . . . . . . 239
A.3 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
A.4 Computational algorithms . . . . . . . . . . . . . . . . . . . . 250
A.5 Software comparison . . . . . . . . . . . . . . . . . . . . . . . 255

Bibliography 265

Index 273
Preface

Bayesian methods are standard in various fields of science including biology,


engineering, finance and genetics, and thus they are an essential addition to an
analyst’s toolkit. In this book, we cover the material we deem indispensable
for a practicing Bayesian data analyst. The book covers the most common sta-
tistical methods including the t-test, multiple linear regression, mixed models
and generalized linear models from a Bayesian perspective and includes many
examples and code to implement the analyses. To illustrate the flexibility of
the Bayesian approach, the later chapters explore advanced topics such as
nonparametric regression, missing data and hierarchical models. In addition
to these important practical matters, we provide sufficient depth so that the
reader can defend his/her analysis and argue the relative merits of Bayesian
and classical methods.
The book is intended to be used as a one-semester course for advanced
undergraduate and graduate students. At North Carolina State University
(NCSU) this book is used for a course comprised of undergraduate statistics
majors, non-statistics graduate students from all over campus (e.g., engineer-
ing, ecology, psychology, etc.) and students from the Masters of Science in
Statistics Program. Statistics PhD students take a separate course that cov-
ers much of the same material but at a more theoretical and technical level. We
hope this book and associated computing examples also serve as a useful re-
source to practicing data analysts. Throughout the book we have included case
studies from several fields to illustrate the flexibility and utility of Bayesian
methods in practice.
It is assumed that the reader is familiar with undergraduate-level calcu-
lus including limits, integrals and partial derivatives and some basic linear
algebra. Derivation of some key results are given in the main text when this
helps to communicate ideas, but the vast majority of derivations are relegated
to the Appendix for interested readers. Knowledge of introductory statistical
concepts through multiple regression would also be useful to contextualize the
material, but this background is not assumed and thus not required. Funda-
mental concepts are covered in detail but with references held to a minimum
in favor of clarity; advanced concepts are described concisely at a high level
with references provided for further reading.
The book begins with a review of probability in the first section of Chap-
ter 1. A solid understanding of this material is essential to proceed through
the book, but this section may be skipped for readers with the appropriate
background. The remainder of Chapter 1 through Chapter 5 form the core

xi
xii Preface

of the book. Chapter 1 introduces the central concepts of and motivation


for Bayesian inference. Chapter 2 provides ways to select the prior distribu-
tion which is the genesis of a Bayesian analysis. For all but the most basic
methods, advanced computing tools are required, and Chapter 3 covers these
methods with the most weight given to Markov chain Monte Carlo which is
used for the remainder of the book. Chapter 4 applies these tools to common
statistical models including multiple linear regression, generalized linear mod-
els and mixed effects models (and more complex regression models in Section
4.5 which may be skipped if needed). After cataloging numerous statistical
methods in Chapter 4, Chapter 5 treats the problem of selecting an appro-
priate model for a given dataset and verifying and validating that the model
fits the data well. Chapter 6 introduces hierarchical modeling as a general
framework to extend Bayesian methods to complex problems, and illustrates
this approach using detailed case studies. The final chapter investigates the
theoretical properties of Bayesian methods, which is important to justify their
use but can be omitted if the course is intended for non-PhD students.
The choice of software is crucial for any modern textbook or statistics
course. We elected to use R as the software platform due to its immense pop-
ularity in the statistics community, and access to online tutorials and assis-
tance. Fortunately, there are now many options within R to conduct a Bayesian
analysis and we compare several including JAGS, BUGS, STAN and NIMBLE. We
selected the package JAGS as the primary package for no particularly strong
reason other than we have found it works well for the courses taught at our
university. In our assessment, JAGS provides the nice combination of ease of
use, speed and flexibility for the size and complexity of problems we consider.
A repository of code and datasets used in the book is available at
https://bayessm.org/.
Throughout the book we use R/JAGS, but favor conceptual discussions over
computational details and these concepts transcend software. The course web-
page also includes latex/beamer slides.
We wish to thank our NCSU colleagues Kirkwood Cloud, Qian Guan,
Margaret Johnson, Ryan Martin, Krishna Pacifici and Ana-Maria Staicu for
providing valuable feedback. We also thank the students in Bayesian courses
taught at NCSU for their probing questions that helped shape the material
in the book. Finally, we thank our families and friends for their enduring
support, as exemplified by the original watercolor painting by Swagata Sarkar
that graces the front cover of the book.
1
Basics of Bayesian inference

CONTENTS
1.1 Probability background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1.2 Continuous distributions . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.3 Marginal and conditional distributions . . . . . . . . . . . . . . . . . . 10
1.2 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Discrete example of Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Continuous example of Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Introduction to Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Summarizing the posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.2 Univariate posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.3 Multivariate posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 The posterior predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.1 Probability background


Understanding basic probability theory is essential to any study of statistics.
Generally speaking, the field of probability assumes a mathematical model for
the process of interest and uses the model to compute the probability of events
(e.g., what is the probability of flipping five straight heads using a fair coin?).
In contrast, the field of statistics uses data to refine the probability model
and test hypotheses related to the underlying process that generated the data
(e.g., given we observe five straight heads, can we conclude the coin is biased?).
Therefore, probability theory is a key ingredient to a statistical analysis, and in
this section we review the most relevant concepts of probability for a Bayesian
analysis.
Before developing probability mathematically, we briefly discuss probabil-
ity from a conceptual perspective. The objective is to compute the probability
of an event, A, denoted Prob(A). For example, we may be interested in the

1
2 Bayesian Statistical Methods

probability that the random variable X (random variables are generally rep-
resented with capital letters) takes the specific value x (lower-case letter),
denoted Prob(X = x), or the probability that X will fall in the interval [a, b],
denoted Prob(X ∈ [a, b]). There are two leading interpretations of this state-
ment: objective and subjective. An objective interpretation views Prob(A)
as a purely mathematical statement. A frequentist interpretation is that if
we repeated the experiment many times and recorded the sample proportion
of the times A occurred, this proportion would eventually converge to the
number Prob(A) ∈ [0, 1] as the number of samples increases. A subjective in-
terpretation is that Prob(A) represents an individual’s degree of belief, which
is often quantified in terms of the amount the individual would be willing to
wager that A will occur. As we will see, these two conceptual interpretations
of probability parallel the two primary statistical frameworks: frequentist and
Bayesian. However, a Bayesian analysis makes use of both of these concepts.

1.1.1 Univariate distributions


The random variable X’s support S is the smallest set so that X ∈ S with
probability one. For example, if X the number of successes in n trials then
S = {0, 1, ..., n}. Probability equations differ based on whether S is a countable
set: X is a discrete random variable if S is countable, and X is continuous
otherwise. Discrete random variables can have a finite (rainy days in the year)
or an infinite (number lightning strikes in a year) number of possible outcomes
as long as the number is countable, e.g., a random count X ∈ S = {0, 1, 2, ...}
has an infinite but countable number of possible outcomes and is thus discrete.
An example of a continuous random variable is the amount of rain on a given
day which can be any real non-negative number and so S = [0, ∞).

1.1.1.1 Discrete distributions


For a discrete random variable, the probability mass function (PMF) f (x)
assigns a probability to each element of X’s support, that is,

Prob(X = x) = f (x). (1.1)

PPMF is valid if all probabilities are non-negative, f (x) ≥ 0, and sum to one,
A
x∈S f (x) = 1. The PMF can also be used to compute probabilities of more
complex events by summing over the PMF. For example, the probability that
X is either x1 or x2 , i.e., X ∈ {x1 , x2 }, is

Prob(X = x1 or X = x2 ) = f (x1 ) + f (x2 ). (1.2)

Generally, the probability of the event that X falls in a set S 0 ⊂ S is the sum
over elements in S 0 , X
Prob(X ∈ S 0 ) = f (x). (1.3)
x∈S 0
Basics of Bayesian inference 3

Using this fact defines the cumulative distribution function (CDF)


X
F (x) = Prob(X ≤ x) = f (c). (1.4)
c≤x

A PMF is a function from the support of X to the probability of events. It


is often useful to summarize the function using a few interpretable quantities
such as the mean and variance. The expected value or mean value is
X
E(X) = xf (x) (1.5)
x∈S

and measures the center of the distribution. The variance measures the spread
around the mean via the expected squared deviation from the center of the
distribution,
X
Var(X) = E{[X − E(X)]2 } = [x − E(X)]2 f (x). (1.6)
x∈S
p
The variance is often converted to the standard deviation SD(X) = Var(X)
to express the variability on the same scale as the random variable.
The central concept of statistics is that the PMF and its summaries (such
as the mean) describe the population of interest, and a statistical analysis uses
a sample from the population to estimate these functions of the population.
For example, we might take a sample of size n from the population. Denote the
ith sample value as Xi ∼ f (“∼” means “is distributed as”), and X1 , ..., Xn
as the complete sample. We might then Pn approximate the population mean
E(X) with the sample mean X̄ = n1 i=1 Xi , the probability of an outcome
f (x) with the sample proportion of the n observations that equal x, and the
entire PMF f (x) with a sample histogram. However, even for a large sample,
X̄ will likely not equal E(X), and if we repeat the sampling procedure again
we might get a different X̄ while E(X) does not change. The distribution of
a statistic, i.e., a summary of the sample, such as X̄ across random samples
from the population is called the statistic’s sampling distribution.
A statistical analysis to infer about the population from a sample often
proceeds under the assumption that the population belongs to a paramet-
ric family of distributions. This is called a parametric statistical analysis. In
this type of analysis, the entire PMF is assumed to be known up to a few
unknown parameters denoted θ = (θ1 , ..., θp ) (or simply θ if there is only
p = 1 parameter). We then denote the PMF as f (x|θ). The vertical bar “|”
is read as “given,” and so f (x|θ) gives the probability that X = x given
the parameters θ. For example, a common parametric model for count data
with S = {0, 1, 2, ...} is the Poisson family. The Poisson PMF with unknown
parameter θ ≥ 0 is

exp(−θ)θx
Prob(X = x|θ) = f (x|θ) = . (1.7)
x!
4 Bayesian Statistical Methods

θ=2
● ●

0.25
θ=4

0.20
● ●

0.15

PMF

0.10

0.05

● ●
0.00



● ● ●
● ● ● ●

0 2 4 6 8 10 12

FIGURE 1.1 x
Poisson probability mass function. Plot of the PMF f (x|θ) = exp(−θ)θ x!
for θ = 2 and θ = 4. The PMF is connected by lines for visualization, but the
probabilities are only defined for x = {0, 1, 2, ...}.

P∞
Clearly f (x|θ) > 0 for all x ∈ S and it can be shown that x=0 f (x|θ) = 1
so this is a valid PMF. As shown in Figure 1.1, changing the parameter θ
changes the PMF and so the Poisson is not a single distribution but rather a
family of related distributions indexed by θ.
A parametric assumption greatly simplifies the analysis because we only
have to estimate a few parameters and we can compute the probability of any
x in S. Of course, this assumption is only useful if the assumed distribution
provides a reasonable fit to the observed data, and thus a statistician needs
a large catalog of distributions to be able to find an appropriate family for a
given analysis. Appendix A.1 provides a list of parametric distributions, and
we discuss a few discrete distributions below.
Bernoulli: If X is binary, i.e., S = {0, 1}, then X follows a Bernoulli(θ)
distribution. A binary random variable is often used to model the result of a
trial where a success is recorded as a one and a failure as zero. The parameter
θ ∈ [0, 1] is the success probability Prob(X = 1|θ) = θ, and to be a valid PMF
we must have the failure probability Prob(X = 0|θ) = 1 − θ. These two cases
can be written concisely as

f (x|θ) = θx (1 − θ)1−x . (1.8)

This gives mean


1
X
E(X|θ) = xf (x|θ) = f (1|θ) = θ (1.9)
x=0

and variance Var(X|θ) = θ(1 − θ).


Binomial: The binomial distribution is a generalization of the Bernoulli to
Basics of Bayesian inference 5

n = 10 n = 100

θ = 0.1 θ = 0.1

0.5
θ = 0.5 θ = 0.5
θ = 0.9 θ = 0.9

0.15
0.4

● ● ● ●
● ●

● ● ● ●
● ●
0.3

0.10
● ●
PMF

PMF
● ●


● ●
● ● ● ●
0.2

● ●
● ● ● ●

● ●
● ●

0.05
● ●
● ●

● ● ● ●
0.1

● ● ● ●
● ●

● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●

0.00
● ● ● ● ● ●
0.0

● ● ● ● ● ●● ●●● ● ●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ● ● ● ● ● ● ● ●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 2 4 6 8 10 0 20 40 60 80 100

x x

FIGURE 1.2
Binomial probability mass function. Plot of the PMF f (x|θ) = nx θx (1−


θ)n−x for combinations of the number of trials n and the success probability
θ. The PMF is connected by lines for visualization, but the probabilities are
only defined for x = {0, 1, ..., n}.

the case of n ≥ 1 independent trials. Specifically, if X1 , ..., Xn are the binary


results of the n independent trials each with P success probability θ (so that
n
Xi ∼ Bernoulli(θ) for all i = 1, ..., n) and X = i=1 Xi is the total number of
successes, then X’s support is S = {0, 1, ..., n} and X ∼ Binomial(n, θ). The
PMF is  
n x
f (x|θ) = θ (1 − θ)n−x , (1.10)
x
where nx is the binomial coefficient. This gives mean and variance E(X|θ) =


nθ and Var(X|θ) = nθ(1 − θ). It is certainly reasonable that the expected


number of successes in n trials is n times the success probability for each
trial, and Figure 1.2 shows that the variance is maximized with θ = 0.5 when
the outcome of each trial is the least predictable. Appendix A.1 provides other
distributions with support S = {0, 1, ..., n} including the discrete uniform and
beta-binomial distributions.
Poisson: When the support is the counting numbers S = {0, 1, 2, ...},
a common model is the Poisson PMF defined above (and plotted in Figure
1.1). The Poisson PMF can be motivated as the distribution of the number of
events that occur in a time interval of length T if the events are independent
and equally likely to occur at any time with rate θ/T events per unit of
time. The mean and variance are E(X|θ) = Var(X|θ) = θ. Assuming that
the mean and variance are equal is a strong assumption, and Appendix A.1
provides alternatives with support S = {0, 1, 2, ...} including the geometric
and negative binomial distributions.
6 Bayesian Statistical Methods

1.1.1.2 Continuous distributions


The PMF does not apply for continuous distributions with support S that is
a subinterval of the real line. To see this, assume that X is the daily rainfall
(inches) and can thus be any non-negative real number, S = [0, ∞). What is
the probability that X is exactly π/2 inches? Well, within some small range
around π/2, T = (π/2 − , π/2 + ) with say  = 0.001, it seems reasonable
to assume that all values in T are equally likely, say Prob(X = x) = q
for all x ∈ T . But since there are an uncountable number of values in T
when we sum the probability over the values in T we get infinity and thus
the probabilities are invalid unless q = 0. Therefore, for continuous random
variables Prob(X = x) = 0 for all x and we must use a more sophisticated
method for assigning probabilities.
Instead of defining the probability of outcomes directly using a PMF, for
continuous random variables we define probabilities indirectly through the
cumulative distribution function (CDF)

F (x) = Prob(X ≤ x). (1.11)

The CDF can be used to compute probabilities for any interval, e.g., in the
rain example Prob(X ∈ S) = Prob(X < π/2 + ) − Prob(X < π/2 − ) =
F (π/2 + ) − F (π/2 − ), which converges to zero as  shrinks if F is a con-
tinuous function. Defining the probability of X falling in an interval resolves
the conceptual problems discussed above, because it is easy to imagine the
proportion of days with rainfall in an interval converging to a non-zero value
as the sample size increases.
The probability on a small interval is

Prob(x −  < X < x + ) = F (x + ) − F (x − ) ≈ 2f (x) (1.12)

where f (x) is the derivative of F (x) and is called the probability density
function (PDF). If f (x) is the PDF of X, then the probability of X ∈ [a, b] is
the area under the density curve between a and b (Figure 1.3),
Z b
Prob(a ≤ X ≤ b) = F (b) − F (a) = f (x)dx. (1.13)
a

The distributions of random variables are usually defined via the PDF. To
ensure that the PDF Rproduces valid probability statements we must have

f (x) ≥ 0 for all x and −∞ f (x)dx = 1. Because f (x) is not a probability, but
rather a function used compute probabilities via integration, the PDF can be
greater than one for some x so long as it integrates to one.
The formulas for the mean and variance of a continuous random variable
resemble those for a discrete random variable except that the sum over the
Basics of Bayesian inference 7

0.25
0.20
0.15
f(x)
0.10
0.05
0.00

−5 0 5 10

FIGURE 1.3
Computing probabilities using a PDF. The curve is a PDF f (x) and the
R5
shaded area is Prob(1 < X < 5) = 1 f (x)dx.

PMF is replaced with an integral over the PDF,


Z ∞
E(X) = xf (x)dx
−∞
Z ∞
Var(X) = E{[X − E(X)]2 } = [x − E(X)]2 f (x)dx.
−∞

Another summary that is defined for both discrete and continuous random
variables (but is much easier to define in the continuous case) is the quantile
function Q(τ ). For τ ∈ [0, 1], Q(τ ) is the solution to

Prob[X ≤ Q(τ )] = F [Q(τ )] = τ. (1.14)

That is, Q(τ ) is the value so that the probability of X being no larger than
Q(τ ) is τ . The quantile function is the inverse of the distribution function,
Q(τ ) = F −1 (τ ), and gives the median Q(0.5) and a (1 − α)% equal-tailed
interval [Q(α/2), Q(1−α/2)] so that Prob[Q(α/2) ≤ X ≤ Q(1−α/2)] = 1−α.
Gaussian: As with discrete data, parametric models are typically assumed
for continuous data and practitioners must be familiar with several parametric
families. The most common parametric family with support S = (−∞, ∞) is
the normal (Gaussian) family. The normal distribution has two parameters,
the mean E(X|θ) = µ and variance Var(X|θ) = σ 2 , and the familiar bell-
shaped PDF  
1 1
f (x|θ) = √ exp − 2 (x − µ)2 , (1.15)
2πσ 2σ
where θ = (µ, σ 2 ). The Gaussian distribution is famous because of the central
8 Bayesian Statistical Methods

b=1 b=5

a=1 a=1

3.0
a=5 a=5
0.8 a = 10 a = 10

2.5
0.6

2.0
PDF

PDF
1.5
0.4

1.0
0.2

0.5
0.0

0.0
0 5 10 15 20 0 1 2 3 4

x x

FIGURE 1.4
Plots of the gamma PDF. Plots of the gamma density function f (x|θ) =
ba a
Γ(a) x exp(−bx) for several combinations of a and b.

limit theorem
Pn
(CLT). The CLT applies to the distribution of the sample mean
Xi
X̄n = i=1 n , where X1 , ..., Xn are independent samples from some distri-
bution f (x). The CLT says that under fairly general conditions, for large n
the distribution of X̄n is approximately normal even if f (x) is not. There-
fore, the Gaussian distribution is a natural model for data that are defined
as averages, but can be used for other data as well. Appendix A.1 gives other
continuous distributions with S = (−∞, ∞) including the double exponential
and student-t distributions.
Gamma: The gamma distribution has S = [0, ∞). The PDF is
ba a

Γ(a) x exp(−bx) x ≥ 0
f (x|θ) = (1.16)
0 x<0

where Γ is the gamma function and a > 0 and b > 0 are the two parameters
in θ = (a, b). Beware that the gamma PDF is also written with b in the de-
nominator of the exponential function, but we use the parameterization above.
Under the parameterization in (1.16) the mean is a/b and the variance is a/b2 .
As shown in Figure 1.4, a is the shape parameter and b is the scale. Setting
a = 1 gives the exponential distribution with PDF f (x|θ) = b exp(−bx) which
decays exponentially from the origin and large a gives approximately a normal
distribution. Varying b does not change the shape of the PDF but only affects
its spread. Appendix A.1 gives other continuous distributions with S = [0, ∞)
including the inverse-gamma distribution.
Basics of Bayesian inference 9

3.5
a = 0.5, b = 0.5 a = 1, b = 10
a = 1, b = 1 a = 2, b = 5
a = 10, b = 10 a = 5, b = 2

8
3.0

a = 10, b = 1
2.5

6
2.0
PDF

PDF
1.5

4
1.0

2
0.5
0.0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

FIGURE 1.5
Plots of the beta PDF. Plots of the beta density function f (x|θ) =
Γ(a,b) a−1
Γ(a)Γ(b) x (1 − x)b−1 for several a and b.

Beta: The beta distribution has S = [0, 1] and PDF


(
Γ(a,b) a−1
Γ(a)Γ(b) x (1 − x)b−1 x ∈ [0, 1]
f (x|θ) = (1.17)
0 x < 0 or x > 1,

where a > 0 and b > 0 are the two parameters in θ = (a, b). As shown
in Figure 1.5, the beta distribution is quite flexible and can be left-skewed,
right-skewed, or symmetric. The beta distribution also includes the uniform
distribution f (x) = 1 for x ∈ [0, 1] by setting a = b = 1.

1.1.2 Multivariate distributions


Most statistical analyses involve multiple variables with the objective of study-
ing relationships between variables. To model relationships between variables
we need multivariate extensions of mass and density functions. Let X1 , ..., Xp
be p random variables, Sj be the support of Xj so that Xj ∈ Sj , and
X = (X1 , ..., Xp ) be the random vector. Table 1.1 describes the joint distribu-
tion between the p = 2 variables: X1 indicates that the patient has primary
health outcome and X2 indicates the patient has a side effect. If all variables
are discrete, then the joint PMF is

Prob(X1 = x1 , ..., Xp = xp ) = f (x1 , ..., xp ). (1.18)

To
P be a valid PPMF, f must be non-negative f (x1 , ..., xp ) ≥ 0 and sum to one
x1 ∈S1 , ..., xp ∈S1 f (x1 , ..., xp ) = 1.
As in the univariate case, probabilities for continuous random variables are
10 Bayesian Statistical Methods

TABLE 1.1
Hypothetical joint PMF. This PMF f (x1 , x2 ) gives the probabilities that a
patient has a positive (X1 = 1) or negative (X1 = 0) primary health outcome
and the patient having (X2 = 1) or not having (X2 = 0) a negative side effect.

X2
0 1 f1 (x1 )
X1 0 0.06 0.14 0.20
1 0.24 0.56 0.80
f2 (x2 ) 0.30 0.70

computed indirectly via the PDF f (x1 , ..., xp ). In the univariate case, probabil-
ities are computed as the area under the density curve. For p > 1, probabilities
are computed as the volume under the density surface. For example, for p = 2
random variables, the probability of a1 < X1 < b1 and a2 < X2 < b2 is
Z b1 Z b2
f (x1 , x2 )dx1 dx2 . (1.19)
a1 a2

This gives the probability of the random vector X = (X1 , X2 ) lying in the
rectangle defined by the endpoints a1 , b1 , a2 , and b2 . In general, the prob-
ability ofR the random vector X falling in region A is the p-dimensional
integral A f (x1 , ..., xp )dx1 , ..., dxp . As an example, consider the bivariate
PDF on the unit square with f (x1 , x2 ) = 1 for X with x1 ∈ [0, 1] and
x2 ∈ [0, 1] and f (x1 , x2 ) = 0 otherwise. Then Prob(X1 < .5, X2 < .1) =
R 0.5 R 0.1
0 0
f (x1 , x2 )dx1 dx2 = 0.05.

1.1.3 Marginal and conditional distributions


The marginal and conditional distributions that follow from a multivariate
distribution are key to a Bayesian analysis. To introduce these concepts we
assume discrete random variables, but extensions to the continuous case are
straightforward by replacing sums with integrals. Further, we assume a bivari-
ate PMF with p = 2. Again, extensions to high dimensions are conceptually
straightforward by replacing sums over one or two dimensions with sums over
p − 1 or p dimensions.
The marginal distribution of Xj is simply the distribution of Xj if we
consider only a univariate analysis of Xj and disregard the other variable.
Denote fj (xj ) = Prob(Xj = xj ) as the marginal PMF of Xj . The marginal
distribution is computed by summing over the other variable in the joint PMF
X X
f1 (x1 ) = f (x1 , x2 ) and f2 (x2 ) = f (x1 , x2 ). (1.20)
x2 x1

These are referred to as the marginal distributions because in a two-way table


Basics of Bayesian inference 11

such as Table 1.1, the marginal distributions are the row and column totals
of the joint PMF that appear along the table’s margins.
As with any univariate distribution, the marginal distribution can be sum-
marized with its mean and variance,
X XX
µj = E(Xj ) = xj fj (xj ) = xj f (x1 , x2 )
xj ∈Sj x1 x2
X XX
σj2 = Var(Xj ) = (xj − µj )2 fj (xj ) = (xj − µj )2 f (x1 , x2 ).
xj x1 x2

The marginal mean and variance measure the center and spread of the
marginal distributions, respectively, but do not capture the relationship be-
tween the two variables. The most common one-number summary of the joint
relationship is covariance, defined as

σ12 = Cov(X1 , X2 ) = E[(X1 − µ1 )(X2 − µ2 )] (1.21)


XX
= (x1 − µ1 )(x2 − µ2 )f (x1 , x2 ).
x1 x2

The covariance is often hard to interpret because it depends on the scale of


both X1 and X2 , i.e., if we double X1 we double the covariance. Correlation is
a scale-free summary of the joint relationship, ρ12 = σσ112 σ2 . In vector notation,
T
the mean of the  2 random  vector X is E(X) = (µ1 , µ2 ) , the covariance
 matrix

σ1 σ12 1 ρ12
is Cov(X) = , and the correlation matrix is Cor(X) = .
σ12 σ22 ρ12 1
T
Generalizing for p > 2, the mean vector becomes E(X) = (µ1 , ..., µp ) and the
covariance matrix becomes the symmetric p×p matrix with diagonal elements
σ12 , ..., σp2 and (i, j) off-diagonal element σij .
While the marginal distributions sum over columns or rows of a two-way
table, the conditional distributions focus only on a single column or row. The
conditional distribution of X1 given that the random variables X2 is fixed at
x2 is denoted f1|2 (x1 |X2 = x2 ) or simply f (x1 |x2 ). Referring to Table 1.1, the
knowledge that X2 = x2 restricts the domain to a single column of the two-
way table. However, the probabilities in a single column do not define a valid
PMF because their sum is less than one. We must rescale these probabilities
to sum to one by dividing the column total, which we have previously defined
as f2 (x2 ). Therefore, the general expression for the conditional distributions
of X1 |X2 = x2 , and similarly X2 |X1 = x1 , is

f (x1 , x2 ) f (x1 , x2 )
f1|2 (x1 |X2 = x2 ) = and f2|1 (x2 |X1 = x1 ) = . (1.22)
f2 (x2 ) f1 (x1 )
Atlantic hurricanes example: Table 1.2 provides the counts of At-
lantic hurricanes that made landfall between 1990 and 2016 tabulated by
their intensity category (1–5) and whether they hit the US or elsewhere.
Of course, these are only sample proportions and not true probabilities,
12 Bayesian Statistical Methods

TABLE 1.2
Table of the Atlantic hurricanes that made landfall between 1990
and 2016. The counts are tabulated by their maximum Saffir–Simpson in-
tensity category and whether they made landfall in the US or elsewhere. The
counts are downloaded from http://www.aoml.noaa.gov/hrd/hurdat/.

(a) Counts

Category
1 2 3 4 5 Total
US 14 13 10 1 1 39
Not US 46 19 20 17 3 105
Total 60 32 30 18 4 144

(b) Sample proportions

Category
1 2 3 4 5 Total
US 0.0972 0.0903 0.0694 0.0069 0.0069 0.2708
Not US 0.3194 0.1319 0.1389 0.1181 0.0208 0.7292
Total 0.4167 0.2222 0.2083 0.1250 0.0278 1.0000

but for this example we treat Table 1.2b as the joint PMF of location,
X1 ∈ {US, Not US}, and intensity category, X2 ∈ {1, 2, 3, 4, 5}. The marginal
distribution of X1 is given in the final column and is simply the row sums
of the joint PMF. The marginal probability of a hurricane making landfall
in the US is Prob(X1 = US) = 0.2708, which is the proportion calculation
as if we had never considered the storms’ category. Similarly, the column
sums are the marginal probability of intensity averaging over location, e.g.,
Prob(X2 = 5) = 0.0278.
The conditional distributions tell us about the relationship between the
two variables. For example, the marginal probability of a hurricane reaching
category 5 is 0.0278, but given that the storm hits the US, the conditional dis-
tribution is slightly lower, f2|1 (5|US) = Prob(X1 = US, X2 = 5)/Prob(X1 =
US) = 0.0069/0.2708 = 0.0255. By definition, the conditional probabilities
sum to one,
0.0972 0.0903
f2|1 (1|US) = = 0.3589, f2|1 (2|US) = = 0.3334
0.2708 0.2708
0.0694 0.0069
f2|1 (3|US) = = 0.2562, f2|1 (4|US) = = 0.0255
0.2708 0.2708
0.0069
f2|1 (5|US) = = 0.0255.
0.2708
Given that a storm hits the US, the probability of a category 2 or 3 storm
Basics of Bayesian inference 13

increases, while the probability of a category 4 or 5 storm decreases, and so


there is a relationship between landfall location and intensity.

Independence example: Consider the joint PMF of the primary health


outcome (X1 ) and side effect (X2 ) in Table 1.1. In this example, the marginal
probability of a positive primary health outcome is Prob(X1 = 1) = 0.80,
as is the conditional probability f1|2 (1|X2 = 1) = 0.56/0.70 = 0.80 given
the patient has the side effect and the conditional probability f1|2 (1|0) =
0.24/0.30 = 0.80 given the patient does not have the side effect. In other
words, both with and without knowledge of side effect status, the probability
of a positive health outcome is 0.80, and thus side effect is not informative
about the primary health outcome. This is an example of two random variables
that are independent.
Generally, X1 and X2 are independent if and only if the joint PMF (or
PDF for continuous random variables) factors into the product of the marginal
distributions,
f (x1 , x2 ) = f1 (x1 )f2 (x2 ). (1.23)
From this expression it is clear that if X1 and X2 are independent then
f1|2 (x1 |X2 = x2 ) = f (x1 , x2 )/f2 (x2 ) = f1 (x1 ), and thus X2 is not informative
about X1 . A special case of joint independence is if all marginal distributions
are same, fj (x) = f (x) for all j and x. In this case, we say that X1 and X2
iid
are independent and identically distributed (“iid”), which is denoted Xj ∼ f .
If variables are not independent then they are said to be dependent.
Multinomial: Parametric families are also useful for multivariate distribu-
tions. A common parametric family for discrete data is the multinomial, which,
as the name implies, is a generalization of the binomial. Consider the case of n
independent trials where each trial results in one of p possible outcomes (e.g.,
p = 3 and each result is either win, lose or draw). Let Xj ∈ {0, 1, ..., n} be
the number of trails that resulted in outcome j, and X = (X1 , ..., Xp ) be the
vector of counts. If we assume that Pp θj is the probability of outcome j for each
trial, with θ = (θ1 , ..., θp ) and j=1 θj = 1, then X|θ ∼ Multinomial(n, θ)
with
n!
f (x1 , ..., xp ) = θx1 · ... · θpxp (1.24)
x1 ! · ... · xp ! 1
Pp
where n = j=1 xj . If there are only p = 2 categories then this would be a
binomial experiment X2 ∼ Binomial(n, θ2 ).
Multivariate normal: The multivariate normal distribution is a gener-
alization of the normal distribution to p > 1 random variables. For p = 2, the
bivariate normal has five parameters, θ = (µ1 , µ2 , σ12 , σ22 , ρ): a mean parameter
for each variable E(Xj ) = µj , a variance for each variable Var(Xj ) = σj2 > 0,
and the correlation Cor(X1 , X2 ) = ρ ∈ [−1, 1]. The density function is
 2
z1 − 2ρz1 z2 + z22

1
f (x1 , x2 ) = exp − , (1.25)
2(1 − ρ2 )
p
2πσ1 σ2 1 − ρ2
14 Bayesian Statistical Methods

where zj = (xj − µj )/σj . As shown in Figure 1.6 the density surface is ellip-
tical with center determined
 2 by µ = (µ  1 , µ2 ) and shape determined by the
σ1 σ1 σ2 ρ
covariance matrix Σ = .
σ1 σ2 ρ σ22
A convenient feature of the bivariate normal distribution is that the
marginal and conditional distributions are also normal. The marginal dis-
tribution of Xj is Gaussian with mean µj and variance σj2 . The conditional
distribution, shown in Figure 1.7, is
 
σ2 2 2
X2 |X1 = x1 ∼ Normal µ2 + ρ (x1 − µ1 ), (1 − ρ )σ2 . (1.26)
σ1
If ρ = 0 then the conditional distribution is the marginal distribution, as
expected. If ρ > 0 (ρ < 0) then the conditional mean increases (decreases)
with x1 . Also, the conditional variance (1 − ρ2 )σ22 is less than the marginal
variance σ22 , especially for ρ near -1 or 1, and so conditioning on X1 reduces
uncertainty in X2 when there is strong correlation.
The multivariate normal PDF for p > 2 is most concisely written using
matrix notation. The multivariate normal PDF for the random vector X with
mean vector µ and covariance matrix Σ is
 
−p/2 −1/2 1 T −1
f (X) = (2π) |Σ| exp − (X − µ) Σ (X − µ) (1.27)
2
where |A|, AT and A−1 are the determinant, transpose and inverse, respec-
tively, of the matrix A. From this expression it is clear that the contours of the
log PDF are elliptical. All conditional
Pp and marginal distributions are normal,
as are all linear combinations j=1 wj Xj for any w1 , ..., wp .

1.2 Bayes’ rule


As the name implies, Bayes’ rule (or Bayes’ theorem) is fundamental to
Bayesian statistics. However, this rule is a general result from probability
and follows naturally from the definition of a conditional distribution. Con-
sider two random variables X1 and X2 with joint PMF (or PDF as the result
holds for both discrete and continuous data) density function f (x1 , x2 ). The
definition of a conditional distribution gives f (x1 |x2 ) = f (x2 , x1 )/f (x2 ) and
f (x2 |x1 ) = f (x1 , x2 )f (x1 ). Combining these two expressions gives Bayes’ rule
f (x1 |x2 )f (x2 )
f (x2 |x1 ) = . (1.28)
f (x1 )
This result is useful as a means to reverse conditioning from X1 |X2 to X2 |X1 ,
and also indicates the need to define a joint distribution for this inversion to
be valid.
Basics of Bayesian inference 15

(a) (b)
4

4
2

2

X2

X2

0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

(c) (d)
4

4
2

● ●
X2

X2
0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

FIGURE 1.6
Plots of the bivariate normal PDF. Panel (a) plots the bivariate normal
PDF for µ = (0, 0), σ1 = 1, σ2 = 1 and ρ = 0. The other panels modify Panel
(a) as follows: (b) has µ = (1, 1) and σ1 = 2, (c) has µ = (1, 1) and ρ = 0.8,
and (d) has µ = (1, 1) and ρ = −0.8. The plots are shaded according to the
PDF with white indicating the PDF near zero and black indicating the areas
with highest PDF; the white dot is the mean vector µ.
16 Bayesian Statistical Methods

(a) Joint distribution (b) Conditional distribution

x1 = −1
x1 = 1
4

0.06
x1 = 3

0.05
Conditional distribution
2

0.04
X2
0

0.03
0.02
−2

0.01
−4

0.00
−4 −2 0 2 4 −4 −2 0 2 4

X1 x2

FIGURE 1.7
Plots of the joint and conditional bivariate normal PDF. Panel (a)
plots the bivariate normal PDF for µ = (1, 1), σ1 = σ2 = 1 and ρ = 0.8. The
plot is shaded according to the PDF with white indicating the PDF near zero
and black indicating the areas with highest PDF; the vertical lines represent
x1 = −1, x1 = 1 and x1 = 3. The conditional distribution of X2 |X1 = x1 for
these three values of x1 are plotted in Panel (b).

1.2.1 Discrete example of Bayes’ rule


You have a scratchy throat and so you go to the doctor who administers a
rapid strep throat test. Let Y ∈ {0, 1} be the binary indicator of a positive
test, i.e., Y = 1 if the test is positive for strep and Y = 0 if the test is negative.
The test is not perfect. The false positive rate p ∈ [0, 1] is the probability of
testing positive if you do not have strep, and the false negative rate q ∈ [0, 1]
is the probability of testing negative given that you actually have strep. To
express these probabilities mathematically we must define the true disease
status θ ∈ {0, 1}, where θ = 1 if you are truly infected and θ = 0 otherwise.
This unknown variable we hope to estimate is called a parameter. Given these
error rates and the definition of the model parameter, the data distribution
can be written

Prob(Y = 1|θ = 0) = p and Prob(Y = 0|θ = 1) = q. (1.29)

Generally, the PMF (or PDF) of the observed data given the model parameters
is called the likelihood function.
To formally analyze this problem we must determine which components
should be treated as random variables. Is the test result Y a random variable?
Before the exam, Y is clearly random and (1.29) defines its distribution. This
is aleatoric uncertainty because the results may differ if we repeat the test.
However, after the learning of the test results, Y is determined and you must
Basics of Bayesian inference 17

decide how to proceed given the value of Y at hand. In this sense, Y is known
and no longer random at the analysis stage.
Is the true disease status θ a random variable? Certainly θ is not a ran-
dom variable in the sense that it changes from second-to-second or minute-
to-minute, and so it is reasonable to assume that the true disease status is
a fixed quantity for the purpose of this analysis. However, because our test
is imperfect we do not know θ. This is epistemic uncertainty because θ is a
quantity that we could theoretically know, but at the analysis stage we do
not and cannot know θ using only noisy data. Despite our uncertainty about
θ, we have to decide what to do next and so it is useful to quantify our un-
certainty using the language of probability. If the test is reliable and p and
q are both small, then in light of a positive test we might conclude that θ is
more likely to be one than zero. But how much more likely? Twice as likely?
Three times? In Bayesian statistics we quantify uncertainty about fixed but
unknown parameters using probability theory by treating them as random
variables. As (1.28) suggests, for formal inversion of conditional probabilities
we would need to treat both variables as random.
The probabilities in (1.29) supply the distribution of the test result given
disease status, Y |θ. However, we would like to quantify uncertainty in the
disease status given the test results, that is, we require the distribution of
θ|Y . Since this is the uncertainty distribution after collecting the data this is
referred to as the posterior distribution. As discussed above, Bayes’ rule can
be applied to reverse the order of conditioning,

Prob(Y = 1|θ = 1)Prob(θ = 1)


Prob(θ = 1|Y = 1) = , (1.30)
Prob(Y = 1)

where the marginal probability Prob(Y = 1) is


1
X
f (1, θ) = Prob(Y = 1|θ = 1)Prob(θ = 1)+Prob(Y = 1|θ = 0)Prob(θ = 0).
θ=0
(1.31)
To apply Bayes’ rule requires specifying the unconditional probability of hav-
ing strep throat, Prob(θ = 1) = π ∈ [0, 1]. Since this is the probability of
infection before we conduct the test, we refer to this as the prior probability.
We can then compute the posterior using Bayes’ rule,

(1 − q)π
Prob(θ = 1|Y = 1) = . (1.32)
(1 − q)π + p(1 − π)

To understand this equation consider a few extreme scenarios. Assuming the


error rates p and q are not zero or one, if π = 1 (π = 0) then the posterior
probability of θ = 1 (θ = 0) is one for any value of Y . That is, if we have
no prior uncertainty then the imperfect data does not update the prior. Con-
versely, if the test is perfect and q = p = 0 then for any prior π ∈ (0, 1) the
posterior probability that θ is Y is one. That is, with perfect data the prior
18 Bayesian Statistical Methods

TABLE 1.3
Strep throat data. Number of patients that are truly positive and tested
positive in the rapid strep throat test data taken from Table 1 of [26].

Truly positive, Truly positive, Truly negative, Truly negative,


test positive test negative test positive test negative
Children 80 38 23 349
Adults 43 10 14 261
Total 123 48 37 610

is irrelevant. Finally, if p = q = 1/2, then the test is a random coin flip and
the posterior is the prior Prob(θ = 1|Y ) = π.
For a more realistic scenario we use the data in Table 1.3 taken from [26].
We plug in the sample error rates from these data for p = 37/(37 + 610) =
0.057 and q = 48/(48 + 123) = 0.281. Of course these data represent only
a sample and the sample proportions are not exactly the true error rates,
but for illustration we assume these error rates are correct. Then if we assume
prior probability of disease is π = 0.5, the posterior probabilities are Prob(θ =
1|Y = 0) = 0.230 and Prob(θ = 1|Y = 1) = 0.927. Therefore, beginning with
a prior probability of 0.5, a negative test moves the probability down to 0.230
and a positive test increases the probability to 0.927.
Of course, in reality the way individuals process test results is complicated
and subjective. If you have had strep many times before and you went to
the doctor because your current symptoms resemble previous bouts with the
disease, then perhaps your prior is π = 0.8 and the posterior is Prob(θ =
1|Y = 1) = 0.981. On the other hand, if you went to the doctor only at the
urging of your friend and your prior probability is π = 0.2, then Prob(θ =
1|Y = 1) = 0.759.
This simple example illustrates a basic Bayesian analysis. The objective
is to compute the posterior distribution of the unknown parameters θ. The
posterior has two ingredients: the likelihood of the data given the parameters
and the prior distribution. Selection of these two distributions is thus largely
the focus of the remainder of this book.

1.2.2 Continuous example of Bayes’ rule


Let θ ∈ [0, 1] be the proportion of the population in a county that has health
insurance. It is known that the proportion varies across counties following a
Beta(a, b) distribution and so the prior is θ ∼ Beta(a, b). We take a sample
of size n = 20 from your county and assume that the number of respondents
with insurance, Y ∈ {0, 1, ..., n}, is distributed as Y |θ ∼ Binomial(n, θ). Joint
Basics of Bayesian inference 19

20

15

10
Y

0.00 0.25 0.50 0.75 1.00


θ

FIGURE 1.8
Joint distribution for the beta-binomial example. Plot of f (θ, y) for the
example with θ ∼ Beta(8, 2) and Y |θ ∼ Binomial(20, θ). The marginal distri-
butions f (θ) (top) and f (y) (right) are plotted in the margins. The horizontal
line is the Y = 12 line.

probabilities for θ and Y can be computed from

f (θ, y) = f (y|θ)f (θ)


   
n y n−y Γ(a, b) a−1 b−1
= θ (1 − θ) θ (1 − θ)
y Γ(a)Γ(b)
= cθy+a−1 (1 − θ)n−y+b−1

where c = ny Γ(a, b)/[Γ(a)Γ(b)] is a constant that does not depend on θ.




Figure 1.8 plots f (θ, y) and the marginal distributions for θ and Y . By the
way we have defined the problem, the marginal distribution of θ, f (θ), is
a Beta(a, b) PDF, which could also be derived by summing f (θ, y) over y.
The marginal distribution of Y plotted on the right of Figure 1.8 is f (y) =
R1
0
f (θ, y)dθ. In this case the marginal distribution of Y follows a beta-binomial
distribution, but as we will see this is not needed in the Bayesian analysis.
In this problem we are given the unconditional distribution of the disease
rate (prior) and the distribution of the sample given the true proportion (likeli-
hood), and Bayes’ rule gives the (posterior) distribution of the true proportion
20 Bayesian Statistical Methods

given the sample. Say we observe Y = 12. The horizontal line in Figure 1.8
traces over the conditional distribution f (θ|Y = 12). The conditional distribu-
tion is centered around the sample proportion Y /n = 0.60 but has non-trivial
mass from 0.4 to 0.8. More formally, the posterior is

f (y|θ)f (θ)
f (θ|y) =
f (y)
 
c
= θy+a−1 (1 − θ)n−y+b−1
f (y)
= CθA−1 (1 − θ)B−1 (1.33)

where C = c/f (y), A = y + a, and B = n − y + b.


We note the resemblance between f (θ|y) and the PDF of a Beta(A, B)
density. Both include θA−1 (1 − θ)B−1 but differ in the normalizing constant,
C for f (θ|y) compared to Γ(A, B)/[Γ(A)Γ(B)] for the Beta(A, B) PDF. Since
both f (θ|y) and the Beta(A, B) PDF are proper, they both integrate to one,
and thus
Z 1 Z 1
Γ(A, B) A−1
CθA−1 (1 − θ)B−1 dθ = θ (1 − θ)B−1 dθ = 1 (1.34)
0 0 Γ(A)Γ(B)

and so
Z 1 Z 1
Γ(A, B)
C θA−1 (1 − θ)B−1 dθ = θA−1 (1 − θ)B−1 dθ (1.35)
0 Γ(A)Γ(B) 0

and thus C = Γ(A, B)/[Γ(A)Γ(B)]. Therefore, f (θ|y) is in fact the Beta(A, B)


PDF and θ|Y = y ∼ Beta(y + a, n − y + b).
Dealing with the normalizing constant makes posterior calculations quite
tedious. Fortunately this can often be avoided by discarding terms that do
not involve the parameter of interest and comparing the remaining terms
with known distributions. The derivation above can be simplified to (using
“∝” to mean “proportional to”)

f (θ|y) ∝ f (y|θ)f (θ) ∝ θ(y+a)−1 (1 − θ)(n−y+b)−1

and immediately concluding that θ|Y = y ∼ Beta(y + a, n − y + b).


Figure 1.9 plots the posterior distribution for two priors and Y ∈
{0, 5, 10, 15, 20}. The plots illustrate how the posterior combines information
from the prior and the likelihood. In both plots, the peak of the posterior dis-
tribution increases with the observation Y . Comparing the plots shows that
the prior also contributes to the posterior. When we observe Y = 0 successes,
the posterior under the Beta(8,2) prior (left) is pulled from zero to the right
by the prior (thick line). Under the Beta(1,1), i.e., the uniform prior, when
Y = 0 the posterior is concentrated around θ = 0.
Basics of Bayesian inference 21

Beta(8,2) prior Beta(1,1) prior

Y=0

20
Y=5
10 Y = 10
Y = 15
Y = 20
8

15
Prior
Posterior

Posterior
6

10
4

5
2
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

θ θ

FIGURE 1.9
Posterior distribution for the beta-binomial example. The thick lines
are the beta prior for success probability θ and the thin lines are the posterior
assuming Y |θ ∼ Binomial(20, θ) for various values of Y .

1.3 Introduction to Bayesian inference


A parametric statistical analysis models the random process that produced
the data, Y = (Y1 , ..., Yn ), in terms of fixed but unknown parameters θ =
(θ1 , ..., θp ). The PDF (or PMF) of the data given the parameters, f (Y|θ), is
called the likelihood function and links the observed data with the unknown
parameters. Statistical inference is concerned with the inverse problem of using
the likelihood function to estimate θ. Of course, if the data are noisy then
we cannot perfectly estimate θ, and a Bayesian quantifies uncertainty about
the unknown parameters by treating them as random variables. Treating θ
as a random variable requires specifying the prior distribution, π(θ), which
represents our uncertainty about the parameters before we observe the data.
If we view θ as a random variable, we can apply Bayes’ tule to obtain the
posterior distribution

f (Y|θ)π(θ)
p(θ|Y) = R ∝ f (Y|θ)π(θ). (1.36)
f (Y|θ)π(θ)dθ

The posterior is proportional to the likelihood times the prior, and quantifies
the uncertainty about the parameters that remain after accounting for prior
knowledge and the new information in the observed data.
Table 1.4 establishes the notation we use throughout for the prior, likeli-
hood and posterior. We will not adhere to the custom (e.g., Section 1.1) that
random variables are capitalized because in a Bayesian analysis more often
22 Bayesian Statistical Methods

TABLE 1.4
Notation used throughout the book for distributions involving the parameter
vector θ = (θ1 , ..., θp ) and data vector Y = (Y1 , ..., Yn ).

Prior density of θ: π(θ)


Likelihood function of Y given θ: f (Y|θ) R
Marginal density of Y: m(Y) = f (Y|θ)π(θ)dθ
Posterior density of θ given Y: p(θ|Y) = f (Y|θ)π(θ)/m(Y)

than not it is the parameters that are the random variables, and capital Greek
letters, e.g., Prob(Θ = θ), are unfamiliar to most readers. We will however
follow the custom to use bold to represent vectors and matrices. Also, assume
independence unless otherwise noted. For example, if we say “the priors are
θ1 ∼ Uniform(0, 1) and θ2 ∼ Gamma(1, 1),” you should assume that θ1 and
θ2 have independent priors.
The Bayesian framework provides a logically consistent framework to use
all available information to quantify uncertainty about model parameters.
However, to apply Bayes’ rule requires specifying the prior distribution and the
likelihood function. How do we pick the prior? In many cases prior knowledge
from experience, expert opinion or similar studies is available and can be used
to specify an informative prior. It would be a waste to discard this information.
In other cases where prior information is unavailable, then the prior should be
uninformative to reflect this uncertainty. For instance, in the beta-binomial
example in Section 1.2 we might use a uniform prior that puts equal mass on
all possible parameter values. The choice of prior distribution is subjective,
i.e., driven by the analyst’s past experience and personal preferences. If a
reader does not agree with your prior then they are unlikely to be persuaded
by your analysis. Therefore, the prior, especially an informative prior, should
be carefully justified, and a sensitivity analysis comparing the posteriors under
different priors should be presented.
How do we pick the likelihood? The likelihood function is the same as in a
classical analysis, e.g., a maximum likelihood analysis. The likelihood function
for multiple linear regression is the product of Gaussian PDFs defined by the
model  
p
indep X
Yi |θ ∼ Normal β0 + Xij βj , σ 2  (1.37)
j=1

where Xij is the value of the j covariate for the ith observation and
th

θ = (β0 , ..., βp , σ 2 ) are the unknown parameters. A thoughtful application


of multiple linear regression must consider many questions, including
• Which covariates to include?
• Are the errors Gaussian? Independent? Do they have equal variance?
Basics of Bayesian inference 23

• Should we include quadratic or interaction effects?

• Should we consider a transformation of the response (e.g., model log(Yi ))?


• Which observations are outliers? Should we remove them?
• How should we handle the missing observations?
• What p-value threshold should be used to define statistical significance?

As with specifying the prior, these concerns are arguably best resolved using
subjective subject-matter knowledge. For example, while there are statistical
methods to select covariates (Chapter 5), a more reliable strategy is to ask
a subject-matter expert which covariates are the most important to include,
at least as an initial list to be refined in the statistical analysis. As another
example, it is hard to determine (without a natural ordering as in times series
data) whether the observations are independent without consulting someone
familiar with the data collection and the study population. Other decisions
are made based on visual inspections of the data (such as scatter plots and
histograms of the residuals) or ad hoc rules of thumb (threshold on outliers’ z-
scores or p-values for statistical significance). Therefore, in a typical statistical
analysis there many subjective choices to be made, and the choice of prior is
far from the most important.
Bayesian statistical methods are often criticized as being subjective. Per-
haps an objective analysis that is free from personal preferences or beliefs is an
ideal we should strive for (and this is the aim of objective Bayesian methods,
see Section 2.3), but it is hard to make the case that non-Bayesian methods
are objective, and it can be argued that almost any scientific knowledge and
theories are subjective in nature. In an interesting article by Press and Tanur
(2001), the authors cite many scientific theories (mainly from physics) where
subjectivity played a major role and they concluded “Subjectivity occurs, and
should occur, in the work of scientists; it is not just a factor that plays a minor
role that we need to ignore as a flaw...” and they further added that “Total
objectivity in science is a myth. Good science inevitably involves a mixture of
subjective and objective parts.” The Bayesian inferential framework provides a
logical foundation to accommodate objective and subjective parts involved in
data analysis. Hence, a good scientific practice would be to state upfront all
assumptions and then make an effort to validate such assumptions using the
current data or preferable future test cases. There is nothing wrong to have a
subjective but reasonably flexible model as long as we can exhibit some form
of sensitivity analysis when the assumptions of the model are mildly violated.
In addition to explicitly acknowledging subjectivity, another important
difference between Bayesian and frequentist (classical) methods is their notion
of uncertainty. While a Bayesian considers only the data at hand, a frequentist
views uncertainty as arising from repeating the process that generated the data
many times. That is, a Bayesian might give a posterior probability that the
population mean µ (a parameter) is positive given the data we have observed,
24 Bayesian Statistical Methods

whereas a frequentist would give a probability that the sample mean Ȳ (a


statistic) exceeds a threshold given a specific value of the parameters if we
repeated the experiment many times (as is done when computing a p-value).
The frequentist view of uncertainty is well-suited for developing procedures
that have desirable error rates when applied broadly. This is reasonable in
many settings. For instance, a regulatory agency might want to advocate
statistical procedures that ensure only a small proportion of the medications
made available to the public have adverse side effects. In some cases however it
is hard to see why repeating the sampling is a useful thought experiment. For
example, [14] study the relationship between a region’s climate and the type of
religion that emerged from that region. Assuming the data set consists of the
complete list of known cultures, it is hard to imagine repeating the process
that led to these data as it would require replaying thousands of years of
human history.
Bayesians can and do study the frequentist properties. This is critical to
build trust in the methods. If a Bayesian weather forecaster gives the posterior
predictive 95% interval every day for a year, but at the end of the year these
intervals included the observed temperature only 25% of the time, then the
forecaster would lose all credibility. It turns out that Bayesian methods often
have desirable frequentist properties, and Chapter 7 examines these properties.
Developing Bayesian methods with good frequentist properties is often
called calibrated Bayes (e.g., [52]). According to Rubin [52, 71]: “The applied
statistician should be Bayesian in principle and calibrated to the real world
in practice - appropriate frequency calculations help to define such a tie...
frequency calculations are useful for making Bayesian statements scientific,
scientific in the sense of capable of being shown wrong by empirical test; here
the technique is the calibration of Bayesian probabilities to the frequencies of
actual events.”

1.4 Summarizing the posterior


The final output of a Bayesian analysis is the posterior distribution of the
model parameters. The posterior contains all the relevant information from
the data and the prior, and thus all statistical inference should be based
on the posterior distribution. However, when there are many parameters,
the posterior distribution is a high-dimensional function that is difficult to
display graphically, and for complicated statistical models the mathematical
form of the posterior may be challenging to work with. In this section, we
discuss some methods to summarize a high-dimensional posterior with low-
dimensional summaries.
Basics of Bayesian inference 25

1.4.1 Point estimation


One approach to summarizing the posterior is to use a point estimate, i.e.,
a single value that represents the best estimate of the parameters given the
data and (for a Bayesian analysis) the prior. The posterior mean, median and
mode are all sensible choices. Thinking of the Bayesian analysis as a procedure
that can be applied to any dataset, the point estimator is an example of an
estimator, i.e., a function that takes the data as input and returns an estimate
of the parameter of interest. Bayesian estimators such as the posterior mean
can then be seen as competitors to other estimators such as the sample mean
estimator for a population mean or a sample variance for a population vari-
ance, or more generally as a competitor to the maximum likelihood estimator.
We study the properties of these estimators in Chapter 7.
A common point estimator is the maximum a posteriori (MAP) estimator,
defined as the value that maximizes the posterior (i.e., the posterior mode),
θ̂ M AP = arg max log[p(θ|Y)] = arg max log[f (Y|θ)] + log[π(θ)]. (1.38)
θ θ
The second equality holds because the normalizing constant m(Y) does not
depend on the parameters and thus does not affect the optimization.
If the prior is uninformative, i.e., mostly flat as a function of the param-
eters, then the MAP estimator should be similar to the maximum likelihood
estimator (MLE)
θ̂ M LE = arg max log[f (Y|θ)]. (1.39)
θ
In fact, this relationship is often used to intuitively justify maximum likelihood
estimation. The addition of the log prior log[π(θ)] in (1.38) can be viewed a
regularization or penalty term to add stability or prior knowledge.
Point estimators are often useful as fast methods to estimate the parame-
ters for purpose of making predictions. However, a point estimate alone does
not quantify uncertainty about the parameters. Sections 1.4.2 and 1.4.3 pro-
vide more thorough summaries of the posterior for univariate and multivariate
problems, respectively.

1.4.2 Univariate posteriors


A univariate posterior (i.e., from a model with p = 1 parameter) is best sum-
marized with a plot because this retains all information about the parameter.
Figure 1.10 shows a hypothetical univariate posterior with PDF centered at
0.8 and most of its mass on θ > 0.4.
Point estimators such as the posterior mean or median summarize the
center of the posterior, and should be accompanied by a posterior variance or
standard deviation to convey uncertainty. The posterior standard deviation
resembles a frequentist standard error in that if the posterior is approximately
Gaussian then the posterior probability that the parameter is within two pos-
terior standard deviation units of the posterior mean is roughly 0.95. However,
26 Bayesian Statistical Methods

Posterior mean

3.5

3.5
Posterior median
3.0

3.0
2.5

2.5
Posterior

Posterior
2.0

2.0
1.5

1.5
1.0

1.0
0.5

0.5
P(H1|Y) = 0.98
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

θ θ

FIGURE 1.10
Summaries of a univariate posterior. The plot on the left gives the
posterior mean (solid vertical line), median (dashed vertical line) and 95%
equal-tailed interval (shaded area). The plot on the right shades the posterior
probability of the hypothesis H1 : θ > 0.5.

the standard error is the standard deviation of the estimator (e.g., the sample
mean) if we repeatedly sample different data sets and compute the estima-
tor for each data set. In contrast, the posterior standard deviation quantifies
uncertainty about the parameter given only the single data set under consid-
eration.
The interval E(θ|Y) ± 2SD(θ|Y ) is an example of a credible interval. A
(1 − α)% credible interval is any interval (l, u) so that Prob(l < θ < u|Y) =
1 − α. There are infinitely many intervals with this coverage, but the easiest
to compute is the equal-tailed interval with l and u set to the α/2 and 1 − α/2
posterior quantiles. An alternative is the highest posterior density interval
which searches over all l and u to minimize the interval width u − l while
maintaining the appropriate posterior coverage. The HPD thus has the high-
est average posterior density of all intervals of the form (l, u) that have the
nominal posterior probability. As opposed to equal-tailed intervals, the HPD
requires an additional optimization step, but this can be computed using the
R package HDInterval.
Interpreting a posterior credible interval is fairly straightforward. If (l, u) is
a posterior 95% interval, this means “given my prior and the observed data, I
am 95% sure that θ is between l and u.” In a Bayesian analysis we express our
subjective uncertainty about unknown parameters by treating them as random
variables, and in this subjective sense it is reasonable to assign probabilities to
θ. This is in contrast with frequentist confidence intervals which have a more
nuanced interpretation. A confidence interval is a procedure that defines an
Basics of Bayesian inference 27

interval for a given data set in a way that ensures the procedure’s intervals
will include the true value 95% of the time when applied to random datasets.
The posterior distribution can also be used for hypothesis testing. Since
the hypotheses are functions of the parameters, we can assign posterior prob-
abilities to each hypothesis. Figure 1.10 (right) plots the posterior probability
of the null hypothesis that θ < 0.5 (white) and the posterior probability of
the alternative hypothesis that θ > 0.5 (shaded). These probabilities summa-
rize the weight of evidence in support of each hypothesis, and can be used to
guide future decisions. Hypothesis testing, and more generally model selection,
is discussed in greater detail in Chapter 5.
Summarizing a univariate posterior using R: We have seen that if
Y |θ ∼ Binomial(n, θ) and θ ∼ Beta(a, b), then θ|Y ∼ Beta(A, B), where
A = Y + a and B = n − Y + b. Listing 1.1 specifies a data set with Y = 40 and
n = 100 and summarizes the posterior using R. Since the posterior is the beta
density, the functions dbeta, pbeta and qbeta can be used to summarize the
posterior. The posterior median and 95% credible set are 0.401 and (0.309,
0.498).
Monte Carlo sampling: Although univariate posterior distributions are
best summarized by a plot, higher dimensional posterior distributions call
for other methods such as Monte Carlo (MC) sampling and so we intro-
duce this approach here. MC sampling draws S samples from the posterior,
θ(1) , ..., θ(S) , and uses these samples to approximate the posterior. For exam-
ple, the posterior
PS mean is approximated using the mean of the S samples,
E(θ|Y) ≈ s=1 θ(s) /S, the posterior 95% credible set is approximated using
the 0.025 and 0.975 quantiles of the S samples, etc. Listing 1.1 provides an
example of using MC sampling to approximate the posterior mean and 95%
credible set.

1.4.3 Multivariate posteriors


Unlike the univariate case, a simple plot of the posterior will not suffice, es-
pecially for large p, because plotting in high dimensions is challenging. The
typical remedy for this is to marginalize out other parameters and summarize
univariate marginal distributions with plots, point estimates, and credible sets,
and perhaps plots of a few bivariate marginal distributions (i.e., integrating
over the other p − 2 parameters) of interest.
iid
Consider the model Yi |µ, σ ∼ Normal(µ, σ 2 ) with independent priors µ ∼
Normal(0, 1002 ) and σ ∼ Uniform(0, 10) (other priors are discussed in Section
4.1.1). The likelihood
n  Pn 2
(Yi − µ)2
  
1 i=1 (Yi − µ)
Y
−n
f (Y|µ, σ) ∝ exp − ∝ σ exp − (1.40)
i=1
σ 2σ 2 2σ 2

factors as the product of n terms because the observations are assumed to be


independent. The prior is f (µ, σ) = f (µ)f (σ) because µ and σ have indepen-
28 Bayesian Statistical Methods

Listing 1.1
Summarizing a univariate posterior in R.
1 # Load the data
2 > n <- 100
3 > Y <- 40
4 > a <- 1
5 > b <- 1
6 > A <- Y + a
7 > B <- n - Y + b
8

9 # Define a grid of points for plotting


10 > theta <- seq(0,1,.001)
11

12 # Evaluate the density at these points


13 > pdf <- dbeta(theta,A,B)
14

15 # Plot the posterior density


16 > plot(theta,pdf,type="l",ylab="Posterior",xlab=expression(theta))
17

18 # Posterior mean
19 > A/(A + B)
20 [1] 0.4019608
21

22 # Posterior median (0.5 quantile)


23 > qbeta(0.5,A,B)
24 [1] 0.4013176
25

26 # Posterior probability P(theta<0.5|Y)


27 > pbeta(0.5,A,B)
28 [1] 0.976978
29

30 # Equal-tailed 95% credible interval


31 > qbeta(c(0.025,0.975),A,B)
32 [1] 0.3093085 0.4982559
33

34 # Monte Carlo approximation


35 > S <- 100000
36 > samples <- rbeta(S,A,B)
37 > mean(samples)
38 [1] 0.402181
39 > quantile(samples,c(0.025,0.975))
40 2.5% 97.5%
41 0.3092051 0.4973871
Basics of Bayesian inference 29

TABLE 1.5
Bivariate posterior distribution. Summaries of the marginal posterior dis-
iid
tributions for the model with Yi ∼ Normal(µ, σ 2 ), priors µ ∼ Normal(0, 1002 )
and σ ∼ Uniform(0, 10), and n = 5 observations Y1 = 2.68, Y2 = 1.18,
Y3 = −0.97, Y4 = −0.98, Y5 = −1.03.

Parameter Posterior mean Posterior SD 95% credible set


µ 0.17 1.31 (-2.49, 2.83)
σ 2.57 1.37 ( 1.10, 6.54)

dent priors and since f (σ) = 1/10 for all σ ∈ [0, 10], the prior becomes

µ2
 
π(µ, σ) ∝ exp − (1.41)
2 · 1002

for σ ∈ [0, 10] and f (µ, σ) = 0 otherwise. The posterior is proportional to the
likelihood times the prior, and thus
 Pn 2
µ2
  
−n i=1 (Yi − µ)
p(µ, σ|Y) ∝ σ exp − exp − (1.42)
2σ 2 2 · 1002

for σ ∈ [0, 10]. Figure 1.11 plots this bivariate posterior assuming there are
n = 5 observations: Y1 = 2.68, Y2 = 1.18, Y3 = −0.97, Y4 = −0.98, Y5 =
−1.03.
The two parameters in Figure 1.11 depend on each other. If σ = 1.5
(i.e., the conditional distribution traced by the horizontal line at σ = 1.5 in
Figure 1.11) then the posterior of µ concentrates between -1 and 1, whereas
if σ = 3 the posterior of µ spreads from -3 to 3. It is difficult to describe this
complex bivariate relationship, so we often summarize the univariate marginal
distributions instead. The marginal distributions
Z 10 Z ∞
p(µ|Y) = p(µ, σ|Y)dσ and p(σ|Y) = p(µ, σ|Y)dµ. (1.43)
0 −∞

are plotted on the top (for µ) and right (for σ) of Figure 1.11; they are the
row and columns sums of the joint posterior. By integrating over the other
parameters, the marginal distribution of a parameter accounts for posterior
uncertainty in the remaining parameters. The marginal distributions are usu-
ally summarized with point and interval estimates as in Table 1.5.
The marginal distributions and their summaries above were computed by
evaluating the joint posterior (1.42) for values of (µ, σ) that form a grid (i.e.,
pixels in Figure 1.11) and then simply summing over columns or rows of the
grid. This is a reasonable approximation for p = 2 variables but quickly be-
comes unfeasible as p increases. Thus, it was only with the advent of more
30 Bayesian Statistical Methods

10.0

7.5

5.0
σ

2.5

0.0

−4 −2 0 2 4 6
µ

FIGURE 1.11
Bivariate posterior distribution. The bivariate posterior (center) and uni-
variate marginal posteriors (top for µ and right for σ) for the model with
iid
Yi ∼ Normal(µ, σ 2 ), priors µ ∼ Normal(0, 1002 ) and σ ∼ Uniform(0, 10), and
n = 5 observations Y1 = 2.68, Y2 = 1.18, Y3 = −0.97, Y4 = −0.98, Y5 = −1.03.
Basics of Bayesian inference 31

efficient computing algorithms in the 1990s that Bayesian statistics became


feasible for even medium-sized applications. These exciting computational de-
velopments are the subject of Chapter 3.

1.5 The posterior predictive distribution


Often the objective of a statistical analysis is to build a stochastic model that
can be used to make predictions of future events or impute missing values.
Let Y ∗ be the future observation we would like to predict. Assuming that
the observations are independent given the parameters and that Y ∗ follows
the same model as the observed data, then given θ we have Y ∗ ∼ f (y|θ)
and prediction is straightforward. Unfortunately, we do not know θ exactly,
even after observing Y. A remedy for this is to plug in a value of θ, say, the
posterior mean θ̂ = E(θ|Y), and then sample Y ∗ ∼ f (Y |θ̂). However, this
ignores uncertainty about the unknown parameters. If the posterior variance
of θ is small then its uncertainty is negligible, otherwise a better approach is
needed.
For the sake of prediction, the parameters are not of interest themselves,
but rather they serve as vehicles to transfer information from the data to
the predictive model. We would rather bypass the parameters altogether and
simply use the posterior predictive distribution (PPD)
Y ∗ |Y ∼ f∗ (Y ∗ |Y). (1.44)
The PPD is the distribution of a new outcome given the observed data.
In a parametric model, the PPD naturally accounts for uncertainty in the
model parameters; this an advantage of the Bayesian framework. The PPD
accounts for parametric uncertainty because it can be written
Z Z
f ∗ (Y ∗ |Y) = f ∗ (Y ∗ , θ|Y)dθ = f (Y ∗ |θ)p(θ|Y)dθ, (1.45)

where f is the likelihood density (here we assume that the observations are
independent given the parameters and so f (Y ∗ |θ) = f (Y ∗ |θ, Y), and p is the
posterior density.
To further illustrate how the PPD accounts for parameteric uncertainty,
we consider how to make a sample from the PPD. If we first draw posterior
sample θ ∗ ∼ p(θ|Y) and then a prediction from the likelihood, Y ∗ |θ ∗ ∼
f (Y |θ ∗ ), then Y ∗ follows the PPD. A Monte Carlo approximation (Section
3.2) repeats these step many times to approximate the PPD. Unlike the plug-
in predictor, each predictive uses a different value of the parameters and thus
accurately reflects parametric uncertainty. It can be shown that Var(Y ∗ |Y) ≥
Var(Y ∗ |Y, θ) with equality holding only if there is no posterior uncertainty
in the mean of Y ∗ |Y, θ.
32 Bayesian Statistical Methods

Plug−in
● ●

0.20
PPD

0.3

● ●

0.15




0.2
PMF

PMF

0.10




0.1

0.05

● ●






0.00
● ● ●
0.0

● ●
● ● ● ●● ●● ●● ●● ●● ●● ●●

0 1 2 3 4 5 0 5 10 15 20

y y

FIGURE 1.12
Posterior predictive distribution for a beta-binomial example. Plots
of the posterior predictive distribution (PPD) from the model Y |θ ∼
Binomial(n, θ) and θ ∼ Beta(1, 1). The “plug-in” PMF is the binomial den-
sity evaluated at the posterior mean θ̂, f (y|θ̂). This is compared with the
full PPD for Y = 1 success in n = 5 trials (left) and Y = 4 successes in
n = 20 trials (right). The PMFs are connected by lines for visualization, but
the probabilities are only defined for y = {0, 1, ..., n}.

As an example, consider the model Y |θ ∼ Binomial(n, θ) and θ ∼


Beta(1, 1). Given the data we have observed (Y and n) we would like to pre-
dict the outcome if we repeat the experiment, Y ∗ ∈ {0, 1, ..., n}. The posterior
of θ is θ|Y ∼ Beta(Y + 1, n + 1) and the posterior mean is θ̂ = (Y + 1)/(n + 2).
The solid lines in Figure 1.12 show the plug-in prediction Y ∗ ∼ Binomial(n, θ̂)
versus the full PPD (Listing 1.2) that accounts for uncertainty in θ (which is
a Beta-Binomial(n, Y + 1, n + 1) distribution). For both n = 5 and n = 20,
the PPD is considerably wider than the plug-in predictive distribution, as
expected.
Basics of Bayesian inference 33

Listing 1.2
Summarizing a posterior predictive distribution (PPD) in R.
1 > # Load the data
2 > n <- 5
3 > Y <- 1
4 > a <- 1
5 > b <- 1
6 > A <- Y + a
7 > B <- n - Y + b
8

9 > # Plug-in estimator


10 > theta_hat <- A/(A+B)
11 > y <- 0:5
12 > PPD <- dbinom(y,n,theta_hat)
13 > names(PPD) <- y
14 > round(PPD,2)
15 0 1 2 3 4 5
16 0.19 0.37 0.30 0.12 0.02 0.00
17

18 > # Draws from the PPD, Y_star[i]~Binomial(n,theta_star[i])


19 > S <- 100000
20 > theta_star <- rbeta(S,A,B)
21 > Y_star <- rbinom(S,n,theta_star)
22 > PPD <- table(Y_star)/S
23 > round(PPD,2)
24 0 1 2 3 4 5
25 0.27 0.30 0.23 0.13 0.05 0.01
34 Bayesian Statistical Methods

1.6 Exercises
1. If X has support X ∈ S = [1, ∞], find the constant c (as a function
of θ) that makes f (x) = c exp(−x/θ) a valid PDF.
2. Assume that X ∼ Uniform(a, b) so the support is S = [a, b] and the
PDF is f (x) = 1/(b − a) for any x ∈ S.
(a) Prove that this is a valid PDF.
(b) Derive the mean and variance of X.
3. Expert knowledge dictates that a parameter must be positive and
that its prior distribution should have the mean 5 and variance 3.
Find a prior distribution that satisfies these constraints.
4. X1 and X2 have joint PMF

x1 x2 Prob(X1 = x1 , X2 = x2 )
0 0 0.15
1 0 0.15
2 0 0.15
0 1 0.15
1 1 0.20
2 1 0.20

(a) Compute the marginal distribution of X1 .


(b) Compute the marginal distribution of X2 .
(c) Compute the conditional distribution of X1 |X2 .
(d) Compute the conditional distribution of X2 |X1 .
(e) Are X1 and X2 independent? Justify your answer.
5. If (X1 , X2 ) is bivariate normal with E(X1 ) = E(X2 ) = 0, Var(X1 ) =
Var(X2 ) = 1, and Cor(X1 , X2 ) = ρ:
(a) Derive the marginal distribution of X1 .
(b) Derive the conditional distribution of X1 |X2 .
6. Assume (X1 , X2 ) have bivariate PDF
1 −3/2
f (x1 , x2 ) = 1 + x21 + x22 .

(a) Plot the conditional distribution of X1 |X2 = x2 for x2 ∈
{−3, −2, −1, 0, 1, 2, 3} (preferably on the same plot).
(b) Do X1 and X2 appear to be correlated? Justify your answer.
(c) Do X1 and X2 appear to be independent? Justify your answer.
Basics of Bayesian inference 35

7. According to insurance.com, the 2017 auto theft rate was 135 per
10,000 residents in Raleigh, NC compared to 214 per 10,000 res-
idents in Durham/Chapel Hill. Assuming Raleigh’s population is
twice as large as Durham/Chapel Hill and a car has been stolen
somewhere in the triangle (i.e., one of these two areas), what the
probability it was stolen in Raleigh?
8. Your daily commute is distributed uniformly between 15 and 20
minutes if there no convention downtown. However, conventions
are scheduled for roughly 1 in 4 days, and your commute time is
distributed uniformly from 15 to 30 minutes if there is a convention.
Let Y be your commute time this morning.
(a) What is the probability that there was a convention downtown
given Y = 18?
(b) What is the probability that there was a convention downtown
given Y = 28?
9. For this problem pretend we are dealing with a language with a
six-word dictionary

{fun, sun, sit, sat, fan, for}.

An extensive study of literature written in this language reveals that


all words are equally likely except that “for” is α times as likely as
the other words. Further study reveals that:
i. Each keystroke is an error with probability θ.
ii. All letters are equally likely to produce errors.
iii. Given that a letter is typed incorrectly it is equally likely to be
any other letter.
iv. Errors are independent across letters.
For example, the probability of correctly typing “fun” (or any other
word) is (1 − θ)3 , the probability of typing “pun” or “fon” when
intending to type is “fun” is θ(1 − θ)2 , and the probability of typing
“foo” or “nnn” when intending to type “fun” is θ2 (1−θ). Use Bayes’
rule to develop a simple spell checker for this language. For each of
the typed words “sun”, “the”, “foo”, give the probability that each
word in the dictionary was the intended word. Perform this for the
parameters below:
(a) α = 2 and θ = 0.1.
(b) α = 50 and θ = 0.1.
(c) α = 2 and θ = 0.95.
Comment on the changes you observe in these three cases.
36 Bayesian Statistical Methods

10. Let X1 ∼ Bernoulli(θ) be the indicator that a tree species occu-


pies a forest and θ ∈ [0, 1] denote the prior occupancy probability.
The researcher gathers a sample of n trees from the forest and
X2 belong to the species of interest. The model for the data is
X2 |X1 ∼ Binomial(n, λX1 ) where λ ∈ [0, 1] the probability of de-
tecting the species given it is present. Give expressions in terms of
n, θ and λ for the following joint, marginal and conditional proba-
bilities:
(a) Prob(X1 = X2 = 0).
(b) Prob(X1 = 0).
(c) Prob(X2 = 0).
(d) Prob(X1 = 0|X2 = 0).
(e) Prob(X2 = 0|X1 = 0).
(f) Prob(X1 = 0|X2 = 1).
(g) Prob(X2 = 0|X1 = 1).
(h) Provide intuition for how (d)-(g) change with n, θ and λ.
(i) Assuming θ = 0.5, λ = 0.1, and X2 = 0, how large must n be
before we can conclude with 95% confidence that the species
does not occupy the forest?
11. In a study that uses Bayesian methods to forecast the number of
species that will be discovered in future years, [24] report that the
number of marine bivalve species discovered each year from 2010-
2015 was 64, 13, 33, 18, 30 and 20. Denoting Yt as the number of
iid
species discovered in year t and assuming Yt |λ ∼ Poisson(λ) and
λ ∼ Uniform(0, 100), plot the posterior distribution of λ.
12. Assume that (X, Y ) follow the bivariate normal distribution and
that both X and Y have marginal mean zero and marginal vari-
ance one. We observe six independent and identically distributed
data points: (-3.3, -2.6), (0.1, -0.2), (-1.1, -1.5), (2.7, 1.5), (2.0, 1.9)
and (-0.4, -0.3). Make a scatter plot of the data and, assuming the
correlation parameter ρ has a Uniform(−1, 1) prior, plot the poste-
rior distribution of ρ.
13. The normalized difference vegetation index (NDVI) is commonly
used to classify land cover using remote sensing data. Hypotheti-
cally, say that NDVI follows a Beta(25, 10) distribution for pixels
in a rain forest, and a Beta(10, 15) distribution for pixels in a de-
forested area now used for agriculture. Assuming about 10% of the
rain forest has been deforested, your objective is to build a rule to
classify individual pixels as deforested based on their NDVI.
(a) Plot the PDF of NDVI for forested and deforested pixels, and
the marginal distribution of NDVI averaging over categories.
Basics of Bayesian inference 37

(b) Give an expression for the probability that a pixel is deforested


given its NDVI value, and plot this probability by NDVI.
(c) You will classify a pixel as deforested if you are at least 90%
sure it is deforested. Following this rule, give the range of NDVI
that will lead to a pixel being classified as deforested.
14. Let n be the unknown number of customers that visit a store on
the day of a sale. The number of customers that make a purchase is
Y |n ∼ Binomial(n, θ) where θ is the known probability of making
a purchase given the customer visited the store. The prior is n ∼
Poisson(5). Assuming θ is known and n is the unknown parameter,
plot the posterior distribution of n for all combinations of Y ∈
{0, 5, 10} and θ ∈ {0.2, 0.5} and comment on the effect of Y and θ
on the posterior.
15. Last spring your lab planted ten seedlings and two survived the
winter. Let θ be the probability that a seedling survives the winter.
(a) Assuming a uniform prior distribution for θ, compute its pos-
terior mean and standard deviation.
(b) Assuming the same prior as in (a), compute and compare the
equal-tailed and highest density 95% posterior credible inter-
vals.
(c) If you plant another 10 seedlings next year, what is the posterior
predictive probability that at least one will survive the winter?
16. X1 and X2 are binary indicators of failure for two parts of a ma-
chine. Independent tests have shown that X1 ∼ Bernoulli(1/2) and
X2 ∼ Bernoulli(1/3). Y1 and Y2 are binary indicators of two system
failures. We know that Y1 = 1 if both X1 = 1 and X2 = 1 and
Y1 = 0 otherwise, and Y2 = 0 if both X1 = 0 and X2 = 0 and
Y2 = 1 otherwise. Compute the following probabilities:
(a) The probability that X1 =1 and X2 = 1 given Y1 = 1.
(b) The probability that X1 =1 and X2 = 1 given Y2 = 1.
(c) The probability that X1 =1 given Y1 = 1.
(d) The probability that X1 =1 given Y2 = 1.
17. The table below has the overall free throw proportion and results
of free throws taken in pressure situations, defined as “clutch”
(https://stats.nba.com/), for ten National Basketball Associa-
tion players (those that received the most votes for the Most Valu-
able Player Award) for the 2016–2017 season. Since the overall pro-
portion is computed using a large sample size, assume it is fixed and
analyze the clutch data for each player separately using Bayesian
methods. Assume a uniform prior throughout this problem.
38 Bayesian Statistical Methods

Overall Clutch Clutch


Player proportion makes attempts
Russell Westbrook 0.845 64 75
James Harden 0.847 72 95
Kawhi Leonard 0.880 55 63
LeBron James 0.674 27 39
Isaiah Thomas 0.909 75 83
Stephen Curry 0.898 24 26
Giannis Antetokounmpo 0.770 28 41
John Wall 0.801 66 82
Anthony Davis 0.802 40 54
Kevin Durant 0.875 13 16

(a) Describe your model for studying the clutch success probability
including the likelihood and prior.
(b) Plot the posteriors of the clutch success probabilities.
(c) Summarize the posteriors in a table.
(d) Do you find evidence that any of the players have a different
clutch percentage than overall percentage?
(e) Are the results sensitive to your prior? That is, do small changes
in the prior lead to substantial changes in the posterior?
18. In the early twentieth century, it was generally agreed that Hamilton
and Madison (ignore Jay for now) wrote 51 and 14 Federalist Pa-
pers, respectively. There was dispute over how to attribute 12 other
papers between these two authors. In the 51 papers attributed to
Hamilton the word “upon” was used 3.24 times per 1,000 words,
compared to 0.23 times per 1,000 words in the 14 papers attributed
to Madison (for historical perspective on this problem, see [58]).
(a) If the word “upon” is used three times in a disputed text of
length 1, 000 words and we assume the prior probability 0.5,
what is the posterior probability the paper was written by
Hamilton?
(b) Give one assumption you are making in (a) that is likely un-
reasonable. Justify your answer.
(c) In (a), if we changed the number of instances of “upon” to one,
do you expect the posterior probability to increase, decrease or
stay the same? Why?
(d) In (a), if we changed the text length to 10, 000 words and num-
ber of instances of “upon” to 30, do you expect the posterior
probability to increase, decrease or stay the same? Why?
(e) Let Y be the number of observed number of instances of “upon”
in 1,000 words. Compute the posterior probability the paper
was written by Hamilton for each Y ∈ {0, 1, ..., 20}, plot these
Basics of Bayesian inference 39

posterior probabilities versus Y and give a rule for the num-


ber of instances of “upon” needed before the paper should be
attributed to Hamilton.
2
From prior information to posterior
inference

CONTENTS
2.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 Beta-binomial model for a proportion . . . . . . . . . . . . . . . . . . . 42
2.1.2 Poisson-gamma model for a rate . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.3 Normal-normal model for a mean . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.4 Normal-inverse gamma model for a variance . . . . . . . . . . . . 48
2.1.5 Natural conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.1.6 Normal-normal model for a mean vector . . . . . . . . . . . . . . . . 51
2.1.7 Normal-inverse Wishart model for a covariance matrix . 52
2.1.8 Mixtures of conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2 Improper priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3 Objective priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.1 Jeffreys’ prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.2 Reference priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.3 Maximum entropy priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3.4 Empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3.5 Penalized complexity priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

One of the most controversial and yet crucial aspects of Bayesian model is the
construction of a prior distribution. Often a user is faced with the questions
like Where does the prior distribution come from? or What is the true or
correct prior distribution? and so on. There is no concept of the “true, correct
or best” prior distribution, but rather the prior distribution can be viewed as
an initialization of a statistical (in this case a Bayesian) inferential procedure
that gets updated as the data accrue. The choice of a prior distribution is
necessary (as you would need to initiate the inferential machine) but there is
no notion of the ‘optimal’ prior distribution. Choosing a prior distribution is
similar in principle to initializing any other sequential procedure (e.g., iterative
optimization methods like Newton–Raphson, EM, etc.). The choice of such
initialization can be good or bad in the sense of the rate of convergence of
the procedure to its final value, but as long as the procedure is guaranteed to
converge, the choice of prior does not have a permanent impact. As discussed

41
42 Bayesian Statistical Methods

in Section 7.2, the posterior is guaranteed to converge to the true value under
very general conditions on the prior distribution.
In this chapter we discuss several general approaches for selecting prior
distributions. We begin with conjugate priors. Conjugate priors lead to sim-
ple expressions for the posterior distribution and thus illustrate how prior
information affects the Bayesian analysis. Conjugate priors are useful when
prior information is available, but can also be used when it is not. We con-
clude with objective Bayesian priors that attempt to remove the subjectivity
in prior selection through conventions to be adopted when prior information
is not available.

2.1 Conjugate priors


Conjugate priors are the most convenient choice. A prior and likelihood pair
are conjugate if the resulting posterior is a member of the same family of
distributions as the prior. Therefore, conjugacy refers to a mathematical re-
lationship between two distributions and not to a deeper theme of the appro-
priate way to express prior beliefs. For example, if we select a beta prior then
Figure 1.5 shows that by changing the hyperparameters (i.e., the parameters
that define the prior distribution, in this case a and b) the prior could be
concentrated around a single value or spread equally across the unit interval.
Because both the prior and posterior are members of the same family, the
update from the prior to the posterior affects only the parameters that index
the family. This provides an opportunity to build intuition about Bayesian
learning through simple examples. Conjugate priors are not unique. For ex-
ample, the beta prior is conjugate for both the binomial and negative binomial
likelihood and both a gamma prior and a Bernoulli prior (trivially) are con-
jugate for a Poisson likelihood. Also, conjugate priors are somewhat limited
because not all likelihood functions have a known conjugate prior, and most
conjugacy pairs are for small examples with only a few parameters. These
limitations are abated through hierarchical modeling (Chapter 6) and Gibbs
sampling (Chapter 3) which provide a framework to build rich statistical mod-
els by layering simple conjugacy pairs.
Below we discuss several conjugate priors and mathematical tools needed
to derive the corresponding posteriors. Detailed derivation of many of the
posterior distributions are deferred to Appendix A.3, and Appendix A.2 has
an abridged list of conjugacy results.

2.1.1 Beta-binomial model for a proportion


Returning to the beta-binomial example in Section 1.2, the data Y ∈
{0, 1, ..., n} is the number of successes in n independent trials each with suc-
From prior information to posterior inference 43

cess probability θ. The likelihood is then Y |θ ∼ Binomial(n, θ). Since we are


only interested in terms in the likelihood that involve θ, we focus only on its
kernel, i.e., the terms that involve θ. The kernel of the binomial PMF is

f (Y |θ) ∝ θY (1 − θ)n−Y . (2.1)

If we view the likelihood as a function of θ, it resembles a beta distribution,


and so we might suspect that a beta distribution is the conjugate prior for
the binomial likelihood. If we select the prior θ ∼ Beta(a, b), then (as shown
in Section 1.2) this combination of likelihood and prior leads to the posterior
distribution
θ|Y ∼ Beta(A, B), (2.2)
where the updated parameters are A = Y + a and B = n − Y + b. Since both
the prior and posterior belong to the beta family of distributions, this is an
example of a conjugate prior.
The Beta(A, B) distribution has mean A/(A + B), and so the prior and
posterior means are
a Y +a
E(θ) = θ̂0 = and E(θ|Y ) = θ̂1 = . (2.3)
a+b n+a+b
The prior and posterior means are both estimators of the population propor-
tion, and thus we denote them as θ̂0 and θ̂1 , respectively. The prior mean θ̂0
is an estimator of θ before observing the data, and this is updated to θ̂1 by
the observed data. A natural estimator of the population proportion θ is the
sample proportion θ̂ = Y /n, which is the number of successes divided by the
number of trials. Comparing the sample proportion to the posterior mean, the
posterior mean adds a to the number of successes in the numerator and a + b
to the number of trials in the denominator. Therefore, we can think of a as
the prior number of successes, a + b as the prior number of trials, and thus b
as the prior number of failures (see Section 2.1.5 for a tie to natural conjugate
priors).
Viewing the hyperparameters as the prior number of successes and failures
provides a means of balancing the information in the prior with the informa-
tion in the data. For example, the Beta(0.5, 0.5) prior in Figure 1.5 has one
prior observation and the uniform Beta(1, 1) prior contributes one prior suc-
cess and one prior failure. If the prior is meant to reflect prior ignorance then
we should select small a and b, and if there is strong prior information that θ
is approximately θ0 , then we should select a and b so that θ0 = a/(a + b) and
a + b is large.
The posterior mean can also be written as

θ̂1 = (1 − wn )θ̂0 + wn θ̂ (2.4)

where wn = n/(n + a + b) is the weight given to the sample proportion and


1 − wn is the weight given to the prior mean. This confirms the intuition that
44 Bayesian Statistical Methods

Y = 0, n = 0
Y = 3, n = 10
15 Y = 12, n = 50
Y = 21, n = 100
Y = 39, n = 200
10
Posterior
5
0

0.0 0.2 0.4 0.6 0.8 1.0

FIGURE 2.1
Posterior distributions from the beta-binomial model. Plot of the pos-
terior of θ from the model Y |θ ∼ Binomial(n, θ) and θ ∼ Beta(1, 1) for various
n and Y .

for any prior (a and b), if the sample size n is small the posterior mean is
approximately the prior mean, and as the sample size increases the posterior
mean becomes closer to the sample proportion. Also, as a + b → 0, wn → 1, so
as we make the prior vague with large variance, the posterior mean coincides
with the sample proportion for any sample size n.
Figure 2.1 plots the posterior for various n and Y . This plot is meant to
illustrate a sequential analysis. In most cases the entire data set is analyzed
in a single analysis, but in some cases data are analyzed as they arrive and a
Bayesian analysis provides a framework to analyze data sequentially. In Figure
2.1, before any data are collected θ has a uniform prior. After 10 observations,
there are three successes and the posterior concentrates below 0.5. After an
additional 40 samples, the posterior centers on the sample proportion 12/50.
As data accrues, the posterior converges to θ ≈ 0.2 and the posterior variance
decreases.
From prior information to posterior inference 45

2.1.2 Poisson-gamma model for a rate


The Poisson-gamma conjugate pair is useful for estimating the rate of events
per unit of observation effort, denoted θ. For example, an ecologist may survey
N acres of a forest and observe Y ∈ {0, 1, ...} individuals of the species of
interest, or a company may observe Y employee injuries in N person-hours
on the job. For these data, we might assume the model Y |θ ∼ Poisson(N θ)
where N is known sampling effort and θ > 0 is the unknown event rate. The
likelihood is
exp(−N θ)(N θ)Y
f (Y |θ) = ∝ exp(−N θ)θY . (2.5)
Y!
The kernel of the likelihood resembles a Gamma(a, b) distribution

ba
π(θ) = exp(−bθ)θa ∝ exp(−bθ)θa (2.6)
Γ(a)

in the sense that θ appears in the PDF raised to a power and in the exponent.
Combining the likelihood and prior gives posterior

p(θ|Y ) ∝ [exp(−N θ)θY ] · [exp(−bθ)θa ] ∝ exp(−Bθ)θA , (2.7)

where A = Y + a and B = N + b. The posterior of θ is thus θ|Y ∼


Gamma(A, B), and the gamma prior is conjugate.
A simple estimate of the expected number of events per unit of effort
is sample event rate θ̂ = Y /N . The mean and variance of a Gamma(A, B)
distribution are A/B and A/B 2 , respectively, and so the posterior mean under
the Poisson-gamma model is
Y +a
E(θ|Y ) = . (2.8)
N +b
Therefore, compared to the sample event rate we add a events to the numera-
tor and b units of effort to the denominator. As in the beta-binomial example,
the posterior mean can be written as a weighted average of the sample rate
and the prior mean,
a Y
E(θ|Y ) = (1 − wN ) + wN , (2.9)
b N
where wN = N/(N +b) is the weight given to the sample rate Y /N and 1−wN
is the weight given to the prior mean a/b. As b → 0, wn → 1, and so for a
vague prior with large variance the posterior mean coincides with the sample
rate. A common setting for the hyperparameters is a = b =  for some small
value , which gives prior mean 1 and large prior variance 1/.
NFL concussions example: Concussions are an increasingly serious con-
cern in the National Football League (NFL). The NFL has 32 teams and
each team plays 16 regular-season games per year, for a total of N = 32 ·
16/2 = 256 games. According to Frontline/PBS (http://apps.frontline.
46 Bayesian Statistical Methods

2012
2013
8 2014
2015
6
Posterior
4
2
0

0.4 0.6 0.8 1.0

FIGURE 2.2
Posterior distributions for the NFL concussion example. Plot of the
posterior of θ from the model Y |θ ∼ Gamma(N θ) and θ ∼ Gamma(a, b),
where N = 256 is the number of games played in an NFL season, Y is the
number of concussions in a year, and a = b = 0.1 are the hyperparameters.
The plot below gives the posterior for years 2012–2015, which had 171, 152,
123, and 199 concussions, respectively.

org/concussion-watch) there were Y1 = 171 concussions in 2012, Y2 = 152


concussions in 2013, Y3 = 123 concussions in 2014, and Y4 = 199 concussions
in 2015. Figure 2.2 plots the posterior of the concussion rate for each year
assuming a = b = 0.1. Comparing only 2014 with 2015 there does appear to
be some evidence of an increase in the concussion rate per game; for 2014 the
posterior mean and 95% interval are 0.48 (A/B) and (0.40, 0.57) (computed
using the Gamma(A, B) quantile function), compared to 0.78 and (0.67, 0.89)
for 2015. However, 2015 does not appear to be significantly different than 2012
or 2013.
From prior information to posterior inference 47

2.1.3 Normal-normal model for a mean


Gaussian responses play a key role in applied and theoretical statistics. The t-
test, ANalysis Of VAriance (ANOVA), and linear regression all assume Gaus-
sian responses. We develop these methods in Chapter 4, but here we dis-
iid
cuss conjugate priors for the simpler model Yi |µ, σ 2 ∼ Normal(µ, σ 2 ) for
i = 1, ..., n. This model has two parameters, the mean µ and the variance
σ 2 . In this subsection we assume that σ 2 is fixed and focus on estimating µ;
in the following subsection we analyze σ 2 given µ, and Chapter 4 derives the
joint posterior of both parameters.
Assuming σ 2 to be fixed, a conjugate prior for the unknown mean µ is
µ ∼ Normal(µ0 , σ 2 /m). The prior variance is proportional to σ 2 to express
prior uncertainty on the scale of the data. The hyperparameter m > 0 controls
the strength of the prior, with small m giving large prior variance and vice
versa. Appendix A.3 shows that the posterior is

σ2
 
µ|Y ∼ N wȲ + (1 − w)µ0 , (2.10)
n+m

Pn w = n/(n + m) ∈ [0, 1] is the weight given to the sample mean Ȳ =


where
i=1 Yi /n. Letting m → 0 gives wn → 1 so the posterior mean coincides with
the sample mean as the prior variance increases.

The posterior standard deviation √is σ/ n + m which is less than the stan-
dard error of the sample mean, σ/ n. The prior with hyperparameter m
reduces the posterior standard deviation by the same amount as adding an
additional m observations, and this stabilizes a Bayesian analysis. For Gaus-
sian data analyses such as ANOVA or regression the prior mean µ0 is often set
to zero. In the normal-normal model with µ0 = 0 the posterior mean estimator
is E(µ|Y) = wȲ . More generally, E(µ|Y) = w(Ȳ − µ0 ) + µ0 . This is an exam-
ple of a shrinkage estimator because the sample mean is shrunk towards the
prior mean by the shrinkage factor w ∈ [0, 1]. As will be discussed in Chapter
4, shrinkage estimators have advantages, particularly in hard problems such
as regression with many predictors and/or small sample size.
Blood alcohol concentration (BAC) example: The BAC level is per-
cent of your blood that is concentrated with alcohol. The legal limit for op-
erating a vehicle is BAC ≤ 0.08 in most US states. Let Y be the measured
BAC and µ be your true BAC. Of course, the BAC test has error, and the
error standard deviation for a sample near the legal limit has been established
in (hypothetical) laboratory tests to be σ = 0.01, so that the likelihood of
the data is Y |µ ∼ Normal(µ, 0.012 ). Your BAC is measured to be Y = 0.082,
which is above the legal limit.
Your defense is that you had two drinks, and that the BAC of someone
your size after two drinks has been shown to follow a Normal(0.05, 0.022 ) dis-
tribution depending on the person’s metabolism and the timing of the drinks.
Figure 2.3 plots the prior µ ∼ Normal(0.05, 0.022 ), i.e., with m = 0.25. The
prior probability that your BAC exceeds 0.08 is 0.067. The posterior distribu-
48 Bayesian Statistical Methods

Prior
Posterior
40
30
Density
20
10
0

0.00 0.02 0.04 0.06 0.08 0.10

True BAC level (%)

FIGURE 2.3
Posterior distributions for the blood alcohol content example. Plot
of the prior and posterior PDF, with the prior (0.067) and posterior (0.311)
probabilities of exceeding the legal limit of 0.08 shaded.

tion with n = 1, Y = 0.082, σ = 0.01 and m = 0.25 is


µ|Y ∼ Normal(0.0756, 0.00892 ). (2.11)
Therefore, Prob(µ > 0.08|Y ) = 0.311 and so there is considerable uncertainty
about whether your BAC exceeds the legal limit and in fact the posterior odds
that your BAC is below the legal limit,
Prob(µ ≤ 0.08|Y )
,
Prob(µ > 0.08|Y )
are greater than two.

2.1.4 Normal-inverse gamma model for a variance


Next we turn to estimating a Gaussian variance assuming the mean is fixed.
iid
As before, the sampling density is Yi |σ 2 ∼ Normal(µ, σ 2 ) for i = 1, ..., n. With
From prior information to posterior inference 49

the mean fixed, the likelihood is


n    
Y 1 (Yi − µ) 2 −n/2 SSE
f (Y|σ 2 ) ∝ exp − ∝ (σ ) exp − (2.12)
i=1
σ 2σ 2 2σ 2
Pn
where SSE = i=1 (Yi −µ)2 . The likelihood has σ 2 raised to a negative power
and σ 2 in the denominator of the exponent. Of the distributions in Appendix
A.1 with support [0, ∞), only the inverse gamma PDF has these properties.
Taking the prior σ 2 ∼ InvGamma(a, b) gives
 
b
π(σ 2 ) ∝ (σ 2 )−(a+1) exp − 2 . (2.13)
σ
Combining the likelihood and prior gives the posterior
 
2 2 2 2 −(A+1) B
p(σ |Y) ∝ f (Y|σ )π(σ ) ∝ (σ ) exp − 2 , (2.14)
σ
where A = n/2 + a and B = SSE/2 + b are the updated parameters, and
therefore
σ 2 |Y ∼ InvGamma (A, B) . (2.15)
The posterior mean (if n/2 + a > 1) is
SSE/2 + b SSE + 2b
E(σ 2 |Y) = = . (2.16)
n/2 + a − 1 n − 1 + 2a − 1
Therefore, if we take the hyperparameters to be a = 1/2+m/2 and b = /2 for
small m and , then the posterior-mean estimator is E(σ 2 |Y) = (SSE+)/(n−
1 + m) and compared to the usual sample variance SSE/(n − 1) the small
values  and m are added to the numerator and denominator, respectively. In
this sense, the prior adds an additional m degrees of freedom for estimating
the variance, which can stabilize the estimator if n is small.
Conjugate prior for a precision: We have introduced the Gaussian
distribution as having two parameters, the mean and the variance. However, it
can also be parameterized in terms of its mean and precision (inverse variance),
τ = 1/σ 2 . In particular, the JAGS package used in this book employs this
parameterization. In this parameterization, Y |µ, τ ∼ Normal(µ, τ ) has PDF
τ 1/2 h τ i
f (Y |µ, τ ) = √ exp − (Y − µ)2 . (2.17)
2π 2
This parameterization makes derivations and computations slightly easier,
especially for the multivariate normal distribution where using a precision
matrix (inverse covariance matrix) avoids some matrix inversions.
Not surprisingly, the conjugate prior for the precision is the gamma family.
iid
If Yi ∼ Normal(µ, τ ) then the likelihood is
n
Y h τ i  τ 
f (Y|τ ) ∝ τ 1/2 exp − (Yi − µ)2 ∝ τ n/2 exp − SSE , (2.18)
i=1
2 2
50 Bayesian Statistical Methods

and the Gamma(a, b) prior is π(τ ) ∝ τ a−1 exp(−τ b). Combining the likelihood
and prior gives

p(τ |Y) ∝ f (Y|τ )π(τ ) ∝ τ A−1 exp(−τ B), (2.19)

where A = n/2+a and B = SSE/2+b are the updated parameters. Therefore,

τ |Y ∼ Gamma(A, B). (2.20)

The InvGamma(a, b) prior for the variance and the Gamma(a, b) prior for
the precision give the exact same posterior distribution. That is, if we use
the InvGamma(a, b) prior for the variance and then convert this to obtain the
posterior of 1/σ 2 , the results are identical as if we have conducted the analysis
with a Gamma(a, b) prior for the precision. Throughout the book we use the
mean-variance parameterization except for cases involving JAGS code when we
adopt their mean-precision parameterization.

2.1.5 Natural conjugate priors


We have thus far catalogued a series of popular sampling distributions and
corresponding choices of the conjugate family of priors. Is there a natural way
of constructing a class of conjugate priors given a specific sampling density
f (y|θ)? It turns out for many of these familiar choices of sampling densities
(e.g., exponential family) we can construct a class of conjugate priors, and
priors constructed in this manner are called natural conjugate priors.
iid
Let Yi ∼ f (y|θ) for i = 1, . . . , n and consider a class of priors defined by
m
Y
π(θ|y10 , . . . , ym
0
, m) ∝ f (yj0 |θ) (2.21)
j=1

where yj0 are some arbitrary fixed values in the support of the sampling distri-
bution for j = 1, . . . , m and m ≥ 1 is a fixed integer. The pseudo-observations
yj0 can be seen as the hyperparameters of the prior distribution and such a prior
R Qm 0
is a proper distribution if there exists m such that j=1 f (yj |θ) dθ < ∞.
To see that the prior defined in (2.21) is indeed conjugate notice that
n
Y M
Y
p(θ|Y) ∝ π(θ|y10 , . . . , ym
0
, m) f (Yi |θ) ∝ f (yj∗ |θ) (2.22)
i=1 j=1

where M = m + n and yj∗ = yj0 for j = 1, . . . , m and yj∗ = Yj−m for j =


m + 1, . . . , m + n. Thus, the posterior distribution belongs to the same class
of distributions as in (2.21).
Below we revisit some previous examples using this method of creating
natural conjugate priors.
Bernoulli trails (Section 2.1.1): When Y |θ ∼ Bernoulli(θ), we have
From prior information to posterior inference 51

f (y|θ) ∝ θy (1 − θ)1−y and so the natural conjugate prior with the first s0
pseudo observations equal yj0 = 1 and the remaining m − s0 pseudo observa-
tions set to yj0 = 0 gives the prior π(θ|y10 , . . . , ym
0
, m) ∝ θs0 (1−θ)m−s0 , that is,
θ ∼ Beta(s0 +1, m−s0 +1). This beta prior is restricted to have integer-valued
hyperparameters, but once we see the form we can relax the assumption that
m and s0 are integer valued.
Poisson counts (Section 2.1.2): When Y |θ ∼ Poisson(θ), we have
f (y|θ) ∝ θy e−θ and so the natural conjugate prior byPusing equation (2.21)
m
is given by π(θ|y10 , . . . , ym0
, m) ∝ θs0 e−mθ , where s0 = j=1 yj0 and this natu-
rally leads to a gamma prior distribution θ ∼ Gamma(s0 + 1, m). Therefore,
we can view the prior as consisting of m pseudo observations with sample rate
s0 /m. Again, the restriction that m is an integer can be relaxed this once it
is revealed that the prior is from the gamma family.
Normal distribution with fixed variance (Section 2.1.3): Assuming
2
Y |θ ∼ Normal(µ, 1), we have f (y|µ) ∝ exp{−(y − µ)P /(2σ 2 )} and so the nat-
m
0 0
ural conjugate prior is π(µ|y1 , . . . , ym , m) ∝ exp{− j=1 (yj0 − µ)2 /(2σ 2 )} ∝
0 2 2 0
Pm 0
exp{−m(µ − ȳ ) /(2σ )}, where ȳ = j=1 yj /m and this naturally leads
0 2
to the prior θ ∼ Normal(ȳ , σ /m). Therefore, the prior can be viewed as
consisting of m pseudo observations with mean ȳ 0 .
This systematic way of obtaining a conjugate prior takes away the mystery
of first guessing the form of the conjugate prior and then verifying its conju-
gacy. The procedure works well when faced with problems that do not have a
familiar likelihood and works even when we have vector-valued parameters.

2.1.6 Normal-normal model for a mean vector


In Bayesian linear regression the parameter of interest is the vector of re-
gression coefficients. This is discussed extensively in Chapter 4, but here we
provide the conjugacy relationship that underlies the regression analysis. Al-
though still a bit cumbersome, linear regression notation is far more concise
using matrices. Say the n-vector Y is multivariate normal

Y|β ∼ Normal(Xβ, Σ). (2.23)

The mean of Y is decomposed as the known n × p matrix X and unknown p-


vector β, and Σ is the n × n covariance matrix. The prior for β is multivariate
normal with mean µ and p × p covariance matrix Ω,

β ∼ Normal(µ, Ω). (2.24)

As shown in Appendix A.3, the posterior of β is multivariate normal


h i
β|Y ∼ Normal Σβ (XT Σ−1 Y + Ω−1 µ), Σβ , (2.25)

where Σβ = (X0 Σ−1 X + Ω−1 )−1 . In standard linear regression the errors are
assumed to be independent and identically distributed and thus the covariance
52 Bayesian Statistical Methods

is proportional to the identity matrix, Σ = σ 2 In . In this case, if the prior is


uninformative with Ω−1 ≈ 0, then the posterior mean reduces to the familiar
least squares estimator (XT X)−1 XT Y and the posterior covariance reduces
to the covariance of the sampling distribution of the least squares estimator,
σ 2 (XT X)−1 .

2.1.7 Normal-inverse Wishart model for a covariance matrix


Say Y1 , ..., Yn are vectors of length p that are independently distributed as
multivariate normal with known mean vectors µi and p×p unknown covariance
matrix Σ. The conjugate prior for the covariance matrix is the inverse Wishart
prior. The inverse Wishart family’s support is symmetric positive definite
matrices (i.e., covariance matrices), and reduces to the inverse gamma family
if p = 1. The inverse Wishart prior with degrees of freedom ν > p − 1 and
p × p symmetric and positive definite scale matrix R has PDF
 
1
π(Σ) ∝ |Σ|−(ν+p+1)/2 exp − Trace(Σ−1 R) (2.26)
2
and mean E(Σ) = R/(ν − p − 1) assuming ν > p + 1. The prior concentration
around R/(ν − p − 1) increases with ν, and therefore small ν, say ν = p − 1 + 
for small  > 0, gives the least informative prior. An interesting special case is
when ν = p + 1 and R is a diagonal matrix. This induces a uniform prior on
each off-diagonal element of the correlation matrix corresponding to covariance
matrix Σ.
As shown in Chapter Appendix A.3, the posterior is
n
!
X
T
Σ|Y ∼ InvWishartp n + ν, (Yi − µi )(Yi − µi ) + R . (2.27)
i=1
Pn
The posterior mean is [ i=1 (Yi − µi )(Yi − µi )T + R]/[n + ν −P p − 1]. For
n
R ≈ 0 and ν ≈ p + 1, the posterior mean is approximately [ i=1 (Yi −
µi )(Yi − µi )T ]/n, which is the sample covariance matrix assuming the means
are known and not replaced by the sample means.
Marathon example: Figure 2.4 plots the data for several of the top
female runners in the 2016 Boston Marathon. Let Yij be the speed (min-
utes/mile) for runner i = 1, ..., n and mile j = 1, ..., p = 26. For this analysis,
we have discarded all runners with missing data (for a missing data analysis
see Section 6.4), leaving n = 59 observations Yi = (Yi1 , ..., Yip )T . We analyze
the covariance of the runners’ data to uncover patterns and possibly strategy.
PnFor simplicity we conduct the analysis conditioning on µij = Ȳj =
i=1 Yij /n, i.e., the sample mean for mile j. For the prior, we take ν = p + 1
and R = Ip /ν so that elements of the correlation matrix have Uniform(-1,1)
priors. The code in Listing 2.1 generates S samples from the posterior of Σ and
uses the samples to approximate the posterior mean. To avoid storage prob-
lems, rather than storing all S samples the code simply retains the running
From prior information to posterior inference 53

(a) Spaghetti plot of the marathon data (b) Posterior mean covariance matrix

25
9

0.10

20
Speed (minute/mile)

0.08
8

15
Mile
0.06
7

0.04

10
0.02
6

5 0.00

0 5 10 15 20 25 5 10 15 20 25

Mile Mile

(c) Posterior mean correlation matrix (d) Significant conditional correlations

1.0
25

25

0.8
20

20

0.6
15

15
Mile

Mile

0.4
10

10

0.2
5

0.0

5 10 15 20 25 5 10 15 20 25

Mile Mile

FIGURE 2.4
Covariance analysis of the 2016 Boston Marathon data. Panel (a)
plots the minute/mile for each runner (the winner is in black), Panels (b)
and (c) show the posterior mean of the covariance and correlation matrices,
respectively, and Panel (d) is shaded gray (black) if the 95% (99%) credible
interval for the elements of the precision matrix excludes zero.
54 Bayesian Statistical Methods

mean of the samples. However, Monte Carlo sampling produces the full joint
distribution of all elements of Σ. In particular, for each draw from the poste-
rior, Listing 2.1 converts the simulated covariance matrix to the corresponding
correlation matrix, and computes the posterior mean of the correlation matrix.
The estimated (posterior mean) variance (the diagonal elements of Figure
2.4b) increases with the mile as the top runners separate themselves from the
pack. The correlation between speeds at different miles (Figure 2.4c) is high
for all pairs of miles in the first half (miles 1–13) of the race as the runners
maintain a fairly steady pace. The correlation is low between a runner’s first-
and second-half mile times, and the strongest correlations in the second half
of the race are between subsequent miles.
Normal-Wishart model for a precision matrix: Correlation is clearly
a central concept in statistics, but it should not be confused with causation.
For example, consider the three variables that follow the distribution Z1 ∼
Normal(0, 1), Z2 |Z1 ∼ Normal(Z1 , 1) and Z3 |Z1 , Z2 ∼ Normal(Z2 , 1). In this
toy example, the first variable has a causal effect on the second, and the second
has a causal effect on the third. The shared relationship with Z2 results in a
correlation of 0.57 between Z1 and Z3 . However, if we condition on Z2 , then
Z1 and Z3 are independent.
Statistical inference about the precision matrix (inverse covariance matrix)
Ω = Σ−1 is a step closer to uncovering causality than inference about the cor-
relation/covariance matrix. For a normal random variable Y = (Y1 , ..., Yp )T ∼
Normal(µ, Ω−1 ), the (j, k) element of the precision matrix Ω measures the
strength of the correlation between Yj and Yk after accounting for the effects
of the other p − 2 elements of Y. We say that variables j and k are condition-
ally correlated if and only if the (j, k) element of Ω is non-zero. Therefore,
association tests based on Ω rather than Σ eliminate spurious correlation
(e.g., between Z1 and Z3 ) induced by lingering variables (e.g., Z2 ). Assuming
all variables relevant to the problem are included in the p variables under
consideration, then these conditional correlations have causal interpretations.
The Wishart distribution is the conjugate prior for a normal precision
matrix. If Σ ∼ InvWishart(ν, R), then Ω = Σ−1 ∼ Wishart(ν, R) and has
prior density

π(Ω) ∝ |Ω|(p−ν−1)/2 exp −Trace ΩR−1 /2 .


  
(2.28)

Given a sample Y1 , ..., Yn ∼ Normal(µ, Ω−1 ) and conditioning on the mean


vectors µi , the posterior is
 " n #−1 
X
Ω|Y ∼ Wishart n + ν, (Yi − µi )(Yi − µi )T + R−1 . (2.29)
i=1

Marathon example: Listing 2.1 computes and Figure 2.4d plots the
conditional correlations with credible sets that exclude zero for the Boston
Marathon example. Many of the non-zero correlations in Figure 2.4c do not
From prior information to posterior inference 55

Listing 2.1
Monte Carlo analysis of the Boston Marathon data.
1

2 # Hyperpriors
3

4 nu <- p+1
5 R <- diag(p)/(p+1)
6

7 # Process the data


8

9 Ybar <- colMeans(Y)


10 SSE <- sweep(Y,2,Ybar,"-")
11 SSE <- solve(t(SSE)%*%SSE)
12

13 # Monte Carlo settings from the posterior


14

15 S <- 10000
16 Sigma_mn <- Cor_mn <- Omega_pos <- 0
17

18 # Monte Carlo sampling


19

20 for(s in 1:S){
21 Omega <- rwish(n+nu,SSE+R)
22 Sigma <- solve(Omega)
23 Sigma_mn <- Sigma_mn + Sigma/S
24 Cor_mn <- Cor_mn + cov2cor(Sigma)/S
25 Omega_pos <- Omega_pos + (Omega>0)/S
26 }
27

28 # Evaluate significance of the precision matrix


29

30 Omega_sig <- ifelse(Omega_pos<0.025 | Omega_pos>0.975,


31 "gray","white")
32 Omega_sig <- ifelse(Omega_pos<0.005 | Omega_pos>0.995,
33 "black",Omega_sig)
34

35 # Plot some of the results


36

37 library(fields)
38 image.plot(1:p,1:p,Sigma_mn,
39 xlab="Mile",ylab="Mile",
40 main="Posterior mean covariance matrix")
56 Bayesian Statistical Methods

have significant conditional correlations. For example, if the times of the first
25 miles are known, then only miles 20 and 25 have strong conditional correla-
tion with the final mile time. Therefore, despite the many strong correlations
between the final mile and the other 25 miles, for the purpose of predicting
the final mile time given all other mile times, using only miles 20 and 25 may
be sufficient.

2.1.8 Mixtures of conjugate priors


A limitation of a conjugate prior is that restricting the prior to a parametric
family limits how accurately prior uncertainty can be expressed. For example,
the normal distribution is a conjugate prior for a normal mean, but a normal
prior must be symmetric and unimodal, which may not describe the available
prior information. A mixture of conjugate priors provides a much richer class
of priors. A mixture prior is
J
X
π(θ) = qj πj (θ) (2.30)
j=1

PJ
where q1 , ..., qJ are the mixture probabilities with qj ≥ 0 and j=1 qj = 1,
and πj are valid PDFs (or PMFs). The mixture prior can be interpreted as
each component corresponding to the opinion of a different expert. That is,
with probability qj you select the prior from expert j, denoted πj .
The posterior distribution under a mixture prior is
p
X p
X
p(θ|Y) ∝ f (Y|θ)π(θ) ∝ qj f (Y|θ)πj (θ) ∝ Qj pj (θ|Y), (2.31)
j=1 j=1

where pj (θ|Y) ∝ f (Y|θ)πj (θ) is the posterior under prior πj and the updated
mixture weights Qj are positive and sum to one. The posterior mixture weights
Qj are not the same is the prior mixture weights qj , and unfortunately have a
fairly complicated form. However, since each mixture component is conjugate,
the pj belong to the same family as the πj . Therefore, the posterior is also a
mixture distribution over the same parametric family as the prior, and thus
the prior is conjugate.
ESP example: As discussed in [56], [7] presents the results of several ex-
periments on extrasensory perception (ESP). In one experiment (Experiment
2 in [7]), subjects were asked to say which of two pictures they preferred before
being shown the pictures. Unknown to the subject, one of the pictures was a
subliminally exposed negative picture, and the subjects avoided this picture
in 2, 790 of the 5, 400 trials (51.7%). Even more dramatically, participants
that scored high in stimulus seeking avoided the picture in Y = 963 of the
n = 1, 800 trials (53.5%).
The data are then Y ∼ Binomial(n, θ), and the objective is to test whether
θ = 0.5 and thus the subjects do not have ESP. A skeptical prior that presumes
From prior information to posterior inference 57

0.012 B(2000,2000) B(1,1) Mixture prior

0.008
Posterior Posterior Posterior
0.010

0.006
0.008
Density

Density
0.006

0.004
0.004

0.002
0.002
0.000

0.000
0.40 0.45 0.50 0.55 0.60 0.40 0.45 0.50 0.55 0.60

θ θ

FIGURE 2.5
Prior and posterior distributions for the ESP example. The data are
binomial with Y = 963 successes in n = 1800 trials. The left panel plots the
Beta(2000,2000) and Beta(1,1) priors and resulting posteriors, and the right
panel plots the mixture prior that equally weights the two beta priors and the
resulting posterior.

the subjects do not have ESP should be concentrated around θ = 0.5, say π1 is
the Beta(2000, 2000) PDF. On the other extreme, if we approach the analysis
with an open mind then we may select an uninformative prior, say π2 is the
uniform Beta(1, 1) PDF. The mixture prior combines these two extremes,

π(θ) = qπ1 (θ) + (1 − q)π2 (θ), (2.32)

as shown in Figure 2.5 (right) for q = 0.5. The mixture prior is simply the
average of the two beta prior PDFs in the left panel. The prior has a peak
at 0.5 corresponding to π1 , but does not go completely to zero as θ departs
from 0.5 because of the addition of the uniform prior π2 . The posterior dis-
tribution under the mixture prior (Figure 2.5, right) is also a mixture of two
beta distributions. The posterior is a weighted average of the two posterior
densities in the left panel. The posterior is bimodal with one mode near 0.51
corresponding to the skeptical prior and a second mode near 0.53 correspond-
ing to the uniform prior. This shape of the density is not possible within the
standard beta family.
58 Bayesian Statistical Methods

2.2 Improper priors


Section 1.1 introduced the concept of a probability density function (PDF). A
proper PDF is non-negative and integrates to one. If the prior is selected from
a common family of distributions as in Section 2.1 then the prior is ensured
to be proper. In this chapter we explore the possibility of using an improper
prior, i.e., a prior PDF that is non-negative but does not integrate to one over
the parameter’s support.
iid
As a concrete example, assume that Y1 , ..., Yn |µ, σ 2 ∼ Normal(µ, σ 2 ) with
σ taken to be fixed and µ is to be estimated. If we truly have no prior infor-
mation then it is tempting to fit the model with prior π(µ) = c > 0 for all
possible µ ∈ (−∞, ∞). This can be viewed as approximating either a uniform
prior with bounds tending to infinity or a Gaussian
R∞ prior with variance tend-
ing to infinity. However, for any positive c, −∞ π(µ)dµ = ∞ and so the prior
is improper.
Despite the fact that the prior is not a proper PDF, it is still possible to
use the density function

p(µ|Y) ∝ f (Y|µ)π(µ) (2.33)

to summarize uncertainty about µ. That is, we can treat this product of the
likelihood and improper prior as a posterior distribution. Applying this rule
for the normal mean example gives µ|Y ∼ Normal(Ȳ , σ 2 /n), which is a proper
PDF as long as n ≥ 1.
As this example shows, using improper priors can lead to a proper poste-
rior distribution. However, improper priors must be used with caution. It must
be verified that the resulting posterior distribution is proper, i.e., it must be
mathematically proven that the posterior integrates to one over the parame-
ter’s support. This step is not needed if the prior is proper because any proper
prior leads to a proper posterior. Also, from a conceptual perspective, it is dif-
ficult to apply the subjective Bayesian argument that the prior and posterior
distributions represent the analyst’s uncertainty about the parameters when
the prior uncertainty cannot be summarized by a valid PDF. One way to jus-
tify treating inference with improper prior as Bayesian inference is to view
the posterior under an improper prior as the limit of the posterior for proper
priors (with, say, the prior variance increasing to infinity) [78].
An obvious remedy to these concerns is to replace the improper prior with
a uniform prior with bounds that almost surely include the true value of a pa-
rameter. In fact, a uniform prior was the original prior used by Laplace in his
formulation of Bayesian statistics. While this alleviates mathematical prob-
lems of ensuring a proper posterior and conceptual problems with Bayesian
interpretation, it raises a concern about invariance to parameterization. For
example, let θ ∈ (0, 1) be the probability of an event and r = θ/(1 − θ) > 0 be
the corresponding odds of the event. A Uniform(0,100) prior for the odds r ap-
From prior information to posterior inference 59

pears to be uninformative, but in fact the induced prior mean of θ = r/(1 + r)


is 0.95, which is clearly not uninformative for θ.

2.3 Objective priors


In an idealized subjective Bayesian analysis the system (physical, biological,
economic, etc.) is understood except for the values of a few parameters, and
studies are conducted sequentially to refine estimates of the unknown param-
eters. In this case, the ability to incorporate the current state of knowledge
as prior information in the analysis of the current study is a strength of the
Bayesian paradigm and should be embraced. However, in many cases there is
no prior information, leaving the analyst to select, say, conjugate uninforma-
tive priors in an ad hoc manner.
In the absence of prior information, selecting the prior can be viewed as
a nuisance to be avoided if possible. Objective priors offer a solution by pro-
viding systematic approaches to formulate the prior. Berger [8] argues that
objectivity is essential in many analyses, such as obtaining approval from reg-
ulatory agencies, where any apparent personal bias can be a huge red flag.
Objective methods also provide useful baselines for comparison for subjective
methods and make application of Bayesian methods more approachable for
non-statisticians [8]. Of course, eliminating the subjectivity of prior selection
is only part of the story; there remains subjectivity in selecting the data to be
analyzed, the inference to be conducted, the likelihood to be used, and indeed,
the form of the objective prior.
In the remainder of this section we discuss several general and objective ap-
proaches to setting priors. Jeffreys’ prior (Section 2.3.1) is arguably the most
common objective Bayes prior and so we describe this method in the most
detail. We provide only brief conceptual discussions for the other approaches
because deriving the exact expressions of the priors even in moderately com-
plex cases often requires tedious calculus.

2.3.1 Jeffreys’ prior


A Jeffreys’ prior (JP) is a method to construct a prior distribution that is
invariant to reparameterization. This is a necessary condition for a method to
be objective. For example, if one statistician places a prior on the standard
deviation (σ) and another places a prior on the variance (σ 2 ) and the two
get different results, then the method is subjective because it depends on the
choice of parameterization.
The univariate JP for θ is
p
π(θ) ∝ I(θ) (2.34)
60 Bayesian Statistical Methods

where I(θ) is the expected Fisher information defined as


 2 
d log f (Y|θ)
I(θ) = −E (2.35)
dθ2

and the expectation is with respect to Y|θ. The key feature of this construction
is that it is invariant to transformation. That is, say statistician one uses a
JP on θ, denoted π1 (θ), and statistician two places a JP on γ = g(θ), denoted
π2 (γ). Using the change of variables formula, the prior for γ induced by π1 is

dh(γ)
π3 (γ) = π1 [h(γ)]
(2.36)

where h(γ) = θ is the inverse of g (assuming g is invertable). It can be shown


that π2 = π3 , that is, placing a JP on γ is equivalent to placing a JP on θ and
then reparameterizing to γ; this establishes invariance to reparameterization.
Once the likelihood function f has been determined, then the JP is deter-
mined; in this sense the prior is objective because any two analysts working
with the same likelihood will get the same prior and thus the same posterior.
An important point is that the prior does not depend on the data, Y, because
the expectation in I(θ) removes the dependence on the data. Finally, the JP
is often improper and therefore before using JP in practice it must be verified
that the posterior is a valid PDF that integrates to one (Section 2.2).
JP can also be applied in multivariate problems. The multivariate JP for
θ = (θ1 , ..., θp ) is p
π(θ) ∝ |I(θ)| (2.37)
where I(θ) is the p × p expected Fisher information matrix with (i, j) element
 2 
∂ log f (Y|θ)
−E . (2.38)
∂θi ∂θj

As with the univariate JP, this prior is invariant to transformation and can
be improper.
Binomial proportion: As a univariate example, consider the binomial
model Y ∼ Binomial(n, θ). The second derivative of the binomial log likelihood
is
d2 log f (Y|θ) d2
 
n
= log + Y log(θ) + (n − Y ) log(1 − θ)
dθ2 dθ2 Y
d Y n−Y
= −
dθ θ 1−θ
Y n−Y
= − 2− .
θ (1 − θ)2

The information is the expected value of the negative second derivative. Un-
der the binomial model, the expected value of Y is nθ and since the second
From prior information to posterior inference 61

derivative is linear in Y the expectation passes into each term as

E(Y ) n − E(Y ) n n n
I(θ) = + = + = . (2.39)
θ2 (1 − θ)2 θ 1−θ θ(1 − θ)

The JP is then
r
n
π(θ) ∝ ∝ θ1/2−1 (1 − θ)1/2−1 , (2.40)
θ(1 − θ)

which is the kernel of the Beta(1/2,1/2) PDF. Therefore, the JP for the bino-
mial proportion is θ ∼ Beta(1/2, 1/2). In Section 2.1 we saw that a beta prior
was conjugate, and the JP provides an objective way to set the hyperparam-
eters in the beta prior.
Normal model: As a multivariate example, consider the Gaussian model
iid
Yi |µ, σ 2 ∼ Normal(µ, σ 2 ). As shown in Appendix A.3, the bivariate JP is
1
π(µ, σ) ∝ . (2.41)
(σ 2 )3/2

This is an improper prior but leads to a proper posterior as long as n ≥ 2.


This prior can be seen as the limit of the product of independent priors µ ∼
Normal(0, c2 ) and σ 2 ∼ InvGamma(0.5, b) as c → ∞ and b → 0. The JP is flat
for all values of µ and, similar to the inverse gamma distribution with small
shape parameter, the JP peaks at σ 2 = 0 and drops sharply away from the
origin.
Multiple linear regression model: The linear regression model is
Y|β, σ 2 ∼ Normal(Xβ, σ 2 In ), where the vector of p regression coefficients
is β = (β1 , ..., βp ) and the error variance is σ 2 . Appendix A.3 shows that the
JP is
1
π(β, σ) ∝ 2 p/2+1 . (2.42)
(σ )
As in the mean-only model (which is a special case of the regression model),
the JP is improper, completely flat in the mean parameters β, and decays
polynomially in the variance with degree that depends on the number of co-
variates in the model.

2.3.2 Reference priors


Bernardo’s reference prior (RP; [10]) is the prior that maximizes the expected
difference between the prior and posterior. This is certainly an intuitive def-
inition of an uninformative prior as it ensures that the prior does not over-
whelm the data. To act on this conceptual definition requires a measure of
the discrepancy between two density functions, and [10] use (ignoring techni-
cal problems that require studying sequences of priors) the Kullback–Leibler
62 Bayesian Statistical Methods

(KL) divergence between the prior π(θ) and the posterior p(θ|Y), defined as
the expected difference in log densities,

KL(p, π|Y) = E {log[f (θ|Y)] − log[π(θ)]} , (2.43)

where the expectation is with respect to the posterior p(θ|Y). The KL di-
vergence depends on the data and thus cannot be used to derive a prior. To
remove the dependence on the data, the RP isR the prior density function π
that maximizes the expected information gain KL(p, π|Y)m(Y)dY, where
m(Y) is the marginal distribution of the data (Table 1.4).
Finding the RP is a daunting optimization problem because the solution
is a prior distribution and not a scalar or vector. In many cases, the RP differs
from the JP. However, it can be shown that if the posterior is approximately
normal then the RP is approximately the JP. The two are also equivalent
in non-Gaussian cases, for example, if Y ∼ Binomial(n, θ) the RP is θ ∼
Beta(1/2, 1/2).

2.3.3 Maximum entropy priors


A maximum entropy (MaxEnt) prior assumes that a few properties of the
prior are known and selects the prior with the most entropy in the class of
priors with these properties. As an example, say that θ has prior support
θ ∈ {0, 1, 2, ...} and the prior mean is assumed to be E(θ) = 1. There are
infinitely many priors for θ that satisfy the constraint that E(θ) = 1, e.g., θ ∼
Poisson(1), θ ∼ Binomial(2, 0.5) or θ ∼ Binomial(5, 0.2). The MaxEnt prior
is the prior that maximizes entropy over the priors with support {0, 1, 2, ...}
and mean one.
In general, let the constraints be E[gk (θ)] = ck for functions gk (mean,
variance, exceedance probabilities, etc.) and constants ck for k = 1, ..., K.
The entropy for a discrete parameter with prior π(θ) and discrete support
S = {θ1 , θ2 , ...} is
X
E{− log[π(θ)]} = − log[π(θi )]π(θi ). (2.44)
i∈S

P MaxEnt prior is the value of {π(θ1 ), π(θ2 P


The ), ...} that maximizes
− i∈S log[π(θi )]π(θi ) subject to the K constraints i∈S gk (θi )π(θi ) = ck .
MaxEnt priors can be extended to continuous parameters as well.

2.3.4 Empirical Bayes


An empirical Bayesian (EB) analysis uses the data to set the priors. For ex-
iid
ample, consider the model Yi |θi ∼ Binomial(N, θi ) for i = 1, ..., n, where each
iid
observation’s success probability has prior θi |a, b ∼ Beta(a, b). The goal of this
analysis is to estimate the individual probabilities θi , and thus there two types
of parameters: the parameters of interest θ = (θ1 , ..., θn ) and the nuisance
From prior information to posterior inference 63

parameters γ = (a, b). Without prior knowledge about γ, a fully Bayesian


analysis requires a prior for the nuisance hyperparameters. In contrast, an EB
analysis inspects the data to select γ, and then performs a Bayesian analy-
sis of θ as if γ was known all along. For example, in this problem it can be
shown (Appendix A.1) that marginally over θ, Yi |γ ∼ Beta-Binomial(N, a, b).
Therefore, an EB analysis might use the MLE from the beta-binomal model
to estimate γ and plug this estimate into the Bayesian analysis for θ.
More formally, consider the model Y|θ ∼ f (Y|θ) and prior θ|γ ∼ π(θ|γ).
The marginal maximum likelihood estimator for the hyperparameter is
Z
γ̂ = argmax p(Y|θ)π(θ|γ)dθ. (2.45)
γ
The EB analysis then conducts a Bayesian analysis of θ with γ set to γ̂,
i.e., Y|θ ∼ f (Y|θ) and θ|γ = γ̂ ∼ π(θ|γ̂). Although other forms of EB
are possible (e.g., a using γ̂ as the prior mean for γ), this plug-in approach
provides a systematic and objective way to set priors.
The disadvantage of EB is obvious: we are using the data twice and ignor-
ing uncertainty about γ. As a result, an EB analysis of θ will often have smaller
posterior variances and narrower credible intervals than a fully Bayesian anal-
ysis that puts a prior on γ, and thus the statistical properties of the EB
analysis can be questioned. Despite this concern, EB remains a useful tool to
stabilize computing in high-dimensional problems, especially when it can be
argued (e.g., for large data sets) that uncertainty in the nuisance parameters
is negligible.

2.3.5 Penalized complexity priors


The penalized complexity (PC) prior [76] is designed to prevent over-fitting
caused by using a model that is too complex to be supported by the data. For
example, consider the linear regression model
 
p
indep X
Full model : Yi |β, σ 2 ∼ Normal β1 + Xij βj , σ 2  . (2.46)
j=2

If the number of predictors p is large then the model becomes very complex. A
PC prior might shrink the full model to the base model with β2 = ... = βp = 0,
iid
Base model : Yi |β, σ 2 ∼ Normal β1 , σ 2 .

(2.47)
Using uninformative priors for the βj in the full model puts almost no weight
on this base model, which could lead to over-fitting if the true model is sparse
with many βj = 0.
The PC prior is defined using the KL distance between the priors for the
full and base models,
Z  
πf ull (θ)
KL = log πf ull (θ)dθ. (2.48)
πbase (θ
64 Bayesian Statistical Methods

The prior on the distance between models is then 2KL ∼ Gamma(1, λ), i.e.,
an exponential prior with rate λ which peaks at the base model with KD = 0
and decreases in complexity KD. Technically this is not an objective Bayesian
approach because the user must specify the base model and either λ or a prior
for λ. Nonetheless, the PC prior provides a systematic way to set priors for
high-dimensional models.

2.4 Exercises
iid
1. Assume Y1 , ..., Yn |µ ∼ Normal(µ, σ 2 ) where σ 2 is fixed and the
unknown mean µ has prior µ ∼ Normal(0, σ 2 /m).
(a) Give a 95% posterior interval for µ.
(b) Select a value of m and argue that for this choice your‘ 95%
posterior credible interval has frequentist coverage 0.95 (that
is, if you draw many samples of size n and compute the 95%
interval following the formula in (a) for each sample, in the
long-run 95% of the intervals will contain the true value of µ).
2. The Major League Baseball player Reggie Jackson is known as
“Mr. October” for his outstanding performances in the World Series
(which takes place in October). Over his long career he played in
2820 regular-season games and hit 563 home runs in these games
(a player can hit 0, 1, 2, ... home runs in a game). He also played
in 27 World Series games and hit 10 home runs in these games.
Assuming uninformative conjugate priors, summarize the posterior
distribution of his home-run rate in the regular season and World
Series. Is there sufficient evidence to claim that he performs better
in the World Series?
3. Assume that Y |θ ∼ Exponential(θ), i.e., Y |θ ∼ Gamma(1, θ). Find
a conjugate prior distribution for θ and derive the resulting posterior
distribution.
4. Assume that Y |θ ∼ NegBinomial(θ, m) (see Appendix A.1) and
θ ∼ Beta(a, b).
(a) Derive the posterior of θ.
(b) Plot the posterior of θ and give its 95% credible interval assum-
ing m = 5, Y = 10, and a = b = 1.
5. Over the past 50 years California has experienced an average of
λ0 = 75 large wildfires per year. For the next 10 years you will record
the number of large fires in California and then fit a Poisson/gamma
model to these data. Let the rate of large fires in this future period,
λ, have prior λ ∼ Gamma(a, b). Select a and b so that the prior
From prior information to posterior inference 65

is uninformative with prior variance around 100 and gives prior


probability approximately Prob(λ > λ0 ) = 0.5 so that the prior
places equal probability on both hypotheses in the test for a change
in the rate.
6. An assembly line relies on accurate measurements from an image-
recognition algorithm at the first stage of the process. It is known
that the algorithm is unbiased, so assume that measurements follow
iid
a normal distribution with mean zero, Yi |σ 2 ∼ Normal(0, σ 2 ). Some
errors are permissible, but if σ exceeds the threshold c then the al-
gorithm P must be replaced. P You make n = 20 measurements and
n n
observe i=1 Yi = −2 and i=1 Yi2 = 15 and conduct a Bayesian
analysis with InvGamma(a, b) prior. Compute the posterior proba-
bility that σ > c for:
(a) c=1 and a = b = 0.1
(b) c=1 and a = b = 1.0
(c) c=2 and a = b = 0.1
(d) c=2 and a = b = 1.0
For each c, compute the ratio of probabilities for the two priors (i.e.
a = b = 0.1 and a = b = 1.0). Which, if any, of the results are
sensitive to the prior?
7. In simple logistic regression, the probability of a success is regressed
onto a single covariate X as

exp(α + Xβ)
p(X) = . (2.49)
1 + exp(α + Xβ)

Assuming the covariate is distributed as X ∼ Normal(0, 1) and the


parameters have priors α ∼ Normal(0, c2 ) and β ∼ Normal(0, c2 ),
find a prior standard deviation c so that the induced prior on the
success probability p(X) is roughly Uniform(0,1). That is, for a
given c generate S (say S = 1, 000, 000) samples (X (s) , α(s) , β (s) )
for s = 1, ..., S, compute the S success probabilities, and make a
histogram of the S probabilities. Repeat this for several c until the
histogram is approximately uniform. Report the final value of c and
the resulting histogram. Would you call this an uninformative prior
for the regression coefficients α and β?
8. The Mayo Clinic conducted a study of n = 50 patients followed
for one year on a new medication and found that 30 patients expe-
rienced no adverse side effects (ASE), 12 experienced one ASE, 6
experienced two ASEs and 2 experienced ten ASEs.
(a) Derive the posterior of λ for the Poisson/gamma model
iid
Y1 , ..., Yn |λ ∼ Poisson(λ) and λ ∼ Gamma(a, b).
66 Bayesian Statistical Methods

(b) Use the Poisson/gamma model with a = b = 0.01 to study the


rate of adverse events. Plot the posterior and give the posterior
mean and 95% credible interval.
(c) Repeat this analysis with Gamma(0.1,0.1) and Gamma(1,1)
priors and discuss sensitivity to the prior.
(d) Plot the data versus the Poisson(λ̂) PMF, where λ̂ is the pos-
terior mean of λ from part (b). Does the Poisson likelihood fit
well?
(e) The current medication is thought to have around one adverse
side effect per year. What is the posterior probability that this
new medication has a higher side effect rate than the previous
mediation? Are the results sensitive to the prior?
9. A smoker is trying to quit and their goal is to abstain for one week
because studies have shown that this is the most difficult period.
Denote Y as the number of days before the smoker relapses, and θ
as the probability that they relapse on any given day of the study.
(a) Argue that a negative binomial distribution (Appendix A.1)
for Y |θ is reasonable. What are the most important assump-
tions being made and are they valid for this smoking cessation
attempt?
(b) Assuming a negative binomial likelihood and Uniform(0, 1)
prior for θ, what is the prior probability the smoker will be
smoke-free for at least 7 days?
(c) Plot the posterior distribution of θ assuming Y = 5.
(d) Given this attempt resulted in Y = 5 and that θ is the same in
the second attempt, what is the posterior predictive probability
the second attempt will result in at least 7 smoke-free days?
10. You are designing a very small experiment to determine the sensi-
tivity of a new security alarm system. You will simulate five robbery
attempts and record the number of these attempts that trigger the
alarm. Because the dataset will be small you ask two experts for
their opinion. One expects the alarm probability to be 0.95 with
standard deviation 0.05, the other expects 0.80 with standard de-
viation 0.20.
(a) Translate these two priors into beta PDFs, plot the two beta
PDFs and the corresponding mixture of experts prior with
equal weight given to each expert.
(b) Now you conduct the experiment and the alarm is triggered in
every simulated robbery. Plot the posterior of the alarm prob-
ability under a uniform prior, each experts’ prior, and the mix-
ture of experts prior.
From prior information to posterior inference 67

11. Let θ be the binomial success probability and γ = θ/(1 − θ) be


the corresponding odds of a success. Use Monte Carlo simulation
(you can discard samples with γ > 100) to explore the effects of
parameterization on prior choice in this context.
(a) If θ ∼ Uniform(0, 1), what is the induced prior on γ?
(b) If θ ∼ Beta(0.5, 0.5), what is the induced prior on γ?
(c) If γ ∼ Uniform(0, 100), what is the induced prior on θ?
(d) If γ ∼ Gamma(1, 1), what is the induced prior on θ?
(e) Would you say any of these priors are simultaneously uninfor-
mative for both θ and γ?
12. Say Y |µ ∼ Normal(µ, 1) and µ has improper prior π(µ) = 1 for all
µ. Prove that the posterior distribution of µ is a proper PDF.
13. Suppose that the likelihood f (Y|θ) ≥ c(Y) for all θ. Show that an
improper prior on θ will yield an improper posterior.
14. Consider a sample of size n from the mixture of two normal densities
given by f (y|θ) = 0.5φ(y) + 0.5φ(y − θ) where φ(z) denotes the
standard normal density function. Show that any improper prior
for θ leads to an improper posterior.
15. Say Y |λ ∼ Poisson(λ).
(a) Derive and plot the Jeffreys’ prior for λ.
(b) Is this prior proper?
(c) Derive the posterior and give conditions on Y to ensure it is
proper.
16. Say Y |λ ∼ Gamma(1, λ).
(a) Derive and plot the Jeffreys’ prior for λ.
(b) Is this prior proper?
(c) Derive the posterior and give conditions on Y to ensure it is
proper.
17. Assume the model Y |λ ∼ Poisson(λ) and we observe Y = 10. Con-
duct an objective Bayesian analysis of these data.
18. The data in the table below are the result of a survey of commuters
in 10 counties likely to be affected by a proposed addition of a high
occupancy vehicle (HOV) lane.

County Approve Disapprove County Approve Disapprove


1 12 50 6 15 8
2 90 150 7 67 56
3 80 63 8 22 19
4 5 10 9 56 63
5 63 63 10 33 19
68 Bayesian Statistical Methods

(a) Analyze the data in each county separately using the Jeffreys’
prior distribution and report the posterior 95% credible set for
each county.
(b) Let p̂i be the sample proportion of commuters in county i that
approve of the HOV lane (e.g., p̂1 = 12/(12 + 50) = 0.194).
Select a and b so that the mean and variance of the Beta(a, b)
distribution match the mean and variance of the sample pro-
portions p̂1 , ..., p̂10 .
(c) Conduct an empirical Bayesian analysis by computing the 95%
posterior credible sets that results from analyzing each county
separately using the Beta(a, b) prior you computed in (b).
(d) How do the results from (a) and (c) differ? What are the ad-
vantages and disadvantages of these two analyses?
3
Computational approaches

CONTENTS
3.1 Deterministic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.1 Maximum a posteriori estimation . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.2 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.3 Bayesian central limit theorem (CLT) . . . . . . . . . . . . . . . . . . 74
3.2 Markov chain Monte Carlo (MCMC) methods . . . . . . . . . . . . . . . . . . 75
3.2.1 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.2 Metropolis–Hastings (MH) sampling . . . . . . . . . . . . . . . . . . . . 89
3.3 MCMC software options in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.4 Diagnosing and improving convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.4.1 Selecting initial values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.4.2 Convergence diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.4.3 Improving convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.4.4 Dealing with large datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Computing is a key ingredient to any modern statistical application and a


Bayesian analysis is no different. As of the late 1980’s, the application of
Bayesian methods was limited to small problems where conjugacy (Section
2.1) led to a simple posterior or the number of parameters was low enough to
permit numerical integration (e.g., Section 3.1.2). However, with the advent
of Markov Chain Monte Carlo (MCMC) methods (which were developed in
the 1950s through 1970s) the popularity of Bayesian methods exploded as it
became possible to fit realistic models to complex data sets. This chapter is
dedicated to the core Bayesian computational tools that led to this resurgence.
Bayesian methods are often associated with heavy computing. This is not
an inherent property of the Bayesian paradigm. Bayesian computing is de-
signed to estimate the posterior distribution, which is analogous to frequentist
computing estimating the sampling distribution. Both problems are challeng-
ing, but frequentists often make use of large-sample normal approximations
to simplify the problem. There is nothing to prohibit a Bayesian from mak-
ing similar approximations, and this is the topic of Section 3.1.3. However,
these approximations are not common because MCMC produces much richer
output, and the accuracy of the approximation is not limited by the sample
size of the data, but rather the MCMC approximation can be made arbitrar-

69
70 Bayesian Statistical Methods

ily precise even for non-Gaussian posteriors by increasing the computational


effort.
Bayesian computation boils down to summarizing a high-dimensional pos-
terior distribution (Section 1.4). In this chapter we cover methods that span
the spectrum from summaries that are fast to compute (e.g., MAP estima-
tion) but produce only a point estimate to slower methods (e.g., MCMC)
that permit full uncertainty quantification. Just as a user of maximum likeli-
hood analysis need not be an expert on the theory and practice of iteratively
reweighted least squares optimization, there is software (Section 3.3) avail-
able to implement most Bayesian analyses without the user possessing a deep
understanding of the algorithms being used under the hood. However, at a
minimum, the user should have sufficient understanding of the underlying
algorithms to properly inspect the output to determine if it is sensible. In
Sections 3.1 and 3.2, we outline the fundamental algorithms used in Bayesian
computing and work a few small examples for illustration. However, for the
remainder of the book, we will use the JAGS package introduced in Section 3.3
for computation to avoid writing MCMC code.

3.1 Deterministic methods


3.1.1 Maximum a posteriori estimation
Most of the computational methods discussed in this chapter attempt to sum-
marize the entire posterior distribution. This allows for full quantification of
uncertainty about the parameters and leads to inferences such as the posterior
probability that a hypothesis is true. However, summarizing the full posterior
distribution can be difficult and in some settings unnecessary. For example, in
machine-learning, statistical estimation is used like a filtering process where
the sole purpose is the prediction of future values using estimates from the
current data, and accounting for uncertainty and accessing statistical signif-
icance are not high priorities. In this case, a single point estimate may be a
sufficient summary of the posterior.
The MAP (Section 1.4.1) point estimator is

θ̂ M AP = arg max log[p(θ|Y)] = arg max log[f (Y|θ)] + log[π(θ)]. (3.1)


θ θ
The MAP estimator requires optimization of the posterior, unlike the pos-
terior mean which requires integration. Typically optimization is faster than
integration, especially in high-dimensions, and therefore MAP estimation can
be applied even for hard problems.
The MAP solution can be found using calculus or numerical optimization
[62]. The simplest optimization algorithm is gradient ascent (or gradient de-
scent to minimize a function), which is an iterative algorithm that begins with
Computational approaches 71

an initial value θ (0) and updates the parameters at iteration t using the rule
 
θ (t) = θ (t−1) + γ∇ θ (t−1) ,

where the step size γ is a tuning parameter, the gradient vector of the log
posterior is ∇(θ) = [∇1 (θ), ..., ∇p (θ)]T for

∇j (θ) = {log[f (Y|θ)] + log[π(θ)]} .
∂θj

This step is repeated until convergence, i.e., θ (t) ≈ θ (t−1) . Gradient ascent
is most effective when the posterior is convex, and typically requires several
starting values to ensure convergence.
There is a rich literature on optimization in statistical computing. Some
other useful methods include Newton’s method, the expectation-maximization
(EM) algorithm, the majorize-minimization (MM) algorithm, genetic algo-
rithms, etc. R has many optimization routines including the general-purpose
algorithm optim, as illustrated in Listing 3.1 and Figure 3.1 for the model
iid
Yi |µ, σ ∼ Normal(µ, σ 2 ) with priors µ ∼ Normal(0, 1002 ) and σ ∼ Unif(0, 10)
and data Y = (2.68, 1.18, −0.97, −0.98, −1.03).

3.1.2 Numerical integration


Many posterior summaries of interest can be written as p-variate integrals over
the posterior, including the posterior means, covariances and probabilities of
hypotheses. For example, for θ1 , the marginal posterior mean, variance and
probability that θ1 exceeds constant c are
Z
E(θ1 |Y) = θ1 p(θ|Y)dθ (3.2)
Z
Var(θ1 |Y) = [θ1 − E(θ1 |Y)]2 p(θ|Y)dθ
Z ∞Z Z
Prob(θ1 > c|Y) = ... p(θ|Y)dθ1 dθ2 , ..., dθp .
c

If these summaries are sufficient to describe the posterior, then numerical


integration can be used for Bayesian computing.
All of the summaries in (3.2) can be written as E[g(θ)] for some function
g, e.g., Prob(θ1 > c|Y) uses g(θ) = I(θ1 > c) where I(θ1 > c) = 1 if θ1 > c
and zero otherwise. Assume that a grid of points θ ∗1 , ..., θ ∗m covers the range
of θ with non-zero posterior density. Then
Z Xm
E[g(θ)] = g(θ)p(θ|Y )dθ ≈ g(θ j )Wj (3.3)
j=1

where Wj is the weight given to grid point j. To approximate the pos-


terior mean of g(θ) the weights should be related to the posterior PDF
72 Bayesian Statistical Methods

Listing 3.1
Numerical optimization and integration to summarize the posterior.
1 library(cubature)
2 Y <- c(2.68, 1.18, -0.97, -0.98, -1.03) # Data
3

4 # Evaluate the density on the grid for plotting


5 m <- 50
6 mu <- seq(-4,6,length=m)
7 sigma <- seq(0,10,length=m)
8 theta <- as.matrix(expand.grid(mu,sigma))
9 D <- dnorm(theta[,1],0,100)*dunif(theta[,2],0,10) # Prior
10 for(i in 1:length(Y)){ # Likelihood
11 D <- D * dnorm(Y[i],theta[,1],theta[,2])
12 }
13 W <- matrix(D/sum(D),m,m)
14

15 # MAP estimation
16 neg_log_post <- function(theta,Y){
17 log_like <- sum(dnorm(Y,theta[1],theta[2],log=TRUE))
18 log_prior <- dnorm(theta[1],0,100,log=TRUE)+
19 dunif(theta[2],0,10,log=TRUE)
20 return(-log_like-log_prior)}
21

22 inits <- c(mean(Y),sd(Y))


23 MAP <- optim(inits,neg_log_post,Y=Y,
24 method = "L-BFGS-B", # Since the prior is bounded
25 lower = c(-Inf,0), upper = c(Inf,10))$par
26

27 # Compute the posterior mean


28 post <- function(theta,Y){
29 like <- prod(dnorm(Y,theta[1],theta[2]))
30 prior <- dnorm(theta[1],0,100)*dunif(theta[2],0,10)
31 return(like*prior)}
32

33 g0 <- function(theta,Y){post(theta,Y)}
34 g1 <- function(theta,Y){theta[1]*post(theta,Y)}
35 g2 <- function(theta,Y){theta[2]*post(theta,Y)}
36 m0 <- adaptIntegrate(g0,c(-5,0.01),c(5,5),Y=Y)$int #constant m(Y)
37 m1 <- adaptIntegrate(g1,c(-5,0.01),c(5,5),Y=Y)$int
38 m2 <- adaptIntegrate(g2,c(-5,0.01),c(5,5),Y=Y)$int
39 pm <- c(m1,m2)/m0
40

41 # Make the plot


42 image(mu,sigma,W,col=gray.colors(10,1,0),
43 xlab=expression(mu),ylab=expression(sigma))
44 points(theta,cex=0.1,pch=19)
45 points(pm[1],pm[2],pch=19,cex=1.5)
46 points(MAP[1],MAP[2],col="white",cex=1.5,pch=19)
47 box()
Computational approaches 73

10

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
8

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
6

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
σ

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
4

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
2

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●


● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−4 −2 0 2 4 6

FIGURE 3.1
Numerical integration and optimization. The small points are the grid
points θ j = (µ∗j , σj∗ ) and the color of the shading is the posterior density for
the model Yi |µ, σ ∼ Normal(µ, σ 2 ) with Y = (2.68, 1.18, −0.97, −0.98, −1.03)
and priors µ ∼ Normal(0, 1002 ) and σ ∼ Uniform(0, 10). The large black
dot is the approximate posterior mean found using numerical integration and
the large white dot is the approximate posterior mode found using numerical
optimization.
74 Bayesian Statistical Methods

Dj = f (Y|θ ∗j )π(θ ∗j ). The normalizing constant m(Y) cannot be P computed


m
and so the posterior is normalized at the grid points as Wj = Dj / l=1 Dl .
This provides a very simple approximation to the posterior of θ. Of course,
more precise numerical integration can be used, including those in the R func-
tion adaptIntegrate as illustrated in Listing 3.1 and Figure 3.1. However,
we do not focus on these methods because they do not scale well with the
number of parameters, p. For example, if we use a grid of 20 points for each of
the p = 10 parameters then there are m = 2010 points in the expanded grid,
which is over a trillion points!

3.1.3 Bayesian central limit theorem (CLT)


The central limit theorem states that the sampling distribution of the sample
mean for a large sample is approximately normal even for non-Gaussian data.
This can be extended to the sampling distribution of the MLE. Many frequen-
tist standard error and p-value computations rely on the approximation

θ̂ M LE ∼ Normal(θ 0 , Σ̂M LE ), (3.4)

where θ 0 is the true value and the p × p covariance matrix Σ̂M LE = (−H)−1
and the (j, k) element of the Hessian matrix H is

∂2


log[f (Y|θ)] . (3.5)
∂θj ∂θk θ =θˆ M LE
indep
To define the analogous Bayesian approximation, suppose that Yi |θ ∼
f (Yi |θ) for i = 1, ..., n and θ ∼ π(θ). Under general conditions (see Section
7.2.2), for large n the posterior is approximately
 
θ|Y ∼ Normal θ̂ M AP , Σ̂M AP (3.6)

where Σ̂M AP = (−H)−1 is defined the same way as Σ̂M LE except that the
Hessian H is
∂2


log[p(θ|Y)] , (3.7)
∂θj ∂θk θ =θˆ M AP
which will be similar to Σ̂M LE for large samples. Of course, this is not appro-
priate if the parameters are discrete, but the Bayesian CLT can still be used
in some cases where the observations are dependent.
For example, consider the beta-binomial model Y |θ ∼ Binomial(n, θ) with
Jeffreys prior θ ∼ Beta(1/2, 1/2). The MAP estimate is θ̂M AP = A/(A + B)
and the approximate posterior variance is
" #−1
A B
+ ,
2
θ̂M (1 − θ̂M AP )2
AP
Computational approaches 75
Y=3, n=10 Y=9, n=30 Y=30, n=100

0.005
Exact

0.008
0.0025
CLT
MAP

0.004
0.0020

0.006
0.003
Posterior

Posterior

Posterior
0.0015

0.004
0.002
0.0010

0.002
0.001
0.0005
0.0000

0.000

0.000
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

θ θ θ

FIGURE 3.2
Illustration of the Bayesian CLT. The exact Beta(Y + 1/2, n + 1/2) pos-
terior versus the Gaussian approximation for the model Y |θ ∼ Binomial(n, θ)
with prior θ ∼ Beta(1/2, 1/2) for various Y and n.

where A = Y − 0.5 and B = n − Y − 0.5. Of course, there is no need


for an approximation in this simple case because the exact posterior is
θ|Y ∼ Beta(Y + 1/2, n + 1/2), but Figure 3.2 illustrates that the Gaussian
approximation works well for large n.
As discussed in Section 3.1.1, if the prior is uninformative then the MAP
and MLE estimates are similar, and these approximations suggest that the
entire posterior will be similar. This is the first glimpse at a recurring theme:
Bayesian and maximum likelihood methods give similar results for large sam-
ple sizes irrespective of the prior.
Advantages of the Bayesian CLT approximation are that it replaces in-
tegration (Section 3.1.2) with differentiation, and is thus easier to compute,
especially for large p, and the method is deterministic rather than stochas-
tic. A disadvantage is that approximation may not be accurate for small or
even moderate samples sizes. However, more elaborate approximation meth-
ods have been proposed that combine numerical integration with Gaussian
approximations, such as the integrated nested Laplace approximation (INLA)
of [73] (Appendix A.4). Alternatively, in variational Bayesian inference [46]
the user postulates a parametric (potentially non-Gaussian) posterior distri-
bution and then solves for the parameters in the postulated posterior that
give the best approximation to the true posterior.

3.2 Markov chain Monte Carlo (MCMC) methods


Monte Carlo (MC) methods are appealing to statisticians because they mirror
the fundamental statistical concept of using a sample to make inferences about
76 Bayesian Statistical Methods

a population. A canonical problem in statistics to estimate a summary of the


population, such as the population mean or standard deviation. In most cases
(i.e., not a census) we cannot observe the entire population to compute the
mean or standard deviation directly, so instead we take samples and use them
to make inference about the population. For example, we use the sample mean
to approximate the population mean and, assuming the sample is sufficiently
large, the law of large numbers guarantees the approximation is reliable.
MC sampling from the posterior works the same way. In this analogy, the
population of interest is the posterior distribution. We would like to summarize
the posterior using its mean and variance, but in most cases these posterior
summaries cannot be computed directly, and so we take a sample from the
posterior and use the MC sample mean to approximate the posterior mean and
the MC sample variance to approximate the posterior variance. Assuming the
MC sample is sufficiently large, then the approximation is reliable. This holds
for virtually any posterior distribution and any summary of the posterior,
even those that are high-dimensional integrals.
Assume that we have generated S posterior samples, θ (1) , ..., θ (S) ∼
(s) (s)
p(θ|Y), where sample s is θ (s) = (θ1 , ..., θp ). These draws can be used
to approximate any posterior summary discussed in Section 1.4. For a uni-
variate model with p = 1, we could approximate the posterior density with
a histogram of the S samples, the posterior mean with the MC sample mean
PS (s)
E(θ1 |Y) ≈ s=1 θ1 /S, the posterior variance with the MC sample variance
of the S samples, the probability that θ1 > 0 with the MC sample proportion
(s)
with θ1 > 0, etc.
To illustrate convergence of MC sampling, consider the Poisson-gamma
model Y |θ ∼ Poisson(N θ) with prior θ ∼ Gamma(a, b). The posterior is
then θ|Y ∼ Gamma(Y + a, N + b). Figure 3.3 assumes N = 10, Y = 8 and
a = b = 0.1 and plots the approximate posterior mean and probability that
θ > 0.5 as a function of the number of MC iterations. The sample mean and
proportion are noisy for a small number of MC samples, but as the number
of samples increases they converge to the true posterior mean and probability
that θ > 0.5.
(1) (S)
For multivariate models with p > 1 parameters, the samples θj , ..., θj
follow the marginal posterior distribution ofRθj , p(θj |Y). Critically, we do
not need to analytically integrate p(θj |Y) = f (θ|Y)dθ1 ...dθj−1 dθj+1 ...dθp .
Because each sample consists of a random draw from all parameters, MC
sampling automatically produces samples from the marginal distribution θj
accounting for uncertainty in the other parameters.
In addition to the marginal distribution, MC sampling can be used to ap-
proximate the posterior of parameters defined as transformations of the orig-
inal parameters. For example, we can approximate the posterior of γ = g(θ1 )
(s)
for a function g using the MC samples γ (1) , ..., γ (S) where γ (s) = g(θ1 ), that
is, we simply transform each sample and these transformed samples approxi-
mate the posterior distribution of γ. The posterior of the transformed param-
Computational approaches 77

1.00

1.00
0.95

0.95
Sample probability
0.90
Sample mean
0.85

0.90
0.80

0.85
0.75

0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000

Iteration, S Iteration, S

FIGURE 3.3
iid
Convergence of a Monte Carlo approximation. Assuming PS θ(s) ∼
(s)
Gamma(8.1, 10.1), the plot below gives the sample mean s=1 θ /S and
PS (s)
sample proportion s=1 I(θ > 0.5)/S by the number of samples, S; the
horizontal lines are the true values.

PS
eter is summarized like any other, e.g., E(γ|Y) ≈ s=1 γ (s) /S. As another
example, say we want to test if ∆ = θ1 − θ2 > 0. The posterior of ∆ is approx-
(s) (s)
imated by the S samples ∆(s) = θ1 − θ2 , and thus Prob(∆ > 0|Y) is ap-
proximated using the sample proportion of the S samples for which ∆(s) > 0.
Once we have posterior samples, summarizing the posterior or even compli-
cated functions of the posterior is straightforward and this one of the appeals of
MC sampling. However, generating valid samples from the posterior distribu-
tion is not always straightforward. We will focus on two sampling algorithms:
Gibbs sampling and Metropolis–Hastings sampling. Gibbs sampling is pre-
ferred when it is possible to directly sample from the conditional distribution
of each parameter. Metropolis–Hastings is a generalization to more compli-
cated problems. We also briefly mention other more advanced algorithms in
Appendix A.4 including delayed rejection and adaptive Metropolis (DRAM),
the Metropolis-adjusted Langevin algorithm, Hamiltonian MC (HMC) and
slice sampling.

3.2.1 Gibbs sampling


As a motivating example, consider the Gaussian model
iid
Y1 , ..., Yn ∼ Normal(µ, σ 2 ) (3.8)
78 Bayesian Statistical Methods

where the priors are µ ∼ Normal(γ, τ 2 ) and σ 2 ∼ InvGamma(a, b). To study


the posterior of θ = (µ, σ 2 ) we would like to make S draws from the joint
posterior distribution p(θ|Y). Section 2.1.3 provides a means to sample from
the posterior of µ with σ 2 assumed known since µ|σ 2 , Y follows a Gaussian
distribution, and Section 2.1.4 provides a means to sample from the posterior
of σ 2 with µ assumed known since σ 2 |µ, Y follows an inverse gamma distri-
bution. Gibbs sampling generates draws from the desired joint posterior of θ
using only these univariate conditional distributions.
Gibbs sampling (proposed by [34]) begins with initial values for both pa-
rameters and proceeds by alternating between sampling µ|σ 2 , Y and then
sampling σ 2 |µ, Y until S samples have been collected (for an extensive dis-
cussion about selecting S, see Section 3.4). For example, we might begin by
setting µ to the sample mean of Y and σ 2 to the sample variance of Y (for
an extensive study of initialization, see Section 3.4). Implementing successive
steps requires deriving the full conditional posterior distributions, i.e., the
distribution of one parameter conditioned on the data and all other parame-
ters. Following Section 2.1.3 and 2.1.4 (with slight modification to the prior
distributions), the full conditional posterior distributions are
 Pn 2 2

2 i Yi /σ + γ/τ 1
µ|σ , Y ∼ Normal ,
n/σ 2 + 1τ 2 n/σ 2 + 1/τ 2
n
!
X
2 2
σ |µ, Y ∼ InvGamma n/2 + a, (Yi − µ) /2 + b .
i=1

This results in S posterior samples θ (1) , ..., θ (S) to summarize the posterior,
where θ (s) = (µ(s) , σ 2(s) ). Listing 3.2 provides R code to perform these steps
using the same data as in Section 1.1.2. Figure 3.4 (bottom row) plots the
draws from the joint posterior and marginal density for µ, which both closely
resemble Figure 1.11.
Algorithm 1 provides a general recipe for Gibbs sampling. The algorithm
begins with an initial value for the parameter vector and each iteration consists
of a sequence of updates from each parameter’s full conditional distribution.
Because each step cycles through the parameters and updates them given the
current values of all other parameters, the samples are not independent. All
sampling is performed conditionally on the previous iteration’s value, and so
the samples form a specific type of stochastic process called a Markov chain,
hence the name Markov chain Monte Carlo (MCMC) sampling. There are
MC samplers that attempt to sample independent draws from the posterior
(e.g., rejection sampling and approximate Bayesian computing (ABC) [53]),
but MCMC is preferred for high-dimensional problems.
The beauty of this algorithm is that it reduces the difficult problem of
sampling from a multivariate distribution down to a sequence of simper uni-
variate problems. This assumes that the full conditional distributions are easy
to sample from, but as we will see in the following examples, even for high-
Computational approaches 79

Listing 3.2
Gibbs sampling for the Gaussian model with unknown mean and variance.
1 # Load the data
2

3 Y <- c(2.68,1.18,-0.97,-0.98,-1.03)
4 n <- length(Y)
5

6 # Create an empty matrix for the MCMC samples


7

8 S <- 25000
9 samples <- matrix(NA,S,2)
10 colnames(samples) <- c("mu","sigma")
11

12 # Initial values
13

14 mu <- mean(Y)
15 sig2 <- var(Y)
16

17 # priors: mu ~ N(gamma,tau), sig2 ~ InvG(a,b)


18

19 gamma <- 0
20 tau <- 100^2
21 a <- 0.1
22 b <- 0.1
23

24 # Gibbs sampling
25

26 for(s in 1:S){
27 P <- n/sig2 + 1/tau
28 M <- sum(Y)/sig2 + gamma/tau
29 mu <- rnorm(1,M/P,1/sqrt(P))
30

31 A <- n/2 + a
32 B <- sum((Y-mu)^2)/2 + b
33 sig2 <- 1/rgamma(1,A,B)
34

35 samples[s,]<-c(mu,sqrt(sig2))
36 }
37

38 # Plot the joint posterior and marginal of mu


39 plot(samples,xlab=expression(mu),ylab=expression(sigma))
40 hist(samples[,1],xlab=expression(mu))
41

42 # Posterior mean, median and credible intervals


43 apply(samples,2,mean)
44 apply(samples,2,quantile,c(0.025,0.500,0.975))
80 Bayesian Statistical Methods

15

25
10

20
5

15
σ
µ
0

10
−5

5
−10

0
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000

Iteration Iteration

2500
2000
1500
Frequency
1000
500
0

−10 −5 0 5 10 15

FIGURE 3.4
Summary of posterior samples from MCMC for the Gaussian model
with unknown mean and variance. The first row gives trace plots of the
samples for µ and σ, the second row shows the samples from the joint posterior
of (µ, σ) and the marginal posterior of µ.

Algorithm 1 Gibbs Sampling


(0) (0)
1: Initialize θ (0) = (θ1 , ..., θp )
2: for s = 1, ..., S do
3: for j = 1, ..., p do
(s) (s) (s) (s−1) (s−1)
4: sample θj ∼ pj (θj |θ1 , ..., θj−1 , θj+1 , ..., θp , Y)
5: end for
6: end for
Computational approaches 81

dimensional problems with large p, the full conditional distributions often


follow familiar conjugacy pairs (Section 2.1) which are conducive to sampling.
Before continuing with other examples, we pause and discuss the theo-
retical properties of this algorithm. Why does it work? That is, why does
repeated sampling from full conditional distributions lead to samples from
the joint posterior distribution? Surely we cannot trust that the initial value
is a sample from the posterior because it is selected subjectively by the user.
It would also be dangerous to trust that the first random sample follows the
posterior distribution because it depends on the initial values, which might be
far from the posterior. Appendix A.3 provides an argument using stochastic
process theory that (1) under general conditions, the samples produced by the
algorithm converge to the posterior distribution for any choice of initial values
and (2) once a sample is drawn from the posterior, all subsequent samples are
also samples from the posterior.
The theoretical convergence arguments dictate how Gibbs sampling is used
in practice. The user follows the chain until it has converged, and discards all
previous samples from this burn-in period. Convergence is often assessed by
visual inspection of the trace plots (i.e., the a plot of the samples by itera-
tion number), but formal measures are available (Section 3.4). All remaining
samples are used to summarize the posterior. It is important to remember
that unlike other optimization algorithms, we do not expect Gibbs sampling
to converge to a single optimal value. Rather, we hope that after burn-in the
algorithm produces samples from the posterior distribution. That is, we hope
the samples generated by the algorithm converge in distribution to the poste-
rior, and do not get stuck in one place. Figure 3.5 provides idealized output
where convergence clearly occurs around iteration 1000 for both parameters
and the remaining samples consistently draw from the same distribution after
this point and thus the trace plot resembles a “bar code” or “caterpillar.”
Bivariate normal example: We will use Gibbs sampling to make draws
from posterior distributions, but in fact it can be used to sample from any
distribution (so long as we can sample from the full conditional distributions).
For example, say θ = (U, V ) is bivariate normal with mean zero and variance
1 for both parameters and correlation ρ between the parameters. This is not
a posterior distribution because we are not conditioning on data, but in this
toy example we are using Gibbs sampling to draw from this bivariate normal
distribution to get a feel for the algorithm and how it converges to the target
distribution. As given in (1.26) (see Figure 1.7), the full conditional distribu-
tions are U |V ∼ Normal(ρV, 1 − ρ2 ) and V |U ∼ Normal(ρU, 1 − ρ2 ). Given
initial values (U (0) , V (0) ), the first three iterations are:
1a Draw U (1) |V (0) ∼ Normal(ρV (0) , 1 − ρ2 )
1b Draw V (1) |U (1) ∼ Normal(ρU (1) , 1 − ρ2 )
2a Draw U (2) |V (1) ∼ Normal(ρV (1) , 1 − ρ2 )
2b Draw V (2) |U (2) ∼ Normal(ρU (2) , 1 − ρ2 )
82 Bayesian Statistical Methods

20
4

18
16
2
θ1

θ2
14
12
0

10
−2

0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000

Iteration, Iteration,

FIGURE 3.5
Gibbs sampling trace plots. These trace plots of the iteration number s
(s)
by the posterior samples θj for j = 1, 2 show convergence at iteration 1000,
which is denoted with a vertical line.

3a Draw U (3) |V (2) ∼ Normal(ρV (2) , 1 − ρ2 )


3b Draw V (3) |U (3) ∼ Normal(ρU (3) , 1 − ρ2 ).
The algorithm continues until S draws have been made.
It can be shown that the distribution after s iterations is

U (s) ∼ Normal(ρ2s−1 V (0) , 1 − ρ4s−2 ).

The true marginal distribution of U is Normal(0, 1), and since ρ2s−1 ≈ 0 and
ρ4s−2 ≈ 0 for large s, for any initial value the MCMC posterior is very close but
never exactly equal to the true posterior. Convergence is immediate if ρ = 0
and slow if |ρ| ≈ 1, illustrating how cross-correlation between parameters can
hinder MCMC convergence (e.g., Figure 3.8 and the discussion of blocked
Gibbs sampling below). Nonetheless, for any ρ and large S the approximation
is surely sufficient and can be made arbitrarily precise by increasing S.
Constructing a Gibbs sampler for NFL concussions data: In Sec-
tion 2.1 we analyzed the NFL concussion data. The data consist of the number
of concussions in the years 2012–2015, Y1 = 171, Y2 = 152, Y3 = 123, and
Y4 = 199, respectively. In Section 2.1 we analyzed the years independently,
but here we analyze all years simultaneously with the model
indep indep
Yi |λi ∼ Poisson(N λi ) λi |γ ∼ Gamma(1, γ) γ ∼ Gamma(a, b) (3.9)

where N = 256 is the number of games per season. The model has five un-
Computational approaches 83

known parameters θ = (λ1 , ..., λ4 , γ) and posterior


" 4 #
Y
p(λ1 , ..., λ4 , γ|Y) ∝ f (Yi |λi )π(λi |γ) π(γ), (3.10)
i=1

where f is the Poisson PMF and π is the gamma PDF.


The Gibbs sampler requires the full conditional posterior for each of the
five parameters. We first compute the full conditional of λ1 given all the other
parameters. For this computation, we only need to consider terms that depend
on λ1 , and all other terms can be absorbed in the normalizing constant. For
λ1 this gives
p(λ1 , |λ2 , λ3 , λ4 , γ, Y) ∝ f (Y1 |λ1 )π(λ1 |γ). (3.11)
This is exactly the form of the posterior for the Poisson-gamma conjugacy pair
studied in Section 2.1, and so the full conditional of λ1 is Gamma(Y1 +1, N +γ);
the full conditionals for the other λj are similar.
The terms in the likelihood f (Yi |λi ) do not depend on γ, and so the full
conditional is
" 4 #
Y
p(γ|λ1 , λ2 , λ3 , λ4 , Y) ∝ π(λi |γ) π(γ). (3.12)
i=1

When viewed as a function of γ, the Gamma(1, γ) prior π(λi |γ) =


γλi exp(−γλi ) ∝ γ exp(−γλi ). Therefore, the full conditional reduces to
" 4 #
Y
p(γ|λ1 , λ2 , λ3 , λ4 , Y) ∝ γ exp(−γλi ) γ a−1 exp(−γb)
i=1
" 4
!#
X
∝ γ 4+a−1 exp −γ λi + b ,
i=1
P4
and thus the update for γ is Gamma(4 + a, i=1 λi + b). Note that the update
does not depend on the data, Y. However, the data are informative about γ
indirectly because they largely determine the posterior of the λj which in turn
influence the posterior of γ.

Listing 3.3 provides the code that generates the results in Figure 3.6. Con-
vergence for γ (left panel of Figure 3.6) is immediate (as it is for the λi , not
shown), and so we do not need a burn-in period and all samples are used to
summarize the posterior. The posterior distribution from this analysis of all
years (right panel of Figure 3.6) is actually quite similar to the one-year-at-a-
time analysis in Figure 2.2.
Constructing a blocked Gibbs sampler for a T. rex growth chart:
[25] studies growth charts for several tyrannosaurid dinosaur species, including
the tyrannosaurus rex (T. rex). There n = 6 T. rex observations with weights
(kg) 29.9, 1761, 1807, 2984, 3230, 5040, and 5654 and corresponding ages
84 Bayesian Statistical Methods

Listing 3.3
Gibbs sampling for the NFL concussions data.
1 # Load data
2

3 Y <- c(171, 152, 123, 199)


4 n <- 4
5 N <- 256
6

7 # Create an empty matrix for the MCMC samples


8

9 S <- 25000
10 samples <- matrix(NA,S,5)
11 colnames(samples) <- c("lam1","lam2","lam2","lam4","gamma")
12

13 # Initial values
14

15 lambda <- log(Y/N)


16 gamma <- 1/mean(lambda)
17

18 # priors: lambda[i]|gamma ~ Gamma(1,gamma), gamma ~ InvG(a,b)


19

20 a <- 0.1
21 b <- 0.1
22

23 # Gibbs sampling
24

25 for(s in 1:S){
26 for(i in 1:n){
27 lambda[i] <- rgamma(1,Y[i]+1,N+gamma)
28 }
29 gamma <- rgamma(1,4+a,sum(lambda)+b)
30 samples[s,] <- c(lambda,gamma)
31 }
32

33 boxplot(samples[,1:4],outline=FALSE,
34 ylab=expression(lambda),names=2012:2015)
35 plot(samples[,5],type="l",xlab="Iteration",
36 ylab=expression(gamma))
37

38 # Posterior mean, median and credible interval


39 apply(samples,2,mean)
40 apply(samples,2,quantile,c(0.025,0.500,0.975))
Computational approaches 85

0.9

6
0.8

5
0.7

4
λ

γ
3
0.6

2
0.5

1
0.4

0
2012 2013 2014 2015 0 5000 10000 15000 20000 25000

Iteration

FIGURE 3.6
MCMC analysis of the NFL concussions data. The left panel plots the
posterior distribution of the concussion rate λi for each year, and the right
panel is a trace plot of the hyperparameter γ.

(years) 2, 15, 14, 16, 18, 22, and 28, plotted in Figure 3.7 (top left). Non-
linear models are likely more appropriate for growth curve data (Section 6.3),
but we fit a linear model for illustration. We assume that Yi = β1 + xi β2 +
εi where Yi is the weight of dinosaur i, xi is the age, β1 and β2 are the
iid
regression intercept and slope, respectively, and εi ∼ Normal(0, σ 2 ). We select
indep
uninformative priors βj ∼ Normal(0, τ ) and σ 2 ∼ InvGamma(a, b) with τ =
10, 0002 and a = b = 0.1. Figure 3.7 (top right) plots the posterior distribution
of the three parameters θ = (β1 , β2 , σ). The error standard deviation σ is not
strongly dependent on the other parameters, but the regression coefficients
β = (β1 , β2 ) have correlation Cor(β1 , β2 ) = −0.91.
Gibbs sampling has trouble when parameters have strong posterior depen-
dence. To see this, consider the hypothetical example plotted in Figure 3.8.
The two parameters β = (β1 , β2 ) follow a bivariate normal posterior with
means equal zero, variances equal 1, and correlation equal -0.98. The initial
values are β (0) = (0, −3). The first step is to update β1 with β2 = −3. This
corresponds to a draw along the bottom row of Figure 3.8, and so the sample
is β1 ≈ 3. The next update is from the conditional distribution of β2 given
β1 ≈ 3. Since the two variables are negatively correlated, with β1 ≈ 3 we
must have β2 ≈ −3, and so β2 is only slightly changed from its initial value.
These small updates continue for the next four iterations and β stays in the
lower-right quadrant. This toy example illustrates that with strong posterior
dependence between parameters, the one-at-a-time Gibbs sampler will slowly
traverse the parameter space, leading to poor convergence.
86 Bayesian Statistical Methods

5000
4000
Body Mass (kg)
3000
2000
1000
0

5 10 15 20 25

Age (years)
6000

500
4000

400
2000

300
0
β1

β2
200
−2000

100
−4000
−6000

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

Iteration Iteration

FIGURE 3.7
Analysis of the T. rex growth chart. The data are plotted in the top left
panel and samples from the joint posterior of the three model parameters are
plotted in the top right. The second row gives trace plots of β1 and β2 .
Computational approaches 87

3
2
1
β2
0
● ●

−1
β(3)
● ●

β(2)
−2

● ●

β(1)
β(0)
● ●
−3

● ●

−3 −2 −1 0 1 2 3

β1

FIGURE 3.8
Gibbs sampling for correlated parameters. The background color repre-
sents the posterior PDF (black is high, white is low) of a hypothetical bivariate
posterior for β = (β1 , β2 ), and dots represent the starting value β (0) and five
hypothetical Gibbs sampling updates β (1) , ..., β (5) .

One way to improve convergence is to update dependent parameters in


blocks. Returning to the T. rex example, since β1 and β2 are correlated with
each other but not σ 2 (Figure 3.7), we might set blocks θ 1 = β and θ2 =
σ 2 and apply Algorithm 1 by alternating between sampling from θ 1 |θ2 , Y
and θ2 |θ 1 , Y. Since β is drawn from its joint posterior, it will not get stuck
in a corner as it might in one-at-a-time sampling, and thus convergence is
improved.
To derive the full conditional distributions of the blocks and MCMC code
we use matrix notation. Let the data vector be Y = (Y1 , ..., Yn )T and X be
n × 2 covariate matrix with the ith row equal to (1, xi ). The linear regres-
sion model can be written Y|β, σ 2 ∼ Normal(Xβ, σ 2 In ). The full conditional
distributions (derived in Appendix A.3 and discussed in Section 2.1.6) are

β|σ 2 , Y ∼ Normal P−1 W, P−1




σ 2 |β, Y ∼ InvGamma n/2 + a, (Y − Xβ)T (Y − Xβ)/2 + b




where P = XT X/σ 2 + I/τ and W = XT Y/σ 2 . Code to implement this model


is given in Listing 3.4. Figure 3.7 (bottom row) shows excellent convergence.
If there were more than one covariate, then X would have additional columns
and β would have additional elements, but the full conditionals and Gibbs
sampling steps would be unchanged.
88 Bayesian Statistical Methods

Listing 3.4
Gibbs sampling for linear regression applied to the T-rex data.
1

2 library(mvtnorm)
3

4 # Load T-Rex data


5

6 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)


7 age <- c(2, 15, 14, 16, 18, 22, 28)
8 n <- length(age)
9 X <- cbind(1,age)
10 Y <- mass
11

12 # Create an empty matrix for the MCMC samples


13

14 S <- 10000
15 samples <- matrix(NA,S,3)
16 colnames(samples) <- c("Beta1","Beta2","Sigma")
17

18 # Initial values
19

20 beta <- lm(mass~age)$coef


21 sig2 <- var(lm(mass~age)$residuals)
22

23 # priors: beta ~ N(0,tau I_2), sigma^2 ~ InvG(a,b)


24

25 tau <- 10000^2


26 a <- 0.1
27 b <- 0.1
28

29 # Blocked Gibbs sampling


30

31 V <- diag(2)/tau
32 tXX <- t(X)%*%X
33 tXY <- t(X)%*%Y
34

35 for(s in 1:S){
36 P <- tXX/sig2 + V
37 W <- tXY/sig2
38 beta <- rmvnorm(1,solve(P)%*%W,solve(P))
39 beta <- as.vector(beta)
40

41 A <- n/2 + a
42 B <- sum((Y-X%*%beta)^2)/2 + b
43 sig2 <- 1/rgamma(1,A,B)
44

45 samples[s,]<-c(beta,sqrt(sig2))
46 }
47

48 pairs(samples)
Computational approaches 89

3.2.2 Metropolis–Hastings (MH) sampling


Each step in Gibbs sampling requires taking a sample from the full condi-
tional distribution of one parameter (or block of parameters) conditional on
all other parameters. In the examples in the previous section, all priors were
conditionally conjugate, and so we were able to show that the full condition-
als were members of familiar families of distributions and so sampling was
straightforward. However, it is not always possible to specify a conditionally
conjugate prior.
For example, returning the NFL concussions data (Figure 3.9), say we
select the model
Yi |β ∼ Poisson[N exp(β1 + β2 i)] (3.13)
for year i = 1, ..., 4 so that the log concussion rate changes linearly in time.
The likelihood is then
4
Y
f (Y|β) ∝ exp[−N exp(β1 + β2 i)] exp[Yi (β1 + β2 i)]. (3.14)
i=1

The parameters appear in the exponential within another exponential, and


there is no well-known family of distributions for β with this feature. There-
fore, the posterior will not belong to a known distribution, and it is not clear
how to directly sample from the posterior.
MH sampling [41] replaces draws from the exact full conditional distribu-
tion with a draw from a candidate distribution followed by an accept/reject
(s) (s−1)
step. That is, the Gibbs update of β1 is a sample β1 |β2 , Y, whereas an
∗ (s−1)
MH makes a candidate draw β1 ∼ q(β1 |β1 ) that is conditioned on the
current value of β1 (and potentially β2 and/or Y). Of course, the candidate
cannot be blindly accepted because the candidate distribution may not be
related to the posterior distribution. To correct for this, the candidate is ac-
cepted with probability min{1, R} where R is the ratio
(s−1)
p(β1∗ |β2 , Y) q(β1 |β1∗ )
R= (s−1) (s−1)
. (3.15)
p(β1 |β2 , Y) q(β1∗ |β1 )

Equivalently, the accept/reject step generates U ∼ Uniform(0, 1) and accepts


the candidate if U < R. Algorithm 2 formally describes the MH sampler,
which can be justified following the steps outlined in Appendix A.3 for Gibbs
sampling.
The acceptance ratio R depends on the ratio of the posteriors of the can-
didate and current values. For updating parameter θj , terms in the likelihood
or prior that do not include θj cancel and can thus be ignored. Crucially, this
includes the often intractable normalizing constant m(Y). Other terms can
cancel as well, for example, the entire likelihood for the example in Listing 3.3
did not depend on one parameter (γ) and would cancel in the posterior ratio
for this parameter.
90 Bayesian Statistical Methods

200

0.10
180


Fitted values

β2
160

0.05

140

0.00


120

2012 2013 2014 2015 0 50 100 150 200

Year Iteration
0.15
−0.2
−0.3

0.10
−0.4

0.05
−0.5
β1

β2
−0.6

0.00
−0.7

−0.05
−0.8

0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000

Iteration Iteration

FIGURE 3.9
Poisson regression for the NFL concussions data. The first panel plots
the posterior distribution of the mean values N exp(β1 + β2 i) for each year
(boxplots) and the observed number of concussions, Yi (points); the remaining
three panels are trace plots of the regression coefficients β1 and β2 .
Computational approaches 91

Algorithm 2 Metropolis–Hastings Sampling


(0) (0)
1: Initialize θ (0) = (θ1 , ..., θp )
2: for s = 1, ..., S do
3: for j = 1, ..., p do
(s−1)
4: sample θj∗ ∼ qj (θj |θj )
 
∗ (s) (s) (s−1) (s−1)
5: set θ = θ1 , ..., θj−1 , θj∗ , θj+1 , ..., θp
 

∗ (s−1) ∗
f (Y|θ)π(θ ) qj θj |θj
6: set R =  (s−1)
  (s−1)  ·  
(s−1)
f Y|θ π θ qj θj∗ |θj
7: sample U ∼ Uniform(0, 1)
8: if U < R then
(s)
9: set θj = θj∗
10: else
(s) (s−1)
11: set θj = θj
12: end if
13: end for
14: end for

The MH sampler can be applied more generally than Gibbs sampling, but
at the cost of having to select and tune a candidate distribution for each pa-
rameter (or block of parameters). A common choice is a random-walk Gaussian
candidate distribution
(s−1) (s−1)
θj∗ |θj ∼ Normal(θj , c2j ) (3.16)

where cj is the candidate standard deviation. This is called a random-walk


proposal distribution because it simply adds Gaussian jitter to the current
state of the chain. A Gaussian candidate distribution can be used for any
continuous parameter, even for those without a Gaussian prior. This is not
always ideal. For example, a Gaussian candidate for a variance parameter
may propose negative values which will automatically be discarded because
the prior density and thus acceptance probability will be zero. Alternatively,
a parameter with bounded support (e.g., σ > 0) can be transformed to have
support over the whole real line (θ = log(σ)) and a Gaussian can be used in
this transformed space. A Gaussian candidate is not appropriate for a discrete
parameter, however, because Gaussian draw would almost surely not be in the
support of the discrete parameter’s prior.
The standard deviation of the candidate distribution, cj , is a tuning pa-
rameter. A rule of thumb is to tune the algorithm so that it accepts 30-50% of
the candidates. It is hard to know which value of cj will lead to this acceptance
rate before starting the sampling, and so it is common to adapt cj based on
batches of samples during the burn-in. For example, if less than 30% of the last
100 candidates were accepted then you might decrease cj by 20%, and if more
than 50% of the last 100 candidates were accepted then you might increase cj
92 Bayesian Statistical Methods

by 20%. However, once the burn-in has completed, you must fix the candidate
distribution unless you consider more advanced methods (Appendix A.4).
Benefits of the random-walk candidate distribution are that it does not
require knowledge about the form of the posterior, and that the acceptance
ratio simplifies. If the candidate PDF q is Gaussian,
(s−1)
!
(s−1) ∗ ∗ (s−1) 1 (θj − θj∗ )2
q(θj |θj ) = qj (θj |θj )= √ exp − , (3.17)
2πcj 2c2j

and so the ratio of candidate distributions in the MH acceptance probabilities


cancels. This is called a symmetric candidate distribution, and the MH algo-
rithm reduces to the Metropolis algorithm [55] in Algorithm 3 if the candidate
distribution is symmetric.

Algorithm 3 Metropolis Sampling


(0) (0)
1: Initialize θ (0) = (θ1 , ..., θp )
2: for s = 1, ..., S do
3: for j = 1, ..., p do
(s−1)
4: sample θj∗ ∼ qj (θj |θj ) for symmetric qj
 
∗ (s) (s) (s−1) (s−1)
5: set θ = θ1 , ..., θj−1 , θj∗ , θj+1 , ..., θp
∗ ∗
f (Y|θ )π (θ )
6: set R =  (s−1)
  (s−1) 
f Y|θ π θ
7: sample U ∼ Uniform(0, 1)
8: if U < R then
(s)
9: set θj = θj∗
10: else
(s) (s−1)
11: set θj = θj
12: end if
13: end for
14: end for

Listing 3.5 gives R code for the Poisson regression model of NFL concus-
sions. The Gaussian candidate distributions were set to have standard de-
viation 0.1. Since the parameters appear in the exponential function in the
mean count, adding normal jitter with standard deviation 0.1 is similar to a
10% change in the mean counts, which leads to acceptance probability of 0.42
for β1 and 0.18 for β2 . Further tuning might reduce the candidate standard
deviation for β2 to increase the acceptance probability, but the trace plots in
Figure 3.9 show good convergence. The top right panel of Figure 3.9 zooms in
on the first few samples of the Metropolis algorithm and shows how β2 stays
constant for several iterations before jumping to a new value when a candidate
is accepted.
For the accept/reject step in Lines 40–41 of Listing 3.5 it is important
to perform computation on the log scale. For larger problems, the ratio of
Computational approaches 93

Listing 3.5
Metropolis sampling for the NFL concussions data.
1 # Load data
2
3 Y <- c(171, 152, 123, 199)
4 t <- 1:4
5 n <- 4
6 N <- 256
7
8 # Create an empty matrix for the MCMC samples
9
10 S <- 25000
11 samples <- matrix(NA,S,2)
12 colnames(samples) <- c("beta1","beta2")
13 fitted <- matrix(NA,S,4)
14
15 # Initial values
16
17 beta <- c(log(mean(Y/N)),0)
18
19 # priors: beta[j] ~ N(0,tau^2)
20
21 tau <- 10
22
23 # Prep for Metropolis sampling
24
25 log_post <- function(Y,N,t,beta,tau){
26 mn <- N*exp(beta[1]+beta[2]*t)
27 like <- sum(dpois(Y,mn,log=TRUE))
28 prior <- sum(dnorm(beta,0,tau,log=TRUE))
29 post <- like + prior
30 return(post)}
31
32 can_sd <- rep(0.1,2)
33
34 # Metropolis sampling
35
36 for(s in 1:S){
37 for(j in 1:2){
38 can <- beta
39 can[j] <- rnorm(1,beta[j],can_sd[j])
40 logR <- log_post(Y,N,t,can,tau)-log_post(Y,N,t,beta,tau)
41 if(log(runif(1))<logR){
42 beta <- can
43 }
44 }
45 samples[s,] <- beta
46 fitted[s,] <- N*exp(beta[1]+beta[2]*t)
47 }
48
49 boxplot(fitted,outline=FALSE,ylim=range(Y),
50 xlab="Year",ylab="Fitted values",names=2012:2015)
51 points(Y,pch=19)
94 Bayesian Statistical Methods

posteriors is numerically unstable because a small value in the denominator


can cause the ratio to be numerically infinite. Therefore, the ratio is replaced
by the difference of the log scale. The original rejection step is to reject if
U < f (θ ∗ |Y)/f (θ (s−1) |Y). Taking the log of each side gives the equivalent
inequality log(U ) < log[f (θ ∗ |Y)] − log[f (θ (s−1) |Y)].
Random-walk Gaussian distributions can also be used for block up-
dates. As with Gibbs, updating dependent parameters simultaneously can
improve convergence. Say the p parameters are partitioned into q blocks,
θ = (θ 1 , ..., θ q ), and the candidate distribution for block j is θ ∗j ∼
Normal(θ (s−1) , cj Vj ). As before, the scalar cj must be tuned to give a rea-
sonable acceptance probability. The matrix Vj must also be tuned, which is
difficult because it involves the variance of each parameter and the correlation
between each pair of parameters, and all of these elements contribute to the
single acceptance probability. A reasonable choice is to set Vj to the poste-
rior covariance of θ j , but unfortunately this is unknown. A common remedy
is to use a short burn-in period and set Vj to the sample covariance of the
burn-in samples, and then tune the scalar cj to give reasonable acceptance.
The proposal distribution can also be adapted based on prior MCMC samples
(Appendix A.4)
A drawback of the random walk candidate distribution is that it may
be suboptimal if it does not closely approximate the posterior. Using addi-
tional information can improve convergence. For example, a Bayesian CLT
(Section 3.1.3) approximation θ|Y ≈ Normal(θ̂ M AP , Σ̂M AP ) could be used
to tune the sampler to match candidate to the posterior. Taking this to the
extreme, if we can tune the candidate to be exactly the full conditional dis-
tribution, then the candidate distribution is proportional to the posterior,
(s−1)
q(θj∗ |θj ) ∝ f (Y|θ ∗ )π(θ ∗ ), and the ratio in Line 6 of Algorithm 2 reduces
to 1. This algorithm then cycles through the parameters, samples each from
their full conditional distribution, and keeps all draws. This is the Gibbs sam-
pling algorithm in Algorithm 1! This shows that Gibbs sampling is a special
case of MH with careful selection of the candidate distributions. It also opens
the door for mixing Gibbs and MH updates in the same algorithm because the
algorithm can be framed as an MH sampler with some well-tuned candidates
that lead to Gibbs steps.
Metropolis-within-Gibbs example: To illustrate this Metropolis-
within-Gibbs sampling algorithm, we use logistic regression, which is an exten-
sion of linear regression for binary data. For binary data, E(Yi ) = Prob(Yi =
1), and thus we cannot directly model the mean as a linear combination of
the covariates Xi = (1, Xi2 ..., Xip ) because this linear combination might not
be between zero and one. The linear regression model is thus P modified to
p
Yi |β ∼ Bernoulli(pi ), where pi = 1/[1 + exp(−ηi )] for ηi = β1 + j=2 βj Xij ,
which ensures that the mean response is between zero and one. We select an
uninformative prior for the intercept β1 ∼ Normal(0, 102 ), and for the slopes
Computational approaches 95

2

1
● ● ● ●


● ●
● ●
● ●

0
βj

● ●

−1
● ●
−2

1 2 3 4 5 6 7 8 9 11 13 15 17 19

Index, j

FIGURE 3.10
Logistic regression analysis of simulated data. The boxplots are the
posterior distribution of the βj and the points are the true values used to
simulate the data.

with j = 2, ..., p the prior is

βj |σ 2 ∼ Normal(0, σ 2 ) and σ 2 ∼ InvGamma(a, b). (3.18)

The regression coefficients βj appear in the likelihood inside a non-linear


function, and so their normal prior is not conditionally conjugate. However, the
prior variance σ 2 has conditionally conjugate inverse gamma prior and thus
inverse gamma full conditional distribution. Therefore a Metropolis-within-
Gibbs algorithm cycles through the βj and updates them using Metropolis
steps with Gaussian candidate distribution, and applies a Gibbs step to sample
σ 2 from its inverse gamma full conditional distribution.
This model has p + 1 parameters and the algorithm has different types
of updates and p tuning parameters. Due to its complexity, it should be in-
terrogated before being used for a real data analysis. A common method for
validating code is a simulation study. In a simulation study, we fix the values
of the parameters and generate data using these values. The synthetic data
are then analyzed and the posterior distributions are compared with the true
values. Unlike a real data analysis, since the true values are known we can
verify that the algorithm is able to recover the true values.
Listing 3.6 provides code to simulate the data, carry out the Metropolis-
within-Gibbs algorithm, and summarize the results. The results are plotted
in Figure 3.10. The true values of βj used to generate the data are included
by the posterior distribution for all but one or two coefficients, as expected.
Therefore, it seems the algorithm is working well.
96 Bayesian Statistical Methods

Listing 3.6
Metropolis-within-Gibbs sampling for simulated logistic regression data.
1 # Simulate data
2 n <- 100
3 p <- 20
4 X <- cbind(1,matrix(rnorm(n*(p-1)),n,p-1))
5 beta_true <- rnorm(p,0,.5)
6 prob <- 1/(1+exp(-X%*%beta_true))
7 Y <- rbinom(n,1,prob)
8
9 # Function to compute the log posterior
10 log_post_beta <- function(Y,X,beta,sigma){
11 prob <- 1/(1+exp(-X%*%beta))
12 like <- sum(dbinom(Y,1,prob,log=TRUE))
13 prior <- dnorm(beta[1],0,10,log=TRUE) + # Intercept
14 sum(dnorm(beta[-1],0,sigma,log=TRUE)) # Slopes
15 return(like+prior)}
16
17 # Create empty matrix for the MCMC samples
18 S <- 10000
19 samples <- matrix(NA,S,p+1)
20
21 # Initial values and priors
22 beta <- rep(0,p)
23 sigma <- 1
24 a <- 0.1
25 b <- 0.1
26 can_sd <- 0.1
27
28 for(s in 1:S){
29
30 # Metropolis for beta
31 for(j in 1:p){
32 can <- beta
33 can[j] <- rnorm(1,beta[j],can_sd)
34 logR <- log_post_beta(Y,X,can,sigma)-
35 log_post_beta(Y,X,beta,sigma)
36 if(log(runif(1))<logR){
37 beta <- can
38 }
39 }
40
41 # Gibbs for sigma
42 sigma <- 1/sqrt(rgamma(1,(p-1)/2+a,sum(beta[-1]^2)/2+b))
43
44 samples[s,] <- c(beta,sigma)
45
46 }
47
48 boxplot(samples[,1:p],outline=FALSE,
49 xlab="Index, j",ylab=expression(beta[j]))
50 points(beta_true,pch=19,cex=1.25)
Computational approaches 97

3.3 MCMC software options in R


Writing MCMC code step-by-step is great practice and the only way to truly
understand the algorithms, and tailoring the code to a specific model can im-
prove speed and stability for complex models. However, coding basic MCMC
samplers becomes repetitive. For Gibbs sampling, you must derive and encode
the full conditional distribution of each parameter, but there are only a few
dozen known conjugacy pairs (e.g., Appendix A.2), and so writing a Gibbs
sampler is simply a matter of looking up the correct full conditional distribu-
tion and entering this into R. If the parameter’s full conditional distribution
does not appear in the table of conjugacy pairs, you might use a Metropolis
update with Gaussian random walk candidate distribution. Each Metropolis
update consists of a Gaussian proposal followed by computing the acceptance
ratio, which requires only tuning the candidate standard deviation and insert-
ing the specific likelihood and prior chosen for the model into the acceptance
ratio. But the tuning step is not model specific, and most models are built
using only a few dozen distributions for the likelihood and prior (Appendix
A.1), and so writing MH code becomes simply a matter of coding the correct
distributions.
Fortunately, this process has been automated! There are several general-
purpose MCMC packages that can be called from R, including Just Another
Gibbs Sampler (JAGS), OpenBUGS ([79]), STAN ([15]), and NIMBLE ([22]), as well
as libraries in python including pymc ([64]) and the SAS procedure PROC MCMC
([17]). These packages are not specific to a single model, rather they take a
script specifying the likelihood and prior as input and use this information
to construct an MCMC sampler for the specified model. Appendix A.5 pro-
vides code and a comparison of JAGS, OpenBUGS, STAN and NIMBLE on example
datasets. The format of these packages are similar and so with a solid under-
standing of MCMC sampling to interpret the results, switching from one to
another is fairly straightforward.
For the remainder of the book we will use JAGS to carry out MCMC sam-
pling. Compared to other packages, JAGS is relatively easy to code and suffi-
ciently fast for the size and complexity of models considered here. JAGS is a
very general package that can be used to fit all the models discussed in this
book and many more. Of course, for a specific application there may be more
focused and thus more efficient code. For example, the BLR package in R surely
performs better than JAGS for Bayesian linear regression. However, to avoid
having to learn dozens of packages for different models we simply use JAGS
throughout.
JAGS must first be downloaded from http://mcmc-jags.sourceforge.net/,
and installed on your computer. The R package rjags must be downloaded
and installed to communicate with JAGS from R. An MCMC analysis using
JAGS has five steps:
98 Bayesian Statistical Methods

(1) Specify the model as character string in R


(2) Upload the data and compile the model using the function
jags.model
(3) Perform burn-in sampling using the function update
(4) Generate posterior samples using the function coda.samples
(5) Summarize the results using the functions summarize and plot.
Listing 3.7 contains the R code for simple linear regression applied to the
T. rex growth chart data, and the output of the plot command in the last
line is given in Figure 3.11. A few notes on this code:
• All of this code is run in R. You must install JAGS on your computer, but you
never have to open JAGS because the rjags library passes data and results
back and forth between R and JAGS.
• The model specification format resembles R code, but not all R commands are
applicable in JAGS and some commands with the same syntax have different
properties. For example, dnorm in JAGS specifies the normal distribution via
its mean and precision (inverse variance) unlike the R command by the same
name that uses mean and standard deviation. For a list of JAGS commands
and their meaning see the user manual in [66].
• The symbol “∼” indicates that a variable follows the distribution given to
the right of the symbol, and deterministic operations are denoted by the left
arrow“<-”.
• In the model definition, the prior is placed on the precision tau, but each
sample is converted to the standard deviation sigma <- 1/sqrt(tau) and
samples of√ sigma are returned. That is, this line tells JAGS to compute
σ (s) = 1/ τ (s) at each iteration and return the samples σ (s) .
• The object model contains the data, the code to update each parameter,
and the current state of each parameter in each chain.
• The initial values function can specify either random samples for initial
values as is done here, or give specific values (e.g., beta1 = 0). If the inits
argument in jags.model is omitted JAGS will set initial values automatically.
• The update function modifies the state of each parameter in each chain, but
does not store intermediate samples.
• The coda.samples function does retain all MCMC samples, but only for
those parameters listed in variable.names. The function outputs the sam-
ples into the object samples which are in the format used by the coda
package that can be used to study convergence.
Computational approaches 99

Listing 3.7
JAGS code for linear regression applied to the T-rex data.
1 library(rjags)
2 # Load T-Rex data
3 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
4 age <- c(2, 15, 14, 16, 18, 22, 28)
5 n <- length(age)
6 data <- list(mass=mass,age=age,n=n)
7
8 # (1) Define the model as a string
9 model_string <- textConnection("model{
10 # Likelihood (dnorm uses a precision, not variance)
11 for(i in 1:n){
12 mass[i] ~ dnorm(beta1 + beta2*age[i],tau)
13 }
14 # Priors
15 tau ~ dgamma(0.1, 0.1)
16 sigma <- 1/sqrt(tau)
17 beta1 ~ dnorm(0, 0.001)
18 beta2 ~ dnorm(0, 0.001)
19 }")
20
21 # (2) Load the data, specify initial values and compile the MCMC code
22 inits <- list(beta1=rnorm(1),beta2=rnorm(1),tau=rgamma(1,1))
23 model <- jags.model(model_string,data = data, inits=inits, n.chains=2)
24
25 # (3) Burn-in for 10000 samples
26 update(model, 10000, progress.bar="none")
27
28 # (4) Generate 20000 post-burn-in samples
29 params <- c("beta1","beta2","sigma")
30 samples <- coda.samples(model,
31 variable.names=params,
32 n.iter=20000, progress.bar="none")
33
34 # (5) Summarize the output
35 summary(samples)
36 1. Empirical mean and standard deviation for each variable,
37 plus standard error of the mean:
38
39 Mean SD Naive SE Time-series SE
40 beta1 2.512 31.61 0.1580 0.1580
41 beta2 52.763 39.21 0.1961 0.3727
42 sigma 2792.738 1177.88 5.8894 9.7678
43
44 2. Quantiles for each variable:
45
46 2.5% 25% 50% 75% 97.5%
47 beta1 -59.53 -18.87 2.601 23.57 64.61
48 beta2 -21.36 25.71 51.531 78.34 134.17
49 sigma 1083.16 1997.85 2601.864 3361.14 5622.69
50
51 plot(samples)
100 Bayesian Statistical Methods

• JAGS has built-in plot and summary functions that are used in Listing 3.7,
but you can make your own plots by extracting the S × p matrix of samples
for chain c, samples[[c]]. All the samples can be put in one 2S × p matrix

samps_matrix <- rbind(samples[[1]],samples[[2]])

and, for example, the posterior quantiles could be computed as

apply(samps_matrix,2,quantile,
c(0.025,0.250,0.500,0.750,0.975))

• The summary function combines the samples across the chains and gives the
sample mean, standard deviation, and quantiles of the posterior samples.
• The posterior could be summarized using the posterior mean and 95% in-
tervals of the marginal posterior for each parameter given by the summarize
function (i.e., the “Mean” and “2.5%” and “97.5%” entries).
Listing 3.8 provides a second example of fitting the Poisson-gamma model
to the NFL concussions data. The steps and code are very similar to the
previous example, except that instead of using the built-in summary and plot
function, Listing 3.8 extracts the samples from the rjags object samples
and combines them into the 2S × 4 matrix of samples samps. This allows for
more flexibility in the posterior summaries used to illustrate the results. For
example, Listing 3.8 computes the 90% posterior intervals.
In the remainder of the book we will use JAGS for all MCMC coding. We
will often show the model statement and summaries of the output, but we will
not report the steps of loading the data, generating the samples, etc., because
these blocks of code are virtually identical for all models.

3.4 Diagnosing and improving convergence


3.4.1 Selecting initial values
Theoretically, Gibbs and Metropolis–Hastings sampling should converge for
any initial values, but in practice the choice of starting values is important.
There are two schools of thought on initialization: select initial values close to
the posterior mode and run one long chain or run several parallel chains with
intentionally diffuse starting values to verify that the algorithm has converged.
Selecting good starting values and running one long chain has the advan-
tage of shortening or eliminating the burn-in period. A common method to
select initial values is the MLE or MAP estimates computed using numerical
optimization, which is often easier to compute than MCMC sampling. A dis-
advantage of this approach is that it is hard to rule out the possibility that
Computational approaches 101

Trace of beta1 Density of beta1

0.012
0 50

0.006
−100

0.000
10000 15000 20000 25000 30000 −100 −50 0 50 100 150

Iterations N = 20000 Bandwidth = 4.024

Trace of beta2 Density of beta2


200

0.008
100

0.004
0
−100

0.000

10000 15000 20000 25000 30000 −100 −50 0 50 100 150 200

Iterations N = 20000 Bandwidth = 4.992

Trace of sigma Density of sigma


4e−04
15000

2e−04
5000

0e+00
0

10000 15000 20000 25000 30000 0 5000 10000 15000

Iterations N = 20000 Bandwidth = 129.5

FIGURE 3.11
JAGS output for the T. rex analysis. This is the output of the plot
function in JAGS for the linear regression of the T. rex growth chart data.
The two lines with different shades of gray in the plots in the first columns are
the samples from the two chains. The density plots on the right summarize
the marginal distribution of each parameter combining samples across chains.
102 Bayesian Statistical Methods

Listing 3.8
JAGS code for the NFL concussions data.
1 library(rjags)
2 # Load the NFL concussions data
3 Y <- c(171, 152, 123, 199)
4 n <- 4
5 N <- 256
6

7 # (1) Define the model as a string


8 model_string <- textConnection("model{
9 # Likelihood
10 for(i in 1:n){
11 Y[i] ~ dpois(N*lambda[i])
12 }
13 # Priors
14 for(i in 1:n){
15 lambda[i] ~ dgamma(1,gamma)
16 }
17 gamma ~ dgamma(a, b)
18 }")
19

20 # (2) Load the data and compile the MCMC code


21 inits <- list(lambda=rgamma(n,1,1),gamma=1)
22 data <- list(Y=Y,N=N,n=n,a=0.1,b=0.1)
23 model <- jags.model(model_string,data = data, inits=inits,
n.chains=2)
24

25 # (3) Burn-in for 10000 samples


26 update(model, 10000, progress.bar="none")
27

28 # (4) Generate 20000 post-burn-in samples


29 params <- c("lambda")
30 samples <- coda.samples(model,
31 variable.names=params,
32 n.iter=20000, progress.bar="none")
33

34 # (5) Compute 90% credible intervals


35 samps <- rbind(samples[[1]],samples[[2]]) #2S x 4 matrix of
samples
36 apply(samps,2,quantile,c(0.05,0.95))
37

38 lambda[1] lambda[2] lambda[3] lambda[4]


39 2.5% 0.5722272 0.5035036 0.4005522 0.6717104
40 97.5% 0.7704071 0.6925751 0.5685348 0.8878783
Computational approaches 103

Chain 1
Chain 2

6.0
Chain 3

5.5
Sample
5.0
4.5

0 1000 2000 3000 4000 5000

Iteration

FIGURE 3.12
Convergence of three parallel chains. The three trace plots represent
the samples for one parameter from three MCMC chains run in parallel. The
chains converge around iteration 1,000.

the chain is stuck in a local mode and is completely missing the bulk of the
posterior distribution.
On the other hand, starting multiple (typically 2–5) chains with diffuse
starting values requires more burn-in samples, but if all chains give the same
results then this provides evidence that the algorithm has properly converged.
Figure 3.12 shows idealized convergence of three chains around iteration 500.
MCMC is not easily parallelizable because of its sequential nature, but running
several independent chains is one way to exploit parallel computing to improve
Bayesian computation.
Both single and parallel chains have their relative merits, but given the
progress in parallel computing environments it is preferable to run at least
two chains. Care should be taken up front to ensure that the starting val-
ues of each chain are sufficiently spread across the posterior distribution so
that convergence of the two chains necessarily represents strong evidence of
convergence.

3.4.2 Convergence diagnostics


Verifying that the MCMC chains have converged and run long enough to
sufficiently explore the posterior is often done informally by visual inspection
of trace plots as in Figure 3.12. However, many formal diagnostics have been
proposed to diagnose non-convergence. In this chapter, we focus on a few key
diagnostics that are produced by JAGS via the coda package in R.
104 Bayesian Statistical Methods

Trace of mu[1] Density of mu[1] mu[1] mu[2]


60

0.020

1.0

1.0
40
20

0.010
0
−40 −20

0.5

0.5
0.000

Autocorrelation

Autocorrelation
2000 3000 4000 5000 6000 7000 −40 −20 0 20 40 60

Iterations N = 5000 Bandwidth = 2.601

0.0

0.0
Trace of mu[2] Density of mu[2]
40

0.020

−0.5

−0.5
20
0

0.010
−60 −40 −20

−1.0

−1.0
0.000

0 5 10 20 30 0 5 10 20 30
2000 3000 4000 5000 6000 7000 −60 −40 −20 0 20 40

Iterations N = 5000 Bandwidth = 2.602 Lag Lag

FIGURE 3.13
Convergence diagnostics for a toy example with poor convergence.
The left panel gives the trace plot for each parameter and each chain (distin-
guished by grayscale) and the right panel gives the autocorrelation functions
for the first chain.

Throughout this section we use two toy examples for illustration:

Poor convergence model Y |µ ∼ Poisson[exp(µ1 + µ2 )]


Good convergence model Y1 |µ ∼ Poisson[exp(µ1 )]
Y2 |µ ∼ Poisson[exp(µ2 )].
iid
In both models the priors are µj ∼ Normal(0, 1000). In the first model the
two parameters are unidentified and this leads to poor convergence, as shown
in Listing 3.9 and Figure 3.13. In the second model the two parameters are
related to separate observations and so both parameters are identified leading
to good convergence as shown in Listing 3.10 and Figure 3.14. The first model
is ridiculous and would never be fit to data, but provides a simple illustration
of poor convergence. Of course, not all convergence problems are related to
unidentified parameters, but this is a common source of problems.
The first diagnostic is to plot the trace plots and verify that the chains have
all reached the same distribution and have mixed properly. Figure 3.14 (top
left) provides a good example where the chains for both parameters look like
a bar code, while Figure 3.13 (top left) is a problematic case where the chains
are mixing slowly and likely need many more iterations to provide a good
approximation to the posterior. In these small examples it is possible to plot
all chains for all parameters, but for more complex models it may be necessary
to inspect the chains for only a representative subset of the parameters.
Autocorrelation of the chains provides a numerical measure of how fast
Computational approaches 105

Listing 3.9
Toy example with poor convergence.
1 # Define the model as a string
2 > model_string <- textConnection("model{
3 > Y ~ dpois(exp(mu[1]+mu[2]))
4 > mu[1] ~ dnorm(0,0.001)
5 > mu[2] ~ dnorm(0,0.001)
6 > }")
7

8 # Generate MCMC samples


9 > inits <- list(mu=rnorm(2,0,5))
10 > data <- list(Y=1)
11 > model <- jags.model(model_string,data = data,
12 > inits=inits, n.chains=3)
13

14 > update(model, 1000, progress.bar="none")


15 > samples <- coda.samples(model,
16 > variable.names=c("mu"),
17 > n.iter=5000, progress.bar="none")
18

19 ># Apply convergence diagnostics


20

21 > # Plots
22 > plot(samples)
23 > autocorr.plot(samples)
24

25 > # Statistics
26 > autocorr(samples[[1]],lag=1)
27 , , mu[1]
28 mu[1] mu[2]
29 Lag 1 0.9948544 -0.9926385
30 , , mu[2]
31 mu[1] mu[2]
32 Lag 1 -0.9960286 0.9947489
33

34 > effectiveSize(samples)
35 mu[1] mu[2]
36 22.90147 22.71505
37

38 > gelman.diag(samples)
39 Point est. Upper C.I.
40 mu[1] 1.62 2.88
41 mu[2] 1.62 2.88
42

43 Multivariate psrf
44 1.48
45

46 > geweke.diag(samples[[1]])
47 mu[1] mu[2]
48 -0.6555 0.6424
106 Bayesian Statistical Methods

the chains are mixing. Ideally, the iterations would be independent of each
other, but the Markovian nature of the sampler induces dependence between
successive iterations. The lag-l autocorrelation of the chain for parameter θj
is defined as
(s) (s−l)
ρj (l) = Cor(θj , θj ), (3.19)
and the function ρj (l) is called the autocorrelation function. The right panels
of Figures 3.13 and 3.14 plot the sample autocorrelation functions for the two
examples. Ideally we find that ρj (l) ≈ 0 for all l > 0, but some correlation
is expected. For example, the lag-1 autocorrelation for µ1 in Figure 3.14 is
around 0.4, but the chain converges nicely.
Rather than reporting the entire autocorrelation function for each param-
eter, the lag-1 autocorrelation is typically sufficient. Another common one-
number summary of the entire function is the effective sample size (ESS).
Recall that the sample mean of the MCMC samples is only an estimate of
the true posterior mean, and we can quantify the uncertainty in this estimate
with the standard error of the sample mean. If the √ samples are independent,
then the standard error of the sample mean is sdj / S where sdj is the sam-
ple standard deviation of the draws for θj and S is the number of samples.
This is labelled “naive SE” in the JAGS summary output as in Listing 3.7.
However, this underestimates the uncertainty in the posterior-mean estimate
if the samples have autocorrelation. It can bep shown that the actual standard
error accounting for autocorrelation is sdj / ESSj where
S
ESSj = P∞ ≤ S. (3.20)
1+2 l=1 ρj (l)
p
The standard error sdj / ESSj is labeled “Time-series SE” in the JAGS sum-
mary output. One way to determine if the number of samples is sufficient is
to increase S until this standard error is acceptably low for all parameters.
Another method is to increase S until the effective sample size is acceptably
high for all parameters, say ESSj > 1000.
Geweke’s diagnostic ([36]) is used to detect non-convergence. It tests for
convergence using a two-sample t-test to compare the mean of the chain be-
tween batches at the beginning versus the end of the sampler. The default
in coda is to test that the mean is the same for the first 10% of the sam-
ples and the last 50% of the samples. Let θ̄jb and sejb be the sample mean
and standard error, respectively, for θj in batch b = 1, 2. The standard errors
are computed accounting for autocorrelation (as in (3.20)) in the chains. The
Geweke statistic is then
θ̄j1 − θ̄j2
Z=q . (3.21)
se2j1 + se2j2
Under the null hypothesis that the means are the same for batches (and as-
suming each batch has large ESS), Z follows a standard normal distribution
and so |Z| > 2 is cause for concern. In the two examples in Listings 3.9 and
3.10, |Z| is less than one and so this statistic does not detect non-convergence.
Computational approaches 107

Listing 3.10
Toy example with good convergence.
1 # Define the model as a string
2 > model_string <- textConnection("model{
3 > Y1 ~ dpois(exp(mu[1]))
4 > Y2 ~ dpois(exp(mu[2]))
5 > mu[1] ~ dnorm(0,0.001)
6 > mu[2] ~ dnorm(0,0.001)
7 > }")
8

9 # Generate MCMC samples


10 > inits <- list(mu=rnorm(2,0,5))
11 > data <- list(Y1=1,Y2=10)
12 > model <- jags.model(model_string,data = data,
13 > inits=inits, n.chains=3)
14

15 > update(model, 1000, progress.bar="none")


16 > samples <- coda.samples(model,
17 > variable.names=c("mu"),
18 > n.iter=5000, progress.bar="none")
19

20 ># Apply convergence diagnostics


21

22 > # Plots
23 > plot(samples)
24 > autocorr.plot(samples)
25

26 > # Statistics
27 > autocorr(samples[[1]],lag=1)
28 , , mu[1]
29 mu[1] mu[2]
30 Lag 1 0.359733 0.02112005
31 , , mu[2]
32 mu[1] mu[2]
33 Lag 1 0.002213494 0.2776712
34

35 > effectiveSize(samples)
36 mu[1] mu[2]
37 6494.326 8227.748
38 > gelman.diag(samples)
39

40 Point est. Upper C.I.


41 mu[1] 1 1
42 mu[2] 1 1
43 Multivariate psrf
44 1
45

46 > geweke.diag(samples[[1]])
47 mu[1] mu[2]
48 -0.5217 -0.2353
108 Bayesian Statistical Methods

Trace of mu[1] Density of mu[1] mu[1] mu[2]

1.0

1.0
0.3
0

0.2
−5

0.1

0.5

0.5
−10

0.0

Autocorrelation

Autocorrelation
2000 3000 4000 5000 6000 7000 −10 −5 0

Iterations N = 5000 Bandwidth = 0.1825

0.0

0.0
Trace of mu[2] Density of mu[2]

1.2
3.0

−0.5

−0.5
2.5

0.8
2.0

0.4
1.5

−1.0

−1.0
1.0

0.0

0 5 10 20 30 0 5 10 20 30
2000 3000 4000 5000 6000 7000 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Iterations N = 5000 Bandwidth = 0.04921 Lag Lag

FIGURE 3.14
Convergence diagnostics for a toy example with good convergence.
The left panel gives the trace plot for each parameter and each chain (distin-
guished by grayscale) and the right panel gives the autocorrelation functions
for the first chain.

Geweke’s statistic uses only one chain. A multi-chain extension is the


Gelman–Rubin statistic ([32]). The Gelman–Rubin statistic measures agree-
ment in the means between C chains. It is essentially an ANOVA test of
whether the chains have the same mean, but the statistic is scaled so that
1.0 indicates perfect convergence and values above 1.1 are questionable. The
statistic is
S−1 1 1

S W + S + SC B
Rj = (3.22)
W
where B is S times the variance of the C MCMC sample means for θj and
W is the average of the C MCMC sample variances for θj . The coda package
plots the Gelman–Rubin statistic as a function of the iteration, and so when
the statistic reaches one this indicates good mixing among the parallel chains
and that the chains have possibly reached their stationary distribution. The
Gelman–Rubin statistics clearly distinguish between the examples in Listing
3.9 and 3.10; Rj = 1.62 in the poor-convergence case and Rj = 1.00 for the
good-convergence case.

3.4.3 Improving convergence


Armed with these diagnostic tools, if the user honestly inspects the sample
output, poor convergence can usually be detected. This leaves the more chal-
lenging problem of improving convergence! There is no single step that always
resolves convergence problems, but the list below offers some suggestions:
Computational approaches 109

1. Increase the number of iterations: Theoretically MCMC algo-


rithms should converge and cover the posterior distribution as S
goes to infinity. Of course, time limits the number of samples that
can be generated, but in some cases increasing S can be faster than
searching for other improvements. Nonetheless, this is not the most
satisfying solution to poor convergence, specifically when the chains
move slowly due to high autocorrelation.
2. Tune the MH candidate distribution: The acceptance prob-
ability can be tuned during burn-in to be around 30-50%, but as
the chain evolves the acceptance probability can change. If groups
of parameters have strong cross-correlation then updating them in
blocks can dramatically improve convergence. Block steps must be
carefully tuned however so the candidate correlation matrix approx-
imates the posterior correlation matrix.
3. Improve initial values: Often a sampler works well once it finds
the posterior distribution, but the posterior PDF may be too flat
away from the mode for the sampler to find its center. One way to
improve initial values is to use the maximum likelihood estimates,
θ (0) = θ̂ M LE . Another option is to use simulated annealing during
burn-in. In simulated annealing, the MH acceptance rate is raised to
the power Ts ∈ (0, 1], i.e., RTs , where the temperature Ts increases
to one throughout the burn-in, e.g., Ts = s/B for s < B and Ts = 1
for s > B, where B is the number of burn-in samples. The intuition
behind this modification is that by raising the acceptance ratio to
a power the algorithm is more likely to make big jumps during the
burn-in and will slowly settle down in the heart of the posterior as
the burn-in period ends.
4. Use a more advanced algorithm: Appendix A.4 provides sev-
eral advanced algorithms that can be used for a particularly vex-
ing problem. For example, Hamiltonian Monte Carlo (HMC) is a
Metropolis sampler that uses the gradient of the posterior to in-
telligently propose candidates and adaptive Metropolis estimates
allows the proposal distribution to evolve across iterations.
5. Simplify the model: Poor convergence often occurs when mod-
els are too complex to be fit with the data at hand. Overly com-
plex models often have unidentified parameters that cannot be es-
iid
timated. For example, in a model with Yi ∼ Normal(θ1 + θ2 , σ 2 ),
the two mean parameters are not identified, meaning that there
are infinitely many combinations of the parameters that give the
same likelihood (e.g., θ = (−10, 10) or θ = (10, −10) give the same
mean). Of course, no one would fit such a blatantly unidentified
model, but in complex models with dozens of parameters, unidenti-
fiability can be hard to detect. Convergence can be improved by sim-
110 Bayesian Statistical Methods

plifying the model by eliminating covariates, removing non-linear


terms, reducing the covariance structure to independence, etc.
6. Pick more informative priors: Selecting more informative priors
has a similar effect to simplifying the model. For example, even in
iid
the silly model Yi ∼ Normal(θ1 + θ2 , σ 2 ), the MCMC algorithm
would likely converge quickly with priors θ1 ∼ Normal(−3, 1) and
θ2 ∼ Normal(3, 1). Even a weakly informative prior can stabilize
the posterior and improve convergence. As an extreme example, an
empirical Bayesian prior (Chapter 2) fixes nuisance parameters at
their MAP estimates, which can dramatically improve convergence
at the expense of suppressing some uncertainty.
7. Run a simulation study: Determining whether a chain has con-
verged can be frustrating for a real data analysis where the true
values of the parameters are unknown. Simulating data from the
model and then fitting the MCMC algorithm (as in Listing 3.6)
for different sample sizes and parameter values can shed light on
the properties of the algorithm and build trust that the sampler is
producing reliable output.

3.4.4 Dealing with large datasets


Big data poses computational challenges to all statistical methods, but its
effects are acutely felt by MCMC because MCMC requires thousands of passes
through the data. Fortunately, recent years have seen a surge in Bayesian
computing methods to handle massive datasets. Likely the next few years will
see further developments in this area, but below we outline some of the most
useful current approaches.
1. MAP estimation: MCMC is the gold standard for Bayesian com-
puting because it returns estimates of the entire joint distribution
of all of the parameters. However, if the scope of the analysis is
limited to prediction, then it is much faster to simply compute the
MAP estimate and ignore uncertainty in parameter estimation.
2. Gaussian approximation: The Bayesian CLT states that the pos-
terior is approximately normal if the sample size is large. Therefore,
running a long MCMC chain for, say, a logistic regression with
n = 1, 000, 000 observations, p = 50 covariates and flat priors is
unjustifiable; it is much faster and equally accurate to approximate
the posterior as Gaussian centered on the MAP estimate with co-
variance determined by the information matrix. This computation
can often be carried out using MLE software, and thus the posterior
will be very similar to the approximate sampling distribution of the
MLE, but the interpretation of uncertainty remains Bayesian.
Computational approaches 111

3. Variational Bayesian computing: Variational Bayesian approx-


imations can be even faster to compute than evoking the Bayesian
CLT if you target only the marginal distribution of each parame-
ter. That is, if you assume the posterior (not prior) is independent
across parameters and use the best approximation to the true pos-
terior that has this form, the posterior can be approximated with-
out computing the joint distribution. This of course raise questions
about properly accounting for uncertainty, but does lead to impres-
sive computational savings.
4. Parallel computing: MCMC is inherently sequential, i.e., you gen-
erally cannot update the second parameter until after updating the
first. In special cases, steps of the MCMC routine can be performed
simultaneously in parallel on different cores of a CPU or GPU clus-
ter. For example, if parameters are (conditionally) independent then
they can be updated in parallel. Alternatively, if the data are inde-
pendent and the likelihood factors as the product of n terms, then
likelihood calculations within each MCMC step can be done in par-
allel, e.g., n1 terms computed on core 1, n2 terms computed on core
2, etc. While parallelization can theoretically have huge benefits,
the overhead associated with passing data across cores can dampen
the benefits of using multiple cores unless parallelization is done
carefully.
5. Divide and conquer: Divide and conquer methods provide an ap-
proximation to the posterior that is embarrassingly parallelizable.
Say we partition the data into T batches Y1 , ..., YT that are in-
dependent given the model parameters. Then the posterior can be
written
YT h i
p(θ|Y) ∝ f (Y|θ)π(θ) = f (Yt |θ)π(θ)1/T . (3.23)
t=1

This decomposition suggests fitting T separate Bayesian analyses


(in parallel) with analysis t using data Yt and prior π(θ)1/T and
then combining the results. Each of the T analyses uses the original
prior π(θ) raised to the 1/T power to spread the prior information
across batches, e.g., if θ ∼ Normal(0, τ 2 ) then the prior to the 1/T
power is Normal(0, T τ 2 ). Scott et al. [75] discuss several ways to
combine the T posterior distributions. The simplest method is to
assume the posteriors are approximately Gaussian and thus anal-
ysis t returns θ|Yt ≈ Normal(Mt , Vt ). In this case, the posteriors
combine as
T
X
p(θ|Y) ≈ Normal(V V−1
t Mt , V) (3.24)
t=1
PT
where V−1 = t=1 V−1
t .
112 Bayesian Statistical Methods

3.5 Exercises
1. Give an advantage and a disadvantage of the following methods:
(a) Maximum a posteriori estimation
(b) Numerical integration
(c) Bayesian central limit theorem
(d) Gibbs sampling
(e) Metropolis–Hastings sampling
indep
2. Assume that Yi |µ ∼ Normal(µ, σi2 ) for i ∈ {1, ..., n}, with σi
known and improper prior distribution π(µ) = 1 for all µ.
(a) Give a formula for the MAP estimator for µ.
(b) We observe n = 3, Y1 = 12, Y2 = 10, Y3 = 22, σ1 = σ2 = 3 and
σ3 = 10, compute the MAP estimate of µ.
(c) Use numerical integration to compute the posterior mean of µ.
(d) Plot the posterior distribution of µ and indicate the MAP and
the posterior mean estimates on the plot.
indep
3. Assume Yi |λ ∼ Poisson(Ni λ) for i = 1, ..., n.
(a) Identify a conjugate prior for λ and derive the posterior that
follows from this prior.
(b) Using the prior λ ∼ Uniform(0, 20), derive the MAP estimator
of λ.
(c) Using the prior λ ∼ Uniform(0, 20), plot the posterior on a grid
of λ assuming n = 2, N1 = 50, N2 = 100, Y1 = 12, and Y2 = 25
and show that the MAP estimate is indeed the maximizer.
(d) Use the Bayesian CLT to approximate the posterior of λ under
the setting of (c), plot the approximate posterior, and compare
the plot with the plot from (c).
indep
4. Consider the model Yi |σi2 ∼ Normal(0, σi2 ) for i = 1, ..., n where
σi2 |b ∼ InvGamma(a, b) and b ∼ Gamma(1, 1).
(a) Derive the full conditional posterior distributions for σ12 and b.
(b) Write pseudocode for Gibbs sampling, i.e., describe in detail
each step of the Gibbs sampling algorithm.
(c) Write your own Gibbs sampling code (not in JAGS) and plot the
marginal posterior density for each parameter. Assume n = 10,
a = 10 and Yi = i for i = 1, ..., 10.
(d) Repeat the analysis with a = 1 and comment on the conver-
gence of the MCMC chain.
Computational approaches 113

(e) Implement this model in (c) using JAGS and compare the results
with the results in (c).
5. Consider the model Yi |µ, σ 2 ∼ Normal(µ, σ 2 ) for i = 1, ..., n and
Yi |µ, δ, σ 2 ∼ Normal(µ + δ, σ 2 ) for i = n + 1, ..., n + m, where µ, δ ∼
Normal(0, 1002 ) and σ 2 ∼ InvGamma(0.01, 0.01).
(a) Give an example of a real experiment for which this would be
an appropriate model.
(b) Derive the full conditional posterior distributions for µ, δ, and
σ2 .
(c) Simulate a dataset from this model with n = m = 50, µ = 10,
δ = 1, and σ = 2. Write your own Gibbs sampling code (not in
JAGS) to fit the model above to the simulated data and plot the
marginal posterior density for each parameter. Are you able to
recover the true values reasonably well?
(d) Implement this model using JAGS and compare the results with
the results in (c).
6. Fit the following model to the NBA free throw data in the table in
the Exercise 17 in Chapter 1:

Yi |θi ∼ Binomial(ni , θi ) and θi |m ∼ Beta[exp(m)qi , exp(m)(1−qi )],

where Yi is the number of made clutch shots for player i = 1, ..., 10,
ni is the number of attempted clutch shots, qi ∈ (0, 1) is the overall
proportion, and m ∼ Normal(0, 10).
(a) Explain why this is a reasonable prior for θi .
(b) Explain the role of m in the prior.
(c) Derive the full conditional posterior for θ1 .
(d) Write your own MCMC algorithm to compute a table of poste-
rior means and 95% credible intervals for all 11 model param-
eters (θ1 , ..., θ10 , m). Turn in commented code.
(e) Fit the same model in JAGS. Turn in commented code, and
comment on whether the two algorithms returned the same
results.
(f) What are the advantages and disadvantages of writing your own
code as opposed to using JAGS in this problem and in general?
7. Open and plot the galaxies data in R using the code below,

> library(MASS)
> data(galaxies)
> ?galaxies
> Y <- galaxies
> n <- length(Y)
> hist(Y,breaks=25)
114 Bayesian Statistical Methods

Model the observations Y1 , ..., Y82 using the Student-t distribution


with location µ, scale σ and degrees of freedom k. Assume prior dis-
tributions µ ∼ Normal(0, 100002 ), 1/σ 2 = τ ∼ Gamma(0.01, 0.01)
and k ∼ Uniform(1, 30).
(a) Give reasonable initial values for each of the three parameters.
(b) Fit the model using JAGS. Report trace plots of each parameter
and discuss convergence.
(c) Graphically compare the t distribution with parameters set to
their posterior mean with the observed data. Does the model
fit the data well?
8. Download the galaxy data

> library(MASS)
> data(galaxies)
> Y <- galaxies
iid
Suppose Yi |θ ∼ Laplace(µ, σ) for i = 1, . . . , n where θ = (µ, σ).
(a) Assuming the improper prior σ ∼ Uniform(0, 100000) and
π(µ) = 1 for all µ ∈ (−∞, ∞), plot the joint posterior distri-
bution of θ and the marginal posterior distributions of µ and
σ.
(b) Compute the posterior mean of θ from your analysis in (a) and
plot the Laplace PDF with these values against the observed
data. Does the model fit the data well?
(c) Plot the posterior predictive distribution (PPD) for a new ob-
servation Y ∗ |θ ∼ Laplace(µ, σ). How do the mean and variance
of the PPD compare to the mean and variance of the “plug-in”
distribution from (b)?
9. In Section 2.4 we compared Reggie Jackson’s home run rate in the
regular season and World Series. He hit 563 home runs in 2820
regular-season games and 10 home runs in 27 World Series games
(a player can hit 0, 1, 2, ... home runs in a game). Assuming Uni-
form(0,10) priors for both home run rates, use JAGS to summarize
the posterior distribution of (i) his home run rate in the regular
season, (ii) his home run rate in the World Series, and (iii) the ra-
tio of these rates. Provide trace plots for all three parameters and
discuss convergence of the MCMC sampler including appropriate
convergence diagnostics.
10. As discussed in Section 1.6, [24] report that the number of marine
bivalve species discovered each year from 2010-2015 was 64, 13, 33,
18, 30 and 20. Denote Yt as the number of species discovered in year
Computational approaches 115

2009 + t (so that Y1 = 64 is the count for 2010). Use JAGS to fit the
model
indep
Yt |α, β ∼ Poisson(λt )
indep
where λt = exp(α + βt) and α, β ∼ Normal(0, 102 ). Summa-
rize the posterior of α and β and verify that the MCMC sampler
has converged. Does this analysis provide evidence that the rate of
discovery is changing over time?
11. Write your own Metropolis sampler (i.e., not JAGS) for the previous
problem. Turn in commented code, report your candidate distribu-
tion for each parameters and corresponding acceptance ratio, and
use trace plots for each parameter to show the chains have con-
verged.
12. A clinical trial assigned 100 patients each to placebo, the low dose
of a new drug, and the high dose of the new drug (for a total of 300
patients); the data are given in the table below.

Treatment Positive outcome Negative outcome


Placebo 52 48
Low dose 60 40
High dose 54 46

Conduct a Bayesian analysis using JAGS with uniform priors on


the probability of a patient having a positive outcome under each
treatment.
(a) Report the posterior mean and 95% interval for the probability
of a positive outcome for each treatment group.
(b) Compute the posterior probability that the low dose is the best
of the three treatment options.
13. Consider the normal mixture model
iid 1
Yi |θ ∼ f (y|θ) = [φ(y − θ) + φ(y)] ,
2

for i = 1, ..., n where θ ∈ R and φ(z) = exp{−z 2 /2}/ 2π denotes
the density of a standard normal distribution. We have shown (Sec-
tion 2.4, problem 14) that an improper prior cannot be used for this
likelihood. In this problem we explore the use of vague but proper
priors. Analyze data simulated in R as:

> set.seed(27695)
> theta_true <- 4
> n <- 30
> B <- rbinom(n,1,0.5)
> Y <- rnorm(n,B*theta_true,1)
116 Bayesian Statistical Methods

(a) Argue that the R code above generates samples from f (y|θ).
(b) Plot the simulated data, and plot f (y|θ) for y ∈ [−3, 10] sepa-
rately for θ = {2, 4, 6}.
(c) Assuming prior θ ∼ Normal(0, 102 ), obtain the MAP estimate
of θ and its asymptotic posterior standard deviation using the
following R code:
> library(stats4)
> nlp <- function(theta,Y){
> like <- 0.5*dnorm(Y,0,1) +
> 0.5*dnorm(Y,theta,1)
> prior <- dnorm(theta,0,10)
> neg_log_post <- -sum(log(like)) - log(prior)
> return(neg_log_post)}
>
> map_est <- mle(nlp,start=list(theta=1),
> fixed=list(Y=Y))
> sd <- sqrt(vcov(map_est))
(d) Suppose the prior distribution is θ ∼ N (0, 10k ). Plot the poste-
rior density of θ for k ∈ {0, 1, 2, 3} and compare these posterior
densities with the asymptotic normal distribution in (c) (by
overlaying all five densities on the same plot).
(e) Use JAGS to fit this model via its mixture representation
iid
Yi |Bi , θ ∼ Normal(Bi θ, 1), where Bi ∼ Bernoulli(0.5) and
θ ∼ Normal(0, 102 ). Compare the posterior distribution of θ
with the results from part (d).
14. Let Y1 , . . . , Yn be a random sample from a shifted exponential dis-
tribution with density
(
β exp−β(y−α) y ≥ α
f (y|α, β) =
0 y<α

where α > 0 and β > 0 are the parameters of interest. Assume prior
distributions α ∼ Uniform(0, c) and β ∼ Gamma(a, b).
(a) Plot f (y|α, β) for α = 2 and β = 3 and y ∈ [0, 5], give interpre-
tations of α and β, and describe a real experiment where this
model might be appropriate.
(b) Give the full conditional distributions of α and β and determine
if they are members of a common family of distributions.
(c) Write pseudo code for an MCMC sampler including initial val-
ues, full details of each sampling step, and how you plan to
assess convergence.
Computational approaches 117

15. Using the data (mass and age) provided in Listing 3.7, fit the fol-
lowing non-linear regression model:

massi ∼ Normal µi , σ 2 where µi = θ1 + θ2 ageθi 3 ,




with priors θ1 ∼ Normal(0, 1002 ), θ2 ∼ Uniform(0, 20000), θ3 ∼


Normal(0, 1), and σ 2 ∼ InvGamma(0.01, 0.01).
(a) Fit the model in JAGS and plot the data (age versus mass)
along with the posterior mean of µi to verify the model fits
reasonably well.
(b) Conduct a thorough investigation of convergence.
(c) Give three steps you might take to improve convergence.
16. Assume the (overly complicated) model Y |n, p ∼ Binomial(n, p)
with prior distributions n ∼ Poisson(λ) and p ∼ Beta(a, b). The
observed data is Y = 10.
(a) Describe why convergence may be slow for this model.
(b) Fit the model in JAGS with λ = 10 and a = b = 1. Check
convergence and summarize the posterior of n, p and θ = np.
(c) Repeat the analysis in (b) except with a = b = 10. Comment
on the effect of the prior distribution of p on convergence for
all the three parameters.
17. Consider the (overly complicated) model

Yi |µi , σi2 ∼ Normal(µi , σi2 )


iid
where µi ∼ Normal(0, θ1−1 ), σi2 ∼ InvGamma(θ2 , θ3 ) and θj ∼
Gamma(, ) for j = 1, 2, 3. Assume the data are Yi = i for
i = 1, ..., n.
(a) Describe why convergence may be slow for this model.
(b) Fit the model in JAGS with all four combinations of n ∈ {5, 25}
and  ∈ {0.1, 10} and report the effective sample size for θ1 , θ2 ,
and θ3 for all four fits.
(c) Comment on the effect of n and  on convergence.
4
Linear models

CONTENTS
4.1 Analysis of normal means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.1.1 One-sample/paired analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.1.2 Comparison of two normal means . . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.1 Jeffreys prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.2 Gaussian prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.2.3 Continuous shrinkage priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.2.4 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2.5 Example: Factors that affect a home’s microbiome . . . . . 130
4.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.3.1 Binary data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.3.2 Count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.3.3 Example: Logistic regression for NBA clutch free throws 138
4.3.4 Example: Beta regression for microbiome data . . . . . . . . . 140
4.4 Random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.5 Flexible linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.5.1 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.5.2 Heteroskedastic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.5.3 Non-Gaussian error models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.5.4 Linear models with correlated data . . . . . . . . . . . . . . . . . . . . . 153
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Linear models form the foundation for much of statistical modeling and in-
tuition. In this chapter, we introduce many common statistical models and
implement them in the Bayesian framework. We focus primarily on Bayesian
aspects of these analyses including selecting priors, computation and compar-
isons with classical methods. The chapter begins with analyses of the mean of
a normal population (Section 4.1.1) and comparison of the means of two nor-
mal populations (Section 4.1.2), which are analogous to the classic one-sample
and two-sample t-tests. Section 4.2 introduces the more general Bayesian mul-
tiple linear regression model including priors that are appropriate for high-
dimensional problems. Multiple regression is extended to non-Gaussian data
in Section 4.3 via generalized linear models and correlated data in Section 4.4
via linear mixed models.

119
120 Bayesian Statistical Methods

The Bayesian linear regression model makes strong assumptions including


that the mean is a linear combination of the covariates, and that the observa-
tions are Gaussian and independent. Estimates are robust to small departures
from these assumptions, but nonetheless these assumptions should be carefully
evaluated (Section 5.6). Section 4.5 concludes with several extensions to the
basic linear regression model to illustrate the flexibility of Bayesian modeling.

4.1 Analysis of normal means


4.1.1 One-sample/paired analysis
iid
In a one-sample study, the n observations are modeled as Yi |µ, σ 2 ∼
Normal(µ, σ 2 ) and the objective is to determine if µ = 0. This model is of-
ten applied to experiments where each unit i actually has two measurements
taken under different conditions and Yi is the difference between the measure-
ments. For example, say student i’s math scores before and after a tutorial
session are Z0i and Z1i , respectively, then testing if the (population) mean of
Yi = Z1i − Z0i is zero is a way to evaluate the effectiveness of the tutorial
sessions.
A Bayesian analysis specifies priors for µ and σ 2 and then summarizes the
marginal posterior of µ accounting for uncertainty in σ 2 . In this chapter we
use the Jeffreys’ prior, but conjugate normal/inverse gamma priors can also
be used as in Section 3.2. In most cases it is sufficient to plot the posterior
density p(µ|Y) and report the posterior mean and 95% interval. We will also
address the problem using a formal hypothesis test. In this chapter we will
compute the posterior probabilities of the one-sided hypotheses H0 : µ ≤ 0
and H1 : µ > 0; in Chapter 5 we test the point hypothesis H0 : µ = 0 versus
H1 : µ 6= 0.
2
Conditional distribution with the variance fixed: Conditional  on
 σ ,
σ2
µ has Jeffreys prior π(µ) ∝ 1. The posterior is µ|Y ∼ Normal Ȳ , n and
thus the 100(1 − α)% posterior credible interval for µ is
σ
Ȳ ± zα/2 √ , (4.1)
n
where zτ is the τ quantile of the standard normal distribution (i.e., the nor-
mal distribution with mean zero and variance one). This exactly matches the
classic 100(1 − α)% confidence interval. While the credible interval and confi-
dence interval are numerically identical in this case, they are interpreted dif-
ferently. The Bayesian credible interval quantifies uncertainty about µ given
this dataset, Y, whereas the confidence interval quantifies the anticipated
variation in Ȳ if we were to repeat the experiment.
For the test of null hypothesis H0 : µ ≤ 0 versus the alternative hypothesis
Linear models 121

H1 : µ > 0, the posterior probability of the null hypothesis is

Prob(H0 |Y) = Prob(µ < 0|Y) = Φ(−Z) (4.2)

where
√ Φ is the standard normal cumulative distribution function and Z =
nȲ /σ and thus matches exactly with a frequentist p-value. By definition,
Φ(zτ ) = τ , and so the decision rule to reject H0 in favor of H1 if the posterior
probability of H0 is less than α is equivalent to rejecting H0 if −Z < zα , or
equivalently if Z > z1−α (−zα = z1−α due to the symmetry of the standard
normal PDF). Therefore, the decision rule to reject H0 in favor of H1 at
significance level α if Z > z1−α is identical to the classic one-sided z-test.
However, unlike the classical test, we can quantify our uncertainty using the
posterior probability that the hypothesis H0 (or H1 ) is true since we have
computed Prob(H0 |Y).
Unknown variance: As shown in Section 2.3, the Jeffreys’ prior for
(µ, σ 2 ) is
 3/2
1
π(µ, σ 2 ) ∝ . (4.3)
σ2
Appendix A.3 shows that the marginal posterior of µ integrating over σ 2 is

µ|Y ∼ tn Ȳ , σ̂ 2 /n ,

(4.4)
n
where σ̂ 2 = i=1 (Yi − Ȳ )2 /n, i.e., the posterior is Student’s t distribution
P
with location Ȳ , variance parameter σ̂ 2 /n and n degrees of freedom. Posterior
inference such as credible sets or the posterior probability that µ is posi-
tive follow from the quantiles of Student’s t distribution. The credible set is
slightly different than the frequentist confidence interval because the degrees
of freedom in the classic t-test is n − 1, whereas the degrees of freedom in the
posterior is n; this is the effect of the prior on σ 2 .
In classical statistics, when σ 2 is unknown the Z-test based on the nor-
mal distribution is replaced with the t-test based on Student’s t distribution.
Similarly, the posterior distribution of the mean changes from a normal dis-
tribution given σ 2 to Student’s t distribution when uncertainty about the
variance is considered. Figure 4.1 compares the Gaussian and t density func-
tions. The density functions are virtually identical for n = 25, but for n = 5
the t distribution has heavier tails than the Gaussian distribution; this is the
effect of accounting for uncertainty in σ 2 .

4.1.2 Comparison of two normal means


The two-sample test compares the mean in two groups. For example, an ex-
periment might take the blood pressure of a random sample of n1 mail carriers
that walk their route and n2 mail carriers that drive their route to determine if
iid
these groups have different mean blood pressure. Let Yi ∼ Normal(µ, σ 2 ) for
iid
i = 1, ..., n1 and Yi ∼ Normal(µ + δ, σ 2 ) for i = n1 + 1, ..., n1 + n2 = n, so that
122 Bayesian Statistical Methods

n=5 n = 25

0.010
Gaussian
Student's t
0.04

0.008
0.03

0.006
Density

Density
0.02

0.004
0.01

0.002
0.000
0.00

6 8 10 12 14 8 9 10 11 12

µ µ

FIGURE 4.1
Comparison of the Gaussian and Student’s t distributions. √ Below
are the Gaussian PDF with mean Ȳ and standard deviation σ/ n√compared
to the PDF of Student’s t distribution with location Ȳ , scale σ̂/ n, and n
degrees of freedom. The plots assume Ȳ = 10, σ = σ̂ = 2, and n ∈ {5, 25}.

δ is the difference in means and the parameter of interest. Denote the sample
mean of the nj observations in group j = 1, 2 as Ȳj and the group-specific vari-
Pn1 Pn1 +n2
ance estimators as s21 = i=1 (Yi − Ȳ1 )2 /n1 and s22 = i=n 1 +1
(Yi − Ȳ2 )2 /n2 .
Conditional distribution with the variance fixed: Conditional on
the variance and flat prior π(µ, δ) ∝ 1 it can be shown that the posterior of
the difference in means is
  
2 1 1
δ|Y ∼ Normal Ȳ2 − Ȳ1 , σ + . (4.5)
n1 n2

As in the one-sample case, posterior intervals and probabilities of hypothe-


ses can be computed using the quantiles of the normal distribution. Also, as
in the one-sample case, the credible set and rejection rule match the clas-
sic confidence interval and one-sided z-test numerically but have a different
interpretations.
Unknown variance: The Jeffreys’ prior for (µ, δ, σ 2 ) is (Section 2.3)
1
π(µ, δ, σ 2 ) ∝ . (4.6)
(σ 2 )2

Appendix A.3 shows (as a special case of multiple linear regression, see Section
4.2) the marginal posterior distribution of δ integrating over both µ and σ 2 is
  
2 1 1
δ|Y ∼ tn Ȳ2 − Ȳ1 , σ̂ + , (4.7)
n1 n2
Linear models 123

Listing 4.1
R code for comparing two normal means with the Jeffreys’ prior.
1 # Y1 is the n1-vector of data for group 1
2 # Y2 is the n2-vector of data for group 2
3

4 # Statistics from group 1


5 Ybar1 <- mean(Y1)
6 s21 <- mean((Y1-Ybar1)^2)
7 n1 <- length(Y1)
8

9 # Statistics from group 2


10 Ybar2 <- mean(Y2)
11 s22 <- mean((Y2-Ybar2)^2)
12 n2 <- length(Y2)
13

14 # Posterior of the difference assuming equal variance


15 delta_hat <- Ybar2-Ybar1
16 s2 <- (n1*s21 + n2*s22)/(n1+n2)
17 scale <- sqrt(s2)*sqrt(1/n1+1/n2)
18 df <- n1+n2
19 cred_int <- delta_hat + scale*qt(c(0.025,0.975),df=df)
20

21 # Posterior of delta assuming unequal variance using MC sampling


22 mu1 <- Ybar1 + sqrt(s21/n1)*rt(1000000,df=n1)
23 mu2 <- Ybar2 + sqrt(s22/n2)*rt(1000000,df=n2)
24 delta <- mu2-mu1
25

26 hist(delta,main="Posterior distribution of the difference in


means")
27 quantile(delta,c(0.025,0.975)) # 95% credible set

where σ̂ 2 = (n1 ŝ21 + n2 ŝ22 )/n is a pooled variance estimator. As with the one-
sample model, the difference between the posterior for the known-variance
versus unknown-variance cases is that an estimate of σ 2 is inserted in the pos-
terior and the Gaussian distribution is replaced with Student’s t distribution
with n degrees of freedom. In the Bayesian analysis we did not “plug in” an
estimate of σ 2 , rather, by accounting for its uncertainty and marginalizing
over µ and σ 2 the posterior for δ happens to have a natural estimator of σ 2
in δ’s scale. Listing 4.1 implements this method.
Unequal variance: If the assumption that the variance is the same for
iid
both groups is violated, then the two-sample model can be extended as Yi ∼
iid
Normal(µ1 , σ12 ) for i = 1, ..., n1 and Yi ∼ Normal(µ2 , σ22 ) for i = n1 +1, ..., n1 +
n2 . Since no parameters are shared across the two groups, we can apply the
one-sample model separately for each group to obtain
indep
µj |Y ∼ tnj (Ȳj , s2j /nj ) (4.8)
124 Bayesian Statistical Methods

for j = 1, 2, and the posterior of the difference in means δ = µ2 − µ1 follows.


The posterior of δ is the difference between two Student t random variables,
which does not in general have a simple form. However, the posterior can
be approximated with arbitrary precision using Monte Carlo sampling as in
Listing 4.1.

4.2 Linear regression


The multiple linear regression model with response (also called the dependent
variable or outcome) Yi and covariates (also called independent variables,
predictors or inputs) Xi1 , ..., Xip is
p
X
Yi = Xij βj + εi , (4.9)
j=1

where β = (β1 , ..., βp )T are the regression coefficients and the errors (also
iid
called the residuals) are εi ∼ Normal(0, σ 2 ). We assume throughout that
Xi1 = 1 for all i so that β1 is the intercept (i.e., the mean if all other covariates
are zero). This model includes as special cases the one-sample mean model
in Section 4.1.1 (with p = 1 and β1 = µ) and the two-sample mean model in
Section 4.1.2 (with p = 2, Xi2 equal one if observation i is from the second
group and zero otherwise, β1 = µ, and β2 = δ).
The coefficient βj for j > 0 is the slope associated with the j th covariate.
For the remainder of this subsection we will assume that all p − 1 covariates
(excluding the intercept term) have been standardized to have mean zero and
variance one so the prior can be specified without considering the scales of the
covariates. That is, if the original covariate j, X̃ij , had sample mean X̄j and
standard deviation ŝj then we set Xij = (X̃ij − X̄j )/ŝj . After standardization,
the slope βj is interpreted as the change in the mean response corresponding
to an increase of one standard deviation unit (ŝj ) in the original covariate.
Similarly, βj /ŝj is the expected increase in the mean response associated with
an increase of one in the original covariate. The model actually has p + 1 pa-
rameters (p regression coefficients and variance σ 2 ) so we temporarily use p as
the number of regression parameters and not the total number of parameters
in the model.
The likelihood function for the linearP model with n observations is the
p
product of n Gaussian PDFs with means j=1 Xij βj and variance σ 2 ,
  2 
n p
Y 1  1 X
f (Y|β, σ 2 ) = √ exp − 2 Yi − Xij βj   . (4.10)

i=1
2πσ 2σ j=1
Linear models 125

Since maximizing the likelihood is equivalent to minimizing the negative log-


likelihood, the maximum likelihood estimator for β can be written
 2
n
X p
X
β LS = arg min Yi − Xij βj  . (4.11)
β i=1 j=1

Therefore, assuming Gaussian errors the maximum likelihood estimator is also


the least squares estimator.
The model is conveniently written in matrix notation. Let Y =
(Y1 , ..., Yn )T be the response vector of length n, and the design matrix X
be the n × p matrix with the first column equal to the vector of ones for the
intercept and column j having elements X1j , ..., Xnj . The linear regression
model is then
Y ∼ Normal(Xβ, σ 2 In ) (4.12)
where In is the n × n identity matrix with ones on the diagonal (so all re-
sponses have variance σ 2 ) and zeros off the diagonal (so the responses are
uncorrelated). The usual least squares estimator in matrix notation is

β̂ LS = (XT X)−1 XT Y. (4.13)

Note that the least squares estimator is unique only if XT X is full rank (i.e.,
p < n and none of X’s columns are redundant) and the estimator is poorly
defined if XT X is not full rank. Assuming X is full rank, the sampling dis-
tribution used to construct frequentist
 confidence intervals and p-values is
2 T −1
β̂ LS ∼ Normal β 0 , σ (X X) , where β 0 is the true value.

4.2.1 Jeffreys prior


Conditional distribution with the variance fixed: Conditioned on σ 2 ,
the Jeffreys prior is π(β) ∝ 1. This improper prior only leads to a proper
posterior if XT X has full rank, which is the same condition needed for the
least squares estimator to be unique. Assuming this condition is met, the
posterior is h i
β|Y, σ 2 ∼ Normal β̂ LS , σ 2 (XT X)−1 . (4.14)

The posterior mean is the least squares solution and the posterior covari-
ance matrix is the covariance matrix of the sampling distribution of the least
squares estimator. Therefore, the posterior credible intervals from this model
will numerically match the confidence intervals from a least squares analysis
with known error variance.
Unknown variance: With σ 2 unknown, the Jeffreys’ prior is π(β, σ 2 ) ∝
2 −p/2−1
(σ ) (Section 2.3). Assuming XT X has full rank, Appendix A.3 shows
that h i
β|Y ∼ tn β̂ LS , σ̂ 2 (XT X)−1 , (4.15)
126 Bayesian Statistical Methods

Listing 4.2
R code for Bayesian linear regression under the Jeffreys’ prior.
1 # This code assumes:
2 # Y is the n-vector of observations
3 # X us the n x p matrix of covariates
4 # The first column of X is all ones for the intercept
5

6 # Compute posterior mean and 95% interval


7 beta_mean <- solve(t(X)%*%X)%*%t(X)%*%Y
8 sigma2 <- mean((Y-X%*%beta_mean)^2)
9 beta_cov <- sigma2*solve(t(X)%*%X)
10 beta_scale <- sqrt(diag(beta_cov))
11 df <- length(Y)
12 beta_025 <- beta_mean + beta_scale*qt(0.025,df=df)
13 beta_975 <- beta_mean + beta_scale*qt(0.975,df=df)
14

15 # Package the output


16 out <- cbind(beta_mean,beta_025,beta_975)
17 rownames(out) <- colnames(X)
18 colnames(out) <- c("Mean","Q 0.025","Q 0.975")

where σ̂ 2 = (Y − Xβ̂ LS )T (Y − Xβ̂ LS )/n. That is, the marginal posterior


of β follows the p-dimensional Student’s t distribution with location β̂ LS ,
covariance matrix σ̂ 2 (XT X)−1 , and n degrees of freedom.
A property of the multivariate t distribution is that the marginal distri-
bution of each element is univariate t so that

βj |Y ∼ tn (β̂j , s2j ) (4.16)

where β̂j is the j th element of β̂ LS and s2j is the j th diagonal element of


σ̂ 2 (XT X)−1 . Therefore, both the joint distribution of all the regression co-
efficients and marginal distribution of each regression coefficient belong to a
known family of distributions, and can thus be computed without MCMC
sampling. Listing 4.2 provides R code to compute the posterior mean and 95%
intervals.

4.2.2 Gaussian prior


For high-dimensional cases with more predictors than observations the im-
proper Jeffreys’ prior does not lead to a valid posterior distribution and
thus a proper prior is required. Even in less extreme cases with moder-
ate p, a proper prior can stabilize the posterior and give better results
than the improper prior. The conjugate prior (conditioned on σ 2 ) for β is
β|σ 2 ∼ Normal(µ, σ 2 Ω). We include σ 2 in the prior variance to account
for the scale of the response. Typically the prior is centered around zero,
Linear models 127

Listing 4.3
JAGS code for multiple linear regression with Gaussian priors.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(inprod(X[i,],beta[]),taue)
4 }
5 # Priors
6 beta[1] ~ dnorm(0,0.001) #X[i,1]=1 for the intercept
7 for(j in 2:p){
8 beta[j] ~ dnorm(0,taub*taue)
9 }
10 taue ~ dgamma(0.1, 0.1)
11 taub ~ dgamma(0.1, 0.1)

and so from here on we set µ = 0. This prior combined with the likelihood
Y ∼ Normal(Xβ, σ 2 In ) gives posterior
h i
β|Y, σ 2 ∼ Normal (XT X + Ω−1 )−1 XT Y, σ 2 (XT X + Ω−1 )−1 , (4.17)

which is a proper distribution as long as the prior is proper (Ω is positive


definite) even if p > n or the covariates are perfectly collinear.
There are several choices for the prior covariance matrix Ω. The most
common is the diagonal matrix Ω = τ 2 Ip , which is equivalent to the prior
iid
βj |σ 2 ∼ Normal(0, σ 2 τ 2 ). Under this prior the MAP estimator is the ridge
regression estimator [44]
 2
Xn Xp Xp
β R = arg min Yi − Xij βj  + λ βj2 , (4.18)
β i=1 j=1 j=1

where λ = 1/τ 2 . Ridge regression is often used to stabilize least squares prob-
lems when the number of predictors is large and/or the covariates are collinear.
In ridge regression, the tuning parameter λ can be selected based on cross-
validation. In a fully Bayesian analysis, τ 2 can either be fixed to a large value
to give an uninformative prior, or given a conjugate inverse gamma prior as in
Listing 4.3 to allow the data to determine how much to shrink the coefficients
towards zero. If τ 2 is given a prior, then the intercept term β1 should be given
a different variance because it plays a different role than the other regression
coefficients.
The hform of the posterior
i simplifies under Zellner’s g-prior [83] β|σ 2 ∼
2
Normal 0, σg (XT X)−1 for g > 0. The conditional posterior is then
h i
β|Y, σ 2 ∼ Normal cβ̂ LS , cσ 2 (XT X)−1 , (4.19)

where c = 1/(g + 1) ∈ (0, 1) is the shrinkage factor. This speeds computation


128 Bayesian Statistical Methods

because β̂ LS and (XT X)−1 can be computed once outside the MCMC sam-
pler. The shrinkage factor c determines how strongly the posterior mean and
covariance are shrunk towards zero. A common choice is g = 1/n and thus
c = n/(n + 1). Since Fisher’s information matrix for the Gaussian distribution
is the inverse covariance matrix, the prior contributes 1/nth the information
as the likelihood, and so this prior is called the unit information prior [49].

4.2.3 Continuous shrinkage priors


In regressions with many predictors it is often assumed that most of the p
predictors have little effect on the response. The Gaussian prior can reflect
this prior belief with a small prior variance, but a small prior variance has the
negative side effect of shrinking the posterior mean of the important regression
coefficients toward zero and thus inducing bias. An alternative is to use double
exponential priors for the βj that have (Figure 4.2) a peak at zero to reflect
the prior belief that most predictors have no effect on the response, but a
heavy tail to reflect the prior belief that there are a few predictors with strong
effects. Listing 4.4 provides JAGS code to implement this model (although the
R package BLR [21] is likely faster). Pp
Assuming the model Yi ∼ Normal( j=1 Xij βj , σ 2 ) with double exponen-
iid
tial priors βj ∼ DE(λ/σ 2 ) and thus prior PDF π(βj ) ∝ exp − 2σλ2 |βj | , then


the maximum a posterior (MAP) estimate is (since maximizing the posterior


is equivalent to minimizing twice the negative log posterior)
 2
n
X p
X Xp
β̂ LASSO = arg min Yi − Xij βj  + λ |βj |. (4.20)
β i=1 j=1 j=1

This is the famous LASSO [81] penalized regression estimator and thus the
double exponential prior is often called the Bayesian LASSO prior. An attrac-
tive feature of this estimator is that some of the estimates may have β̂j set
exactly to zero, and this then performs variable selection simultaneously with
estimation. In other words, the LASSO encodes the prior belief that some of
the covariates are unimportant.
The double exponential prior is just one example of a shrinkage prior
with peak at zero and heavy tails. The horseshoe prior [16] is βj |λj ∼
Normal(0, λ20 λ2j ) where λ0 is global variance common to all regression coef-
ficients and λj is a local prior standard deviation specific to βj . The local
variances are given half-Cauchy prior (i.e., student-t prior with one degree of
freedom restricted to be positive). This global-local prior is designed to shrink
null coefficients towards zero by having small variance while the true signals
have uninformative priors with large variance. The Dirichlet–Laplace prior
[11] gives even more shrinkage towards zero by supplementing the Bayesian
LASSO with local shrinkage parameters and a Dirichlet prior on the shrink-
age parameters. The R2D2 prior [84] is another global-local shrinkage prior for
Linear models 129

0.007
BLASSO
Gaussian

0.006
0.005
0.004
Density
0.003
0.002
0.001
0.000

−4 −2 0 2 4

FIGURE 4.2
Comparison of the Gaussian and double exponential prior distribu-
tions. Below are the standard normal PDF and the double exponential PDF
with parameters set to give mean zero and variance one.

the regression coefficients that is constructed so that the model’s coefficient


of determination (i.e., R-squared) has a beta prior. The R2D2 prior has the
most mass at zero and heaviest tail among these priors. For an alternative
approach that places positive prior probability on the regression coefficients
being exactly zero see the spike-and-slab prior in Section 5.3.

4.2.4 Predictions
One use of linear regression is to make a prediction for a new set of covariates,
Xpred = (X1pred , ..., Xppred ). Given the model parameters, the distribution of

Listing 4.4
JAGS code for the Bayesian LASSO.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(inprod(X[i,],beta[]),taue)
4 }
5 # Priors
6 beta[1] ~ dnorm(0,0.001)
7 for(j in 2:p){
8 beta[j] ~ ddexp(0,taub*taue)
9 }
10 taue ~ dgamma(0.1, 0.1)
11 taub ~ dgamma(0.1, 0.1)
130 Bayesian Statistical Methods

Listing 4.5
JAGS code for linear regression predictions.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(inprod(X[i,],beta[]),taue)
4 }
5 # Priors
6 beta[1] ~ dnorm(0,0.001)
7 for(j in 2:p){
8 beta[j] ~ dnorm(0,taub*taue)
9 }
10 taue ~ dgamma(0.1, 0.1)
11 taub ~ dgamma(0.1, 0.1)
12

13 # Predictions
14 for(i in 1:n_pred){
15 Y_pred[i] ~ dnorm(inprod(X_pred[i,],beta[]),taue)
16 }
17 # User must pass JAGS the covariates X_pred and integer n_pred
18 # JAGS returns PPD samples of Y_pred

Pp
the new response is Y pred |β, σ 2 ∼ Normal( j=1 Xjpred βj , σ 2 ). To properly
account for parametric uncertainty, we should use the posterior predictive
distribution (Section 1.5) that averages over the uncertainty in β and σ 2 .
MCMC provides a means to sample from the PPD by making a sample from
the predictive distribution for each of the s = 1, ..., S MCMC samples of the
Pp (s)
parameters, Y (s) |β (s) , σ 2(s) ∼ Normal( j=1 Xjpred βj , σ 2(s) ), and using the
S draws Y (1) , ..., Y (S) to approximate the PPD. The PPD is then summarized
the same way as other posterior distributions, such as by the posterior mean
and 95% interval. Similar approaches can use to analyze missing data as in
Section 6.4.
Listing 4.5 gives JAGS code to make linear regression predictions. The
matrix of predictors Xpred must be passed to JAGS and JAGS will return the
predictions Ypred . Making predictions with JAGS can slow the sampler and
consume memory, and so it is often better to first perform MCMC sampling
for the parameters using JAGS and then make predictions in R as in Listing
4.6.

4.2.5 Example: Factors that affect a home’s microbiome


We use the data from [5], downloaded from http://figshare.com/
articles/1000homes/1270900. The data are dust samples from the ledges
above doorways from n = 1, 059 homes (after removing samples with missing
data; for missing data methods see Section 6.4) in the continental US. Bioin-
formatics processing detects the presence or absence of 763 species (technically
Linear models 131

Listing 4.6
R code to use JAGS MCMC samples for linear regression predictions.
1 # INPUTS
2 # beta_samples := S x p matrix of MCMC samples (from JAGS)
3 # taue_samples := S x 1 matrix of MCMC samples (from JAGS)
4 # X_pred := n_pred x p matrix of prediction covariates
5

6 S <- nrow(beta_samples)
7 n_pred <- nrow(X_pred)
8 Y_pred <- matrix(NA,S,n_pred)
9 sigma <- 1/sqrt(taue_samples)
10

11 for(s in 1:S){
12 Y_pred[s,] ~ X_pred%*%beta_samples[s,]+rnorm(n_pred,0,sigma[s])
13 }
14

15 # OUTPUT
16 # Y_pred := S x n_pred matrix of PPD samples

operational taxonomic units) of fungi. The response is the log of the number
of fungi species present in the sample, which is a measure of species richness.
The objective is to determine which factors influence a home’s species rich-
ness. For each home, eight covariates are included in this example: longitude,
latitude, annual mean temperature, annual mean precipitation, net primary
productivity (NPP), elevation, the binary indicator that the house is a single-
family home, and the number of bedrooms in the home. These covariates are
all centered and scaled to have mean zero and variance one.
iid
We apply the Gaussian model in Listing 4.3 first with βj ∼
iid
Normal(0, 1002 ) (“Flat prior”) and then with βj |σ 2 , τ 2 ∼ Normal(0, σ 2 τ 2 )
with τ 2 ∼ InvGamma(0.1, 0.1) (“Gaussian shrinkage prior”), and the Bayesian
LASSO prior in Listing 4.4. For each of the three models we ran two MCMC
chains with 10,000 samples in the burn-in and 20,000 samples after burn-in.
Trace plots (not shown) showed excellent convergence and the effective sample
size exceeded 1,000 for all parameters and all models.
The results are fairly similar for the three priors (Figure 4.3). In all three
models, temperature, NPP, elevation, and single-family home are the most
important predictors, with the most richness estimated to occur in single-
family homes with low temperature, NPP, and elevation. In all three models,
the sample with largest fitted value (i.e., the posterior mean of Xβ) is a single-
family home with three bedrooms in Montpelier, VT, and the sample with the
smallest fitted value is a multiple-family home with two bedrooms in Tempe,
AZ.
Although the results are not that sensitive to the prior in this analysis,
there are some notable difference. For example, compared to the posterior
under a flat priors, the posterior of the slope for latitude (top right panel in
132 Bayesian Statistical Methods

Sample locations Longitude Latitude

20
25




● ●
●●●●


●● ●


●●●
● ● ● ● ●
● ● ●
● ● ●

● ● ●
● ● ● ● ●
● ● ●
●● ●


●●
● ●●


●●
●● ●

20

● ● ● ●
● ● ● ● ● ●
● ●
● ●

15


● ●
● ●
● ● ●
● ● ●
● ● ●
●●● ●● ● ● ●
● ●● ●● ●
● ● ● ● ● ●
● ● ●
●●
● ●
● ●●● ●●
● ● ●● ●
● ●● ● ● ● ●●
●●● ● ● ● ●
● ● ● ● ● ●
●● ●●
● ● ● ● ● ● ● ● ●●

● ● ●● ● ●

●●● ● ● ●
● ●● ●●


●●

● ● ● ●
●●
● ● ●● ●●●●
● ● ● ●●● ● ● ●● ●● ●● ●


● ● ●● ●● ● ●
● ● ●●
● ●● ●●●● ●●●
● ● ● ●● ●

● ●● ●● ● ●
● ● ● ● ● ● ● ●● ●
● ●●●●●
● ●
● ●● ●
●● ● ● ●●●● ● ●● ●●●●
●●●
●●

● ● ● ● ●●

●● ●● ●●● ●● ●

● ● ●
● ● ● ●
● ●
● ●
● ●

15

●● ●
● ● ● ● ● ●●
● ●●● ● ● ●
●● ● ●●
● ●●

●●
● ●●


●● ● ●● ●
●● ●● ● ●
● ● ● ●● ● ●●● ● ● ●
●●
●●● ● ●
● ●
● ● ● ●
● ●● ●●
●●●●

● ● ●
● ●●
● ●
● ● ● ● ●


●●

●●


●● ●
● ● ● ●
●●● ● ●●
●●
● ● ●●●
● ● ● ● ● ● ● ●
● ●● ●● ● ● ●●
●●●● ●
●● ● ● ●

10
● ● ●
● ● ●
● ●
●● ●●●●


●●

●●

● ● ● ●●●●● ●

●●●●
● ●● ● ● ●

●● ● ●● ● ● ● ● ●
● ●●
● ●●● ●
●●●
● ● ● ● ● ●
● ● ● ●
● ●● ● ●●●
● ● ● ● ●
● ●●● ●
● ●
● ● ● ● ●●● ●● ●
●● ●● ●●●●
●●
●●●● ●● ●
● ●● ● ● ●●●● ● ●
●●

●●
●●
●●

●●
●●
●●
● ●● ●● ● ● ● ●●●●

●●
●●


●●●
●●● ●

10
● ● ● ●
● ● ● ● ●
● ●
●●● ●

● ● ●●

●●




●●

● ● ●●

● ● ●
●● ● ● ● ●
● ● ● ●

● ●●
● ● ● ●
●●
●● ●● ● ● ●● ●



● ●
●● ●● ● ● ●
●●●● ● ●●
●● ●

● ●


● ● ● ●●
●●●




●●

●● ● ● ●
●● ● ●


● ● ● ●
●● ● ●
● ● ●

5
● ●
● ● ●●

● ●

●●

5
●● ● ●●● ● ●


●● ●

●●
●●
● ● ●
●●
● ●● ●
● ●
●●●

●●
● ●

● ●


● ●

0

−0.05 0.00 0.05 0.10 −0.15 −0.10 −0.05 0.00 0.05 0.10

β β

Temperature Precipitation NPP


15

25
20

20
10

15

15
10

10
5

5
0

0
−0.30 −0.25 −0.20 −0.15 −0.10 −0.05 0.00 −0.05 0.00 0.05 0.10 −0.10 −0.08 −0.06 −0.04 −0.02 0.00 0.02

β β β

Elevation One−family house Number of bedrooms


30
30
20

25
25
15

20
20

15
15
10

10
10
5

5
5
0

−0.10 −0.05 0.00 −0.02 0.00 0.02 0.04 0.06 0.08 −0.04 −0.02 0.00 0.02 0.04 0.06 0.08

β β β

FIGURE 4.3
Regression analysis of the richness of a home’s microbiome. The
first panel shows the sample locations and the remaining panels plot the pos-
terior distributions of the regression coefficients, βj . The three models are
distinguished by their priors for βj : the flat prior is βj ∼ Normal(0, 1002 )
iid
(solid line), the Gaussian shrinkage prior is βj ∼ Normal(0, σ 2 τ 2 ) with
iid
τ 2 ∼ InvGamma(0.1, 0.1) (dashed line), and the Bayesian LASSO is βj ∼
DE(0, σ 2 τ 2 ) with τ 2 ∼ InvGamma(0.1, 0.1) (dotted line).
Linear models 133

Figure 4.3) is shrunk towards zero by the Gaussian shrinkage model, and the
posterior density concentrates even more around the origin for the Bayesian
LASSO prior. However, it is not clear from this plot which of these three fits
in preferred; model comparison is discussed in Chapter 5.

4.3 Generalized linear models


Multiple linear regression assumes that the response variable is Gaussian and
thus the mean response can be any real number. Many analyses do not conform
to this assumption. For example, in Section 4.3.1 we analyze binary responses
with support {0, 1} and in Section 4.3.2 we analyze count data with support
{0, 1, 2, ...}. Clearly these data are not Gaussian and their mean cannot be
any real number because the mean must be between zero and one for binary
data and positive for count data. The generalized linear model (GLM) extends
linear regression concepts to handle these non-Gaussian outcomes. There is a
deep theory of GLMs (exponential families, canonical links, etc; see [54]), but
here we focus only on casting the GLM in the Bayesian framework through a
few examples.
The basic steps to selecting a GLM are (1) determine the support of the re-
sponse and select an appropriate parametric family and (2) link the covariates
to the parameters that define this family. As an example, consider the Gaus-
sian linear regression model in Section 4.2. If the support of the response is
(−∞, ∞), a natural parametric family is the Gaussian distribution. Of course,
there are other families with this support and the fit of the Gaussian family
to the data should be verified empirically. Once the Gaussian family has been
selected, the covariates must be linked to one or both of the parameters, the
mean or the variance. Let the linear combination of the covariates be
p
X
ηi = Xij βj . (4.21)
j=1

The linear predictor ηi can take any value in (−∞, ∞) depending on Xij .
Therefore, to link the covariates with the mean we can simply set E(Yi ) = ηi
as in standard linear regression. To complete the standard model we elect not
to link the covariates with the variance, and simply set V(Yi ) = σ 2 for all i.
The function that links the linear predictor with a parameter is called the
link function. Say that the parameter in the likelihood for the response is θi
(e.g., E(Yi ) = θi or V(Yi ) = θi ), then the link function g is

g(θi ) = ηi . (4.22)

The link function must be an invertible function that is well-defined for all
permissible values of the parameter. For example, in the Gaussian case the
134 Bayesian Statistical Methods

Listing 4.7
Model statements for several GLMs in JAGS.
1

2 # (a) Logistic regression


3 for(i in 1:n){
4 Y[i] ~ dbern(q[i])
5 logit(q[i]) <- inprod(X[i,],beta[])
6 }
7 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
8

9 # (b) Probit regression


10 for(i in 1:n){
11 Y[i] ~ dbern(q[i])
12 probit(q[i]) <- inprod(X[i,],beta[])
13 }
14 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
15

16 # (c) Poisson regression


17 for(i in 1:n){
18 Y[i] ~ dpois(lambda[i])
19 log(lambda[i]) <- inprod(X[i,],beta[])
20 }
21 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
22

23 # (d) Negative binomial regression


24 for(i in 1:n){
25 Y[i] ~ dnegbin(q[i],m)
26 q[i] <- m/(m + lambda[i])
27 log(lambda[i]) <- inprod(X[i,],beta[])
28 }
29 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
30 m ~ dgamma(0.1,0.1)
31

32 # (e) Zero-inflated Poisson


33 for(i in 1:n){
34 Y[i] ~ dpois(q[i])
35 q[i] <- Z[i]*lambda[i]
36 Z[i] ~ dbern(p[i])
37 log(lambda[i]) <- inprod(X[i,],beta[])
38 logit(p[i]) <- inprod(X[i,],alpha[])
39 }
40 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
41 for(j in 1:p){alpha[j] ~ dnorm(0,0.01)}
42

43 # (f) Beta regression


44 for(i in 1:n){
45 Y[i] ~ dbeta(r*q[i],r*(1-q[i]))
46 logit(q[i]) <- inprod(X[i,],beta[])
47 }
48 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
49 r ~ dgamma(0.1,0.1)
Linear models 135

link function for the mean is the identity function g(x) = x so that the mean
can be any real number. To link the covariates to the variance we must ensure
that the variance is positive, and so natural-log function g(x) = log(x) is more
appropriate. Link functions are not unique, and must be selected by the user.
For example, the link function for the mean could be replaced by g(x) = x3
and the link function for the variance could be replaced with g(x) = log10 (x).
Bayesian fitting of a GLM requires selecting the prior and computing the
posterior distribution. The priors for the regression coefficients discussed for
Gaussian data in Section 4.2 can be applied for GLMs. The posterior distribu-
tions for GLMs are usually too complicated to derive the posterior in closed-
form and prove that, say, a particular prior distribution leads to a student-t
posterior distribution with interpretable mean and covariance. However, much
of the intuition developed with Gaussian linear models carries over to GLMs.
With non-Gaussian responses the full conditional distributions for the βj
are usually not conjugate and Metropolis sampling must be used. Maximum
likelihood estimates can be used as initial values and the corresponding stan-
dard errors can suggest appropriate candidate distributions. In the examples
in this chapter we use JAGS to carry out the MCMC sampling. R also has
dedicated packages for Bayesian GLMs, such as the MCMClogit package for
logistic regression (Section 4.3.1) that are likely faster than JAGS. With flat
priors and a large sample it is even more efficient to evoke the Bayesian central
limit theorem (Section 3.1.3) and approximate the posterior as Gaussian (e.g.,
using the glm function in R) and avoid MCMC altogether.

4.3.1 Binary data


Binary outcomes Yi ∈ {0, 1} occur frequently when the result of an experiment
is recorded only as an indicator of a success or failure. For example, in Section
1.2 the response was the binary indicator that a test for HIV was positive. Bi-
nary variables must follow a Bernoulli distribution. The Bernoulli distribution
has one parameter, the success probability Prob(Yi = 1) = qi ∈ [0, 1]. Since
the parameter is a probability, the link function must have input range [0, 1]
and output range [−∞, ∞]. Below we discuss two such link functions: logistic
and probit.
Logistic regression: The logistic link function is
 
q
g(q) = logit(q) = log . (4.23)
1−q

This link function converts the event probability q first to the odds of the
event, q/(1 − q) > 0, and then to the log odds, which can be any real number.
The logistic regression model is written
J
indep X
Yi ∼ Bernoulli(qi ) and logit(qi ) = ηi = Xij βj . (4.24)
j=1
136 Bayesian Statistical Methods

The inverse logistic function is g −1 (x) = exp(x)/[1 + exp(x)] and so the model
indep
can also be expressed as Yi ∼ Bernoulli (exp(ηi )/[1 + exp(ηi )]) . JAGS code
for this model is in Listing 4.7a.
Because the log odds of the event that Yi = 1 are linear in the covariates,
βj is interpreted as the increase in the log odds corresponding to an increase of
one in Xj with all other covariates held fixed. Similarly, with all over covariates
held fixed, increasing Xj by one multiplies the odds by exp(βj ). Therefore if
βj = 2.3, increasing Xj by one multiplies the odds by ten and if βj = −2.3,
increasing Xj by one divides the odds by ten. This interpretation is convenient
for communicating the results and specifying priors. For example, if a change
of one in the covariate is deemed a large change, then a standard normal prior
may have sufficient spread to represent an uninformative prior.
Probit regression: There are many possible link functions from [0, 1]
to [−∞, ∞]. In fact, the quantile function (inverse CDF) of any continuous
random variable with support [−∞, ∞] would suffice. The link function in
logistic regression is the quantile function of the logistic distribution. In probit
regression, the link function is the Gaussian quantile function
indep
Yi ∼ Bernoulli(qi ) and qi = Φ(ηi ), (4.25)

where Φ is the standard normal CDF (Listing 4.7b).


Unfortunately, the regression coefficients βj in probit regression do not
have a nice interpretation such as their log-odds interpretation in logistic
regression. However, probit regression is useful because it leads to Gibbs sam-
pling and can be used to model dependence between binary variables. Probit
indep
regression is equivalent to specifying latent variables Zi ∼ Normal(ηi , σ 2 )
and assuming that observing Yi = 1 indicates that Zi exceeds the threshold
z. For example, Zi might represent a patient’s blood pressure and Yi is the
corresponding binary indicator that the patient has high blood pressure (de-
fined as exceeding z). In this example, the probability that the patient has
high blood pressure is

Prob(Yi = 1) = Prob(Zi > z)


= Prob[(Zi − ηi )/σ > (z − ηi )/σ]
= 1 − Φ[(z − ηi )/σ]
= Φ [(ηi − z)/σ]
  
XJ
= Φ  Xij βj − z  /σ  .
j=1

Since we never observe the latent variables Zi we cannot estimate the


threshold z or the variance σ 2 . The threshold z is coupled with the intercept
because adding a constant to the threshold and subtracting the same constant
from the intercept does not affect the event probabilities qi . Therefore, the
Linear models 137

threshold is typically set to z = 0. Similarly, multiplying σ and dividing the


slopes βj by the same constant does not affect the event probabilities, and so
the variance is typically set to σ 2 = 1. This gives the usual probit regression
model qi = Φ(ηi ).
In this formulation, the regression coefficients β have conjugate full condi-
tional distributions conditioned on the latent Zi . This leads to Gibbs sampling
if the Zi are imputed at each MCMC iteration [3]. Also, dependence between
binary outcomes Yi can be induced by a multivariate normal model for the
latent variables Zi .

4.3.2 Count data


Random variables with support Yi ∈ {0, 1, 2, ...} often arise as the number of
events that occur in a time interval or spatial region. For example, in Section
2.1 we analyze the number of NFL concussions by season. In this chapter,
we will focus on modeling the mean as a function of the covariates. The link
between the linear predictor and the mean must ensure that E(Yi ) = λi ≥ 0.
A natural link function is the log link,
p
X
log(λi ) = Xij βj . (4.26)
j=1

In this model, the mean is multiplied by exp(βj ) if Xj increases by one with


all other covariates remaining fixed. Unlike binary data, specifying the mean
does not completely determine the likelihood for count data. We discuss below
two families of distributions for the likelihood function: Poisson and negative
binomial.
Poisson regression: The Poisson regression model is
 
p
indep X
Yi |λi ∼ Poisson(λi ) where λi = exp  Xij βj  . (4.27)
j=1

As with logistic regression, the regression coefficients β1 , ..., βp can be given


the priors discussed in Section 4.2, and Metropolis sampling can be used to
explore the posterior (Listing 4.7c). A critical assumption of the Poisson model
is that the mean and variance are equal, that is,

E(Yi ) = V(Yi ) = λi . (4.28)

The distribution of a count is over-dispersed (under-dispersed) if its variance


is greater than (less than) its mean. If over-dispersion is present, then the
Poisson model is inappropriate and another model should be considered so
that the posterior accurately reflects all sources of variability.
Negative binomial regression: One approach to accommodating over-
iid
dispersion is to incorporate gamma random variables ei ∼ Gamma(m, m)
138 Bayesian Statistical Methods

with mean 1 and variance 1/m. Then


indep
Yi |ei , λi , m ∼ Poisson(λi ei ). (4.29)

Marginally over ei , Yi follows the negative binomial distribution,

Yi |λi , m ∼ NegBinomial(qi , m) (4.30)

with probability qi = m/(λi + m) and size m. The size m need not be an


integer, but if it is and we envision a sequence of independent Bernoulli trials
each with success probability qi , then Yi can be interpreted as the number
of failures that occur before the mth success. More importantly for handling
over-dispersion is that E(Yi ) = λi and V(Yi ) = λi +λ2i /m > λi . The parameter
m controls over-dispersion. If m is large, then ei ≈ 1 and the model reduces
to the Poisson regression model with V(Yi ) ≈ E(Yi ); if m is close to zero, then
ei has large variance and thus V(Yi ) > E(Yi ). The over-dispersion parameter
can be given a gamma prior, as in Listing 4.7d.
Zero-inflated Poisson: Another common deviation from the Poisson dis-
tribution is an excess of zeros. For example, say that Yi is the number of fish
caught by visitor i to a state park. It may be that most of the Yi are zero
because only a small proportion (say p) of the visitors fished, but the distri-
bution of Yi for those that did fish is Poisson with mean λ. The probability
of an observation being zero is then 1 − p for the non-fishers plus p times the
Poisson probability at zero for the fishers. The PMF corresponding to this
scenario is (
(1 − p) + pfP (0|λ) if y = 0
f (y|p, λ) = (4.31)
pfP (y|λ) if y > 0
where fP is the Poisson PMF. This is equivalent to the two-stage model

Yi |Zi , λ ∼ Poisson(Zi λ) and Zi ∼ Bernoulli(p)

where Zi is the latent indicator that visitor i fished and thus the mean of
Yi is zero for non-fishers with Zi = 0. In this scenario, we do not observe
the Zi but this two-stage model gives the equivalent model to (4.31) but uses
only standard distributions. As a result, the model can be coded in JAGS with
covariates included in the mass at zero and the Poisson rate as in Listing 4.7e.

4.3.3 Example: Logistic regression for NBA clutch free


throws
The table in Problem 17 of Section 1.6 gives the overall free-throw propor-
tion (qi ∈ [0, 1]), the number of made clutch shots (Yi ), and the number of
attempted clutch shots (ni ) for players i = 1, ..., 10. Figure 4.4 shows that
most players have similar success rates in clutch and non-clutch situations,
but some players appear to have less success in pressure situations. We fit
logistic regression models to formally explore this relationship.
Linear models 139

95
Model 1 − Intercept
IT

Model 1 − Slope
SC

Model 2 − Intercept
90

3
KL

RW
Clutch percentage

Posterior density

85

JW KD

2


80

JH
AD

75

1
LJ
70

● GA

65

0
65 70 75 80 85 90 95 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

Overall percentage β

FIGURE 4.4
Logistic regression analysis of NBA free throws. The first panel shows
the overall percentage of made free throws versus the percentage for clutch
shots only for each player (denoted by the player’s initials). The solid line
is the x = y lines and the dashed line is the fitted value from Model 2. The
second plot is the posterior density for the slope and intercept from the Model
1 and the intercept from the model 2.

The support of the response is Yi ∈ {0, 1, ..., ni } and so we select a bino-


mial likelihood Yi |pi ∼ Binomial(ni , pi ) where pi is the probability of player
i making a clutch shot. The number of shots ni is known, and so we link the
covariate to the success probability pi . The two models are
1. logit(pi ) = β1 + β2 Xi
2. logit(pi ) = β1 + Xi
where Xi = logit(qi ) is the log odds of making a regular free throw. If β1 = 0
and β2 = 1 in Model 1 or β1 = 0 in Model 2, then the clutch performance
equals the overall performance. To determine if this is the case, we fit the
model using JAGS with two chains with burn-in 10,000 and 20,000 additional
samples and with uninformative Normal(0, 100) priors for all parameters. The
model specification for Model 1 is
for(i in 1:10){
Y[i] ~ dbinom(p[i],n[i])
logit(p[i]) <- beta[1] + beta[2]*X[i]
}
beta[1] ~ dnorm(0,0.01)
beta[2] ~ dnorm(0,0.01)

The results are plotted in Figure 4.4. For Model 1, the slope is centered
squarely on one and the intercept is centered slightly below zero, but both
140 Bayesian Statistical Methods

TABLE 4.1
Beta regression of microbiome data. Posterior median and 95% intervals
for the regression coefficients βj and concentration parameter r.

Median 95% Interval


Intercept -1.01 (-1.05, -0.96)
Longitude -0.21 (-0.28, -0.14)
Latitude -0.08 (-0.22, 0.05)
Temperature -0.15 (-0.30, -0.01)
Precipitation 0.03 (-0.04, 0.11)
NPP -0.04 (-0.10, 0.02)
Elevation -0.02 (-0.09, 0.05)
Single-family home 0.07 ( 0.02, 0.13)
Number of bedrooms -0.08 (-0.13, -0.03)
r 7.97 ( 7.34, 8.66)

parameters have considerable uncertainty and so the model with pi = qi re-


mains plausible. However, the intercept in Model 2 is negative with posterior
probability 0.96, so there is some evidence that even the best players in the
NBA underperform in pressure situations. The fitted curve (dashed line) in
Figure 4.4 (left panel) is 1/[1+exp(−β̄1 −Xi )], where β̄1 is the posterior mean,
and shows that the clutch performance can be several percentage points lower
than the overall percentage.

4.3.4 Example: Beta regression for microbiome data


In Section 4.2.5 we regressed the diversity of a sample’s microbiome onto
features of the home. Diversity was measured as the log of the number of the
L = 763 species present in the sample. However, this measure fails to account
for the relative abundance of the species. Let Ail ≥ 0 be the abundance of
species l in sample i. Another measure of the diversity is the proportion of
the total abundance attributed to the most abundant species in sample i,

max{Ai1 , ..., AiL }


Yi = PL ∈ [0, 1]. (4.32)
l=1 Ail

This measure is plotted in Figure 4.5 (top left).


Since Yi is between 0 and 1, Gaussian linear regression is inappropriate.
One option is to transform the response from support [0, 1] to (−∞, ∞), e.g.,
Yi∗ = logit(Yi ), and model the transformed data Yi∗ using a Gaussian linear
model. Another option is to model the data directly using a non-Gaussian
model. A natural model for a continuous variable with support [0, 1] is the
beta distribution. Since the mean must also be in [0, 1], a logistic link can be
used to relate the covariates to the mean response. Therefore, we fit the beta
Linear models 141

regression model

Yi |β, r ∼ Beta[rqi , r(1 − qi )] and logit(qi ) = XTi β. (4.33)

Yi |β, r has mean qi and variance qi (1 − qi )/(r + 1), and so r > 0 determines
the concentration of the beta distribution around the mean qi .
Using priors βj ∼ Normal(0, 100) and r ∼ Gamma(0.1, 0.1), we fit this
model in JAGS using the code in Listing 4.7e (although sampling would likely
be faster using the betareg package in R). Before fitting the model, the co-
variates are standardized to have mean zero and variance one. Convergence
was excellent for all parameters (the bottom left panel of Figure 4.5 shows the
trace plot for r). The posterior distributions in Table 4.1 indicate that there
is more diversity (smaller Yi ) on average in homes in cool regions in the east,
and multiple-family homes with many bedrooms.

4.4 Random effects


The standard linear regression model assumes the same regression model ap-
plies to all observations. This assumption is tenuous if data are collected in
groups. For example, Figure 4.6 plots the jaw bone density measurements of
n = 20 children measured over the course of m = 4 visits. These 20 children
represent only a random sample from a much larger population, but we can
use these samples to make an inference about the larger population. If we let
Yij be the j th measurement for patient i, then (ignoring age as in the top right
panel of Figure 4.6) the one-way random effects model is
indep iid
Yij |αi ∼ Normal(αi , σ 2 ) where αi ∼ Normal(µ, τ 2 ). (4.34)

The random effect αi is the true mean for patient i, and the observations for
patient i vary around αi with variance σ 2 . The αi are called random effects
because if we repeated the experiment with a new sample of 20 children the
αi would change. In this model, the population of patient-specific means is
assumed to follow a normal distribution with mean µ and variance τ 2 . The
overall mean µ is a fixed effect because if we repeated the experiment with
a new sample of 20 children from the same population it would not change.
A linear model with both fixed and random effects is called a linear mixed
model.
A Bayesian analysis of a random effects model requires priors for the pop-
ulation parameters. For example, Listing 4.8 provides JAGS code with con-
jugate priors µ ∼ Normal(0, 1002 ) and τ 2 ∼ InvGamma(0.1, 0.1). The same
algorithms (e.g., Gibbs sampling) can be used for random effects models as
for the other models we have considered. In fact, computationally there is
no need to distinguish between fixed and random effects (which can lead to
142 Bayesian Statistical Methods

1.0


200

Proportion for the most abundant OTU


● ● ● ●
● ● ●●
● ● ● ●



0.8
●● ●

● ●

150

● ●

● ● ●
● ●

● ●
Frequency

● ● ●

0.6
● ● ●
● ● ● ● ●
● ● ● ● ● ●

● ● ●

● ●
● ●
● ● ● ●
● ● ●

100

● ●
●● ● ● ●
● ● ●
● ● ●
● ●
● ● ● ● ●

● ●
● ● ●
● ● ●● ●
●● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ●●
●● ● ●
●● ●● ●

0.4
● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ●
●●● ● ●● ●● ●
● ●● ●● ● ●
● ● ● ●
● ● ●
●●
● ●● ● ●●● ● ● ●
● ●
● ● ● ● ●● ●
● ●● ● ● ● ● ●
● ● ●
● ● ● ●● ● ● ● ●
● ●●

● ●
● ● ● ●● ● ●

● ● ● ● ●● ● ● ●● ● ●
● ●● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ●

● ● ● ● ● ●
● ●●● ●● ● ● ● ● ●●● ● ● ●
●● ● ● ●
● ●●
● ● ● ● ● ●● ● ●

● ● ●
●● ●● ●●
50

● ● ● ●● ● ● ● ●● ● ● ●
● ● ●
● ● ● ● ●● ● ● ● ● ●
●● ● ● ● ●● ● ●
● ● ●

●● ●●
● ● ●
● ● ● ●● ●

● ● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ●●
●●● ●● ● ● ● ● ●●
● ●● ● ● ●● ● ●

● ● ● ●● ●●
●● ●● ● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ●● ● ● ● ●●

● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●
●●
● ● ● ●● ● ●
● ● ●● ● ● ● ●
● ●●
● ● ● ● ●● ●●●
●● ●●



0.2
●● ● ● ● ●●



● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ●● ●●●● ●




●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ●●




●● ● ●
●●●● ● ● ● ● ●●●●● ● ●●● ● ●
●●●● ● ●● ● ●
● ● ● ● ● ●● ●
● ● ● ●
●●


●● ● ●●
●●
●●●
● ●
● ● ● ●● ● ● ●● ●
● ● ●●● ● ● ●●
●●● ● ● ●●
● ●●



● ●●●
● ● ● ● ●
● ●● ● ●

● ● ● ● ● ● ● ●● ●
● ● ● ● ●●● ●●● ● ●●

●●● ●
●● ●● ●

● ●
● ● ● ● ●●● ●● ● ● ●●
● ● ●●● ● ●● ●●
● ●
● ● ● ● ● ● ● ●●● ●
● ●●● ●●
●●● ●


● ● ● ●● ● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●
● ● ● ● ●●● ●
● ● ● ●● ● ●● ●● ● ● ●● ●●
● ●●●●● ● ●● ● ● ●● ●
●● ● ● ● ● ● ●●● ● ●
● ●● ● ●●


●●● ●●●● ● ● ● ●●●
●● ● ● ● ● ●●
● ● ●●●

●●
●●● ● ● ●●●
● ● ●
● ● ● ●
●●
●●● ● ● ● ● ● ●
● ● ● ●
●●●● ●
● ●
●●●

●●● ●●●● ● ● ● ●
● ● ●
●● ● ● ● ●
● ● ●● ● ● ● ● ●● ● ●
●●
●● ●● ●● ●
● ● ●● ● ● ●● ● ●
● ●

●●
●●

●● ●●
● ●●



● ●
●●● ● ●
●● ● ●●
● ●●
●● ●●

0

●●

0.2 0.4 0.6 0.8 1.0 −120 −110 −100 −90 −80 −70

Proportion for the most abundant OTU Longitude


9.5

Summerville, SC
4

Greensburg, PA
Junction City, CA
9.0

3
8.5

Density
2
r
8.0
7.5

1
7.0

10000 15000 20000 25000 30000 0.0 0.2 0.4 0.6 0.8 1.0

Iteration y

FIGURE 4.5
Beta regression for microbiome data. The top left panel shows the his-
togram of the observed proportions of abundance allocated to the most abun-
dance OTU, and the top right panel plots this variable against the sam-
ple’s longitude. The second row gives the trace plots (the two chains are
different shades of gray) of the concentration parameter, r, and the fitted
Beta[r̂q̂, r̂(1 − q̂)] density for three samples evaluated at the posterior mean
for all parameters.
Linear models 143

● ●
● ●
● ● ● ●

● ● ● ●
● ● ● ●
54

54
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ●
● ●
52

52

● ● ● ●
Bone density

Bone density
● ● ● ●
● ● ● ●
● ● ● ● ●

● ● ● ●
● ● ● ●
50

50
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
48

48
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●

● ● ● ●
● ● ● ●
46

46
● ●

● ● ●

8.0 8.5 9.0 9.5 1 2 3 4 5 6 7 8 9 11 13 15 17 19

Age Patient

Random effects
5

IID
1.2
4

1.0
Posterior density
0.8
3
Density

0.6
2

0.4
1

0.2
0.0
0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 47 48 49 50 51 52

Proportion of variance explained by the random effect µ

FIGURE 4.6
One-way random effect analysis of the jaw data. The dots in the top
left panel show bone density at the four visits for each patient (connected by
lines), the top right panel compares the observations (dots) and the posterior
distribution of the subject random effect αi (boxplot), the bottom left panel
plots the posterior of the variance ratio τ 2 /(τ 2 + σ 2 ), and the final panel
compares the posterior density of the mean µ from the random effects model
iid
versus independence model Yij ∼ Normal(µ, σ 2 ).
144 Bayesian Statistical Methods

Listing 4.8
The one-way random effects model in JAGS.
1 # Likelihood
2 for(i in 1:n){for(j in 1:m){
3 Y[i,j] ~ dnorm(alpha[i],sig2_inv)
4 }}
5

6 # Random effects
7 for(i in 1:n){alpha[i] ~ dnorm(mu,tau2_inv)}
8

9 # Priors
10 mu ~ dnorm(0,0.0001)
11 sig2_inv ~ dgamma(0.1,0.1)
12 tau2_inv ~ dgamma(0.1,0.1)

confusion, [42]). Conceptually, however, fixed and random effects are distinct
because fixed effects describe the population and a random effect describes an
individual from the population. A common source of confusion is that fixed
effects (such as µ in Listing 4.8) are treated as random variables in a Bayesian
analysis. However, as with all parameters, the prior and posterior distribu-
tions for fixed effects reflect subjective uncertainty about the true values of
these fixed but unknown parameters.
Unlike analyses such as linear regression where the main focus is on the
mean and prediction of new observations, in random effects models the vari-
ance components (e.g., σ 2 and τ 2 ) are often the main focus of the analysis.
Therefore it is important to scrutinize the priors used for these parameters.
The inverse gamma prior is conjugate for the variance parameters which often
leads to simple Gibbs updates. However, as shown in Figure 4.7, the inverse
gamma prior with small shape parameter for a variance induces a prior for
the standard deviation with prior PDF equal to zero at the origin. This ex-
cludes the possibility that there is no random effect variance, which may not
be suitable in many problems.
As an alternative, [29] endorses a half-Cauchy (HC) prior for the standard
deviation. The HC distribution is the student-t distribution with one degree
of freedom restricted to be positive and has a flat PDF at the origin (Figure
4.7), which is usually a more accurate expression of prior belief. Listing 4.9
gives JAGS code for this prior. In this code the HC distribution is assigned
directly to the standard deviations which breaks the conjugacy relationships
for the variance components (this is easily handled by JAGS); [29] shows that
conjugacy can be restored using a two-stage model. This code assumes the HC
scale parameter is fixed at one. Because the Cauchy prior has a very heavy
tail this gives 0.99 prior quantile equal to 63. Despite this wide prior range,
the scale of the HC prior should be adjusted to the scale of the data.
Random effects induce correlation between observations from the same
group. In the one-way random effects model, the covariance marginally over
Linear models 145

Listing 4.9
The one-way random effects model with half-Cauchy priors.
1 # Likelihood
2 for(i in 1:n){for(j in 1:m){
3 Y[i,j] ~ dnorm(alpha[i],sig2_inv)
4 }}
5

6 # Random effects
7 for(i in 1:n){alpha[i] ~ dnorm(mu,tau2_inv)}
8

9 # Priors
10 mu ~ dnorm(0,0.0001)
11 sig2_inv <- pow(sigma1,-2)
12 tau2_inv <- pow(sigma2,-2)
13 sigma1 ~ dt(0, 1, 1)T(0,) # Half-Cauchy priors with
14 sigma2 ~ dt(0, 1, 1)T(0,) # location 0 and scale 1
3.5

3.5

Half−Cauchy Half−Cauchy
InvGamma, a=1 InvGamma, a=1
InvGamma, a=0.5 InvGamma, a=0.5
3.0

3.0

InvGamma, a=0.1 InvGamma, a=0.1


2.5

2.5
Prior density

Prior density
2.0

2.0
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.0 0.5 1.0 1.5 2.0 2.5 3.0

σ σ

FIGURE 4.7
Priors for a standard deviation. The half-Cauchy prior for σ and the prior
induced for σ by inverse gamma priors on σ 2 with different shape parameters.
All priors are scaled to have median equal 1. The two panels differ only by
the range of σ being plotted.
146 Bayesian Statistical Methods

the random effect αi is


 2
 τ + σ2 i = u and j = v
Cov(Yij , Yuv ) = τ2 i = u and j 6= v (4.35)
0 i 6= u.

The variance of each observation is τ 2 + σ 2 and includes both the variance


in the random effect (τ 2 ) and variability around the mean (σ 2 ). Since two
observations from the same patient share a common random effect, they have
covariance τ 2 and thus correlation τ 2 /(τ 2 + σ 2 ). Observations from different
patients have no common source of variability and are thus uncorrelated.
Returning to the bone density data, Figure 4.6 plots the posterior approxi-
mated with two chains each with 30,000 samples and the first 10,000 discarded
as burn-in assuming the model in Listing 4.8. The posterior distribution of
the proportion of the variance attributed to the random effect τ 2 /(τ 2 + σ 2 )
ranges from 0.1 to 0.5 (bottom left panel). Accounting for this correlation
between observations from the same patient affects the posterior of the fixed
effect, µ (bottom right). The posterior variance of µ is larger for the ran-
dom effects model compared to the posterior of µ for the independence model
iid
Yij ∼ Normal(µ, σ 2 ). This is expected because there is less information about
the mean in 4 repeated measurements for 20 patients than 80 measurements
from 80 different patients.
To test for prior sensitivity, we refit the model using a half-Cauchy prior
in Listing 4.9. In this case the prior for the variance components had little
effect on the results. The 95% posterior credible sets σ and τ are (1.24, 1.78)
and (1.75, 3.51) respectively using inverse gamma priors compared to (1.24,
1.77) and (1.72, 3.45) for the half-Cauchy priors.
Random slopes model: The one-way random effect model that ignores
age is naive because bone density clearly increases with age (top left panel of
Figure 4.6). Adding a time trend to the model accounts for this,
indep iid
Yij |αi ∼ Normal(αi + Xj β, σ 2 ) where αi ∼ Normal(µ, τ 2 ), (4.36)

Xj is the child’s age at visit j, and β is the fixed age trend.


This model assumes that each child has a different intercept but the same
slope. However, Figure 4.6 indicates that the rate of increase over time varies
by patient. Therefore we could model the patient-specific slopes as random
effects
indep iid
Yij |αi ∼ Normal(αi1 + αi2 Xj , σ 2 ) where αi ∼ Normal(β, Ω), (4.37)

and the random intercept and slope for patient i is αi = (αi1 , αi2 )T . The mean
vector β includes the population mean intercept and slope, and is thus a fixed
effect. The 2 × 2 population covariance matrix Ω determines the variation
of the random effects over the population. To complete the Bayesian model
we specify prior σ 2 ∼ InvGamma(0.1, 0.1), β ∼ Normal(0, 1002 I2 ), and Ω ∼
Linear models 147

Listing 4.10
Random slopes model in JAGS.
1 # Likelihood
2 for(i in 1:n){for(j in 1:m){
3 Y[i,j] ~ dnorm(alpha[i,1]+alpha[i,2]*age[j],tau)
4 }}
5

6 # Random effects
7 for(i in 1:n){
8 alpha[i,1:2] ~ dmnorm(beta[1:2],Omega_inv[1:2,1:2])
9 }
10

11 # Priors
12 tau ~ dgamma(0.1,0.1)
13 for(j in 1:2){beta[j] ~ dnorm(0,0.0001)}
14 Omega_inv[1:2,1:2] ~ dwish(R[,],2.1)
15

16 R[1,1]<-1/2.1
17 R[1,2]<-0
18 R[2,1]<-0
19 R[2,2]<-1/2.1

InvWishart(2.1, I2 /2.1). The inverse Wishart prior for the covariance matrix
has prior mean I2 , the 2 × 2 identify matrix (see Appendix A.1). JAGS code
for this random-slopes model is in Listing 4.10.
The posterior mean of the population covariance matrix is
 
91.78 −10.14
E(Ω|Y) = (4.38)
−10.14 1.23

and √ the posterior 95% interval for the correlation Cor(αi1 , αi2 ) =
Ω12 / Ω11 Ω22 is (-0.98, -0.89). Therefore there is a strong negative depen-
dence between the intercept and slope, indicating that bone density increases
rapidly for children with low bone density at age 8, and vice versa.
Figure 4.8 plots the posterior distribution of the fitted values αi1 + αi2 X
for X between 8 and 10 years for three patients. For each patient and each age,
we compute the 95% interval using the quantiles of the S posterior samples
(s) (s)
αi1 + αi2 X. We also plot the posterior predictive distribution (PPD) for
the measured bone density at age 10. The PPD is approximated by sampling
∗(s) (s) (s)
Yi ∼ Normal(αi1 + αi2 10, σ 2(s) ) at each iteration and then computing
the quantiles of the S predictions. The PPD accounts for both uncertainty in
the patient’s random effect αi and measurement error with variance σ 2 . The
intervals in Figure 4.8 suggest that uncertainty in the random effects is the
dominant source of variation.
Marginal models: Inducing correlation by conditioning on random effects
is equivalent to a marginal model (Section 4.5.4) that does not include random
148 Bayesian Statistical Methods

60
Patient 1
Patient 2
Patient 3

55
Bone density
50
45

8.0 8.5 9.0 9.5 10.0

Age

FIGURE 4.8
Mixed effects analysis of the jaw data. The observed bone density
(points) for three subjects versus the posterior median (solid lines) and 95%
intervals (dashed lines) of the fitted value αi1 + αi2 X for X ranging from 8–10
years, and 95% credible intervals (vertical lines at Age=10) of the posterior
predictive distributions for the measured response at age X = 10.

effects but directly specifies correlation between observation in the same group.
For example, the one-way random effects model
indep iid
Yij |αi ∼ Normal(αi , σ 2 ) where αi ∼ Normal(µ, τ 2 ) (4.39)

is equivalent to the marginal model

Yi ∼ Normal(µ, Σ), (4.40)

where Yi = (Yi1 , ..., Yim )T is the data vector for group i, µ = (µ, ..., µ)T is the
mean vector, and Σ is the covariance matrix with τ 2 + σ 2 on the diagonals
and τ 2 elsewhere. An advantage of the marginal approach is that we no longer
have to estimate the random effects αi ; a disadvantage is that for large data
sets and complex correlation structure the mean vector and especially the
covariance matrix can be large which slows computation.
In the hierarchical representation, the elements of Yi are independent and
identically distributed given αi ; in the marginal model the m observations
from group i are no longer independent, but they remain exchangeable, i.e.,
their distribution is invariant to permuting their order. The concept of ex-
changeability plays a fundamental role in constructing hierarchical models.
The representation theorem by Bruno de Finetti states that any infinite se-
quence of exchangeable variables can be written as independent and identically
distributed conditioned on some latent distribution. Therefore, this important
type of dependent data can be modeled using a simpler hierarchical model.
Linear models 149

4.5 Flexible linear models


As discussed in Section 4.2, the multiple linear regression model of response
Yi onto covariates Xi1 , ..., Xip is
p
X
Yi = Xij βj + εi , (4.41)
j=1

iid
where β = (β1 , ..., βp )T are the regression coefficients and the errors are εi ∼
Normal(0, σ 2 ). This model makes four key assumptions:
(1) Linearity: The mean of Yi |Xi is linear in Xi
(2) Equal variance: The residual variance (σ 2 ) is the same for all i
(3) Normality: The errors εi are Gaussian
(4) Independence: The errors εi are independent
In real analyses, most if not all of these assumptions will be violated to some
extent. Minor violations will not invalidate statistical inference, but glaring
model misspecifications should be addressed. This chapter provides Bayesian
remedies to model misspecification (each subsection addresses one of the four
assumptions above). A strength of the Bayesian paradigm is that these models
can be fit by simply adding a few lines of JAGS code and do not require
fundamentally new theory or algorithms.

4.5.1 Nonparametric regression


The linearity assumption can be relaxed by using a general expression of the
regression of Yi onto Xi ,

Yi = g(Xi ) + εi = g(Xi1 , ..., Xip ) + εi (4.42)


iid
where g is the mean function and εi ∼ Normal(0, σ 2 ). Parametric regression
specifies the mean function g as a parametric function of a finite number of
parameters, e.g., in linear regression g(Xi ) = Xi β. A linear mean function is
often a sufficient and interpretable first-order approximation, but more com-
plex relationships between variables can be fit using a more flexible model.
For example, consider the data from the mcycle R package plotted in Figure
4.9. The predictor, Xi , is the time since a motorcycle makes impact (scaled
to the unit interval) and the response, Yi , is the acceleration of a monitor on
the head of a crash test dummy. Clearly a linear model will not fit these data
well: the mean is flat for the first quarter of the experiment, then dramatically
dips, rebounds, and then plateaus until the end of the experiment.
A fully nonparametric model allows for any continuous function g. A model
150 Bayesian Statistical Methods

(a) Fitted mean trend − Homoskedastic model (b) Spline basis functions

1.0
2

0.8
1
Acceleration

0.6
Bj(X)
0

0.4
−1

0.2
Median
95% interval
−2

0.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
time Time

(c) Fitted mean trend − Heteroskedastic model (d) Variance − Heteroskedastic model
2.0
2

Median
95% interval
1.5
1
Acceleration

σ2(x)
1.0
0
−1

0.5

Median
95% interval
−2

0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
time time

FIGURE 4.9
Nonparametric regression for the motorcycle data. Panel (a) plots the
time since impact (scaled to be between 0 and 1) and the acceleration (g) along
with the posterior median and 95% interval for the mean function from the
homoskedastic fit; Panel (b) shows the J = 10 spline basis functions Bj (X);
Panels (c) and (d) show the posterior median and 95% intervals for the mean
and variance functions from the heteroskedastic model.
Linear models 151

this flexible requires infinitely many parameters. We will focus on semipara-


metric models that specify the mean function in terms of a finite number of
parameters in a way that increasing the number of parameters can approx-
imate any function g. For example, if there is only p = 1 covariate (X) we
could fit a J th order polynomial function
J
X
g(X) = X j βj . (4.43)
j=0

This model has J + 1 parameters and by increasing J the polynomial function


can approximate any continuous g.
There are many Bayesian semiparametric/nonparametric regression mod-
els, including Gaussian process regression [67], Bayesian adaptive regression
trees [18], neural networks [61] and regression splines [19]. The simplest ap-
proach is arguably regression splines. In spline regression we construct non-
linear functions of the original covariates, and use these constructed covariates
as the predictors in multiple linear regression. Denote the J constructed co-
variates as B1 (X), ..., BJ (X). In polynomial regression Bj (X) = X j , but there
are many other choices. For example, Figure 4.9b plots J = 10 B-spline basis
functions. These functions are appealing because they are smooth and local,
i.e., non-zero only for some values of X. The model is simply the multiple
linear regression model (Section 4.2) with the B1 (X), ..., BJ (X) as the covari-
ates,
XJ
2
Yi ∼ Normal[g(Xi ), σ ] and g(Xi ) = β0 + Bj (Xi )βj .
j=1

Note that each basis function in Figure 4.9b has Bj (0) = 0, and so an in-
tercept (β0 ) is required. By increasing J, any smooth mean function can be
approximated as a linear combination of the B-spline basis functions.
Motorcycle example: To fit the mean curve to the data plotted in Figure
4.9a, we use J = 10 B-spline basis functions and priors βj ∼ Normal(0, τ 2 σ 2 )
and σ 2 , τ 2 ∼ InvGamma(0.1, 0.1). The model is fit using MCMC with the
code in Listing 4.3. In this code, the basis functions have been computed in
using the bs package in R and passed to JAGS as xij = Bj (Xi ). For each
iteration, we compute g(Xi ) for all i = 1, ..., n as a function of that iteration’s
posterior sample of β. This produces the entire posterior distribution of the
mean function g for all n sample points (and any other X we desire), and
Figure 4.9a plots the posterior median and 95% interval of g at each sample
point. The fitted model accurately captures the main trend including the valley
around X = 0.4.
We selected J = 10 basis functions because this degree of model complexity
visually seemed to fit the data well. Choosing smaller J would give a smoother
estimate of g and choosing larger J would give a rougher estimate. Clearly a
more rigorous approach to selecting the number of basis functions is needed,
and this is discussed in Chapter 5.
152 Bayesian Statistical Methods

Listing 4.11
Model statement for heteroskedastic Gaussian regression.
1 for(i in 1:n){
2 Y[i] ~ dnorm(mu[i],prec[i])
3 mu[i] <- inprod(x[i,],beta[])
4 prec[i] <- 1/sig2[i]
5 sig2[i] <- exp(inprod(x[i,],alpha[]))
6 }
7 for(j in 1:p){beta[j] ~ dnorm(0,taub)}
8 for(j in 1:p){alpha[j] ~ dnorm(0,taua)}
9 taub ~ dgamma(0.1,0.1)
10 taua ~ dgamma(0.1,0.1)

4.5.2 Heteroskedastic models


The standard linear regression analysis assumes a homoskedastic variance
V(εi ) = σ 2 for all i. A more flexible heteroskedastic model allows the co-
variates to affect both the mean and the variance. A natural approach is to
build a linear model for the variance as a function of the covariates so that
V(εi ) = σ 2 (Xi ). Since the variance is positive, we must transform the linear
predictor to be positive before linking to the variance. For example,
p
X
log[σ 2 (Xi )] = Xij αj , (4.44)
j=1
Pp
or equivalently σ 2 (Xi ) = exp( j=1 Xij αj ). The parameter αj determines the
effect of the covariate j on the variance and must be estimated. JAGS code for
this model is in Listing 4.11.
Motorcycle example: The variance of the observations about the mean
trend in Figure 4.9a clearly depends on X, with small variance at the be-
ginning of the experiment and large variance in the middle. To capture this
heteroskedasticity in a flexible way, we model the log variance using the same
J B-spline basis functions used for the mean,
p
X p
X
g(X) = β0 + Bj (X)βj and log[σ 2 (X)] = α0 + Bj (X)αj , (4.45)
j=1 j=1

where βj ∼ Normal(0, σb2 ) and αj ∼ Normal(0, σa2 ). The hyperparameters have


uninformative priors σa2 , σb2 ∼ InvGamma(0.1, 0.1).
The pointwise 95% intervals of σ 2 (X) in Figure 4.9d suggest that the vari-
ance is indeed small at the beginning of the experiment and increases with X.
Comparing the posterior distributions of the mean trend for the homoskedastic
(Figure 4.9a) and heteroskedastic (Figure 4.9c) models, the posterior means
are similar but the heteroskedastic model produces more realistic 95% intervals
with widths that vary according the pattern of the error variance. Therefore, it
Linear models 153

appears that properly quantifying uncertainty about the mean trend requires
a realistic model for the error variance.

4.5.3 Non-Gaussian error models


Most Bayesian regression models assume Gaussian errors, but more flexible
methods are easily constructed. For example, to accommodate heavy tails the
errors could be modelled using a student-t or double-exponential (Laplace) dis-
tributions. To further allow for asymmetry, generalizations such as the skew-t
and asymmetric Laplace distributions would be used. Listing 4.12a provides
code for regression with student-t errors.
In most analysis, a suitable parametric distribution can be found. How-
ever, this process is subjective and difficult to automate. Just as a mixture of
conjugate priors (Section 2.1.8) can be used to approximate virtually any prior
distribution, the mixture of normals distribution can be used to approximate
virtually any residual distribution. The mixture of normals density for ε is
K
X
f (ε) = πk φ(ε; θk , τ12 ), (4.46)
k=1

where K is the number of mixture components, πk ∈ (0, 1) is the probability


on mixture component k, and φ(ε; θ, τ 2 ) is the Gaussian PDF with mean θ
and variance τ 2 . This model is equivalent to the clustering model where
 
X p
Yi |gi ∼ Normal  Xij βj + θgi , σ 2  (4.47)
j=1

and gi ∈ {1, ..., K} is the cluster label for observation i with Prob(gi = k) =
πk . By letting the number of mixture components increase to infinity, any dis-
iid
tribution can be approximated, and by selecting priors θk ∼ Normal(0, τ22 ),
τ12 , τ22 ∼ InvGamma, and (π1 , ..., πK ) ∼ Dirichlet all full conditional distri-
butions for all parameters are conjugate permitting Gibbs samples (Listing
4.12b).
For a fixed number of mixture components (K) the mixture-of-normals
model is a semiparametric estimator of the density f (ε). There is a rich liter-
ature on nonparametric Bayesian density estimation [37]. The most common
model is the Dirichlet process mixture model that has infinitely many mixture
components and a particular model for the mixture probabilities.

4.5.4 Linear models with correlated data


For data with a natural ordering such as spatial or temporal data modeling
the correlation between observations is important to obtain valid inference
and make accurate predictions. Generally, the Bayesian linear model with
154 Bayesian Statistical Methods

Listing 4.12
Model statement for Gaussian regression with non-normal errors.
1

2 # (a) Regression with student-t errors


3 for(i in 1:n){
4 Y[i] ~ dt(mu[i],tau,df)
5 mu[i] <- inprod(X[i,],beta[])
6 }
7 for(j in 1:p){beta[j] ~ dnorm(0,taub)}
8 tau ~ dgamma(0.1,0.1)
9 df ~ dgamma(0.1,0.1)
10

11

12 # (b) Regression with mixture-of-normals errors


13 for(i in 1:n){
14 Y[i] ~ dnorm(mu[i]+theta[g[i]],tau1)
15 mu[i] <- inprod(X[i,],beta[])
16 g[i] ~ dcat(pi[])
17 }
18 for(k in 1:K){theta[k] ~ dnorm(0,tau2)}
19 for(j in 1:p){beta[j] ~ dnorm(0,tau3)}
20 tau1 ~ dgamma(0.1,0.1)
21 tau2 ~ dgamma(0.1,0.1)
22 tau3 ~ dgamma(0.1,0.1)
23 pi[1:K] ~ ddirch(alpha[1:K])
Linear models 155

correlated errors is
Y ∼ Normal(Xβ, Σ). (4.48)
A Bayesian analysis of correlated data hinges on correctly specifying the corre-
lation structure to capture say spatial or temporal correlation. Given the cor-
relation structure and priors for the correlation parameters, standard Bayesian
computational tools can be used to summarize the posterior. Correlation pa-
rameters usually will not have conjugate priors and so Metropolis–Hastings
sampling is used. An advantage of the Bayesian approach for correlated data
is that using MCMC sampling we can account for uncertainty in the corre-
lation parameters for prediction or inference on other parameters, whereas
maximum likelihood analysis often uses plug-in estimates of the correlation
parameters and thus underestimates uncertainty.
Gun control example: The data for this analysis come from Kalesan
et. al. (2016) [47]. The response variable, Yi , is the log firearm-related death
rate per 10,000 people in 2010 in state i (excluding Alaska and Hawaii). This
is regressed onto five potential confounders: log 2009 firearm death rate per
10,000 people; firearm ownership rate quartile; unemployment rate quartile;
non-firearm homicide rate quartile; and firearm export rate quartile. The co-
variate of interest is the number of gun control laws in effect in the state. This
gives p = 6 covariates.
We first fit the usual Bayesian linear regression model
p
X
Yi = β0 + Xi βj + εi (4.49)
j=1

iid
with independent errors εi ∼ Normal(0, σ 2 ) and uninformative priors. The
posterior density of the regression coefficient corresponding to the number of
gun laws is plotted in Figure 4.10. The posterior probability that the coefficient
is negative is 0.96, suggesting a negative relationship between the number of
gun laws and the firearm-related death rate.
The assumption of independent residuals is questionable because neighbor-
ing states may be correlated. Spatial correlation may stem from guns being
brought across state borders or from missing covariates (e.g., attitudes about
and use of guns) that vary spatially. Research has shown that accounting
for residual dependence can have a dramatic effect on regression coefficient
estimates [43].
We decompose the residual covariance Cov[(ε1 , ..., εn )T ] = Σ as

Σ = τ 2 S + σ 2 In , (4.50)

where τ 2 S is the spatial covariance and σ 2 In is the non-spatial covariance.


There are many spatial correlation models (e.g., [4]) that allow the correla-
tion between two states to decay with the distance between the states. For
example, a common model is to assume the correlation between states decays
156 Bayesian Statistical Methods

Non−spatial
Spatial

80
Posterior density
60
40
20
0

−0.03 −0.02 −0.01 0.00 0.01

Beta

FIGURE 4.10
Effect of gun-control legislation on firearm-related death rate. Pos-
terior distribution of the coefficient associated with the number of gun-control
laws in a state from the spatial and non-spatial model of the states’ firearm-
related death rate.

exponentially with the distance between them. However, quantifying the dis-
tance between irregularly shaped states is challenging, and so we model spatial
dependence using adjacencies. Let Aij = 1 if states i and j share a border
and Aij = 0 if i = j or the states are not neighbors. The spatial covariance
follows the conditionally autoregressive model S = (M − ρA)−1 , where A is
the adjacency matrix with (i, j) element Aij and M is the diagonal matrix
with the ith diagonal element equal to the number of states that neighbor
state i. The parameter ρ ∈ (0, 1) is not the correlation between adjacent sites,
but determines the strength of spatial dependence with ρ = 0 corresponding
to independence.
The posterior mean (standard deviation) of the spatial dependence pa-
rameter ρ is 0.38 (0.25), and so the residual spatial dependence in these data
is not strong. However, the posterior of the regression coefficient of interest
in Figure 4.10 is noticeably wider for the spatial model than the non-spatial
model. The posterior probability that the coefficient is negative lowers from
0.96 from the non-spatial model to 0.93 for the spatial model. Therefore, while
accounting for residual dependence did not qualitatively change the results,
this example illustrates that the chosen model for the residuals can affect the
posterior of the regression coefficients.
Jaw bone density example: A possible correlation structure for the
longitudinal data in Figure 4.6 (top left) is to assume that correlation decays
with the time between visits. A first-order autoregression correlation structure
is Cor(Yij , Yik ) = ρ|j−k| . Denoting the vector of m observations for patient
Linear models 157

Listing 4.13
Random slopes model with autoregressive dependence in JAGS.
1 # Likelihood
2 for(i in 1:n){
3 Y[i,1:m] ~ dmnorm(mn[i,1:m],SigmaInv)
4 for(j in 1:m){mn[i,j] <- alpha[i,1]+alpha[i,2]*age[j]}
5 }
6 SigmaInv[1:m,1:m] <- inverse(Sigma[1:m,1:m])
7 for(j in 1:m){for(k in 1:m){
8 Sigma[j,k] <- pow(rho,abs(k-j))/tau
9 }}
10

11 # Random effects
12 for(i in 1:n){alpha[i,1:2] ~ dmnorm(beta[1:2],Omega[1:2,1:2])}
13

14 # Priors
15 tau ~ dgamma(0.1,0.1)
16 for(j in 1:2){beta[j] ~ dnorm(0,0.0001)}
17 rho ~ dunif(0,1)
18 Omega[1:2,1:2] ~ dwish(R[,],2.1)
19

20 R[1,1]<-1/2.1
21 R[1,2]<-0
22 R[2,1]<-0
23 R[2,2]<-1/2.1

i as Yi = (Yi1 , ..., Yim )T , the m × m covariance matrix Σ has (j, k) element


equal to σ 2 ρ|j−k| . The random slope model for the mean in matrix notation is
E(Yi |αi ) = Xαi , where X is the m × 2 matrix with the first column equal to
the vector of ones for the intercept and the second column equal to the ages
X1 , ..., Xm for the slope. The likelihood is then
indep
Yi |αi ∼ Normal(Xαi , Σ) (4.51)
iid
with random effects distribution αi ∼ Normal(β, Ω). The correlation param-
eter is given prior ρ ∼ Uniform(0, 1) and all other priors are the same as other
fits. JAGS code is given in Listing 4.13.
The posterior median of the correlation parameter ρ is 0.85 and the poste-
rior 95% interval is (0.46, 0.96) so there is evidence of correlation that cannot
be explained by the patient-specific linear trend. Including autoregressive cor-
relation has only a modest effect on the posterior distribution of the fixed
effects: the 95% posterior intervals for the random effect model with indepen-
dent errors are (29.9, 38.3) for β1 and (1.33, 2.38) for β2 compared to (30.0,
37.3) for β1 and (1.45, 2.31) for β2 for the autoregressive model.
This model with random intercept, random slope and autoregressive cor-
relation structure is now very complex. There are almost as many parameters
158 Bayesian Statistical Methods

as observations and multiple explanations of dependence (random effects and


residual correlation). Perhaps in this case all of these terms are necessary and
can be estimated from this relatively small data set, but a simpler yet adequate
model is preferred for computational purposes and because simpler models are
easier to explain and defend. Model comparisons and tests of model adequacy
are the topics of Chapter 5.

4.6 Exercises
1. A clinical trial gave six subjects a placebo and six subjects a new
weight loss medication. The response variable is the change in
weight (pounds) from baseline (so -2.0 means the subject lost 2
pounds). The data for the 12 subjects are:

Placebo Treatment
2.0 -3.5
-3.1 -1.6
-1.0 -4.6
0.2 -0.9
0.3 -5.1
0.4 0.1

Conduct a Bayesian analysis to compare the means of these two


groups. Would you say the treatment is effective? Is your conclusion
sensitive to the prior?
2. Load the classic Boston Housing Data in R:

> library(MASS)
> data(Boston)
> ?Boston

The response variable is medv, the median value of owner-occupied


homes (in $1,000s), and the other 13 variables are covariates that
describe the neighborhood.
(a) Fit a Bayesian linear regression model with uninformative
Gaussian priors for the regression coefficients. Verify the
MCMC sampler has converged, and summarize the posterior
distribution of all regression coefficients.
(b) Perform a classic least squares analysis (e.g., using the lm func-
tion in R). Compare the results numerically and conceptually
with the Bayesian results.
Linear models 159

(c) Refit the Bayesian model with double exponential priors for the
regression coefficients, and discuss how the results differ from
the analysis with uninformative priors.
(d) Fit a Bayesian linear regression model in (a) using only the first
500 observations and compute the posterior predictive distribu-
tion for the final 6 observations. Plot the posterior predictive
distribution versus the actual value for these 6 observations and
comment on whether the predictions are reasonable.
3. Download the 2016 Presidential Election data from the book’s web-
site. Perform Bayesian linear regression with the response variable
for county i being the difference between the percentage of the vote
for the Republican candidate in 2016 minus 2012 and all variables
in the object X as covariates.
(a) Fit a Bayesian linear regression model with uninformative
Gaussian priors for the regression coefficients and summarize
the posterior distribution of all regression coefficients.
(b) Compute the residuals Ri = Yi − Xi β̂ where β̂ is the posterior
mean of the regression coefficients. Are the residuals Gaussian?
Which counties have the largest and smallest residuals, and
what might this say about these counties?
(c) Include a random effect for the state, that is, for a county in
state l = 1, ..., 50,

Yi |αl ∼ Normal(Xi β + αl , σ 2 )
iid
where αl ∼ Normal(0, τ 2 ) and τ 2 has an uninformative prior.
Why might adding random effects be necessary? How does
adding random effects affect the posterior of the regression co-
efficients? Which states have the highest and lowest posterior
mean random effect, and what might this imply about these
states?
4. Download the US gun control data from the book’s website. These
data are taken from the cross-sectional study in [47]. For state i, let
Yi be the number of homicides and Ni be the population.
(a) Fit the model Yi |β ∼ Poisson(Ni λi ) where log(λi ) = Xi β. Use
uninformative priors and p = 7 covariates in Xi : the intercept,
the five confounders Zi , and the total number of gun laws in
state i. Provide justification that the MCMC sampler has con-
verged and sufficiently explored the posterior distribution and
summarize the posterior of β.
(b) Fit a negative binomial regression model and compare with the
results from Poisson regression.
160 Bayesian Statistical Methods

(c) For the Poisson model in (a), compute the posterior predictive
distribution for each state with the number of gun laws set
to zero. Repeat this with the number of gun laws set to 25
(the maximum number). According to these calculations, how
would the number of deaths nationwide be affected by these
policy changes? Do you trust these projections?
5. Download the titanic dataset from R,

library("titanic")
dat <- titanic_train
?titanic_train

Let Yi = 1 if passenger i survived and Yi = 0 otherwise. Perform


a Bayesian logistic regression of the survival probability onto the
passenger’s age, gender (dummy variable) and class (two dummy
variables). Summarize the effect of each covariate.
6. The T. rex growth chart data plotted in Figure 3.7 has n = 6
observations with weights (kg) 29.9, 1761, 1807, 2984, 3230, 5040,
and 5654 and corresponding ages (years) 2, 15, 14, 16, 18, 22, and
28. Since weight must be positive, the gamma family of distribu-
tions is a reasonable model for these data. Describe a model with
gamma likelihood and log mean that increases linearly with age.
Approximate the posterior using MCMC, summarize the posterior
distribution of all model parameters, and plot the data versus the
fitted mean curve.
7. Consider the one-way random effects model Yij |αi , σ 2 ∼
Normal(αi , σ 2 ) and αi ∼ Normal(0, τ 2 ) for i = 1, ..., n and j =
1, ..., m. Assuming conjugate priors σ 2 , τ 2 ∼ InvGamma(a, b), de-
rive the full conditional distributions of α1 , σ 2 , and τ 2 and outline
(but do not code) an MCMC algorithm to sample from the poste-
rior.
8. Load the Gambia data in R:

> library(geoR)
> data(gambia)
> ?gambia

The response variable Yi is the binary indicator that child i tested


positive for malaria (pos) and the remaining seven variables are
covariates.
(a) Fit the logistic regression model
p
X
logit[Prob(Yi = 1)] = Xij βj
j=1
Linear models 161

with uninformative priors for the βj . Verify that the MCMC


sampler has converged and summarize the effects of the covari-
ates.
(b) In this dataset, the 2,035 children reside in L = 65 unique
locations (defined by the x and y coordinates in the dataset).
Let si ∈ {1, ..., L} be the label of the location for observation
i. Fit the random effects logistic regression model
p
iid
X
logit[Prob(Yi = 1)] = Xij βj +αsi where αl ∼ Normal(0, τ 2 )
j=1

and the βj and τ 2 have uninformative priors. Verify that the


MCMC sampler has converged; explain why random effects
might be needed here; discuss and explain any differences in the
posteriors of the regression coefficients that occur when random
effects are added to the model; plot the posterior means of the
αl by their spatial locations and suggest how this map might
be useful to malaria researchers.
9. Download the babynames data in R and compute the log odds of a
baby being named “Sophia” each year after 1950:

library(babynames)
dat <- babynames
dat <- dat[dat$name=="Sophia" &
dat$sex=="F" &
dat$year>1950,]
yr <- dat$year
p <- dat$prop
t <- dat$year - 1950
Y <- log(p/(1-p))

Let Yt denote the sample log-odds in year t+1950. Fit the following
time series (auto-regressive order 1) model to these data:

Yt = µt + ρ(Yt−1 − µt−1 ) + εt
iid
where µt = α + βt and εt ∼ Normal(0, σ 2 ). The priors
are α, β ∼ Normal(0, 1002 ), ρ ∼ Uniform(−1, 1), and σ 2 ∼
InvGamma(0.1, 0.1).
(a) Give an interpretation of each of the four model parameters: α,
β, ρ, and σ 2 .
(b) Fit the model using JAGS for t > 1, verify convergence, and
report the posterior mean and 95% interval for each parameter.
(c) Plot the posterior predictive distribution for Yt in the year 2020.
162 Bayesian Statistical Methods

10. Open and plot the galaxies data in R using the code below,

> library(MASS)
> data(galaxies)
> ?galaxies
> Y <- galaxies
> hist(Y,breaks=25)

Model the observations Y1 , ..., Y82 using a mixture of K = 3 normal


distributions. For each the S MCMC iterations evaluate the density
function on the grid y ∈ {5000, 5100, ..., 40000} (351 points in the
grid), giving an S × 351 matrix of posterior samples. Plot the pos-
terior median and 95% credible set of the density function at each
of the 351 grid values. Does this mixture model fit the data well?
5
Model selection and diagnostics

CONTENTS
5.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.2 Hypothesis testing and Bayes factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3 Stochastic search variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.4 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.5 Model selection criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.6 Goodness-of-fit checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

A statistical model is mathematical representation of the system that includes


errors and biases in the observation process, and therefore lays bare the as-
sumptions being made in the analysis. Of course, no statistical model is ab-
solutely correct; in reality, functional relationships are not linear, errors are
not exactly Gaussian, residuals are not independent, etc. Nonetheless, slight
deviations in fit are a small price to pay for a simple and interpretable rep-
resentation of reality that allows us to probe for important relationships, test
scientific hypotheses and make predictions. On the other hand, if a model’s
assumptions are blatantly violated then the results of the analysis cannot be
taken seriously. Therefore, selecting an appropriate model that is as simple as
possible while fitting the data reasonably well is a key step in any parametric
statistical analysis.
In this chapter we discuss Bayesian model selection and goodness-of-fit
measures. For model selection, we assume there is a finite collection of candi-
date models denoted as M1 , ..., MM . For example, we might compare Gaus-
sian and student-t models,
iid iid
M1 : Yi ∼ Normal(µ, σ 2 ) versus M2 : Yi ∼ tν (µ, σ 2 ) (5.1)
or whether or not to include a covariate
indep indep
M1 : Yi ∼ Normal(β1 , σ 2 ) versus M2 : Yi ∼ Normal(β1 + Xi β2 , σ 2 ).
(5.2)
The methods we discuss are general and can be used for other model-selection
tasks, including to select random effects structure, the link function, etc.
Sections 5.1–5.5 introduce techniques for comparing and selecting statis-
tical models. Section 5.1 begins with cross validation, which is a flexible and

163
164 Bayesian Statistical Methods

intuitive way to compare methods. For a more formal Bayesian treatment of


model selection and assessment of model uncertainty, Section 5.2 introduces
the Bayes factor and Section 5.3 provides computational tools to approximate
Bayes factors when many models are under consideration. An attractive fea-
ture of Bayes factors is that rather than selecting a single model, uncertainty
about the model is captured using posterior probabilities. Section 5.4 uses
these posterior probabilities to make predictions that appropriately average
over model uncertainty. While Bayes factors are appealing, they typically re-
quire extensive derivation or computation and are sensitive to the prior and so
Section 5.5 provides less formal but more broadly applicable model selection
criteria. In virtually any analysis none of the M models will be “right,” and
we are simply searching for the one that fits the “best.” Therefore, Section
5.6 provides goodness-of-fit tools to verify that a selected model captures the
important features of a dataset.

5.1 Cross validation


Arguably the most intuitive way to compare models is based on out-of-sample
prediction performance. Ideally an independent validation set is available and
used to evaluate performance. For example, for data streaming in over time
one might train the M models on data available at the time of the analysis
and use data collected after the analysis to measure predictive performance.
Often this is not feasible as data are not collected sequentially, and so inter-
nal cross validation (CV) is used instead. In K-fold CV, each observation is
randomly assigned to one of the K folds, with gi ∈ {1, ..., K} denoting the
group assignment for observation i. Denote the subset of the data in fold k
as Yk = {Yi ; gi = k} and the data in all other folds as Y(k) = {Yi ; gi 6= k}.
The model is then fit K times, with the k th model fit using Y(k) to train the
model and make predictions for Yk . In this way, a prediction is made for each
observation using an analysis that excludes the observation, approximating
out of sample prediction performance.
Bayesian prediction is based on the posterior predictive distribution (PPD)
of the test set observations Yk given the training data Y(k) (Section 1.5).
An advantage of this method of prediction is that it naturally averages over
uncertainty in the model parameters. Generating samples from the PPD av-
(s)
eraging over parameter uncertainty is straightforward using MCMC. Let Yi
be the prediction made at MCMC iteration s for test set observation Yi , then
(1) (S)
Yi , ..., Yi can be used to approximate the PPD. For example, we might
PS (s)
compute the Monte Carlo sample mean Ŷi = s=1 Yi /S, median Ỹi , and
τ -quantile qi (τ ) to approximate the posterior predictive mean, median and
quantiles respectively.
Model selection and diagnostics 165

Many metrics are available to summarize the accuracy of the predictions


from each model [38]. The most common measures of point prediction are
bias, mean squared error and mean absolute deviation,
n
1X
BIAS = (Ŷi − Yi )
n i=1
n
1X
M SE = (Ŷi − Yi )2
n i=1
n
1X
M AD = |Ỹi − Yi |,
n i=1

with M SE being more sensitive to large errors than M AD. Performance of


credible intervals can be summarized using empirical coverage and average
width of prediction intervals
n
1X
COV = I [qi (α/2) ≤ Yi ≤ qi (1 − α/2)]
n i=1
n
1X
W IDT H = [qi (1 − α/2) − qi (α/2)] ,
n i=1

where I(A) = 1 if the statement A is true and zero otherwise. A measure of


fit to the entire distribution is the log score,
n
1X
LS = log[f (Yi |θ̂ i )] (5.3)
n i=1

where θ̂ i is the parameter estimate based on the fold that excludes observation
i. The log score is the average log likelihood of the test set observations given
the parameter estimates from the training data. Based on these measures, we
might discard models with COV far below the nominal 1 − α level and from
the remaining model choose the one with small M SE, M AD and W IDT H
and large LS. It is essential that the evaluation is based on out-of-sample
predictions rather than within-sample fit. Overly complicated models (e.g., a
linear model with too many predictors) may replicate the data used to fit the
model but be too unstable to predict well under new conditions.
Cross validation can be motivated by information theory. Suppose the
iid
“true” data generating model has PDF f0 so that in reality Yi ∼ f0 . Of
course, we cannot know the true model and so we choose between M models
with PDFs f1 , ..., fM . Our objective is to select the model that is in some
sense the closest to the true model. A reasonable measure of the difference
between the true and postulated model is the Kullback–Leibler divergence
  
fj (Y )
KL(f0 , fj ) = E log = E {log[fj (Y )]} − E {log[f0 (Y )]} ,
f0 (Y )
166 Bayesian Statistical Methods

where the expectation is with respect to the true model Y ∼ f0 . The term
E {log[f0 (Y )]} is the same for all j = 1, ..., M and therefore ranking models
based on KL(f0 , fj ) is equivalent to ranking models based their log score
LSj = E {log[fj (Y )]}. Since the data are generated as from f0 , the cross
validation log score in (5.3) is a Monte Carlo estimate of the true log score
LSj , and therefore ranking models based on their cross validation log score
is an attempt to rank them based on similarity to the true data-generating
model.

5.2 Hypothesis testing and Bayes factors


Bayes factors [48] provide a formal summary of the evidence that the data
support one model over another. For now, say there are only M = 2 models
under consideration. Although not necessary for the computation of Bayes
factors, most model comparison problems can be framed so that both models
are nested in a common model and distinguished by the model parameters.
iid
For example, the full model might be Yi |µ ∼ normal(µ, σ 2 ), with model M1
defined by µ ≤ 0 and model M2 defined by µ > 0. In a Bayesian analysis,
the unknown parameters are treated as random variables. If the statistical
models are stated as functions of the parameters, then the models are also
random variables. For example, the posterior probability Prob(µ ≤ 0|Y) is the
posterior probability of model M1 and Prob(µ > 0|Y) = Prob(M2 |Y). These
posterior probabilities are the most intuitive summaries of model uncertainty.
Posterior model probabilities incorporate information from both the data
and the prior. Bayes factors remove the effect of the prior and quantify the
data’s support of the models. The Bayes factor (BF) of Model 2 relative to
Model 1 is the ratio of posterior odds to prior odds,

Prob(M2 |Y)/Prob(M1 |Y)


BF = . (5.4)
Prob(M2 )/Prob(M1 )

If the prior probability of the two models are equal, then the BF is simply the
posterior odds Prob(M2 |Y)/Prob(M1 |Y); on the other hand, if the data are
not at all informative about the models and the prior and posterior odds are
the same, then BF = 1 regardless of the prior.
Selecting between two competing models is often referred to as hypothe-
sis testing. In hypothesis testing one of the models is referred to as the null
model or null hypothesis and the other is the alternative model/hypothesis.
Hypothesis tests are usually designed to be conservative so that the null model
is rejected in favor of the alternative only if the data strongly support this
model. If we define Model 1 as the null hypothesis and Model 2 as an al-
ternative hypothesis, then a rule of thumb [48] is that BF > 10 provides
Model selection and diagnostics 167

strong evidence of the alternative hypothesis (Model 2) compared to the null


hypothesis (Model 1) and BF > 100 is decisive evidence.
A common mistake in a frequentist hypothesis testing is to state the prob-
ability that the null hypothesis is true; from a frequentist perspective, the
parameters and thus the hypotheses are fixed quantities and not random vari-
ables, and therefore it is not sensible to assign them probabilities. From the
Bayesian perspective however giving the posterior probability of each mod-
el/hypothesis is a legitimate summary of uncertainty.
Computing BFs for problems with many parameters can be challenging. In
general, model selection can be framed as treating the model as an unknown
random variable M ∈ {M1 , M2 } with prior probability Prob(M = Mj ) = qj .
Conditioned on model j, i.e., M = Mj , the remainder of the Bayesian model
is
Y ∼ f (Y|θ; Mj ) and θ ∼ π(θ|Mj ). (5.5)
Therefore, the two models can have different likelihood and prior functions.
The BF requires the marginal posterior of the model integrating over uncer-
tainty in the parameters,
Z
Prob(M = Mj |Y) = p(θ, M = Mj |Y)dθ. (5.6)

Unfortunately, the marginalizing over the parameters (as in m in Table 1.4)


is rarely possible. Also, the marginal distribution of M is not defined for
improper priors. Therefore, BFs cannot be used with improper priors. Even
with proper priors, BFs can very be sensitive to the choice of hyperparameters
as shown by the examples below.
MCMC provides a means of approximating BFs in special cases. The BF
for nested models defined by intervals of the parameters is straightforward
iid
to compute using MCMC. For example, in the model above with Yi |µ ∼
2 2
normal(µ, σ ) and µ ∼ Normal(0, 10 ), and models M1 defined by µ ≤ 0
and M2 defined by µ > 0, the posterior probability of Model 1 (2) can be
approximated by generating samples from the posterior of the full model and
recording the proportion of the samples for which µ is negative (positive).
Section 5.3 provides a more general computational strategy for computing
model probabilities.
A final note on Bayes factors is that they resemble the likelihood ratio
statistic, which is commonly used in frequentist hypothesis testing. Assuming
equal priors, the Bayes factor is
R
f (Y|θ; M2 )π(θ|M2 )dθ
BF = R , (5.7)
f (Y|θ; M1 )π(θ|M1 )dθ
i.e., the ratio of marginal distribution of the data under the two models. This
resembles the likelihood ratio from classical hypothesis testing
f (Y|θ̂ 2 )
LR = , (5.8)
f (Y|θ̂ 1 )
168 Bayesian Statistical Methods

where θ̂ j is the MLE under model j = 1, 2. Therefore, both measures compare


models based on the ratio of their likelihood, but the BF integrates over
posterior uncertainty in the parameters whereas the likelihood ratio statistic
plugs in point estimates.
Beta-binomial example: To build intuition about BFs, we begin with
the univariate binomial model and test whether the success probability equals
0.5. Let Y |θ ∼ Binomial(n, θ) and consider two models for θ:

M1 : θ = 0.5 versus M2 : θ 6= 0.5 and θ ∼ Beta(a, b). (5.9)

The first model has no unknown parameters, and the second model’s parame-
ter θ can be integrated out giving the beta-binomial model for Y (see Appendix
A.1). Therefore, these hypotheses about θ correspond to two different models
for the data

M1 : Y ∼ Binomial(n, 0.5) versus M2 : Y ∼ BetaBinom(n, a, b). (5.10)

Assuming priors Prob(Mj ) = qj and observed data Y = y, the BF of Model


2 relative to Model 1 is
fBB (y; n, a, b)
BF (y) = (5.11)
fB (y; n, 0.5)
where fBB and fB are the beta-binomial and binomial PMFs, respectively.
Figure 5.1 plots BF (y) for n = 20 and different hyperparameters a and b.
Assuming the uniform prior (a = b = 1) or Jeffreys’ prior (a = b = 0.5) the
BF exceeds 10 for y < 5 or y > 15 successes, and exceeds 100 for y < 4 or y >
16 successes; these scenarios give strong and decisive evidence, respectively,
against the null hypothesis that θ = 0.5 in favor of the alternative that θ 6= 0.5.
The BF is less than 10 for all possible y for the strong prior centered on 0.5
(a = b = 50). Under this prior the two hypotheses are similar and it is difficult
to distinguish between them. Finally, under the strong prior that θ is close to
one (a = 50 and b = 1), the BF exceeds 10 only for y > 16 successes; in this
case y near zero is even less likely under the alternative than under the null
that θ = 0.5.
Normal-mean example: Say there is a single observation Y |µ ∼
Normal(µ, 1) and the objective is to test whether µ = 0. The competing
hypotheses are

M1 : µ = 0 versus M2 : µ 6= 0 and µ ∼ Normal(0, τ 2 ). (5.12)

Given that we observe Y = y, it can be shown that the BF of M2 relative to


M1 is  2
τ2

−1/2 y
BF (y) = 1 + τ 2 exp . (5.13)
2 1 + τ2
For fixed τ , BF (y) increases to infinity as y 2 increases as expected because
data far from zero contradict M1 : µ = 0. However, for any fixed y, BF (y)
Model selection and diagnostics 169

1000
3.0
a=1, b=1
a=0.5, b=0.5
a=50, b=50
2.5

100
a=50, b=1
2.0

10
Bayes factor
Prior PDF
1.5

1
1.0

0.1
a=1, b=1
0.5

a=0.5, b=0.5
a=50, b=50

0.01
a=50, b=1
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20

θ Y

FIGURE 5.1
Bayes factor for the beta-binomial model. (left) Beta(a, b) prior PDF
for several combinations of a and b and (right) observed data Y versus the
Bayes factor comparing the beta-binomial model Y |θ ∼ Binomial(n, θ) and
θ ∼ Beta(a, b) versus the null model Y ∼ Binomial(n, 0.5) for n = 20 and
several combinations of a and b.

converges to zero as the prior variance τ 2 increases. Therefore, even if the


observation is 10 standard deviation units above zero, the BF favors the null
model with µ = 0 if the prior variance is sufficiently large. This is an example
of Lindley’s paradox [51] where Bayesian tests can perform poorly depending
on the prior. For the normal means problem, this odd result occurs because
with large τ these data are unlikely under both models. For example, the
probability of |Y | ∈ [9, 11] is very small under the null hypothesis that µ = 0,
but under the alternative hypothesis this probability converges to zero as τ
increases because the marginal distribution of Y (averaging over µ) becomes
increasingly diffuse.
BFs are not defined with improper priors, and this example suggests that
they can have strange properties for uninformative priors. It has been argued
[9] that the standard half-Cauchy prior µ ∼ t1 (0, 1) is preferred to the large-
variance normal prior because this prior is diffuse without having a variance
parameter to tune. In general, prior selection has more impact for hypothe-
sis testing than estimation, making it imperative to report Bayes factors for
multiple priors to illuminate this sensitivity.
Lindley’s paradox is more pronounced for tests of point null hypotheses
such as M1 : µ = 0 than for one-sided tests such as

M1 : µ ≤ 0 versus M2 : µ > 0.

To compute the BF for these hypotheses we simply fit the Bayesian model
170 Bayesian Statistical Methods

Y |µ ∼ Normal(µ, 1) and µ ∼ Normal(0, τ 2 ) and compute Prob(M2 |Y ) =


Prob(µ > 0|Y ) using the results in Section 2.1.3. This test is stable as the
prior variance increases because the BF converges to Φ(y)/[1 − Φ(y)] > 0,
where Φ is the standard normal CDF. Since 1 − Φ(y) is the p-value for the
classical one-sided z-test, the BF test with large variance is equivalent to the
frequentist test.
Tests based on credible sets are another remedy to Lindley’s paradox. That
is, we simply fit a Bayesian model Y |µ ∼ Normal(µ, 1) and µ ∼ Normal(0, τ 2 )
as in Section 4.1 and reject the null hypothesis that µ = 0 if the posterior
credible set for µ excludes zero. As the prior variance increases, this rule has
the same frequentist operating characteristics as the classic two-sided z-test (or
t-test with unknown error variance). Therefore, this approach is not sensitive
to the prior and has appealing frequentist properties including controlling
Type I error.

5.3 Stochastic search variable selection


In Section 5.2, Bayes factors were used to summarize the data’s support for
M = 2 competing models. In many applications, the number of models un-
der consideration is large and enumerating all models and computing each
posterior probability is unfeasible. For example, in linear regression with p
parameters there are M = 2p potential models formed by including subsets of
the predictors; with p = 30 this is over a billion potential models. A classical
way to overcome searching over all combinations of covariates is to employ a
systematic search such as forward, backward or stepwise selection. In this sec-
tion we discuss a stochastic alternative that randomly visits models according
to their posterior probability.
Stochastic search variable selection (SSVS; [35]) approximates model prob-
abilities using MCMC. SSVS introduces dummy variables to encode the model
and then computes posterior probabilities of the dummy variables to approx-
imate the posterior model probabilities. For example, consider the linear re-
gression model
 
p
indep X
Yi |β, σ 2 ∼ Normal  Xij βj , σ 2  . (5.14)
j=1

The M = 2p models formed by subsets of the predictors can be encoded by


γ = (γ1 , ..., γp ) where γj = 1 if covariate j is included in the model and
γj = 0 if covariate j is excluded from the model, so that γ = c(1, 0, 1, 0, 0, ...)
corresponds to the model E(Yi ) = Xi1 β1 + Xi3 β3 .
A common prior over models is to fix γ1 = 1 for the intercept and
iid
γj |q ∼ Bernoulli(q) for j > 1, where q is the prior inclusion probabil-
Model selection and diagnostics 171

0.5

1.0
0.4

0.8
Prior density
0.3

0.6
Prior CDF
0.2

0.4
0.1

0.2
0.0

0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

β β

FIGURE 5.2
Spike and slab prior. PDF (left) and CDF of β under the spike and slab
prior β = γδ where γ ∼ Bernoulli(0.5) and δ ∼ Normal(0, 1).

ity. Given the model γ, the regression coefficients that are included in the
model can have independent normal priors (other priors are possible, [70])
βj |γj = 1 ∼ Normal(0, σ 2 τ 2 ). The inclusion probability and regression co-
efficient variance can be fixed or have prior such as q ∼ Beta(a, b) and
τ 2 ∼ InvGamma(, ). As with Bayes factors (Section 5.2), the posterior for
this model can be sensitive to the prior, and multiple priors should be com-
pared to understand sensitivity.
The prior for βj induced by this model is plotted in Figure 5.2. The prior
is a mixture of two components: a peak at βj = 0 corresponding to samples
that exclude (γj = βj = 0) covariate j and a Gaussian curve corresponding
to samples that include (γj = 1 so βj = δj ) βj . Because of this distant shape,
this prior is often called the spike-and-slab prior.
All M models can simultaneously be written as the supermodel
 
p
indep X
Yi |β, σ 2 ∼ Normal  Xij βj , σ 2 
j=1

βj = γj δ j
γj ∼ Bernoulli(q)
δj ∼ Normal(0, τ 2 σ 2 ).

MCMC samples from this model include different subsets of the covariates.
Posterior samples with γj = 0 have βj = 0 and thus covariate j excluded from
the model. This supermodel can be fit a single time and give approximations
for all M posterior model probabilities. Since the search over models is done
within MCMC, this is a stochastic search as opposed to systematic searches
such as forward or backward regression. An advantage of stochastic search
172 Bayesian Statistical Methods

is that models with low probability are rarely sampled and so more of the
computing time is spent on high-probability models.
With many possible models, no single model is likely to emerge as having
high probability. Therefore, with large M extremely long chains can be needed
to give accurate estimates of model probabilities. A more stable summary of
the model space are the marginal inclusion probabilities M IPj = E(γj =
1|Y) = Prob(βj 6= 0|Y). M IPj is the probability that covariate j is included
in the model averaging over uncertainty in the subset of the other variables
that are included in the model, and can be used to compute the Bayes factor
for the models that do and do not include covariate j, BFj = [M IPj /(1 −
M IPj )]/[q/(1 − q)]. In fact, [6] show that if a single model is required for
prediction, the model that includes covariates with M IPj greater than 0.5 is
preferred to the highest probability model.
Childhood malaria example: Diggle et al. [23] analyze data from n =
1, 332 children from the Gambia. The binary response Yi is the indictor that
child i tested positive for malaria. We use five covariates in Xij :
• Age: Age of the child, in days
• Net use: Indicator variable denoting whether (1) or not (0) the child regularly
sleeps under a bed-net
• Treated: Indicator variable denoting whether (1) or not (0) the bed-net is
treated (coded 0 if netuse=0)
• Green: Satellite-derived measure of the greenness of vegetation in the im-
mediate vicinity of the village (arbitrary units)
• PCH: Indicator variable denoting the presence (1) or absence (0) of a health
center in the village
All five covariates are standardized to have mean zero and variance one. We
use the logit regression model
p
X
logit[Prob(Yi = 1)] = α + Xij βj . (5.15)
j=1

The spike-and-slab prior for βj is βj = γj δj where γj ∼ Bernoulli(0.5) and


δj ∼ Normal(0, τ 2 ). Listing 5.1 gives JAGS code to fit this model.
Table 5.1 gives the model probabilities, i.e., the proportion of the MCMC
samples with γ corresponding to each model. Only three models have poste-
rior probability greater than 0.01. All three models include age, net use and
greenness, and differ based on whether they include bed-net treatment, the
health center indicator, or both. The marginal inclusion probabilities M IPj
in Table 5.2 exceed 0.5 for all covariates and therefore the best single model
for prediction likely includes all covariates [6]. The posterior density for the
regression coefficient corresponding to age (Figure 5.3) is bell-shaped because
Model selection and diagnostics 173

Listing 5.1
JAGS code for SSVS.
1 for(i in 1:n){
2 Y[i] ~ dbern(pi[i])
3 logit(pi[i]) <- alpha + X[i,1]*beta[1] +
4 X[i,2]*beta[2] + X[i,3]*beta[3] +
5 X[i,4]*beta[4] + X[i,5]*beta[5]
6 }
7 for(j in 1:5){
8 beta[j] <- gamma[j]*delta[j]
9 gamma[j] ~ dbern(0.5)
10 delta[j] ~ dnorm(0,tau)
11 }
12 alpha ~ dnorm(0,0.01)
13 tau ~ dgamma(0.1,0.1)

TABLE 5.1
Posterior model probabilities for the Gambia analysis. All other mod-
els have posterior probability less than 0.01.

Covariates Probability
Age, Net use, Greenness, Treated 0.42
Age, Net use, Greenness, Treated, Health center 0.37
Age, Net use, Greenness, Health center 0.20

TABLE 5.2
Marginal posteriors for the Gambia analysis. Posterior inclusion proba-
bilities (i.e., Prob(βj 6= 0|Y)) and posterior median and 90% intervals for the
βj .

Covariate Inclusion Prob Median 95% Interval


Age 1.00 0.26 (0.19, 0.34)
Net use 1.00 -0.25 (-0.34, -0.17)
Greenness 1.00 0.29 (0.21, 0.37)
Treated 0.79 -0.13 (-0.24, 0.00)
Health center 0.56 -0.05 (-0.19, 0.00)
174 Bayesian Statistical Methods

Age Health center

1200

12000
1000

10000
Posterior density

Posterior density
800

8000
600

6000
4000
400

2000
200
0

0
0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 −0.3 −0.2 −0.1 0.0 0.1
βj βj

FIGURE 5.3
Posterior distribution for the SSVS analysis. Posterior distribution for
the regression coefficients βj for age and proximity to a health center.

it escapes the prior spike at zero, however the posterior for proximity to a
health center retains considerable mass at zero and thus the spike-and-slab
shape.
High-dimensional regression example: The data for this example are
from [50] and can be downloaded from http://www.ncbi.nlm.nih.gov/geo (ac-
cession number GSE3330). In this study, n = 60 mice (31 female) were sam-
pled and the physiological phenotype stearoyl-CoA desaturase 1 (SCD1) is
taken as the response to be regressed onto the expression levels of 22,575
genes. Following [12] and [84], we use only the p = 1, 000 genes with highest
pairwise correlation with the response as predictors in the model. Even after
this simplification, this leaves a high-dimensional problem with p > n.
We use the linear regression model with SSVS prior
 
Xp
Yi ∼ Normal α + Xij βj , σe2  (5.16)
j=1

βj = γj δ j
γj ∼ Bernoulli(q)
δj ∼ Normal(0, σe2 σb2 ).

Because p is large, it is possible to learn about the hyperparameters q


and σb2 . We select priors q ∼ Beta(1, 1) and σe2 , σb2 ∼ InvGamma(0.1, 0.1)
(Prior 1) and present the marginal inclusion probabilities Prob(βj 6= 0|Y ).
In high-dimensional problems, sensitivity to the prior is especially concern-
ing. Therefore, we also refit with priors q ∼ Beta(1, 2) (Prior 2) and σb2 ∼
Model selection and diagnostics 175

Inclusion probabilities Inclusion probabilities

1.0

1.0
0.8

0.8
0.6

0.6
Prior 2

Prior 3
0.4

0.4
0.2

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Prior 1 Prior 1

FIGURE 5.4
High-dimensional regression example. Marginal inclusion probabilities
(M IPj = Prob(βj 6= 0|Y)) under three priors: (1) q ∼ Beta(1, 1) and σb2 ∼
InvGamma(0.1, 0.1), (2) q ∼ Beta(1, 2) and σb2 ∼ InvGamma(0.1, 0.1), and
(3) q ∼ Beta(1, 1) and σb2 ∼ InvGamma(0.5, 0.5).

InvGamma(0.5, 0.5) (Prior 3). The models are fit using Gibbs sampling with
200,000 iterations, a burn-in of 20,000 iterations discarded and the remaining
samples thinned by 20. This MCMC code requires 1–2 hours on an ordinary
PC.
Only two genes have inclusion probability greater than 0.5 (Figure 5.4).
The ordering of the marginal inclusion probabilities is fairly robust to the
prior, but the absolute value of the marginal inclusion probabilities varies
with the prior. The posteriorP median (95% interval) of the number of variables
p
included in the model pin = j=1 γj are 33 (12, 535) for Prior 1, 28 (12, 392)
for Prior 2 and 20 (11, 64) for Prior 3. In this analysis, the results are more
sensitive to the prior for the variance than the inclusion probability.

5.4 Bayesian model averaging


A main advantage of the Bayesian approach is the ability to properly han-
dle uncertainty in model parameters when making statistical inference and
predictions. In Section 1.5 (Figure 1.12) we explored the effect of accounting
for parameter uncertainty in prediction. The posterior predictive distribution
(PPD) for observation Y ∗ given the observed data Y averages over uncertainty
in the parameters according to their posterior distribution,
Z Z
p(Y ∗ |Y) = p(Y ∗ , θ|Y)dθ = p(Y ∗ |θ, Y)p(θ|Y)dθ. (5.17)
176 Bayesian Statistical Methods

In addition to parametric uncertainty, with many potential models no single


model is likely to emerge as the only viable option, and it is important to ac-
count for model uncertainty in prediction. Denoting the posterior probability
of model m as Prob(M = Mm |Y) = wm (Y), the PPD of Y ∗ averaging over
posterior model uncertainty is
M
X
p(Y ∗ |Y) = wm (Y)p(Y ∗ |Y, Mm ) (5.18)
m=1

where p(Y ∗ |Y, Mm ) is the PPD from model Mm . Making inference that
accounts for model uncertainty using posterior model probabilities is called
Bayesian model averaging (BMA; [45]).
SSVS (Section 5.3) via MCMC provides a convenient way to perform
model-averaged predictions. Draws from the SSVS model include dummy indi-
cators for different models as unknown parameters, and thus naturally average
over models according to their posterior probability. Therefore, if a sample of
Y ∗ is made at each MCMC iterations, then these samples follow the Bayesian
model averaged PPD, as desired.
In addition to prediction, BMA can be used for inference on param-
eters common to all models. For example, if parameter βj has posterior
distribution p(βj |Y, Mm ) under model Mm , then the BMA posterior is
PM
p(βj |Y) = m=1 wm (Y)p(βj |Y, Mm ). However, BMA results for parame-
ters must be carefully scrutinized because the interpretation of parameters
can change considerably across models and so it is not always clear how to
interpret an average over models. As an extreme case, in a regression analysis
with collinearity the sign of βj might change depending on the other covariates
that are included, and so it is not obvious that results should be combined
across models.

5.5 Model selection criteria


Cross validation and Bayes factors are both difficult to compute for large
datasets and/or complicated models; cross validation requires fitting each
model K times and Bayes factors require difficult integration. Model-fit cri-
teria provide a useful alternative. The criteria considered in this chapter are
defined via the deviance (twice the negative log likelihood) of the data given
the parameters,
D(Y|θ) = −2 log[f (Y|θ)]. (5.19)
In this chapter, we compare only models with the same deviance function but
different models/priors for the parameters so that the deviance is a compa-
rable measure of fit across models. For some models, e.g., the random effect
models in Section 4.4 which have equivalent representations conditional on
Model selection and diagnostics 177

and marginal over the random effects, the definition of the likelihood and
thus deviance are not unique, but for most models for independent data the
definition of the deviance is clear.
The Bayesian information criteria (BIC) is one such criteria, defined as

BIC = D(Y|θ̂ M LE ) + log(n)p (5.20)

where θ̂ M LE is the maximum likelihood estimate, n is the sample size and


p is the number of parameters in the model. BIC is split into two terms:
D(Y|θ̂ M LE ) is small for models that fit the data well and log(n)p is small
for simple low-dimensional models. Since simple models with good fit are
desirable, models with smaller BIC are preferred.
BIC was originally motivated as an approximation to the Bayes factor.
However, BIC has undesirable features from the Bayesian perspective. First,
estimating θ using the MLE does not use prior information and is not avail-
able from MCMC output. Second, quantifying model complexity with the
number of parameter p does not account for informative priors. For example,
if two models have the same number of parameters but one has very strongly
informative priors and the other does not, then the model with informative
priors should be regarded as simpler because it has fewer effective degrees of
freedom.
The deviance information criteria (DIC; [77]) resolves these issues. From
a Bayesian perspective, the deviance D(Y|θ) is a random variable since is a
function of the random variables θ, therefore D(Y|θ) has a posterior distribu-
tion that can be summarized using its posterior mean D̄. Model complexity
is summarized by the effective number of parameters,

pD = D̄ − D̂, (5.21)

where D̂ = D(Y|θ̂) and θ̂ is the posterior mean of θ. As shown by example


below, pD is typically not an integer because prior shrinkage can results in
partial degrees of freedom. Both D̄ and θ̂ can be approximated using MCMC
output by computing D (Y|θ) and θ at each iteration and taking the mean
over iterations. Therefore, DIC is straightforward to compute given MCMC
output.
The criteria is then

DIC = D̄ + pD = D(Y|θ̂) + 2pD . (5.22)

This resembles the Akaike information criterion (AIC) [2] AIC =


D(Y|θ̂ M LE ) + 2p but with the posterior mean and effective degrees of free-
dom replacing the MLE and the total number of parameters. As with AIC
and BIC, the actual value of the criteria is hard to interpret, but it can be
used to rank models with models having small DIC being preferred (a loose
rule of thumb is that a difference of 5 is substantial and 10 is definitive, but
it is difficult to establish statistical significance using DIC). The intuition is
178 Bayesian Statistical Methods

that models with small DIC are simple (small pD ) and fit well (small D̄).
The effective number of parameters pD is generally less than the number of
parameters p if the prior are strong, as desired. Unfortunately pD can be out-
side [0, p] in pathological cases, typically where the posterior mean of θ is
not a good summary of the posterior as is the case in mixture models with
multimodal priors and posteriors.
The motivation of pD as a measure of model size is complex, but intu-
ition can be built using a few examples. In multiple linear regression with
Y|β ∼ Normal(Xβ, σ 2 In ) and Zellner’s prior β ∼ Normal(0, cσ 2 (XT X)−1 ),
c
the effective number of parameters is pD = c+1 p. In this case, the effective
number of parameters increases from zero with tight prior (c = 0) to p with
uninformative prior (c = ∞). Also, in the one-way random effects model (Sec-
tion 4.4)
Yij |µj ∼ Normal(µj , σ 2 ) and µj ∼ Normal(0, cσ 2 ), (5.23)
for i = 1, ..., n replications within each of the j = 1, ..., p groups and fixed
variance components σ 2 and c. The effective number of parameters is pd =
c
c+1/n p, which also increases from 0 to p with the prior variance c.
The Watanabe–Akaike (also known as the widely applicable) information
criteria (W AIC; [30]) is an alternative to DIC. W AIC is proposed as an
approximation to n-fold (i.e., leave-one-out) cross validation. Rather than the
posterior mean of the deviance, W AIC compute the posterior mean and vari-
ance of the likelihood and log likelihood. The criteria is
X
W AIC = −2 log{f¯i } + 2pW (5.24)
i=1

where fit for observation i is measured by the posterior mean of the likelihood,
f¯i = E[f (Yi |θ)|Y] (5.25)
and model complexity is measured by
n
X
pW = Var[log(f (Yi |θ))|Y], (5.26)
i=1

defined as the sum of posterior variance of the log likelihood functions. There-
fore, as with BIC and DIC, models with small W AIC are preferred because
they are simple and fit well.
Selecting a random effect model for the Gambia data: As in Section
5.3, the binary response Yi indicates that child i tested positive for malaria and
we consider covariates for age, bed-net use, bed-net treatment, greenness and
health center. In this analysis we also consider child i’s village vi ∈ {1, ..., 65}
to account for dependence in the malaria status of children from the same
village (Figure 5.5 plots the location and number of children sampled from
each village). We use the random effects logistic regression model
p
X
logit[Prob(Yi = 1)] = α + Xij βj + θvi , (5.27)
j=1
Model selection and diagnostics 179


● ●

● ●
● ●

● ●●

<25 children 25−35 children ● >35 children

FIGURE 5.5
Gambia data. The location and number of children sampled from each vil-
lage.

where θv is the random effect for village v. We compare three models for the
village random effects via DIC and W AIC:
1. No random effects: θv = 0
2. Gaussian random effects: θv ∼ Normal(0, τ 2 )
3. Double-exponential random effects: θv ∼ DE(0, τ 2 )
In all models, the priors are α, βj ∼ Normal(0, 100) and τ 2 ∼
InvGamma(0.1, 0.1).
Code for Model 3 is given in Listing 5.2. DIC is computed using the
dic.samples function in JAGS, although this unfortunately requires extra
MCMC sampling. There is no analogous function for W AIC in JAGS and
so it must be computed outside of JAGS. In Listing 5.2 the extra line in the
likelihood like[i] instructs JAGS to return posterior samples of the likelihood
function f (Yi |θ) which in this model is the binomial PMF (dbin in JAGS).
After MCMC sampling, the posterior mean of like[i] is computed as the
approximation to f¯i and the posterior variance of log(like[i]) is computed
to approximate pW .
The W AIC and DIC results are in Table 5.3. Both measures show strong
support for including village random effects, but cannot distinguish between
Gaussian and double-exponential random-effect distributions. Since the Gaus-
sian model is more familiar, this is probably the preferred model for these
data.
2016 Presidential Election example: The data for this analysis come
from Tony McGovern’s very useful data repository 1 . The response variable,
Yi , is the percent increase in Republican (GOP) support from 2012 to 2016,
i.e.,  
% in 2016
100 −1 , (5.28)
% in 2012
1 https://github.com/tonmcg/County_Level_Election_Results_12-16
180 Bayesian Statistical Methods

Listing 5.2
JAGS code to compute WAIC and DIC for the random effects model.
1 mod <- textConnection("model{
2 for(i in 1:n){
3 Y[i] ~ dbern(pi[i])
4 logit(pi[i]) <- beta[1] + X[i,1]*beta[2]
5 + X[i,2]*beta[3] + X[i,3]*beta[4]
6 + X[i,4]*beta[5] + X[i,5]*beta[6]
7 + theta[village[i]]
8 like[i] <- dbin(Y[i],pi[i],1) # For WAIC computation
9 }
10 for(j in 1:6){beta[j] ~ dnorm(0,0.01)}
11 for(j in 1:65){theta[j] ~ ddexp(0,tau)}
12 tau ~ dgamma(0.1,0.1)
13 }")
14

15 data <- list(Y=Y,X=X,n=n,village=village)


16 model <- jags.model(mod,data = data, n.chains=2,quiet=TRUE)
17 update(model, 10000, progress.bar="none")
18 samps <- coda.samples(model, variable.names=c("like"),
19 n.iter=50000, progress.bar="none")
20

21 # Compute DIC
22 DIC <- dic.samples(model,n.iter=50000,progress.bar="none")
23

24 # Compute WAIC
25 like <- rbind(samps[[1]],samps[[2]]) # Combine the two chains
26 fbar <- colMeans(like)
27 Pw <- sum(apply(log(like),2,var))
28 WAIC <- -2*sum(log(fbar))+2*Pw

TABLE 5.3
Model selection criteria for the Gambia data. DIC (pD ) and W AIC
(pW ) for the three random effects models.

Random-effects model DIC (pD ) W AIC (pW )


None 2526 (6.0) 2525 (6.0)
Gaussian 2333 (55.1) 2333 (53.4)
Double-exponential 2334 (57.1) 2333 (54.3)
Model selection and diagnostics 181

in county i = 1, ..., n (Figure 5.6a). The election data are matched with p = 10
county-level census variables (Xij ) obtained from Kaggle via Ben Hamner2 :
1. Population, percent change – April 1, 2010 to July 1, 2014
2. Persons 65 years and over, percent, 2014
3. Black or African American alone, percent, 2014
4. Hispanic or Latino, percent, 2014
5. High school graduate or higher, percent of persons age 25+, 2009–
2013
6. Bachelor’s degree or higher, percent of persons age 25+, 2009–2013
7. Homeownership rate, 2009–2013
8. Median value of owner-occupied housing units, 2009–2013
9. Median household income, 2009–2013
10. Persons below poverty level, percent, 2009–2013.
All covariates are centered and scaled (e.g., Figure 5.6b). The objective is to
determine the factors that are associated with an increase in GOP support.
Also, following the adage that “all politics is local,” we explore the possibility
that the factors related to GOP support vary by state.
For a county in state s, we assume the linear model
p
X
Yi = β0s + Xi βsj + εi , (5.29)
j=1

iid
where βjs is the effect of covariate j in state s and εi ∼ Normal(0, σ 2 ). We
compare three models for the βjs
1. Constant slopes: βjs ≡ βj for all counties
iid
2. Varying slopes, uninformative prior: βjs ∼ Normal(0, 102 )
indep
3. Varying slopes, informative prior: βjs ∼ Normal(µj , σj2 )
In all models, the prior for the error variance is σ 2 ∼ InvGamma(0.1, 0.1). In
the first model the slopes have uninformative priors βj ∼ Normal(0, 102 ).
In the final model, the mean µj and variances σj2 are given priors µj ∼
Normal(0, 102 ) and σj2 ∼ InvGamma(0.1, 0.1) and estimated from the data
so that information is pooled across states via the prior. The three methods
are compared using DIC and WAIC as in Listing 5.3 for the constant slopes
model.
We first study the results from Model 1 with the same slopes in all states.
Table 5.4 shows that all covariates other than home-ownership rate are asso-
ciated with the election results. GOP support tended to increase in counties
2 https://www.kaggle.com/benhamner/2016-us-election
182 Bayesian Statistical Methods

(a) Percent increase in GOP support

[−44.4 to −1.9)
[−1.9 to 2.3)
[2.3 to 5.3)
[5.3 to 8.0)
[8.0 to 11.4)
[11.4 to 16.9)
[16.9 to 61.7]
NA

(b) Bachelor's degree or higher, percent of persons age 25+, 2009−2013

[−1.8763 to −0.8680)
[−0.8680 to −0.6188)
[−0.6188 to −0.3582)
[−0.3582 to −0.0977)
[−0.0977 to 0.2762)
[0.2762 to 0.9898)
[0.9898 to 6.1896]

FIGURE 5.6
2016 Presidential Election data. Panel (a) plots the percentage change in
Republican (GOP) support from 2012 to 2016 (Yi ) and Panel (b) plots the
percent of counties with a bachelor’s degree or higher (standardized to have
mean zero and variance one; Xi7 ).
Model selection and diagnostics 183

Listing 5.3
JAGS code to compute DIC for the constant slopes model.
1 model_string <- "model{
2 for(i in 1:n){
3 Y[i] ~ dnorm(Xb[i],taue)
4 Xb[i] ~ inprod(X[i,],beta[])
5 like[i] <- dnorm(Y[i],Xb[i],taue) # For WAIC
6 }
7 for(j in 1:p){beta[j] ~ dnorm(0,0.01)}
8 tau ~ dgamma(0.1,0.1)
9 }"
10

11 # Put the data and model statement JAGS format


12 dat <- list(Y=Y,n=n,X=X,p=p)
13 init <- list(beta=rep(0,p),tau=1)
14 model <- jags.model(textConnection(model_string),n.chains=2,
15 inits=init,data = dat)
16

17 # Burn-in samples
18 update(model, 10000, progress.bar="none")
19

20 # Compute DIC
21 dic <- dic.samples(model1,n.iter=50000)
22

23 # Compute WAIC
24 samp <- coda.samples(model, variable.names=c("like"),
25 n.iter=50000)
26 like <- rbind(samps[[1]],samps[[2]]) # Combine the two chains
27 fbar <- colMeans(like)
28 Pw <- sum(apply(log(like),2,var))
29 WAIC <- -2*sum(log(fbar)) + 2*Pw
184 Bayesian Statistical Methods

TABLE 5.4
2016 Presidential Election multiple regression analysis. Posterior
mean (95% interval) for the slopes βj for the model with the same slopes
in each state.

Covariate Median 95% interval


Population change -1.14 (-1.46, -0.81)
Percent over 65 0.93 (0.54, 1.32)
Percent African American -1.56 (-1.89, -1.23)
Percent Hispanic -2.06 (-2.40, -1.72)
Percent HS graduate 1.75 (1.25, 2.26)
Percent bachelor’s degree -6.19 (-6.71, -5.67)
Home-ownership rate 0.01 (-0.39, 0.41)
Median home value -1.52 (-1.99, -1.06)
Median income 1.88 (1.13, 2.61)
Percent below poverty 1.48 (0.91, 2.04)

with decreasing population, high proportion of seniors and high school grad-
uates, low proportions of African Americans and Hispanics, high income but
low home value and high poverty rate.
The DIC (D̄, pD ) for the three models are 21312 (21300, 12) for Model
1 with constant slopes, 18939 (18483, 455) for Model 2 with varying slope
and uninformative priors, and 18842 (18604, 238) for Model 3 with varying
slopes and informative priors. The first model is not rich enough to capture the
important trends in data and thus has high D̄ and DIC. The second model
has the best fit to the observed data (smallest D̄), but is too complicated
(large pD ). The final model balances model complexity and fit and has the
smallest DIC. WAIC gives similar results, with W AIC (pW ) equal to 21335
(20), 18971 (406) and 18909 (259) for Models 1–3, respectively.
Inspection of the posterior of the variances σj2 from the Model 3 shows
that the covariate effect that varies the most across states is the proportion
of the county with a Bachelor’s degree. Figure 5.7 maps the posterior mean
and standard deviation of the state-level slopes, βs7 . The association between
the proportion of the population with a Bachelor’s degree and change in GOP
support is the strongest (most negative) in New England and the Midwest. The
posterior standard deviation is the smallest in Colorado and Texas, possibly
because these states have high variation in the covariate across counties.
Simulation study to evaluate the performance of DIC and W AIC:
In these examples DIC and W AIC gave similar results. However, in practice
there will obviously be cases where they differ and the user will have to choose
between them and defend their choice. Also, when applied to real data as above
the “correct” model is unknown and so we cannot say for certain that either
method selected the right model. One way to build trust in these criteria (or
any other statistical method) is to evaluate their performance for simulated
Model selection and diagnostics 185

(a) Effect of college graduates − posterior mean

[−8.73 to −7.58)
[−7.58 to −6.54)
[−6.54 to −6.17)
[−6.17 to −5.38)
[−5.38 to −4.95)
[−4.95 to −3.79)
[−3.79 to −3.22]
NA

(b) Effect of college graduates − posterior SD

[0.548 to 0.747)
[0.747 to 0.845)
[0.845 to 0.902)
[0.902 to 0.948)
[0.948 to 1.004)
[1.004 to 1.150)
[1.15 to 1.66]
NA

FIGURE 5.7
Results of the 2016 Presidential Election analysis. Posterior mean and
standard deviation of the effect of the bachelor-degree rate on GOP support
(βs7 ) for the model with a different slope in each state and informative prior.
186 Bayesian Statistical Methods

data where the correct model is known (see Section 7.3 for more discussion of
simulation studies).
The data for this simulation experiment are generated to mimic the Gam-
bia data. The response Yi is binary and generated from the random effects
logistic regression model

logit[Prob(Yi = 1)] = α + Xi β + θvi ,

where vi is the village index of observation i and θv ∼ Normal(0, σ 2 ) is the


random effect for village v. Data are generated with n = 100 observations, ten
iid
villages each with ten observations, α = 0, β = 1 and Xi ∼ Normal(0, 1). We
2
vary the random effect variance σ to determine how large it must be before
the model selection criteria consistently favor the random effects model.
For each value of σ we generate N = 100 datasets. For each simulated data
set we fit two models: (1) the model without random effects (i.e., θv = 0) and
(2) the full model with random effects and θv ∼ Normal(0, σ 2 ). The priors
are α, β ∼ Normal(0, 100) and σ 2 ∼ InvGamma(0.1, 0.1). For each model
and each simulated data set we generate two MCMC chains of length 10,000
(after a burn-in of 1,000) iterations and compute both DIC and W AIC. Say
DICmj and W AICmj are the criteria for model m ∈ {1, 2} for simulated
dataset j ∈ {1, ..., N }. Figure 5.8 plots the N samples of DIC2j − DIC1j
(left) and W AIC2j − W AIC1j (right). The random effects model is selected
if these difference are negative, and so the proportion of samples that are
negative is the estimated probability of selecting the random effects model.
The percentage of the datasets for which the random effects model is selected
is given above each boxplot in Figure 5.8.
When data are generated with σ = 0 then the correct model is the sim-
ple model without random effects. In this case, both metrics on average have
larger value for the overly complex random effects model and select the random
effects model for only 10–20% of the simulated datasets. For data generated
with σ > 0 the random effects model is correct and both methods reliably
select it with probability that increases with σ. Therefore, both methods re-
liably select the correct model for data generated to mimic the Gambia data.
For this particular setting DIC returns the correct model with slightly higher
probability than W AIC, but of course we cannot generalize this result to
other settings.

5.6 Goodness-of-fit checks


Thus far we have discussed methods to choose from a prespecified set of mod-
els. This model selection step is crucial but cannot guarantee that the selected
model actually fits the data well. For example, if all models considered are
Model selection and diagnostics 187

10

10
17 33 51 75 90 14 29 43 73 86

5
0

0
Difference in WAIC
Difference in DIC
−5

−5
−10

−10
−15

−15
−20

−20
−25

−25
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

σ σ

FIGURE 5.8
Simulation to evaluate selection criteria. Boxplots of the difference
in DIC (left) and W AIC (right) comparing random effects logistic re-
gression with simple logistic regression for N = 100 simulated datasets
generated with random effect standard deviation σ. Each boxplot repre-
sents the distribution of the difference over N datasets simulated with σ ∈
{0.00, 0.25, 0.50, 0.75, 1.00} and the numbers above the boxplots are the per-
centage of the N datasets for which the difference was negative and thus the
criteria favored the random effects model.

Gaussian but the data are not, then even the best fitting model is inappro-
priate. Therefore, in addition to comparing models, diagnostics should be
performed to determine if the models capture the important features of the
data.
Standard diagnostic tools are equally important to a Bayesian and non-
Bayesian analysis. For example, in a linear regression, normality (e.g., residual
qq-plot), linearity (e.g., added variable plots), influential points (e.g., Cook’s
D), etc., should be scrutinized. Many of these classic tools are based on least-
squares residuals and therefore are not purely Bayesian, but they remain valu-
able informal goodness-of-fit measures.
Another way to critique a model is out-of-sample prediction performance.
Say the data are split into a training set Y and a test set Y∗ = (Y1∗ , ..., Ym∗ ).
Section 1.5 discusses posterior predictive distribution (PPD) of the test set
observation i given the training data (and averaging over uncertainty in model
parameters), fi∗ (y|Y). Comparing the test-set data to the PPD is a way to
verify the model fits well. The PPD evaluated at the observed test observation,
CP Oi = fi∗ (Yi∗ |Y), (5.30)
is called the conditional predictive ordinate (CPO) [27, 65]. Test set observa-
tions with small CPO do not fit the model well and are potentially outliers.
A more interpretable diagnostic is to check that roughly 95% of the test-
set observations fall in the 95% posterior prediction intervals. For continuous
188 Bayesian Statistical Methods

data, the probability integral transform (PIT) statistic [20] provides a measure
of fit for the entire predictive distribution rather than just the 95% intervals.
The PIT is the posterior predictive probability below the test set value, Yi∗ ,
Z Yi∗
P ITi = fi∗ (y|Y)dy. (5.31)
−∞

This integral looks daunting, but can be easily approximated as the propor-
tion of the MCMC samples from the PPD that are below the test-set value.
Typically P ITi is computed for each test-set observation and these statistics
are plotted in a histogram. If the model fits well, then the PIT statistics should
follow a Uniform(0,1) distribution and the PIT histogram should be flat.
An important consideration when interpreting diagnostic measures is that
even a model that appears to be perfectly calibrated (say with uniform
PIT statistics) is not necessarily the true model (if there is such a thing).
For example, say the data are generated as Y |X ∼ Normal(X, 1) with
X ∼ Normal(0, 1), then the model Y ∼ Normal(0, 2) will fit the data per-
fectly well, but is clearly inferior to a model that includes X as a predictor.
Therefore, both model selection and goodness-of-fit testing are important.
Posterior predictive checks: Rather than focusing on predicting in-
dividual test set observations, posterior predictive checks (e.g., [33]) evalu-
ate fit using summaries of the dataset. Let θ̃ be a posterior sample of the
model parameters and Ỹ be a replicate dataset drawn from the model given
θ̃. To facilitate comparisons, the replicate dataset should have the same di-
mensions as the observed data. For example, in the linear regression model
Y ∼ Normal(Xβ, σ 2 I), the parameters are θ̃ = (β̃, σ̃ 2 ) and we would sample

Ỹ|θ̃ ∼ Normal(Xβ̃, σ̃ 2 I) (5.32)

using the same covariates X as in the original dataset.


Assume that MCMC produces S posterior samples θ (1) , ..., θ (S) and we
generate S replicate datasets Ỹ1 , ..., ỸS where Ỹs |θ (s) ∼ p(y|θ (s) ). If the
model fits well, then Y and the Ỹs should follow the same distribution, and
thus comparing Y to the distribution of Ỹs provides an evaluation of whether
the proposed data-generating model is valid. Summarizing fit for an entire
multivariate distribution such as the distribution of Y is challenging, and so
we restrict comparisons to one-number summaries of the data set, D(Y). For
example, D(Y) could be the mean or maximum value of the dataset, or any
other measure of interest. A visual goodness-of-fit check plots the predictive
distribution of the summary statistics, D(Ỹ1 ), ..., D(ỸS ), as a histogram and
compares the observed measure D(Y) to this distribution; if the observed
value falls far from the center of the distribution then this is evidence the
model does not fit well.
Selecting the summary measure D is clearly important. The most effective
summary measures are those that verify modeling assumptions. For example,
when fitting the model Y ∼ Normal(µ, σ 2 ), the mean and variance of the data
Model selection and diagnostics 189

should fall in the predictive distribution of the mean and variance because
there are parameters in the model for these summaries. Therefore, selecting
D to the sample mean or variance is not informative. However, the normal
model assumes the distribution is symmetric and there are no parameters in
the model to capture asymmetry. Therefore, taking D to be the skewness
provides a useful verification that this modeling assumption holds.
The Bayesian p-value is a more formal summary of the posterior predictive
check. The Bayesian p-value is the probability under repeated sampling from
the fitted model of observing a summary statistic at least as large as the
observed statistic. This probability can be approximated using MCMC output
as the proportion of the S draws D(Ỹ1 ), ..., D(ỸS ) that are greater than
D(Y). A Bayesian p-value near zero or one indicates that in at least one
aspect the model does not fit well.
The Bayesian p-value resembles the p-value from classical hypothesis test-
ing in that both quantify the probability under repeated sampling of observing
a statistic at least as large as the observed statistic. An important difference
is that the classical p-value assumes repeated sampling under the null hypoth-
esis, whereas the Bayesian p-value assumes repeated sampling from the fitted
Bayesian model. Another important difference is that for the Bayesian p-value,
probabilities near either zero or one provide evidence against the fitted model.
Gun control example: Kalesan et al. (2016) [47] study the relationship
between state gun laws and firearm-related death rates. The response variable,
Yi , is the number of firearm-related deaths in 2010 in state i. The analysis
includes five potential confounders (Zij ): 2009 firearm death rate per 10,000
people; firearm ownership rate quartile; unemployment rate quartile; non-
firearm homicide rate quartile and firearm export rate quartile. The covariates
of interest in the study are status of gun laws in the state. Let Xil indicate
that state
P i has law l. In this example, we simply use the number of laws
Xi = l Xil as the covariate. Setting aside correlation versus causation issues,
the objective of the analysis is to determine if there is a relationship between
the number of gun laws in the state and its firearm-related death rate. Our
objective in this section is to illustrate the use of posterior predictive checks
to verify that the model fits well.
We check the fit of two models. The first is the usual Poisson regression
model
 
X 5
Yi ∼ Poisson(λi ) where λi = Ni exp α + Zij βj + Xi β6  (5.33)
j=1

and Ni is the state’s population. A concern with the Poisson model is that
because the mean equals the variance it may not be flexible enough to capture
large counts. Therefore, we also consider the negative binomial model with
mean λi and over-dispersion parameter m
 
m
Yi ∼ NB ,m . (5.34)
λi + m
190 Bayesian Statistical Methods

Listing 5.4
JAGS code for the over-dispersed Poisson regression model.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnegbin(q[i],m)
4 q[i] <- m/(m+N[i]*lambda[i])
5 log(lambda[i]) <- alpha + inprod(Z[i,],beta[1:5]) +
X[i]*beta[6]
6 }
7

8 #Priors
9 for(j in 1:6){
10 beta[j] ~ dnorm(0,0.1)
11 }
12 alpha ~ dnorm(0,0.1)
13 m ~ dgamma(0.1,0.1)
14

15 # Posterior predictive checks


16 for(i in 1:n){
17 Y2[i] ~ dnegbin(q[i],m)
18 rate[i] <- Y2[i]/N[i]
19 }
20

21 D[1] <- min(Yp[])


22 D[2] <- max(Yp[])
23 D[3] <- max(Yp[])-min(Yp[])
24 D[4] <- min(rate[])
25 D[5] <- max(rate[])
26 D[6] <- max(rate[])-min(rate[])

The priors for the fixed effects are α, βj ∼ Normal(0, 10) and for the negative-
binomial model m ∼ Gamma(0.1, 0.1). Code to fit the negative-binomial
model is given in Listing 5.4.
We interrogate the models using posterior predictive checks with the fol-
lowing six test statistics:
1. Minimum count: D1 (Y) = min{Y1 , ..., Yn }
2. Maximum count: D2 (Y) = max{Y1 , ..., Yn }
3. Range of counts: D3 (Y) = max{Y1 , ..., Yn } − min{Y1 , ..., Yn }
4. Minimum rate: D4 (Y) = min{Y1 /N1 , ..., Yn /Nn }
5. Maximum rate: D5 (Y) = max{Y1 /N1 , ..., Yn /Nn }
6. Range of rates: D6 (Y) = max{Y1 /N1 , ..., Yn /Nn } −
min{Y1 /N1 , ..., Yn /Nn }
We use test statistics related to the range of the counts and rates because
Model selection and diagnostics 191
Minimum count Maximum count Range of counts

Poisson (p−val = 0.93) Poisson (p−val = 0.02) Poisson (p−val = 0.02)


0.08
NB (p−val = 0.79) NB (p−val = 0.34) NB (p−val = 0.33)

0.006
0.006
0.06
Posterior density

Posterior density

Posterior density
0.004
0.004
0.04

0.002
0.002
0.02

0.000

0.000
0.00

20 30 40 50 60 70 80 2000 2500 3000 3500 4000 2000 2500 3000 3500 4000

D D D

Minimum rate Maximum rate Range of rates

Poisson (p−val = 0.99) Poisson (p−val = 0.07) Poisson (p−val = 0.01)


60000
100000 120000

NB (p−val = 0.92) NB (p−val = 0.55) NB (p−val = 0.39)

50000
50000

40000
Posterior density

Posterior density

Posterior density
40000
80000

30000
30000
60000

20000
20000
40000

10000
10000
20000
0

0
2e−05 3e−05 4e−05 5e−05 6e−05 0.00020 0.00025 0.00030 0.00015 0.00020 0.00025

D D D

FIGURE 5.9
Bayesian p-values for the gun control example. The density curves
show the posterior predictive distribution of the six test statistics for the
Poisson and negative-binomial (NB) models, and the vertical lines are the
test statistics for the observed data. The Bayesian p-value is given in the
legend’s parentheses.

the main concern here is properly accounting for large counts. Figure 5.9
plots the PPD for both models and all six test statistics. The PPD from the
Poisson model does not capture the largest counts or the range of counts ob-
served in the dataset. The Bayesian p-value is close to zero or one for all four
test statistics involving either the maximum or the range. As expected, the
negative-binomial model gives much wider prediction intervals and thus less
extreme Bayesian p-values. Therefore, while the Poisson model is not appro-
priate for this analysis, the negative-binomial model appears to be adequate
based on these (limited) tests. However, in both analyses the coefficient as-
sociated with the number of gun laws (β6 ) is negative with high probability
(95% interval (-0.017,-0.011) for the Poisson model and (-0.026, -0.008) for
the negative-binomial model), and so the conclusion that there is a negative
association between the number of gun laws in a state and its firearm-related
mortality is robust to this modeling assumption.
192 Bayesian Statistical Methods

5.7 Exercises
1. Download the airquality dataset in R. Compare the following two
models using 5-fold cross validation:
M1 : ozonei ∼ Normal(β1 + β2 solar.Ri , σ 2 )
M2 : ozonei ∼ Normal(β1 +β2 solar.Ri +β2 tempi +β3 windi , σ 2 ).
Specify and justify the priors you select for both models.
2. Fit model M2 to the airquality data from the previous problem,
and use posterior predictive checks to verify that the model fits
well. If you find model misspecification, suggest (but do not fit)
alternatives.
3. Assume that Y |λ ∼ Poisson(N λ) and λ ∼ Gamma(0.1, b). The
null hypothesis is that λ ≤ 1 and the alternative is that λ > 1.
Select b so that the prior probability of the null hypothesis is 0.5.
Using this prior, compute the posterior probability of the alternative
hypothesis and the Bayes factor for the alternative relative to the
null hypothesis for (a) N = 10 and Y = 12; (b) N = 20 and Y = 24;
(c) N = 50 and Y = 60; (d) N = 100 and Y = 120. For which N
and Y is there definitive evidence in favor of the alternative?
4. Use the “Mr. October” data (Section 2.4) Y1 = 563, N1 = 2820,
Y2 = 10, and N2 = 27. Compare the two models:
M1 : Y1 |λ1 ∼ Poisson(N1 λ1 ) and Y2 |λ2 ∼ Poisson(N2 λ2 )
M2 : Y1 |λ0 ∼ Poisson(N1 λ0 ) and Y2 |λ0 ∼ Poisson(N2 λ0 ).
using Bayes factors, DIC and W AIC. Assume the Uniform(0,c)
prior for all λj and compare the results for c = 1 and c = 10.
5. Use DIC and W AIC to compare logistic and probit links for the
gambia data in the R package geoR using the five covariates in List-
ing 5.2 and no random effects.
6. Fit logistic regression model to the gambia data in the previous
question and use posterior predictive checks to verify the model fits
well. If you find model misspecification, suggest (but do not fit)
alternatives.
7. For the NBA free throw data in Section 1.6, assume that for player
i, Yi |pi ∼ Binomial(ni , pi ) where Yi is the number of clutch makes,
ni is the number of clutch attempts, and pi is the clutch make
probability. Compute the posterior probabilities of the models:
M1 : logit(pi ) = β1 + logit(qi )
M2 : logit(pi ) = β1 + β2 logit(qi ).
Model selection and diagnostics 193

where qi is the overall free throw percentage. Specify the priors you
use for the models’ parameters and discuss whether the results are
sensitive to the prior.
8. Fit model M2 to the NBA data in the previous question and use
posterior predictive checks to verify the model fits well. If you find
model misspecification, suggest (but do not fit) alternatives.
9. Download the Boston Housing Data in R from the Boston dataset.
The response is medv, the median value of owner-occupied homes,
and the other 13 variables are covariates that describe the neighbor-
hood. Use stochastic search variable selection (SSVS) to compute
the most likely subset of the 13 covariates to include in the model
and the marginal probability that each variable is included in the
model. Clearly describe the model you fit including all prior distri-
butions.
10. Download the WWWusage dataset in R. Using data from times t =
5, ..., 100 as outcomes (earlier times may be used as covariates), fit
the autoregressive model
Yt |Yt−1 , ..., Y1 ∼ Normal(β0 + β1 Yt−1 + ... + βL Yt−L , σ 2 )
where Yt is the WWW usage at time t. Compare the models with
L = 1, 2, 3, 4 and select the best time lag L.
11. Using the WWWusage dataset in the previous problem, fit the model
with L = 2 and use posterior predictive checks to verify that the
model fits well. If you find model misspecification, suggest (but do
not fit) alternatives.
12. Open and plot the galaxies data in R using the code below,

> library(MASS)
> data(galaxies)
> ?galaxies
> Y <- galaxies
> n <- length(Y)
> hist(Y,breaks=25)

Model the observations Y1 , ..., Y82 using the Student-t distribution


with location µ, scale σ and degrees of freedom k. Assume prior dis-
tributions µ ∼ Normal(0, 100002 ), 1/σ 2 = τ ∼ Gamma(0.01, 0.01)
and k ∼ Uniform(1, 30).
(a) Use posterior predictive checks to evaluate whether the t dis-
tribution captures the mean, variance, skewness and kurtosis of
the data.
(b) Repeat this with k fixed at 30 (so that the model is essentially
Gaussian).
6
Case studies using hierarchical modeling

CONTENTS
6.1 Overview of hierarchical modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.2 Case study 1: Species distribution mapping via data fusion . . . . 200
6.3 Case study 2: Tyrannosaurid growth curves . . . . . . . . . . . . . . . . . . . . . 203
6.4 Case study 3: Marathon analysis with missing data . . . . . . . . . . . . . 211
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

6.1 Overview of hierarchical modeling


Thus far we have introduced Bayesian ideas in the context of standard statis-
tical models. However, one of the primary benefits of Bayesian methodology
is flexibility to handle non-standard cases with irregularities such as missing
values, censored data, variables measured with error, multiple data sources
each with distinct biases and errors, subpopulations with different properties,
etc. Incorporating all of these features in an analysis may seem daunting, but
can often be accomplished by breaking the large model into manageable layers
and combining the layers in a hierarchical model. Hierarchical modeling (also
known as a multilevel model) is thus an essential model-building tool.
To see how building a model for complex data can be simplified by thinking
hierarchically, say our objective is to specify the joint distribution of three
variables X, Y and Z. Directly fitting a multivariate joint distribution can be
challenging, especially if the three variables have different supports. However,
any trivariate distribution can be written

f (x, y, z) = f (x)f (y|x)f (z|x, y). (6.1)

By ordering the three variables and specifying the univariate marginal distri-
bution of X and then the univariate conditional distributions for Y |X and
Z|X, Y , the multivariate problem is reduced to three univariate problems.
Because the variables are ordered and each conditional distribution depends
only on the previous variables in the ordering, the resulting joint distribution
is guaranteed to be valid. Also, since any multivariate distribution can be
decomposed this way, there is no loss of flexibility by taking this approach.
The trivariate model is represented as a directed acyclic graph (DAG) in

195
196 Bayesian Statistical Methods

(a) (b)

Y X Y Z

X Z

FIGURE 6.1
Directed acyclic graphs (DAGs). Panel (a) shows the DAG for the model
f (X, Y, Z) = f (X)f (Y |X)f (Z|X, Y ) and Panel (b) shows the DAG for the
model f (X, Y, Z) = f (X)f (Y |X)f (Z|Y ).

Figure 6.1a. A DAG (also called a Bayesian network) represents the model
as a graph with each observation and parameter as a node (i.e., the points
that define the graph) and edges (i.e., connections between the nodes) to
denote conditional dependence. To define a valid stochastic model the graph
must be directed and acyclic. A directed graph associates each edge with
a direction; an arrow from X to Y indicates that the hierarchical model is
defined by modeling the conditional distribution of Y as a function of X.
The absence of an arrow from X to Z in Figure 6.1b conveys the choice
that conditioned on Y , Z does not depend on X, i.e., f (z|x, y) = f (z|y); in
contrast, Figure 6.1a is the DAG for the model with conditional dependence
between X and Z given Y . The graph must also be acyclic, meaning that it is
impossible to follow directed edges from a node through the graph and return
to the original node. These two conditions rule out building models such as
p(x, y, z) = p(x|y, z)p(y|z)p(z|y), which may not be a valid joint distribution.
Hierarchical models can take many forms, but a general way to build a
model is through a data layer, a process layer and a prior layer. Model build-
ing should begin with the process layer that contains the underlying scientific
processes of interest and the unknown parameters. Building this layer is ide-
ally done in consultation with domain experts. Once this layer is defined the
statistical objectives can be articulated, for example, to estimate a particular
parameter or test a specific hypothesis. Ideally these objectives dictate the
data to be collected for the analysis. The data layer relates (via the likelihood
function) the data to the process and encodes bias and error in the data col-
lection procedure, which requires knowledge of how the data were collected.
Finally the prior layer quantifies uncertainty about the model parameters at
the onset of the analysis.
Building a model hierarchically is convenient, but not fundamentally dif-
ferent than models we have considered previously. In fact, we have already
encountered many hierarchical models such as the random effects models in
Section 4.4. This means that the computational methods to sample from the
Case studies using hierarchical modeling 197

posterior described in Chapter 3 apply to hierarchical models as do the graph-


ical and numerical methods used to summarize the posterior.
The hierarchical model for disease progression below is an example of a
model built this way. Although we do not carry out an analysis, this exam-
ple illustrates the model building process. Let St and It be the number of
susceptible and infected individuals in a population, respectively, at time t.
Scientific understanding of the disease is used to model disease propagation.
In consultation with an epidemiologist, we might select the simple Reed–Frost
model [1]:

It+1 ∼ Binomial St , 1 − (1 − q)It


 
Process layer:
St+1 = St − It+1

where it is assumed that all infected individuals are removed from the pop-
ulation before the next time step and q is the probability of a non-infected
person coming into contact with and contracting the disease from an infected
individual. The epidemiological process-layer model expresses the disease dy-
namics up to a few unknown parameters. To estimate these parameters, the
number of cases at time t, denoted Yt , is collected. The data layer models
our ability to measure the process It . For example, after discussing the data
collection procedure with domain experts, we might assume there are no false
positives (uninfected people counted as infected) but potentially false nega-
tives (uncounted infected individuals) and thus

Data layer: Yt |It ∼ Binomial(It , p) (6.2)

where p is the probability of detecting an infected individual. The Bayesian


model is completed using priors

Prior layer: I1 ∼ Poisson(λ1 ), S1 ∼ Poisson(λ2 )


p, q ∼ beta(a, b).

Figures 6.2 plots the DAG corresponding to this model.


Another general idea for building a hierarchical model is to split the data
into homogeneous groups and then pool information across the groups via the
prior distribution. The random slopes model for the jaw bone density data
(plotted in Figure 4.6) discussed in Section 4.4 is an example of a hierarchical
model built this way. The model is

Data layer : Yij |β i ∼ Normal(Xj β i , σ 2 )


Process layer : β i |µ, Σ ∼ Normal(µ, Σ)
Prior layer : µ ∼ Normal(0, c2 I2 ), Σ ∼ InvWishart(ν, Ω)

for i = 1, ..., n = 20 and j = 1, ..., m = 4. Each of the 20 patients gets its


own simple linear regression model, and these regressions are combined in the
process layer that specifies the distribution of the regression coefficients across
individuals in the patient population.
198 Bayesian Statistical Methods

Y2 Y3 Y4 …

Data layer

I2, S2 I3, S3 I4, S4 …

Process layer

I1, S1, p, q

Prior layer

λ1, λ2, a, b

FIGURE 6.2
Directed acyclic graph for the Reed–Frost infectious disease model.

This model is visualized as a DAG in Figure 6.3. The DAG shows how
information moves through the hierarchical model. For example, how does the
data from patient 1 help us predict the bone density for patient 2’s next visit?
To traverse the DAG from Y1 to Y2 requires going through the population
parameters µ and Σ. That is, Y1 informs the model about β 1 which shapes
the random effects distribution that enters the model for β 2 and thus Y2 . If we
only had data from patient 2, then we would likely resort to an uninformative
prior for β 2 and with only a few observations for patient 2 the posterior would
be unstable. However, the hierarchical model allows us to borrow strength
across patients to stabilize the results.
MCMC is a natural choice for fitting hierarchical models; just as hierar-
chical models build complexity by layering simple conditional distributions,
MCMC samples from the complex posterior distribution by sequentially up-
dating parameters from simple full conditional distributions. In fact, display-
ing the hierarchical model as a DAG not only helps understand the model
but it also aids in coding the MCMC sampler because the full conditional
distribution for a parameter depends only on terms with an arrow to or from
the parameter’s node in the DAG. For example, from Figure 6.3 it is clear
that the full conditional distribution for β 2 depends only on the data-layer
terms for Y21 , ..., Y2m |β 2 and the process-layer term for β 2 |µ, Σ. If we view
the model only through these terms then it is immediately clear that the full
conditional distribution of β 2 is exactly the full conditional distribution of the
regression coefficients in standard Bayesian linear regression (Section 4.2).
When to stop adding layers? The hierarchical models in this section
have three levels: data, process and prior. However, likely the values that define
Case studies using hierarchical modeling 199

Y11, …, Y14 Y21, …, Y24 … Yn1, …, Yn4

Data layer

β1 β2 … βn

Process layer

µ, Σ

Prior layer

c, ν, Ω

FIGURE 6.3
Directed acyclic graph. Visual representation of the random slopes model
Yij |β i ∼ Normal(Xj β i , σ 2 ) for i = 1, ..., n and j = 1, ..., 4 with random effect
distribution β i |µ, Σ ∼ Normal(µ, Σ) and priors µ ∼ Normal(0, c2 I2 ) and
Σ ∼ InvWishart(ν, Ω).

the prior layer will not be known exactly, and it is tempting to add a fourth
(and a fifth, etc.) layer to explain this uncertainty. A general rule of thumb
is to stop adding layers when there is no replication to estimate parameters.
For example, referring to the DAG in Figure 6.3, it is reasonable to add a
layer to estimate the random effect mean µ and covariance Σ because there
are repeated random effects β 1 , ..., β n that can be leveraged to estimate these
parameters. However, adding an additional layer to estimate prior mean Ω of
the random effect covariance Σ would not be reasonable because there is only
one covariance Σ in the model, and even if we knew Σ exactly we would not
be able to estimate its distribution from a single sample.
The remainder of this chapter is formatted as a sequence of case studies in
hierarchical modeling. The three case studies each pose different challenges:

1. Species distribution mapping via data fusion: Combining in-


formation from multiple data streams while accounting for their
bias and uncertainty
2. Tyrannosaurid growth curves: Pooling information across sub-
populations (species) and quantifying uncertainty in non-linear
models with a small number of observations
3. Marathon analysis with missing data: Accounting for missing
data when performing statistical inference and prediction
In these analyses we demonstrate the flexibility of hierarchical modeling, and
200 Bayesian Statistical Methods

also illustrate complete Bayesian analyses including model and prior specifi-
cation, model comparisons, and presentation of the results.

6.2 Case study 1: Species distribution mapping via data


fusion
The data for this case study come from [63]. The objective is to map the
spatial distribution of a small song bird, the brown-headed nuthatch (BHNU;
Sitta pusilla), in the southeastern US. There are two data sources, each with
different strengths. The first data source is the Breeding Birds Survey (BBS).
BBS is a network of hundreds of routes surveyed by thousands of volunteers
that has been active since 1966 [74]. Data are collected systematically by
trained volunteers and sites are visited annually to monitor for changes. How-
ever, even with this immense sampling effort, there are spatial and temporal
gaps in BBS coverage. An emerging line of research is to supplement system-
atic survey data with massive citizen science data such as the Cornell Lab of
Ornithology’s eBird database [80], which consists of millions of data points
from thousands of citizen scientists each year. These data are not collected by
trained birders, but have far greater spatial and temporal coverage.
For this analysis, the southeast US is partitioned into n = 741, 0.25 × 0.25
degree lat/lon cells and we analyze data from 2012. Let N1i be the number
BBS sampling occasions in cell i and Y1i ∈ {0, 1, ..., N1i } be the number of
occasions with a sighting of the BHNU. Many cells do not have a BBS route
and thus N1i = Yi1 = 0. Similarly, for cell i define N2i as the number of
hours logged by eBird citizen scientists and Y2i be the number of BHNU eBird
sightings. Figure 6.4 maps the data. The BBS sampling effort is fairly uniform,
whereas the eBird effort is more concentrated in populated areas. Both maps
show more BHNU sightings in Alabama, Georgia and the Carolinas.
The true process of interest in cell i is the abundance λi ≥ 0, defined as the
expected number of birds present in the region during one unit of surveying
effort. For a cell that is not inhabited by the BHNU λi = 0. The data layer
relates the process to the observed data, and requires careful consideration of
the strengths of the two data sources. First we make the assumption that the
BBS and eBird datasets are independent given λi . This seems reasonable as
most eBird users are not following BBS updates. The BBS data are gathered
by expert birders and thus we assume that there are no false positives or
negatives (although more flexible models are available and likely preferable, see
[63] for example). If the number of birds present during a survey is distributed
as Poisson(λi ), then the probability that there is at least one bird present is
1 − exp(−λi ), and so we model the BBS data as Yi1 |λi ∼ Binomial[N1i , 1 −
exp(−λi )].
For the eBird data, we do allow for false positives and false negatives. The
Case studies using hierarchical modeling 201

(a) Sampling occasions − BBS (b) Sightings − BBS

40 40

250 20
36 36
200
15
150
10
100
50 5
32 0 32 0

−95 −90 −85 −80 −75 −95 −90 −85 −80 −75

(c) Sqrt effort − Ebird (d) Sqrt sightings − Ebird

40 40

36 36
60 20

40 15
10
20
5
32 0 32 0

−95 −90 −85 −80 −75 −95 −90 −85 −80 −75

(e) Posterior mean abundance (f) Probability that occupancy exceeds 0.01

40 40

1.00
36 0.4 36
0.75
0.3
0.50
0.2
0.25
0.1
32 32 0.00

−95 −90 −85 −80 −75 −95 −90 −85 −80 −75

FIGURE 6.4
2012 Brown-headed nuthatch data. Panels (a) and (b) plot the number of
BBS sampling occasions (N1i ) and number of √ BBS sightings (Y1i ); Panels (c)
and (d) plot the square root √of Ebird effort ( N2i ) and the square root of the
number of Ebird sightings ( Y2i ). Panel (e) plots the posterior mean abun-
dance, λi , and Panel (f) plots the posterior probability that the occupancy
probability exceeds 0.01, i.e., Prob[1 − exp(−λi ) > 0.01|Y].
202 Bayesian Statistical Methods

mean is N2i λ̃i with rate λ̃i = θ1 λi + θ2 , where θ1 > 0 controls the difference
between observation rates by BBS observers and eBird observers and θ2 > 0 is
the eBird false positive rate so that if the cell is truly uninhabited and λi = 0,
then E(Y2i ) = N2i θ2 . To allow for over-dispersion we fit the model

Y2i |λi , θ ∼ NegBinomial(qi , m) (6.3)

with probability qi = m/(λ̃i + m) and size m. Combining these two contribu-


tions to the likelihood, the data layer is

Data layer : Y1i |λi , θ1 , θ2 ∼ Binomial[N1i , 1 − exp(−λi )]


Y2i |λi , θ1 , θ2 ∼ NegBinomial(qi , m).

The latent process of interest is the abundance in each cell, λi . It is difficult


to model all the λi without prior knowledge because some cells do not have
BBS data. In the absence of other prior knowledge, we may simply assume that
nearby cells have similar abundance. This is a reasonable assumption if the
underlying factors that drive abundance (climate, habitat, etc.) vary spatially
and allows us to pool information locally to estimate abundance. Many spatial
models can be used for model abundance [28], but here we use spline regression
as in Section 4.5.1. The log abundance (we use a log transformation to ensure
λi is non-negative) is a smooth function of the grid cell’s spatial location,
si = (s1i , s2i ). Since this is a two-dimensional function we use a spline basis
expansions in both the longitude and latitude,
J X
X K
Process layer : log(λi ) = Bj (si1 )Dk (si2 )βjk (6.4)
j=1 k=1

where Bj are B-spline basis functions of longitude, Dk are B-spline basis


iid
functions of latitude and βjk ∼ Normal(β0 , σ 2 ) (here we use different notation
for the basis function in the latitude and longitude directions because they
have a different number of basis functions and thus a different form). Some
of the products Bj (si1 )Dk (si2 ) are near zero for all si in the domain and are
discarded. Since the spatial domain spans a wider range of longitudes than
latitudes, we take K = 2L and J = L for a total of p = 2L2 terms of the
form Xl = Bj (si1 )Dk (si2 ) and select L using DIC. To complete the Bayesian
hierarchical model, we specify uninformative priors

Prior layer : θ1 , θ2 , m, σ −2 ∼ Gamma(0.1, 0.1) and β0 ∼ Normal(0, 100).


(6.5)
JAGS code to implement this model is given in Listing 6.1.
Two chains are run, each with a burn-in of 10,000 iterations and 50,000
post-burn-in samples. The samples are thinned by 5 leaving 20,000 samples to
approximate the posterior. The DIC (pD ) is 3107 (30) for L = 4, 3056 (58)
for L = 6, 3015 (89) for L = 8, 2999 (127) for L = 10, 3014 (177) for L = 12
and 3009 (209) for L = 14, and so we proceed with L = 10. With L = 10, the
Case studies using hierarchical modeling 203

Listing 6.1
Spatial data fusion model for BHNU abundance.
1 # Data layer
2 for(i in 1:n){
3 Y1[i] ~ dbin(phi[i],N1[i]) # BBS
4 phi[i] <- 1-exp(-lam[i])
5

6 Y2[i] ~ dnegbin(q[i],m) # eBird


7 q[i] <- m/(m+N2[i]*(theta1*lam[i]+theta2))
8 }
9

10 # Process layer
11 for(j in 1:p){beta[j]~dnorm(beta0,tau)}
12 for(i in 1:n){
13 log(lam[i]) <- inprod(X[i,],beta[])
14 }
15

16 # Prior layer
17 theta1 ~ dgamma(0.1,0.1)
18 theta2 ~ dgamma(0.1,0.1)
19 m ~ dgamma(0.1,0.1)
20 tau ~ dgamma(0.1,0.1)
21 beta0 ~ dnorm(0,1)

effective sample size is greater than 1,000 for all βjk , indicating the sampler
has mixed well and sufficiently explored the posterior.
Table 6.1 presents the posterior distributions of the hyperparameters. Of
note, the eBird false positive rate, θ2 , is estimated to be near zero, and thus
the eBird data appears to be a reliable source of information. The posterior
mean of λi and the posterior probability that cells are occupied (i.e., at least
one individual is present) are mapped in Figures 6.4e and 6.4f, respectively.
As expected, the estimated abundance is the largest in Georgia and the Car-
olinas, but the occupancy probability is also high farther west in Louisiana
and Arkansas. The occupancy probabilities would be lower in these western
states if the eBird data were excluded.

6.3 Case study 2: Tyrannosaurid growth curves


We analyze the data from 20 fossils to estimate the growth curves of four tyran-
nosaurid species: Albertosaurus, Daspletosaurus, Gorgosaurus and Tyran-
nosaurus. The data are taken from Table 1 of [25] and plotted in Figure 6.5.
The objective is to establish the growth curve, i.e., expected body mass by
204 Bayesian Statistical Methods

TABLE 6.1
Posteriors for the BHNU analysis. Posterior median and 95% intervals
mean for the final fit with L = 10.

Median 95% Interval


Scaling factor, θ1 11.5 (9.2, 14.4)
False positive rate, θ2 0.00 (0.00, 0.00)
Over-dispersion parameter, m 0.45 (0.37, 0.55)
Mean abundance parameter, β0 -5.81 (-6.69, -4.86)
Spline standard deviation, σ 5.58 (4.52, 7.03)

age, for each species. The data exhibit non-linear relationships between age
and mass and there are commonalities between species. We therefore pursue
a non-linear hierarchical model.
The original analysis of these data used non-linear least squares (fitted
curves shown in the left panel of Figure 6.5). Quantifying uncertainty in this
fit is challenging. The sampling distribution of the estimator does not have a
closed form due to the non-linear mean structure, and with only a handful of
observations to estimate roughly the same number of parameters, large-sample
normal approximations are not valid and resampling techniques such as the
bootstrap may have insufficient data to approximate the sampling distribu-
tion. As shown below, a Bayesian analysis powered by MCMC fully quantifies
posterior uncertainty.
Let Yij and Xij be the body mass and age, respectively, of sample i from
species j = 1, ..., 4. We model the data as

Yij = fj (Xij )ij , (6.6)

where fj is the true growth curve for species j and ij > 0 is multiplicative
error with mean one. We use multiplicative error rather than additive error
because variation in the population likely increases with mass/age. Assuming
the errors are log-normal with log(ij ) ∼ Normal(−σj2 /2, σj2 ) then E(ij ) = 1
as required and the model becomes

log(Yij ) ∼ Normal log[fj (Xij )] − σj2 /2, σj2 ,



(6.7)

and E(Yij ) = fj (Xij ) and where σj2 controls the error variance for species j.
The data in Figure 6.5 (left) clearly exhibit nonlinearity. However, after
taking a log transformation of both mass and age their relationship is fairly
linear (Figure 6.5, right). Therefore, one model we consider is the log-linear
model
log[fj (X)] = aj + bj log(X) (6.8)
where aj and bj are the intercept and slope, respectively, for species j. On
the original scale, the corresponding growth curve is fj (X) = exp(aj )X bj . If
Case studies using hierarchical modeling 205

Albertosaurus Albertosaurus
Daspletosaurus Daspletosaurus
5000

Gorgosaurus Gorgosaurus

8
Tyrannosaurus Tyrannosaurus
4000

Log Body Mass (kg)


7
Body Mass (kg)
3000

6
2000

5
1000

4
0

5 10 15 20 25 1.0 1.5 2.0 2.5 3.0

Age (years) Log Age (Years)

FIGURE 6.5
Tyrannosaurid growth curve data. Panel (a) gives scatter plots of the
estimated age and body mass (kg) of 20 samples of four tyrannosaurid species;
Panel (b) plots the same data after a log transformation of both variables. The
curves plotted in Panel (a) are the fitted logistic curves from [25] and the lines
in Panel (b) are least squares fits.

as expected bj is positive, then the growth curve increases indefinitely, which


may not be realistic. Therefore, we compare the log-linear model with the
logistic growth curve

exp[dj (x − cj )]
fj (X) = aj + bj . (6.9)
1 + exp[dj (x − cj )]

where x = log(X). This model has four parameters:


1. aj is the expected mass at age 0
2. bj is the expected lifetime gain in mass
3. log(cj ) is the age at which the species reaches half its expected gain
4. dj > 0 determines the rate of increase with age.
The form of the curve is increasing (assuming bj > 0) and plateaus at aj + bj
rather than continuing to increase with age. This is the same function fit by
[25], except that we transform to log age.
In addition to comparing these two forms of growth curves, we also
compare two priors. The first prior (“unpooled”) fits each species sepa-
rately using uninformative priors. For the log-linear model the priors are
aj , bj ∼ Normal(0, 10) and σj2 ∼ InvGamma(0.1, 0.1) and for the logistic
model the priors are log(aj ), log(bj ), cj , log(dj ) ∼ Normal(0, 10) and σj2 ∼
206 Bayesian Statistical Methods

Listing 6.2
JAGS code for hierarchical growth curve modeling.
1 # n is the total number of observations for all species
2 # x[i] is the log age of individual i
3 # y[i] is the log mass of individual i
4 # sp[i] is the species number (1, 2, 3, or 4) of individual i
5

6 # Data layer
7 for(i in 1:n){
8 y[i] ~ dnorm(muY[i],taue)
9 muY[i] <- log(a[sp[i]] + b[sp[i]]/(1+exp(-part[i])) - 0.5/taue
10 part[i] <- (x[i]-c[sp[i]])/d[sp[i]]
11 }
12

13 # Process layer
14 for(j in 1:N){
15 a[j] <- exp(alpha[j,1])
16 b[j] <- exp(alpha[j,2])
17 c[j] <- alpha[j,3]
18 d[j] <- exp(alpha[j,4])
19

20 for(k in 1:4){alpha[j,k] ~ dnorm(mu[k],tau[k])}


21 }
22

23 # Prior layer
24 for(k in 1:4){
25 mu[k] ~ dnorm(0,0.1)
26 tau[k] ~ dgamma(0.1,0.1)
27 }
28 taue ~ dgamma(0.1,0.1)

InvGamma(0.1, 0.1). The normal prior is replaced with the log-normal prior
for aj , bj and dj to ensure these parameters are positive and thus fj (X) is
positive and increasing for all X. The second prior (“pooled”) is a Bayesian
hierarchical model that borrows information across the four species. In the
pooled analysis we assume the variance is the same for all species, σj2 = σ 2 and
has uninformative prior σ 2 ∼ InvGamma(0.1, 0.1). For the log-linear model
priors for the intercepts are aj ∼ Normal(µa , σa2 ), where µa ∼ Normal(0, 10)
and σa2 ∼ InvGamma(0.1, 0.1). The same hierarchical model is applied to the
log(aj ), log(bj ), cj and log(dj ) in the logistic model. The JAGS code for this
model is given in Listing 6.2.
This hierarchical model treats the parameters across the four species as
random effects, and learning about the random effects distribution (i.e., µa and
σa2 ) stabilizes the posterior by providing additional information via the priors.
It is debatable whether these parameters are truly random effects, i.e., whether
there is an infinite distribution of exchangeable species from which these four
Case studies using hierarchical modeling 207

Albertosaurus Daspletosaurus

Data

5000

5000

Body Mass (kg)

Body Mass (kg)


Post mean
95% interval

3000

3000

0 1000

0 1000
● ●



5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

Gorgosaurus Tyrannosaurus

5000

5000

Body Mass (kg)

Body Mass (kg)


3000

3000

●●
0 1000

0 1000



● ● ●

5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

FIGURE 6.6
Fitted log-linear growth curves – unpooled. Observations (points) ver-
sus the posterior mean (solid lines) and 95% intervals (dashed lines) of the
tyrannosaurid growth curves for the unpooled log-linear model.

were randomly selected for the study. However, analyzing the data from these
four species using a random effects model clearly improves the results (as
shown below) by pooling information across species to reduce uncertainty.
We fit the model with log-linear and logistic growth curves, each separate
by species (unpooled) and using a hierarchical model (pooled). DIC (pD ) for
the four fits are: 29 (25) for log-linear unpooled, -3 (9) for log-linear pooled,
64 (41) for logistic unpooled and -2 (12) for logistic pooled. The pooled mod-
els reduce model complexity (as measured by pD ) and this leads to smaller
(better) DIC. DIC for the log-linear and logistic growth curves are similar.
Figures 6.6–6.9 plot the posterior mean and pointwise 95% credible interval
for fj for each model and each species (the interval estimates are for fj and not
Yij , so they should not include 95% of the observations). The posterior means
of the four methods are fairly similar and all fit the data well. The main
difference between the fits is that by borrowing information across species,
the pooled analyses have narrower credible sets. Visually, the log-linear fits
in Figure 6.9 appear to sufficiently model the growth curves. However, given
that the logistic curve fits nearly as well and possesses the intuitive property
of plateauing at an advanced age, this model is arguably preferable when
considering the entire life course.
208 Bayesian Statistical Methods

Albertosaurus Daspletosaurus

Data
5000

5000

Body Mass (kg)

Body Mass (kg)


Post mean
95% interval
3000

3000


0 1000

0 1000
● ●



5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

Gorgosaurus Tyrannosaurus

5000

5000


Body Mass (kg)

Body Mass (kg)


3000

3000


●●
0 1000

0 1000




● ● ●

5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

FIGURE 6.7
Fitted log-linear growth curves – pooled. Observations (points) versus
the posterior mean (solid lines) and 95% intervals (dashed lines) of the tyran-
nosaurid growth curves for the pooled (hierarchical) log-linear model.
Case studies using hierarchical modeling 209

Albertosaurus Daspletosaurus

Data
5000

5000

Body Mass (kg)

Body Mass (kg)


Post mean
95% interval
3000

3000


0 1000

0 1000
● ●



5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

Gorgosaurus Tyrannosaurus

5000

5000


Body Mass (kg)

Body Mass (kg)


3000

3000


●●
0 1000

0 1000




● ● ●

5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

FIGURE 6.8
Fitted log-linear growth curves – logistic, unpooled. Observations
(points) versus the posterior mean (solid lines) and 95% intervals (dashed
lines) of the tyrannosaurid growth curves for the unpooled logistic model.
210 Bayesian Statistical Methods

Albertosaurus Daspletosaurus

Data
5000

5000

Body Mass (kg)

Body Mass (kg)


Post mean
95% interval
3000

3000


0 1000

0 1000
● ●



5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

Gorgosaurus Tyrannosaurus

5000

5000


Body Mass (kg)

Body Mass (kg)


3000

3000


●●
0 1000

0 1000




● ● ●

5 10 15 20 25 5 10 15 20 25

Age (years) Age (years)

FIGURE 6.9
Fitted log-linear growth curves – logistic, pooled. Observations (points)
versus the posterior mean (solid lines) and 95% intervals (dashed lines) of the
tyrannosaurid growth curves for the pooled (hierarchical) logistic model.
Case studies using hierarchical modeling 211

6.4 Case study 3: Marathon analysis with missing data


Missing data are a complicating factor in many analyses. An an example,
consider the 2016 Boston Marathon data from Section 2.1.7. In this analysis,
we build a linear regression model to predict the speed (minutes per mile) of
the final mile (mile 26) of the top female runners as a function of their speeds
in the first 25 miles. Let Yi be the speed of mile 26 for runner i = 1, ..., n = 149
and Xij be the speed for runner i in mile j = 1, ..., p = 25. Approximately 8%
of the speed measurements are missing: the number of missing observations
ranges from 0 to 19 across runners and the percentage of missing observations
ranges from 0% to 55% (miles 6 and 7) across miles (Figure 6.10a).
The easiest resolution to the missing-data problem is to simply discard
observations with missing data and proceed with a complete-case linear re-
gression. Discarding observations with at least one missing Xij would reduce
the sample size from 149 runners to 58. Given that most of the missing values
are for miles 6 and 7 and these covariates are likely not important predic-
tors of the response, it would be wasteful to discard all of these observations
from the analysis. Another simple approach is to impute the missing Xij us-
ing the sample mean of the Xij for the other runner’s speed at mile j or a
linear regression using the other covariates in the model. A drawback of these
single-imputation approaches is that they do not account for uncertainty in
the missing observations, and thus the resulting posterior inference is ques-
tionable. Multiple imputation techniques are also available [72] (and are often
motivated using Bayesian ideas).
A Bayesian analysis using a hierarchical model is a natural way to han-
dle missing data. The Bayesian approach handles missing data in the same
way as unknown parameters; we represent our uncertainty about them by
treating them as random variables in a hierarchical Bayesian model. As with
an unknown parameter, posterior inference on an unknown missing covariate
requires assigning it a prior distribution. A fairly general model is
Yi |Xi ∼ Normal(Xi β, σ 2 ) where Xi ∼ Normal(µ, Σ) (6.10)
where µ is the mean vector of length p and Σ is the p × p covariance matrix of
the covariates. In the absence of subject-matter prior information, the hyper-
parameters µ and Σ are given priors and estimated as part of the Bayesian
analysis, resembling a random-effects model. In this way, the complete cases
inform the model about the distribution of the covariates across observations
(via µ and Σ), and this information is used to impute the missing values.
Of course, this approach requires a reasonable model for the covariate dis-
tribution. This is challenging when p is large and/or the covariates are non-
Gaussian. For example, if the covariates are a mix of continuous and binary
variables, sufficiently capturing their joint distribution is difficult. A simple
approach is to model the p covariates with independent priors. This is inef-
ficient but at least supplies a reasonable approximation in many situations.
212 Bayesian Statistical Methods

(a) Missing data pattern (b) Imputed X for two runners

Runner 3
140

● Runner 149 ●

4
120

● ● ●
● ●

Speed (minute/mile)
● ● ● ●

100

● ● ●
● ● ●
● ●
● ●

2
Runner


80
60

0
40
20

−2

5 10 15 20 25 5 10 15 20 25

Mile Mile

(c) Posterior of beta − full data set (d) Posterior of beta − complete cases
1.5

1.5
1.0

1.0
0.5

0.5
Posterior

Posterior
0.0

0.0
−0.5

−0.5
−1.0

−1.0
−1.5

−1.5

0 5 10 15 20 25 0 5 10 15 20 25

Mile Mile

FIGURE 6.10
Missing data analysis of the 2016 Boston Marathon data. Panel (a)
shows the missing (black) and non-missing (white) Xij by runner (i) and mile
(j). Panel (b) plots the observed (points) standardized covariates (Xij ) and
the posterior distributions (boxplots) of the missing covariates for two runners.
Panels (c) and (d) plot the posterior distribution of each regression coefficient,
βj , for the missing-data model and complete-case analysis, respectively.
Case studies using hierarchical modeling 213

More sophisticated modeling, such as the methods outlined in Section 4.5,


may improve this aspect of the analysis. A crucial (and often unverifiable) as-
sumption is that there are no systematic biases in the missing data, i.e., they
are missing completely at random. For the present example, a hypothetical
source of systematic bias is that missing times are caused by runners moving
too fast to record their speeds. If this is the case, it would be impossible to ob-
serve this bias because the data are missing, and the model-based imputations
would underestimate the missing speeds leading to questionable inference.
Assuming that the covariate model is correct, a hierarchical Bayesian anal-
ysis appropriately accounts for uncertainty in the missing values. A Gibbs
sampler would update each parameter (β, σ, etc.) and then cycle through
the missing observations (Xij ) and update them from their full conditional
distributions, treating them exactly the same as the model parameters. There-
fore, each sample of the regression coefficient β is updated using a complete
set of covariates, but the imputed covariates vary across iterations following
their posterior distribution. The missing values are thus effectively treated as
nuisance parameters, and the analysis produces both the posterior predictive
distribution of the missing values but more importantly the posterior distri-
bution of the regression coefficients marginally over the uncertainty in the
missing observations, as desired.
Marathon example: In Section 2.1.7, we modelled the covariance of the
covariates (Σ) using an inverse Wishart model. This model did not assume
any structure among the miles, but the posterior mean of the covariance ma-
trix revealed that subsequent miles are highly correlated. Therefore, here we
model the standardized (to have mean zero and variance one) covariates us-
ing the first-order autoregressive times series model Xi1 ∼ Normal(0, σ12 ) and
Xij+1 |Xij ∼ Normal(ρXij , σ22 ) for j ≥ 1. JAGS code for this model is given in
Listing 6.3. All hyperparameters have uninformative priors.
Figure 6.10b plots the observed covariates (dots) and the posterior dis-
tribution of the missing covariates (boxplots) for two representative runners.
Because of the times series model for the missing covariates, the posterior
distributions of the missing Xij are close to the speeds for the adjacent miles
for both runners. The posterior distributions of the covariates, βj , are plotted
in Figure 6.10c. Only miles 24 and 25 appear to be useful predictors of the
speed in the final mile. The posterior variances in this missing-data analysis
are much smaller than the posterior variances in the complete-case analysis
with n = 58 in Figure 6.10d, illustrating the benefit of the missing data model.

6.5 Exercises
1. Since full conditional distributions are used in many MCMC al-
gorithms, it is tempting to specify the model via its conditional
214 Bayesian Statistical Methods

Listing 6.3
JAGS model statement for the missing data analysis of the marathon data.
1 # Likelihood
2 for(i in 1:n){
3 Y[i] ~ dnorm(alpha + inprod(X[i,],beta[]),taue)
4 }
5

6 # Missing-data model
7 for(i in 1:n){
8 X[i,1] ~ dnorm(0,tau1)
9 for(j in 2:p){
10 X[i,j] ~ dnorm(rho*X[i,j-1],tau2)
11 }
12 }
13

14 # Priors
15 alpha ~ dnorm(0,0.01)
16 for(j in 1:p){
17 beta[j] ~ dnorm(0,0.01)
18 }
19 taue ~ dgamma(0.1, 0.1)
20 tau1 ~ dgamma(0.1, 0.1)
21 tau2 ~ dgamma(0.1, 0.1)
22 rho ~ dnorm(0, 0.01)
Case studies using hierarchical modeling 215

distributions. For example, consider the conditional distributions:

Y |X ∼ Normal(aX, 1) and X|Y ∼ Normal(bY, 1).

(a) Select values of a and b so that these full conditional distribu-


tions are incompatible, i.e., there is no valid joint distribution
for X and Y that gives these full conditional distributions. Ar-
gue, but do not formally prove, your assertion that the full
conditional distributions are incompatible.
(b) Explain why building a model that produces a valid DAG will
always lead to a valid joint distribution.
2. Draw a DAG for the model in Listing 6.3, derive the full conditional
posterior distribution for X2,2 (i.e., X[2,2], the speed for runner 2
at mile 2), and explain how this would be used in a Gibbs sampler.
3. In this problem we will conduct a meta analysis, i.e., an analysis
that combines the results of several studies. The data are from the
rmeta package in R:

> library(rmeta)
> data(cochrane)
> cochrane
name ev.trt n.trt ev.ctrl n.ctrl
1 Auckland 36 532 60 538
2 Block 1 69 5 61
3 Doran 4 81 11 63
4 Gamsu 14 131 20 137
5 Morrison 3 67 7 59
6 Papageorgiou 1 71 7 75
7 Tauesch 8 56 10 71

The data are from seven randomized trials that evaluate the effect
of corticosteroid therapy on neonatal death. For trial i ∈ {1, ..., 7}
denote Yi0 as the number of events in the Ni0 control-group patients
and Yi1 as the number of events in the Ni1 treatment-group patients.
indep
(a) Fit the model Yij |θj ∼ Binomial(Nij , θj ) with θ0 , θ1 ∼
Uniform(0, 1). Can we conclude that the treatment reduces the
event rate?
indep
(b) Fit the model Yij |θij ∼ Binomial(Nij , θij ) with logit(θij ) =
iid
αij and αi = (αi0 , αi1 )T ∼ Normal(µ, Σ), µ ∼
Normal(0, 102 I2 ), and Σ ∼ InvWishart(3, I2 ). Summarize the
evidence that the treatment reduces the death rate.
(c) Draw a DAG for these two models.
(d) Discuss the advantages and disadvantages of both models.
216 Bayesian Statistical Methods

(e) Which model is preferred for these data?


4. Download the marathon data of Section 6.4 from the course web-
page. Let Yij be the speed of runner i in mile j. Fit the hierarchical
model Yi1 ∼ Normal(µi , σ02 ) and

Yij |Yij−1 ∼ Normal(µi + ρi (Yij−1 − µi ), σi2 ),


iid iid iid
where µi ∼ Normal(θ1 , θ2 ), ρi ∼ Normal(θ3 , θ4 ), and σi2 ∼
InvGamma(θ5 , θ6 ).
(a) Draw a DAG for this model and give an interpretation for each
parameter in the model.
(b) Select uninformative prior distributions for θ1 , ..., θ6 .
(c) Fit the model in JAGS using three chains each with 25,000 iter-
ations and thoroughly assess MCMC convergence for the θj .
(d) Are the data informative about the θj ? That is, are the pos-
terior distributions more concentrated than the prior distribu-
tions?
(e) In light of (c) and (d), are there any simplifications you might
consider and if so, how would you compare the full and simpli-
fied models?
5. Download the Gambia data

> library(geoR)
> data(gambia)
> ?gambia

The data consist of 2,035 children that live in 65 villages. For village
v ∈ {1, ..., 65}, denote nv as the number of children in the sample,
Yv as the number of children that tested positive for malaria, and
pv as the true probability of testing positive for malaria. We use
the spatial model αv = logit(pv ), where α = (α1 , ..., α65 )T follows
a multivariate normal distribution with mean E(αv ) = µ, variance
V(αv ) = σ 2 , and correlation Cor(αu , αv ) = exp(−duv /ρ), where
duv is the distance between villages u and v. For priors assume µ ∼
Normal(0, 102 ), σ 2 ∼ InvGamma(0.1, 0.1), and ρ ∼ Uniform(0, d∗ )
where d∗ is the maximum distance between villages.
(a) Specify the data layer, process layer and prior layer for this
hierarchical model.
(b) Fit the model using JAGS and assess convergence.
(c) Summarize the data and results using five maps: the sample
size nv , the sample proportion Yv /nv , the posterior means of
the pv , the posterior standard deviations of the pv , and the
posterior probabilities that pv exceeds 0.5.
7
Statistical properties of Bayesian methods

CONTENTS
7.1 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.2 Frequentist properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.2.1 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.2.2 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

In this chapter we briefly discuss some of the most important concepts of


statistical theory as they relate to Bayesian methods. This book is primarily
dedicated to the practical application of Bayesian statistical methods and thus
this chapter can be skipped on first read. However, even an applied statistician
should be familiar with statistical theory at least at a high level to plan and
defend their work.
As an example, consider the problem of estimating a normal mean, θ. An
estimator is a function that takes the data as input and returns an estimate
of the parameter of interest. An estimator of the parameter θ is denoted
θ̂(Y).
Pn A natural estimator of a normal mean is the sample mean, θ̂(Y) =
i=1 Yi /n. However, there are other estimators including the sample median
or the trimmed mean. In Section 2.1.3 we discussed
Pn Bayesian estimators such
as the posterior mean θ̂(Y) = E(θ|Y) = i=1 Yi /(n + m). How to justify
that the Bayesian estimator is a better estimator than the sample mean? In
what sense is an estimator optimal? Questions such as these motivate the
development of statistical theory.
In addition to investigating the properties of specific estimators and deter-
mining the settings for which they are preferred, much of statistical theory is
dedicated to developing general procedures for deriving estimators with good
properties. General estimation procedures include maximum likelihood esti-
mation, the method of moments, estimating equations, and of course, Bayesian
methods. Comparisons of these statistical frameworks are often made using
frequentist criteria. As we will see, even if the objective if to generate a proce-
dure with good frequentist properties, Bayesian methods often perform well.
In Section 7.1 we focus on Bayesian methods and optimally summarizing
the posterior distribution. In Sections 7.2 and 7.3 we study the frequentist

217
218 Bayesian Statistical Methods

properties of Bayesian methods using mathematical and computational tools,


respectively.

7.1 Decision theory


Before studying the frequentist properties of Bayesian estimators, we discuss
how to optimally summarize the posterior distribution. An appealing feature
of a Bayesian analysis is that it produces the full posterior distribution of each
parameter. However, in many cases a point estimate (i.e., a single-number es-
timator) is desired. Common one-number summaries of the posterior used as
point estimators are the posterior mean, median or mode. Rather than arbi-
trarily choosing one of these posterior summaries as the estimator, Bayesian
decision theory provides a formal way to select the optimal Bayesian point
estimator.
Defining the optimal point estimator requires defining the measure of esti-
mation accuracy that is to be optimized. If θ is the true value of the parameter
and θ̂(Y) is the estimator, then the loss function is defined as l(θ, θ̂(Y)). For
example, we might choose squared error loss, l(θ, θ̂(Y)) = [θ − θ̂(Y)]2 , or
absolute loss l(θ, θ̂(Y)) = |θ − θ̂(Y)|.
The loss function depends on the true value of the parameter, and there-
fore cannot be evaluated in a real data analysis. From the Bayesian perspec-
tive, given the data, uncertainty about the fixed but unknown parameter θ is
quantified by its posterior distribution. In this view, θ, and thus l(θ, θ̂(Y)),
are random variables. The Bayesian risk is the expected (with respect to the
posterior distribution of θ) loss, R(θ̂(Y)) = E[l(θ, θ̂(Y))|Y]. The Bayes rule
is the estimator θ̂(Y) that minimizes Bayesian risk.
For example, assume squared error loss and define the posterior mean as
θ̄ = E(θ|Y). The Bayesian risk is
R(θ̂(Y)) = E[l(θ, θ̂(Y))|Y]
= E{[θ − θ̂(Y)]2 |Y}
= E{[(θ − θ̄) − (θ̂(Y) − θ̄)]2 |Y}
= E{(θ − θ̄)2 |Y} − 2E{(θ − θ̄)(θ̂(Y) − θ̄)|Y} + E{[θ̂(Y) − θ̄]2 |Y}
= V(θ|Y) − 2(θ̂(Y) − θ̄)E(θ − θ̄|Y) + [θ̂(Y) − θ̄]2
= V(θ|Y) + [θ̂(Y) − θ̄]2 .
In the last equation, E(θ − θ̄|Y) = E(θ) − θ̄ = 0 by definition. The estimator
θ̂(Y) does not affect the posterior variance V(θ|Y), and so under squared error
loss, the posterior mean θ̂(Y) = θ̄ minimizes Bayesian risk and is the Bayes
rule estimator. This justifies the use of the posterior mean to summarize the
posterior in cases where squared error loss is reasonable.
Statistical properties of Bayesian methods 219

Different loss functions give different Bayes rule estimators. Absolute loss
gives the posterior median as the Bayes rule estimator. More generally, if
over-estimation and under-estimation are weighted differently, a useful loss
function is the asymmetric check-loss function
(
(1 − τ )(θ̂(Y) − θ) if θ̂(Y) > θ
l(θ, θ̂(Y)) = (7.1)
τ (θ − θ̂(Y)) if θ̂(Y) < θ

for τ ∈ (0, 1). In this loss function, if τ is close to zero, then over-estimation
(top row) is penalized more than under-estimation (bottom row). The Bayes
rule for this loss function is the τ th posterior quantile. If θ is a discrete random
variable, then the MAP estimator θ̂(Y) = arg maxθ f (θ|Y) is the Bayes rule
under zero-one loss l(θ, θ̂(Y)) = I[θ 6= θ̂(Y)].
Decision theory can also formalize Bayesian hypothesis testing. Let H ∈
{0, 1} denote the true state, with H = 0 if the null hypothesis is true and
H = 1 if the alternative hypothesis is true. A Bayesian analysis produces
posterior probabilities Prob(H = h|Y) for h = {0, 1} (see Section 5.2). The
decision to be made is which hypothesis to select. Let d(Y) = 0 if we select
the null hypothesis and d(Y) = 1 if we select the alternative hypothesis.
To determine the Bayes rule for d(Y) we must specify the loss incurred if
we select the wrong hypothesis. If we assign a Type I error the loss λ1 and a
Type II error the loss λ2 , the loss function can be written

0
 if H = d(Y)
l(H, d(Y)) = λ1 if H = 0 and d(Y) = 1 (7.2)

λ2 if H = 1 and d(Y) = 0.

The Bayesian risk is


(
λ1 Prob(H = 0|Y) if d(Y) = 1
R(d(Y)) = (7.3)
λ2 Prob(H = 1|Y) if d(Y) = 0.

Therefore, the Bayes rule is to reject the null hypothesis and conclude that
the alternative is true, i.e., select d(Y) = 1, if the posterior probability of the
alternative hypothesis exceeds
λ1
Prob(H = 1|Y) > . (7.4)
λ1 + λ2
Note that unlike classical hypothesis testing if we swap the roles of the hy-
potheses the decision rule remains the same. Often the loss of a Type 1 error
is assumed to be larger than a Type II error, e.g., λ1 = 10λ2 , so that the
null hypothesis is rejected only if the posterior probability of the alternative
hypothesis is near one.
These decision-theory results hold for prediction as well. For example, if
predictions are evaluated using squared-error prediction loss, then the mean
220 Bayesian Statistical Methods

of the posterior predictive distribution is the Bayes rule. Also, if the response
is binary then the Bayes rule for classification is to predict Y pred = 1 if the
posterior predictive probability Prob(Y pred = 1|Y) > λ1 /(λ1 + λ2 ), where
λ1 is the loss for a false positive and λ2 is the loss of a false negative. In
this case, it might be reasonable to set λ1 = λ2 and predict Y pred = 1 if
Prob(Y pred = 1|Y) > 0.5.

7.2 Frequentist properties


Comparison and evaluation of statistical procedures proposed for general use
are commonly evaluated using frequentist criteria. A frequentist evaluation
of an estimator focuses on its sampling distribution, i.e., the distribution of
θ̂(Y) over repeated samples of the data Y. The distribution of the data of
course depends on the true value of θ, denoted in this discussion of frequentist
iid
properties as θ0 . For example, if Yi ∼ Normal(θ0 , σ 2 ), then the sample mean
Pn
θ̂(Y) = i=1 Yi /n is an estimator of θ and its sampling distribution is θ̂(Y) ∼
Normal(θ0 , σ 2 /n).
There are many ways to compare the sampling distributions of two esti-
mators. For example, we could study the bias

Bias[θ̂(Y)] = E[θ̂(Y)] − θ0 . (7.5)

If the bias is positive, this means that on average the estimator over-estimates
the parameter and vice versa. With all else equal, an unbiased estimator with
bias to equal zero is preferred. However, the bias only evaluates the center of
the sample distribution and ignores its spread. Therefore we also study the
variance of the sampling distribution V[θ̂(Y)]. For two estimators that are
unbiased we prefer low variance because this means the sampling distribution
is concentrated around the true parameter; on the other hand, small vari-
ance can be undesirable for biased estimators (Figure 7.1). The most common
method of comparison is mean squared error

M SE[θ̂(Y)] = E[θ̂(Y) − θ0 ]2 = Bias[θ̂(Y)]2 + Var[θ̂(Y)], (7.6)

which combines for bias and variance into a single measure.


Evaluating estimators using bias, variance and MSE is complicated by the
fact that these summaries depend on the true parameter value, θ0 , and the
sample size, n. For example, it may be that an estimator performs well if θ0
is close to zero and the sample is small, but performs poorly in other settings.
Therefore it is important to evaluate procedure over a range of settings. To deal
with the sample size, we often consider asymptotic arguments with n → ∞. A
method is asymptotically unbiased if its bias converges to zero as the sample
size goes to infinity. An estimator is said to be consistent if it converges in
Statistical properties of Bayesian methods 221

θ^1(Y)

4
θ^2(Y)
θ^3(Y)
θ^ (Y)
4

3
Sampling distribution
2
1
0

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

FIGURE 7.1
Hypothetical sampling distributions. Of the four hypothetical sampling
distributions, the first two are unbiased and the last two are biased, and the
first and third have small variance while the second and fourth have large vari-
ance. The true value θ0 is denoted with the vertical line at θ = 0.5. The mean
squared errors of the four estimators are 0.01, 0.25, 0.26 and 0.50, respectively.

probability to the true value, i.e., the probability of the estimator being within
 (for any ) of the true value increases to one as the sample size increases
to infinity. An asymptotically unbiased estimator is consistent if its variance
decreases to zero with the sample size.
Frequentist evaluation of statistical methods extends to interval estimation
and testing. For Bayesian credible intervals we evaluate their frequentist cov-
erage probability, i.e., the probability that they include the true value when
applied repeatedly to many datasets. Similarly, for a Bayesian testing proce-
dure we compute the probability of making Type I and Type II errors when
the test is applied to many random datasets.

7.2.1 Bias-variance tradeoff


For difficult problems such as analyses with many parameters or regression
with correlated predictors, adding prior information can stabilize the statis-
tical analysis by reducing the variance of the estimator. On the other hand,
if the prior information is erroneous then this can lead to bias. The balance
between these two consequences of incorporating prior information can be for-
malized by studying the bias-variance tradeoff. Recall that the mean squared
error is the sum of the variance and the squared bias,

MSE[θ̂(Y)] = Bias[θ̂(Y)]2 + Var[θ̂(Y)]. (7.7)


222 Bayesian Statistical Methods

TABLE 7.1
Bias-variance tradeoff for a normal-mean analysis. Assuming
iid
Yi , ..., Yn ∼ Normal(θ, σ 2 ), this table gives the bias, variance and mean
squared error (MSE) of the estimators θ̂1 (Y) = Ȳ and θ̂2 (Y) = cȲ , where
c = n/(n + m) ∈ [0, 1] and θ0 is the true value.

Estimator Bias Variance MSE


σ2 σ2
θ̂1 (Y) 0 n n
c2 σ 2 c2 σ 2
θ̂2 (Y) (c − 1)θ0 n (c − 1)2 θ02 + n

Therefore it is possible for a biased estimator to have smaller MSE than an


unbiased estimator if the reduction in variance is large enough to offset the
squared bias.
The optimal estimator as given by the bias-variance tradeoff often depends
on the true value of the parameter and the sample size. As a concrete example,
iid
consider the simple case with Y1 , ..., Yn ∼ Normal(θ, σ 2 ). For mathematical
simplicity we assume σ is fixed. We compare two estimators:
Pn
(1) θ̂1 (Y) = n1 i=1 Yi = Ȳ
1
Pn
(2) θ̂2 (Y) = n+m i=1 Yi = cȲ

where c = n/(n + m) ∈ [0, 1]. The first estimator is the usual sample mean
(i.e., the MLE or posterior mean under Jeffreys prior) and the second is the
posterior mean assuming prior θ ∼ Normal(0, σ 2 /m).
Table 7.1 gives the bias, variance and MSE of the two estimators. The
sample mean is unbiased but always has larger variance than the Bayesian
estimator. The relative MSE is

M SE[θ̂2 (Y)] nm2 (θ0 /σ)2 + n2


RM SE = = . (7.8)
M SE[θ̂1 (Y)] (n + m)2

The Bayesian procedure is preferred when this ratio is less than one. When
θ0 = 0, the prior mean, then the ratio is less than one for all n and m. That
is, when the prior mean is the true value, the Bayesian estimator is preferred
over the sample mean. The Bayesian estimator can be preferred even if the
prior mean is not exactly the true value. As long as the true value is close to
zero, r
1 2
|θ0 | < σ + = B(n, m) (7.9)
n m
the Bayesian estimator is preferred. The bound B(n, m) shrinks to zero as
the sample size increases and so in this case the advantage of the Bayesian
approach is for small datasets. As n → ∞ or m → 0 the RMSE converges to
Statistical properties of Bayesian methods 223

one, and thus for large sample sizes or uninformative priors, the two estimators
perform similarly.

7.2.2 Asymptotics
In Section 3.1.3 we discussed the Bayesian central limit theorem that states
that under general conditions the posterior converges to a normal distribution
as the sample size increases. It can also be shown that under these conditions
and assuming the prior includes the true value that the sampling distribution
of the posterior mean θ̂ B = E(θ|Y) converges (as n → ∞) to the sampling
distribution of the maximum likelihood estimator
 
θ̂ B ∼ Normal θ 0 , Σ̂M LE (7.10)

where θ 0 is the true value and Σ̂M LE is defined in Section 3.1.3. Therefore,
the posterior mean is asymptotically unbiased. Further, it follows from classic
results for maximum likelihood estimators that its variance decreases to zero as
the sample size increases and thus the posterior mean is a consistent estimator
for θ in essentially the same conditions as the maximum likelihood estimator.
An asymptotic property that is uniquely Bayesian is posterior consistency.
For a given dataset Y, the posterior probability that θ is within  of the true
value is
Prob(||θ − θ0 || < |Y). (7.11)
A Bayesian procedure is said to possess posterior consistency if this proba-
bility is assured to converge to one as the sample size increases (Figure 7.2).
Appendix A.3 provides a proof that the posterior distribution is consistent
for parameters with discrete support under very general conditions on the
likelihood and prior, and posterior consistency has been established for most
finite-dimensional problems and many Bayesian nonparametric methods. For
a more thorough discussion of posterior consistency see [37].
Both the Bayesian CLT and posterior consistency result in Appendix A.3
hold for any prior as long as the prior does not change with the sample size and
has positive mass/density around the true value. This confirms the argument
that for large datasets, any reasonable prior distribution should lead to the
same statistical inference and that this posterior will converge to the true value.

7.3 Simulation studies


For simple models such as the normal mean example in Section 7.2.1 it is
possible to derive the frequentist properties of an estimator using algebra.
However, for more complicated models, a purely mathematical study is impos-
sible. Especially in these complicated settings, it is important to understand
224 Bayesian Statistical Methods

1.0
20 n = 100 ● ●


n = 250
n = 500 ●

n = 1000

0.8

15

n = 2500

0.6

Posterior

Pn
10

0.4


5

0.2


● ● ε = 0.01

● ε = 0.05
● ε = 0.1

0.0
0

0.7 0.8 0.9 1.0 1.1 1.2 1.3 500 1000 1500 2000 2500

θ n

FIGURE 7.2
Illustration of posterior consistency. The data are generated as
iid
Yi , ..., YN ∼ Normal(θ0 , 1) with N = 2, 500 and θ0 = 1, and we fit the
Bayesian model with flat (improper) prior for the mean and variance fixed
at one using the first n observations Yn = (Y1 , ..., Yn ). The left panel plots
the posterior f (θ|Yn ) by n, and the right panel plots the corresponding pos-
terior probability Pn = Prob(|θ − θ0 | < |Yn ).

the operating characteristics of a statistical procedure, and a simulation study


is a general way to carry out this evaluation.
Just as MCMC is used to approximate complicated posterior distribu-
tions, simulation studies use Monte Carlo sampling to approximate compli-
cated sampling distributions. To approximate, say, the mean squared error of
an estimator we generate S independent and identically distributed datasets
given the true parameter value θ0 and sample size n, and then compute the
estimator for each simulated dataset. If we denote the S simulated datasets as
Y1 , ..., YS , then θ̂(Y1 ), ..., θ̂(YS ) are S draws from the sampling distribution
of the estimator θ̂(Y). The mean squared error is then approximated as
S
1X
MSE[θ̂(Y)] ≈ [θ̂(Ys ) − θ0 ]2 . (7.12)
S s=1

Other summaries of the sampling distribution such as bias and coverage are
computed similarly. Of course, this is only a Monte Carlo estimate of the
true MSE and a different simulation experiment will give a different estimate.
Therefore,
√ the approximation should be accompanied by a standard error,
sM SE / S, where sM SE is the sample standard deviation of the S squared
errors, [θ̂(Y1 ) − θ0 ]2 , ..., [θ̂(YS ) − θ0 ]2 .
A typical simulation study will generate data from a few values of θ0 and
Statistical properties of Bayesian methods 225

n to understand when the method performs well and when it does not. Note
that if θ̂(Y) is a Bayesian estimator, say the posterior mean, then computing
each θ̂(Ys ) may require MCMC and thus Bayesian simulation experiments
may need many applications of MCMC and can be time consuming (running
the S chains in parallel is obviously helpful). For a thorough description of
simulation studies, see [13] (Chapter 9).
As an example, we conduct a simulation study to compare ordinary least
squares (OLS) regression with Bayesian LASSO regression (BLR; Section
4.2). The sampling distribution of the OLS estimator is known (multivari-
ate student-t) but the sampling distribution of the posterior mean under the
BLR model is quite complicated and difficult to study without simulation.
iid Pp
We generate Xij ∼ Normal(0, 1) and Yi |Xi ∼ Normal( j=1 Xij βj0 , σ02 ). The
data are generated with true values σ0 = 1 and the first p0 elements of β 0
equal to zero and the final p1 elements equal to one, so that p = p0 + p1 = 20.
We generate S = 100 datasets each from six combinations of n, p0 and p1 us-
ing the R code in Listing 7.1. Each dataset is analyzed using least squares (the
lm function in R) and Bayesian LASSO (the BLR function in R with default
values). For dataset s = 1, ..., S, let β̂js be the estimate of βj (either the least
squares solution for OLS or the posterior mean for BLR) and vjs its estimated
variance (either the squared standard error for OLS or the posterior variance
for BLR). Methods are compared using
S p
1 XX
Bias = (β̂js − βj0 )
Sp s=1 j=1
S p
1 XX
Variance = vjs
Sp s=1 j=1
S p
1 XX
MSE = (β̂js − βj0 )2
Sp s=1 j=1
S p
1 XX √
Coverage = I(|β̂js − βj0 | < 2 vjs ).
Sp s=1 j=1

For coverage, we approximate the posterior distribution as Gaussian to avoid


saving all posterior samples. Results are averaged across covariates and stan-
dard errors are omitted to make the presentation concise. It may also be
interesting to study these coefficients separately, in particular, to evaluate
performance separately for null and active covariates.
Table 7.2 shows the results. For the small sample size (n = 40), the
Bayesian method gives a large reduction in variance and thus mean squared
error in the first two cases with more null covariates than active covariates.
In these cases the prior information that many of the coefficients are near
zero is valid and improves the stability of the algorithm. However, when this
prior information is wrong in case three and all covariates are active the BLR
226 Bayesian Statistical Methods

Listing 7.1
R simulation study code.
1 # Set up the simulation
2 library(BLR)
3 n <- 25 # Sample size
4 p_null <- 15 # Number of null covariates
5 p_act <- 5 # Number of active covariates
6 nsims <- 100 # Number of simulated datasets
7 sigma <- 1 # True value of sigma
8 beta <- c(rep(0,p_null),rep(1,p_act)) # True beta
9

10 # Define matrices to store the results


11 p <- p_null + p_act
12 EST1 <- VAR1 <- matrix(0,nsims,p)
13 EST2 <- VAR2 <- matrix(0,nsims,p)
14

15 # Start the simulation


16 for(sim in 1:nsims){
17 set.seed(sim*1234)
18 # Generate a dataset
19 X <- matrix(rnorm(n*p),n,p)
20 Y <- X%*%beta+rnorm(n,0,sigma)
21

22 # Fit ordinary least squares


23 ols <- summary(lm(Y~X))$coef[-1,]
24 EST1[sim,] <- ols[,1]
25 VAR1[sim,] <- ols[,2]^2
26

27 # Fit the Bayesian LASSO


28 blr <- BLR(y=Y,XL=X)
29 EST2[sim,] <- blr$bL
30 VAR2[sim,] <- blr$SD.bL^2
31 }
32

33 # Compute the results


34 E <- sweep(EST1,2,beta,"-")
35 MSE <- mean(E^2)
36 BIAS <- mean(E)
37 VAR <- mean(VAR1)
38 COV <- mean(abs(E/sqrt(VAR1))<2)
39

40 E <- sweep(EST2,2,beta,"-")
41 MSE <- c(MSE,mean(E^2))
42 BIAS <- c(BIAS,mean(E))
43 VAR <- c(VAR,mean(VAR2))
44 COV <- c(COV,mean(abs(E/sqrt(VAR2))<2))
45

46 out <- cbind(BIAS,VAR,MSE,COV)


Statistical properties of Bayesian methods 227

TABLE 7.2
Simulation study results. The simulation study compares ordinary least
squares (“OLS”) with Bayesian LASSO regression (“BLR”) for estimating
regression coefficients in terms of bias, variance, mean squared error (“MSE”)
and coverage of 95% intervals (all metrics are averaged over covariates and
datasets). The simulations vary based on the sample size (n), the number of
null covariates (p0 ) and the number of active covariates (p1 ). All values are
multiplied by 100.

Bias Variance MSE Coverage


n p0 p1 OLS BLR OLS BLR OLS BLR OLS BLR
40 20 0 -1.65 -0.08 5.59 0.19 5.40 0.03 94.7 100.0
15 5 0.63 -3.14 5.38 3.47 5.71 3.45 93.8 96.0
0 20 -0.88 -11.71 5.59 7.09 5.40 9.47 93.7 91.6
100 20 0 -0.43 -0.04 1.28 0.09 1.17 0.02 95.8 100.0
15 5 0.44 -0.56 1.22 1.02 1.27 0.98 94.5 95.5
0 20 0.11 -1.27 1.33 1.36 1.22 1.26 96.0 95.6

method has larger bias and thus MSE than OLS and the empirical coverage
of the Bayesian credible sets dips below the nominal level. For the large sam-
ples size cases (n = 100), these same trends are apparent but the differences
between methods are smaller, as expected.

7.4 Exercises
1. Assume Y |µ ∼ Normal(µ, 2) and µ ∼ Normal(0, 2) (i.e., n = 1).
The objective is to test the null hypothesis H0 : µ ≤ 0 versus
the alternative hypothesis that H1 : µ > 0. We will reject H0 if
Prob(H1 |Y ) > c.
(a) Compute the posterior of µ.
(b) What is the optimal value of c if Type I and Type II errors
have the same costs?
(c) What is the optimal value of c if a Type I error costs ten times
more than a Type II error?
(d) Compute the Type I error rate of the test as a function of c.
(e) How would you pick c to control Type I error at 0.05?
2. Given data from a small pilot study, your current posterior probabil-
ity that the new drug your company has developed is more effective
228 Bayesian Statistical Methods

than the current treatment is θ ∈ [0, 1]. Your company is consider-


ing to run a large clinical trial to confirm that your drug is indeed
preferred. If you run the trial it will cost $X. If in fact your drug is
better, then the probability that you will confirm this in the trial
is 80%; if in fact your drug is not better there is still a 5% chance
the trial will conclude it is better. If the trial suggests your drug is
preferred, you will make $cX. For which values of θ and c would
you initiate the trial?
3. Assume Y |θ ∼ Binomial(n, θ). Consider two estimators of θ: the
sample proportion θ̂1 (Y ) = Y /n and the posterior mean under the
Jeffreys’ prior θ̂2 (Y ) = (Y + 1/2)/(n + 1/2).
(a) Compute the bias, variance and MSE for each estimator and
comment on the bias-variance tradeoff.
(b) In terms of n and the true value θ0 , when is θ̂2 (Y ) preferred?
iid
4. Assume Y1 , ..., Yn |σ 2 ∼ Normal(0, σ 2 ) and σ 2 ∼ InvGamma(a, b).
(a) Give an expression for two estimators of σ 2 : the posterior mean
and the posterior mode.
(b) Plot the posterior density functions and P the give the two es-
n
timators for a sample with n = 10 and i=1 Yi2 = 200 and
a = b = 0.001.
(c) Argue that both estimators are consistent (i.e., their MSE goes
to zero as n increases) for any a and b.
5. Assume Y |θ ∼ Binomial(n, θ) and θ ∼ Beta(1/2, 1/2). Use a sim-
ulation study to compute the empirical coverage of the equal-
tailed 95% credible set for n ∈ {1, 5, 10, 25} and true value θ0 ∈
{0.05, 0.10, ..., 0.50}. Comment on the frequentist properties of the
Bayesian credible set.
6. A paper reports the results of a logistic regression analysis using
maximum likelihood estimation. They estimate β1 to be β̂1 = 2.1,
with the 95% confidence interval (0.42, 3.17). Further, the p-value
for the test that H0 : β1 = 0 versus H1 : β1 6= 0 is 0.01. Conduct
a Bayesian analysis using only this information and asymptotic ar-
guments.
7. You are designing a study to estimate the success probability of a
new marketing strategy. When the data have been collected, you
will analyze them using the model Y |θ ∼ Binomial(n, θ) and θ ∼
Uniform(0, 1). Before collecting any data, you suspect (based on
past studies) that θ ≈ 0.4. Which value of n should be used to
ensure the posterior standard deviation will be approximately 0.01?
iid
8. Suppose Y1 , ..., Yn ∼ f (y|η) where f is the canonical exponential
family
f (y|η) ∝ exp{yη − ψ(η)}
Statistical properties of Bayesian methods 229

for some known function ψ. We are interested in estimating the


parameter θ = E(Yi |η) = dψ(η)/dη.
(a) Show that the Gaussian density with variance fixed at one and
the Poisson density can be written in this form.
(b) Obtain the maximum likelihood estimator of θ. Is it unbiased
for θ?
(c) Find a class of conjugate priors for η.
(d) Obtain the Bayes estimator of µ using a conjugate prior and
squared error loss.
(e) Is there any (proper) conjugate prior for which the Bayes esti-
mator that you obtained in (d) is unbiased? Justify your answer.
Appendices

A.1: Probability distributions


Univariate discrete
In the following plots the probability mass functions for several combinations
of parameters are denoted with points; lines connect the points for visualiza-
tion, but the probability is non-zero only at the points.

Discrete uniform
0.20

a = 1, b = 4 Notation: X ∼ DiscreteUniform(a, b)
a = 2, b = 8
Support: X ∈ {a, a + 1, ..., b}
0.15

Parameters: a, b ∈ {..., −1, 0, 1, ...} with a < b


Probability

PMF: 1/(b − a + 1)
0.10

Mean: (a + b)/2
Variance: [(b − a + 1)2 − 1]/12
0.05

Notes: The discrete uniform can be applied


0.00

0 2 4 6 8 10 to any finite set. For example, we could say


x
that X is distributed uniformly over the set
{1/10, 2/10, ..., 10/10}.

Binomial
0.4

n = 5, θ = 0.5 Notation: X ∼ Binomial(n, θ)


n = 10, θ = 0.5
n = 10, θ = 0.1 Support: X ∈ {0, 1, ..., n}
0.3

 n ∈ {1, 2, ...}, θ ∈ [0, 1]


Parameters:
Probability

PMF: nx θx (1 − θ)n−x
0.2

Mean: nθ
0.1

Variance: nθ(1 − θ)
Notes: If X is the number of successes in n
0.0

0 2 4 6 8 10 independent trials each with success probabil-


x
ity θ, then X ∼ Binomial(n, θ); if n = 1 then
X ∼ Bernoulli(θ).

231
232 Appendices

Beta-binomial

0.12
n=25 a = 1, b = 2 Notation: X ∼ BetaBinomial(n, a, b)
a = 5, b = 10
a = 10, b = 20
Support: X ∈ {0, 1, ..., n}
0.10

Parameters: n ∈ {1, 2, ...}, a, b > 0


0.08
Probability

Γ(n+1)Γ(x+a)Γ(n−x+b)Γ(a+b)
PMF: Γ(x+1)Γ(n−x+1)Γ(n+a+b)Γ(a)Γ(b)
0.06

Mean: na/(a + b)
0.04

Variance: nab(a + b + n)/[(a + b)2 (a + b + 1)]


0.02

Notes: If X|θ ∼ Binomial(n, θ) and θ ∼


0.00

0 5 10 15 20 25 Beta(a, b), then X ∼ BetaBinomial(n, a, b). If


x
a = b = 1 then X ∼ DiscreteUniform(0, n).

Negative Binomial
Notation: X ∼ NegBinomial(θ, m)
Support: X ∈ {0, 1, 2, ...}
0.14

m = 5, θ = 0.5
m = 10, θ = 0.5 Parameters: m > 0, θ ∈ [0, 1]
0.12

m = 2, θ = 0.05
PMF: x+m−1 θm (1 − θ)x
0.10

x
Probability
0.08

Mean: m(1 − θ)/θ


Variance: m(1 − θ)/θ2
0.06
0.04

Notes: In a sequence of independent trials each


0.02

with success probability θ, if X is the num-


0.00

0 5 10 15 20 25 30
ber of failures that occur before the mth suc-
x cess (assuming m is an integer), then X ∼
NegBinomial(θ, m); if m = 1 then X ∼
Geometric(θ). (The distribution can also be de-
fined with m as the number of failures, but we
use the JAGS parameterization.)

Poisson

θ=2
Notation: X ∼ Poisson(θ)
0.25

θ = 10 Support: X ∈ {0, 1, 2, ...}


θ = 20
0.20

Parameters:
x
θ>0
Probability

PMF: θ exp(−θ)
0.15

x!
Mean: θ
0.10

Variance: θ
0.05

Notes: If events occur independently and uni-


0.00

0 5 10 15 20 25 30
formly over time (space) with the expected
x number of events in a given time interval (re-
gion) equal to θ, then the number of events
that occur in the interval (region) follows a
Poisson(θ) distribution.
Appendices 233

Multivariate discrete
Multinomial
Notation: X = (X1 , ..., Xp ) ∼ Multinomial(n,
Pp θ)
Support: Xj ∈ {0, 1, ..., n} with j=1 Xj = n
Pp
Parameters: θ = (θ1 , ..., θp ) with θj ∈ [0, 1] and j=1 θj = 1
p x
PMF: Qp n! xj ! j=1 θj j
Q
j=1
Mean: E(Xj ) = nθj
Variance: V(Xj ) = nθj (1 − θj )
Covariance: Cov(Xj , Xk ) = −nθj θk
Marginal distributions: Xj ∼ Binomial(n, θj )
Notes: If n independent trials each have p possible outcomes with the proba-
bility of outcome j being θj and Xj is the number of the trials that result in
outcome j, then X = (X1 , ..., Xp ) ∼ Multinomial(n, θ).
234 Appendices

Univariate continuous
Uniform

Notation: X ∼ Uniform(a, b)
1.0

a=0, b=1
a=0.5, b=2.5
Support: X ∈ [a, b]
0.8

Parameters: −∞ < a < b < ∞


0.6
Density

1
PDF: b−a
Mean: (a + b)/2
0.4

Variance: (b − a)2 /12


0.2

iid
Notes: If X1 , X2 ∼ Uniform(0, 1) then
0.0

p
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −2 log(X1 ) cos(2πX2 ) ∼ Normal(0, 1); if
x
X ∼ Uniform(0, 1) and F is a continuous CDF,
then F −1 (X) has CDF F .

Beta

a=1, b=1 Notation: X ∼ Beta(a, b)


a=1, b=5
Support: X ∈ [0, 1]
4

a=20, b=20
a=0.5, b=0.5
Parameters: a > 0, b > 0
3
Density

Γ(a+b) a−1
PDF: Γ(a)Γ(b) x (1 − x)b−1
2

a
Mean: a+b
Variance: (a+b)2ab
1

(a+b+1)
Notes: If a = b = 1 then X ∼ Uniform(0, 1);
0

0.0 0.2 0.4 0.6 0.8 1.0

x 1 − X ∼ Beta(b, a).

Gamma

a=0.5, b=0.5
Notation: X ∼ Gamma(a, b)
a=1, b=1 Support: X ∈ [0, ∞]
1.5

a=2, b=2
a=10, b=10 Parameters: shape a > 0, scale b > 0
ba
PDF: Γ(a) xa−1 exp(−bx)
Density
1.0

Mean: a/b
Variance: a/b2
0.5

Notes: cX ∼ Gamma(a, b/c); if a = 1


0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0


then X ∼ Exponential(b); if a = ν/2 and
x
b = 1/2 then X ∼ Chi-squared(ν); 1/X ∼
InvGamma(a, b).
Appendices 235

Inverse gamma

1.2
a=0.5, b=0.5 Notation: X ∼ InvGamma(a, b)
a=1, b=1
a=2, b=2 Support: X ∈ [0, ∞]
1.0

a=10, b=10 Parameters: shape a > 0, scale b > 0


0.8

ba
Density

PDF: Γ(a) x−a−1 exp(−b/x)


0.6

b
Mean: a−1 (if a > 1)
0.4

2
Variance: (a−1)b2 (a−2) (if a > 2)
0.2

Notes: cX ∼ InvGamma(a, cb); 1/X ∼


0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

x Gamma(a, b).

Normal/Gaussian

Notation: X ∼ Normal(µ, σ 2 )
0.4

µ = 0, σ = 1
µ = − 2, σ = 1 Support: X ∈ (−∞, ∞)
µ = 0, σ = 2
µ ∈ i(−∞, ∞), scale σ > 0
0.3

Parameters: location
h
(x−µ)2
Density

1
PDF: 2πσ exp − 2σ2

0.2

Mean: µ
Variance: σ 2
0.1

Notes: c + dX ∼ Normal(c + dµ, d2 σ 2 ); if µ = 0


0.0

−4 −2 0 2 4 6 8 and σ 2 = 1 then X follows the standard normal


x
distribution.

Student’s t
Notation: X ∼ tν (µ, σ 2 )
µ = 0, σ = 1, ν = 2
µ = 0, σ = 1, ν = 5
Support: X ∈ (−∞, ∞)
µ = 2, σ = 2, ν = 5 Parameters: location µ ∈ (−∞, ∞), scale σ >
0.3

0, degrees of freedom ν > 0


Density

2 −(ν+1)/2
0.2

Γ( ν+1 )
h i
PDF: Γ(ν/2)2√νπσ 1 + (x−µ) νσ 2
Mean: µ (if ν > 1)
0.1

ν
Variance: σ 2 ν−2 (if ν > 2)
0.0

−5 0 5 10 Notes: c + dX ∼ tν (c + dµ, d2 σ 2 ); if µ = 0
x
and σ 2 = 1 then X follows the standard
t distribution; if Z ∼ Normal(0, 1) indepen-
dentpof W ∼ Gamma(ν/2, 1/2) then µ +
σZ/ W/ν ∼ tν (µ, σ 2 ); if ν = 1 then X follows
the Cauchy distribution; X is approximately
Normal(µ, σ 2 ) for large ν.
236 Appendices

Laplace/Double exponential

0.5
µ = 0, σ = 1 Notation: X ∼ DE(µ, σ)
µ = 0, σ = 2
µ = 2, σ = 1
Support: X ∈ (−∞, ∞)
0.4

Parameters: location µ ∈ (−∞, ∞), scale σ > 0


0.3


Density

1 |x−µ|
PDF: 2σ exp − σ
0.2

Mean: µ
Variance: 2σ 2
0.1

Notes: c + dX ∼ DE(c + dµ, dσ).


0.0

−5 0 5 10

Logistic
Notation: X ∼ Logistic(µ, σ)
Support: X ∈ (−∞, ∞)
0.25

µ = 0, σ = 1
µ = 0, σ = 2
µ = 2, σ = 1 Parameters: location µ ∈ (−∞, ∞), scale σ > 0
0.20

exp[−(x−µ)/σ]
PDF: σ1 {1+exp[−(x−µ)/σ]}2
0.15
Density

Mean: µ
0.10

Variance: π 2 σ 2 /3
Notes: c + dX ∼ Logistic(c + dµ, dσ); if
0.05

U ∼ Uniform(0, 1) then µ + σlogit(U ) ∼


0.00

−5 0 5 10 Logistic(µ, σ).
x
Appendices 237

Multivariate continuous
Multivariate normal
Notation: X = (X1 , ..., Xp )T ∼ Normal(µ, Σ)
Support: Xj ∈ (−∞, ∞)
Parameters: mean vector µ = (µ1 , ..., µp ) with µj ∈ (−∞, ∞) and p × p
positive definite covariance matrix Σ
PDF: (2π)−p/2 |Σ|−1/2 exp[− 12 (X − µ)T Σ−1 (X − µ)]
Mean: E(Xj ) = µj
Variance: V(Xj ) = σj2 where σj2 is the (j, j) element of Σ
Covariance: Cov(Xj , Xk ) = σjk where σjk is the (j, k) element of Σ
Marginal distributions: Xj ∼ Normal(µj , σj2 )
Notes: For q-vector a and q × p matrix b, a + bX ∼ Normal(a + bµ, bΣbT ).

Multivariate t
Notation: X = (X1 , ..., Xp )T ∼ tν (µ, Σ)
Support: Xj ∈ (−∞, ∞)
Parameters: location µ = (µ1 , ..., µp ) with µj ∈ (−∞, ∞), p × p positive
definite matrix Σ and degrees of freedom ν > 0
Γ(ν/2+p/2) −(ν+p)/2
−1/2
1 + ν1 (X − µ)T Σ−1 (X − µ)

PDF: Γ(ν/2)(νπ) p/2 |Σ|

Mean: E(Xj ) = µj (if ν > 1)


ν
Variance: V(Xj ) = ν−2 σj2 where σj2 is the (j, j) element of Σ (if ν > 2)
ν
Covariance: Cov(Xj , Xk ) = ν−2 σjk where σjk is the (j, k) element of Σ (if
ν > 2)
Marginal distributions: Xj ∼ tν (µj , σj2 )
Notes: For q-vector a and q × p matrix b, a + bX ∼ tν (a + bµ, bΣbT ); X
is approximately Normal(µ, Σ) for large ν; if X|W ∼ Normal(0, Σ/W ) and
W ∼ Gamma(ν/2, 1/2), then X ∼ tν (µ, Σ).

Dirichlet
Notation: X = (X1 , ..., XpP ) ∼ Dirichlet(θ)
p
Support: Xj ∈ [0, 1] with j=1 Xj = 1
Parameters: θ = (θ1 , ..., θp ) with θj > 0
Γ( p
P
θj ) Qp θ −1
PDF: Qp Γ(θj ) j=1 xj j
j=1
j=1
Pp
Mean: E(Xj ) = θj /( k=1 θP k)
θj ( k6=j θk )
Variance: V(Xj ) = ( p
P 2
Pp
k=1 θk ) (1+ k=1 θk )
−θj θkP
Covariance: Cov(Xj , Xk ) = (Pp θk )2 (1+ p
k=1 P k=1 θk )
Marginal distributions: Xj ∼ Beta(θj , k6=j θk )
indep Pp
Notes: If Wj ∼ Gamma(θj , b) and Xj = Wj /( k=1 Wk ) then X =
(X1 , ..., Xp ) ∼ Dirichlet(θ).
238 Appendices

Wishart
Notation: X ∼ Wishart(ν, Ω)
Support: X = {Xjk } is a p × p symmetric positive definite matrix
Parameters: degrees of freedom ν > p−1 and p×p symmetric positive definite
matrix Ω = {Ωjk }
1
PDF: 2pν |Ω|ν/2 Γ (n/2)
|X|(p−ν−1)/2 exp[−trace(Ω−1 X)/2]
p
Mean: E(Xjk ) = νΩjk
Variance: V(Xjk ) = ν(Ω2jk + Ωjj Ωkk )
Marginal distributions: Xjj ∼ Gamma(ν/2, Ωjj /2)
iid Pν
Notes: If ν is an integer and Z1 , ..., Zν ∼ Normal(0, Ω), then i=1 Zi ZTi ∼
Wishart(ν, Ω).

Inverse Wishart
Notation: X ∼ InvWishart(ν, Ω)
Support: X = {Xjk } is a p × p symmetric positive definite matrix
Parameters: degrees of freedom ν > p−1 and p×p symmetric positive definite
matrix Ω = {Ωjk }
PDF: |Ω|
ν/2
|X|−(p−ν−1)/2 exp[−trace(ΩX−1 )/2]
2pν Γp (n/2)
1
Mean: E(Xjk ) = ν−p−1 Ωjk (for ν > p + 1)
(ν−p+1)Ω2kk +(ν−p−1)Ωjj Ωkk
Variance: V(Xjk ) = (ν−p)(ν−p−1) 2 (ν−p−3) (for ν > p + 3)
Marginal distributions: Xjj ∼ InvGamma((ν − p + 1)/2, Ωjj /2)
Notes: If Y ∼ Wishart(ν, Ω−1 ) then Y−1 ∼ InvWishart(ν,
p Ω); if ν = p+1 and
Ω is a diagonal matrix then the correlation Xjk / Xjj Xkk ∼ Uniform(−1, 1).
Appendices 239

A.2: List of conjugacy pairs


Below is a partial list of conjugacy pairs. In these derivations, all parameters
not assigned a prior are assumed to be fixed.

1. Binomial proportion
Likelihood: Y |θ ∼ Binomial(n, θ)
Prior: θ ∼ Beta(a, b)
Posterior: θ|Y ∼ Beta(a + Y, b + n − Y )
2. Negative-binomial proportion
Likelihood: Y |θ ∼ NegBinomial(θ, m)
Prior: θ ∼ Beta(a, b)
Posterior: θ|Y ∼ Beta(a + m, b + Y )
3. Multinomial probabilities
Likelihood: Y = (Y1 , ..., Yp )|θ ∼ Multinomial(n, θ)
Prior: θ ∼ Dirichlet(α) with α = (α1 , ..., αp )
Posterior: θ|Y ∼ Dirichlet(α + Y)
4. Poisson rate
indep
Likelihood: Y1 , ..., Yn |λ ∼ Poisson(Ni λ) with Ni fixed
Prior: λ ∼ Gamma(a, b) Pn Pn
Posterior: λ|Y ∼ Gamma(a + i=1 Yi , b + i=1 Ni )
5. Mean of a normal distribution
iid
Likelihood: Y1 , ..., Yn |µ ∼ Normal(µ, σ 2 )
Prior: µ ∼ Normal(θ, σ 2 /m)  
σ2
Pn
Posterior: µ|Y ∼ Normal nȲn+m +mθ
, n+m for Ȳ = i=1 Yi /n
6. Variance of a normal distribution
indep
Likelihood: Y1 , ..., Yn |σ 2 ∼ Normal(µi , σ 2 )
Prior: σ 2 ∼ InvGamma(a, b) Pn
Posterior: σ 2 |Y ∼ InvGamma(a + n/2, b + i=1 (Yi − µi )2 /2)
7. Precision of a normal distribution
indep
Likelihood: Y1 , ..., Yn |τ 2 ∼ Normal(µi , 1/τ 2 )
Prior: τ 2 ∼ Gamma(a, b) Pn
Posterior: τ 2 |Y ∼ Gamma(a + n/2, b + i=1 (Yi − µi )2 /2)
8. Mean vector of a multivariate normal distribution
indep
Likelihood: Y1 , ..., Yn |µ ∼ Normal(Xi µ, Σi )
Prior: µ ∼ Normal(θ, Ω) Pn
Posterior: µ|Y ∼ Normal(VM, V) with V = ( i=1 XTi Σ−1
i Xi +
n
Ω−1 )−1 and M = i=1 XTi Σ−1 −1
P
i Y i + Ω θ
240 Appendices

Special case: If Xi = I and Σi = Σ for all i, then V = (nΣ−1 +


Ω−1 )−1 and M = nΣ−1 Ȳ + Ω−1 θ
9. Covariance matrix of a multivariate normal distribution
indep
Likelihood: Y1 , ..., Yn |Σ ∼ Normal(µi , Σ)
Prior: Σ ∼ InvWishart(ν, R) Pn
Posterior: Σ|Y ∼ InvWishart (n + ν, S + R), where S = i=1 (Yi −
µi )(Yi − µi )T
10. Precision matrix of a multivariate normal distribution
indep
Likelihood: Y1 , ..., Yn |Ω ∼ Normal(µi , Ω−1 )
Prior: Ω ∼ Wishart(ν, R)  −1 
Posterior: Ω|Y ∼ Wishart n + ν, S + R−1

, where S =
Pn T
i=1 (Yi − µi )(Yi − µi )
11. Scale parameter of a gamma distribution
iid
Likelihood: Y1 , ..., Yn |µ ∼ Gamma(ai , wi b)
Prior: b ∼ Gamma(u, v) P
n Pn
Posterior: b|Y ∼ Gamma( i=1 ai + u, i=1 wi Yi + v)
12. Arbitrary parameter with discrete prior
indep
Likelihood: Y1 , ..., Yn |θ ∼ fi (Yi |θ)
Prior: Prob(θ = θ k ) = πk for θ ∈ {θ 1 , ..., θP
m}
m
Posterior: Prob(θ = θ k |Y) = Lk /[ j=1 Lj ] where Lk =
Qn
πk i=1 fi (Yi |θ k )
Appendices 241

A.3: Derivations
Normal-normal model for a mean
iid
Say Yi |µ ∼ Normal(µ, σ 2 ) for i = 1, ..., n with σ 2 known and prior µ ∼
Normal(θ, σ 2 /m). Since the Y1 , ..., Yn are independent, the likelihood factors
as
n n
(Yi − µ)2
 
Y Y 1
f (Y|µ) = f (Yi |µ) = √ exp − .
i=1 i=1
2πσ 2σ 2
Discarding constants that do not depend on µ and expressing the product of
exponentials as the exponential of the sum, the likelihood is
" n #
X (Yi − µ)2 
1

nȲ n 2

f (Y|µ) ∝ exp − ∝ exp − −2 µ + µ
i=1
2σ 2 2 σ2 σ2
Pn
where Ȳ = i=1 Yi /n. The last equality comes from multiplying the quadratic
terms, collecting them as a function of their power of µ, and discarding terms
without a µ. Similarly, the prior can be written
m(µ − θ)2
    
1 mθ m 2
π(µ) ∝ exp − ∝ exp − −2 µ + µ .
2σ 2 2 σ2 σ2
Because both the likelihood and prior are quadratic in µ they can be combined
as
p(µ|Y) ∝ f (Y|µ)π(µ)
  
1 nȲ + mθ n+m 2
∝ exp − −2 µ + µ
2 σ2 σ2
  
1 1
∝ exp − −2M µ + µ2 ,
2 V
where M = (nȲ +mθ)/σ 2 and V = σ 2 /(n+m). The exponent of the posterior
is quadratic in µ, and we have seen that a Gaussian PDF is quadratic in the
exponent. Therefore, we rearrange the terms in the posterior to reveal its
Gaussian PDF form. Completing the square in the exponent (and discarding
and/or adding terms that do not depend on µ) gives
(µ − V M )2
    
1 1 2
p(µ|Y) ∝ exp − −2M µ + µ ∝ exp − .
2 V 2V
Therefore, the posterior is µ|Y ∼ N(V M, V ). Plugging in the above expres-
sions for M and V gives
σ2
 
µ|Y ∼ N wȲ + (1 − w)θ,
n+m
where w = n/(n + m).
242 Appendices

Normal-normal model for a mean vector


The model is

Y|β ∼ Normal(Xβ, Σ) and β ∼ Normal(µ, Ω).

As with the normal-normal model in Section 7.4, we proceed by expressing


the exponential of the posterior as a quadratic form in β and then comparing
this expression to a multivariate normal to determine the posterior. Using
precision matrices U = Σ−1 and V = Ω−1 , the posterior is

p(β|Y) ∝ f (Y|β)π(β)
  
1 T T 1 T T
∝ exp − (Y − Xβ) U(Y − Xβ) − (β − µ) V(β − µ)
2 2
 i
1h T T T
∝ exp − −2(Y UX + µ V)β + β(X UX + V)β
2
 
1 T T
∝ exp − −2W β + β Pβ
2

where W = XT U Y + Vµ and P = XT UX + V. If β|Y ∼ Normal(M, S) for


some mean vector M and covariance matrix S, then its PDF can be written
 
1 T −1
p(β|Y) ∝ exp − (β − M) S (β − M)
2
 
1 T −1 T −1
∝ exp − −2M S β + β S β .
2

To reconcile these two expressions of the posterior we must have posterior


covariance S = P−1 and S−1 M = W and thus M = SW = P−1 W. Inserting
the expressions for W and P and replacing precision matrices with covariance
matrices gives the posterior
h i
β|Y ∼ Normal Σβ (XT Σ−1 Y + Ω−1 µ), Σβ ,

where Σβ = (X0 Σ−1 X + Ω−1 )−1

Normal-inverse Wishart model for a covariance matrix


The model for the p-vectors Y1 , ..., Yn given the p × p covariance matrix Σ is
indep
Yi ∼ Normal(µi , Σ) and Σ ∼ InvWishartp (ν, R).

Using the facts that for arbitrary matrices A, B and C, Trace(A + B) =


Trace(A) + Trace(B) and Trace(ABC) = Trace(BCA), the likelihood can be
Appendices 243

written
n  
Y 1
f (Y|Σ) ∝ |Σ|−1/2 exp − (Yi − µi )T Σ−1 (Yi − µi )
i=1
2
" n
#
−n/2 1X T −1
∝ |Σ| exp − (Yi − µi ) Σ (Yi − µi )
2 i=1
( n
)
−n/2 1X  T −1

∝ |Σ| exp − Trace (Yi − µi ) Σ (Yi − µi )
2 i=1
( n
)
−n/2 1X  −1 T

∝ |Σ| exp − Trace Σ (Yi − µi )(Yi − µi )
2 i=1
( " n #)
1 X
−1
∝ |Σ|−n/2 exp − Trace Σ (Yi − µi )(Yi − µi )T
2 i=1
( " n
#)
−n/2 1 −1
X
T
∝ |Σ| exp − Trace Σ (Yi − µi )(Yi − µi )
2 i=1
 
−n/2 1 −1
∝ |Σ| exp − Trace(Σ W)
2
Pn
where W = − µi )(Yi − µi )T . The inverse Wishart prior is
i=1 (Yi
 
−(ν+p+1)/2 1 −1
π(Σ) ∝ |Σ| exp − Trace(Σ R) .
2

Combining the likelihood and prior, the posterior is

p(Σ|Y) ∝ f (Y|Σ)π(Σ)
 
1
|Σ|−(n+ν+p+1)/2 exp − Trace[Σ−1 (W + R)] .

2
Pn
Therefore, Σ|Y ∼ InvWishartp (n + ν, i=1 (Yi − µi )(Yi − µi )T + R).

Jeffreys’ prior for a normal model


iid
The Gaussian model is Yi ∼ Normal(µ, σ 2 ). Denote τ = σ 2 . The log likelihood
is
n 1 X
log f (Y|µ, τ ) = − log(τ ) − (Yi − µ)2 .
2 2τ i=1
The information matrix depends on both second derivatives and the cross
derivative. The second derivatives are
∂ 2 log f (Y|µ, τ ) ∂ 1X n
= (Yi − µ) = −
∂µ2 ∂µ τ i=1 τ
244 Appendices

and
∂ 2 log f (Y|µ, τ ) ∂ −n 1 X
= + 2 (Yi − µ)2
∂τ 2 ∂τ 2τ 2τ i=1
n 1 X
= 2
− 3 (Yi − µ)2 .
2τ τ i=1

The cross derivative is


∂ 2 log f (Y|µ, τ ) ∂ 1X 1 X
= (Yi − µ) = − 2 (Yi − µ)
∂µ∂τ ∂τ τ i=1 τ i=1

Since E(Yi ) = µ and E(Yi − µ)2 = τ , the elements of the information matrix
are
 2 
∂ log f (Y|µ, τ ) n
−E 2
=
∂µ τ
 2 
∂ log f (Y|µ, τ ) n nτ n
−E 2
= − 2+ 3 = 2
∂τ 2τ τ 2τ
 2 
∂ log f (Y|µ, τ )
−E = 0.
∂µ∂τ
The determinant of the 2 × 2 information matrix is thus
n  n  n2
|I(µ, τ )| = 2
− 02 = 3 ,
τ 2τ 2τ
and the JP is r
n2 1
π(µ, σ) ∝ ∝ 2 3/2 .
2τ 3 (σ )

Jeffreys’ prior for multiple linear regression


Assume Y|β, σ 2 ∼ Normal(Xβ, σ 2 In ) and denote τ = σ 2 . The log likelihood
is
n 1
log f (Y|β, τ ) = − log(τ ) − (Y − Xβ)T (Y − Xβ).
2 2τ
The second derivative with respect to τ is
∂ 2 log f (Y|β, τ ) ∂ −n 1
= + 2 (Y − Xβ)T (Y − Xβ)
∂τ 2 ∂τ 2τ 2τ
n 1
= − 3 (Y − Xβ)T (Y − Xβ).
2τ 2 τ
Taking derivatives with respect to β requires using matrix calculus identities
including the formula for the derivative of a quadratic form,
∂ 2 log f (Y|β, τ ) ∂ 1 T 1
= X (Y − Xβ) = − XT X.
∂β 2 ∂β τ τ
Appendices 245

The cross derivative is


∂ 2 log f (Y|β, τ ) ∂ 1 T 1
= X (Y − Xβ) = − 2 XT (Y − Xβ).
∂β∂τ ∂τ τ τ

Since E(Y) = Xβ and E(Y − Xβ)T (Y − Xβ) = nτ , the elements of the


information matrix are
 2 
∂ log f (Y|µ, τ ) 1 T
−E 2 = X X
∂β τ
 2 
∂ log f (Y|µ, τ ) n nτ n
−E = − 2+ 3 = 2
∂τ 2 2τ τ 2τ
 2 
∂ log f (Y|β, τ )
−E = 0.
∂µ∂τ

The determinant of the (p + 1) × (p + 1) block-diagonal information matrix is


thus
1 T n n
|I(β, τ )| = X X 2 = p+2 |XT X|,

τ 2τ 2τ
and the JP is r
2 n
T 1
π(β, σ ) ∝ X X ∝ .
2τ p+2 (σ 2 )p/2+1

Convergence of the Gibbs sampler


Here we provide: (1) a proof that the Gibbs sampler generates posterior sam-
ples after convergence and (2) a discussion of the theory of Markov processes
that ensures that Gibbs sampler converges to the posterior distribution.
Part (1): The proof of (1) is equivalent to showing that the poste-
rior distribution is the stationary distribution of this Markov chain. That
is, if we make a draw from the posterior distribution and then iterate
the Gibbs sampler forward one iteration from this starting point, the next
iteration also follows the posterior distribution. To make the derivations
tractable, we restrict the proof to the bivariate case with p = 2 and thus
θ = (θ1 , θ2 ) and denote the posterior density as p(θ1 , θ2 ) = p(θ|Y), the full
R density as p(θR1 |θ2 ) = p(θ1 |θ2 , Y), and the marginal density as
conditional
p(θ1 ) = p(θ1 , θ2 )dθ2 = p(θ1 |θ2 )p(θ2 )dθ2 . Assume we have reached con-
vergence and so one draw in the chain is a realization from the posterior
distribution, say θ ∗ = (θ1∗ , θ2∗ ) ∼ p(θ1 , θ2 ). We would like to show that the
subsequent sample also follows the posterior distribution. By recursion, this
shows that once the algorithm has converged, all samples follow the posterior.
The next sample, (θ10 , θ20 ), drawn from Gibbs sampling has density

q(θ10 , θ20 |θ1∗ , θ2∗ ) = p(θ10 |θ2∗ )p(θ20 |θ10 ),


246 Appendices

where the two elements of the product represent the updates of the two pa-
rameters from their full conditional distribuitons given the current value of
the parameters in the chain. We want to show that the marginal distribu-
tion of (θ10 , θ20 ) integrating over (θ1∗ , θ2∗ ) follows the posterior. The marginal
distribution is
Z Z
0 0
g(θ1 , θ2 ) = q(θ10 , θ20 |θ1∗ , θ2∗ )f (θ1∗ , θ2∗ )dθ1∗ dθ2∗ .

The integral reduces to


Z Z
g(θ10 , θ20 ) = p(θ10 |θ2∗ )p(θ20 |θ10 )p(θ1∗ |θ2∗ )p(θ2∗ )dθ2∗ dθ1∗
Z Z 
0 0 0 ∗ ∗
= p(θ2 |θ1 ) p(θ1 |θ2 )p(θ2 ) ∗ ∗ ∗
p(θ1 |θ2 )dθ1 dθ2∗
Z
= p(θ2 |θ1 ) p(θ10 |θ2∗ )p(θ2∗ )dθ2∗
0 0

Z
= p(θ20 |θ10 ) p(θ10 , θ2∗ )dθ2∗

= p(θ20 |θ10 )p(θ10 ) = p(θ10 , θ20 ),

as desired. The proof for p > 2 similar but involves higher-order integration.
Part (2): Part (1) shows (for a special case) that the stationary distri-
bution of the Gibbs sampler is the posterior distribution. The proof that the
Gibbs sampler converges to its stationary (posterior) distribution draws heav-
ily from Markov chain theory. Given that the posterior distribution is the
stationary distribution, [82] proves that a Gibbs sampler converges to the
posterior distribution if the chain is aperiodic and p-irreducible. A chain is
aperiodic if for any partition of the posterior domain of θ, say {A1 , ..., Am },
so that each subset has positive posterior probability, then the probability of
the chain transitioning from Ai to Aj is positive for any i and j. A chain is
p-irreducible if for any initial value θ (0) in the support of the posterior distri-
bution and set A with positive posterior probability, i.e., Prob(θ ∈ A) > 0,
there exists an s so that there is a positive probability that the chain will visit
A at iteration s. Proving convergence then requires showing that the Gibbs
sampler is aperiodic and p-irreducible which is discussed, e.g., in [82] and [69].
A sufficient condition is that for any set A with positive posterior probabil-
ity and any initial value θ (0) in the support of the posterior distribution, the
probability under the Gibbs sampler that θ(1) ∈ A is positive. This condition
is met in all but exotic cases where support of full conditional distributions
depend on the values of other parameters, in which case convergence should
be studied carefully.
Appendices 247

Marginal distribution of a normal mean under Jeffreys’ prior


iid
Assume Yi ∼ Normal(µ, σ 2 ) and Jeffreys’ prior π(µ, σ 2 ) ∝ (σ 2 )−3/2 . Denoting
τ = σ 2 , the joint posterior is
  Pn 2
 n
i=1 (Yi − µ)
o
−n/2
p(µ, τ |Y) ∝ τ exp − τ −3/2

 Pn 2

−(n+1)/2−1 i=1 (Yi − µ)
∝ τ exp −

 
B
∝ τ −A−1 exp − ,
τ
Pn
where A = (n + 1)/2 and B = i=1 (Yi − µ)2 /2. As a function of τ , the joint
distribution resembles an InvGamma(A, B) PDF. Integrating over τ gives
Z
p(µ|Y) ∝ p(µ, τ |bY )dτ
Z
∝ τ −A−1 exp(−B/τ )dτ

B A −A−1
Z
Γ(A)
∝ A
τ exp(−B/τ )dτ
B Γ(A)
Γ(A)

BA
" n #−(n+1)/2
X
2
∝ (Yi − µ) .
i=1

The marginal PDF is a quadratic function of µ raised to the power −(n +


1)/2, suggesting that the posterior is a t distribution with n degrees of freedom.
Completing the square gives
n
X n
X n
X
(Yi − µ)2 = Yi2 − 2 Yi µ + nµ2
i=1 i=1 i=1
" n
#
X
= n Yi2 /n − 2Ȳ µ + µ 2

i=1
" n #
X
= n Yi2 /n 2 2
− Ȳ + Ȳ − 2Ȳ µ + µ 2

i=1
" n #
X
= n Yi2 /n − Ȳ 2 + (µ − Ȳ )2
i=1
= n σ̂ 2 + (µ − Ȳ )2 ,
 

Pn Pn
since σ̂ 2 = i=1 (Yi − Ȳ )2 /n = i=1 Yi2 /n − Ȳ 2 . Inserting this expression
248 Appendices

back into the marginal posterior gives


" n
#−(n+1)/2
X
p(µ|Y) ∝ (Yi − µ)2
i=1
−(n+1)/2
∝ σ̂ 2 + (µ − Ȳ )2

"  2 #−(n+1)/2
1 µ − Ȳ
∝ 1+ √ .
n σ̂/ n

This
√ is Student’s t distribution with location parameter Ȳ , scale parameter
σ̂/ n, and n degrees of freedom.

Marginal posterior of the regression coefficients under Jef-


freys’ prior
Assume Y|β, σ 2 ∼ Normal(Xβ, σ 2 In ) and Jeffreys’ prior π(β, σ 2 ) ∝
(σ 2 )−p/2−1 . Denoting τ = σ 2 , the joint posterior is
  
1
p(β, τ |Y) ∝ τ −n/2 exp − (Y − Xβ)T (Y − Xβ) τ −p/2−1

 
B
∝ τ −A−1 exp − ,
τ

where A = (n + p)/2 and B = (Y − Xβ)T (Y − Xβ)/2. Marginalizing over σ 2


gives
Z
p(β|Y) = p(β, τ |Y)dτ

B A −A−1
Z  
Γ(A) B
∝ τ exp − dτ
BA Γ(A) τ
∝ B −A
−(n+p)/2
∝ (Y − Xβ)T (Y − Xβ)

.

The quadratic form is factored as

(Y − Xβ)T (Y − Xβ) = YT Y − 2YT Xβ + β T Wβ


T
= YT Y − 2β̂ Wβ + β T Wβ
T T T
= YT Y − β̂ Wβ̂ + β̂ Wβ̂ − 2β̂ Wβ + β T Wβ
= nσ̂ 2 + (β − β̂)T W(β − β̂)
Appendices 249

where W = XT X, β̂ = (W)−1 XT Y is the usual least squares estimator, and


T
nσ̂ 2 = (Y − Xβ̂)T (Y − Xβ̂) = YT Y − β̂ Wβ̂. Therefore,
−(n+p)/2
(Y − Xβ)T (Y − Xβ)

p(β|Y) ∝
h i−(n+p)/2
∝ nσ̂ 2 + (β − β̂)T W(β − β̂)
 −(n+p)/2
1 T
∝ 1+ (β − β̂) W(β − β̂) .
nσ̂ 2

The marginal posterior of β is thus the p-dimensional t-distribution with


location vector β̂, scale matrix σ̂ 2 (XT X)−1 , and n degrees of freedom.
The two-sample t-test in Section 4.1.2 is a special case. If we parameterize
the means as µ for the first group and µ+δ for the second group then X’s first
column has all ones and its second column is n1 zeros followed by n2 ones,
and β = (µ, δ)T . This gives the least squares estimator as β̂ = (Ȳ , Ȳ2 − Ȳ1 )T
and σ̂ 2 = (n1 σ̂12 + n2 σ̂22 )/(n1 + n2 ) and
−1   1
− n11
  
n1 + n2 n2 1 n2 −n2
(XT X)−1 = = = n11 1 1
n2 n2 n1 n2 −n2 n1 + n2 − n1 n1 + n2 .

This give the components of the joint posterior distribution of β, since


theh marginal  i of multivariate t are univariate, we have δ|Y ∼
distributions
1 1
tn Ȳ2 − Ȳ1 , σ̂ n1 + n2

Proof of posterior consistency


Here we prove posterior consistency in the general but simple case with inde-
pendent data and parameter with discrete support. Assume that:
iid
(A1) Yi ∼ f (y|θ) for i = 1, ..., n
(A2) The support is discrete, θ ∈ {θ 1 , θ 2 , ...} = S
(A3) The true value θ0 ∈ S has positive prior probability, π(θ 0 ) > 0
(A4) The Kullback-Leibler divergence
  
f (Y |θ 0 )
KL(θ) = EY |θ0 log
f (Y |θ)

satisfies KL(θ) > 0 for all θ 6= θ 0 .


Assumption (A4) ensures that the parameter is identifiable by asserting that
on average the likelihood is higher for true value and any other value.

Theorem 1 Assuming (A1)-(A4), Prob(θ = θ 0 |Y1 , ..., Yn ) → 1 as n → ∞.


250 Appendices

Proof 1 For any θ ∈ S,


    X n  
p(θ|Y) π(θ) f (Yi |θ)
log = log + log .
p(θ 0 |Y) π(θ 0 ) i=1
f (Yi |θ 0 )

f (Yi |θ )
Pn h i
1
By the law of large numbers, i=1 log → −KL(θ), and thus
n f (Yi |θ 0 )
   
p(θ|Y) π(θ)
log → log − nKL(θ).
p(θ 0 |Y) π(θ 0 )
Therefore, as n → ∞, p(θ|Y)/p(θ 0 |Y) → 0 for any θ 6= θ 0 , and Prob(θ =
θ 0 |Y) converges to one.

This proof can be generalized to continuous parameters by discretizng the


support and making additional assumptions about the smoothness of the like-
lihood and prior density functions.

A.4: Computational algorithms


Integrated nested Laplace approximation (INLA)
INLA [73] is a deterministic approximation to the marginal posterior of each
parameter that combines many of the ideas discussed in Section 3.1. The
method is most fitting in the special but common case where the parame-
ter vector θ = (α, β) can be divided into a low-dimensional α and a high-
dimensional β whose posterior is approximately Gaussian. For example, in a
random effects model (Section 4.4) α might include the variance components
and β might include all of the Gaussian random effects.
Evoking the Bayesian CLT (i.e., Laplace approximation) in Section 3.1.3,
assume that the conditional posterior of β conditioned on α is approximately

β|α, Y ∼ Normal (µ(α), Σ(α))

and denote the corresponding density function as φ (β; µ(α), Σ(α)). We first
use this approximation for the marginal distribution of the low-dimensional
parameter α. Since p(α, β|Y) = p(β|α, Y)p(α|Y), the marginal posterior of
α can be written
p(α, β|Y)
p(α|Y) = .
p(β|α, Y)
Expanding around the MAP estimate β = µ(α) and using the Laplace ap-
proximation for the denominator gives the approximation

f (Y|α, β)π(α, β|Y)
p(α|Y) ≈ .
φ(β; µ(α), Σ(α)) β =µ(α)
Appendices 251

This low-dimensional distribution and can be evaluated using the methods in


Section 3.1, e.g., grid approximations or numerical integration.
The Laplace approximation can also be used to approximate the marginal
distribution of each element of β. Let β −i be the elements of β excluding βi .
Following arguments similar to the approximation of the posterior of α,
f (Y|α, β)π(α, β)
p(βi |α, Y) ∝ .
p(β −i |βi , α, Y)
This can be approximated using a Laplace approximation for p(β −i |βi , α, Y)
around its posterior mode ([73] also consider faster approximations). Finally,
to obtain p(β −i |Y) requires numerical integration over α, and therefore the
Laplace approximation is nested within numerical integration.

Metropolis–adjusted Langevin algorithm


Metropolis–Hastings sampling (Section 3.2.2) is a flexible algorithm but de-
pends on finding a reasonable candidate distribution. A Gaussian random
walk distribution for the candidate θ ∗ = (θ1∗ , ..., θp∗ ) given the current value at
the onset of iteration s, θ (s−1) , is
θ ∗ |θ (s−1) ∼ Normal(θ (s−1) , c2 Ip ),
where c > 0 is a tuning parameter. This candidate is easy to code, very general
and surprisingly effective. However, convergence can be improved by tailoring
the candidate distribution to the problem at hand. We saw in Section 3.2
that if the candidate distribution is taken to be the full conditional distri-
bution Metropolis–Hastings sampling becomes Gibbs sampling. While Gibbs
sampling is free from tuning parameters and is thus easier to implement, it
requires derivation of full conditional distributions which can be tedious and
is not always possible.
The Metropolis-adjusted Langevin (MALA) algorithm [68] balances the
strengths of random-walk Metropolis and Gibbs sampling. Rather than simply
centering the candidate distribution on the current value, MALA uses the
gradient of the posterior to push the candidate distribution towards the center
of the distribution. This requires computing the gradient of the posterior and
thus the algorithm is more complex than a random walk, but the gradient is
typically easier to derive and more generally available than the full conditional
distribution required for Gibbs sampling.
Define the gradient vector of the log posterior as ∇(θ) = [∇1 (θ), ..., ∇p (θ)]T
where

∇j (θ) = {log[f (Y|θ)] + log[π(θ)]}
∂θj
is the partial derivative with respect to the j th parameter. The candidate is
c2
 
θ ∗ |θ (s−1) ∼ Normal θ (s−1) + ∇(θ (s−1) ), c2 Ip .
2
252 Appendices

Unlike the random-walk candidate distribution, the MALA candidate distri-


bution is asymmetric and requires including the candidate distribution in the
acceptance ratio,
 2 
1
Pp  (s−1) ∗ c2 ∗
exp − θ − θ + ∇(θ )
f (Y|θ ∗ )π(θ ∗ ) 2c2 j=1 j j 2
R=  2  .
f (Y|θ (s−1) )π(θ (s−1) ) exp − 1 Pp

θ ∗ − θ (s−1) + c2 ∇(θ (s−1) )
2c2 j=1 j j 2

As with the standard Metropolis algorithm, the tuning parameter c should


be adjusted to give reasonable acceptance probability. Roberts and Rosenthal
[68] argue that 0.574 is the optimal acceptance probability, but they claim
that acceptance probabilities between 0.4 and 0.8 work well. In this chapter
we have assumed that the candidate standard deviation c is the same for
all p parameters and that the candidates are independent across parameters.
Convergence can often be improved by adapting the candidate covariance to
resemble the posterior covariance. Finally, we note that since MALA is simply
a special type of MH sampling it can be used within a larger Gibbs sampling
algorithm just like MH sampling steps.

Hamiltonian Monte Carlo (HMC)


MALA improves on random-walk Metropolis sampling by fitting the candidate
distribution to the posterior by incorporating the gradient of the log poste-
rior. However, for highly irregular posterior distributions (e.g., a U-shaped or
donut-shaped posterior), one step along the gradient may not be sufficient to
traverse the posterior. Hybrid Monte Carlo (HMC; also called Hamiltonian
Monte Carlo) sampling [60] generalizes MALA to take multiple random steps
guided by the gradient. The simple version in Algorithm 4 has two tuning
parameters: the step size c and the number of steps L. If L = 1 then this
algorithm reduces to MALA with c as the candidate standard deviation. Mo-
tivation, extensions and tuning of this algorithm are beyond the scope of this
text but form the basis for the software STAN [15] which is compared with
other MCMC software in Appendix A.5.

Delayed rejection and adaptive Metropolis


Delayed rejection and adaptive Metropolis (DRAM, [39]) is a combination
of two ideas: delayed rejection Metropolis [57] and adaptive Metropolis [40].
Adaptive Metropolis allows the covariance of the candidate distribution to
evolve across iterations. The intuition is that if the posterior is irregularly
shaped, then a different proposal distribution is needed depending on the
current state of the chain. Assuming
 a Gaussian random-walk candidate dis-
∗ (s−1) (s−1) (s−1)
tribution, θ |θ ∼ Normal θ ,V , the user sets an initial p × p
Appendices 253

Algorithm 4 Hamiltonian MCMC


(0) (0)
1: Initialize θ (0) = (θ1 , ..., θp )
2: for s = 1, ..., S do
3: sample z ∼ Normal(0, Ip )
4: set θ ∗ = θ (s−1)
5: set z∗ = z + c∇(θ ∗ )/2
6: for l = 1, ..., L do
7: set θ ∗ = θ ∗ + cz∗
8: set z∗ = z∗ + c∇(θ ∗ )
9: end for
10: set z∗ = z∗ − c∇(θ ∗ )/2
∗ ∗
f (Y|θ )π (θ ) exp(−z∗ T z∗ /2)
11: set R =    (s−1)  ·
f Y|θ
(s−1)
π θ exp(−zT z/2)

12: sample U ∼ Uniform(0, 1)


13: if U < R then
14: θ (s) = θ ∗
15: else
16: θ (s) = θ (s−1)
17: end if
18: end for

covariance matrix V(0) that is then adapted as


 (s) 
V(s) = c V̂ + δI

(s)
where V̂ is the sample covariance of the previous samples θ (1) , ..., θ (s−1) ,
δ > 0 is a small constant to avoid singularities and c = 2.42 /p [31].
Delayed rejection Metropolis replaces the standard single proposal in
Metropolis–Hastings sampling with multiple proposal considered sequentially.
∗ (s−1)
The  is a usual Metropolis–Hasting step with candidate θ |θ
 first stage ∼
q θ ∗ |θ (s−1) and acceptance probability
   
   p (θ ∗ |Y) q θ (s−1) |θ ∗ 
R θ ∗ , θ (s−1) = min 1,     .
 p θ (s−1) |Y q θ ∗ |θ (s−1) 

If the first candidate is rejected,


 a second candidate is proposed as
θ 0 |θ ∗ , θ (s−1) ∼ Q θ 0 |θ ∗ , θ (s−1) and accepted with probability
    
p θ 0 |Y q θ 0 |θ ∗ Q θ 0 |θ ∗ , θ (s−1) 1 − R θ 0 , θ ∗
 
 
min 1,      h  i .
 p θ (s−1) |Y q θ (s−1) |θ ∗ Q θ (s−1) |θ ∗ , θ 0 1 − R θ (s−1) , θ ∗ 

The notation becomes cumbersome but this is can be iterated beyond two
254 Appendices

candidates. DRAM combines these two ideas by using adaptive Metropolis to


tune the Gaussian candidate distributions used for q and Q.

Slice sampling
Slice sampling [59] is a clever way to apply Gibbs sampling when the full
conditional distributions do not belong to known parametric families of dis-
tributions. Slice sampling introduces an auxiliary variable (i.e., a variable that
is not an actual parameter) U > 0 and draws samples from the joint distribu-
tion
p∗ (θ, U ) = I[0 < U < p(θ|Y)].
By construction, under p∗ the marginal distribution of θ is
Z
I[0 < U < p(θ|Y)]dU = p(θ|Y),

and therefore if samples of (θ, U ) are drawn from p∗ , then the samples of θ
follow the posterior distribution. Also, Gibbs sampling can be used to draw
samples from p∗ since the full conditional distributions are both uniform
1. U |θ, Y ∼ Uniform on [0, p(θ|Y)]
2. θ|U, Y ∼ Uniform on P (U ) = {θ; p(θ|Y) > U }
Therefore, slice sampling works by drawing from the joint distribution of
(θ, U ), discarding the samples of U and retaining the samples from θ. The
most challenging step is to make a draw from P (U ) (see the figure below). For
some posteriors P (U ) has a simple form and samples can be drawn directly.
In other cases, θ can be drawn from a uniform distribution with a domain
that includes P (U ) until a sample falls in P (U ).
2.5

U
2.0
1.5
p(θ|Y)
1.0
0.5

P(U)
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Illustration of slice sampling. The curve is the posterior density p(θ|Y),


the horizontal line represents the auxiliary variable U (i.e., the “slice”), and
the bold interval is P (U ) = {θ; p(θ|Y) > U }.
Appendices 255

A.5: Software comparison


There are now many software packages to implement the MCMC algorithms
discussed in Chapter 3. Here we compare four packages: JAGS, OpenBUGS, STAN
and NIMBLE. The packages are all fairly similar to use and so rather than stick-
ing with one package for all analyses, we recommend becoming familiar with
multiple packages as different packages will be more effective for some appli-
cations than others. As these examples show, JAGS and OpenBUGS are slightly
easier to code and sufficiently fast for the low to medium complexity analyses
considered in this book, but for more complex models or larger datasets it is
useful to be familiar with other packages such as STAN.

Example 1: Simple linear regression


Listings 7.10–7.13 give R code to fit a simple linear regression model using
the four packages under consideration. The code is nearly identical for JAGS,
OpenBUGS and NIMBLE and slightly more complex for STAN. For this simple
example, JAGS has the fastest computation time (see the table below) followed
by OpenBUGS, then STAN and then NIMBLE, but this is mostly due to overhead
time setting up the more intricate updating schemes in NIMBLE and STAN. The
effective sample size for the slope is the highest for STAN and similar for the
other three software packages.

MCMC software for simple linear regression. Effective samples size


(ESS) and run time (seconds) for two chains each with 30,000 MCMC itera-
tions and a burn-in of 10,000.

Parameter ESS for β1 ESS for β2 Run time


JAGS 4038 4098 0.09
OpenBUGS 1600 1600 3.75
STAN 10391 10360 32.0
NIMBLE 3984 4006 48.0

Example 2: Random slopes model


For a more challenging example, Listings 7.14–7.17 give R code to run a ran-
dom slopes regression model using the jaw data introduced in Section 4.4 and
plotted in Figure 4.6. The table below gives the effective sample size for the
means of the random effects distributions (the variances have higher effec-
tive sample sizes for all models). As in the simple regression example, JAGS is
the fastest package and STAN has the highest effective sample size. Weighing
these two factors, in this case STAN is the better option, but all packages are
256 Appendices

Listing 7.10
JAGS code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4

5 # Fit in JAGS
6 #install.packages("rjags")
7 library(rjags)
8

9 model_string <- textConnection("model{


10 for(i in 1:n){
11 mass[i] ~ dnorm(beta1 + beta2*age[i],tau)
12 }
13 tau ~ dgamma(0.01, 0.01)
14 beta1 ~ dnorm(0, 0.0000001)
15 beta2 ~ dnorm(0, 0.0000001)
16 }")
17

18 data <- list(mass=mass,age=age,n=n)


19 inits <- list(beta1=rnorm(1),beta2=rnorm(1),tau=10)
20 model <- jags.model(model_string, data = data,
inits=inits,n.chains=2)
21

22 update(model, 10000)
23 samples <- coda.samples(model, n.iter=20000,
24 variable.names=c("beta1","beta2"))
25 summary(samples)
Appendices 257

Listing 7.11
OpenBUGS code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4

5 #install.packages("R2OpenBUGS")
6 library(R2OpenBUGS)
7

8 model_string <- function() {


9 for(i in 1:n){
10 mass[i] ~ dnorm(mn[i],tau)
11 mn[i] <- beta1 + beta2*age[i]
12 }
13 tau ~ dgamma(0.01, 0.01)
14 beta1 ~ dnorm(0, 0.0000001)
15 beta2 ~ dnorm(0, 0.0000001)
16 }
17

18 data <- list(mass=mass,age=age,n=n)


19 inits <- list(beta1=rnorm(1),beta2=rnorm(1),tau=10)
20 fit <- bugs(model.file=model_string,
21 data=data,inits=inits,
22 parameters.to.save=c("beta1","beta2"),
23 n.iter=30000,n.burnin=10000,n.chains=2)
24 fit
258 Appendices

Listing 7.12
STAN code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4

5 #install.packages("rstan")
6 library(rstan)
7

8 stan_model <- "


9

10 data {
11 int<lower=0> n;
12 vector [n] mass;
13 vector [n] age;
14 }
15

16 parameters {
17 real beta1;
18 real beta2;
19 real<lower=0> sigma;
20 }
21

22 model {
23 vector [n] mu;
24 beta1 ~ normal(0,1000000);
25 beta2 ~ normal(0,1000000);
26 sigma ~ cauchy(0.0,1000);
27 mu = beta1 + beta2*age;
28 mass ~ normal(mu,sigma);
29 }
30 "
31

32 data <- list(n=n,age=age,mass=mass)


33 fit_stan <- stan(model_code = stan_model,
34 data = data, chains=2, warmup = 10000, iter =
30000)
35 fit_stan
Appendices 259

Listing 7.13
NIMBLE code for simple linear regression for the paleo data.
1 mass <- c(29.9, 1761, 1807, 2984, 3230, 5040, 5654)
2 age <- c(2, 15, 14, 16, 18, 22, 28)
3 n <- length(age)
4

5 #install.packages("nimble")
6 library(nimble)
7

8 model_string <- nimbleCode({


9 for(i in 1:n){
10 mass[i] ~ dnorm(mn[i],tau)
11 mn[i] <- beta1 + beta2*age[i]
12 }
13 tau ~ dgamma(0.01, 0.01)
14 beta1 ~ dnorm(0, 0.0000001)
15 beta2 ~ dnorm(0, 0.0000001)
16 })
17

18 consts <-
list(n=n,age=age)
19 data <-
list(mass=mass)
20 inits <-
function(){list(beta1=rnorm(1),beta2=rnorm(1),tau=10)}
21 samples <-
nimbleMCMC(model_string, data = data, inits = inits,
22 constants=consts,
23 monitors = c("beta1", "beta2"),
24 samplesAsCodaMCMC=TRUE,WAIC=FALSE,
25 niter = 30000, nburnin = 10000, nchains = 2)
26 plot(samples)
27 effectiveSize(samples)
260 Appendices

Listing 7.14
JAGS code for the random slopes model for the jaw data.
1 model_string <- textConnection("model{
2 # Likelihood
3 for(i in 1:n){for(j in 1:m){
4 Y[i,j] ~ dnorm(alpha1[i]+alpha2[i]*age[j],tau3)
5 }}
6

7 # Random effects
8 for(i in 1:n){
9 alpha1[i] ~ dnorm(mu1,tau1)
10 alpha2[i] ~ dnorm(mu2,tau2)
11 }
12

13 # Priors
14 mu1 ~ dnorm(0,0.0001)
15 mu2 ~ dnorm(0,0.0001)
16 tau1 ~ dgamma(0.1,0.1)
17 tau2 ~ dgamma(0.1,0.1)
18 tau3 ~ dgamma(0.1,0.1)
19 }")
20

21 data <- list(Y=Y,age=age,n=n,m=m)


22 params <- c("mu1","mu2","tau1","tau2","tau3")
23 model <- jags.model(model_string,data = data,
n.chains=2,quiet=TRUE)
24 update(model, 10000, progress.bar="none")
25 samples <- coda.samples(model, variable.names=params,
26 n.iter=90000, progress.bar="none")
27 summary(samples)

sufficient, especially if thinning were implemented for JAGS, OpenBUGS and


NIMBLE.

MCMC software for the random slopes model. Effective samples size
(ESS) and run time (seconds) for two chains each with 100,000 MCMC iter-
ations and a burn-in of 10,000.
Parameter ESS for µ1 ESS for µ2 Run time
JAGS 293 335 1.83
OpenBUGS 1300 1200 10.2
STAN 180000 180000 424
NIMBLE 283 311 26.5
Appendices 261

Listing 7.15
OpenBUGS code for the random slopes model for the jaw data.
1 model_string <- function(){
2 # Likelihood
3 for(i in 1:n){for(j in 1:m){
4 Y[i,j] ~ dnorm(mn[i,j],tau3)
5 mn[i,j] <- alpha1[i]+alpha2[i]*age[j]
6 }}
7

8 # Random effects
9 for(i in 1:n){
10 alpha1[i] ~ dnorm(mu1,tau1)
11 alpha2[i] ~ dnorm(mu2,tau2)
12 }
13

14 # Priors
15 mu1 ~ dnorm(0,0.0001)
16 mu2 ~ dnorm(0,0.0001)
17 tau1 ~ dgamma(0.1,0.1)
18 tau2 ~ dgamma(0.1,0.1)
19 tau3 ~ dgamma(0.1,0.1)
20 }
21

22 data <- list(Y=Y,age=age,n=n,m=m)


23 params <- c("mu1","mu2","tau1","tau2","tau3")
24 inits <- function(){list(mu1=0,mu2=0,tau1=.1,tau2=.2,tau3=.2)}
25 fit <- bugs(model.file=model_string,
26 data=data,inits=inits,
27 parameters.to.save=params,DIC=FALSE,
28 n.iter=90000,n.chains=2,n.burnin=10000)
29 fit$summary
262 Appendices

Listing 7.16
STAN code for the random slopes model for the jaw data.
1 stan_model <- "
2

3 data {
4 int<lower=0> n;
5 int<lower=0> m;
6 vector [m] age;
7 matrix [n,m] Y;
8 }
9

10 parameters {
11 vector [n] alpha1;
12 vector [n] alpha2;
13 real mu1;
14 real mu2;
15 real<lower=0> sigma3;
16 real<lower=0> sigma2;
17 real<lower=0> sigma1;
18 }
19

20 model {
21 real mu;
22 alpha1 ~ normal(0,sigma1);
23 alpha2 ~ normal(0,sigma2);
24 sigma1 ~ cauchy(0.0,1000);
25 sigma2 ~ cauchy(0.0,1000);
26 sigma3 ~ cauchy(0.0,1000);
27 mu1 ~ normal(0,1000);
28 mu2 ~ normal(0,1000);
29

30 for(i in 1:n){for(j in 1:m){


31 mu = alpha1[i] + alpha2[i]*age[j];
32 Y[i,j] ~ normal(mu,sigma3);
33 }}
34 }
35 "
36

37 data <- list(Y=Y,age=age,n=n,m=m)


38 fit_stan <- stan(model_code = stan_model,
39 data = data,
40 chains=2, warmup = 10000, iter = 100000)
41 summary(fit_stan)$summary
Appendices 263

Listing 7.17
NIMBLE code for the random slopes model for the jaw data.
1 library(nimble)
2

3 model_string <- nimbleCode({


4 # Likelihood
5 for(i in 1:n){for(j in 1:m){
6 Y[i,j] ~ dnorm(mn[i,j],tau3)
7 mn[i,j] <- alpha1[i]+alpha2[i]*age[j]
8 }}
9

10 # Random effects
11 for(i in 1:n){
12 alpha1[i] ~ dnorm(mu1,tau1)
13 alpha2[i] ~ dnorm(mu2,tau2)
14 }
15

16 # Priors
17 mu1 ~ dnorm(0,0.0001)
18 mu2 ~ dnorm(0,0.0001)
19 tau1 ~ dgamma(0.1,0.1)
20 tau2 ~ dgamma(0.1,0.1)
21 tau3 ~ dgamma(0.1,0.1)
22 })
23

24 params <- c("mu1","mu2","tau1","tau2","tau3")


25 consts <- list(n=n,m=m,age=age)
26 data <- list(Y=Y)
27 inits <- function(){
28 list(mu1=rnorm(1),mu2=rnorm(1),tau1=10,tau2=10,tau3=10)
29 }
30 samples <- nimbleMCMC(model_string, data = data, inits = inits,
31 constants=consts,
32 monitors = params,
33 samplesAsCodaMCMC=TRUE,WAIC=FALSE,
34 niter = 100000, nburnin = 10000, nchains = 2)
35 plot(samples)
Bibliography

[1] Helen Abbey. An examination of the Reed-Frost theory of epidemics.


Human Biology, 24(3):201, 1952.

[2] Hirotogu Akaike. Information theory and an extension of the maximum


likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199–
213. Springer, 1998.
[3] James H Albert and Siddhartha Chib. Bayesian analysis of binary and
polychotomous response data. Journal of the American Statistical Asso-
ciation, 88(422):669–679, 1993.
[4] Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical
Modeling and Analysis for Spatial Data. CRC Press, 2014.
[5] Albert Barberán, Robert R Dunn, Brian J Reich, Krishna Pacifici,
Eric B Laber, Holly L Menninger, James M Morton, Jessica B Henley,
Jonathan W Leff, Shelly L Miller, and Noah Fierer. The ecology of mi-
croscopic life in household dust. In Proceedings of the Royal Society B,
volume 282, page 20151139. The Royal Society, 2015.
[6] Maria Maddalena Barbieri and James O Berger. Optimal predictive
model selection. The Annals of Statistics, 32(3):870–897, 2004.
[7] Daryl J Bem. Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality and
Social Psychology, 100(3):407, 2011.

[8] James Berger. The case for objective Bayesian analysis. Bayesian Anal-
ysis, 1(3):385–402, 2006.
[9] James O Berger, Luis R Pericchi, JK Ghosh, Tapas Samanta, Fulvio
De Santis, JO Berger, and LR Pericchi. Objective Bayesian methods for
model selection: Introduction and comparison. Institute of Mathematical
Statistics Lecture Notes - Monograph Series, pages 135–207, 2001.
[10] Jose M Bernardo. Reference posterior distributions for Bayesian infer-
ence. Journal of the Royal Statistical Society: Series B (Methodological),
pages 113–147, 1979.

265
266 Bibliography

[11] Anirban Bhattacharya, Debdeep Pati, Natesh S Pillai, and David B Dun-
son. Dirichlet–Laplace priors for optimal shrinkage. Journal of the Amer-
ican Statistical Association, 110(512):1479–1490, 2015.
[12] Howard D Bondell and Brian J Reich. Consistent high-dimensional
Bayesian variable selection via penalized credible regions. Journal of
the American Statistical Association, 107(500):1610–1624, 2012.

[13] Dennis D Boos and Leonard A Stefanski. Essential Statistical Inference:


Theory and Methods, volume 120. Springer Science & Business Media,
2013.
[14] Carlos A Botero, Beth Gardner, Kathryn R Kirby, Joseph Bulbulia,
Michael C Gavin, and Russell D Gray. The ecology of religious beliefs.
Proceedings of the National Academy of Sciences, 111(47):16784–16789,
2014.
[15] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben
Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Pe-
ter Li, and Allen Riddell. STAN: A probabilistic programming language.
Journal of Statistical Software, 20(2):1–37, 2016.
[16] Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe
estimator for sparse signals. Biometrika, 97(2):465–480, 2010.
[17] Fang Chen. Bayesian modeling using the MCMC procedure. In Proceed-
ings of the SAS Global Forum 2008 Conference, Cary NC: SAS Institute
Inc, 2009.
[18] Hugh A Chipman, Edward I George, and Robert E McCulloch. BART:
Bayesian additive regression trees. The Annals of Applied Statistics,
4(1):266–298, 2010.
[19] Ciprian M Crainiceanu, David Ruppert, and Matthew P Wand. Bayesian
analysis for penalized spline regression using WinBUGS. Journal of Sta-
tistical Software, 14(1):1–24, 2005.
[20] A Philip Dawid. Present position and potential developments: Some
personal views: Statistical theory: The prequential approach. Journal of
the Royal Statistical Society: Series A (General), pages 278–292, 1984.

[21] Gustavo de los Campos and Paulino Perez Rodriguez. BLR: Bayesian
Linear Regression, 2014. R package version 1.4.
[22] Perry de Valpine, Daniel Turek, Christopher J Paciorek, Clifford
Anderson-Bergman, Duncan Temple Lang, and Rastislav Bodik. Pro-
gramming with models: Writing statistical algorithms for general model
structures with NIMBLE. Journal of Computational and Graphical
Statistics, 26(2):403–413, 2017.
Bibliography 267

[23] Peter Diggle, Rana Moyeed, Barry Rowlingson, and Madeleine Thom-
son. Childhood malaria in the Gambia: A case-study in model-based
geostatistics. Journal of the Royal Statistical Society: Series C (Applied
Statistics), 51(4):493–506, 2002.
[24] Stewart M Edie, Peter D Smits, and David Jablonski. Probabilistic mod-
els of species discovery and biodiversity comparisons. Proceedings of the
National Academy of Sciences, 114(14):3666–3671, 2017.

[25] Gregory M Erickson, Peter J Makovicky, Philip J Currie, Mark A


Norell, Scott A Yerby, and Christopher A Brochu. Gigantism and com-
parative life-history parameters of tyrannosaurid dinosaurs. Nature,
430(7001):772–775, 2004.
[26] Kevin R. Forward, David Haldane, Duncan Webster, Carolyn Mills,
Cherly Brine, and Diane Aylward. A comparison between the Strep A
Rapid Test Device and conventional culture for the diagnosis of strepto-
coccal pharyngitis. Canadian Journal of Infectious Diseases and Medical
Microbiology, 17:221–223, 2004.
[27] Seymour Geisser. Discussion on sampling and Bayes inference in scientific
modeling and robustness (by GEP Box). Journal of the Royal Statistical
Society: Series A (General), 143:416–417, 1980.
[28] Alan E Gelfand, Peter Diggle, Peter Guttorp, and Montserrat Fuentes.
Handbook of Spatial Statistics. CRC press, 2010.
[29] Andrew Gelman et al. Prior distributions for variance parameters in hi-
erarchical models (comment on article by Browne and Draper). Bayesian
Analysis, 1(3):515–534, 2006.
[30] Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predic-
tive information criteria for Bayesian models. Statistics and Computing,
24(6):997–1016, 2014.
[31] Andrew Gelman, Gareth O Roberts, and Walter R Gilks. Efficient
Metropolis jumping rules. Bayesian Statistics, 5(599-608):42, 1996.
[32] Andrew Gelman and Donald B Rubin. Inference from iterative simulation
using multiple sequences. Statistical Science, pages 457–472, 1992.

[33] Andrew Gelman and Cosma Rohilla Shalizi. Philosophy and the practice
of Bayesian statistics. British Journal of Mathematical and Statistical
Psychology, 66(1):8–38, 2013.
[34] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distribu-
tions, and the Bayesian restoration of images. In Readings in Computer
Vision, pages 564–584. Elsevier, 1987.
268 Bibliography

[35] Edward I George and Robert E McCulloch. Variable selection via Gibbs
sampling. Journal of the American Statistical Association, 88(423):881–
889, 1993.
[36] John Geweke. Evaluating the accuracy of sampling-based approaches to
the calculation of posterior moments, volume 196. Federal Reserve Bank
of Minneapolis, Research Department, Minneapolis, MN, USA, 1991.

[37] Subhashis Ghosal and Aad van der Vaart. Fundamentals of Nonparamet-
ric Bayesian Inference, volume 44. Cambridge University Press, 2017.
[38] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules,
prediction, and estimation. Journal of the American Statistical Associa-
tion, 102(477):359–378, 2007.
[39] Heikki Haario, Marko Laine, Antonietta Mira, and Eero Saksman.
DRAM: Efficient adaptive MCMC. Statistics and Computing, 16(4):339–
354, 2006.
[40] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive
Metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
[41] Wilfred K Hastings. Monte Carlo sampling methods using Markov chains
and their applications. Biometrika, 57(1):97–109, 1970.
[42] James S Hodges. Richly Parameterized Linear Models: Additive, Time
Series, and Spatial Models using Random Effects. Chapman & Hall/CRC,
2016.
[43] James S Hodges and Brian J Reich. Adding spatially-correlated errors can
mess up the fixed effect you love. The American Statistician, 64(4):325–
334, 2010.
[44] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased esti-
mation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
[45] Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volin-
sky. Bayesian model averaging: A tutorial. Statistical Science, pages
382–401, 1999.
[46] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and
Lawrence K Saul. An introduction to variational methods for graphi-
cal models. Machine Learning, 37(2):183–233, 1999.
[47] Bindu Kalesan, Matthew E Mobily, Olivia Keiser, Jeffrey A Fagan, and
Sandro Galea. Firearm legislation and firearm mortality in the USA:
A cross-sectional, state-level study. The Lancet, 387(10030):1847–1855,
2016.
Bibliography 269

[48] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the
American Statistical Association, 90(430):773–795, 1995.
[49] Robert E Kass and Larry Wasserman. A reference Bayesian test for
nested hypotheses and its relationship to the Schwarz criterion. Journal
of the American Statistical Association, 90(431):928–934, 1995.
[50] Hong Lan, Meng Chen, Jessica B Flowers, Brian S Yandell, Donnie S
Stapleton, Christine M Mata, Eric Ton-Keen Mui, Matthew T Flowers,
Kathryn L Schueler, Kenneth F Manly, et al. Combined expression trait
correlations and expression quantitative trait locus mapping. PLoS Ge-
netics, 2(1):e6, 2006.
[51] Dennis V Lindley. A statistical paradox. Biometrika, 44(1/2):187–192,
1957.
[52] Roderick Little. Calibrated Bayes, for statistics in general, and missing
data in particular. Statistical Science, 26(2):162–174, 2011.
[53] Jean-Michel Marin, Pierre Pudlo, Christian P Robert, and Robin J Ry-
der. Approximate Bayesian computational methods. Statistics and Com-
puting, 22(6):1167–1180, 2012.
[54] Peter McCullagh and John Nelder. Generalized Linear Models, Second
Edition. Boca Raton: Chapman & Hall/CRC, 1989.
[55] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth,
Augusta H Teller, and Edward Teller. Equation of state calculations by
fast computing machines. The Journal of Chemical Physics, 21(6):1087–
1092, 1953.
[56] Greg Miller. ESP paper rekindles discussion about statistics. Science,
331(6015):272–273, 2011.
[57] Antonietta Mira. On Metropolis-Hastings algorithms with delayed rejec-
tion. Metron, 59(3-4):231–241, 2001.
[58] Frederick Mosteller and David L Wallace. Inference in an authorship
problem: A comparative study of discrimination methods applied to the
authorship of the disputed Federalist Papers. Journal of the American
Statistical Association, 58(302):275–309, 1963.
[59] Radford M Neal. Slice sampling. Annals of Statistics, pages 705–741,
2003.
[60] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of
Markov Chain Monte Carlo, 2(11):2, 2011.
[61] Radford M Neal. Bayesian Learning for Neural Networks, volume 118.
Springer Science & Business Media, 2012.
270 Bibliography

[62] Jorge Nocedal and Stephen J Wright. Sequential Quadratic Programming.


Springer, 2006.

[63] Krishna Pacifici, Brian J Reich, David AW Miller, Beth Gardner, Glenn
Stauffer, Susheela Singh, Alexa McKerrow, and Jaime A Collazo. Inte-
grating multiple data sources in species distribution modeling: A frame-
work for data fusion. Ecology, 98(3):840–850, 2017.

[64] Anand Patil, David Huard, and Christopher J Fonnesbeck. PyMC:


Bayesian stochastic modelling in Python. Journal of Statistical Software,
35(4):1, 2010.
[65] LI Pettit. The conditional predictive ordinate for the normal distribution.
Journal of the Royal Statistical Society: Series B (Methodological), pages
175–184, 1990.
[66] Martyn Plummer. JAGS Version 4.0. 0 user manual. See
https://sourceforge. net/projects/mcmc-jags/files/Manuals/4. x, 2015.
[67] Carl Edward Rasmussen. Gaussian processes in machine learning. In
Advanced Lectures on Machine Learning, pages 63–71. Springer, 2004.
[68] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete
approximations to Langevin diffusions. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
[69] Gareth O Roberts and Adrian FM Smith. Simple conditions for the
convergence of the Gibbs sampler and Metropolis-Hastings algorithms.
Stochastic Processes and Their Applications, 49(2):207–216, 1994.
[70] Veronika Ročková and Edward I George. The spike-and-slab LASSO.
Journal of the American Statistical Association, 113(521):431–444, 2018.
[71] Donald B Rubin. Bayesianly justifiable and relevant frequency calcula-
tions for the applied statistician. The Annals of Statistics, pages 1151–
1172, 1984.
[72] Donald B Rubin. Multiple imputation after 18+ years. Journal of the
American Statistical Association, 91(434):473–489, 1996.
[73] Håvard Rue, Sara Martino, and Nicolas Chopin. Approximate Bayesian
inference for latent Gaussian models by using integrated nested Laplace
approximations. Journal of the Royal Statistical Society: Series B (Sta-
tistical Methodology), 71(2):319–392, 2009.
[74] John R Sauer, James E Hines, and Jane E Fallon. The North American
Breeding Bird Survey, Results and Analysis 1966–2005. Version 6.2.2006.
USGS Patuxent Wildlife Research Center, Laurel, Maryland, USA, 2005.
Bibliography 271

[75] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A


Chipman, Edward I George, and Robert E McCulloch. Bayes and big
data: The consensus Monte Carlo algorithm. International Journal of
Management Science and Engineering Management, 11(2):78–88, 2016.
[76] Daniel Simpson, Håvard Rue, Andrea Riebler, Thiago G Martins, and
Sigrunn H Sørbye. Penalising model component complexity: A principled,
practical approach to constructing priors. Statistical Science, 32(1):1–28,
2017.
[77] David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van
Der Linde. Bayesian measures of model complexity and fit. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 64(4):583–
639, 2002.
[78] Mervyn Stone. Necessary and sufficient condition for convergence in prob-
ability to invariant posterior distributions. The Annals of Mathematical
Statistics, pages 1349–1353, 1970.
[79] Sibylle Sturtz, Uwe Ligges, and Andrew Gelman. R2OpenBUGS: A pack-
age for running OpenBUGS from R. URL http://cran. rproject. org/we-
b/packages/R2OpenBUGS/vignettes/R2OpenBUGS. pdf, 2010.
[80] Brian L Sullivan, Christopher L Wood, Marshall J Iliff, Rick E Bonney,
Daniel Fink, and Steve Kelling. eBird: A citizen-based bird observation
network in the biological sciences. Biological Conservation, 142(10):2282–
2292, 2009.
[81] Robert Tibshirani. Regression shrinkage and selection via the LASSO.
Journal of the Royal Statistical Society: Series B (Methodological), pages
267–288, 1996.
[82] Luke Tierney. Markov chains for exploring posterior distributions. Annals
of Statistics, pages 1701–1728, 1994.

[83] Arnold Zellner. On assessing prior distributions and Bayesian regres-


sion analysis with g-prior distributions. Bayesian Inference and Decision
Techniques, 1986.
[84] Yan Zhang, Brian J Reich, and Howard D Bondell. High dimen-
sional linear regression via the R2-D2 shrinkage prior. arXiv preprint
arXiv:1609.00046, 2016.
Index

Absolute loss, 218 Classification, 220


Adaptive Metropolis, 252 Clustering, 153
Akaike information criterion (AIC), Collinearity, 127
177 Conditional distribution, 10
Alternative hypothesis, 27, 219 Confidence interval, 26
Aperiodic, 246 Consistent, 220
Assumptions, 149 Continuous random variable, 2
Asymptotics, 220, 223 Convergence, 81, 108, 223
Autocorrelation, 104 Convergence diagnostics, 103
Autoregressive model, 156 Correlated data, 144, 155
Auxiliary variable, 254 Correlation, 11
Count data, 137
Basis expansion, 202 Covariance, 11
Basis functions, 151 Credible interval, 26
Bayes factors, 166 Cross validation, 164, 187
Bayes rule, 218 Cumulative distribution function, 6
Bayes’ theorem, 14, 21
Bayesian analysis, 18 Data fusion, 200
Bayesian central limit theorem, 135, Data layer, 196
223 Delayed rejection and adaptive
Bayesian decision theory, 218 Metropolis (DRAM), 252
Bayesian information criteria (BIC), Deviance information criteria (DIC),
177 177, 202, 207
Bayesian model averaging, 176 Diagnostics, 187
Bayesian network, 196 Directed acyclic graph (DAG), 195
Bayesian risk, 218 Dirichlet process, 153
Bernoulli distribution, 4, 135 Discrete random variable, 2
Beta distribution, 9, 140 Double exponential, 128
Bias, 220, 224
Bias-variance tradeoff, 221 Effective sample size, 106
Big data, 110 Empirical Bayes, 62
Binary data, 135 Equal-tailed interval, 26
Binomial distribution, 4, 139 Estimator, 25
Blocked Gibbs sampling, 87 Exchangeability, 148
Expected value, 3, 6, 11
Candidate distribution, 89
Check loss, 219 Fixed effect, 141
Frequentist, 2, 23

273
274 Index

Frequentist properties, 220 Link function, 133


Full conditional distribution, 78 Log link, 137
Log odds, 135
Gamma distribution, 8, 137 Log-normal, 204
Gaussian distribution, 7 Logistic growth curve, 205
Gaussian process regression, 151 Logistic regression, 135, 138
Gelman–Rubin diagnostic, 108 Loss function, 218
Generalized linear model, 133
Geweke’s diagnostic, 106 Marginal distribution, 10, 27
Gibbs sampling, 78 Markov chain Monte Carlo sampling,
Growth curves, 203 78
Maximum entropy prior, 62
Hamiltonian Monte Carlo, 252 Maximum likelihood estimator, 25,
Heteroskedastic errors, 152 125, 223
Hierarchical model, 148, 195 Mean squared error, 220, 224
High-dimensional regression, 127 Metropolis–Hastings sampling, 89
Highest posterior density interval, 26 Metropolis-within-Gibbs sampling,
Homoskedastic errors, 152 94
Hyperparameter, 42 Missing data, 130, 211
Hypothesis testing, 27, 166, 219 Mixed model, 141
Mixture distribution, 153
Improper prior, 58 Model misspecification, 149
Imputation, 211 Model uncertainty, 176
Independence, 13 Monte Carlo sampling, 27, 75, 224
Independent and identically Multilevel model, 195
distributed, 13 Multinomial distribution, 13
Informative prior, 22 Multiple linear regression, 124
Initial values, 100, 109, 135 Multiplicative error, 204
Integrated nested Laplace Multivariate normal distribution, 13
approximation (INLA), 75, Multivariate random variables, 9
250
Irreducible, 246 Negative binomial distribution, 138
Neural networks, 151
JAGS, 97, 255 NIMBLE, 97, 255
Jeffreys’ prior, 59, 120, 125 Non-Gaussian data, 133
Joint distribution, 9 Non-linear regression, 151, 204
Nonparametric density estimator,
Kernel, 43
153
Laplace prior, 58 Nonparametric regression, 149
LASSO, 128, 131, 225 Normal distribution, 7
Latent variables, 136 Normalizing constant, 20
Least squares, 125 Null hypothesis, 27, 219
Likelihood function, 19, 21 Numerical integration, 71
Linear models, 119 Objective, 23
Linear regression, 124, 211, 225 Objective prior, 59
Index 275

OpenBUGS, 97, 255 Sequential analysis, 44


Optimality, 218 Shrinkage prior, 128
Over-dispersion, 137 Simulation study, 110, 186, 224
Slice sampling, 254
p-value, 170 Spatial, 153, 156, 202
Parametric model, 21 Spike-and-slab prior, 171
Parametric statistical analysis, 3 Splines, 151, 202
Point estimator, 25, 218 Squared error loss, 218
Poisson distribution, 5 STAN, 97, 255
Poisson regression, 137 Standard error, 25
Population, 3 Statistical inference, 3, 21
Posterior consistency, 223 Statistical theory, 217
Posterior distribution, 19, 21 Stochastic Search Variable Selection
Posterior mean, 25, 218 (SSVS), 170
Posterior median, 25, 219 Student’s t distribution, 121, 126
Posterior predictive checks, 188 Subjective, 2
Posterior predictive distribution Subjectivity, 23
(PPD), 31, 130, 147, 164, Support, 2
187 Systematic bias, 213
Posterior variance, 25
Prediction, 31, 129 T-test, 121
Prior distribution, 19, 21 Times series, 153, 156
Prior layer, 196 Trace plot, 104
Probability, 1 Tuning, 109
Probability density function, 6, 10 Tuning parameter, 91
Probability integral transform (PIT), Type I error, 219
188 Type II error, 219
Probability mass function, 2, 9
Probit regression, 136 Unbiased, 220
Process layer, 196 Uncertainty quantification, 23
Unit information prior, 128
Quantile function, 7
Variance, 3, 6, 11, 220
Random effect, 141, 206 Variational Bayesian inference, 75
Random slopes, 146
Random variable, 2 Z-test, 121
Regression trees, 151 Zellner’s g-prior, 127
Ridge regression, 127 Zero-one loss, 219

Sample, 3
Sampling distribution, 3, 220, 223,
224
Semiparametric density estimator,
153
Semiparametric regression, 151
Sensitivity analysis, 22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy