Bayes Rules (Johnson, Alicia A.ott, Miles Q.dogucu, Mine)
Bayes Rules (Johnson, Alicia A.ott, Miles Q.dogucu, Mine)
Bayes Rules (Johnson, Alicia A.ott, Miles Q.dogucu, Mine)
Bayesian Networks
With Examples in R, Second Edition
Marco Scutari and Jean-Baptiste Denis
Time Series
Modeling, Computation, and Inference, Second Edition
Raquel Prado, Marco A. R. Ferreira and Mike West
Sampling
Design and Analysis, Third Edition
Sharon L. Lohr
Bayes Rules!
An Introduction to Applied Bayesian Modeling
Alicia A. Johnson, Miles Q. Ott and Mine Dogucu
Alicia A. Johnson
Miles Q. Ott
Mine Dogucu
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2022 Alicia A. Johnson, Miles Q. Ott and Mine Dogucu
Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of
their use. The authors and publishers have attempted to trace the copyright holders of all material
reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write
and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted,
reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means,
now known or hereafter invented, including photocopying, micro lming, and recording, or in
any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access
www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please
contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and
are used only for identi cation and explanation without intent to infringe.
Foreword
Preface
I Bayesian Foundations
2 Bayes' Rule
2.1 Building a Bayesian model for events
2.1.1 Prior probability model
2.1.2 Conditional probability & likelihood
2.1.3 Normalizing constants
2.1.4 Posterior probability model via Bayes' Rule!
2.1.5 Posterior simulation
2.2 Example: Pop vs soda vs coke
2.3 Building a Bayesian model for random variables
2.3.1 Prior probability model
2.3.2 The Binomial data model
2.3.3 The Binomial likelihood function
2.3.4 Normalizing constant
2.3.5 Posterior probability model
2.3.6 Posterior shortcut
2.3.7 Posterior simulation
2.4 Chapter summary
2.5 Exercises
2.5.1 Building up to Bayes' Rule
2.5.2 Practice Bayes' Rule for events
2.5.3 Practice Bayes' Rule for random variables
2.5.4 Simulation exercises
5 Conjugate Families
5.1 Revisiting choice of prior
5.2 Gamma-Poisson conjugate family
5.2.1 The Poisson data model
5.2.2 Potential priors
5.2.3 Gamma prior
5.2.4 Gamma-Poisson conjugacy
5.3 Normal-Normal conjugate family
5.3.1 The Normal data model
5.3.2 Normal prior
5.3.3 Normal-Normal conjugacy
5.3.4 Optional: Proving Normal-Normal conjugacy
5.4 Why no simulation in this chapter?
5.5 Critiques of conjugate family models
5.6 Chapter summary
5.7 Exercises
5.7.1 Practice: Gamma-Poisson
5.7.2 Practice: Normal-Normal
5.7.3 General practice exercises
13 Logistic Regression
13.1 Pause: Odds & probability
13.2 Building the logistic regression model
13.2.1 Specifying the data model
13.2.2 Specifying the priors
13.3 Simulating the posterior
13.4 Prediction & classi cation
13.5 Model evaluation
13.6 Extending the model
13.7 Chapter summary
13.8 Exercises
13.8.1 Conceptual exercises
13.8.2 Applied exercises
13.8.3 Open-ended exercises
14 Naive Bayes Classi cation
14.1 Classifying one penguin
14.1.1 One categorical predictor
14.1.2 One quantitative predictor
14.1.3 Two predictors
14.2 Implementing & evaluating naive Bayes classi cation
14.3 Naive Bayes vs logistic regression
14.4 Chapter summary
14.5 Exercises
14.5.1 Conceptual exercises
14.5.2 Applied exercises
14.5.3 Open-ended exercises
Index
Foreword
Even after decades of thinking about it, Bayes' Rule never ceases to amaze
me. How can one simple formula have such a wide variety of applications?
You will encounter a vibrant sample of such applications in this book,
ranging from weather prediction to LGBTQ+ anti-discrimination laws, and
from who calls soda “pop” (or calls pop “soda”) to how to classify penguin
species. Most importantly, careful study of this book will empower you to
conduct thoughtful Bayesian analyses for the data and applications you
care about.
Statistics and data science focus on using data to learn about the world and
make predictions. The Bayesian approach gives a principled, powerful tool
for obtaining probabilities and predictions about our unknown quantities
of interest, given what we do know (the data). It gives easy-to-interpret
results that directly quantify our uncertainties. Unfortunately, it is rarely
taught in depth at the undergraduate level, perhaps out of concern that
there would be too many scary-looking integrals to do or too much cryptic
code to write.
Bayes Rules! shows that the Bayesian approach is in fact accessible to
students and self-learners with basic statistics knowledge, even if they are
not adept at calculus derivations or coding up fancy algorithms from
scratch. The book achieves this with many reader-friendly features, such
as clear explanations through words and pictures, quizzes to test your
understanding, and the bayesrules R package that contains datasets and
functions that facilitate trying out Bayesian methods.
Better yet, the accessibility is achieved through good pedagogy, not
through giving a watered down, over-simpli ed look at the subject. For
example, models called hierarchical models and an R package called rstan
are introduced, with highly instructive examples showing how to apply
these to interesting applications. Hierarchical models and rstan are among
the state-of-the-art techniques used in modern Bayesian data analysis.
The Peter Parker principle from the Spider-Man comics says, “With great
power comes great responsibility.” Likewise, the great power of statistics
and data science comes with the great responsibility to consider the
bene ts and risks to society, the privacy rights of the participants in a
study, the biases in a dataset and whether a proposed algorithm ampli es
those biases, and other ethical issues. Bayes Rules! emphasizes fairness
and ethics rather than ignoring these crucial issues.
Given that you read Bayes Rules! (actively – make sure to try the self-
quizzes and practice with some exercises!), the probability is high that you
will strengthen your statistical problem-solving skills while experiencing
the joy of Bayesian thinking.
Audience
Bayes Rules! brings the power of Bayes to advanced undergraduate
students and comparably trained practitioners. Accordingly, the book is
neither written at the graduate level nor is it meant to be a rst
introduction to the eld of statistics. At minimum, the book assumes that
readers are familiar with the content covered in a typical undergraduate-
level introductory statistics course. Readers will also, ideally, have some
experience with undergraduate-level probability, calculus, and the R
statistical software. But wait! Please don't go away if you don't check off
all of these boxes. We provide all R code and enough probability review so
that readers without this background will still be able to follow along so
long as they are eager to pick up these tools on the y. Further, though
certain calculus concepts are important in Bayesian analysis (thus the
book), calculus derivations are not. The latter are limited to the simple
model settings in early chapters and easily skippable.
Getting set up
Once you're ready to dive into Bayes Rules!, take the following steps to
get set up. First, download the most recent versions of the following
software:
R (https://www.r-project.org/)
RStudio (https://rstudio.com/products/rstudio/
download/)
Next, install the following packages within RStudio:
Contact us
As you read and interact with Bayes Rules!, please feel free to drop us a
note through the form posted on our website: https://bayes-
rules.github.io/posts/contact/. We're especially curious to
know if: you have any ideas for increasing the accessibility and inclusion
of this book; you notice any errors or typos; and there's anything you'd like
to see in future versions. When we revise the text, we will occasionally
check in with this form to help guide that process.
Acknowledgments
First and foremost, we'd like to thank the students we've worked with at
Augsburg University, Carleton College, Denison University, Macalester
College, Smith College, and the University of California, Irvine. Their
feedback, insights, and example inspired a high bar for the type of book
we wanted to put out into the world. Beyond our students, we received
valuable feedback from numerous colleagues in the Statistics and Data
Science community. Special thanks to: James Albert, Virgilio Gómez
Rubio, Amy Herring, David Hitchcock, Nick Horton, Yue Jiang, and Paul
Roback. And to our supportive editor, David Grubbs. He certainly made
this rst adventure in book publishing more chill and enjoyable than we
expected.
Beyond those above, Alicia would especially like to thank the following:
John Kim, Martha Skold, the Johnsons, and Minneapolis friends for
their support, even on the dreaded “writing brain” days.
Galin Jones for his inviting enthusiasm about statistical computation.
The STAT 454 students at Macalester College for their inspiring
curiosity and the STAT 454 teaching assistants for supporting their
peers in learning about Bayes (Zuofu Huang, Sebastian Coll, Connie
Zhang).
Colleagues in the Department of Mathematics, Statistics, and
Computer Science at Macalester College – you are a humane, fun, and
re ective crew.
Miles would especially like to thank the following:
Francesca Giardine, Sarah Glidden, and Elaona Lemoto for testing out
half-formed exercises, gently pointing out errors, and giving excellent
suggestions. It was an honor to get to work with you at this early stage
of your statistics careers.
The Smith College SDS 320 students from spring 2019 and the SDS
390 students from fall 2020 (especially Audrey Bretin, Dianne
Caravela, Marlene Jackson, and Hannah Snell who provided helpful
feedback on Chapter 8).
His colleagues from Smith College SDS.
The WSDS conference for being the starting point for many
friendships and collaborations, including this book.
His family, Bea Capistrant, Ethan Suniewick, Malkah Bird, Henry
Schneiderman, Christopher Tradowsky, Ross El ine, Jon Knapp, and
Alex Callendar.
Morteza Khakshoor, family, and friends who are far only in distance,
for their love, support, and understanding.
The late Binnaz Melin for being supportive of her career from a young
age.
Students of STATS 115 at UC Irvine who always are the best part of
teaching Bayesian statistics.
Her colleagues in the Department of Statistics at UC Irvine.
All those in the statistics and R community who are supportive of, and
kind towards, others, especially newcomers.
Finally, we would all like to thank each other for recognizing the
importance of humor, empathy, and gratitude to effective collaboration.
Deciding to tackle this project before we really knew one another was a
pretty great gamble. It's been fun!
License
This work is licensed under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License.
About the Authors
DOI: 10.1201/9780429288340-1
Everybody changes their mind. You likely even changed your mind in the
last minute. Prior to ever opening it, you no doubt had some
preconceptions about this book, whether formed from its title, its online
reviews, or a conversation with a friend who has read it. And then you saw
that the rst chapter opened with a quote from Beyoncé, an unusual choice
for a statistics book. Perhaps this made you think “This book is going to be
even more fun than I realized!” Perhaps it served to do the opposite. No
matter. The point we want to make is that we agree with Beyoncé –
changing is simply part of life.
We continuously update our knowledge about the world as we accumulate
lived experiences, or collect data. As children, it takes a few spills to
understand that liquid doesn't stay in a glass. Or a couple attempts at
conversation to understand that, unlike in cartoons, real dogs can't talk.
Other knowledge is longer in the making. For example, suppose there's a
new Italian restaurant in your town. It has a 5-star online rating and you
love Italian food! Thus, prior to ever stepping foot in the restaurant, you
anticipate that it will be quite delicious. On your rst visit, you collect
some edible data: your pasta dish arrives a soggy mess. Weighing the
stellar online rating against your own terrible meal (which might have just
been a uke), you update your knowledge: this is a 3-star not 5-star
restaurant. Willing to give the restaurant another chance, you make a
second trip. On this visit, you're pleased with your Alfredo and increase
the restaurant's rating to 4 stars. You continue to visit the restaurant,
collecting edible data and updating your knowledge each time.
Figure 1.1 captures the natural Bayesian knowledge-building process of
acknowledging your preconceptions, using data to update your knowledge,
and repeating. We can apply this same Bayesian process to rigorous
research inquiries. If you're a political scientist, yours might be a study of
demographic factors in voting patterns. If you're an environmental
scientist, yours might be an analysis of the human role in climate change.
You don't walk into such an inquiry without context – you carry a degree
of incoming or prior information based on previous research and
experience. Naturally, it's in light of this information that you interpret
new data, weighing both in developing your updated or posterior
information. You continue to re ne this information as you gather new
evidence (Figure 1.2).
Goals
Learn to think like a Bayesian.
Explore the foundations of a Bayesian data analysis and how they
contrast with the frequentist alternative.
Learn a little bit about the history of the Bayesian philosophy.
Next, tally up your quiz score using the scoring system below.1 Totals
from 4–5 indicate that your current thinking is fairly frequentist, whereas
totals from 9–12 indicate alignment with the Bayesian philosophy. In
between these extremes, totals from 6–8 indicate that you see strengths in
both philosophies. Your current inclinations might be more frequentist
than Bayesian or vice versa. These inclinations might change throughout
your reading of this book. They might not. For now, we merely wish to
highlight the key differences between the Bayesian and frequentist
philosophies.
_________________________
1 1 . a = 1 point, b = 3 points, c = 2 points; 2. a = 1 point, b = 3 points, c = 1 point; 3. a = 3
points, b = 1 point; 4. a = 3 points, b = 1 point.
Interpreting probability
In the Bayesian philosophy, a probability measures the relative
plausibility of an event.
The frequentist philosophy is so named for its interpretation of
probability as the long-run relative frequency of a repeatable
event.
Thus, in the coin ip example, a Bayesian would conclude that Heads and
Tails are equally likely. In contrast, a frequentist would conclude that if we
ip the coin over and over and over, roughly 1/2 of these ips will be
Heads. Let's try applying these same ideas to the question 2 setting in
which a pollster declares that candidate A has a 0.9 probability of winning
the upcoming election. This routine election calculation illustrates cracks
within the frequentist interpretation. Since the election is a one-time
event, the long-run relative frequency concept of observing the election
over and over simply doesn't apply. A very strict frequentist interpretation
might even conclude the pollster is just wrong. Since the candidate will
either win or lose, their win probability must be either 1 or 0. A less
extreme frequentist interpretation, though a bit awkward, is more
reasonable: in long-run hypothetical repetitions of the election, i.e.,
elections with similar circumstances, candidate A would win roughly 90%
of the time.
The election example is not rare. It's often the case that an event of
interest is unrepeatable. Whether or not a politician wins an election,
whether or not it rains tomorrow, and whether or not humans will live on
Mars are all one-time events. Whereas the frequentist interpretation of
probability can be awkward in these one-time settings, the more exible
Bayesian interpretation provides a path by which to express the
uncertainty of these events. For example, a Bayesian would interpret “a
0.9 probability of winning” to mean that, based on election models, the
relative plausibility of winning is high – the candidate is 9 times more
likely to win than to lose.
FIGURE 1.4: Bayesian analyses balance our prior experiences with new
data. Depending upon the setting, the prior is given more weight than the
data (left), the prior and data are given equal weight (middle), or the prior
is given less weight than the data (right).
_________________________
2 If you have experience with frequentist statistics, you might be skeptical that these methods
would produce such a silly conclusion. Yet in the frequentist null hypothesis signi cance testing
framework, the hypothesis being tested in both Zuofu's and Kavya's settings is that their success
rate exceeds 50%. Since their “10 out of 10” data is the same, the corresponding p-values (
≈ 0.001) and resulting hypothesis test conclusions are also the same.
Allowing the posterior to balance out the prior and data is critical to the
Bayesian knowledge-building process. When we have little data, our
posterior can draw upon the power in our prior knowledge. As we collect
more data, the prior loses its in uence. Whether in science, policy-
making, or life, this is how people tend to think (El-Gamal and Grether,
1995) and how progress is made. As they collect more and more data, two
scientists will come to agreement on the human role in climate change, no
matter their prior training and experience.3 As they read more and more
pages of this book, two readers will come to agreement on the power of
Bayesian statistics. This logical and heartening idea is illustrated by
Figure 1.5.
Asking questions
A Bayesian hypothesis test seeks to answer: In light of the
observed data, what's the chance that the hypothesis is correct?
A frequentist hypothesis test seeks to answer: If in fact the
hypothesis is incorrect, what's the chance I'd have observed this, or
even more extreme, data?
_________________________
3 There is an extreme exception to this rule. If someone assigns 0 prior weight to a given
scenario, then no amount of data will change their mind. We explore this speci c situation in
Chapter 4.
_________________________
6 http://pi.math.cornell.edu/~numb3rs/lipa/Episodes/
7 https://priceonomics.com/the-time-everyone-corrected-the-worlds-
smartest
8 https://bayesian.org/chapters/australasian-chapter; or brazil; or chile; or
east-asia; or india; or south-africa
Motivating question
How can we incorporate Bayesian thinking into a formal model of
some variable of interest, Y?
Unit 1 develops the foundation upon which to build our Bayesian analyses.
You will explore the heart of every Bayesian model, Bayes' Rule, and put
Bayes' Rule into action to build a few introductory but fundamental
Bayesian models. These unique models are tied together in the broader
conjugate family. Further, they are tailored toward variables Y of differing
structures, and thus apply our Bayesian thinking in a wide variety of
scenarios.
To begin, the Beta-Binomial model can help us determine the probability
that it rains tomorrow in Australia using data on binary categorical
variable Y, whether or not it rains for each of 1000 sampled days (Figure
1.7 (a)). The Gamma-Poisson model can help us explore the rate of bald
eagle sightings in Ontario, Canada using data on variable Y, the counts of
eagles seen in each of 37 one-week observation periods (Figure 1.7 (b)).
Finally, the Normal-Normal model can provide insight into the average 3
p.m. temperature in Australia using data on the bell-shaped variable Y,
temperatures on a sample of study days (Figure 1.7 (c)).9
FIGURE 1.7: (a) Binomial output for the rain status of 1000 sampled
days in Australia; (b) Poisson counts of bald eagles observed in 37 one-
week observation periods; (c) Normally distributed 3 p.m. temperatures
(in degrees Celsius) on 200 days in Australia.
_________________________
9 These plots use the weather_perth, bald_eagles, and weather_australia data in
the bayesrules package. The weather datasets are subsets of the weatherAUS data in the
rattle package. The eagles data was made available by Birds Canada (2018) and distributed
by R for Data Science (2018).
Motivating questions
When our Bayesian models of Y become too complicated to
mathematically specify, how can we approximate them? And once
we've either speci ed or approximated a model, how can we make
meaning of and draw formal conclusions from it?
Motivating question
We're not always interested in the lone behavior of variable Y. Rather,
we might want to understand the relationship between Y and a set of
p potential predictor variables (X , X , … , X ). How do we build a
1 2 p
Unit 3 is where things really keep staying fun. Prior to Unit 3, our
motivating research questions all focus on a single variable Y. For
example, in the Normal-Normal scenario we were interested in exploring
Y, 3 p.m. temperatures in Australia. Yet once we have a grip on this
response variable Y, we often have follow-up questions: Can we model
and predict 3 p.m. temperatures based on 9 a.m. temperatures (X1) and
precise location (X2)? To this end, in Unit 3 we will survey Bayesian
modeling tools that conventionally fall into two categories:
Let's connect these terms with our three examples from Section 1.3.1.
First, in the Australian temperature example, our sample data indicates
that temperatures tend to be warmer in Wollongong and that the warmer it
is at 9 a.m., the warmer it tends to be at 3 p.m. (Figure 1.8 (c)). Since the 3
p.m. temperature response variable is quantitative, modeling this
relationship is a regression task. In fact, we can generalize the Unit 1
Normal-Normal model for the behavior in Y alone to build a Normal
regression model of the relationship between Y and predictors X1 and X2.
Similarly, we can extend our Unit 1 Gamma-Poisson analysis of the
quantitative bird counts (Y) into a Poisson regression model that describes
how these counts have increased over time (X1) (Figure 1.8 (b)).
FIGURE 1.8: (a) Tomorrow's rain vs today's humidity; (b) the number of
bald eagles over time; (c) 3 p.m. vs 9 a.m. temperatures in two different
Australian cities.
Motivating question
Help! What if the structure of our data violates the assumptions of
independence behind the Unit 3 regression and classi cation models?
Speci cally, suppose we have multiple observations per random
“group” in our dataset. How do we tweak our Bayesian models to not
only acknowledge, but harness, this structure?
The regression and classi cation models in Unit 3 operate under the
assumption of independence. That is, they assume that our data on the
response and predictor variables (Y , X , X , … , X ) is a random sample
1 2 p
– the observed values for any one subject in the sample are independent of
those for any other subject. The structure of independent data is
represented by the data table below:
observation y x
1 … …
2 … …
3 … …
group y x
A … …
A … …
B … …
B … …
B … …
1.5 Exercises
In these rst exercises, we hope that you make and learn from some
mistakes as you incorporate the ideas you learned in this chapter into your
way of thinking. Ultimately, we hope that you attain a greater
understanding of these ideas than you would have had if you had never
made a mistake at all.
Exercise 1.1 (Bayesian Chocolate Milk). In the fourth episode of the sixth
season of the television show Parks and Recreation, Deputy Director of the
Pawnee Parks and Rec department, Leslie Knope, is being subjected to an
inquiry by Pawnee City Council member Jeremy Jamm due to an
inappropriate tweet from the of cial Parks and Rec Twitter account. The
following exchange between Jamm and Knope is an example of Bayesian
thinking:
JJ: “When this sick depraved tweet rst came to light, you said ‘the
account was probably hacked by some bored teenager’. Now you're saying
it is an unfortunate mistake. Why do you keep ip- opping?”
LK: “Well because I learned new information. When I was four, I thought
chocolate milk came from brown cows. And then I ‘ ip- opped’ when I
found that there was something called chocolate syrup.”
JJ: “I don't think I'm out of line when I say this scandal makes Benghazi
look like Whitewater.”
Exercise 1.3 (When was the last time you changed your mind?). Think of
a recent situation in which you changed your mind. As with the Italian
restaurant example (Figure 1.1), make a diagram that includes your prior
information, your new data that helped you change your mind, and your
posterior conclusion.
Exercise 1.4 (When was the last time you changed someone else's mind?).
Think of a recent situation in which you had a conversation in which you
changed someone else's mind. As with the Italian restaurant example
(Figure 1.1), make a diagram that includes the prior information, the new
data that helped you change their mind, and the posterior conclusion.
Exercise 1.5 (Changing views on Bayes). When one of the book authors
started their master's degree in biostatistics, they had never used Bayesian
statistics before, and thus felt neutral about the topic. In their rst
semester, they used Bayes to learn about diagnostic tests for different
diseases, saw how important Bayes was, and became very interested in the
topic. In their second semester, their mathematical statistics course
included a Bayesian exercise involving ant eggs which both disgusted
them and felt unnecessarily dif cult – they became disinterested in
Bayesian statistics. In the rst semester of their Biostatistics doctoral
program, they took a required Bayes class with an excellent professor, and
became exceptionally interested in the topic. Draw a Bayesian knowledge-
building diagram that represents the author's evolving opinion about
Bayesian statistics.
_________________________
10 https://twitter.com/frenchpressplz/status/1266424143207034880
Exercise 1.6 (Applying for an internship). There are several data scientist
openings at a much-ballyhooed company. Having read the job description,
you know for a fact that you are quali ed for the position: this is your
data. Your goal is to ascertain whether you will actually be offered a
position: this is your hypothesis.
DOI: 10.1201/9780429288340-2
The Collins Dictionary named “fake news” the 2017 term of the year. And for
good reason. Fake, misleading, and biased news has proliferated along with online
news and social media platforms which allow users to post articles with little
quality control. It's then increasingly important to help readers ag articles as
“real” or “fake.” In Chapter 2 you'll explore how the Bayesian philosophy from
Chapter 1 can help us make this distinction. To this end, we'll examine a sample of
150 articles which were posted on Facebook and fact checked by ve BuzzFeed
journalists (Shu et al., 2017). Information about each article is stored in the
fake_news dataset in the bayesrules package. To learn more about this dataset,
type ?fake_news in your console.
Warning
The fake_news data contains the full text for actual news articles, both real
and fake. As such, some of these articles contain disturbing language or
topics. Though we believe it's important to provide our original resources,
not metadata, you do not need to read the articles in order to do the analysis
ahead.
The table below, constructed using the tabyl() function in the janitor package
(Firke, 2021), illustrates that 40% of the articles in this particular collection are
fake and 60% are real:
Using this information alone, we could build a very simple news lter which uses
the following rule: since most articles are real, we should read and believe all
articles. This lter would certainly solve the problem of mistakenly disregarding
real articles, but at the cost of reading lots of fake news. It also only takes into
account the overall rates of, not the typical features of, real and fake news. For
example, suppose that the most recent article posted to a social media platform is
titled: “The president has a funny secret!” Some features of this title probably set
off some red ags. For example, the usage of an exclamation point might seem
like an odd choice for a real news article. Our data backs up this instinct – in our
article collection, 26.67% (16 of 60) of fake news titles but only 2.22% (2 of 90)
of real news titles use an exclamation point:
Put your own Bayesian thinking to use in a quick self-quiz of your current
intuition about whether the most recent article is fake.
Quiz Yourself!
What best describes your updated, posterior understanding about the article?
a. The chance that this article is fake drops from 40% to 20%. The
exclamation point in the title might simply re ect the author's
enthusiasm.
b. The chance that this article is fake jumps from 40% to roughly 90%.
Though exclamation points are more common among fake articles,
let's not forget that only 40% of articles are fake.
c. The chance that this article is fake jumps from 40% to roughly 98%.
Given that so few real articles use exclamation points, this article is
most certainly fake.
The correct answer is given in the footnote below.1 But if your intuition was
incorrect, don't fret. By the end of Chapter 2, you will have learned how to support
Bayesian thinking with rigorous Bayesian calculations using Bayes' Rule, the
aptly named foundation of Bayesian statistics. And heads up: of any other chapter
in this book, Chapter 2 introduces the most Bayesian concepts, notation, and
vocabulary. No matter your level of previous probability experience, you'll want
to take this chapter slowly. Further, our treatment focuses on the probability tools
that are necessary to Bayesian analyses. For a broader probability introduction, we
recommend that the interested reader visit Chapters 1 through 3 and Section 7.1 of
Blitzstein and Hwang (2019).
Goals
Explore foundational probability tools such as marginal, conditional, and
joint probability models and the Binomial model.
Conduct your rst formal Bayesian analysis! You will construct your rst
prior and data models and, from these, construct your rst posterior
models via Bayes' Rule.
Practice your Bayesian grammar. Imagnie how dif cult it would beto
reed this bok if the authers didnt spellcheck or use proper grammar and!
punctuation. In this spirit, you'll practice the formal notation and
terminology central to Bayesian grammar.
Simulate Bayesian models. Simulation is integral to building intuition for
and supporting Bayesian analyses. You'll conduct your rst simulation,
using the R statistical software, in this chapter.
c
P (B) = 0.40 and P (B ) = 0.60.
As a collection, P (B) and P (B ) specify the simple prior model of fake news in
c
Table 2.1. As a valid probability model must: (1) it accounts for all possible
events (all articles must be fake or real); (2) it assigns prior probabilities to each
event; and (3) these probabilities sum to one.
_________________________
2 We can't cite any rigorous research article here, but imagine what orchestras would sound like if this
weren't true.
Conversely, the certainty of an event might decrease in light of new data. For
example, if you're a fastidious hand washer, then you're less likely to get the u:
P (f lu | wash hands) < P (f lu).
The order of conditioning is also important. Since they measure two different
phenomena, it's typically the case that P (A|B) ≠ P (B|A). For instance, roughly
100% of puppies are adorable. Thus, if the next object you pass on the street is a
puppy, P (adorable | puppy) = 1. However, the reverse is not true. Not every
adorable object is a puppy, thus P (puppy | adorable) < 1.
Finally, information about B doesn't always change our understanding of A. For
example, suppose your friend has a yellow pair of shoes and a blue pair of shoes,
thus four shoes in total. They choose a shoe at random and don't show it to you.
Without actually seeing the shoe, there's a 0.5 probability that it goes on the right
foot: P (right f oot) = 2/4. And even if they tell you that they happened to get one
of the two yellow shoes, there's still a 0.5 probability that it goes on the right foot:
P (right f oot | yellow) = 1/2. That is, information about the shoe's color tells us
nothing about which foot it ts – shoe color and foot are independent.
Independent events
Two events A and B are independent if and only if the occurrence of B
doesn't tell us anything about the occurrence of A:
P (A|B) = P (A).
Let's reexamine our fake news example with these conditional concepts in place.
The conditional probabilities we derived above, P (A|B) = 0.2667 and
P (A|B ) = 0.0222, indicate that a whopping 26.67% of fake articles versus a
c
mere 2.22% of real articles use exclamation points. Since exclamation point usage
is so much more likely among fake news than real news, this data provides some
evidence that the article is fake. We should congratulate ourselves on this
observation – we've evaluated the exclamation point data by ipping the
conditional probabilities P (A|B) and P (A|B ) on their heads. For example, on
c
its face, the conditional probability P (A|B) measures the uncertainty in event A
given we know event B occurs. However, we nd ourselves in the opposite
situation. We know that the incoming article used exclamation points, A. What we
don't know is whether or not the article is fake, B or Bc. Thus, in this case, we
compared P (A|B) and P (A|B ) to ascertain the relative likelihoods of observing
c
data A under different scenarios of the uncertain article status. To help distinguish
this application of conditional probability calculations from that when A is
uncertain and B is known, we'll utilize the following likelihood function notation
L(⋅|A):
c c
L(B|A) = P (A|B) and L(B |A) = P (A|B ).
We present a general de nition below, but be patient with yourself here. The
distinction is subtle, especially since people use the terms “likelihood” and
“probability” interchangeably in casual conversation.
Probability vs likelihood
When B is known, the conditional probability functionP (⋅|B) allows us to
compare the probabilities of an unknown event, A or Ac, occurring with B:
c
P (A|B) vs P (A |B).
Table 2.2 summarizes the information that we've amassed thus far, including the
prior probabilities and likelihoods associated with the new article being fake or
real, B or Bc. Notice that the prior probabilities add up to 1 but the likelihoods do
not. Again, the likelihood function is not a probability function, but rather
provides a framework to compare the relative compatibility of our exclamation
point data with B and Bc. Thus, whereas the prior evidence suggested the article is
most likely real (P (B) < P (B )), the data is more consistent with the article
c
B Bc Total
A
Ac
Total 0.4 0.6 1
First, focus on the B column which splits fake articles into two groups: (1) those
that are fake and use exclamation points, denoted A ∩ B; and (2) those that are
fake and don't use exclamation points, denoted A ∩ B.4 To determine thec
probabilities of these joint events, note that 40% of articles are fake and 26.67%
of fake articles use exclamation points, P (B) = 0.4 and P (A|B) = 0.2667. It
follows that across all articles, 26.67% of 40%, or 10.67%, are fake with
exclamation points. That is, the joint probability of observing both A and B is
_________________________
3 This term is mysterious now, but will make sense by the end of this chapter.
4 We read “∩” as “and” or the “intersection” of two events.
It follows that 73.33% of 40%, or 29.33%, of all articles are fake without
exclamation points:
c c
P (A ∩ B) = P (A |B)P (B) = 0.7333 ⋅ 0.4 = 0.2933.
In summary, the total probability of observing a fake article is the sum of its
parts:
c
P (B) = P (A ∩ B) + P (A ∩ B) = 0.1067 + 0.2933 = 0.4.
We can similarly break down real articles into those that do and those that don't
use exclamation points. Across all articles, only 1.33% (2.22% of 60%) are real
and use exclamation points whereas 58.67% (97.78% of 60%) are real without
exclamation points:
c c c
P (A ∩ B ) = P (A|B )P (B ) = 0.0222 ⋅ 0.6 = 0.0133
c c c c c
P (A ∩ B ) = P (A |B )P (B ) = 0.9778 ⋅ 0.6 = 0.5867.
Thus, the total probability of observing a real article is the sum of these two parts:
c c c c
P (B ) = P (A ∩ B ) + P (A ∩ B ) = 0.0133 + 0.5867 = 0.6.
(2.1)
TABLE
∣
P (A ∩ B) = P (A)P (B).
P (A B) =
P (A∩B)
P (B)
.
P (A) = P (A ∩ B) + P (A ∩ B ).
c
P (B) ≠ 0 , reveals the
Table 2.3 summarizes our new understanding of the joint behavior of our two
By (2.1), we can compute the two pieces of this puzzle using the information we
have about exclamation point usage among fake and real news, P (A|B) and
P (A|B ), weighted by the prior probabilities of fake and real news, P (B) and
P (B ):
c
c
c c
P (A) = P (A ∩ B) + P (A ∩ B ) = P (A|B)P (B) + P (A|B )P (B ).
2.3: A
Bc Total
c
(2.2)
article variables. The fact that the grand total of this table is one con rms that our
calculations are reasonable. Table 2.3 also provides the point of comparison we
sought: 12% of all news articles use exclamation points, P (A) = 0.12. So that we
needn't always build similar marginal probabilities from scratch, let's consider the
theory behind this calculation. As usual, we can start by recognizing the two ways
that an article can use exclamation points: if it is fake (A ∩ B) and if it is not fake
(A ∩ B ). Thus, the total probability of observing A is the combined probability
c
(2.3)
B Bc Total
A 0.1067 0.0133 0.12
Ac 0.2933 0.5867 0.88
Total 0.4 0.6 1
Finally, plugging in, we can con rm that roughly 12% of all articles use
exclamation points: P (A) = 0.2667 ⋅ 0.4 + 0.0222 ⋅ 0.6 = 0.12. The formula
we've built to calculate P (A) here is a special case of the aptly named Law of
Total Probability (LTP).
P (A∩B) P (B)L(B|A)
P (B|A) = =
P (A) P (A)
(2.4)
(2.5)
More generally,
prior ⋅ likelihood
posterior = .
normalizing constant
To convince ourselves that Bayes' Rule works, let's directly apply it to our news
analysis. Into (2.4), we can plug the prior information that 40% of articles are
fake, the 26.67% likelihood that a fake article would use exclamation points, and
the 12% marginal probability of observing exclamation points across all articles.
The resulting posterior probability that the incoming article is fake is roughly
0.889, just as we calculated from Table 2.3:
P (B)L(B|A) 0.4⋅0.2667
P (B A) = = = 0.889.
P (A) 0.12
Table 2.4 summarizes our news analysis journey, from the prior to the posterior
model. We started with a prior understanding that there's only a 40% chance that
the incoming article would be fake. Yet upon observing the use of an exclamation
point in the title “The president has a funny secret!”, a feature that's more
common to fake news, our posterior understanding evolved quite a bit – the
chance that the article is fake jumped to 88.9%.
To simulate the articles that might be posted to your social media, we can use the
sample_n() function in the dplyr package (Wickham et al., 2021) to randomly
sample rows from the article data frame. In doing so, we must specify the
sample size and that the sample should be taken with replacement (replace =
TRUE). Sampling with replacement ensures that we start with a fresh set of
possibilities for each article – any article can either be fake or real. Finally, we set
weight = prior to specify that there's a 60% chance an article is real and a
40% chance it's fake. To try this out, run the following code multiple times, each
time simulating three articles.
Notice that you can get different results every time you run this code. That's
because simulation, like articles, is random. Speci cally, behind the R curtain is a
random number generator (RNG) that's in charge of producing random samples.
Every time we ask for a new sample, the RNG “starts” at a new place: the random
seed. Starting at different seeds can thus produce different samples. This is a great
thing in general – random samples should be random. However, within a single
analysis, we want to be able to reproduce our random simulation results, i.e., we
don't want the ne points of our results to change every time we re-run our code.
To achieve this reproducibility, we can specify or set the seed by applying the
set.seed() function to a positive integer (here 84735). Run the below code a
few times and notice that the results are always the same – the rst two articles
are fake and the third is real:5
Warning
We'll use set.seed() throughout the book so that readers can reproduce
and follow our work. But it's important to remember that these results are
still random. Re ecting the potential error and variability in simulation,
different seeds would typically give different numerical results though
similar conclusions.
_________________________
5 If you get different random samples than those printed here, it likely means that you are using a
different version of R.
Now that we understand how to simulate a few articles, let's dream bigger:
simulate 10,000 articles and store the results in article_sim.
The composition of the 10,000 simulated articles is summarized by the bar plot
below, constructed using the ggplot() function in the ggplot2 package
(Wickham, 2016):
The table below provides a more thorough summary. Re ecting the model from
which these 10,000 articles were generated, roughly (but not exactly) 40% are
fake:
Next, let's simulate the exclamation point usage among these 10,000 articles. The
data_model variable speci es that there's a 26.67% chance that any fake article
and a 2.22% chance that any real article uses exclamation points:
From this data_model, we can simulate whether each article includes an
exclamation point. This syntax is a bit more complicated. First, the group_by()
statement speci es that the exclamation point simulation is to be performed
separately for each of the 10,000 articles. Second, we use sample() to simulate
the exclamation point data, no or yes, based on the data_model and store the
results as usage. Note that sample() is similar to sample_n() but samples
values from vectors instead of rows from data frames.
The article_sim data frame now contains 10,000 simulated articles with
different features, summarized in the table below. The patterns here re ect the
underlying likelihoods that roughly 28% (1070 / 4031) of fake articles and 2%
(136 / 5969) of real articles use exclamation points.
Figure 2.3 provides a visual summary of these article characteristics. Whereas the
left plot re ects the relative breakdown of exclamation point usage among real
and fake news, the right plot frames this information within the normalizing
context that only roughly 12% (1206 / 10000) of all articles use exclamation
points.
FIGURE 2.2: A bar plot of the fake vs real status of 10,000 simulated articles.
FIGURE 2.3: Bar plots of exclamation point usage, both within fake vs real news
and overall.
Our 10,000 simulated articles now re ect the prior model of fake news, as well as
the likelihood of exclamation point usage among fake vs real news. In turn, we
can use them to approximate the posterior probability that the latest article is
fake. To this end, we can lter out the simulated articles that match our data (i.e.,
those that use exclamation points) and examine the percentage of articles that are
fake:
Among the 1206 simulated articles that use exclamation points, roughly 88.7%
are fake. This approximation is quite close to the actual posterior probability of
0.889. Of course, our posterior assessment of this article would change if we had
seen different data, i.e., if the title didn't have exclamation points. Figure 2.4
reveals a simple rule: If an article uses exclamation points, it's most likely fake.
Otherwise, it's most likely real (and we should read it!). NOTE: The same rule
does not apply to this real book in which the liberal use of exclamation points
simply conveys our enthusiasm!
FIGURE 2.4: Bar plots of real vs fake news, broken down by exclamation point
usage.
P (S) = 0.38.
But then, you see the person point to a zzy cola drink and say “please pass my
pop.” Though the country is united in its love of zzy drinks, it's divided in what
they're called, with common regional terms including “pop,” “soda,” and “coke.”
This data, i.e., the person's use of “pop,” provides further information about where
they might live. To evaluate this data, we can examine the pop_vs_soda dataset
in the bayesrules package (Dogucu et al., 2021) which includes 374250 responses
to a volunteer survey conducted at popvssoda.com. To learn more about this
dataset, type ?pop_vs_soda in your console. Though the survey participants
aren't directly representative of the regional populations (Table 2.5), we can use
their responses to approximate the likelihood of people using the word pop in
each region:
Letting A denote the event that a person uses the word “pop,” we'll thus assume
the following regional likelihoods:
L(M |A) = 0.6447, L(N |A) = 0.2734, L(S|A) = 0.0792, L(W |A) = 0.2943
_________________________
6 https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf
7 https://www.census.gov/popclock/data_tables.php?component=growth
For example, 64.47% of people in the Midwest but only 7.92% of people in the
South use the term “pop.” Comparatively then, the “pop” data is most likely if the
interviewee lives in the Midwest and least likely if they live in the South, with the
West and Northeast being in between these two extremes:
L(M |A) > L(W |A) > L(N |A) > L(S|A).
Weighing the prior information about regional populations with the data that the
interviewee used the word “pop,” what are we to think now? For example,
considering the fact that 38% of people live in the South but that “pop” is
relatively rare to that region, what's the posterior probability that the interviewee
lives in the South? Per Bayes' Rule (2.4), we can calculate this probability by
P (S)L(S|A)
P (S A) = .
P (A)
(2.6)
We already have two of the three necessary pieces of the puzzle, the prior
probability P (S) and likelihood L(S|A). Consider the third, the marginal
probability that a person uses the term “pop” across the entire U.S., P (A). By
extending the Law of Total Probability (2.5), we can calculate P (A) by combining
the likelihoods of using “pop” in each region, while accounting for the regional
populations. Accordingly, there's a 28.26% chance that a person in the U.S. uses
the word “pop”:
≈ 0.2826.
Then plugging into (2.6), there's a roughly 10.65% posterior chance that the
interviewee lives in the South:
0.38⋅0.0792
P (S A) = ≈ 0.1065.
0.2826
The rst thing you might notice is that this model greatly simpli es reality.9
Though Kasparov's win probability π can technically be any number from zero to
one, this prior assumes that π has a discrete set of possibilities: Kasparov's win
probability is either 20%, 50%, or 80%. Next, examine the probability mass
function (pmf)f (⋅) which speci es the prior probability of each possible π value.
This pmf re ects the prior understanding that Kasparov learned from the 1996
match-up, and so will most likely improve in 1997. Speci cally, this pmf places a
65% chance on Kasparov's win probability jumping to π = 0.8 and only a 10%
chance on his win probability dropping to π = 0.2, i.e., f (π = 0.8) = 0.65 and
f (π = 0.2) = 0.10.
f (y) = P (Y = y)
all y
_________________________
8 Greek letters are conventionally used to denote our primary quantitative variables of interest.
9 As we keep progressing with Bayes, we'll get the chance to make our models more nuanced and
realistic.
f (y|π) = P (Y = y|π)
∑ f (y π) = 1 .
all y
Y |π~Bin(n, π)
where (
n
y
) =
∣
_________________________
10 Capital letters toward the end of the alphabet (e.g., X,Y,Z) are conventionally used to denote
random variables related to our data.
6
n
y
y
f (y π) = ( )π (1 − π)
y
n!
y!(n−y)!
f (y π) = ( )π (1 − π)
0
6
.
6−y
n−y
6
Y |π~Bin(6, π)
f or y ∈ {0, 1, 2, 3, 4, 5, 6}.
This pmf summarizes the conditional probability of observing any number of wins
Y = y for any given win probability π. For example, if Kasparov's underlying
chance of beating Deep Blue were π = 0.8, then there's a roughly 26% chance
he'd win all six games:
6 6−6 6
= 1 ⋅ 0.8 ⋅ 1 ≈ 0.26.
6
= 1 ⋅ 1 ⋅ 0.2 ≈ 0.000064.
(2.7)
(2.8)
Figure 2.5 plots the conditional pmfsf (y|π), and thus the random outcomes of Y,
under each possible value of Kasparov's win probability π. These plots con rm
our intuition that Kasparov's victories Y would tend to be low if Kasparov's win
probability π were low (far left) and high if π were high (far right).
FIGURE 2.5: The pmf of a Bin(6, π) model is plotted for each possible value of
π ∈ {0.2, 0.5, 0.8}. The masses marked by the black lines correspond to the
∣
Just as the likelihood in our fake news example was obtained by ipping a
conditional probability on its head, the formula for the likelihood function follows
from evaluating the conditional pmf f (y|π) in (2.8) at the observed data Y = 1.
For π ∈ {0.2, 0.5, 0.8},
1
1
L(π y = 1) = f (y = 1 π) = ( )π (1 − π)
6−1
Table 2.8 summarizes the likelihood function evaluated at each possible value of
π. For example, there's a low 0.0015 likelihood of Kasparov winning just one
game if he were the superior player, i.e., π = 0.8:
5
≈ 0.0015.
There are some not-to-miss details here. First, though it is equivalent in formula
to the conditional pmf of Y, f (y = 1|π), we use the L(π|y = 1) notation to
reiterate that the likelihood is a function of the unknown win probability π given
the observed Y = 1 win data. In fact, the resulting likelihood formula depends
only upon π. Further, the likelihood function does not sum to one across π, and
thus is not a probability model. (Mental gymnastics!) Rather, it provides a
mechanism by which to compare the compatibility of the observed data Y = 1
with different π.
Putting this all together, the likelihood function summarized in Figure 2.6 and
Table 2.8 illustrates that Kasparov's one game win is most consistent with him
being the weaker player and least consistent with him being the better player:
L(π = 0.2|y = 1) > L(π = 0.5|y = 1) > L(π = 0.8|y = 1). In fact, it's nearly
impossible that Kasparov would have only won one game if his win probability
against Deep Blue were as high as π = 0.8: L(π = 0.8|y = 1) ≈ 0.
f (y 1 |π) vs f (y 2 |π).
Thus, L(⋅|y) provides the tool we need to evaluate the relative compatibility
of data Y = y with various π values.
π∈{0.2,0.5,0.8}
or, expanding the summation Σ and plugging in the prior probabilities and
likelihoods from Tables 2.7 and 2.8:
f (y = 1) = L(π = 0.2|y = 1)f (π = 0.2) + L(π = 0.5|y = 1)f (π = 0.5)
≈ 0.0637.
(2.9)
Thus, across all possible π, there's only a roughly 6% chance that Kasparov would
have won only one game. It would, of course, be great if this all clicked. But if it
doesn't, don't let this calculation discourage you from moving forward. We'll learn
a magical shortcut in Section 2.3.6 that allows us to bypass this calculation.
FIGURE 2.7: The prior (left), likelihood (middle), and posterior (right) models
of π. The y-axis scales are omitted for ease of comparison.
The posterior model plotted in Figure 2.7 is speci ed by the posterior pmf
f (π|y = 1).
∣
Conceptually, f (π|y = 1) is the posterior probability of some win probability π
given that Kasparov only won one of six games against Deep Blue. Thus, de ning
the posterior f (π|y = 1) isn't much different than it was in our previous examples.
Just as you might hope, Bayes' Rule still holds:
posterior
f (π = 0.8|y = 1)
=
f (π y = 1) =
f (π)L(π|y=1)
f (y=1)
normalizing constant
0.0637
0.65⋅0.0015
0.0637
.
f (π = 0.2|y = 1)
f (π = 0.5|y = 1)
=
=
0.10⋅0.3932
0.0637
0.25⋅0.0938
≈ 0.617
≈ 0.368
≈ 0.015
This posterior probability model is summarized in Table 2.9 along with the prior
probability model for comparison. These details con rm the trends in and
intuition behind Figure 2.7. Mainly, though we were fairly con dent that
Kasparov's performance would have improved from 1996 to 1997, after winning
only one game, the chances of Kasparov being the dominant player (π = 0.8)
dropped from 0.65 to 0.015. In fact, the scenario with the greatest posterior
(2.10)
All that remains is a little “plug-and-chug”: the prior f (π) is de ned by Table 2.7,
(2.11)
support is that Kasparov is the weaker player, with a win probability of only 0.2.
π
f (π|y = 1)
2.3.6
0.2 0.5
∣
Posterior shortcut
0.8
0.617 0.368 0.015
Total
1
We close this section by generalizing the tools we built for the chess analysis.
f (π y) =
prior ⋅ likelihood
normalizing constant
all π
=
f (π)L(π|y)
f (y)
We now make good on our promise that, moving forward, we needn't continue
calculating the normalizing constant. To begin, notice in (2.11) that
f (y = 1) = 0.0637 appears in the denominator of f (π|y = 1) for each
(2.12)
(2.13)
π ∈ {0.2, 0.5, 0.8}. This explains the term normalizing constant – its only purpose
Figure 2.8 demonstrates that they preserve the proportional relationships of the
normalized posterior probabilities.
FIGURE 2.8: The normalized posterior pmf of π (left) and the unnormalized
posterior pmf of π (right) with different y-axis scales.
f (π)L(π|y) f (π)L(π|y)
f (π y) = = .
f (y)
∑ f (π)L(π y)
all π
We state the general form of this proportionality result below and will get plenty
of practice with this concept in the coming chapters.
Proportionality
That is,
∣
Since f (y) is merely a normalizing constant which does not depend on π, the
posterior pmf f (π|y) is proportional to the product of f (π) and L(π|y):
f (π y) =
posterior
f (π)L(π|y)
∝
f (y)
∝ f (π)L(π y).
prior ⋅ likelihood .
The signi cance of this proportionality is that all the information we need to
build the posterior model is held in the prior and likelihood.
Next, simulate 10,000 possible outcomes of π from the prior model and store the
results in the chess_sim data frame.
From each of the 10,000 prior plausible values pi, we can simulate six games and
record Kasparov's number of wins, y. Since the dependence of y on pi follows a
Binomial model, we can directly simulate y using the rbinom() function with
size = 6 and prob = pi.
The combined 10,000 simulated pi values closely approximate the prior model
f (π) (Table 2.7):
FIGURE 2.9: A bar plot of simulated win outcomes y under each possible win
probability π.
Finally, let's focus on the simulated outcomes that match the observed data that
Kasparov won one game. Among these simulations, the majority (60.4%)
correspond to the scenario in which Kasparov's win probability π was 0.2 and very
few (1.8%) correspond to the scenario in which π was 0.8. These observations
very closely approximate the posterior model of π which we formally built above
(Table 2.9).
FIGURE 2.10: A bar plot of 10,000 simulated π values which approximates the
posterior model.
2.4 Chapter summary
In Chapter 2, you learned Bayes' Rule and that Bayes Rules! Every Bayesian
analysis consists of four common steps.
1. Construct a prior model for your variable of interest, π. The prior model
speci es two important pieces of information: the possible values of π and
the relative prior plausibility of each.
2. Summarize the dependence of data Y on π via a conditional pmff (y|π).
3. Upon observing data Y = y, de ne the likelihood function
L(π|y) = f (y|π) which encodes the relative likelihood of observing data
4. Build the posterior model of π via Bayes' Rule which balances the prior
and likelihood:
prior⋅likelihood
posterior = ∝ prior ⋅ likelihood.
normalizing constant
More technically,
f (π)L(π|y)
f (π y) = ∝ f (π)L(π y).
f (y)
2.5 Exercises
2.5.1 Building up to Bayes' Rule
Exercise 2.1 (Comparing the prior and posterior). For each scenario below, you're
given a pair of events, A and B. Explain what you believe to be the relationship
between the posterior and prior probabilities of B: P (B|A) > P (B) or
P (B|A) < P (B).
a) 73% of people that drive 10 miles per hour above the speed limit get a
speeding ticket.
b) 20% of residents drive 10 miles per hour above the speed limit.
c) 15% of residents have used R.
d) 91% of statistics students at the local college have used R.
e) 38% of residents are Minnesotans that like the music of Prince.
f) 95% of the Minnesotan residents like the music of Prince.
Exercise 2.3 (Binomial practice). For each variable Y below, determine whether Y
is Binomial. If yes, use notation to specify this model and its parameters. If not,
explain why the Binomial model is not appropriate for Y.
Exercise 2.4 (Vampires?). Edward is trying to prove to Bella that vampires exist.
Bella thinks there is a 0.05 probability that vampires exist. She also believes that
the probability that someone can sparkle like a diamond if vampires exist is 0.7,
and the probability that someone can sparkle like a diamond if vampires don't
exist is 0.03. Edward then goes into a meadow and shows Bella that he can sparkle
like a diamond. Given that Edward sparkled like a diamond, what is the
probability that vampires exist?
Exercise 2.5 (Sick trees). A local arboretum contains a variety of tree species,
including elms, maples, and others. Unfortunately, 18% of all trees in the
arboretum are infected with mold. Among the infected trees, 15% are elms, 80%
are maples, and 5% are other species. Among the uninfected trees, 20% are elms,
10% are maples, and 70% are other species. In monitoring the spread of mold, an
arboretum employee randomly selects a tree to test.
a) What's the prior probability that the selected tree has mold?
b) The tree happens to be a maple. What's the probability that the employee
would have selected a maple?
c) What's the posterior probability that the selected maple tree has mold?
d) Compare the prior and posterior probability of the tree having mold. How
did your understanding change in light of the fact that the tree is a maple?
Exercise 2.6 (Restaurant ratings). The probability that Sandra will like a
restaurant is 0.7. Among the restaurants that she likes, 20% have ve stars on
Yelp, 50% have four stars, and 30% have fewer than four stars. What other
information do we need if we want to nd the posterior probability that Sandra
likes a restaurant given that it has fewer than four stars on Yelp?
Exercise 2.7 (Dating app). Matt is on a dating app looking for love. Matt swipes
right on 8% of the pro les he views. Of the people that Matt swipes right on, 40%
are men, 30% are women, 20% are non-binary, and 10% identify in another way.
Of the people that Matt does not swipe right on, 45% are men, 40% are women,
10% are non-binary, and 5% identify in some other way.
a) What's the probability that a randomly chosen person on this dating app is
non-binary?
b) Given that Matt is looking at the pro le of someone who is non-binary,
what's the posterior probability that he swipes right?
Exercise 2.8 (Flight delays). For a certain airline, 30% of the ights depart in the
morning, 30% depart in the afternoon, and 40% depart in the evening.
Frustratingly, 15% of all ights are delayed. Of the delayed ights, 40% are
morning ights, 50% are afternoon ights, and 10% are evening ights. Alicia and
Mine are taking separate ights to attend a conference.
a) Mine is on a morning ight. What's the probability that her ight will be
delayed?
b) Alicia's ight is not delayed. What's the probability that she's on a
morning ight?
Exercise 2.9 (Good mood, bad mood). Your roommate has two moods, good or
bad. In general, they're in a good mood 40% of the time. Yet you've noticed that
their moods are related to how many text messages they receive the day before. If
they're in a good mood today, there's a 5% chance they had 0 texts, an 84% chance
they had between 1 and 45 texts, and an 11% chance they had more than 45 texts
yesterday. If they're in a bad mood today, there's a 13% chance they had 0 texts, an
86% chance they had between 1 and 45 texts, and a 1% chance they had more than
45 texts yesterday.
b) Today's a new day. Without knowing anything about the previous day's
text messages, what's the probability that your roommate is in a good
mood? What part of the Bayes' Rule equation is this: the prior, likelihood,
normalizing constant, or posterior?
c) You surreptitiously took a peek at your roommate's phone (we are
attempting to withhold judgment of this dastardly maneuver) and see that
your roommate received 50 text messages yesterday. How likely are they
to have received this many texts if they're in a good mood today? What
part of the Bayes' Rule equation is this?
d) What is the posterior probability that your roommate is in a good mood
given that they received 50 text messages yesterday?
Exercise 2.10 (LGBTQ students: rural and urban). A recent study of 415,000
Californian public middle school and high school students found that 8.5% live in
rural areas and 91.5% in urban areas.11 Further, 10% of students in rural areas and
10.5% of students in urban areas identi ed as Lesbian, Gay, Bisexual,
Transgender, or Queer (LGBTQ). Consider one student from the study.
Exercise 2.11 (Internship). Muhammad applies for six equally competitive data
science internships. He has the following prior model for his chances of getting
into any given internship, π:
b) Muhammad got some pretty amazing news. He was offered four of the six
internships! How likely would this be if π = 0.3?
c) Construct the posterior model of π in light of Muhammad's internship
news.
_________________________
11 https://williamsinstitute.law.ucla.edu/wp-content/uploads/LGBTQ-Youth-
in-CA-Public-Schools.pdf
Exercise 2.12 (Making mugs). Miles is learning how to make a mug in his
ceramics class. A dif cult part of the process is creating or “pulling” the handle.
His prior model of π, the probability that one of his handles will actually be good
enough for a mug, is below:
a) Miles has enough clay for 7 handles. Let Y be the number of handles that
will be good enough for a mug. Specify the model for the dependence of
Y on π and the corresponding pmf, f (y|π).
b) Miles pulls 7 handles and only 1 of them is good enough for a mug. What
is the posterior pmf of π, f (π|y = 1)?
c) Compare the posterior model to the prior model of π. How would you
characterize the differences between them?
d) Miles' instructor Kris had a different prior for his ability to pull a handle
(below). Find Kris's posterior f (π|y = 1) and compare it to Miles'.
Exercise 2.14 (Late bus). Li Qiang takes the 8:30am bus to work every morning.
If the bus is late, Li Qiang will be late to work. To learn about the probability that
her bus will be late (π), Li Qiang rst surveys 20 other commuters: 3 think π is
0.15, 3 think π is 0.25, 8 think π is 0.5, 3 think π is 0.75, and 3 think π is 0.85.
Exercise 2.15 (Cuckoo birds). Cuckoo birds are brood parasites, meaning that
they lay their eggs in the nests of other birds (hosts), so that the host birds will
raise the cuckoo bird hatchlings. Lisa is an ornithologist studying the success rate,
π, of cuckoo bird hatchlings that survive at least one week. She is taking over the
project from a previous researcher who speculated in their notes the following
prior model for π:
a) If the previous researcher had been more sure that a hatchling would
survive, how would the prior model be different?
b) If the previous researcher had been less sure that a hatchling would
survive, how would the prior model be different?
c) Lisa collects some data. Among the 15 hatchlings she studied, 10 survived
for at least one week. What is the posterior model for π?
d) Lisa needs to explain the posterior model for π in a research paper for
ornithologists, and can't assume they understand Bayesian statistics.
Brie y summarize the posterior model in context.
Exercise 2.16 (Fake art). An article in The Daily Beast reports differing opinions
on the proportion (π) of museum artworks that are fake or forged.12
a) After reading the article, de ne your own prior model for π and provide
evidence from the article to justify your choice.
b) Compare your prior to that below. What's similar? Different?
c) Suppose you randomly choose 10 artworks. Assuming the prior from part
b, what is the minimum number of artworks that would need to be forged
for f (π = 0.6|Y = y) > 0.4?
2.5.4 Simulation exercises
Exercise 2.17 (Sick trees redux). Repeat Exercise 2.5 utilizing simulation to
approximate the posterior probability that a randomly selected maple tree has
mold. Speci cally, simulate data for 10,000 trees and remember to set your
random number seed.
_________________________
12 https://www.thedailybeast.com/are-over-half-the-works-on-the-art-
market-really-fakes
Exercise 2.19 (Cuckoo birds redux). Repeat Exercise 2.15 utilizing simulation to
approximate the posterior model of π.
Exercise 2.20 (Cat image recognition). Whether you like it or not, cats have taken
over the internet.13 Joining the craze, Zainab has written an algorithm to detect cat
images. It correctly identi es 80% of cat images as cats, but falsely identi es 50%
of non-cat images as cats. Zainab tests her algorithm with a new set of images, 8%
of which are cats. What's the probability that an image is actually a cat if the
algorithm identi es it as a cat? Answer this question by simulating data for 10,000
images.
Exercise 2.21 (Medical tests). A medical test is designed to detect a disease that
about 3% of the population has. For 93% of those who have the disease, the test
yields a positive result. In addition, the test falsely yields a positive result for 7%
of those without the disease. What is the probability that a person has the disease
given that they have tested positive? Answer this question by simulating data for
10,000 people.
_________________________
13 https://www.nytimes.com/2015/08/07/arts/design/how-cats-took-over-the-
internet-at-the-museum-of-the-moving-image.html
3
The Beta-Binomial Bayesian Model
DOI: 10.1201/9780429288340-3
Every four years, Americans go to the polls to cast their vote for President
of the United States. Consider the following scenario. “Michelle” has
decided to run for president and you're her campaign manager for the state
of Minnesota. As such, you've conducted 30 different polls throughout the
election season. Though Michelle's support has hovered around 45%, she
polled at around 35% in the dreariest days and around 55% in the best
days on the campaign trail (Figure 3.1 (left)).
Elections are dynamic, thus Michelle's support is always in ux. Yet these
past polls provide prior information about π, the proportion of
Minnesotans that currently support Michelle. In fact, we can reorganize
this information into a formal prior probability model of π. We worked a
similar example in Section 2.3, in which context π was Kasparov's
probability of beating Deep Blue at chess. In that case, we greatly over-
simpli ed reality to t within the framework of introductory Bayesian
models. Mainly, we assumed that π could only be 0.2, 0.5, or 0.8, the
corresponding chances of which were de ned by a discrete probability
model. However, in the reality of Michelle's election support and
Kasparov's chess skill, π can be any value between 0 and 1. We can re ect
this reality and conduct a more nuanced Bayesian analysis by constructing
a continuous prior probability model of π. A reasonable prior is
represented by the curve in Figure 3.1 (right). We'll examine continuous
models in detail in Section 3.1. For now, simply notice that this curve
preserves the overall information and variability in the past polls –
Michelle's support π can be anywhere between 0 and 1, but is most likely
around 0.45.
Incorporating this more nuanced, continuous view of Michelle's support π
will require some new tools. BUT the spirit of the Bayesian analysis will
remain the same. No matter if our parameter π is continuous or discrete,
the posterior model of π will combine insights from the prior and data.
Directly ahead, you will dig into the details and build Michelle's election
model. You'll then generalize this work to the fundamental Beta-Binomial
Bayesian model. The power of the Beta-Binomial lies in its broad
applications. Michelle's election support π isn't the only variable of
interest that lives on [0,1]. You might also imagine Bayesian analyses in
which we're interested in modeling the proportion of people that use
public transit, the proportion of trains that are delayed, the proportion of
people that prefer cats to dogs, and so on. The Beta-Binomial model
provides the tools we need to study the proportion of interest, π, in each of
these settings.
Goals
Utilize and tune continuous priors. You will learn how to interpret
and tune a continuous Beta prior model to re ect your prior
information about π.
Interpret and communicate features of prior and posterior models
using properties such as mean, mode, and variance.
Construct the fundamental Beta-Binomial model for proportion π.
Getting started
To prepare for this chapter, note that we'll be using three Greek letters
throughout our analysis: π = pi, α = alpha, and β = beta. Further, load
the packages below:
library(bayesrules)
library(tidyverse)
NOTE: Don't fret if integrals are new to you. You will not need to perform
any integration to proceed with this book.
π~Beta(α, β).
The Beta model is speci ed by continuous pdf
Γ (α+β) α−1 β−1
f (π) = π (1 − π) f or π ∈ [0, 1]
Γ (α)Γ (β)
(3.1)
∞
where Γ (z) = ∫ x
z−1
e
−y
dx and Γ (z + 1) = zΓ (z) . Fun fact:
0
Hyperparameter
A hyperparameter is a parameter used in a prior model.
This model is best understood by playing around. Figure 3.2 plots the Beta
pdf f (π) under a variety of shape hyperparameters, α and β. Check out the
various shapes the Beta pdf can take. This exibility means that we can
tune the Beta to re ect our prior understanding of π by tweaking α and β.
For example, notice that when we set α = β = 1 (middle left plot), the
Beta model is at from 0 to 1. In this setting, the Beta model is equivalent
to perhaps a more familiar model, the standard Uniform.
FIGURE 3.2: Beta(α, β) pdfs f (π) under a variety of shape
hyperparameters α and β (black curve). The mean and mode are
represented by a blue solid line and dashed line, respectively.
π~Unif (0, 1)
with pdf f (π) = 1 for π ∈ [0, 1]. The Unif(0,1) model is a special
case of Beta(α, β) when α = β = 1.
Take a minute to see if you can identify some other patterns in how shape
hyperparameters α and β re ect the typical values of π as well as the
variability in π.1
Quiz Yourself!
α
E(π) =
α+β
α−1
Mode(π) = when α, β > 1.
α+β−2
(3.2)
Figure 3.2 also reveals patterns in the variability of π. For example, with
values that tend to be closer to the mean of 0.5, the variability in π is
smaller for the Beta(20,20) model than for the Beta(5,5) model. We can
measure the variability of a Beta(α, β) random variable π by variance
αβ
Var(π) = 2
.
(α+β) (α+β+1)
(3.3)
The formulas above don't magically pop out of nowhere. They are
obtained by applying general de nitions of mean, mode, and variance to
the Beta pdf (3.1). We provide these de nitions below, but you can skip
them without consequence.
E(π) = ∫ π ⋅ f (π)dπ.
The mode of π captures the most plausible value of π, i.e., the value
of π for which the pdf is maximized:
2 2
Var(π) = E((π − E(π)) ) = ∫ (π − E(π)) ⋅ f (π)dπ.
_________________________
2 The mode when either α ≤ 1 or β ≤ 1 is evident from a plot of the pdf.
9
α ≈ β.
11
π~Beta(45, 55)
with prior pdff (π) following from plugging 45 and 55 into (3.1),
Γ (100) 44 54
f (π) = π (1 − π) f or π ∈ [0, 1].
Γ (45)Γ (55)
(3.4)
(3.5)
(3.6)
Y |π~Bin(50, π)
50 50−y
y
f (y π) = P (Y = y π) = ( )π (1 − π) .
y
(3.7)
Recall that the likelihood function is de ned by turning the Binomial pmf
on its head. Treating Y = 30 as observed data and π as unknown, matching
the reality of our situation, the Binomial likelihood function of π follows
from plugging y = 30 into the Binomial pmf (3.7):
50 20
30
L(π y = 30) = ( )π (1 − π) f or π ∈ [0, 1].
30
(3.8)
For example, matching what we see in Figure 3.5, the chance that Y = 30
of 50 polled voters would support Michelle is 0.115 if her underlying
support were π = 0.6:
50
30 20
L(π = 0.6 y = 30) = ( )0.6 0.4 ≈ 0.115
30
50
30 20
L(π = 0.5 y = 30) = ( )0.5 0.5 ≈ 0.042.
30
It's also important to remember here that L(π|y = 30) is a function ofπ
that provides insight into the relative compatibility of the observed polling
data Y = 30 with different π ∈ [0, 1]. The fact that L(π|y = 30) is
maximized when π = 0.6 suggests that the 60% support for Michelle
among polled voters is most likely when her underlying support is also at
60%. This makes sense! The further that a hypothetical π value is from
0.6, the less likely we would be to observe our poll result – L(π|y = 30)
effectively drops to 0 for π values under 0.3 and above 0.9. Thus, it's
extremely unlikely that we would've observed a 60% support rate in the
new poll if, in fact, Michelle's underlying support were as low as 30% or
as high as 90%.
π ~Beta(45, 55).
These pieces of the puzzle are shown together in Figure 3.6 where, only
for the purposes of visual comparison to the prior, the likelihood function
is scaled to integrate to 1.3 The prior and data, as captured by the
likelihood, don't completely agree. Constructed from old polls, the prior is
a bit more pessimistic about Michelle's election support than the data
obtained from the latest poll. Yet both insights are valuable to our
analysis. Just as much as we shouldn't ignore the new poll in favor of the
old, we also shouldn't throw out our bank of prior information in favor of
the newest thing (also great life advice). Thinking like Bayesians, we can
construct a posterior model of π which combines the information from the
prior with that from the data.
FIGURE 3.6: The prior model of π along with the (scaled) likelihood
function of π given the new poll results in which Y = 30 of n = 50 polled
Minnesotans support Michelle.
Quiz Yourself!
Which plot re ects the correct posterior model of Michelle's election
support π?
Plot (b) is the only plot in which the posterior model of π strikes a balance
between the relative pessimism of the prior and optimism of the data. You
can reproduce this correct posterior using the
plot_beta_binomial() function in the bayesrules package, plugging
in the prior hyperparameters (α = 45, β = 55) and data (y = 30 of n = 50
polled voters support Michelle):
_________________________
1
3 The scaled likelihood function is calculated by L(π y)/ ∫ L(π y)dπ .
0
FIGURE 3.7: The prior pdf, scaled likelihood function, and posterior pdf
of Michelle's election support π.
In its balancing act, the posterior here is slightly “closer” to the prior than
to the likelihood. (We'll gain intuition for why this is the case in Chapter
4.) The posterior being centered at π = 0.5 suggests that Michelle's
support is equally likely to be above or below the 50% threshold required
to win Minnesota. Further, combining information from the prior and data,
the range of posterior plausible values has narrowed: we can be fairly
certain that Michelle's support is somewhere between 35% and 65%.
You might also recognize something new: like the prior, the posterior
model of π is continuous and lives on [0,1]. That is, like the prior, the
posterior appears to be a Beta(α, β) model where the shape parameters
have been updated to combine information from the prior and data. This is
indeed the case. Conditioned on the observed poll results (Y = 30), the
posterior model of Michelle's election support is Beta(75, 75):
(3.9)
Before backing up this claim with some math, let's examine the evolution
in your understanding of Michelle's election support π. The
summarize_beta_binomial() function in the bayesrules package
summarizes the typical values and variability in the prior and posterior
models of π. These calculations follow directly from applying the prior
and posterior Beta parameters into (3.2) and (3.3):
f (π)L(π|y=30)
f (π y = 30) = .
f (y=30)
Γ (100) 54
50 20
44 30
= π (1 − π) ⋅ ( )π (1 − π)
Γ (45)Γ (55)
30
Γ (100) 50 74
74
= [ ( )] ⋅ π (1 − π)
Γ (45)Γ (55)
30
74 74
∝ π (1 − π) .
In the third line of our calculation, we combined the constants and the
elements that depend upon π into two different pieces. In the nal line, we
made a big simpli cation: we dropped all constants that don't depend upon
π. We don't need these. Rather, it's the dependence of f (π|y = 30) on π
that we care about:
74 74 74 74
f (π y = 30) = cπ (1 − π) ∝ π (1 − π) .
74 74 1
1 = ∫ f (π y = 30)dπ = ∫ c ⋅ π (1 − π) dπ ⇒ c = .
74
74
∫ π (1−π) dπ
Quiz Yourself!
For each scenario below, identify the correct Beta posterior model of
π ∈ [0, 1] from its unnormalized pdf.
a. f (π y) ∝ π
3−1
(1 − π)
12−1
b. f (π y) ∝ π
11
(1 − π)
2
c. f (π|y) ∝ 1
Quiz Yourself!
Identify the kernels of each pdf below.
1. f (π|y) = ye −πy
for π > 0
a. y
b. e −π
c. ye −π
d. e −πy
2. f (π y) = for π > 0
y
2 y−1 −2π
π e
(y−1)!
a. π
y−1
e
−2π
b.
y
2
(y−1)!
c. e
−2π
d. π
y−1
π ~Beta(α, β).
This general model has vast applications, applying to any setting having a
parameter of interest π that lives on [0,1] with any tuning of a Beta prior
and any data Y which is the number of “successes” in n xed, independent
trials, each having probability of success π. For example, π might be a
coin's tendency toward Heads and data Y records the number of Heads
observed in a series of n coin ips. Or π might be the proportion of adults
that use social media and we learn about π by sampling n adults and
recording the number Y that use social media. No matter the setting, upon
observing Y = y successes in n trials, the posterior of π can be described
by a Beta model which reveals the in uence of the prior (through α and β)
and data (through y and n):
_________________________
4 Answer : a. Beta(3,12); b. Beta(12,3); c. Beta(1,1) or, equivalently, Unif(0,1)
5 Answers : 1. d; 2. a; 3. π2
(3.10)
α+y−1
Mode(π|Y = y) =
α+β+n−2
(α+y)(β+n−y)
Var(π|Y = y) = 2
.
(α+β+n) (α+β+n+1)
(3.11)
model. Our work below will highlight that conjugacy simpli es the
construction of the posterior, and thus can be a desirable property in
Bayesian modeling.
Conjugate prior
We say that is a conjugate prior for L(π|y) if the posterior,
f (π)
Γ (α+β) β−1
n n−y
α−1 y
f (π) = π (1 − π) and L(π y) = ( )π (1 − π) .
Γ (α)Γ (β)
y
(3.12)
Putting these two pieces together, the posterior pdf follows from Bayes'
Rule:
f (π|y) ∝ f (π)L(π|y)
Γ (α+β) β−1
n n−y
α−1 y
= π (1 − π) ⋅ ( )π (1 − π)
Γ (α)Γ (β)
y
(α+y)−1 (β+n−y)−1
∝ π (1 − π) .
Thus, we've veri ed our claim that the posterior model of π given an
observed Y = y successes in n trials is Beta(α + y, β + n − y).
The resulting 10,000 pairs of π and y values are shown in Figure 3.8. In
general, the greater Michelle's support, the better her poll results tend to
be. Further, the highlighted pairs illustrate that the eventual observed poll
result, Y = 30 of 50 polled voters supported Michelle, would most likely
arise if her underlying support π were somewhere in the range from 0.4 to
0.6.
FIGURE 3.8: A scatterplot of 10,000 simulated pairs of Michelle's
support π and polling outcome y.
When we zoom in closer on just those pairs that match our Y = 30 poll
results, the behavior across the remaining set of π values well
approximates the Beta(75,75) posterior model of π:
In other words, study participants were given the task of testing another
participant (who was in truth a trained actor) on their ability to memorize
facts. If the actor didn't remember a fact, the participant was ordered to
administer a shock on the actor and to increase the shock level with every
subsequent failure. Unbeknownst to the participant, the shocks were fake
and the actor was only pretending to register pain from the shock.
Y |π ~Bin(40, π)
π ~Beta(1, 10).
Before moving ahead with our analysis, let's examine the psychologist's
prior model.
Quiz Yourself!
What does the Beta(1,10) prior model in Figure 3.10 reveal about the
psychologist's prior understanding of π?
a) They don't have an informed opinion.
b) They're fairly certain that a large proportion of people will do
what authority tells them.
c) They're fairly certain that only a small proportion of people
will do what authority tells them.
Quiz Yourself!
In the end, 26 of the 40 study participants in icted what they
understood to be the maximum shock. In light of this data, what's the
psychologist's posterior model of π:
This posterior is summarized and plotted below, contrasted with the prior
pdf and scaled likelihood function. Note that the psychologist's
understanding evolved quite a bit from their prior to their posterior.
Though they started out with an understanding that fewer than 25% of
people would in ict the most severe shock, given the strong
counterevidence in the study data, they now understand this gure to be
somewhere between 30% and 70%.
FIGURE 3.11: The Beta prior pdf, scaled Binomial likelihood function,
and Beta posterior pdf for π, the proportion of subjects that would follow
the given instructions.
Y |π~Bin(n, π)
π~Beta(α, β)
⇒
∣
In Chapter 3, you built the foundational Beta-Binomial model for π, an
unknown proportion that can take any value between 0 and 1:
π (Y = y)~Beta(α + y, β + n − y).
This model re ects the four pieces common to every Bayesian analysis:
1. Prior model The Beta prior model for π can be tuned to re ect the
relative prior plausibility of each π ∈ [0, 1].
f (π) =
Γ (α+β)
Γ (α)Γ (β)
π
α−1
(1 − π)
y
y
L(π y) = ( )π (1 − π)
n−y
.
f or π ∈ [0, 1].
f (π y) ∝ f (π)L(π y) ∝ π
(α+y)−1
(1 − π)
(β+n−y)−1
.
3.8 Exercises
3.8.1 Practice: Beta prior models
Exercise 3.1. (Tune your Beta prior: Take I). In each situation below, tune
a Beta(α, β) model that accurately re ects the given prior information. In
many cases, there's no single “right” answer, but rather multiple
“reasonable” answers.
a) Your friend applied to a job and tells you: “I think I have a 40%
chance of getting the job, but I'm pretty unsure.” When pressed
further, they put their chances between 20% and 60%.
b) A scientist has created a new test for a rare disease. They expect
that the test is accurate 80% of the time with a variance of 0.05.
c) Your Aunt Jo is a successful mushroom hunter. She boasts: “I
expect to nd enough mushrooms to feed myself and my co-
workers at the auto-repair shop 90% of the time, but if I had to
give you a likely range it would be between 85% and 100% of the
time.”
d) Sal (who is a touch hyperbolic) just interviewed for a job, and
doesn't know how to describe their chances of getting an offer.
They say, “I couldn't read my interviewer's expression! I either
really impressed them and they are absolutely going to hire me, or
I made a terrible impression and they are burning my resumé as
we speak.”
Exercise 3.2. (Tune your Beta prior: Take II). As in Exercise 3.1, tune an
appropriate Beta(α, β) prior model for each situation below.
a) Your friend tells you “I think that I have a 80% chance of getting a
full night of sleep tonight, and I am pretty certain.” When pressed
further, they put their chances between 70% and 90%.
b) A scientist has created a new test for a rare disease. They expect
that it's accurate 90% of the time with a variance of 0.08.
c) Max loves to play the video game Animal Crossing. They tell you:
“The probability that I play Animal Crossing in the morning is
somewhere between 75% and 95%, but most likely around 85%.”
d) The bakery in Easthampton, Massachusetts often runs out of
croissants on Sundays. Ben guesses that by 10 a.m., there is a 30%
chance they have run out, but is pretty unsure about that guess.
Exercise 3.3. (It's OK to admit you don't know). You want to specify a
Beta prior for a situation in which you have no idea about some parameter
π. You think π is equally likely to be anywhere between 0 and 1.
Exercise 3.4. (Which Beta? Take I). Six Beta pdfs are plotted below.
Match each to one of the following models: Beta(0.5,0.5), Beta(1,1),
Beta(2,2), Beta(6,6), Beta(6,2), Beta(0.5,6).
Exercise 3.5. (Which Beta? Take II). Six Beta pdfs are plotted below.
Match each to one of the following models: Beta(1,0.3), Beta(2,1),
Beta(3,3), Beta(6,3), Beta(4,2), Beta(5,6).
Exercise 3.6. (Beta properties). Examine the properties of the Beta models
in Exercise 3.4.
a) Which Beta model has the smallest mean? The biggest? Provide
visual evidence and calculate the corresponding means.
b) Which Beta model has the smallest mode? The biggest? Provide
visual evidence and calculate the corresponding modes.
c) Which Beta model has the smallest standard deviation? The
biggest? Provide visual evidence and calculate the corresponding
standard deviations.
E(π) = ∫ πf (π)dπ
2 2 2
Var(π) = E[(π − E(π)) ] = E(π ) − [E(π)] .
a) Specify and plot a Beta model that re ects the staff's prior ideas
about π.
b) Among 50 surveyed students, 15 are regular bike riders. What is
the posterior model for π?
c) What is the mean, mode, and standard deviation of the posterior
model?
d) Does the posterior model more closely re ect the prior
information or the data? Explain your reasoning.
a) Identify and plot a Beta model that re ects Bayard's prior ideas
about π.
b) Bayard wants to update his prior, so he randomly selects 90 US
LGBT adults and 30 of them are married to a same-sex partner.
What is the posterior model for π?
c) Calculate the posterior mean, mode, and standard deviation of π.
d) Does the posterior model more closely re ect the prior
information or the data? Explain your reasoning.
a) Identify and plot a Beta model that re ects Sylvia's prior ideas
about π.
b) Sylvia wants to update her prior, so she randomly selects 200 US
adults and 80 of them are aware that they know someone who is
transgender. Specify and plot the posterior model for π.
c) What is the mean, mode, and standard deviation of the posterior
model?
d) Describe how the prior and posterior Beta models compare.
_________________________
6 https://news.gallup.com/poll/212702/lgbt-adults-married-sex-
spouse.aspx?
utm_source=alert&utm_medium=email&utm_content=morelink&utm_campaign
=syndication
7 https://www.pewforum.org/2016/09/28/5-vast-majority-of-americans-
know-someone-who-is-gay-fewer-know-someone-who-is-transgender/
Exercise 3.16. (Plotting the Beta-Binomial: Take I). Below is output from
plot_beta_binomial() function.
a) Describe and compare both the prior model and likelihood
function in words.
b) Describe the posterior model in words. Does it more closely agree
with the data (as re ected by the likelihood function) or the prior?
c) Provide the speci c plot_beta_binomial() code you would
use to produce a similar plot.
Exercise 3.17. (Plotting the Beta-Binomial: Take II). Repeat Exercise 3.16
for the plot_beta_binomial() output below.
DOI: 10.1201/9780429288340-4
In Alison Bechdel's 1985 comic strip The Rule, a character states that they
only see a movie if it satis es the following three rules (Bechdel, 1986):
These criteria constitute the Bechdel test for the representation of women
in lm. Thinking of movies you've watched, what percentage of all recent
movies do you think pass the Bechdel test? Is it closer to 10%, 50%, 80%,
or 100%?
Let π, a random value between 0 and 1, denote the unknown proportion of
recent movies that pass the Bechdel test. Three friends – the feminist, the
clueless, and the optimist – have some prior ideas about π. Re ecting upon
movies that he has seen in the past, the feminist understands that the
majority lack strong women characters. The clueless doesn't really recall
the movies they've seen, and so are unsure whether passing the Bechdel
test is common or uncommon. Lastly, the optimist thinks that the Bechdel
test is a really low bar for the representation of women in lm, and thus
assumes almost all movies pass the test. All of this to say that three
friends have three different prior models of π. No problem! We saw in
Chapter 3 that a Beta prior model for π can be tuned to match one's prior
understanding (Figure 3.2). Check your intuition for Beta prior tuning in
the quiz below.1
Quiz Yourself!
Match each Beta prior in Figure 4.1 to the corresponding analyst: the
feminist, the clueless, and the optimist.
FIGURE 4.1: Three prior models for the proportion of lms that pass the
Bechdel test.
Placing the greatest prior plausibility on values of π that are less than 0.5,
the Beta(5,11) prior re ects the feminist's understanding that the majority
of movies fail the Bechdel test. In contrast, the Beta(14,1) places greater
prior plausibility on values of π near 1, and thus matches the optimist's
prior understanding. This leaves the Beta(1,1) or Unif(0,1) prior which, by
placing equal plausibility on all values of π between 0 and 1, matches the
clueless's gurative shoulder shrug – the only thing they know is that π is a
proportion, and thus is somewhere between 0 and 1.
_________________________
1 Answer : Beta(1,1) = clueless prior. Beta(5,11) = feminist prior. Beta(14,1) = optimist prior.
The three analysts agree to review a sample of n recent movies and record
Y, the number that pass the Bechdel test. Recognizing Y as the number of
“successes” in a xed number of independent trials, they specify the
dependence of Y on π using a Binomial model. Thus, each analyst has a
unique Beta-Binomial model of π with differing prior hyperparameters α
and β:
Y |π ~Bin(n, π)
π ~Beta(α, β).
(4.1)
If you're thinking “Can everyone have their own prior?! Is this always
going to be so subjective?!”, you are asking the right questions! And the
questions don't end there. To what extent might their different priors lead
the analysts to three different posterior conclusions about the Bechdel
test? How might this depend upon the sample size and outcomes of the
movie data they collect? To what extent will the analysts' posterior
understandings evolve as they collect more and more data? Will they ever
come to agreement about the representation of women in lm?! We will
examine these fundamental questions throughout Chapter 4, continuing to
build our capacity to think like Bayesians.
Goals
Explore the balanced in uence of the prior and data on the
posterior. You will see how our choice of prior model, the features
of our data, and the delicate balance between them can impact the
posterior model.
Perform sequential Bayesian analysis. You will explore one of the
coolest features of Bayesian analysis: how a posterior model
evolves as it's updated with new data.
4.1 Different priors, different posteriors
Reexamine Figure 4.1 which summarizes the prior models of π, the
proportion of recent movies that pass the Bechdel test, tuned by the
clueless, the feminist, and the optimist. Not only do the differing prior
means re ect disagreement about whether π is closer to 0 or 1, the
differing levels of prior variability re ect the fact that the analysts have
different degrees of certainty in their prior information. Loosely speaking,
the more certain the prior information, the smaller the prior variability.
The more vague the prior information, the greater the prior variability. The
priors of the optimist and the clueless represent these two extremes. With
a Beta(14,1) prior which exhibits the smallest variability, the optimist is
the most certain in their prior understanding of π (speci cally, that almost
all movies pass the Bechdel test). We refer to such priors as informative.
Informative prior
An informative prior re ects speci c information about the unknown
variable with high certainty, i.e., low variability.
With the largest prior variability, the clueless is the least certain about π.
In fact, their Beta(1,1) prior assigns equal prior plausibility to each value
of π between 0 and 1. This type of “shoulder shrug” prior model has an
of cial name: it's a vague prior.
Vague prior
A vague or diffuse prior re ects little speci c information about the
unknown variable. A at prior, which assigns equal prior plausibility
to all possible values of the variable, is a special case.
The next natural question to ask is: how will their different priors
in uence the posterior conclusions of the feminist, the clueless, and the
optimist? To answer this question, we need some data. Our analysts decide
to review a random sample of n = 20 recent movies using data collected
for the FiveThirtyEight article on the Bechdel test.2 The bayesrules
package includes a partial version of this dataset, named bechdel. A
complete version is provided by the fivethirtyeight R package
(Kim et al., 2020). Along with the title and year of each movie in this
dataset, the binary variable records whether the lm passed or failed the
Bechdel test:
_________________________
2 https://fivethirtyeight.com/features/the-dollar-and-cents-case-
against-hollywoods-exclusion-of-women/
Among the 20 movies in this sample, only 9 (45%) passed the test:
Before going through any formal math, perform the following gut check of
how you expect each analyst to react to this data. Answers are discussed
below.
Quiz Yourself!
The gure below displays our three analysts' unique priors along with
the common scaled likelihood function which re ects the Y = 9 of
n = 20 (45%) sampled movies that passed the Bechdel test. Whose
posterior do you anticipate will look the most like the scaled
likelihood? That is, whose posterior understanding of the Bechdel test
pass rate will most agree with the observed 45% rate in the observed
data? Whose do you anticipate will look the least like the scaled
likelihood?
Were your instincts right? Recall that the optimist started with the most
insistently optimistic prior about π – their prior model had a high mean
with low variability. It's not very surprising then that their posterior model
isn't as in sync with the data as the other analysts' posteriors. The dismal
data in which only 45% of the 20 sampled movies passed the test wasn't
enough to convince them that there's a problem in Hollywood – they still
think that values of π above 0.5 are the most plausible. At the opposite
extreme is the clueless who started with a at, vague prior model of π.
Absent any prior information, their posterior model directly re ects the
insights gained from the observed movie data. In fact, their posterior is
indistinguishable from the scaled likelihood function.
Warning
As a reminder, likelihood functions are not pdfs, and thus typically
don't integrate to 1. As such, the clueless's actual (unscaled)
likelihood is not equivalent to their posterior pdf. We're merely
scaling the likelihood function here for simplifying the visual
comparisons between the prior vs data evidence about π.
Quiz Yourself!
The three analysts' common prior and unique Binomial likelihood
functions (3.12), re ecting their different data, are displayed below.
Whose posterior do you anticipate will be most in sync with their
data, as visualized by the scaled likelihood? Whose posterior do you
anticipate will be the least in sync with their data?
FIGURE 4.3: Posterior models of π, constructed from the same prior but
different data, are plotted for each analyst.
And with a little rearranging, we can isolate the in uence of the prior and
observed data on the posterior mean. The second step in this
rearrangement might seem odd, but notice that we're just multiplying both
fractions by 1 (e.g., n/n).
α y
E(π|Y = y) = +
α+β+n α+β+n
α α+β y n
= ⋅ + ⋅
α+β+n α+β α+β+n n
α+β α n y
= ⋅ + ⋅
α+β+n α+β α+β+n n
α+β n y
= ⋅ E(π) + ⋅ .
α+β+n α+β+n n
We've now split the posterior mean into two pieces: a piece which depends
upon the prior mean E(π) (3.2) and a piece which depends upon the
observed success rate in our sample trials, y/n. In fact, the posterior mean
is a weighted average of the prior mean and sample success rate, their
distinct weights summing to 1:
α+β n
+ = 1.
α+β+n α+β+n
For example, consider the posterior means for Morteza and Ursula, the
settings for which are summarized in Table 4.2. With a shared Beta(14,1)
prior for π, Morteza and Ursula share a prior mean of E(π) = 14/15. Yet
their data differs. Morteza observed Y = 6 of n = 13 lms pass the
Bechdel test, and thus has a posterior mean of
14+1 13 y
E(π|Y = 6) = ⋅ E(π) + ⋅
14+1+13 14+1+13 n
14 6
= 0.5357 ⋅ + 0.4643 ⋅
15 13
= 0.7143.
14 46
= 0.1316 ⋅ + 0.8684 ⋅
15 99
= 0.5263.
Again, though Morteza and Ursula have a common prior mean for π and
observed similar Bechdel pass rates of roughly 46%, their posterior means
differ due to their differing sample sizes n. Since Morteza observed only
n = 13 lms, his posterior mean put slightly more weight on the prior
mean than on the observed Bechdel pass rate in his sample: 0.5357 vs
0.4643. In contrast, since Ursula observed a relatively large number of
n = 99 lms, her posterior mean put much less weight on the prior mean
than on the observed Bechdel pass rate in her sample: 0.1316 vs 0.8684.
The implications of these results are mathemagical. In general, consider
what happens to the posterior mean as we collect more and more data. As
sample size n increases, the weight (hence in uence) of the Beta(α, β)
prior model approaches 0,
α+β
→ 0 as n → ∞,
α+β+n
Thus, the more data we have, the more the posterior mean will drift toward
the trends exhibited in the data as opposed to the prior: as n → ∞
α+β n y y
E(π Y = y) = ⋅ E(π) + ⋅ → .
α+β+n α+β+n n n
4.4
∣
The rate at which this drift occurs depends upon whether the prior tuning
(i.e., α and β) is informative or vague. Thus, these mathematical results
support the observations we made about the posterior's balance between
the prior and data in Figure 4.4. And that's not all! In the exercises, you
will show that we can write the posterior mode as the weighted average of
the prior mode and observed sample success rate:
Mode(π Y = y) =
α+β−2
α+β+n−2
⋅ Mode(π) +
α+β+n−2
⋅
y
n
.
delivered the most severe shock. Thus, by the end of day one, the
psychologist's understanding of π had already evolved. It follows from
(4.1) that3
Day two was much busier and the results grimmer: among n = 20
participants, Y = 17 delivered the most severe shock. Thus, by the end of
day two, the psychologist's understanding of π had again evolved – π was
likely larger than they had expected.
_________________________
3 The posterior parameters are calculated by α + y = 1 + 1 and β + n − y = 10 + 10 − 1 .
Quiz Yourself!
What was the psychologist's posterior of π at the end of day two?
a) Beta(19,22)
b) Beta(18,13)
If your answer is “a,” you are correct! On day two, the psychologist didn't
simply forget what happened on day one and start afresh with the original
Beta(1,10) prior. Rather, what they had learned by the end of day one,
expressed by the Beta(2,19) posterior, provided a prior starting point on
day two. Thus, by (4.1), the posterior model of π at the end of day two is
Beta(19,22).4 On day three, Y = 8 of n = 10 participants delivered the
most severe shock, and thus their model of π evolved from a Beta(19,22)
prior to a Beta(27,24) posterior.5 The complete evolution from the
psychologist's original Beta(1,10) prior to their Beta(27,24) posterior at
the end of the three-day study is summarized in Table 4.3. Figure 4.5
displays this evolution in pictures, including the psychologist's big leap
from day one to day two upon observing so many study participants
deliver the most severe shock (17 of 20).
TABLE 4.3: A sequential Bayesian
analysis of Milgram's data.
Day Data Model
0 NA Beta(1,10)
1 Y = 1 of n = 10 Beta(2,19)
2 Y = 17 of n = 20 Beta(19,22)
3 Y = 8 of n = 10 Beta(27,24)
delivered the most severe shock. In Section 3.6, we evaluated this data all
at once, not incrementally. In doing so, we jumped straight from the
psychologist's original Beta(1,10) prior model to the Beta(27,24) posterior
model of π. That is, whether we evaluate the data incrementally or all in
one go, we'll end up at the same place.
To prove the data order invariance property, let's rst specify the structure
of posterior pdf f (θ|y , y ) which evolves by sequentially observing data
1 2
∣
given the rst data point y1, L(θ|y ):
f (θ y1) =
f (θ y2) =
f (θ y2, y1) =
∣∣
pdf from our original prior pdf, f (θ), and the likelihood function of θ
f (θ)L(θ y )
f (y )
1
1
f (y2)
and
1
prior⋅likelihood
normalizing constant
L(θ y2)
f (θ|y1, y2) =
=
=
=
In step two, we update our model in light of observing new data y2. In
doing so, don't forget that we start from the prior model speci ed by
f (θ)L(θ|y2)L(θ|y1)
f (y2)f (y1)
Finally, not only does the order of the data not in uence the ultimate
posterior model of θ, it doesn't matter whether we observe the data all at
f (θ)L(θ|y1)
once or sequentially. To this end, suppose we start with the original f (θ)
prior and observe data (y , y ) together, not sequentially. Further, assume
and thus
1
f (θ)L(θ|y1)L(θ|y2)
f (y1)f (y2)
Similarly, observing the data in the opposite order, y2 and then y1, would
Then the posterior pdf resulting from this “data dump” is equivalent to
f (θ)L(θ|y1,y2)
f (y1,y2)
f (θ)f (y1,y2|θ)
f (y1)f (y2)
f (θ)L(θ|y1)L(θ|y2)
f (y1)f (y2)
.
.
.
4.6 Don't be stubborn
Chapter 4 has highlighted some of the most compelling aspects of the
Bayesian philosophy – it provides the framework and exibility for our
understanding to evolve over time. One of the only ways to lose this
Bayesian bene t is by starting with an extremely stubborn prior model. A
model so stubborn that it assigns a prior probability of zero to certain
parameter values. Consider an example within the Milgram study setting
where π is the proportion of people that will obey authority even if it
means bringing harm to others. Suppose that a certain researcher has a
stubborn belief in the good of humanity, insisting that π is equally likely to
be anywhere between 0 and 0.25, and surely doesn't exceed 0.25. They
express this prior understanding through a Uniform model on 0 to 0.25,
FIGURE 4.7: The stubborn researcher's prior and likelihood, with three
potential corresponding posterior models.
As odd as it might seem, the posterior model in plot (c) corresponds to the
stubborn researcher's updated understanding of π in light of the observed
data. A posterior model is de ned on the same values for which the prior
model is de ned. That is, the support of the posterior model is inherited
from the support of the prior model. Since the psychologist's prior model
assigns zero probability to any value of π past 0.25, their posterior model
must also assign zero probability to any value in that range.
Mathematically, the posterior pdf f (π|y = 8) = 0 for any π ∉ [0, 0.25]
and, for any π ∈ [0, 0.25],
f (π|y = 8) ∝ f (π)L(π|y = 8)
10 2
8
= 4 ⋅ ( ) π (1 − π)
8
8 2
∝ π (1 − π) .
Quiz Yourself!
For each statement below, indicate whether the statement is true or
false. Provide your reasoning.
Prior in uence
The less vague and more informative the prior, i.e., the greater our
prior certainty, the more in uence the prior has over the posterior.
Data in uence
The more data we have, the more in uence the data has over the
posterior. Thus, if they have ample data, two researchers with different
priors will have similar posteriors.
4.9 Exercises
4.9.1 Review exercises
Exercise 4.1 (Match the prior to the description). Five different prior
models for π are listed below. Label each with one of these descriptors:
somewhat favoring π < 0.5, strongly favoring π < 0.5, centering π on 0.5,
somewhat favoring π > 0.5, strongly favoring π > 0.5.
a) Beta(1.8,1.8)
b) Beta(3,2)
c) Beta(1,10)
d) Beta(1,3)
e) Beta(17,2)
Exercise 4.2 (Match the plot to the code). Which arguments to the
plot_beta_binomial() function generated the plot below?
a) alpha = 2, beta = 2, y = 8, n = 11
b) alpha = 2, beta = 2, y = 3, n = 11
c) alpha = 3, beta = 8, y = 2, n = 6
d) alpha = 3, beta = 8, y = 4, n = 6
e) alpha = 3, beta = 8, y = 2, n = 4
f) alpha = 8, beta = 3, y = 2, n = 4
Exercise 4.3 (Choice of prior: gingko tree leaf drop). A ginkgo tree can
grow into a majestic monument to the wonders of the natural world. One
of the most notable things about ginkgo trees is that they shed all of their
leaves at the same time, usually after the rst frost. Randi thinks that the
ginkgo tree in her local arboretum will drop all of its leaves next Monday.
She asks 5 of her friends what they think about the probability (π) that this
will happen. Identify some reasonable Beta priors to convey each of these
beliefs.
coworker prior
Kimya Beta(1, 2)
Fernando Beta(0.5, 1)
Ciara Beta(3, 10)
Taylor Beta(2, 0.1)
Exercise 4.4 (Choice of prior). Visualize and summarize (in words) each
coworker's prior understanding of Chad's chances to satisfy his ice cream
craving.
Exercise 4.5 (Simulating the posterior). Chad peruses the shop's website.
On 3 of the past 7 days, they were still open at 2 p.m.. Complete the
following for each of Chad's coworkers:
Exercise 4.6 (Identifying the posterior). Complete the following for each
of Chad's coworkers:
Exercise 4.7 (What dominates the posterior?). In each situation below you
will be given a Beta prior for π and some Binomial trial data. For each
scenario, identify which of the following is true: the prior has more
in uence on the posterior, the data has more in uence on the posterior, or
the posterior is an equal compromise between the data and the prior.
Exercise 4.9 (Different data: more or less sure). Let π denote the
proportion of people that prefer dogs to cats. Suppose you express your
prior understanding of π by a Beta(7, 2) model.
Exercise 4.10 (What was the data?). In each situation below we give you a
Beta prior and a Beta posterior. Further, we tell you that the data is
Binomial, but we don't tell you the observed number of trials n or
successes y in those trials. For each situation, identify n and y, and then
utilize plot_beta_binomial() to sketch the prior pdf, scaled
likelihood function, and posterior pdf.
a) Y = 10 in n = 13 trials
b) Y = 0 in n = 1 trial
d) Y = 20 in n = 120 trials
Exercise 4.13 (Bayesian bummer). Bayesian methods are great! But, like
anything, we can screw it up. Suppose a politician speci es their prior
understanding about their approval rating, π, by: π~Unif (0.5, 1) with pdf
f (π) = 2 when 0.5 ≤ π < 1, and f (π) = 0 when 0 < π < 0.5.
Exercise 4.15 (One at a time). Let π be the probability of success for some
event of interest. You place a Beta(2, 3) prior on π, and are really
impatient. Sequentially update your posterior for π with each new
observation below.
Exercise 4.16 (Five at a time). Let π be the probability of success for some
event of interest. You place a Beta(2, 3) prior on π, and are impatient, but
you have been working on that aspect of your personality. So you
sequentially update your posterior model of π after every ve (!) new
observations. For each set of ve new observations, report the updated
posterior model for π.
a) John has a at Beta(1, 1) prior and analyzes movies from the year
1980.
b) The next day, John analyzes movies from the year 1990, while
building off their analysis from the previous day.
c) The third day, John analyzes movies from the year 2000, while
again building off of their analyses from the previous two days.
d) Jenna also starts her analysis with a Beta(1, 1) prior, but analyzes
movies from 1980, 1990, 2000 all on day one.
DOI: 10.1201/9780429288340-5
In the novel Anna Karenina, Tolstoy wrote “Happy families are all alike;
every unhappy family is unhappy in its own way.” In this chapter we will
learn about conjugate families, which are all alike in the sense that they
make the authors very happy. Read on to learn why.
Goals
Practice building Bayesian models. You will build Bayesian
models by practicing how to recognize kernels and make use of
proportionality.
Familiarize yourself with conjugacy. You will learn about what
makes a prior conjugate and why this is a helpful property. In
brief, conjugate priors make it easier to build posterior models.
Conjugate priors spark joy!
Getting started
To prepare for this chapter, note that we'll be using some new Greek
letters throughout our analysis: λ = lambda, μ = mu or “mew”, σ =
sigma, τ = tau, and θ = theta. Further, load the packages below.
library(bayesrules)
library(tidyverse)
5.1 Revisiting choice of prior
How do we choose a prior? In Chapters 3 and 4 we used the exibility of
the Beta model to re ect our prior understanding of a proportion
parameter π ∈ [0, 1]. There are other criteria to consider when choosing a
prior model:
Computational ease
Especially if we don't have access to computing power, it is helpful if
the posterior model is easy to build.
Interpretability
We've seen that posterior models are a compromise between the data
and the prior model. A posterior model is interpretable, and thus more
useful, when you can look at its formulation and identify the
contribution of the data relative to that of the prior.
Conjugate prior
Let the prior model for parameter θ have pdf f (θ) and the model of
data Y conditioned on θ have likelihood function L(θ|y). If the
resulting posterior model with pdf f (θ|y) ∝ f (θ)L(θ|y) is of the
same model family as the prior, then we say this is a conjugate prior.
To emphasize the utility (and fun!) of conjugate priors, it can be helpful to
consider a non-conjugate prior. Let parameter π be a proportion between 0
and 1 and suppose we plan to collect data Y where, conditional on π,
Y |π~Bin(n, π). Instead of our conjugate Beta(α, β) prior for π, let's try
(5.1)
Though not a Beta pdf, f (π) is indeed a valid pdf since f (π) is non-
negative on the support of π and the area under the pdf is 1, i.e.,
1
∫ f (π) = 1 .
0
50 40
10
L(π y = 10) = ( )π (1 − π) f or π ∈ [0, 1].
10
Recall from Chapter 3 that, when we put the prior and the likelihood
together, we are already on our path to nding the posterior model with
pdf
∣ (1 − π)
f (π y) ∝ (e − e )π
π
f (π y = 10) ∝ f (π)L(π y = 10) = (e − e ) ⋅ (
50
10
)π
10
(1 − π)
f (π y = 10) =
∫
0
(e−e )π
1
π
π
(e−e )π
10
10
(1−π)
(1−π)
40
40
40
.
Notice here that our non-Beta prior didn't produce a neat and clean answer
for the exact posterior model (fully speci ed and not up to a
proportionality constant). We cannot squeeze this posterior pdf kernel into
a Beta box or any other familiar model for that matter. That is, we cannot
rewrite (e − e )π (1 − π) so that it shares the same structure as a Beta
kernel, π■−1
π 10 40
dπ
This is where we really start to feel the pain of not having a conjugate
40
.
f or π ∈ [0, 1].
This leaves us with the question: could we use the conjugate Beta prior
and still capture the broader information of the messy non-conjugate prior
(5.1)? If so, then we solve the problems of messy calculations and
indecipherable posterior models.
Quiz Yourself!
Which Beta model would most closely approximate the non-
conjugate prior for π (Figure 5.1)?
a. Beta(3,1)
b. Beta(1,3)
c. Beta(2,1)
d. Beta(1,2)
Last year, one of this book's authors got fed up with the number of fraud
risk phone calls they were receiving. They set out with a goal of modeling
rateλ, the typical number of fraud risk calls received per day. Prior to
collecting any data, the author's guess was that this rate was most likely
around 5 calls per day, but could also reasonably range between 2 and 7
calls per day. To learn more, they planned to record the number of fraud
risk phone calls on each of n sampled days, (Y , Y , … , Y ).
1 2 n
Y |λ~Pois(λ).
(5.3)
∞
(5.4)
Figure 5.3 illustrates the Poisson pmf (5.3) under different rate parameters
λ. In general, as the rate of events λ increases, the typical number of events
increases, the variability increases, and the skew decreases. For example,
when events occur at a rate of λ = 1, the model is heavily skewed toward
observing a small number of events – we're most likely to observe 0 or 1
events, and rarely more than 3. In contrast, when events occur at a higher
rate of λ = 5, the model is roughly symmetric and more variable – we're
most likely to observe 4 or 5 events, though have a reasonable chance of
observing anywhere between 1 and 10 events.
each of the n days in our data collection period. We assume that the daily
number of calls might differ from day to day and can be independently
modeled by the Poisson. Thus, on each day i,
ind
Y i λ ˜Pois(λ)
f (y i λ) = f or y i ∈ {0, 1, 2, …}.
yi !
Yet in weighing the evidence of the phone call data, we won't want to
analyze each individual day. Rather, we'll need to process the collective or
joint information in our n data points. This information is captured by the
joint probability mass function.
Connecting concepts
n
→
i
f (y λ) = ∏ f (y i λ) = f (y 1 λ) ⋅ f (y 2 λ) ⋅ ⋯ ⋅ f (y n λ).
i=1
The joint pmf for our fraud risk call sample follows by applying the
general de nition (5.5) to our Poisson pmfs. Letting the number of calls on
each day i be y ∈ {0, 1, 2, …},
i
f (y λ)
n
f (y λ) = ∏ f (y i λ) = ∏
i=1
=
λ
[λ
λ
y1 !
∑y
1e
1λ
∏ yi !
i=1
λ
This looks like a mess, but it can be simpli ed. In this simpli cation, it's
important to recognize that we have nunique data points yi, notn copies of
the same data point y. Thus, we need to pay careful attention to the i
subscripts. It follows that
→ y −λ
i
2 ⋯λ
e
⋅
λ
yn
y
y2 !
2e
][e
y 1 !y 2 !⋯y n !
−nλ
i=1
−λ
−λ
n
e
−λ
yi
λ
yi !
⋯e
e
yn
yn !
−λ
−λ
]
−λ
.
(5.5)
(5.6)
where we've simpli ed the products in the nal line by appealing to the
properties below.
Simplifying products
Let (x, y, a, b) be a set of constants. Then we can utilize the following
facts when simplifying products involving exponents:
a b a+b a a a
x x = x and x y = (xy)
(5.7)
Once we observe actual sample data, we can ip this joint pmf on its head
to de ne the likelihood function of λ. The Poisson likelihood function is
→
equivalent in formula to the joint pmf f (y λ) , yet is a function of λ which
helps us assess the compatibility of different possible λ values with our
→
observed collection of sample data y:
→ λ
∑y
i
e
−nλ
∑ yi −nλ
L(λ y) = n
∝ λ e f or λ > 0.
∏ yi !
i=1
(5.8)
when n is large, and what we really care about in the likelihood is λ. And
when we express the likelihood up to a proportionality constant, note that
the sum of the data points (∑ y ) and the number of data points (n) is all
i
the information that is required from the data. We don't need to know the
value of each individual data point yi. Taking this for a spin with real data
points later in our analysis will provide some clarity.
5.2.2 Potential priors
The Poisson data model provides one of two key pieces for our Bayesian
analysis of λ, the daily rate of fraud risk calls. The other key piece is a
prior model for λ. Our original guess was that this rate is most likely
around 5 calls per day, but could also reasonably range between 2 and 7
calls per day. In order to tune a prior to match these ideas about λ, we rst
have to identify a reasonable probability model structure.
Remember here that λ is a positive and continuous rate, meaning that λ
does not have to be a whole number. Accordingly, a reasonable prior
probability model will also have continuous and positive support, i.e., be
de ned on λ > 0. There are several named and studied probability models
with this property, including the F, Weibull, and Gamma. We don't dig into
all of these in this book. Rather, to make the λ posterior model
construction more straightforward and convenient, we'll focus on
identifying a conjugate prior model.
Quiz Yourself!
Suppose we have a random sample of Poisson random variables
→
(Y 1 , Y 2 , … , Y n ) with likelihood function L(λ y) ∝ λ
∑ yi
e
−nλ
for
λ > 0. What do you think would provide a convenient conjugate prior
model for λ? Why?
a. A “Gamma” model with pdf f (λ) ∝ λ s−1
e
−rλ
The answer is a. The Gamma model will provide a conjugate prior for λ
when our data has a Poisson model. You might have guessed this from the
section title (clever). You might also have guessed this from the shared
features of the Poisson likelihood function
pdf f (λ). Both are proportional to
f (λ) =
E(λ)
Mode(λ)
Var(λ)
Γ (s)
λ
=
■
s−1
e
s−1
r
−■λ
s
2
−rλ
.
→
with differing ■ . In fact, we'll prove that combining the prior and
∣
L(λ y)
likelihood produces a posterior pdf with this same structure. That is, the
posterior will be of the same Gamma model family as the prior. First, let's
learn more about the Gamma model.
λ~Gamma(s, r).
f or λ > 0.
=
s
f or s ≥ 1
(5.8) and the Gamma
(5.9)
(5.10)
The Exponential model is a special case of the Gamma with shape
s = 1, Gamma(1, r):
λ~Exp(r).
Notice that the Gamma model depends upon two hyperparameters, r and s.
Assess your understanding of how these hyperparameters impact the
Gamma model properties in the following quiz.1
Quiz Yourself!
Figure 5.4 illustrates how different shape and rate hyperparameters
impact the Gamma pdf (5.9). Based on these plots:
_________________________
1 1:a , 2:b, 3: Gamma(20,20)
FIGURE 5.4: Gamma models with different hyperparameters. The
dashed and solid vertical lines represent the modes and means,
respectively.
increases.
Now that we have some intuition for how the Gamma(s,r) model works,
we can tune it to re ect our prior information about the daily rate of fraud
risk phone calls λ. Recall our earlier assumption that λ is about 5, and most
likely somewhere between 2 and 7. Our Gamma(s,r) prior should have
similar patterns. For example, we want to pick s and r for which λ tends to
be around 5,
s
E(λ) = ≈ 5.
r
λ~Gamma(10, 2)
FIGURE 5.5: The pdf of a Gamma(10,2) prior for λ, the daily rate of
fraud risk calls.
with prior pdff (λ) following from plugging s = 10 and r = 2 into (5.9).
10
2 10−1 −2λ
f (λ) = λ e f or λ > 0.
Γ (10)
Y i |λ ˜Pois(λ)
λ ~Gamma(s, r).
→
Upon observing data y = (y , y , … , y ), the posterior model of λ is
1 2 n
(5.11)
Let's prove this result. In general, recall that the posterior pdf of λ is
proportional to the product of the prior pdf and likelihood function de ned
by (5.9) and (5.8), respectively:
→ → r
s
s−1 −rλ λ
∑y
i
e
−nλ
f (λ y) ∝ f (λ)L(λ y) = λ e ⋅ f or λ > 0.
Γ (s) ∏ yi !
i=1
3 4
∣→
f (λ y) ∝ λ
= λ
→
→
s−1
λ y ~ Gamma(s + ∑ y i , r + n).
e
s+∑ y i −1
Let's apply this result to our fraud risk calls. There we have a
−rλ
⋅ λ
where the nal line follows by combining like terms. What we're left with
e
here is the kernel of the posterior pdf. This particular kernel corresponds
to the pdf of a Gamma model (5.9), with shape parameter s + ∑ y and
Gamma(10,2) prior for λ, the daily rate of calls. Further, on four separate
days in the second week of August, we received
y = (y , y , y , y ) = (6, 2, 2, 1) such calls. Thus, we have a sample of
1 2
n = 4 data points with a total of 11 fraud risk calls and an average of 2.75
∑ y i = 6 + 2 + 2 + 1 = 11
L(λ y) =
We visualize a portion of
λ
6!×2!×2!×1!
→
e
L(λ y)
−4λ
∝ λ
Plugging this data into (5.8), the resulting Poisson likelihood function of λ
→ 11
11
e
∑ yi
−(r+n)λ
−4λ
y =
e
−nλ
∑ yi
i=1
f or λ > 0.
observe a sample with an average daily phone call rate of y = 2.75 when
the underlying rate λ is also 2.75.
of 6 (r + n = 2 + 4):
→
λ y ~ Gamma(21, 6).
We can visualize the prior pdf, scaled likelihood function, and posterior
pdf for λ all in a single plot with the plot_gamma_poisson()
function in the bayesrules package. How magical. For this function to
work, we must specify a few things: the prior shape and rate
hyperparameters as well as the information from our data, the observed
total number of phone calls sum_y(∑ y ) and the sample size n:
i
FIGURE 5.7: The Gamma-Poisson model of λ, the daily rate of fraud risk
calls.
Our posterior notion about the daily rate of fraud calls is, of course, a
compromise between our vague prior and the observed phone call data.
Since our prior notion was quite variable in comparison to the strength in
our sample data, the posterior model of λ is more in sync with the data.
Speci cally, utilizing the properties of the Gamma(10,2) prior and
Gamma(21,6) posterior as de ned by (5.10), notice that our posterior
understanding of the typical daily rate of phone calls dropped from 5 to
3.5 per day:
10 → 21
E(λ) = = 5 and E(λ y) = = 3.5.
2 6
Though a compromise between the prior mean and data mean, this
posterior mean is closer to the data mean of y = 2.75 calls per day.
Hot tip
The posterior mean will always be between the prior mean and the
data mean. If your posterior mean falls outside that range, it indicates
that you made an error and should retrace some steps.
SD(λ) = √
10
2
2
= 1.581 and
∣
Further, with the additional information about λ from the data, the
variability in our understanding of λ drops by more than half, from a
standard deviation of 1.581 to 0.764 calls per day:
→
SD(λ y) = √
We now have two conjugate families in our toolkit: the Beta-Binomial and
the Gamma-Poisson. But many more conjugate families exist! It's
impossible to cover them all, but there is a third conjugate family that's
especially helpful to know: the Normal-Normal. Consider a data story. As
scientists learn more about brain health, the dangers of concussions (hence
of activities in which participants sustain repeated concussions) are
gaining greater attention (Bachynski, 2019). Among all people who have a
history of concussions, we are interested in μ, the average volume (in
cubic centimeters) of a speci c part of the brain: the hippocampus.
Though we don't have prior information about this group in particular,
Wikipedia tells us that among the general population of human adults,
both halves of the hippocampus have a volume between 3.0 and 3.5 cubic
centimeters.2 Thus, the total hippocampal volume of both sides of the
brain is between 6 and 7 cm3. Using this as a starting point, we'll assume
that the mean hippocampal volume among people with a history of
concussions, μ, is also somewhere between 6 and 7 cm3, with an average
of 6.5. We'll balance this prior understanding with data on the
hippocampal volumes of n = 25 subjects, (Y , Y , … , Y ), using the
1 2 n
2
Y ~N (μ, σ ).
E(Y ) = Mode(Y ) = μ
2
Var(Y ) = σ
SD(Y ) = σ.
(5.13)
Figure 5.8 illustrates the Normal model under a variety of mean and
standard deviation parameter values, μ and σ. No matter the parameters,
the Normal model is bell-shaped and symmetric around μ – thus as μ gets
larger, the model shifts to the right along with it. Further, σ controls the
variability of the Normal model – as σ gets larger, the model becomes
more spread out. Finally, though a Normal variable Y can technically
range from −∞ to ∞, the Normal model assigns negligible plausibility to
Y values that are more than 3 standard deviations σ from the mean μ. To
play around some more, you can plot Normal models using the
plot_normal() function from the bayesrules package.
FIGURE 5.8: Normal pdfs with varying mean and standard deviation
parameters.
Warning
Reasonable doesn't mean perfect. Though we'll later see that our
hippocampal volume data does exhibit Normal behavior, the Normal
model technically assumes that each subject's hippocampal volume
can range from −∞ to ∞. However, we're not too worried about this
incorrect assumption here. Per our earlier discussion of Figure 5.8,
the Normal model will put negligible weight on unreasonable values
of hippocampal volume. In general, not letting perfect be the enemy
of good will be a theme throughout this book (mainly because there is
no perfect).
n n
→ 1 (y i −μ)
2
f (y μ) = ∏ f (y i μ) = ∏ exp[− 2
].
2σ
i=1 i=1
√ 2πσ 2
obtain the
∣
Normal likelihood
i=1
→
Once we observe our sample data y, we can ip the joint pdf on its head to
function of μ,
Remembering that we're assuming σ is a known constant, we can simplify
the likelihood up to a proportionality constant by dropping the terms that
don't depend upon μ. Then for μ ∈ (−∞, ∞),
→
n
L(μ y) ∝ ∏ exp[−
(y i −μ)
2σ
2
2
] = exp[−
n
∑ (y i −μ)
i=1
Through a bit more rearranging (which we encourage you to verify if, like
L(μ y) ∝ exp[−
(y −μ)
2
2σ /n
] f or μ ∈ (−∞, ∞).
Don't forget the whole point of this exercise! Specifying a model for the
data along with its corresponding likelihood function provides the tools
the properties of the Y μ~N (μ, σ ) data model, the Normal mean
i
2
→
2σ
us, you enjoy algebra), we can make this even easier to digest by using the
sample mean y and sample size n to summarize our data values:
→
2
we'll need to assess the compatibility of our data y with different values of
μ (once we actually collect that data).
With the likelihood in place, let's formalize a prior model for μ, the mean
hippocampal volume among people that have a history of concussions. By
].
→
L(μ y) = f (y μ)
(5.14)
.
3 You'll have a chance to relax this silly-ish assumption that we know σ but don't know μ in
later chapters.
2
μ~N (θ, τ ),
(5.15)
Not only does the Normal prior assumption that μ ∈ (−∞, ∞) match the
same assumption of the Normal data model, we'll prove below that this is
a conjugate prior. You might anticipate this result from the fact that the
→
likelihood function L(μ y) (5.14) and prior pdf f (μ) (5.15) are both
proportional to
2
(μ−■)
exp[− 2
]
2■
with different ■.
Using our understanding of a Normal model, we can now tune the prior
hyperparameters θ and τ to re ect our prior understanding and uncertainty
about the average hippocampal volume among people that have a history
of concussions, μ. Based on our rigorous Wikipedia research that
hippocampal volumes tend to be between 6 and 7 cm3, we'll set the
Normal prior mean θ to the midpoint, 6.5. Further, we'll set the Normal
prior standard deviation to τ = 0.4. In other words, by (5.13), we think
there's a 95% chance that μ is somewhere between 5.7 and 7.3 cm3 (
6.5 ± 2*0.4). This range is wider , and hence more conservative, than what
FIGURE 5.9: A Normal prior model for μ, with mean 6.5 and standard
deviation 0.4.
2
μ ~N (θ, τ )
→
Upon observing data y = (y , y , … , y ) with mean y, the posterior
1 2 n
∣
μ y ~ N (θ
nτ
σ
2
+σ
2
nτ
→ 0
σ
2
2
+σ
2
and
+σ
+ y
That is, the more and more data we have, our posterior certainty about μ
Let's apply and examine this result in our analysis of μ, the average
hippocampal volume among people that have a history of concussions.
We've already built our prior model of μ, μ~N (6.5, 0.4 ). Next, consider
some data. The football data in bayesrules, a subset of the
FootballBrain data in the Lock5Data package (Lock et al., 2016),
includes results for a cross-sectional study of hippocampal volumes
among 75 subjects (Singh et al., 2014): 25 collegiate football players with
a history of concussions (fb_concuss), 25 collegiate football players
that do not have a history of concussions (fb_no_concuss), and 25
control subjects. For our analysis, we'll focus on the n = 25 subjects with
a history of concussions (fb_concuss):
nτ
variability τ and variability in the data σ. Both are impacted by sample size
n. First, as n increases, the posterior mean places less weight on the prior
mean and more weight on sample mean y:
2
nτ
nτ
→ 0.
2
nτ
2
+σ
2
2
2
+σ
,
2
τ
nτ
→ 1.
2
2
σ
+σ
2
2
).
2
(5.16)
These subjects have an average hippocampal volume of y = 5.735 cm3:
Plugging this information from the data (n = 25, y = 5.735, and σ = 0.5 )
into (5.14) de nes the Normal likelihood function of μ:
2
→ (5.735−μ)
L(μ y) ∝ exp[− 2
] f or μ ∈ (−∞, ∞).
2(0.5 /25)
We plot this likelihood function using plot_normal_likelihood(),
providing our observed volume data and data standard deviation σ = 0.5
(Figure 5.11). This likelihood illustrates the compatibility of our observed
hippocampal data with different μ values. To this end, the hippocampal
patterns observed in our data would most likely have arisen if the mean
hippocampal volume across all people with a history of concussions, μ,
were between 5.3 and 6.1 cm3. Further, we're most likely to have observed
a mean volume of y = 5.735 among our 25 sample subjects if the
underlying population mean μ were also 5.735.
We now have all necessary pieces to plug into (5.16), and hence to specify
the posterior model of μ:
our Normal prior model of μ had mean θ = 6.5 and standard deviation
τ = 0.4;
μ y ~ N (6.5 ⋅ 2 2
+ 5.735 ⋅ 2 2
, 2 2
).
25⋅0.4 +0.5 25⋅0.4 +0.5 25⋅0.4 +0.5
∣
Further simpli ed,
→ 2
μ y ~ N (5.78, 0.009 )
where the posterior mean places roughly 94% of its weight on the data
mean (y = 5.375) and only 6% of its weight on the prior mean (
E(μ) = 6.5):
→
E (μ y) = 6.5 ⋅ 0.0588 + 5.735 ⋅ 0.9412 = 5.78.
Bringing all of these pieces together, we plot and summarize our Normal-
Normal analysis of μ using plot_normal_normal() and
summarize_normal_normal() in the bayesrules package. Though a
compromise between the prior and data, our posterior understanding of μ
is more heavily in uenced by the latter. In light of our data, we are much
more certain about the mean hippocampal volume among people with a
history of concussions, and believe that this gure is somewhere in the
range from 5.586 to 5.974 cm3 (5.78 ± 2*0.097).
Next, we can expand the squares in the exponents and sweep under the rug
of proportionality both the θ2 in the numerator of the rst exponent and
the y in the numerator of the second exponent:
2
2 2 2 2
(−μ +2μθ)σ +(−μ +2μy )nτ
∝ exp[ 2 2
].
2τ σ
→
f (μ y)
∝ exp[
∝ exp[
−μ (nτ
−μ +2μ(
2(τ
∝ exp[
2
2 2
This may still seem messy, but once we complete the square, we actually
have the kernel of a Normal pdf for μ, exp[−
missing pieces ■, we can thus conclude that
→
μ y ~ N(
nτ
2
+σ
= θ
2
σ )/(nτ
This may seem like too much to deal with, but if you look closely, you can
see that we can bring back some constants which do not depend upon μ to
complete the square in the numerator:
−(μ−
2(τ
]. By identifying the
2
θσ +y nτ
nτ
nτ
2
+σ
σ
2
2
+σ
reassuring about the reality check that a simulation can provide. Yet we
are at a crossroads. The Gamma-Poisson and Normal-Normal models
we've studied here are tough to simulate using the techniques we've
2
learned thus far. Letting θ represent some parameter of interest, recall the
steps we've used for past simulations:
θσ
nτ
2
2
+σ )+2μ(θσ +y nτ
2τ
σ )/(nτ
+ y
+ynτ
θσ
nτ
nτ
2
+σ
+σ )
τ
σ
2
2
+ynτ
2
2
σ
2
+σ
+σ
nτ
2
)
2
2
2
+σ )
(μ−■)
nτ
2■
2
)
2
].
+σ
2
2
].
.
2
)
]
1. Simulate, say, 10000 values of θ from the prior model.
2. Simulate a set of sample data Y from each simulated θ value.
3. Filter out only those of the 10000 simulated sets of (θ, Y ) for
which the simulated Y data matches the data we actually observed.
4. Use the remaining θ values to approximate the posterior of θ.
5.7 Exercises
5.7.1 Practice: Gamma-Poisson
Exercise 5.1 (Tuning a Gamma prior). For each situation below, tune an
appropriate Gamma(s,r) prior model for λ.
a)
b)
c)
d)
ind
(y 1 , y 2 , y 3 ) = (3, 7, 19)
y 1 = 12
Exercise 5.5 (Text messages). Let random variable λ represent the rate of
text messages people receive in an hour. At rst, you believe that the
typical number of messages per hour is 5 with a standard deviation of 0.25
messages.
Exercise 5.6 (Text messages with data). Continuing with Exercise 5.5, you
collect data from six friends. They received 7, 3, 8, 9, 10, 12 text messages
in the previous hour.
Exercise 5.7 (World Cup). Let λ be the average number of goals scored in
a Women's World Cup game. We'll analyze λ by the following Gamma-
Poisson model where data Yi is the observed number of goals scored in a
sample of World Cup games:
ind
Y i |λ ˜Pois(λ)
λ ~Gamma(1, 0.25)
a)
b)
c)
d)
(y 1 , y 2 , y 3 ) = (−4.3, 0.7, −19.4)
(y 1 , y 2
(y 1 , y 2
(y 1 , y 2
∣
Exercise 5.8 (Normal likelihood functions). In each situation below, we
Exercise 5.9 (Investing in stock). You just bought stock in FancyTech. Let
random variable μ be the average dollar amount that your FancyTech stock
goes up or down in a one-day period. At rst, you believe that μ is 7.2
dollars with a standard deviation of 2.6 dollars.
Exercise 5.10 (Investing in stock with data). Continuing with Exercise 5.9,
it's reasonable to assume that the daily changes in FancyTech stock value
are Normally distributed around an unknown mean of μ with a known
standard deviation of σ = 2 dollars. On a random sample of 4 days, you
observe changes in stock value of -0.7, 1.2, 4.5, and -4 dollars.
a) Prof. Abebe conducts the nal exam and observes that his 32
students scored an average of 86 points. Calculate the posterior
mean and variance of μ using the data from Prof. Abebe's class.
b) Prof. Morales conducts the nal exam and observes that her 32
students scored an average of 82 points. Calculate the posterior
mean and variance of μ using the data from Prof. Morales' class.
c) Next, use Prof. Abebe and Prof. Morales' combined exams to
calculate the posterior mean and variance of μ.
a) Tune and plot a Normal prior for μ that re ects your friend's
understanding.
b) The weather_perth data in the bayesrules package includes
1000 daily observations of 3 p.m. temperatures in Perth
(temp3pm). Plot this data and discuss whether it's reasonable to
assume a Normal model for the temperature data.
c) Identify the posterior model of μ and verify your answer using
summarize_normal_normal().
d) Plot the prior pdf, likelihood function, and posterior pdf of μ.
Describe the evolution in your understanding of μ from the prior
to the posterior.
a) Your friend Alex has read Chapter 4 of this book, but not Chapter
5. Explain to Alex why it's dif cult to simulate a Normal-Normal
posterior using the simulation methods we have learned thus far.
b) To prove your point, try (and fail) to simulate the posterior of μ
for the following model upon observing a single data point
Y = 1.1:
2
Y |μ ~N (μ, 1 )
2
μ ~N (0, 1 )
d) for θ ∈ (−∞, ∞)
2
−θ
f (θ) ∝ e
Exercise 5.16 (Which model: Back for more!). Below are kernels for
Normal, Poisson, Gamma, Beta, and Binomial models. Identify the
appropriate model with speci c parameter values.
a) f (θ) ∝ e
−2θ
θ
15
for θ > 0
2
−(θ−12)
Y |θ ~Geometric(θ)
θ ~Beta(α, β)
DOI: 10.1201/9780429288340-6
Welcome to Unit 2!
Unit 2 serves as a critical bridge to applying the fundamental
concepts from Unit 1 in the more sophisticated model settings of Unit
3 and beyond. In Unit 1, we learned to think like Bayesians and to
build some fundamental Bayesian models in this spirit. Further, by
cranking these models through Bayes' Rule, we were able to
mathematically specify the corresponding posteriors. Those days are
over. Though merely hypothetical for now, some day (starting in
Chapter 9) the models we'll be interested in analyzing will get too
complicated to mathematically specify. Never fear – data analysts are
not known to throw up their hands in the face of the unknown. When
we can't know or specify something, we approximate it. In Unit 2
we'll explore Markov chain Monte Carlo simulation techniques for
approximating otherwise out-of-reach posterior models.
No matter whether we're able to specify or must approximate a
posterior model, we must then be able to understand and apply the
results. To this end, we learned how to describe our posterior
understanding using model features such as central tendency and
variability in Unit 1. Yet in practice, we typically want to perform a
deeper posterior analysis. This process of asking “what does it all
mean?!” revolves around three major elements that we'll explore in
Unit 2: posterior estimation, hypothesis testing, and prediction.
Learning requires the occasional leap. You've already taken a few. From
Chapter 2 to Chapter 3, you took the leap from using simple discrete priors
to using continuous Beta priors for a proportion π. From Chapter 3 to
Chapter 5, you took the leap from engineering the Beta-Binomial model to
a family of Bayesian models that can be applied in a wider variety of
settings. With each leap, your Bayesian toolkit became more exible and
powerful, but at the cost of the underlying math becoming a bit more
complicated. As you continue to generalize your Bayesian methods in
more sophisticated settings, this complexity will continue to grow.
Consider Michelle's run for president. In Chapter 3 you built a model of
Michelle's support in Minnesota based on polling data in that state. You
could continue to re ne your analysis of Michelle's chances of becoming
president. To begin, you could model Michelle's support in each of the
fty states and Washington, D.C. Better yet, this model might incorporate
data on past state-level voting trends and demographics. The trade-off is
that increasing your model's exibility also makes it more complicated.
Whereas your Minnesota-only model depended upon only one parameter
π, Michelle's level of support in that state, the new model depends upon
dozens of parameters. Here, let θ = (θ , θ , … , θ ) denote a generic set of
1 2 k
∣
f (θ)L(θ y)dθ k ⋯ dθ 2 dθ 1 .
prohibitively dif cult to specify, we're not out of luck. We must simply
change our strategy: instead of specifying the posterior, we can
approximate the posterior via simulation. We'll explore two simulation
techniques: grid approximation and Markov chain Monte Carlo (MCMC).
When done well, both techniques produce a sample of Nθ values,
{θ
(1)
,θ
(2)
,…,θ
(N )
But rst, load some packages that we'll be utilizing throughout the
},
with properties that re ect those of the posterior model for θ. In Chapter 6,
we'll explore these simulation techniques in the familiar Beta-Binomial
and Gamma-Poisson model contexts. Though these models don't require
simulation (we can and did specify their posteriors in Unit 1), exploring
simulation in these familiar settings will help us build intuition for the
process and give us peace of mind that it actually works when we
eventually do need it.
Goals
Implement and examine the limitations of using grid
remainder of this chapter (and book). Among these, rstan is quite unique,
thus be sure to revisit the Preface for directions on installing this package.
6.1 Grid approximation
Imagine there's an image that you can't view in its entirety – you only
observe snippets along a grid that sweeps from left to right across the
image. The ner the grid, the clearer the image. And if the grid is ne
enough, the result is an excellent approximation of the complete image:
This is the big picture idea behind Bayesian grid approximation, in which
case the target “image” is posterior pdf f (θ|y). We needn't observe f (θ|y)
at every possible θ to get a sense of its structure. Rather, we can evaluate
f (θ|y) at a nite, discrete grid of possible θ values. Subsequently, we can
take random samples from this discretized pdf to approximate the full
posterior pdf f (θ|y). We formalize these ideas here and apply them below.
Grid approximation
Grid approximation produces a sample of N independentθ values,
}, from a discretized approximation of posterior
(1) (2) (N )
{θ ,θ ,…,θ
Y |π ~Bin(10, π)
π ~Beta(2, 2).
(6.1)
We're now going to ask you to forget that we were able to specify this
posterior. Instead, we'll try to approximate the posterior using grid
approximation. As Step 1, we need to split the continuum of possible π
values on 0 to 1 into a nite grid. We'll start with a course grid of only 6 π
values, π ∈ {0, 0.2, 0.4, 0.6, 0.8, 1}:
In Step 2 we use dbeta() and dbinom(), respectively, to evaluate the
Beta(2, 2) prior pdf and Bin(10, π) likelihood function with Y = 9 at
each π in pi_grid:
You might anticipate what will happen when we use this approximation to
simulate samples from the posterior in Step 4. Each sample draw has only
6 possible outcomes and is highly likely to be 0.6 or 0.8. Let's try it: use
sample_n() to take a sample of size = 10000 values from the 6-
length grid_data, with replacement, and using the discretized
posterior probabilities as sample weights.
As expected, most of our 10,000 sample values of π were 0.6 or 0.8, few
were 0.4, and none were below 0.4 or above 0.8:
Remember the rainbow image and how we got a more complete picture by
viewing snippets along a ner grid? Similarly, instead of chopping up the
0-to-1 continuum of possible π values into a grid of only 6 values, let's try
a more reasonable grid of 101 values: π ∈ {0, 0.01, 0.02, … , 0.99, 1}. The
rst 3 grid approximation steps using this re ned grid are performed
below:
ind
Y i |λ ˜Pois(λ)
λ ~Gamma(3, 1).
(6.2)
Quiz Yourself!
Fill in the code below to construct a grid approximation of the
Gamma-Poisson posterior corresponding to (6.2). In doing so, use a
grid of 501 λ values between 0 and 15.
Check out the complete code below. Much of this is the same as it was for
the Beta-Binomial model. There are two key differences. First, we use
dgamma() and dpois() instead of dbeta() and dbinom() to
evaluate the prior pdf and likelihood function of λ. Second, since we have
a sample of two data points (Y , Y ) = (2, 8), the Poisson likelihood
1 2
6.1.3 Limitations
Some say that all good things must come to an end. Though we don't agree
with this saying in general, it happens to be true in the case of grid
approximation. Limitations in the grid approximation method quickly
present themselves as our models get more complicated. For example, by
the end of Unit 4 we'll be working with models that have lots of model
parameters θ = (θ , θ , … , θ ). In such settings, grid approximation
1 2 k
When we chop both the x- and y-axes into grids, there are bigger gaps in
the image approximation. To achieve a more re ned approximation, we
need a ner grid than when we only chopped the x-axis into a grid.
Analogously, when using grid approximation to simulate multivariate
posteriors, we need to divide the multidimensional sample space of
θ = (θ , θ , … , θ ) into a very, very
1 2 k ne grid in order to prevent big gaps
in our approximation. In practice, this might not be feasible. When
evaluated on ner and ner grids, the grid approximation method becomes
computationally expensive. You can't merely start the simulation, get up
for a cup of coffee, come back, and poof the simulation is done. You might
have to start the simulation and go off to a month-long meditation retreat
(to practice the patience you'll need for grid approximation). MCMC
methods provide a more exible alternative.
constructing this chain, θ is drawn from some model that depends upon
(2)
θ
(1)
, θ is drawn from some model that depends upon θ , θ is drawn
(3) (2) (4)
from some model that depends upon θ , and so on and so on the chain
(3)
pdf
For example, since θ
true that θ(i+i)
depends on θ
of no consequence to θ
θ
(i+1)
∣
θ
(i+i)
,θ
f (θ
f (θ
(i+1) (1) (2)
,…,θ
(i+1)
(i)
There are a couple of things to note about this dependence among chain
values. First, by the Markov property, θ depends upon the preceding
(i+1)
, y) = f (θ
, it's also
(i)
is
different model, and none of these models are the target posterior. That is,
the pdf from which a Markov chain value is simulated is not equivalent to
the posterior pdf:
f (θ|y).
(2)
,…,θ
, y) ≠ f (θ
(i+1)
y).
θ
(i)
(i)
, y).
(i−1)
(i−1)
key elements of an rstan simulation. Starting in Chapter 9, you will utilize
the complementary rstanarm package, which provides shortcuts for
simulating a broad framework of Bayesian applied regression models
(arm).
Warning
Since stan() has to do the double duty of identifying an appropriate
MCMC algorithm for simulating the given model, and then applying
this algorithm to our data, the simulation will be quite slow for each
new model.
Note that stan() requires two types of arguments. First, we must specify
the model information by:
Burn-in
If you've ever made a batch of pancakes or crêpes, you know that the
rst pancake is always the worst – the pan isn't yet at the perfect
temperature, you haven't yet gured out how much batter to use, and
you need more time to practice your ipping technique. MCMC
chains are similar. Without direct knowledge of the posterior it's
trying to simulate, the Markov chain might start out sampling
unreasonable values of our parameter of interest, say π. Eventually
though, it learns and starts producing values that mimic a random
sample from the posterior. And just as we might need to toss out the
rst pancake, we might want to toss the Markov chain values
produced during this learning period – keeping them in our sample
might lead to a poor posterior approximation. As such, “burn-in” is
the practice of discarding the rst portion of Markov chain values.
The rst four π values for each of the four parallel chains are extracted and
shown here:
It's important to remember that these Markov chain values are NOT a
random sample from the posterior and are NOT independent. Rather, each
of the four parallel chains forms a dependent 5,000 length Markov chain
of π values, (π , π , … , π ). For example, in iteration 1, chain:1
(1) (2) (5000)
depends uponπ . In this case the chain moves from 0.9403 to 0.9301.
(1)
Similarly, the chain moves from 0.9301 to 0.9012, from 0.9012 to 0.9224,
and so on. Thus, the chain traverses the sample space or range of posterior
plausible π values. A Markov chain trace plot illustrates this traversal,
plotting the π value (y-axis) in each iteration (x-axis). We use the
mcmc_trace() function in the bayesplot package (Gabry et al., 2019) to
construct the trace plots of all four Markov chains:
Figure 6.9 zooms in on the trace plot of chain 1. In the rst 20 iterations
(left), the chain largely explores values between 0.57 and 0.94. After 200
iterations (right), the Markov chain has started to explore new territory,
traversing a slightly wider range of values between 0.49 and 0.96. Both
trace plots also exhibit evidence of the slight dependence among the
Markov chain values, places where the chain tends to oat up for multiple
iterations and then down for multiple iterations.
FIGURE 6.9: A trace plot of the rst 20 iterations (left) and 200
iterations (right) of the rst Beta-Binomial Markov chain.
Marking the sequence of the chain values, the trace plots in Figure 6.9
illuminate the Markov chains' longitudinal behavior. We also want to
examine the distribution of the values these chains visit along their
journey, ignoring the order of these visits. The histogram and density plot
in Figure 6.10 provide a snapshot of this distribution for the combined
20,000 chain values, 5,000 from each of the four separate chains. Notice
the important punchline here: the distribution of the Markov chain values
is an excellent approximation of the target Beta(11, 3) posterior model of
π (superimposed in black). That's a relief – that was the whole point.
Warning
Like some other plotting functions in the bayesplot package, the
mcmc_hist() and mcmc_dens() functions don't automatically
include axis labels and scales. As we're new to these plots, we add
labels and scales here using yaxis_text(TRUE) and ylab(). As
we become more and more comfortable with these plots, we'll fall
back on the defaults.
FIGURE 6.10: A histogram (left) and density plot (right) of the
combined 20,000 Markov chain π values from the 4 parallel chains. The
target pdf is superimposed in black.
Quiz Yourself!
Fill in the code below to construct an MCMC approximation of the
Gamma-Poisson posterior corresponding to (6.2). In doing so, run
four parallel chains for 10,000 iterations each (resulting in a sample
size of 5,000 per chain).
In de ning the model in step 1, take note of the three aspects upon which it
depends.
Answering these questions is both an art and science. There are no one-
size- ts-all magic formulas that provide de nitive answers here. Rather,
it's through experience that you get a feel for what “good” Markov chains
look like and what you can do to x a “bad” Markov chain. In this section,
we'll focus on a couple visual diagnostic tools that will get us started:
trace plots and parallel chains. These can be followed up with and
supplemented by a few numerical diagnostics: effective sample size,
autocorrelation, and R-hat (R̂). Utilizing these diagnostics should be done
holistically. Since no single visual or numerical diagnostic is one-size- ts-
all, they provide a fuller picture of Markov chain quality when considered
together. Further, other excellent diagnostics exist. We focus here on those
that are common and easy to implement in the software packages we'll be
using.
First, consider the trace plots in Figure 6.12 (left). The downward trend in
Chain A indicates that it has not yet stabilized after 5,000 iterations – it
has not yet “found” or does not yet know how to explore the range of
posterior plausible π values. The downward trend also hints at strong
correlation among the chain values – they don't look like independent
noise. All of this to say that Chain A is mixing slowly. This is bad. Though
Markov chains are inherently dependent, the more they behave like fast
mixing (noisy) independent samples, the smaller the error in the resulting
posterior approximation (roughly speaking). Chain B exhibits a different
problem. As evidenced by the two completely at lines in the trace plot, it
tends to get stuck when it visits smaller values of π.
The density plots in Figure 6.12 (right) con rm that both of these goofy-
looking chains result in a serious issue: they produce poor approximations
of the Beta(11,3) posterior (superimposed in black), and thus misleading
posterior conclusions. Consider Chain A. Since it's mixing so slowly, it
has only explored π values in the rough range from 0.6 to 0.9 in its rst
5,000 iterations. As a result, its posterior approximation overestimates the
plausibility of π values in this range while completely underestimating the
plausibility of values outside this range. Next, consider Chain B. In getting
stuck, Chain B over-samples some values in the left tail of the posterior π
values. This phenomenon produces the erroneous spikes in the posterior
approximation.
In practice, we run rstan simulations when we can't specify, and thus
want to approximate the posterior. This means that we won't have the
privilege of being able to compare our simulation results to the “real”
posterior. This is why diagnostics are so important. If we see bad trace
plots like those in Figure 6.12, there are some immediate steps we can
take:
1. Check the model. Are the assumed prior and data models
appropriate?
2. Run the chain for more iterations. Some undesirable short-term
chain trends might iron out in the long term.
We'll get practice with the more nuanced Step 1 throughout the book. Step
2 is easy, though it requires extra computation time.
FIGURE 6.13: Density plot of the four parallel Markov chains for π.
The trace plots and corresponding density plots of the short Markov chains
are shown below. Though the chains' trace plots exhibit similar random
behavior, their corresponding density plots differ, hence they produce
discrepant posterior approximations. In the face of such instability and
confusion about which of these four approximations is the most accurate,
it would be a mistake to stop our simulation after only 100 iterations.
FIGURE 6.14: Trace plots and density plots of the four short parallel
Markov chains for π, each of length 50.
the better, yet it's typically true that the accuracy of a Markov chain
approximation is only as good as that of a smaller independent
sample. That is, it's typically true that N < N , thus the effective
ef f
This chain of dependencies also means that each chain value depends in
some degree on all previous chain values. For example, since π is (i)
π
(i−2)
. Yet this dependence, or autocorrelation, fades. It's like Tobler's rst
law of geography: everything is related to everything else, but near things
are more related than distant things. Thus, it's typically the case that a
chain value π is more strongly related to the previous value (π
(i)
) than
(i−1)
Autocorrelation
Lag 1 autocorrelation measures the correlation between pairs of
Markov chain values that are one “step” apart (e.g., π and π
(i) (i−1)
).
Lag 2 autocorrelation measures the correlation between pairs of
Markov chain values that are two “steps” apart (e.g., π and π
(i) (i−2)
).
And so on.
Let's apply these concepts to our bb_sim analysis. Check out the trace
plot and autocorrelation plot of our simulation results in Figure 6.15. (For
simplicity, we show the results for only one of our four parallel chains.)
FIGURE 6.15: A trace plot (left) and autocorrelation plot (right) for a
single Markov chain from the bb_sim analysis.
Again, notice that there are no obvious patterns in the trace plot. This
provides one visual clue that, though the chain values are inherently
dependent, this dependence is relatively weak and limited to small lags or
values that are just a few steps apart. This observation is supported by the
autocorrelation plot which marks the autocorrelation (y-axis) at lags 0
through 20 (x-axis). The lag 0 autocorrelation is naturally 1 – it measures
the correlation between a Markov chain value and itself. From there, the
lag 1 autocorrelation is roughly 0.5, indicating moderate correlation
among chain values that are only 1 step apart. But then the autocorrelation
quickly drops off and is effectively 0 by lag 5. That is, there's very little
correlation between Markov chain values that are more than a few steps
apart. This is all good news. It's more con rmation that our Markov chain
is mixing quickly, i.e., quickly moving around the range of posterior
plausible π values, and thus at least mimicking an independent sample.
Presuming you've never seen an autocorrelation plot before, we imagine
that it's not very obvious that the plot in Figure 6.15 is a “good” one. For
contrast, consider the results for an unhealthy Markov chain (Figure 6.16).
The trace plot exhibits strong trends, and hence autocorrelation, in the
Markov chain values. This observation is echoed and further formalized
by the autocorrelation plot. The slow decrease in the autocorrelation curve
indicates that the dependence between chain values does not quickly fade
away. In fact, there's a roughly 0.9 correlation between Markov chain
values that are a full 20 steps apart! Since its chain values are so strongly
tied to the previous values, this chain is slow mixing – it would take a long
time for it to adequately explore the full range of the posterior. Thus, just
as with the slow mixing Chain A in Figure 6.12, we should be wary about
using this chain to approximate the posterior. Let's tie these ideas together.
FIGURE 6.16: A trace plot (left) and autocorrelation plot (right) for a
slow mixing Markov chain of π.
values: {π , π , π , … , π
(10) (20) (30)
}. By discarding the draws in
(5000)
between, we remove the strong correlations at low lags. For example, π (20)
is less correlated with the previous value in the thinned chain (π ) than
(10)
FIGURE 6.17: A trace plot (left) and autocorrelation plot (right) for a
single Markov chain from the bb_sim analysis, thinned to every tenth
value.
We similarly thin our slow mixing chain down to every tenth value (Figure
6.18). The resulting chain still exhibits slow mixing trends in the trace
plot, but the autocorrelation drops more quickly than the pre-thinned
chain. This is good, but is it worth losing 90% of our original sample
values? We're not so sure.
FIGURE 6.18: A trace plot (left) and autocorrelation plot (right) for a
slow mixing Markov chain of π, thinned to every tenth value.
Warning
There is a careful line to walk when deciding whether or not to thin a
Markov chain. The bene ts of reduced autocorrelation don't
necessarily outweigh the loss of precious chain values. That is, 5000
Markov chain values with stronger autocorrelation might produce a
better posterior approximation than 500 chain values with weaker
autocorrelation. The effectiveness of thinning also depends in part on
the algorithm used to construct the Markov chain. For example, the
rstan and rstanarm packages used throughout this book employ an
ef cient Hamiltonian Monte Carlo algorithm. As such, in the current
stan() help le, the package authors advise against thinning unless
your simulation hogs up too much memory on your machine.
Quiz Yourself!
Figure 6.19 provides simulation results for bb_sim (top row) along
with a bad hypothetical alternative (bottom row). Based on the
patterns in these plots, what do you think is a marker of a “good”
Markov chain simulation?
a. The variability in π values within any individual chain is less
than the variability in π values across all chains combined.
b. The variability in π values within any individual chain is
comparable to the variability in π values across all chains
combined.
_________________________
1 Answer : b
To answer this quiz, let's dig into Figure 6.19. Based on what we learned in
Section 6.3.2, we can see that bb_sim is superior to the alternative – its
parallel chains exhibit similar features and produce similar posterior
approximations. In particular, the variability in π values is nearly identical
within each chain (top middle plot). As a consequence, the variability in π
values across all chains combined (top right plot) is similar to that of the
individual chains. In contrast, notice that the four parallel chains in the
alternative simulation produce con icting posterior approximations
(bottom middle plot), and hence an unstable and poor posterior
approximation when we combine these chains (bottom right plot). As a
consequence, the range and variability in π values across all chains
combined are much larger than the range and variability in π values within
any individual chain.
Bringing this analysis together, we've intuited the importance of the
relationship between the variability in values across all chains combined
and within the individual parallel chains. Speci cally:
R-hat
Consider a Markov chain simulation of parameter θ which utilizes
four parallel chains. Let Varcombined denote the variability in θ across
all four chains combined and Varwithin denote the typical variability
within any individual chain. The R-hat metric calculates the ratio
between these two sources of variability:
Var combined
R-hat ≈ √ .
Var within
To calculate the R-hat ratio for our simulation, we can apply the rhat()
function from the bayesplot package:
Re ecting our observation that the variability across and within our four
parallel chains is comparable, bb_sim has an R-hat value that's
effectively equal to 1. In contrast, the bad hypothetical simulation
exhibited in Figure 6.19 has an R-hat value of 5.35. That is, the variance
across all chain values combined is more than 5 times the typical variance
within each chain. This well exceeds the 1.05 red ag marker, providing
ample evidence that the hypothetical parallel chains do not produce
consistent posterior approximations, thus the simulation is unstable.
6.4 Chapter summary
As our Bayesian models get more sophisticated, their posteriors will
become too dif cult, if not impossible, to specify. In Chapter 6, you
learned two simulation techniques for approximating the posterior in such
scenarios: grid approximation and Markov chain Monte Carlo. Both
techniques produce a sample of Nθ values,
(1) (2) (N )
{θ ,θ ,…,θ }.
the posterior pdf f (θ|y), this sample will mimic the posterior model so
long as the chain length N is big enough.
Finally, you learned some MCMC diagnostics for checking the resulting
simulation quality. In short, we can visually examine trace plots and
density plots of multiple parallel chains for stability and mixing in our
simulation:
Exercise 6.4 (MCMC simulation: thank you for being a friend). Your
friend missed class this week and they are allergic to reading textbooks (a
common af iction). Since you are a true friend, you decide to help them
out and answer their following questions:
π.
b) Repeat part a using a grid of 201 equally spaced values between 0
and 1.
Exercise 6.11 (MCMC with RStan: Step 1). Use the given information to
de ne the Bayesian model structure using the correct RStan syntax. You
don't need to run the code, just provide the syntax.
Exercise 6.12 (MCMC with RStan: Steps 1 and 2). Use the given
information to (1) de ne the Bayesian model structure, and (2) simulate
the posterior using the correct RStan syntax. You don't need to run the
code, just provide the syntax.
a)
b)
c)
∣
Y |π~Bin(20, π)and π~Beta(1, 1) with Y = 12.
Y |λ~Pois(λ) and λ~Gamma(4, 2) with Y = 3.
Exercise 6.14 (MCMC with RStan: once more with feeling). Repeat
Exercise 6.13 for the Beta-Binomial model with Y |π~Bin(n, π) and
π~Beta(4, 3), where you observe Y = 4 successes in n = 12 independent
trials.
2 3
i
∣
Exercise 6.16 (MCMC with RStan: Gamma-Poisson again). Repeat
exercise 6.15 using a λ~Gamma(5, 5) prior model.
DOI: 10.1201/9780429288340-7
2
μ ~N(0, 1 )
(7.1)
The corresponding likelihood function L(μ|y) and prior pdf f (μ) for
y ∈ (−∞, ∞) and μ ∈ (−∞, ∞) are:
2 2
1 (y−μ) 1 μ
L(μ y) = exp[− 2
] and f (μ) = exp[− ].
√ 2 2⋅0.75 √ 2π 2
2π⋅0.75
(7.2)
values of μ and yourself as the tour manager. The trace plot (left) illustrates the
tour route or sequence of tour stops, μ . The histogram (right) illustrates the
(i)
relative amount of time you spent in each μ region throughout the tour.
FIGURE 7.1: A trace plot (left) and histogram (right) of a 5,000 iteration
MCMC simulation of the N (4, 0.6 ) posterior. The posterior pdf is
2
As tour manager, it's your job to ensure that the density of tour stops in each μ
region is proportional to its posterior plausibility. That is, the chain should
spend more time touring values of μ between 2 and 6, where the Normal
posterior pdf is greatest, and less time visiting μ values less than 2 or greater
than 6, where the posterior drops off. This consideration is crucial to producing
a collection of tour stops that accurately approximate the posterior, as does the
tour here (Figure 7.1 right).
As tour manager, you can automate the tour route using the Metropolis-
Hastings algorithm. This algorithm iterates through a two-step process.
Assuming the Markov chain is at location μ = μ at iteration or “tour stop” i,
(i)
Step 1: Propose a random location, μ′, for the next tour stop.
Step 2: Decide whether to go to the proposed location (μ (i+1) ′
= μ ) or to
stay at the current location for another iteration (μ
(i+1)
= μ).
This might seem easy. For example, if there were no constraints on your tour
plan, you could simply draw proposed tour stops μ′ from the N(4, 0.6 ) 2
posterior pdf f (μ |y = 6.25) (Step 1) and then go there (Step 2). This special
′
follows:
Step 1: Propose a location.
Draw a location μ from the posterior model with pdf f (μ|y).
Step 2: Go there.
The result is a nice independent sample from the posterior which, in turn,
produces an accurate posterior approximation:
FIGURE 7.2: A histogram of a Monte Carlo sample from the posterior pdf of
μ. The actual pdf is superimposed in blue.
This unnormalized pdf is drawn in Figure 7.3. Importantly, though it's not
properly scaled to integrate to 1, this unnormalized pdf preserves the shape,
central tendency, and variability of the actual posterior.
tour location. Conditioned on this current location, we propose the next location
by taking a random draw μ′ from the Uniform model which is centered at the
current location μ and ranges from μ − w to μ + w:
′
μ |μ ~ Unif (μ − w, μ + w)
with pdf
′ 1 ′
q(μ μ) = f or μ ∈ [μ − w, μ + w].
2w
The Uniform pdf plotted in Figure 7.4 illustrates that, using this method,
proposals μ′ are equally likely to be any value between μ − w and μ + w.
FIGURE 7.4: The Uniform proposal model.
Figure 7.5 illustrates this idea in a speci c scenario. Suppose we're utilizing a
Uniform half-width of w = 1 and that the Markov chain tour is at location
μ = 3. Conditioned on this current location, we'll then propose the next
the current chain location of 3 (black curve) is drawn against the unnormalized
posterior pdf.
The whole idea behind Step 1 might seem goofy. How can proposals drawn
from a Uniform model produce a decent approximation of the Normal posterior
model?!? Well, they're only proposals. As with any other proposal in life, they
can thankfully be rejected or accepted. Mainly, if a proposed location μ′ is
“bad,” we can reject it. When we do, the chain sticks around at its current
location μ for at least another iteration.
Step 2 of the Metropolis-Hastings algorithm provides a formal process for
deciding whether to accept or reject a proposal. Let's rst check in with our
intuition about how this process should work. Revisiting Figure 7.5, suppose
that our random Unif(2, 4) draw proposes that the chain move from its current
location of 3 to 3.8. Does this proposal seem desirable to you? Well, sure.
Notice that the (unnormalized) posterior plausibility of 3.8 is greater than that
of 3. Thus, we want our Markov chain tour to spend more time exploring values
of μ around 3.8 than around 3. Accepting the proposal gives us the chance to do
so. In contrast, if our random Unif(2, 4) draw proposed that the chain move
from 3 to 2.1, a location with very low posterior plausibility, we might be more
hesitant. Consider three possible rules for automating Step 2 in the following
quiz.1
Quiz Yourself!
Suppose we start our Metropolis-Hastings Markov chain tour at location
= 3 and utilize a Uniform proposal model in Step 1 of the algorithm.
(1)
μ
FIGURE 7.6: Trace plots corresponding to three different strategies for step 2
of the Metropolis-Hastings algorithm.
The quiz above presented you with three poor options for determining whether
to accept or reject proposed tour stops in Step 2 of the Metropolis-Hastings
algorithm. Rule 1 presents one extreme: never accept a proposal. This is a
terrible idea. It results in the Markov chain remaining at the same location at
every iteration (Tour 2), which would certainly produce a silly posterior
approximation. Rule 2 presents the opposite extreme: always accept a proposal.
This results in a Markov chain which is not at all discerning in where it travels
(Tour 3), completely ignoring the information we have from the unnormalized
posterior model regarding the plausibility of a proposal. For example, Tour 3
spends the majority of its time exploring posterior values μ above 6, which we
know from Figure 7.3 to be implausible.
_________________________
1 Answers : Rule 1 produces Tour 2, Rule 2 produces Tour 3, and Rule 3 produces Tour 1.
Rule 3 might seem like a reasonable balance between the two extremes: it
neither always rejects nor always accepts proposals. However, it's still
problematic. Since this rule only accepts a proposed stop if its posterior
plausibility is greater than that at the current location, it ends up producing a
Markov chain similar to that of Tour 1 above. Though this chain oats toward
values near μ = 4, where the (unnormalized) posterior pdf is greatest, it then
gets stuck there forever.
Putting all of this together, we're closer to understanding how to make the
Metropolis-Hastings algorithm work. Upon proposing a next tour stop (Step 1),
the process for rejecting or accepting this proposal (Step 2) must embrace the
idea that the chain should spend more time exploring areas of high posterior
plausibility but shouldn't get stuck there forever:
Step 1: Propose a location, μ′, for the next tour stop by taking a draw from a
proposal model.
Step 2: Decide whether to go to the proposed location (μ = μ ) or to
(i+1) ′
Metropolis-Hastings algorithm
Conditioned data y, let parameter μ have posterior pdf
on
f (μ|y) ∝ f (μ)L(μ|y). A Metropolis-Hastings Markov chain for f (μ|y),
(7.3)
Step 1. In fact, this is a bit lazy. Though our Normal-Normal posterior model is
de ned for μ ∈ (−∞, ∞), the Uniform proposal model lives on a truncated
neighborhood around the current chain location. However, utilizing a Uniform
proposal model simpli es the Metropolis-Hastings algorithm by the fact that
it's symmetric. This symmetry exhibits itself in the plot of the Uniform pdf
(Figure 7.4), as well as numerically – the conditional pdf of μ′ given μ is
equivalent to that of μ given μ′:
1 ′
when μ and μ are within w units of each other
′ ′ 2w
q(μ μ) = q(μ μ ) = { .
0 otherwise
This symmetry means that the chance of proposing a chain move from μ to μ′ is
the same as proposing a move from μ′ to μ. For example, the Uniform model
with a half-width of 1 is equally likely to propose a move from μ = 3 to
μ = 3.8 as a move from μ = 3.8 to μ = 3. We refer to this special case of the
′ ′
Metropolis algorithm
The Metropolis algorithm is a special case of the Metropolis-Hastings in
which the proposal model is symmetric. That is, the chance of proposing a
move to μ′ from μ is equal to that of proposing a move to μ from μ′:
q(μ |μ) = q(μ|μ ). Thus, the acceptance probability (7.3) simpli es to
′ ′
′ ′
f (μ )L(μ |y)
α = min{1, }.
f (μ)L(μ|y)
(7.4)
(7.5)
This rewrite emphasizes that, though we can't calculate the posterior pdfs of μ′
and μ, f (μ |y) and f (μ|y), their ratio is equivalent to that of the unnormalized
′
that of μ, then
′
f (μ |y)
α = < 1.
f (μ|y)
That is, the probability of accepting the proposal increases with the
plausibility of μ′ relative to μ.
Scenario 1 is straightforward. We'll always jump at the chance to move our tour
to a more plausible posterior region. To wrap our minds around Scenario 2, a
little R simulation is helpful. For example, suppose our Markov tour is
currently at location “3”:
To make the nal determination, we set up a weighted coin which accepts the
proposal with probability α (0.824) and rejects the proposal with probability
1 − α (0.176). In a random ip of this coin using the sample() function, we
accept the proposal, meaning that the next_stop on the tour is 2.933:
This is merely one of countless possible outcomes for a single iteration of the
Metropolis-Hastings algorithm for our Normal posterior. To streamline this
process, we'll write our own R function, one_mh_iteration(), which
implements a single Metropolis-Hastings iteration starting from any given
current tour stop and utilizing a Uniform proposal model with any given
half-width w. If you are new to writing functions, we encourage you to focus on
the structure over the details of this code. Some things to pick up on:
If we use a seed of 83, the proposed next tour stop is 2.018, which has a low
corresponding acceptance probability of 0.017:
This makes sense. We see from Figure 7.3 that the posterior plausibility of
2.018 is much lower than that of our current location of 3. Though we do want
to explore such extreme values, we don't want to do so often. In fact, we see
that upon the ip of our coin, the proposal was rejected and the tour again visits
location 3 on its next_stop. As a nal example, we can con rm that when
the posterior plausibility of the proposed next stop (here 3.978) is greater than
that of our current location, the acceptance probability is 1 and the proposal is
automatically accepted:
this process over and over and over. And by “we,” we mean the computer. To
this end, the mh_tour() function below constructs a Metropolis-Hastings
tour of any given length N utilizing a Uniform proposal model with any given
half-width w:
Again, this code is a big leap if you're new to for loops and functions. Let's
focus on the general structure indicated by the # comment blocks. In one call to
this function:
To see this function in action, use mh_tour() to simulate a Markov chain tour
of length N = 5000 utilizing a Uniform proposal model with half-width
w = 1:
A trace plot and histogram of the tour results are shown below. Notably, this
tour produces a remarkably accurate approximation of the N(4, 0.6 ) posterior.
2
You might need to step out of the weeds we've waded into in order to re ect
upon the mathemagic here. Through a rigorous and formal process, we utilized
dependent draws from a Uniform model to approximate a Normal model. And
it worked!
FIGURE 7.7: A trace plot (left) and histogram (right) of 5000 Metropolis
chain values of μ. The target posterior pdf is superimposed in blue.
FIGURE 7.8: Trace plots (top row) and histograms (bottom row) for three
different Metropolis-Hastings tours, where each tour utilizes a different
proposal model. The shared target posterior pdf is superimposed in blue.
The answers are below.2 The main punchline is that the Metropolis-Hastings
algorithm can work – we've seen as much – but we have to tune it. In our
example, this means that we have to pick an appropriate half-width w for the
Uniform proposal model. The tours in the quiz above illustrate the Goldilocks
challenge this presents: we don't want w to be too small or too large, but just
right.3 Tour 2 illustrates what can go wrong when w is too small (here w = 0.01
). You can reproduce these results with the following code:
_________________________
2 Answers : Tour 1 uses w = 100, Tour 2 uses w = 0.01, Tour 3 uses w = 1.
3 This technical term was inspired by the “Goldilocks and the three bears” fairy tale in which, for
some reason, a child (Goldilocks) taste tests three different bears' porridge while trespassing in the
bears' house. The rst bowl of porridge is too hot, the second is too cold, and the third is just right.
In this case, the Uniform proposal model places a tight neighborhood around
the tour's current location – the chain can only move within 0.01 units at a time.
Proposals will therefore tend to be very close to the current location, and thus
have similar posterior plausibility and a high probability of being accepted.
Speci cally, when a proposal μ ≈ μ, it's typically the case that
′
′ ′
f (μ )L(μ |y)
α = min{1, } ≈ min{1, 1} = 1.
f (μ)L(μ|y)
The result is a Markov chain that's almost always moving, but takes such
excruciatingly small steps that it will take an excruciatingly long time to
explore the entire posterior plausible region of μ values. Tour 1 illustrates the
other extreme in which w is too large (here w = 100):
π~Beta(2, 3)
⇒
∣
For extra practice, let's implement the Metropolis-Hastings algorithm for a
Beta-Binomial model in which we observe Y = 1 success in 2 trials:4
Y |π~Bin(2, π)
π (Y = 1)~Beta(3, 4).
Again, pretend we were only able to de ne the posterior pdf up to some missing
normalizing constant,
where f (π) is the Beta(2,3) prior pdf and L(π|y = 1) is the Bin(2, π)
likelihood function. Our goal then is to construct a Metropolis-Hastings tour of
the posterior, {π , π , … , π }, utilizing a two-step iterative process: (1) at
(1) (2) (N )
the current tour stop π, take a random draw π′ from a proposal model; and then
(2) decide whether or not to move there. In step 1, the proposed tour stops
would ideally be restricted to be between 0 and 1, just like π itself. Since the
Uniform proposal model we used above might propose values outside this
range, we'll instead tune and utilize a Beta(a, b) proposal model. Further, we'll
utilize the sameBeta(a, b) proposal model at each step in the chain. As such,
our proposal strategy does not depend on the current tour stop (though whether
or not we accept a proposal still will). This special case of the Metropolis-
Hastings is referred to as the independence sampling algorithm.
α = min{1,
′ ′
f (π )L(π |y)
f (π)L(π|y)
q(π)
′
q(π )
}.
′
(7.6)
We can rewrite α for the independence sampler to emphasize its dependence on
the relative posterior plausibility of proposal π′ versus current location π:
_________________________
4 By Chapter 3, the Beta posterior has parameters 3 (α + y = 2 + 1) and 4 (β + n − y = 3 + 2 − 1).
′ ′ ′
f (π )L(π |y)/f (y) q(π) f (π |y) q(π)
α = min{1, ′
} = min{1, ′
}.
f (π)L(π|y)/f (y) q(π ) f (π|y) q(π )
(7.7)
Like the acceptance probability for the Metropolis algorithm (7.5), (7.7)
includes the posterior pdf ratio f (π |y)/f (π|y). This ensures that the
′
penalty on common proposal values, ensuring that our tour doesn't oat toward
these values simply because we keep proposing them.
The one_iteration() function below implements a single iteration of this
independence sampling algorithm, starting from any current value π and
utilizing a Beta(a, b) proposal model for any given a and b. In the calculation
of acceptance probability α (alpha), notice that we utilize dbeta() to
evaluate the prior and proposal pdfs as well as dbinom() to evaluate the
Binomial likelihood function with data Y = 1, n = 2, and unknown probability
π:
Subsequently, we write a betabin_tour() function which constructs an N-
length Markov chain tour for any Beta(a,b) proposal model, utilizing
one_iteration() to determine each stop:
FIGURE 7.9: A trace plot (left) and histogram (right) for an independence
sample of 5000 π values. The Beta(3,4) target posterior pdf is superimposed in
blue.
P (μ → μ ) = P (μ
P (μ → μ ) = q(μ
′
P (μ→μ )
P (μ →μ)
′
′
P (μ → μ) = q(μ μ ) ⋅ min{1,
P (μ→μ )
′
P (μ →μ)
′
=
′
′
(i+1)
(i+1)
This is indeed the case for any Metropolis-Hastings algorithm. And we can
must happen: we need to propose μ′ and then accept the proposal. The
μ) ⋅ min{1,
′
′
P (μ → μ). Consider the former. For the tour to move from μ to μ , two things
′
associated probabilities of these two events are described by q(μ |μ) and α
q(μ μ )
′
′
q(μ |μ)
f (μ|y)
′
f (μ y)
= μ
= μ
plausibility of any μ′ and μ pair. That is, the relative probabilities of moving
between these two values must satisfy
′
f (μ |y)
f (μ|y)
(7.3), respectively. It follows that the chain moves from μ to μ′ with probability
′ ′
′
q(μ μ)
′
q(μ μ )
∣
′
.
μ
μ
(i)
(i)
′
f (μ |y)
f (μ|y)
f (μ|y)
=
′
f (μ |y)
= μ)
= μ ).
′
f (μ |y)
f (μ|y)
′
q(μ|μ )
′
q(μ |μ)
′
q(μ |μ)
q(μ|μ )
′
.
}.
}.
′
P (μ → μ )/P (μ → μ)
′
′
′
,
P (μ → μ) simpli es to q(μ|μ ) and
′
P (μ→μ )
′
P (μ →μ)
=
′
q(μ
′
∣∣
2. Scenario 2: f (μ |y) < f (μ|y) When μ′ is less plausible than μ,
′
μ)
q(μ|μ )
′
f (μ y)
f (μ|y)
′
′
′
q(μ μ )
q(μ μ)
f (μ|y)
even when it reaches its own limits of utility, the Metropolis-Hastings serves as
the foundation for a more exible set of MCMC tools, including the adaptive
Metropolis-Hastings, Gibbs, and Hamiltonian Monte Carlo (HMC) algorithms.
Among these, HMC is the algorithm utilized by the rstan and rstanarm
packages.
As we noted at the top of Chapter 7, studying these alternative algorithms
would require a book itself. From here on out, we'll rely on rstan with a fresh
con dence in what's going on under the hood. If you're curious to learn a little
more, McElreath (2019) provides an excellent video introduction to the HMC
algorithm and how it compares to the Metropolis-Hastings. For a deeper dive,
Brooks et al. (2011) provides a comprehensive overview of the broader MCMC
landscape.
.
7.8 Chapter summary
In Chapter 7, you built a strong conceptual understanding of the foundational
Metropolis-Hastings MCMC algorithm. You also implemented this algorithm
to study the familiar Normal-Normal and Beta-Binomial models. Whether in
these relatively simple one-parameter model settings, or in more complicated
model settings, the Metropolis-Hastings algorithm produces an approximate
sample from the posterior by iterating between two steps:
1. Propose a new chain location by drawing from a proposal pdf which is,
perhaps, dependent upon the current location.
2. Determine whether to accept the proposal. Simply put, whether or not
we accept a proposal depends on how favorable its posterior plausibility
is relative to the posterior plausibility of the current location.
7.9 Exercises
Chapter 7 continued to shift focus towards more computational methods.
Change can be uncomfortable, but it also spurs growth. As you work through
these exercises, know that if you are struggling a bit, that just means you are
engaging with some new concepts and you are putting in the work to grow as a
Bayesian. Our advice: verbalize what you do and don't understand. Don't rush
yourself. Take a break and come back to exercises that you feel stuck on. Work
with a buddy. Ask for help, and help others when you can.
a) Draw a trace plot for a tour where the Uniform proposal model uses a
very small w.
b) Why is it problematic if w is too small, and hence de nes the
neighborhood around the current chain value too narrowly?
c) Draw a trace plot for a tour where the Uniform proposal model uses a
very large w.
d) Why is it problematic if w is too large, and hence de nes the
neighborhood too widely?
e) Draw a trace plot for a tour where the Uniform proposal model uses a w
that is neither too small or too large.
f) Describe how you would go about nding an appropriate half-width w
for a Uniform proposal model.
b)
c)
d)
a)
b)
c)
d)
(i)
λ = 4.6 λ
∣
c) It is a special case of the Metropolis-Hastings algorithm.
d) It is a special case of Metropolis algorithm.
Exercise 7.6 (Proposing a new location). In each situation below, complete Step
1 of the Metropolis-Hastings Algorithm. That is, starting from the given current
chain value λ = λ and with set.seed(84735), use the given proposal
model to draw a λ′ proposal value for the next chain value λ
, λ~N (λ, 2 )
′
λ = 8.9, λ |λ~Unif (λ − 2, λ + 2)
′
λ = 7.7, λ |λ~Unif (λ − 3, λ + 3)
′
f (λ)L(λ y) = e
f (λ)L(λ y) = e
f (λ)L(λ y) = e
f (λ)L(λ y) = λ
f (λ)L(λ y) = e
f (λ)L(λ y) = e
f (λ)L(λ y) = e
−10λ
−λ
−1
3λ
−1.9λ
−λ
4
4
2
Exercise 7.7 (Calculate the acceptance probability: Part I). Suppose that a
Markov chain is currently at λ = 2 and that the proposal for λ
f (λ)L(λ y) = λ
−2
λ
,λ
′
′
(i)
a) 2
2
′
Exercise 7.8 (Calculate the acceptance probability: Part II). Suppose that a
Markov chain is currently at λ = 1.8 and that the proposal for λ
2
is
λ = 1.6. For each pair of unnormalized posterior pdf f (λ)L(λ|y) and proposal
pdf q(λ |λ), calculate the acceptance probability α used in Step 2 of the
′
2
′
′
(i+1)
(i+1)
′
′
(i+1)
e) For which of these scenarios is there a 100% acceptance probability?
Explain why we'd certainly want to accept λ′ in these scenarios.
2
μ ~N(0, 1 ).
Exercise 7.9 (One iteration with a Uniform proposal model). The function
one_mh_iteration() from the text utilizes a Uniform proposal model,
μ |μ~Unif (μ − w, μ + w) with half-width w = 1. Starting from a current value
′
of μ = 3 and using set.seed(1), run the code below and comment on the
returned proposal, alpha, and next_stop values.
a) 50 iterations, w = 50
b) 50 iterations, w = 0.01
c) 1000 iterations, w = 50
d) 1000 iterations, w = 0.01
e) Contrast the trace plots in parts a and b. Explain why changing w has
this effect.
f) Consider the results in parts c and d. Is the w value as important when
the number of iterations is much larger? Explain.
d)
d)
e)
one_mh_iteration_normal(s
∣
Exercise 7.11 (Changing the proposal model). For this exercise, modify
one_mh_iteration() create
=
a
one_mh_iteration_normal(), which utilizes a symmetric Normal
μ
′ 2
μ~N (μ, s ).
new
3, current = 3)
Exercise 7.12 (Metropolis-Hastings tour with Normal proposals). Upon
completing the previous exercise, modify mh_tour() to create a new
function,
proposal model, centered at the current chain value μ with standard deviation s:
a)
b)
c)
20 iterations, s = 0.01
20 iterations, s = 10
a) new_mh_iteration(w = 1, current = 3, m = 0, s =
10)
b) new_mh_iteration(w = 1, current = 3, m = 20, s =
1)
c) new_mh_iteration(w = 0.1, current = 3, m = 20, s
= 1)
d) new_mh_iteration(w = 0.1, current = 3, m = -15,
s = 10)
Exercise 7.14 (A Gamma-Poisson model). Consider a Gamma-Poisson model
in which rate λ has a Gamma(1,0.1) prior and you observe one Poisson data
point, Y = 4. In this exercise, you will simulate the posterior model of λ using
an independence sampler.
DOI: 10.1201/9780429288340-8
Imagine you nd yourself standing at the Museum of Modern Art (MoMA) in New York City,
captivated by the artwork in front of you. While understanding that “modern” art doesn't
necessarily mean “new” art, a question still bubbles up: what are the chances that this modern
artist is Gen X or even younger, i.e., born in 1965 or later? In this chapter, we'll perform a
Bayesian analysis with the goal of answering this question. To this end, let π denote the
proportion of artists represented in major U.S. modern art museums that are Gen X or younger.
The Beta(4,6) prior model for π (Figure 8.1) re ects our own very vague prior assumption that
major modern art museums disproportionately display artists born before 1965, i.e., π most
likely falls below 0.5. After all, “modern art” dates back to the 1880s and it can take a while to
attain such high recognition in the art world.
To learn more about π, we'll examine n = 100 artists sampled from the MoMA collection.
This moma_sample dataset in the bayesrules package is a subset of data made available by
MoMA itself (MuseumofModernArt, 2020).
Recognizing that the dependence of Y on π follows a Binomial model, our analysis follows the
Beta-Binomial framework. Thus, our updated posterior model of π in light of the observed art
∣
data follows from (3.10):
Y |π~Bin(100, π)
π~Beta(4, 6)
f (π y = 14) =
⇒
Γ (18+92)
Γ (18)Γ (92)
π
π (Y = 14)~Beta(18, 92)
18−1
(1 − π)
92−1
f or π ∈ [0, 1].
The evolution in our understanding of π is exhibited in Figure 8.1. Whereas we started out
with a vague understanding that under half of displayed artists are Gen X, the data has swayed
us toward some certainty that this gure likely falls below 25%.
FIGURE 8.1: Our Bayesian model of π, the proportion of modern art museum artists that are
Gen X or younger.
After celebrating our success in constructing the posterior, please recognize that there's a lot
of work ahead. We must be able to utilize this posterior to perform a rigorous posterior
analysis of π. There are three common tasks in posterior analysis: estimation, hypothesis
testing, and prediction. For example, what's our estimate of π? Does our model support the
claim that fewer than 20% of museum artists are Gen X or younger? If we sample 20 more
museum artists, how many do we predict will be Gen X or younger?
Goals
Establish the theoretical foundations for the three posterior analysis tasks: estimation,
hypothesis testing, and prediction.
Explore how Markov chain simulations can be used to approximate posterior features,
and hence be utilized in posterior analysis.
(8.1)
8.1 Posterior estimation
Reexamine the Beta(18, 92) posterior model for π, the proportion of modern art museum
artists that are Gen X or younger (Figure 8.1). In a Bayesian analysis, we can think of this
entire posterior model as an estimate of π. After all, this model of posterior plausible values
provides a complete picture of the central tendency and uncertainty in π. Yet in specifying and
communicating our posterior understanding, it's also useful to compute simple posterior
summaries of π. Check in with your gut on how we might approach this task.
Quiz Yourself!
What best describes your posterior estimate of π?
a) Roughly 16% of museum artists are Gen X or younger.
b) It's most likely the case that roughly 16% of museum artists are Gen X or
younger, but that gure could plausibly be anywhere between 9% and 26%.
If you responded with answer b, your thinking is Bayesian in spirit. To see why, consider
Figure 8.2, which illustrates our Beta(18, 92) posterior for π (left) alongside a different
analyst's Beta(4, 16) posterior (right). This analyst started with the same Beta(4, 6) prior but
only observed 10 artists, 0 of which were Gen X or younger. Though their different data
resulted in a different posterior, the central tendency is similar to ours. Thus, the other
analyst's best guess of π agrees with ours: roughly 16-17% of represented artists are Gen X or
younger. However, reporting only this shared “best guess” would make our two posteriors
seem misleadingly similar. In fact, whereas we're quite con dent that the representation of
younger artists is between 10% and 24%, the other analyst is only willing to put that gure
somewhere in the much wider range from 6% to 40%. Their relative uncertainty makes sense
– they only collected 10 artworks whereas we collected 100.
FIGURE 8.2: Our Beta(18, 92) posterior model for π (left) is shown alongside an alternative
Beta(4, 16) posterior model (right). The shaded regions represent the corresponding 95%
posterior credible intervals for π.
The punchline here is that posterior estimates should re ect both the central tendency and
variability in π. The posterior mean and mode of π provide quick summaries of the central
tendency alone. These features for our Beta(18, 92) posterior follow from the general Beta
properties (3.2) and match our above observation that Gen X representation is most likely
around 16%:
18
E(π|Y = 14) = ≈ 0.164
18+92
18−1
Mode(π|Y = 14) = ≈ 0.157.
18+92−2
Better yet, to capture both the central tendency and variability in π, we can report a range of
posterior plausible π values. This range is called a posterior credible interval (CI) for π. For
example, we noticed earlier that the proportion of museum artists that are Gen X or younger is
most likely between 10% and 24%. This range captures the more plausible values of π while
eliminating the more extreme and unlikely scenarios (Figure 8.2). In fact, 0.1 and 0.24 are the
2.5th and 97.5th posterior percentiles (i.e., 0.025th and 0.975th posterior quantiles), and thus
mark the middle 95% of posterior plausible π values. We can con rm these Beta(18,92)
posterior quantile calculations using qbeta():
The resulting 95% credible interval for π, (0.1, 0.24), is represented by the shaded region in
Figure 8.2 (left). Whereas the area under the entire posterior pdf is 1, the area of this shaded
region, and hence the fraction of π values that fall into this region, is 0.95. This reveals an
intuitive interpretation of the CI. There's a 95% posterior probability that somewhere between
10% and 24% of museum artists are Gen X or younger:
0.24
Please stop for a moment. Does this interpretation feel natural and intuitive? Thus, a bit
anticlimactic? If so, we're happy you feel that way – it means you're thinking like a Bayesian.
In Section 8.5 we'll come back to just how special this result is.
In constructing the CI above, we used a “middle 95%” approach. This isn't our only option.
The rst tweak we could make is to the 95% credible level (Figure 8.3). For example, a middle
50% CI, ranging from the 25th to the 75th percentile, would draw our focus to a smaller range
of some of the more plausible π values. There's a 50% posterior probability that somewhere
between 14% and 19% of museum artists are Gen X or younger:
FIGURE 8.3: 50%, 95%, and 99% posterior credible intervals for π.
In the other direction, a wider middle 99% CI would range from the 0.5th to the 99.5th
percentile, and thus kick out only the extreme 1%. As such, a 99% CI would provide us with a
fuller picture of plausible, though in some cases very unlikely, π values:
Though a 95% level is a common choice among practitioners, it is somewhat arbitrary and
simply ingrained through decades of tradition. There's no one “right” credible level.
Throughout this book, we'll sometimes use 50% or 80% or 95% levels, depending upon the
context of the analysis. Each provide a different snapshot of our posterior understanding.
Consider a second possible tweak to our construction of the CI: it's not necessary to report the
middle 95% of posterior plausible values. In fact, the middle 95% approach can eliminate
some of the more plausible values from the CI. A close inspection of the 50% and 95%
credible intervals in Figure 8.3 reveals evidence of this possibility in the ever-so-slightly
lopsided nature of the shaded region in our ever-so-slightly non-symmetric posterior. In the
95% CI, values included in the upper end of the CI are less plausible than lower values of π
below 0.1 that were left out of the CI. If this lopsidedness were more extreme, we should
consider forming a 95% CI for π using not the middle, but the 95% of posterior values with
the highest posterior density. You can explore this idea in the exercises, though we won't lose
sleep over it here. Mainly, this method will only produce meaningfully different results than
the middle 95% approach in extreme cases, when the posterior is extremely skewed.
(π 0.025 , π 0.975 ).
1. The majority of the posterior pdf in Figure 8.3 falls below 0.2.
2. The 95% credible interval for π, (0.1, 0.24), is mostly below 0.2.
These observations are a great start. Yet we can be even more precise. To evaluate exactly how
plausible it is that π < 0.2, we can calculate the posterior probability of this scenario,
P (π < 0.2|Y = 14). This posterior probability is represented by the shaded area under the
posterior pdf in Figure 8.4 and, mathematically, is calculated by integrating the posterior pdf
on the range from 0 to 0.2:
0.2
FIGURE 8.4: The Beta(18,92) posterior probability that π is below 0.20 is represented by the
shaded region under the posterior pdf.
We'll bypass the integration and obtain this Beta(18,92) posterior probability using pbeta()
below. The result reveals strong evidence in favor of our claim: there's a roughly 84.9%
posterior chance that Gen Xers account for fewer than 20% of modern art museum artists.
This analysis of our claim is refreshingly straightforward. We simply calculated the posterior
probability of the scenario of interest. Though not always necessary, practitioners often
formalize this procedure into a hypothesis testing framework. For example, we can frame our
analysis as two competing hypotheses: the null hypothesisH contends that at least 20% of
0
museum artists are Gen X or younger (the status quo here) whereas the alternative
hypothesisHa (our claim) contends that this gure is below 20%. In mathematical notation:
H0 : π ≥ 0.2
Ha : π < 0.2
Note that Ha claims that π lies on one side of 0.2 (π < 0.2) as opposed to just being different
than 0.2 (π ≠ 0.2). Thus, we call this a one-sided hypothesis test.
We've already calculated the posterior probability of the alternative hypothesis to be
P (H | Y = 14) = 0.849.
a Thus, the posterior probability of the null hypothesis is
P (H | Y = 14) = 0.151. Putting these together, the posterior odds that π < 0.2 are roughly
0
5.62. That is, our posterior assessment is that π is nearly 6 times more likely to be below 0.2
than to be above 0.2:
P (H a | Y =14)
posterior odds = ≈ 5.62.
P (H 0 | Y =14)
Of course, these posterior odds represent our updated understanding of π upon observing the
survey data, Y = 14 of n = 100 sampled artists were Gen X or younger. Prior to sampling
these artists, we had a much higher assessment of Gen X representation at major art museums
(Figure 8.5).
FIGURE 8.5: The posterior probability that π is below 0.2 (right) is contrasted against the
prior probability of this scenario (left).
Speci cally, the prior probability that π < 0.2, calculated by the area under the Beta(4,6) prior
pdf f (π) that falls below 0.2, was only 0.0856:
0.2
P (H a ) = ∫ f (π)dπ ≈ 0.0856.
0
Thus, the prior probability of the null hypothesis is P (H ) = 0.914. It follows that the prior
0
odds of Gen X representation being below 0.2 were roughly only 1 in 10:
P (H a )
Prior odds = ≈ 0.093.
P (H 0 )
The Bayes Factor (BF) compares the posterior odds to the prior odds, and hence provides
insight into just how much our understanding about Gen X representation evolved upon
observing our sample data:
posterior odds
Bayes Factor = .
prior odds
In our example, the Bayes Factor is roughly 60. Thus, upon observing the artwork data, the
posterior odds of our hypothesis about Gen Xers are roughly 60 times higher than the prior
odds. Or, our con dence in this hypothesis jumped quite a bit.
We summarize the Bayes Factor below, including some guiding principles for its
interpretation.
Bayes Factor
In a hypothesis test of two competing hypotheses, Ha vs H0, the Bayes Factor is an odds
ratio for Ha:
As a ratio, it's meaningful to compare the Bayes Factor (BF) to 1. To this end, consider
three possible scenarios:
Bringing it all together, the posterior probability (0.85) and Bayes Factor (60) establish fairly
convincing evidence in favor of the claim that fewer than 20% of artists at major modern art
museums are Gen X or younger. Did you wince in reading that sentence? The term “fairly
convincing” might seem a little wishy-washy. In the past, you might have learned speci c cut-
offs that distinguish between “statistically signi cant” and “not statistically signi cant”
results, or allow you to “reject” or “fail to reject” a hypothesis. However, this practice
provides false comfort. Reality is not so clear-cut. For this reason, across the frequentist and
Bayesian spectrum, the broader statistics community advocates against making rigid
conclusions using universal rules and for a more nuanced practice which takes into account the
context and potential implications of each individual hypothesis test. Thus, there is no magic,
one-size- ts-all cut-off for what Bayes Factor or posterior probability evidence is big enough
to lter claims into “true” or “false” categories. In fact, what we have is more powerful than a
binary decision – we have a holistic measure of our level of uncertainty about the claim. This
level of uncertainty can inform our next steps. In our art example, do we have ample evidence
for our claim? We're convinced.
H0 : π = 0.3
Ha : π ≠ 0.3
When we try to hit this two-sided hypothesis test with the same hammer we used for the one-
sided hypothesis test, we quickly run into a problem. Since π is continuous, the prior and
posterior probabilities that π is exactly 0.3 (i.e., that H0 is true) are both zero. For example, the
posterior probability that π = 0.3 is calculated by the area of the line under the posterior pdf
at 0.3. As is true for any line, this area is 0:
0.3
Thus, the posterior odds, prior odds, and consequently the Bayes factor are all unde ned:
P (H a | Y =14) 1
Posterior odds = = = nooooo!
P (H 0 | Y =14) 0
No problem. There's not one recipe for success. To that end, try the following quiz.1
Quiz Yourself!
Recall that the 95% posterior credible interval for π is (0.1, 0.24). Does this CI provide
ample evidence that π differs from 0.3?
If you answered “yes,” then you intuited a reasonable approach to two-sided hypothesis
testing. The hypothesized value of π (here 0.3) is “substantially” outside the posterior credible
interval, thus we have ample evidence in favor of Ha. The fact that 0.3 is so far above the
range of plausible π values makes us pretty con dent that the proportion of museum artists
that are Gen X or younger is not 0.3. Yet what's “substantial” or clear in one context might be
different than what's “substantial” in another. With that in mind, it is best practice to de ne
“substantial” ahead of time, before seeing any data. For example, in the context of artist
representation, we might consider any proportion outside the 0.05 window around 0.3 to be
meaningfully different from 0.3. This essentially adds a little buffer into our hypotheses, π is
either around 0.3 (between 0.25 and 0.35) or it's not:
H0 : π ∈ (0.25, 0.35)
Ha : π ∉ (0.25, 0.35)
With this de ned buffer in place, we can more rigorously claim belief in Ha since the entire
hypothesized range for π, (0.25, 0.35), lies above its 95% credible interval. Note also that
since H0 no longer includes a singular hypothesized value of π, its corresponding posterior and
prior probabilities are no longer 0. Thus, just as we did in the one-sided hypothesis testing
setting, we could (but won't here) supplement our above posterior credible interval analysis
with posterior probability and Bayes Factor calculations.
_________________________
1 Answer : yes
Quiz Yourself!
Suppose we get our hands on data for 20 more artworks displayed at the museum. Based
on the posterior understanding of π that we've developed throughout this chapter, what
number would you predict are done by artists that are Gen X or younger?
Your knee-jerk reaction to this quiz might be: “I got this one. It's 3!” This is a very reasonable
place to start. After all, our best posterior guess was that roughly 16% of museum artists are
Gen X or younger and 16% of 20 new artists is roughly 3. However, this calculation ignores
two sources of potential variability in our prediction:
Let's specify these concepts with some math. First, let Y = y be the (yet unknown) number
′ ′
of the 20 new artworks that are done by Gen X or younger artists, where y′ can be any number
of artists in {0, 1, …, 20}. Conditioned on π, the randomness or sampling variability in Y′ can
be modeled by Y |π~Bin(20, π) with pdf
′
20 ′
20−y
′
′ ′ ′ y
f (y π) = P (Y = y π) = ( )π (1 − π) .
′
y
(8.2)
Thus, the random outcome of Y′ depends upon π, which too can vary – π might be any value
between 0 and 1. To this end, the Beta(18, 92) posterior model of π given the original data (
Y = 14) describes the potential posterior variability in π, i.e., which values of π are more
= y Gen Xers for a given π while taking into account the posterior plausibility of that π
′ ′
Y
value:
′
f (y |π)f (π|y = 14).
(8.3)
Figure 8.6 illustrates this idea, plotting the weighted behavior of Y′ (8.3) for just three possible
values of π: the 2.5th posterior percentile (0.1), posterior mode (0.16), and 97.5th posterior
percentile (0.24). Naturally, we see that the greater π is, the greater Y′ tends to be: when
′
π = 0.1 the most likely value of Y is 2, whereas when π = 0.24, the most likely value of Y is
′
5. Also notice that since π values as low as 0.1 or as high as 0.24 are not very plausible, the
values of Y′ that might be generated under these scenarios are given less weight (i.e., the sticks
are much shorter) than those that are generated under π = 0.16, the most plausible π value.
FIGURE 8.6: Possible Y′ outcomes are plotted for π ∈ {0.10, 0.16, 0.24} and weighted by
the corresponding posterior plausibility of π.
Putting this all together, the posterior predictive model of Y′, the number of the 20 new artists
that are Gen X, takes into account both the sampling variability in Y′ and posterior variability
in π. Speci cally, the posterior predictive pmf calculates the overall chance of observing
= y across all possible π from 0 to 1 by averaging across (8.3), the chance of observing
′ ′
Y
′ ′ ′ ′
f (y y = 14) = P (Y = y Y = y) = ∫ f (y π)f (π y = 14)dπ.
0
′ ′
f (y y) = ∫ f (y π)f (π y)dπ.
(8.4)
In words, the overall chance of observing Y = y weights the chance of observing this
′ ′
An exact formula for the pmf of Y′ follows from some calculus (which we don't show here but
is fun and we encourage you to try if you have calculus experience):
20 Γ (110)
′
Γ (18+y )Γ (112−y )
′
′ ′
f (y y = 14) = ( ) f or y ∈ {0, 1, … , 20}.
′ Γ (18)Γ (92) Γ (130)
y
(8.5)
f (y
∣
Though this formula is unlike any we've ever seen (e.g., it's not Binomial or Poisson or
anything else we've learned), it still speci es what values of Y′ we might observe and the
probability of each. For example, plugging y = 3 into this formula, there's a 0.2217 posterior
= 3 y = 14) = (
3
)
′
′
20 Γ (110)
Γ (18)Γ (92)
Γ (18+3)Γ (112−3)
Γ (130)
= 0.2217.
The pmf formula also re ects the in uence of our Beta(18,92) posterior model for π (through
parameters 18 and 92), and hence the original prior and data, on our posterior understanding of
Y′. That is, like any posterior operation, our posterior predictions balance information from
both the prior and data.
For a look at the bigger picture, the posterior predictive pmf f (y |y = 14) is plotted in Figure
′
8.7. Though it looks similar to the model of Y′ when we assume that π equals the posterior
mode of 0.16 (Figure 8.6 middle), it puts relatively more weight on the smaller and larger
values of Y′ that we might expect when π deviates from 0.16. All in all, though the number of
the next 20 artworks that will be done by Gen X or younger artists is most likely 3, it could
plausibly be anywhere between, say, 0 and 10.
FIGURE 8.7: The posterior predictive model of Y′, the number of the next 20 artworks that
are done by Gen X or younger artists.
Finally, after building and examining the posterior predictive model of Y′, the number of the
next 20 artists that will be Gen X, we might have some follow-up questions. For example,
what's the posterior probability that at least 5 of the 20 artists are Gen X, P (Y ≥ 5|Y = 14) ?
How many of the next 20 artists do we expect to be Gen X, E(Y |Y = 14) ? We can answer
′
′
these questions, it's just a bit tedious. Since the posterior predictive model for Y′ isn't familiar,
we can't calculate posterior features using pre-built formulas or R functions like we did for the
Beta posterior model of π. Instead, we have to calculate these features from scratch. For
example, we can calculate the posterior probability that at least 5 of the 20 artists are Gen X
by adding up the pmf (8.5) evaluated at each of the 16 y′ values in this range. The result of this
large sum, the details of which would ll a whole page, is 0.233:
P (Y
E (Y
′
′
≥ 5|y = 14)
|y = 14)
20
= ∑ y f (y
′
y =0
= 0 ⋅ f (y
= 3.273.
= ∑ f (y
y =5
= f (y
= 0.233.
′
20
∣
′
′
y = 14)
= 5|y = 14) + f (y
= 0|y = 14) + 1 ⋅ f (y
′
′
= 6|y = 14) + ⋯ + f (y
Similarly, though this isn't a calculation we've had to do yet (and won't do again), the expected
number of the next 20 artists that will be Gen X can be obtained by the posterior weighted
average of possible Y′ values. That is, we can add up each Y′ value from 0 to 20 weighted by
their posterior probabilities f (y |y = 14). The result of this large sum indicates that we should
expect roughly 3 of the 20 artists to be Gen X:
′ ′
′
y = 14)
= 1|y = 14) + ⋯ + 20 ⋅ f (y
But we don't want to get too distracted by these types of calculations. In this book, we'll never
need to do something like this again. Starting in Chapter 9, our models will be complicated
enough so that even tedious formulas like these will be unattainable and we'll need to rely on
simulation to approximate posterior features.
′
= 20|y = 14)
Check out the numerical and visual diagnostics in Figure 8.8. First, the randomness in the
trace plots (left), the agreement in the density plots of the four parallel chains (middle), and an
Rhat value of effectively 1 suggest that our simulation is extremely stable. Further, our
dependent chains are behaving “enough” like an independent sample. The autocorrelation,
shown at right for just one chain, drops off quickly and the effective sample size ratio is
satisfyingly high – our 20,000 Markov chain values are as effective as 7600 independent
samples (0.38 ⋅ 20000).
FIGURE 8.8: MCMC simulation results for the posterior model of π, the proportion of
museum artists that are Gen X or younger, are exhibited by trace and density plots for the four
parallel chains (left and middle) and an autocorrelation plot for a single chain (right).
8.4.2 Posterior estimation & hypothesis testing
We can now use the combined 20,000 Markov chain values, with con dence, to approximate
the Beta(18, 92) posterior model of π. Indeed, Figure 8.9 con rms that the complete MCMC
approximation (right) closely mimics the actual posterior (left).
FIGURE 8.9: The actual Beta(18, 92) posterior pdf of π (left) alongside an MCMC
approximation (right).
As such, we can approximate any feature of the Beta(18, 92) posterior model by the
corresponding feature of the Markov chain. For example, we can approximate the posterior
mean by the mean of the MCMC sample values, or approximate the 2.5th posterior percentile
by the 2.5th percentile of the MCMC sample values. To this end, the tidy() function in the
broom.mixed package (Bolker and Robinson, 2021) provides some handy statistics for the
combined 20,000 Markov chain values stored in art_sim:
And the mcmc_areas() function in the bayesplot package provides a visual complement
(Figure 8.10):
FIGURE 8.10: The median and middle 95% of the approximate posterior model of π.
In the tidy() summary, conf.low and conf.high report the 2.5th and 97.5th percentiles
of the Markov chain values, 0.101 and 0.239, respectively. These form an approximate middle
95% credible interval for π which is represented by the shaded region in the mcmc_areas()
plot. Further, the estimate reports that the median of our 20,000 Markov chain values, and
thus our approximation of the actual posterior median, is 0.162. This median is represented by
the vertical line in the mcmc_areas() plot. Like the mean and mode, the median provides
another measure of a “typical” posterior π value. It corresponds to the 50th posterior
percentile – 50% of posterior π values are above the median and 50% are below. Yet unlike the
mean and mode, there doesn't exist a one-size- ts-all formula for a Beta(α, β) median. This
exposes even more beauty about MCMC simulation: even when a formula is elusive, we can
estimate a posterior quantity by the corresponding feature of our observed Markov chain
sample values.
Though a nice rst stop, the tidy() function doesn't always provide every summary statistic
of interest. For example, it doesn't report the mean or mode of our Markov chain sample
values. No problem. We can calculate summary statistics directly from the Markov chain
values. The rst step is to convert an array of the four parallel chains into a single data frame
of the combined chains:
With the chains in data frame form, we can proceed as usual, using our dplyr tools to
transform and summarize. For example, we can directly calculate the sample mean, median,
mode, and quantiles of the combined Markov chain values. The median and quantile values
are precisely those reported by tidy() above, and thus eliminate any mystery about that
function!
We can also use the raw chain values to tackle the next task in our posterior analysis – testing
the claim that fewer than 20% of major museum artists are Gen X. To this end, we can
approximate the posterior probability of this scenario, P (π < 0.20|Y = 14), by the proportion
of Markov chain π values that fall below 0.20. By this approximation, there's an 84.6% chance
that Gen X artist representation is under 0.20:
Soak it in and remember the point. We've used our MCMC simulation to approximate the
posterior model of π along with its features of interest. For comparison, Table 8.1 presents the
Beta(18,92) posterior features we calculated in Section 8.1 alongside their corresponding
MCMC approximations. The punchline is this: MCMC worked. The approximations are quite
accurate. Let this bring you peace of mind as you move through the next chapters – though the
models therein will be too complicated to specify, we can be con dent in our MCMC
approximations of these models (so long as the diagnostics check out!).
Posterior variability in π
The collection of 20,000 Markov chain π values provides an approximate sense for the
variability and range in plausible π values.
To capture both sources of variability in posterior predictions Y′, we can use rbinom() to
simulate one Bin(20, π) outcome Y′ from each of the 20,000 π chain values. The rst three
results re ect a general trend: smaller values of π will tend to produce smaller values of Y′.
This makes sense. The lower the underlying representation of Gen X artists in the museum, the
fewer Gen X artists we should expect to see in our next sample of 20 artworks.
The resulting collection of 20,000 predictions closely approximates the true posterior
predictive distribution (Figure 8.7). It's most likely that 3 of the 20 artists will be Gen X or
younger, though this gure might reasonably range between 0 and, say, 10:
FIGURE 8.11: A histogram of 20,000 simulated posterior predictions of the number among
the next 20 artists that will be Gen X or younger.
We can also utilize the posterior predictive sample to approximate features of the actual
posterior predictive model that were burdensome to specify mathematically. For example, we
can approximate the posterior mean prediction, E(Y |Y = 14), and the more comprehensive
′
posterior prediction interval for Y′. To this end, we expect roughly 3 of the next 20 artists to be
Gen X or younger, but there's an 80% chance that this gure is somewhere between 1 and 6
artists:
8.5 Bayesian bene ts
In Chapter 1, we highlighted how Bayesian analyses compare to frequentist analyses. Now that
we've worked through some concrete examples, let's revisit some of those ideas. As you've
likely experienced, often the toughest part of a Bayesian analysis is building or simulating the
posterior model. Once we have that piece in place, it's fairly straightforward to utilize this
posterior for estimation, hypothesis testing, and prediction. In contrast, building up the
formulas to perform the analogous frequentist calculations is often less intuitive.
We can also bask in the ease with which Bayesian results can be interpreted. In general, a
Bayesian analysis assesses the uncertainty regarding an unknown parameter π in light of
observed data Y. For example, consider the artist study. In light of observing that Y = 14 of
100 sampled artists were Gen X or younger, we determined that there was an 84.9% posterior
chance that Gen X representation at the entire museum, π, falls below 0.20:
P (π < 0.20 | Y = 14) = 0.849.
This calculation doesn't make sense in a frequentist analysis. Flipping the script, a frequentist
analysis assesses the uncertainty of the observed data Y in light of assumed values of π. For
example, the frequentist counterpart to the Bayesian posterior probability above is the p-value,
the formula for which we won't dive into here:
P (Y ≤ 14 | π = 0.20) = 0.08.
The opposite order of the conditioning in this probability, Y given π instead of π given Y,
leads to a different calculation and interpretation than the Bayesian probability: ifπ were only
0.20, then there's only an 8% chance we'd have observed a sample in which at most Y = 14 of
100 artists were Gen X. It's not our writing here that's awkward, it's the p-value. Though it
does provide us with some interesting information, the question it answers is a little less
natural for the human brain: since we actually observed the data but don't know π, it can be a
mind bender to interpret a calculation that assumes the opposite. Mainly, when testing
hypotheses, it's more natural to ask “how probable is my hypothesis?” (what the Bayesian
probability answers) than “how probable is my data if my hypothesis weren't true?” (what the
frequentist probability answers). Given how frequently p-values are misinterpreted, and hence
misused, they're increasingly being de-emphasized across the entire frequentist and Bayesian
spectrum.
8.6 Chapter summary
In Chapter 8, you learned how to turn a posterior model into answers. That is, you utilized
posterior models, exact or approximate, to perform three posterior analysis tasks for an
unknown parameter π:
8.7 Exercises
8.7.1 Conceptual exercises
Exercise 8.1 (Posterior analysis). What are the three common tasks in a posterior analysis?
a) In estimating some parameter λ, what are some drawbacks to only reporting the
central tendency of the λ posterior model?
b) The 95% credible interval for λ is (1,3.4). How would you interpret this?
Exercise 8.3 (Hypothesis testing?). In each situation below, indicate whether the issue at hand
could be addressed using a hypothesis test.
a) Your friend Trichelle claims that more than 40% of dogs at the dog park do not have a
dog license.
b) Your professor is interested in learning about the proportion of students at a large
university who have heard of Bayesian statistics.
c) An environmental justice advocate wants to know if more than 60% of voters in their
state support a new regulation.
d) Sarah is studying Ptolemy's Syntaxis Mathematica text and wants to investigate the
number of times that Ptolemy uses a certain mode of argument per page of text. Based
on Ptolemy's other writings she thinks it will be about 3 times per page. Rather than
reading all 13 volumes of Syntaxis Mathematica, Sarah takes a random sample of 90
pages.
Exercise 8.4 (Bayes Factor). Answer questions about Bayes Factors from your friend Enrique
who has a lot of frequentist statistics experience, but is new to Bayes.
Exercise 8.7 (Credible intervals: Part II). For each situation, nd the appropriate credible
interval using the “middle” approach.
Exercise 8.8 (Credible intervals: highest posterior density). There's more than one approach to
constructing a 95% credible interval. The “middle 95%” approach reports the range of the
middle 95% of the posterior density, from the 2.5th to the 97.5th percentile. The “highest
posterior density” approach reports the 95% of posterior values with the highest posterior
densities.
a) Let λ|y~Gamma(1, 5). Construct the 95% highest posterior density credible interval
for λ. Represent this interval on a sketch of the posterior pdf. Hint: The sketch itself
will help you identify the appropriate CI.
b) Repeat part a using the middle 95% approach.
c) Compare the two intervals from parts a and b. Are they the same? If not, how do they
differ and which is more appropriate here?
d)
e)
μ.
∣
d) Let μ y~N(−13, 2 ). Construct the 95% highest posterior density credible interval for
2
Exercise 8.9 (Hypothesis tests: Part I). For parameter π, suppose you have a Beta(1,0.8) prior
model and a Beta(4,3) posterior. You wish to test the null hypothesis that π ≤ 0.4 versus the
alternative that π > 0.4.
a)
b)
c)
What is the posterior probability for the alternative hypothesis?
Calculate and interpret the posterior odds.
Calculate and interpret the prior odds.
Calculate and interpret the Bayes Factor.
Putting this together, explain your conclusion about these hypotheses to someone who
is unfamiliar with Bayesian statistics.
Exercise 8.10 (Hypothesis tests: Part II). Repeat Exercise 8.9 for the following scenario. For
parameter μ, suppose you have a N (10, 10 ) prior model, a N (5, 3 ) posterior, and you wish
to test H : μ ≥ 5.2 versus H : μ < 5.2.
0 a
2 2
a) Identify the posterior pdf of π given the observed data Y = y, f (π|y). NOTE: This will
depend upon (y, n, α, β, π).
b) Suppose we conduct n′ new trials (where n′ might differ from our original number of
trials n) and let Y = y be the observed number of successes in these new trials.
(y , n , π).
′ ′
′ ′
Identify the conditional pmf of Y′ given π, f (y |π). NOTE: This will depend upon
c) Identify the posterior predictive pmf of Y′, f (y |y). NOTE: This pmf, found using
′
d) As with the example in Section 8.3, suppose your posterior model of π is based on a
prior model with α = 4 and β = 6 and an observed y = 14 successes in n = 100
original trials. We plan to conduct n = 20 new trials. Specify the posterior predictive
′
pmf of Y′, the number of successes we might observe in these 20 trials. NOTE: This
should match (8.5).
e) Continuing part d, suppose instead we plan to conduct n = 4 new trials. Specify and
′
sketch the posterior predictive pmf of Y′, the number of successes we might observe
in these 4 trials.
c) Identify the posterior predictive pmf of Y′, f (y |y). NOTE: This will depend upon
′
(y, y , s, r).
′
N (θ, τ ) prior.
2
μ, f (y |μ).
′
Exercise 8.15 (Climate change: hypothesis testing). Continuing the analysis from Exercise
8.14, suppose you wish to test a researcher's claim that more than 10% of people believe in
climate change: H : π ≤ 0.1 versus H : π > 0.1.
0 a
a) What decision might you make about these hypotheses utilizing the credible interval
from the previous exercise?
b) Calculate and interpret the posterior probability of Ha.
c) Calculate and interpret the Bayes Factor for your hypothesis test.
d) Putting this together, explain your conclusion about π.
Exercise 8.16 (Climate change with MCMC: simulation). In the next exercises, you'll repeat
and build upon your climate change analysis using MCMC simulation.
a) Simulate the posterior model of π, the proportion of U.S. adults that do not believe in
climate change, with rstan using 4 chains and 10000 iterations per chain.
b) Produce and discuss trace plots, overlaid density plots, and autocorrelation plots for
the four chains.
c) Report the effective sample size ratio and R-hat values for your simulation, explaining
what these values mean in context.
Exercise 8.17 (Climate change with MCMC: estimation and hypothesis testing).
a) Suppose you were to survey 100 more adults. Use your MCMC simulation to
approximate the posterior predictive model of Y′, the number that don't believe in
climate change. Construct a histogram visualization of this model.
b) Summarize your observations of the posterior predictive model of Y′.
c) Approximate the probability that at least 20 of the 100 people don't believe in climate
change.
Exercise 8.19 (Penguins: estimation). Let μ denote the typical ipper length (in mm) among
the Adelie penguin species. To learn about μ, we'll utilize ipper measurements
(Y , Y , … , Y ) on a sample of Adelie penguins.
1 2 n
Exercise 8.20 (Penguins: hypothesis testing). Let's continue our analysis of μ, the typical
ipper length (in mm) among the Adelie penguin species.
a) You hypothesize that the average Adelie ipper length is somewhere between 200mm
and 220mm. State this as a formal hypothesis test (using H0, Ha, and μ notation).
NOTE: This is a two-sided hypothesis test!
b) What decision might you make about these hypotheses utilizing the credible interval
from the previous exercise?
c) Calculate and interpret the posterior probability that your hypothesis is true.
d) Putting this together, explain your conclusion about μ.
Exercise 8.21 (Loons: estimation). The loon is a species of bird common to the Ontario region
of Canada. Let λ denote the typical number of loons observed by a birdwatcher across a 100-
hour observation period. To learn about λ, we'll utilize bird counts (Y , Y , … , Y ) collected
1 2 n
in n different outings.
Exercise 8.22 (Loons: hypothesis testing). Let's continue our analysis of λ, the typical rate of
loon sightings in a 100-hour observation period.
a) You hypothesize that birdwatchers should anticipate a rate of less than 1 loon per
observation period. State this as a formal hypothesis test (using H0, Ha, and λ
notation).
b) What decision might you make about these hypotheses utilizing the credible interval
from the previous exercise?
c) Calculate and interpret the posterior probability that your hypothesis is true.
d) Putting this together, explain your conclusion about λ.
Exercise 8.23 (Loons with MCMC: simulation). In the next exercises, you'll repeat your loon
analysis using MCMC simulation.
a) Simulate the posterior model of λ, the typical rate of loon sightings per observation
period, with rstan using 4 chains and 10000 iterations per chain.
b) Perform some MCMC diagnostics to con rm that your simulation has stabilized.
c) Utilize your MCMC simulation to approximate a (middle) 95% posterior credible
interval for λ. Do so using the tidy() shortcut function as well as a direct
calculation from your chain values.
d) Utilize your MCMC simulation to approximate the posterior probability that λ < 1.
e) How close are the approximations in parts c and d to the actual corresponding
posterior values you calculated in Exercises 8.21 and 8.22?
a) Use your MCMC simulation to approximate the posterior predictive model of Y′, the
number of loons that a birdwatcher will spy in their next observation period. Construct
a histogram visualization of this model.
b) Summarize your observations of the posterior predictive model of Y′.
c) Approximate the probability that the birdwatcher observes 0 loons in their next
observation period.
Unit III
Bayesian Regression &
Classi cation
9
Simple Normal Regression
DOI: 10.1201/9780429288340-9
Welcome to Unit 3!
Our work in Unit 1 (learning how to think like Bayesians and build simple
Bayesian models) and Unit 2 (exploring how to simulate and analyze
these models), sets us up to expand our Bayesian toolkit to more
sophisticated models in Unit 3. Thus, far, our models have focused on the
study of a single data variable Y. For example, in Chapter 4 we studied Y,
whether or not lms pass the Bechdel test. Yet once we have a grip on the
variability in Y, we often have follow-up questions: can the passing /
failing of the Bechdel test be explained by a lm's budget, genre, release
date, etc.?
In general, we often want to model the relationship between some
response variable Y and predictors (X , X , …, X ). This is the shared
1 2 p
goal of the remaining chapters, which will survey a broad set of Bayesian
modeling tools that we conventionally break down into two tasks:
Regression tasks are those that analyze and predict quantitative
response variables (e.g., Y = hippocampal volume).
Classi cation tasks are those that analyze categorical response
variables with the goal of predicting or classifying the response
category (e.g., classify Y, whether a news article is real or fake).
We'll survey a few Bayesian regression techniques in Chapters 9 through
12: Normal, Poisson, and Negative Binomial regression. We'll also survey
two Bayesian classi cation techniques in Chapters 13 and 14: logistic
regression and naive Bayesian classi cation. Though we can't hope to
introduce you to every regression and classi cation tool you'll ever need,
the ve we've chosen here are generalizable to a broader set of
applications. At the outset of this exploration, we encourage you to focus
on the Bayesian modeling principles. By focusing on principles over a
perceived set of rules (which don't exist), you'll empower yourself to
extend and apply what you learn here beyond the scope of this book.
In Chapter 9 we'll start with the foundational Normal regression model for a
quantitative response variable Y. Consider the following data story. Capital
Bikeshare is a bike sharing service in the Washington, D.C. area. To best serve
its registered members, the company must understand the demand for its
service. To help them out, we can analyze the number of rides taken on a
random sample of n days, (Y , Y , …, Y ). Since Yi is a count variable, you
1 2 n
9.1):Y i
2
μ, σ ˜N (μ, σ ).
2
μ ~N (θ, τ ).
(9.1)
Yet we can greatly extend the power of this model by tweaking its
assumptions. First, you might have scratched your head at the assumption that
we don't know the typical ridership μ but do know the variability in ridership
from day to day, σ. You'd be right. This assumption typically breaks down
outside textbook examples. No problem. We can generalize the Normal-
Normal model (9.1) to accommodate the reality that σ is a second unknown
parameter by including a corresponding prior model:
ind
2
Y i |μ, σ ˜N (μ, σ )
2
μ ~N (θ, τ )
(9.2)
Goals
Upon building a Bayesian simple linear regression model of response
variable Y versus predictor X, you will:
interpret appropriate prior models for the regression parameters;
simulate the posterior model of the regression parameters; and
utilize simulation results to build a posterior understanding of the
relationship between Y and X and to build posterior predictive models
of Y.
To get started, load the following packages, which we'll utilize throughout the
chapter:
{(Y 1 , X 1 ), (Y 2 , X 2 ), …, (Y n , X n )}
where Yi is the number of rides and Xi is the high temperature (in degrees
Fahrenheit) on day i. We can check this assumption once we have some data,
but our experience suggests there's a positive linear relationship between
ridership and temperature – the warmer it is, the more likely people are to hop
on their bikes. For example, we might see data like that in the two scenarios in
Figure 9.2 where each dot re ects the ridership and temperature on a unique
day. Thus, instead of focusing on the global mean ridership across all days
combined (μ), we can re ne our analysis to the local mean ridership on day i,
μi, speci c to the temperature on that day. Assuming the relationship between
ridership and temperature is linear, we can write μi as
μi = β0 + β1 Xi
frigid temperature is far outside the norm for D.C., we shouldn't put stock
into this interpretation. Rather, we can think of β0 as providing a baseline
for where our model “lives” along the y-axis.
Temperature coef cient β indicates the typical change in ridership for
1
For example, the model lines in Figure 9.2 both have intercept β = −2000
0
and slope β = 100. The intercept just tells us that if we extended the line all
1
the way down to 0 degrees Fahrenheit, it would cross the y-axis at -2000. The
slope is more meaningful, indicating that for every degree increase in
temperature, we'd expect 100 more riders.
FIGURE 9.2: Two simulated scenarios for the relationship between ridership
and temperature, utilizing σ = 2000 (left) and σ = 200 (right). In both cases,
the model line is de ned by β + β x = −2000 + 100x.
0 1
i
∣
We can plunk this assumption of a linear relationship between Yi and Xi right
into our Bayesian model, by replacing the global mean μ in the Normal data
model, Y μ, σ~N (μ, σ ), with the temperature speci c local mean μi:
2
ind
2
Y i β 0 , β 1 , σ ˜N (μ i , σ ) with μi = β0 + β1 Xi .
with a relatively small value of σ = 20. The observed data here deviates very
little from the mean model line – we can expect the observed ridership on a
given day to differ by only 20 rides from the mean ridership on days of the
same temperature. This tightness around the mean model line indicates that
temperature is a strong predictor of ridership. The opposite is true in the left
plot which exhibits a larger σ = 200. There is quite a bit of variability in
ridership among days of the same temperature, re ecting a weaker relationship
between these variables. In summary, the formal assumptions encoded by data
model (9.3) are included below.
Quiz Yourself!
Identify the regression parameters upon which the data model (9.3)
depends.
In the data model (9.3), there are two data variables (Y and X) and three
unknown regression parameters that encode the relationship between these
variables: β , β , σ. We must specify prior models for each. There are
0 1
countless approaches to this task. We won't and can't survey them all. Rather,
throughout this book we'll utilize the default framework of the prior models
used by the rstanarm package. Working within this framework will allow us to
survey a broad range of modeling tools in Units 3 and 4. Once you're
comfortable with the general modeling concepts therein and are ready to
customize, you can take that leap. In doing so, we recommend Gabry and
Goodrich (2020b), which provides an overview of all possible prior structures
in rstanarm.
The rst assumption we'll make is that our prior models of β0, β1, and σ are
independent. That is, we'll assume that our prior understanding of where the
model “lives” (β0) has nothing to do with our prior understanding of the rate at
which ridership increases with temperature (β1). Similarly, we'll assume that
our prior understanding of σ, and hence the strength of the relationship, is
unrelated to both β0 and β1. Though in practice we might have some prior
notion about the combination of these parameters, the assumption of
independence greatly simpli es the model. It's also consistent with the
rstanarm framework.
In specifying the structure of the independent priors, we must consider (as
usual) the values that these parameters might take. To this end, the intercept
and slope regression parameters, β0 and β1, can technically take any values in
the real line. That is, a model line can cross anywhere along the y-axis and the
slope of a line can be any positive or negative value (or even 0). Thus, it's
reasonable to utilize Normal prior models for β0 and β1, which also live on the
entire real line. Speci cally,
2
β0 ~N (m 0 , s )
0
2
β1 ~N (m 1 , s )
1
(9.4)
where we can tune the m0, s0, m1, s1 hyperparameters to match our prior
understanding of β0 and β1. Similarly, since the standard deviation parameter σ
must be positive, it's reasonable to utilize an Exponential model which is also
restricted to positive values:
σ~Exp(l).
(9.5)
1 1
E(σ) = and SD(σ) = .
l l
Step back and re ect upon how we got here. We built a regression model of Y,
not all at once, but by starting with and building upon the simple Normal-
Normal model one step at a time. In contrast, our human instinct often draws
us into starting with the most complicated model we can think of. When this
instinct strikes, resist it and remember this: complicated is not necessarily
sophisticated. Complicated models are often wrong and dif cult to apply.
Take note of the values that each of these parameters might take.
Accordingly, identify appropriate prior models for these parameters.
illustrated in Figure 9.3. Processing the prior information about the model
baseline in this way is more intuitive. In fact, it's this centered information that
we'll supply when using rstanarm to simulate our regression model. With this,
we can capture prior assumption 1 with a Normal model for β which is 0c
centered at 5000 rides with a standard deviation of 1000 rides, and thus largely
falls between 3000 and 7000 rides: β ~N (5000, 1000 ). This prior is drawn in
0c
2
Figure 9.4.
FIGURE 9.4: Prior models for the parameters in the regression analysis of
bike ridership, (β , β1, σ).
0c
Plugging our tuned priors into (9.6), the Bayesian regression model of
ridership (Y) by temperature (X) is speci ed as follows:
Since our model utilizes independent priors, we separately processed our prior
information on β0, β1, and σ above. Yet we want to make sure that, when
combined, these priors actually re ect our current understanding of the
relationship between ridership and temperature. To this end, Figure 9.5
presents various scenarios simulated from our prior models.3 The 200 prior
model lines, β + β X (left), do indeed capture our prior understanding that
0 1
simulated from these models should be consistent with ridership data we'd
actually expect to see in practice. That's indeed the case here. The rate of
increase in ridership with temperature, the baseline ridership, and the
variability in ridership are consistent with our prior assumptions.
FIGURE 9.5: Simulated scenarios under the prior models of β0, β1, and σ. At
left are 200 prior plausible model lines, β + β X. At right are 4 prior
0 1
plausible datasets.
_________________________
3 You'll learn to construct these plots in Section 9.7. For now, we'll focus on the concepts.
We can now combine the information from this data with that from the prior to
build a posterior model for parameters (β , β , σ). Our inclination here is to
0 1
f (β 0 , β 1 , σ) = f (β 0 )f (β 1 )f (σ).
→
Further, the likelihood function of the parameters given the independent data y
→
is de ned by the joint pdf of y which, in turn, is the product of the marginal
pdfs de ned by the Normal data structure (9.3):
(β 0 , β 1 , σ)
→
∣
_________________________
f (β 0 , β 1 , σ
→
y)
→
∫
→
prior⋅likelihood
prior⋅likelihood
∫ ∫
n
L(β 0 , β 1 , σ y) = f (y β 0 , β 1 , σ) = ∏ f (y i β 0 , β 1 , σ).
i=1
4 The bikes data was extracted from a larger dataset to match our pedagogical goals, and thus
should be used for illustration purposes only. Type ?bikes in the console for a detailed codebook.
n
would require us to specify
If you went through the tedious work of plugging in the formulas for the 3 + n
pdfs in the product above, you wouldn't discover a familiar structure. Thus, if
you really wanted to specify the posterior pdf, you'd need to calculate the
normalizing constant. But you might not get far – the constant which
guarantees that
→
f (β 0 , β 1 , σ y) integrates to 1 across all possible sets of
is a triple integral of our complicated product:
n
i=1
f (β 0 )f (β 1 )f (σ)⋅[∏ f (y i β 0 ,β 1 ,σ)]
i=1
f (β 0 )f (β 1 )f (σ)⋅[∏ f (y i β 0 ,β 1 ,σ)]dβ 0 dβ 1 dσ
Let's not. Instead, we can utilize Markov chain Monte Carlo simulation
The syntax above is common to the other rstanarm models we'll see in this
book and looks more intimidating than it is. In general, stan_glm() requires
three types of information:
Data information
The rst three stan_glm() arguments specify the structure of our data:
we want to model ridership by temperature (rides temp_feel) using
data = bikes and assuming a Normal data model, aka family =
gaussian.
Prior information
The prior_intercept, prior, and prior_aux arguments specify
the priors of β , β1, and σ, respectively. These match the priors de ned by
0c
(9.7).
Markov chain information
The remaining arguments specify the structure of our MCMC simulation:
the number of Markov chains to run, the length or number of iterations
of each chain, and the random number seed to use.
After tossing out the rst half of Markov chain values from the learning or
burn-in phase, the stan_glm() simulation produces four parallel chains of
length 5000 for each model parameter: {β , β , … , β },
(1)
0
(2)
0
(5000)
We come to similar conclusions from the trace and density plots (Figure 9.7).
FIGURE 9.7: Trace and density plots for the bike model posterior simulation.
data: The data on variables Y and X, rides and temperature, will be vectors
of length n.
parameters: Our two regression coef cients beta0 and beta1 (β0 and
β1) can both be any real number whereas the standard deviation parameter
standard deviation parameter sigma (σ) must be non-negative.
model: The data model of Y is normal with mean beta0 + beta1 *
X and standard deviation sigma. Further, with the exception of beta0, the
priors are similar to those in our stan_glm() syntax. Using rstan, we
must directly express our prior understanding of the intercept β0, not the
centered intercept β . In this case, we can extend our prior understanding
0c
that there are typically 5000 riders on a 70-degree day, to there being -2000
hypothetical riders on a 0-degree day (Figure 9.3).
In step 2, we simulate the posterior model of (β , β , σ) using the stan()
0 1
function. The only difference here from the models we simulated in Chapter 6
is that we have more pieces of data: data on sample size n, response variable Y,
and predictor X:
object and rstanarm bike_model object slightly differ. And now that we've
made the connection between rstan and rstanarm here, moving forward we'll
focus on the rstanarm shortcuts and their output. Should you wish to learn
more about rstan, the Stan development team provides an excellent resource
(Stan development team, 2019).
relationship is
−2194.24 + 82.16X.
(9.8)
That is, for every one degree increase in temperature, we expect ridership to
increase by roughly 82 rides. There is, of course, posterior uncertainty in this
relationship. For example, the 80% posterior credible interval for β1, (75.6,
88.8), indicates that this slope could range anywhere between 76 and 89. To
combine this uncertainty in β1 with that in β0 for a better overall picture of our
model, notice that the Markov chain simulations provide 20,000 posterior
plausible pairs of β0 and β1 values:
These pairs provide 20,000 alternative scenarios for the typical relationship
between ridership and temperature, β + β X, and thus capture our overall
0 1
uncertainty about this relationship. For example, the rst pair indicates the
plausibility that β + β X = -2657 + 88.2 X. The second pair has a higher
0 1
intercept and a smaller slope. Below we plot just 50 of these 20,000 posterior
plausible mean models, β + β X. This is a multi-step process:
(i) (i)
0 1
0 1
Comparing the posterior plausible models in Figure 9.7 to the prior plausible
models in Figure 9.5 reveals the evolution in our understanding of ridership.
First, the increase in ridership with temperature appears to be less steep than
we had anticipated. Further, the posterior plausible models are far less
variable, indicating that we're far more con dent about the relationship
between ridership and temperature upon observing some data. Once you've
re ected on the results above, quiz yourself.
Quiz Yourself!
Do we have ample posterior evidence that there's a positive association
between ridership and temperature, i.e., that β > 0 ? Explain.
1
The answer to the quiz is yes. We can support this answer with three types of
evidence.
Visual evidence
In our visual examination of 50 posterior plausible scenarios for the
relationship between ridership and temperature (Figure 9.7), all exhibited
positive associations. A line exhibiting no relationship (β = 0) would
1
Finally, let's examine the posterior results for σ, the degree to which ridership
varies on days of the same temperature. Above we estimated that σ has a
posterior median of 1281 and an 80% credible interval (1231, 1336). Thus, on
average, we can expect the observed ridership on a given day to fall 1281 rides
from the average ridership on days of the same temperature. Figure 9.7 adds
some context, presenting four simulated sets of ridership data under four
posterior plausible values of σ. At least visually, these plots exhibit similarly
moderate relationships, indicating relative posterior certainty about the
strength in the relationship between ridership and temperature. The syntax here
is quite similar to that used for plotting the plausible regression lines
in Figure 9.7. The main difference is that we've replaced
β0 + β1 X
FIGURE 9.9: Four datasets simulated from the posterior models of β0, β1, and
σ.
Quiz Yourself!
Suppose a weather report indicates that tomorrow will be a 75-degree day
in D.C. What's your posterior guess of the number of riders that Capital
Bikeshare should anticipate?
Your natural rst crack at this question might be to plug the 75-degree
temperature into the posterior median model (9.8). Thus, we expect that there
will be 3968 riders tomorrow:
BUT, recall from Section 8.4.3 that this singular prediction ignores two
potential sources of variability:
well as that in σ, the degree to which observations might deviate from the
model lines.
The posterior predictive model of a new data point Ynew accounts for both
sources of variability. Speci cally, the posterior predictive pdf captures the
overall chance of observing Y = y by weighting the chance of this
new new
→
posterior plausibility of these parameters ( f (β 0 , β 1 , σ y) ). Mathematically
speaking:
→ →
f (y new y) = ∫ ∫ ∫ f (y new |β 0 , β 1 , σ)f (β 0 , β 1 , σ y)dβ 0 dβ 1 dσ.
Now, we don't actually have a nice, tidy formula for the posterior pdf of our
→
regression parameters, f (β , β , σ y), and thus can't get a nice tidy formula for
0 1
→
the posterior predictive pdf f (y new y) . What we do have is 20,000 sets of
parameters in the Markov chain (β (i)
0
,β
(i)
1
,σ
(i)
. We can then approximate the
)
(2)
∣⎢ ⎥
Y new β 0 , β 1 , σ ~ N (μ
⎣β
β
parameter set:
(2)
(20000)
(20000)
β
(i)
(2)
(20000)
(i)
σ
2
) )
(2)
(20000) ⎦
with
Thus, each of the 20,000 parameter sets in our Markov chain (left) produces a
unique prediction (right):
⎡β 0
(1)
β
(1) (1)
(β
μ = β
(1)
0
+ β
(1)
1
(1)
0
,β
(1)
X = −2657 + 88.16X.
,σ
→
(1)
μ
each
(i)
= β
⎡Y new
⎣
⋮
Y new
(1)
(2)
Y new
(20000)
(1)
⎦
(i)
parameter
(2)
{Y new , Y new , … , Y new
(i)
1
⋅ 75.
(20000)
set
}
, the
,
in
prediction Y
(1)
new
∣
To capture the sampling variability around this average, i.e., the fact that not
all 75-degree days have the same ridership, we can simulate our rst of cial
by taking a random draw from the Normal model speci ed by
2
Y new β 0 , β 1 , σ ~ N (3955, 1323 ).
Now let's do this 19,999 more times. That is, let's follow the same two-step
process to simulate a prediction of ridership from each of the 20,000 sets of
regression parameters i in bike_model_df: (1) calculate the average
ridership on 75-degree days, μ = β + β ⋅ 75; then (2) sample from the
(i) (i)
0
(i)
The rst 3 sets of average ridership (mu) and predicted ridership on a speci c
day (y_new) are shown here along with the rst 3 posterior plausible
parameter sets from which they were generated ((Intercept),
temp_feel, sigma):
∣
Whereas the collection of 20,000 mu values approximates the posterior model
for the typical ridership on 75-degree days, μ = β + β *75, the 20,000
μ = β 0 + β 1 ⋅ 75.
In the plots of these two posterior models (Figure 9.9), you'll immediately pick
up the fact that, though they're centered at roughly the same value, the
posterior predictive model for mu is much narrower than that of y_new.
Speci cally, the 95% credible interval for the typical number of rides on a 75-
degree day, μ, ranges from 3843 to 4095. In contrast, the 95% posterior
prediction interval for the number of rides tomorrow has a much wider range
from 1500 to 6482.
FIGURE 9.10: The posterior model of μ, the typical ridership on a 75-degree
day (left), and the posterior predictive model of the ridership tomorrow, a
speci c 75-degree day (right).
These two 95% intervals are represented on a scatterplot of the observed data
(Figure 9.10), clarifying that the posterior model for μ merely captures the
uncertainty in the average ridership on allX = 75-degree days. Since there is
so little uncertainty about this average, this interval visually appears like a wee
dot! In contrast, the posterior predictive model for the number of rides
tomorrow (a speci c day) accounts for not only the average ridership on a 75-
degree day, but the individual variability from this average. The punchline?
There's more accuracy in anticipating the average behavior across multiple
data points than the unique behavior of a single data point.
FIGURE 9.11: 95% posterior credible intervals (blue) for the average
ridership on 75-degree days (left) and the predicted ridership for tomorrow, an
individual 75-degree day (right).
After each data collection phase, we can re-simulate the posterior model by
plugging in the accumulated data (phase_1, phase_2, or phase_3):
Figure 9.12 displays the posterior models for the temperature coef cient β1
after each phase of data collection, and thus the evolution in our understanding
of the relationship between ridership and temperature. What started in Phase 1
as a vague understanding that there might be no relationship between ridership
and temperature (β1 values near 0 are plausible), evolved into clear
understanding by Phase 3 that ridership tends to increase by roughly 80 rides
per one degree increase in temperature.
FIGURE 9.13: Approximate posterior models for the temperature coef cient
β1 after three phases of data collection.
Figure 9.13 provides more insight into this evolution, displaying the
accumulated data and 100 posterior plausible models at each phase in our
analysis. Having observed only 30 data points, the Phase 1 posterior models
are all over the map. Further, since these 30 data points happened to land on
cold days in the winter, our Phase 1 information did not yet reveal that
ridership tends to increase on warmer days. Over Phases 2 and 3, we not only
gathered more data, but data which allows us to examine the ridership across
the full spectrum of temperatures in Washington, D.C. By the end of Phase 3,
we have great posterior certainty of the positive association between these two
variables. Again, this kind of evolution in our understanding is how learning,
science, progress happen. Knowledge is built up, piece by piece, over time.
This syntax speci es the following priors: β ~N (5000, 2.5 ), β ~N (0, 2.5 ),
0c
2
1
2
and σ~Exp(1). With a twist. Consider the priors for β and β1. Assuming we
0c
have a weak prior understanding of these parameters, and hence their scales,
we're not really sure whether a standard deviation of 2.5 is relatively small or
relatively large. Thus, we're not really sure if these priors are more speci c
than we want them to be. This is why we also set autoscale = TRUE. By
doing so, stan_glm() adjusts or scales our default priors to optimize the
study of parameters which have different scales.5 These Adjusted priors are
speci ed by the prior_summary() function and match our reported model
formulation (9.9):
_________________________
5 If you have some experience with Bayesian modeling, you might be wondering about whether or
not we should be standardizing predictor X. The rstanarm manual recommends against this, noting that
the same ends are achieved through the default scaling of the prior models (Gabry and Goodrich,
2020c).
simulated data points even include negative ridership values! Further, the
simulated datasets re ect our uncertainty about whether the relationship is
strong (with σ near 0) or weak (with large σ). Yet, by utilizing weakly
informative priors instead of totally vague priors, our prior uncertainty is still
in the right ballpark. Our priors focus on ridership being in the thousands
(reasonable), not in the millions or billions (unreasonable for a city of
Washington D.C.'s size).
FIGURE 9.15: Simulated scenarios under the default prior models of β0, β1,
and σ. At left are 200 prior plausible model lines, β + β x. At right are 4
0 1
9.10 Exercises
9.10.1 Conceptual exercises
Exercise 9.1 (Normal regression priors). For the Normal regression model
(9.6) with Y |β , β , σ~N (μ , σ) where μ = β + β X , we utilized Normal
i 0 1 i i 0 1 i
Exercise 9.2 (Identify the variable). Identify the response variable (Y) and
predictor variable (X) in each given relationship of interest.
a) We want to use a person's arm length to understand their height.
b) We want to predict a person's carbon footprint (in annual CO2
emissions) with the distance between their home and work.
c) We want to understand how a child's vocabulary level might increase
with age.
d) We want to use information about a person's sleep habits to predict
their reaction time.
Exercise 9.3 (Interpreting coef cients). In each situation below, suppose that
the typical relationship between the given response variable Y and predictor X
can be described by β + β X. Interpret the meaning of β0 and β1 and indicate
0 1
Exercise 9.4 (Deviation from the average). Consider the Normal regression
model (9.6). Explain in one or two sentences, in a way that one of your non-
stats friends could understand, how σ is related to the strength of the
relationship between a response variable Y and predictor X.
Exercise 9.5 (Bayesian model building: Part I). A researcher wants to use a
person's age (in years) to predict their annual orange juice consumption (in
gallons). Here you'll build up a relevant Bayesian regression model, step by
step.
Exercise 9.6 (Bayesian model building: Part II). Repeat the above exercise for
the following scenario. A researcher wishes to predict tomorrow's high
temperature by today's high temperature.
Exercise 9.7 (Posterior simulation T/F). Mark each statement about posterior
regression simulation as True or False.
Exercise 9.8 (Posterior simulation). For each situation, specify the appropriate
stan_glm() syntax for simulating the Normal regression model using 4
chains, each of length 10000. (You won't actually run any code.)
Exercise 9.9 (How humid is too humid: model building). Throughout this
chapter, we explored how bike ridership uctuates with temperature. But what
about humidity? In the next exercises, you will explore the Normal regression
model of rides (Y) by humidity (X) using the bikes dataset. Based on
past bikeshare analyses, suppose we have the following prior understanding of
this relationship:
On an average humidity day, there are typically around 5000 riders, though
this average could be somewhere between 1000 and 9000.
Ridership tends to decrease as humidity increases. Speci cally, for every
one percentage point increase in humidity level, ridership tends to decrease
by 10 rides, though this average decrease could be anywhere between 0 and
20.
Ridership is only weakly related to humidity. At any given humidity,
ridership will tend to vary with a large standard deviation of 2000 rides.
Exercise 9.10 (How humid is too humid: data). With the priors in place, let's
examine the data.
Exercise 9.11 (How humid is too humid: posterior simulation). We can now
simulate our posterior model of the relationship between ridership and
humidity, a balance between our prior understanding and the data.
Exercise 9.12 (How humid is too humid: posterior interpretation). Finally, let's
dig deeper into our posterior understanding of the relationship between
ridership and humidity.
Exercise 9.14 (On your own: Part I). Temperature and humidity aren't the only
possible weather factors in ridership. Let's explore the relationship between
ridership (Y) and windspeed (X).
Exercise 9.15 (On your own: Part II). In this open-ended exercise, conduct a
posterior analysis of the relationship between ridership (Y) and windspeed
(X). This should include a discussion of your posterior understanding of this
relationship along with supporting evidence.
Exercise 9.17 (Penguins: data). With the priors in place, let's examine the data.
a) Plot and discuss the observed relationship between
flipper_length_mm and bill_length_mm among the 344
sampled penguins.
b) Does simple Normal regression seem to be a reasonable approach to
modeling this relationship? Explain.
DOI: 10.1201/9780429288340-10
Imagine that we, the authors, invite you over for dinner. It took us hours to
forage mushrooms and cook them up into something delicious. Putting
niceties aside, you might have some questions for us: Do you know what
you're doing? Are these mushrooms safe to eat? After dinner, we offer to
drive you home in a new car that we just built. Before obliging, it would
be wise to check: Is this car safe? How did it perform in crash tests? Just
as one should never eat a foraged mushroom or get in a new car without
questioning their safety, one should never apply a model without rst
evaluating its quality. No matter whether we're talking about frequentist or
Bayesian models, “simple” or “big” models, there are three critical
questions to ask. Examining these questions is the goal of Chapter 10.
Goals
The rst question in evaluating this or any other Bayesian model is context
speci c and gets at the underlying ethical implications: Is the model fair?
We must always ask this question, even when the consideration is
uneventful, as it is in our bike example. Let's break it down into a series of
smaller questions:
variability σ.
1
simulation from Chapter 9, we can predict 500 days of ridership data from
the 500 days of observed temperature data. The end result is 20,000 unique
sets of predicted ridership data, each of size 500, here represented by the
rows of the right matrix:
Speci cally, for each parameter set j ∈ {1, 2, …, 20000}, we predict the
ridership on day i ∈ {1, 2, …, 500} by drawing from the Normal data
model evaluated at the observed temperature Xi on day i:
Y
(j)
∣ β 0 , β 1 , σ ~ N (μ
(j)
, (σ
(j)
2
) )
From the observed temp_feel on each of the 500 days in the bikes
with
data, we simulate a ridership outcome from the Normal data model tuned
to this rst parameter set:
Check out the original rides alongside the simulated_rides for the
rst two data points:
μ
Of course, on any given day, the simulated ridership is off (very off in the
case of the rst day in the dataset). The question is whether, on the whole,
the simulated data is similar to the observed data. To this end, Figure 10.2
compares the density plots of the 500 days of simulated ridership (light
blue) and observed ridership (dark blue):
(j)
= β
(j)
0
+ β
(j)
1
Xi .
FIGURE 10.2: One posterior simulated dataset of ridership (light blue)
along with the actual observed ridership data (dark blue).
Though the simulated data doesn't exactly replicate all original ridership
features, it does capture the big things such as the general center and
spread. And before further picking apart this plot, recall that we generated
the simulated data here from merely one of 20,000 posterior plausible
parameter sets. The rstanarm and bayesplot packages make it easy to
repeat the data simulation process for each parameter set. The
pp_check() function plots 50 of these 20,000 simulated datasets
(labeled yrep) against a plot of the original ridership data (labeled y). (It
would be computationally and logically excessive to examine all 20,000
sets.)
Naturally, how well any particular adult performs in one trial tells us
something about how well they might perform in another. Thus, within any
given subject, the observed Y values are dependent. You will learn how to
incorporate and address such dependent, grouped data using hierarchical
Bayesian models in Unit 4.
Correlated data can also pop up when modeling changes in Y over time,
space, or time and space. For example, we might study historical changes
in temperatures Y in different parts of the world. In doing so, it would be
unreasonable to assume that the temperatures in one location are
independent of those in neighboring locations, or that the temperatures in
one month don't tell us about the next. Though applying our Normal
Bayesian regression model to study these spatiotemporal dynamics might
produce misleading results, there do exist Bayesian models that are tailor-
made for this task: times series models, spatial regression models, and
their combination, spatiotemporal models. Though beyond the scope of
this book, we encourage the interested reader to check out Blangiardo and
Cameletti (2015).
Next, let's consider violations of assumptions 2 and 3, which often go hand
in hand. Figure 10.4 provides an example. The relationship between Y and
X is nonlinear (violating assumption 2) and the variability in Y increases
with X (violating assumption 3).
Even if we hadn't seen this raw data, the pp_check() (right) would
con rm that a Normal regression model of this relationship is wrong – the
posterior simulations of Y exhibit higher central tendency and variability
than the observed Y values. There are a few common approaches to
addressing such violations of assumptions 2 and 3:
Assume a different data structure. Not all data and relationships are
Normal. In Chapters 12 and 13 we will explore models in which the
data structure of a regression model is better described by a Poisson,
Negative Binomial, or Binomial than by a Normal.
Make a transformation. When the data model structure isn't the issue
we might do the following:
– Transform Y. For some function g(⋅) , assume
ind
g(Y i ) β 0 , β 1 , σ ˜N (μ i , σ )
2
with μ i = β0 + β1 Xi .
Y i β 0 , β 1 , σ ˜N (μ i , σ )
2
with μ i = β 0 + β 1 h(X i ) .
ind
Figure 10.5 con rms that this transformation addresses the violations of
both assumptions 2 and 3: the relationship between log(Y ) and Xis linear
and the variability in Y is consistent across the range of X values. The
ideal pp_check() at right further con rms that this transformation turns
our model from wrong to good. Better yet, when transformed off the log
scale, we can still use this model to learn about the relationship between Y
and X.
10.3
2 3
a popular starting
It's also important to consider the relative, not just absolute, distance
between the observed value and its posterior predictive mean.
Quiz Yourself!
Figure 10.7 compares the posterior predictive model of Y, the
ridership on October 22, 2012, from our Bayesian model to that of an
alternative Bayesian model. In both cases, the posterior predictive
mean is 3967. Which model produces the better posterior predictive
model of the 6228 rides we actually observed on October 22?
a. Our Bayesian model
b. The alternative Bayesian model
c. The quality of these models' predictions are equivalent
FIGURE 10.7: Posterior predictive models of the ridership on October
22, 2012 are shown for our Bayesian regression model (left) and an
alternative model (right), both having a mean of 3967. The actual 6228
rides observed on that day is represented by the black line.
Great – we now understand the posterior predictive accuracy for one case
in our dataset. We can take these same approaches to evaluate the accuracy
for all 500 cases in our bikes data. As discussed in Section 10.2, at each
set of model parameters in the Markov chain, we can predict the 500
ridership values Y from the corresponding temperature data X:
The result is represented by the 20000 × 500 matrix of posterior
predictions (right matrix). Whereas the 20,000 rows of this matrix provide
20,000 simulated sets of ridership data which provide insight into the
validity of our model assumptions (Section 10.2), each of the 500 columns
provides 20,000 posterior predictions of ridership for a unique day in the
bikes data. That is, each column provides an approximate posterior
predictive model for the corresponding day. We can obtain these sets of
20,000 predictions per day by applying posterior_predict() to the
full bikes dataset:
′
MAE = median|Y i − Y i |.
mae_scaled
The scaled median absolute error measures the typical number of
standard deviations that the observed Yi fall from their posterior
predictive means Y :
i
′
′
|Y i −Y |
i
MAE scaled = median .
sd i
Among all 500 days in the dataset, we see that the observed ridership is
typically 990 rides, or 0.77 standard deviations, from the respective
posterior predictive mean. Further, only 43.8% of test observations fall
within their respective 50% prediction interval whereas 96.8% fall within
their 95% prediction interval. This is compatible with what we saw in the
ppc_intervals() plot above: almost all dark blue dots are within the
span of the corresponding 95% predictive bars and fewer are within the
50% bars (naturally). So what can we conclude in light of these
observations: Does our Bayesian model produce accurate predictions? The
answer to this question is context dependent and somewhat subjective. For
example, knowing whether a typical prediction error of 990 rides is
reasonable would require a conversation with Capital Bikeshare. As is a
theme in this book, there's not a yes or no answer.
10.3.2 Cross-validation
The posterior prediction summaries in Section 10.3.1 can provide valuable
insight into our Bayesian model's predictive accuracy. They can also be
awed. Consider an analogy. Suppose you want to open a new taco stand.
You build all of your recipes around Reem, your friend who prefers that
every meal include anchovies. You test your latest “anchov-ladas” dish on
her and it's a hit. Does this imply that this dish will enjoy broad success
among the general public? Probably not! Not everybody shares Reem's
particular tastes.1 Similarly, a model is optimized to capture the features
in the data with which it's trained or built. Thus, evaluating a model's
predictive accuracy on this same data, as we did above, can produce overly
optimistic assessments. Luckily, we don't have to go out and collect new
data with which to evaluate our model. Rather, only for the purposes of
model evaluation, we can split our existing bikes data into different
pieces that play distinct “training” and “testing” roles in our analysis. The
basic idea is this:
_________________________
1 We assume here that readers agree that not everybody likes anchovies.
Since it trains and tests our model using different portions of the bikes
data, this procedure would provide a more honest or conservative estimate
of how well our model generalizes beyond our particular bike sample, i.e.,
how well it predicts future ridership. But there's a catch. Performing just
one round of training and testing can produce an unstable estimate of
posterior predictive accuracy – it's based on only one random split of our
bikes data and uses only 50 data points for testing. A different random
split might paint a different picture. The k -fold cross-validation
algorithm, outlined below, provides a more stable approach by repeating
the training / testing process multiple times and averaging the results.
folds.
Test this model on the kth data fold.
Measure the prediction quality (e.g., by MAE).
3. Repeat. Repeat step 2 k − 1 more times, each time leaving out
a different fold for testing.
4. Calculate cross-validation estimates. Steps 2 and 3 produce k
different training models and k corresponding measures of
prediction quality. Average these k measures to obtain a single
cross-validation estimate of prediction quality.
Having split our data into distinct training and testing roles, these cross-
validated summaries provide a fairer assessment of how well our Bayesian
model will predict the outcomes of new cases, not just those on which it's
trained. For a point of comparison, recall our posterior predictive
assessment based on using the full bikes dataset for both training and
testing:
In light of the original and cross-validated posterior predictive summaries
above, take the following quiz.2
Quiz Yourself!
If we were to apply our model to predict ridership tomorrow, we
should expect that our prediction will be off by:
a. 1029 rides
b. 990 rides
Remember Reem and the anchovies? Remember how we thought she'd like
your anchov-lada recipe better than a new customer would? The same is
true here. Our original posterior model was optimized to our full bikes
dataset. Thus, evaluating its posterior predictive accuracy on this same
dataset seems to have produced an overly rosy picture – the stated typical
prediction error (990 rides) is smaller than when we apply our model to
predict the outcomes of new data (1029 rides). In the spirit of “better safe
than sorry,” it's thus wise to supplement any measures of model quality
with their cross-validated counterparts.
_________________________
2 Answer : a
∣
→
y) = ∫ ∫
→
value and y = (y , y , …, y )
new
FIGURE 10.10: Two hypothetical posterior predictive pdfs for Ynew, the
yet unobserved ridership on a new day. The eventual observed value of
Ynew, ynew, is represented by a dashed vertical line.
Also notice that the posterior predictive pdf is relatively higher at ynew in
Scenario 1 than in Scenario 2, providing more evidence in favor of
Scenario 1. In general, the greater the posterior predictive pdf evaluated at
ynew,
→
f (y new y) , the more accurate the posterior prediction of Ynew.
Similarly, the greater the logged pdf at ynew,
→
log(f (y new y))
accurate the posterior prediction. With this, we present you with a nal
numerical assessment of posterior predictive accuracy.
, the more
ELPD
log(f (y new
∣
measures
→
y))
the average log posterior predictive
From the loo() output, we learn that the ELPD for our Bayesian Normal
regression model of bike ridership is -4289. But what does this mean?!?
Per our earlier warning, this nal approach to evaluating posterior
pdf,
, across all possible new data points ynew. The higher
10.6 Exercises
10.6.1 Conceptual exercises
Exercise 10.1 (The Big Three). When evaluating a model, what are the big
three questions to ask yourself?
Exercise 10.2 (Model fairness). Give an example of a model that will not
be fair for each of the reasons below. Your examples don't have to be from
real life, but try to keep them in the realm of plausibility.
a) How the data was collected.
b) The purpose of the data collection.
c) Impact of analysis on society.
d) Bias baked into the analysis.
a) How would you respond if your colleague were to tell you “I'm
just a neutral observer, so there's no bias in my data analysis”?
b) Your colleague now admits that they are not personally neutral,
but they say “my model is neutral.” How do you respond to your
colleague now?
c) Give an example of when your personal experience or perspective
has informed a data analysis.
Exercise 10.5 (That famous quote). George Box famously said: “All
models are wrong, but some are useful.” Write an explanation of what this
quote means so that one of your non-statistical friends can understand.
Bayesian linear regression model with
μi = β0 + β1 Xi
0 1
(Y , Y , … , Y ).
1 2
.
,σ
∣
Exercise 10.6 (Assumptions). Provide 3 assumptions of the Normal
variable Y has values y = (20, 17, 4, 11, 9). Based on this data, we built a
Bayesian linear regression model of Y vs X.
a) In
(β
(1)
,β
our
(1)
(1)
rst simulated parameter
set,
) = (−1.8, 2.1, 0.8). Explain how you would use
where
Exercise 10.10 (Posterior predictive checks).
Exercise 10.11 (Cross-validation and tacos). Recall this example from the
chapter: Suppose you want to open a new taco stand. You build all of your
recipes around Reem, your friend who prefers that anchovies be a part of
every meal. You test your latest “anchov-ladas” dish on her and it's a hit.
a) What are the four steps for the k-fold cross-validation algorithm?
b) What problems can occur when you use the same exact data to
train and test a model?
c) What questions do you have about k-fold cross-validation?
Exercise 10.14 (Coffee ratings: model it). In this exercise you will build a
Bayesian Normal regression model of a coffee bean's rating (Y) by its
aroma grade (X) with μ = β + β X. In doing so, assume that our only
0 1
prior understanding is that the average cup of coffee has a 75-point rating,
though this might be anywhere between 55 and 95. Beyond that, utilize
weakly informative priors.
Exercise 10.19 (Coffee ratings now with aftertaste). Aroma isn't the only
possible predictor of a coffee bean's rating. What if, instead, we were to
predict rating by a bean's aftertaste? In exploring this relationship,
continue to utilize the same prior models.
Exercise 10.20 (Open-ended: more weather). In this exercise you will use
the weather_perth data in the bayesrules package to explore the
Normal regression model of the maximum daily temperature (maxtemp)
by the minimum daily temperature (mintemp) in Perth, Australia. You
can either tune or utilize weakly informative priors.
Exercise 10.21 (Open-ended: more bikes). In this exercise you will use the
bikes data in the bayesrules package to explore the Normal
regression model of rides by humidity. You can either tune or utilize
weakly informative priors.
DOI: 10.1201/9780429288340-11
Let's begin our analysis with the familiar: a simple Normal regression
model of temp3pm with one quantitative predictor, the morning
temperature temp9am, both measured in degrees Celsius. As you might
expect, there's a positive association between these two variables – the
warmer it is in the morning, the warmer it tends to be in the afternoon:
FIGURE 11.1: A scatterplot of 3 p.m. versus 9 a.m. temperatures, in
degrees Celsius, collected in two Australian cities.
distinguish it from other predictors used later in the chapter. Then the
Bayesian Normal regression model of Y by X1 is represented by (11.1):
Since 0-degree mornings are rare in Australia, it's dif cult to state our
prior understanding of the typical afternoon temperature on such a rare
day, β0. Instead, the Normal prior model on the centered interceptβ 0c
We simulate the model posterior below and encourage you to follow this
up with a check of the prior model speci cations and some MCMC
diagnostics (which all look good!):
The simulation results provide ner insight into the association between
afternoon and morning temperatures. Per the 80% credible interval for β1,
there's an 80% posterior probability that for every one degree increase in
temp9am, the average increase in temp3pm is somewhere between 0.98
and 1.1 degrees. Further, per the 80% credible interval for standard
deviation σ, this relationship is fairly strong – observed afternoon
temperatures tend to fall somewhere between only 3.87 and 4.41 degrees
from what we'd expect based on the corresponding morning temperature.
But is this a “good” model? We'll more carefully address this question in
Section 11.5. For now, we'll leave you with a quick pp_check(), which
illustrates that we can do better (Figure 11.2). Though the 50 sets of
afternoon temperature data simulated from the weather_model_1
posterior (light blue) tend to capture the general center and spread of the
afternoon temperatures we actually observed (dark blue), none capture the
bimodality in these temperatures. That is, none re ect the fact that there's
a batch of temperatures around 20 degrees and another batch around 35
degrees.
Goals
Extend the Normal linear regression model of a quantitative
response variable Y to settings in which we have:
– a categorical predictor X;
– multiple predictors (X , X , …, X ); or
1 2 p
and 0 otherwise:
1 Wollongong
X i2 = {
0 otherwise (i.e., Uluru).
under only two scenarios. Scenario 1: For Uluru, X = 0 and the typical 3
i2
β0 + β1 ⋅ 0 = β0 .
β0 + β1 ⋅ 1 = β0 + β1 .
Uluru (X = 0).
2
X = 0).
2
We can now interpret our prior models in (11.2) with this in mind. First,
the Normal prior model on the centered interceptβ re ects a prior
0c
understanding that the average afternoon temperature in Uluru is
somewhere between 15 and 35 degrees. Further, the weakly informative
Normal prior for β1 is centered around 0, re ecting a default, conservative
prior assumption that the average 3 p.m. temperature in Wollongong might
be greater (β > 0), less (β < 0), or even no different (β = 0) from that
1 1 1
in Uluru. Finally, the weakly informative prior for σ expresses our lack of
understanding about the degree to which 3 p.m. temperatures vary at either
location.
Trace plots and autocorrelation plots (omitted here for space), as well as
density plots (Figure 11.4) suggest that our posterior simulation has
suf ciently stabilized:
FIGURE 11.4: Four parallel Markov chain approximations of the
weather_model_2 posterior models.
These density plots and the below numerical posterior summaries for β0
(Intercept), β1 (locationWollongong), and σ (sigma) re ect
our posterior understanding of 3 p.m. temperatures in Wollongong and
Uluru. Consider the message of the posterior median values for β0 and β1:
the typical 3 p.m. temperature is around 29.7 degrees in Uluru and,
comparatively, around 10.3 degrees lower in Wollongong. Combined then,
we can say that the typical 3 p.m. temperature in Wollongong is around
19.4 degrees (29.7 - 10.3). For context, Figure 11.5 frames these posterior
median estimates of 3 p.m. temperatures in Uluru and Wollongong among
the observed data.
FIGURE 11.5: Density plots of afternoon temperatures in Uluru and
Wollongong, with posterior median estimates of temperature (dashed
lines).
0
,β
(2)
0
,…,β
(20000)
0
} . As for β0 + β1 , we can approximate the
posterior for this function of model parameters by the corresponding
function of Markov chains. Speci cally, we can approximate the β + β 0 1
Interpreting the 9 a.m. temperature and location coef cients in this new
model, β1 and β2, requires some care. We can't simply interpret β1 and β2
as we did in our rst two models, (11.1) and (11.2), when using either
predictor alone. Rather, the meaning of our predictor coef cients changes
depending upon the other predictors in the model. Let's again consider two
scenarios. Scenario 1: In Uluru, X = 0 and the relationship between 3
i2
p.m. and 9 a.m. temperature simpli es to the following formula for a line:
β 0 + β 1 X i1 + β 2 ⋅ 0 = β 0 + β 1 X i1 .
Scenario 2: In Wollongong, X = 1 and the relationship between 3 p.m.
i2
β 0 + β 1 X i1 + β 2 ⋅ 1 = (β 0 + β 2 ) + β 1 X i1 .
9am temperature coef cient β is the common slope of the Uluru and
1
From these priors, we then simulate and plot 100 different sets of 3 p.m.
temperature data (Figure 11.9). First, notice from the left plot that our
combined priors produce sets of 3 p.m. temperatures that are centered
around 25 degrees (per the β prior), yet tend to range widely, from
0c
roughly -75 to 125 degrees. This wide range is the result of our weakly
informative priors and, though it spans unrealistic temperatures, it's at
least in the right ballpark. After all, had we utilized vague instead of
weakly informative priors, our prior simulated temperatures would span
an even wider range, say on the order of millions of degrees Celsius.
Second, the plot at right displays our prior assumptions about the
relationship between 3 p.m. and 9 a.m. temperature at each location. Per
the prior model for slope β1 being centered at 0, these model lines re ect a
conservative assumption that 3 p.m. temperatures might be positively or
negatively associated with 9 a.m. temperatures in both locations. Further,
per the prior model for the Wollongong coef cient β2 being centered at 0,
the lack of a distinction among the model lines in the two locations
re ects a conservative assumption that the typical 3 p.m. temperature in
Wollongong might be hotter, cooler, or no different than in Uluru on days
with similar 9 a.m. temperatures. In short, these prior assumptions re ect
that when it comes to Australian weather, we're just not sure what's up.
FIGURE 11.9: 100 datasets were simulated from the prior models. For
each, we display a density plot of the 3 p.m. temperatures alone (left) and
the relationship in 3 p.m. versus 9 a.m. temperatures by location (right).
Think about this dynamic another way: the relationship between 3 p.m.
temperature and 9 a.m. humidity varies by location. Equivalently, the
relationship between 3 p.m. temperature and location varies by 9 a.m.
humidity level. More technically, we say that the location and humidity
predictors interact.
Interaction
Two predictors, X1 and X2, interact if the association between X1 and
Y varies depending upon the level of X2.
re ect the fact that the relationship between temperature and humidity is
modi ed by location, we can incorporate a new predictor: the interaction
term. This new predictor is simply the product of X2 and X3:
μ = β0 + β1 X2 + β2 X3 + β3 X2 X3 .
Thus, the complete structure for our multivariable Bayesian linear
regression model with an interaction term is as follows, where the weakly
informative priors on the non-intercept parameters are auto-scaled by
stan_glm() below:
μ = β0 + β2 X3 .
Context. In the context of your analysis, does it make sense that the
relationship between Y and one predictor X1 varies depending upon the
value of another predictor X2?
Visualizations. As with our example here, interactions might reveal
themselves when visualizing the relationships between Y, X1, and X2.
Hypothesis tests. Suppose we do include an interaction term in our
model, μ = β + β X + β X + β X X . If there's signi cant
0 1 1 2 2 3 1 2
interaction term. Otherwise, it's typically a good idea to get rid of it.
Let's practice the rst two ideas with some informal examples. The
bike_users data is from the same source as the bikes data in Chapter
9. Like the bikes data, it includes information about daily Capital
Bikeshare ridership. Yet bike_users contains data on both registered,
paying bikeshare members (who tend to ride more often and use the bikes
for commuting) and casual riders (who tend to just ride the bikes every so
often):
We'll use this data as a whole and to examine patterns among casual riders
alone and registered riders alone:
To begin, let Yc denote the number of casual riders and Yr denote the
number of registered riders on a given day. As with model (11.4), consider
an analysis of ridership in which we have two predictors, one quantitative
and one categorical: temperature (X1) and weekend status (X2). The
observed relationships of Yc and Yr with X1 and X2 are shown below.
Syntax is included for the former and is similar for the latter.
Quiz Yourself!
In their relationship with ridership, user type and weather category do not
appear to interact, at least not signi cantly. Among both casual and
registered riders, ridership tends to decrease as weather worsens. Further,
the degree of these decreases from one weather category to the next are
certainly not equal, but they are similar. In contrast, in their relationship
with ridership, user type and weekend status do appear to interact – the
relationship between ridership and weekend status varies by user status.
Whereas casual ridership is greater on weekends than on weekdays,
registered ridership is greater on weekdays. Again, we might have
anticipated this interaction given that casual and registered riders tend to
use the bikeshare service to different ends.
The examples above have focused on examining interactions through
visualizations and context. It remains to determine whether these
interactions are actually signi cant or meaningful, thus whether we should
include them in the corresponding models. To this end, we could simulate
each model and formally test the signi cance of the interaction coef cient.
In our opinion, after building up the necessary intuition, this formality is
the easiest step.
μ = β0 + β1 X1 + β2 X2 + ⋯ + βp Xp ,
We can now pick through the posterior simulation results for our seven
model parameters, here simpli ed to their corresponding 95% credible
intervals:
These intervals provide insight into some big picture questions. When
controlling for the other model predictors, which predictors are
signi cantly associated with temp3pm? Are these associations positive or
negative? Try to answer these big picture questions for yourself in the quiz
below. Though there's some gray area, one set of reasonable answers is in
the footnotes.2
Quiz Yourself!
When controlling
for the other predictors included in
weather_model_4, which predictors…
1. have a signi cant positive association with temp3pm?
2. have a signi cant negative association with temp3pm?
3. are not signi cantly associated with temp3pm?
_________________________
2 1 . temp9am; 2. location, humidity9am; 3. windspeed9am, pressure9am
Let's see how you did. To begin, the 95% posterior credible interval for the
temp9am coef cient β1, (0.73, 0.87), is the only one that lies entirely
above 0. This provides us with hearty evidence that, even when controlling
for the four other predictors in the model, there's a signi cant positive
association between morning and afternoon temperatures. In contrast, the
95% posterior credible intervals for the locationWollongong and
humidity9am coef cients lie entirely below 0, suggesting that both
factors are negatively associated with temp3pm. For example, when
controlling for the other model factors, there's a 95% chance that the
typical temperature in Wollongong is between 5.67 and 7.2 degrees lower
than in Uluru. The windspeed9am and pressure9am coef cients are
the only ones to have 95% credible intervals which straddle 0. Though
both intervals lie mostly below 0, suggesting afternoon temperature is
negatively associated with morning windspeed and atmospheric pressure
when controlling for the other model predictors, the waf ing evidence
invites some skepticism and follow-up questions.
Quiz Yourself!
Utilize Table 11.2 to compare our four models.
Don't panic. We don't expect you to interpret the polynomial terms or their
coef cients. Rather, we want you to take in the fact that each polynomial
term, a transformation of X, is a separate predictor. Thus, model 1 has 1
predictor, model 2 has 2, and model 3 has 12. In obtaining posterior
estimates of these three models, we know two things from past
discussions. First, our two different data samples will produce different
estimates (they're using different information!). Second, models 1, 2, and
3 vary in quality – one is better than the others. Before examining the
nuances, take a quick quiz.4
Quiz Yourself!
_________________________
4 Answer : 1 = model 3; 2 = model 1
Let's connect what we see in this gure with some technical concepts and
terminology. Starting at one extreme, model 1 assumes a linear
relationship between temperature and day of year. Samples 1 and 2
produce similar estimates of this linear relationship. This stability is
reassuring – no matter what data we happen to have, our posterior
understanding of model 1 will be similar. However, model 1 turns out to
be overly simple and rigid. It systematically underestimates temperatures
on summer days and overestimates temperatures on winter days. Putting
this together, we say that model 1 has low variance from sample to
sample, but high bias.
To correct this bias, we can incorporate more exibility into our model
with the inclusion of more predictors, here polynomial terms. Yet we can
get carried away. Consider the extreme case of model 3, which assumes a
12th-order polynomial relationship between temperature and day of year.
Within both samples, model 3 does seem to be better than model 1 at
following the trends in the relationship, and hence is less biased. However,
this decrease in bias comes at a cost. Since model 3 is structured to pick
up tiny, sample-speci c details in the relationship, samples 1 and 2
produce quite different estimates of model 3, or two very distinct wiggly
model lines. In this case, we say that model 3 has low bias but high
variance from sample to sample. Utilizing this highly variable model
would have two serious consequences. First, the results would be unstable
– different data might produce very different model estimates, and hence
conclusions about the relationship between temperature and day of year.
Second, the results would be over t to our sample data – the tiny, local
trends in our sample likely don't extend to the general daily weather
patterns in Wollongong. As a result, this model wouldn't do a good job of
predicting temperatures for future days.
FIGURE 11.21: Sample 1 (top row) and sample 2 (bottom row) are used
to model temperature by day of year under three assumptions: the
relationship is linear (left), quadratic (center), or a 12th order polynomial
(right).
Bringing this all together, in assuming a quadratic structure for the
relationship between temperature and day of year, model 2 provides some
good middle ground. It is neither too biased (simple) nor too variable
(over t). That is, it strikes a nice balance in the bias-variance trade-off.
Bias-variance trade-off
A model is said to have high bias if it tends to be “far” from the
observed relationship in the data; and high variance if estimates of
the model signi cantly differ depending upon what data is used. In
model building, there are trade-offs between bias and variability:
Overly simple models with few or no predictors tend to have high
bias but low variability (high stability).
Overly complicated models with lots of predictors tend to have
low bias but high variability (low stability).
The goal is to build a model which strikes a good balance, enjoying
relatively low bias and low variance.
Check out the results in Table 11.3. By both measures, raw and cross-
validated, model_1 tends to have the greatest prediction errors. This
suggests that model_1 is overly simple, or possibly biased, in
comparison to the other two models. At the other extreme, notice that the
cross-validated prediction error for model_3 (2.12 degrees) is roughly
double the raw prediction error (1.06 degrees). Thus, model_3 is doubly
worse at predicting temperatures for days we haven't yet observed than
temperatures for days in the sample_1 data. This suggests that
model_3 is over t, or overly optimized, to the sample_1 data we used
to build it. As such, the discrepancy in its raw and cross-validated
prediction errors tips us off that model_3 has low bias but high variance
– a different sample of data might lead to very different posterior results.
Between the extremes of model_1 and model_3, model_2 presents the
best option with a relatively low raw prediction error and the lowest cross-
validated prediction error.
11.7 Exercises
11.7.1 Conceptual exercises
Exercise 11.6 (Improving your model: shoe size). Let's say you model a
child's shoe size (Y) by two predictors: the child's age in years (X1) and an
indicator of whether the child knows how to swim (X2).
Exercise 11.8 (Is our model good / better?). What techniques have you
learned in this chapter to assess and compare your models? Give a brief
explanation for each technique.
model formula
1 body_mass_g ~ flipper_length_mm
2 body_mass_g ~ species
3 body_mass_g ~ flipper_length_mm + species
4 body_mass_g ~ flipper_length_mm +
bill_length_mm + bill_depth_mm
a) Simulate these four models using the stan_glm() function.
b) Produce and compare the pp_check() plots for the four models.
c) Use 10-fold cross-validation to assess and compare the posterior
predictive quality of the four models using the
prediction_summary_cv(). NOTE: We can only predict
body mass for penguins that have complete information on our
model predictors. Yet two penguins have NA values for multiple
of these predictors. To remove these two penguins, we select()
our columns of interest before removing penguins with NA values.
This way, we don't throw out penguins just because they're
missing information on variables we don't care about:
DOI: 10.1201/9780429288340-12
Step back from the details of the previous few chapters and recall the big goal:
to build regression models of quantitative response variables Y. We've only
shared one regression tool with you so far, the Bayesian Normal regression
model. The name of this “Normal” regression tool re ects its broad
applicability. But (luckily!), not every model is “Normal.” We'll expand upon
our regression tools in the context of the following data story.
As of this book's writing, the Equality Act sits in the United States Senate
awaiting consideration. If passed, this act or bill would ensure basic LGBTQ+
rights at the national level by prohibiting discrimination in education,
employment, housing, and more. As is, each of the 50 individual states has
their own set of unique anti-discrimination laws, spanning issues from anti-
bullying to health care coverage. Our goal is to better understand how the
number of laws in a state relates to its unique demographic features and
political climate. For the former, we'll narrow our focus to the percentage of a
state's residents that reside in an urban area. For the latter, we'll utilize
historical voting patterns in presidential elections, noting whether a state has
consistently voted for the Democratic candidate, consistently voted for the
“GOP” Republican candidate,1 or is a swing state that has ip opped back and
forth. Throughout our analysis, please recognize that the number of laws is not
a perfect proxy for the quality of a state's laws – it merely provides a starting
point in understanding how laws vary from state to state.
For each state i ∈ {1, 2, … , 50}, let Yi denote the number of anti-
discrimination laws and predictor X denote the percentage of the state's
i1
residents that live in urban areas. Further, our historical political climate
predictor variable is categorical with three levels: Democrat, GOP, or swing.
This is our rst time working with a three-level variable, so let's set this up
right. Recall from Chapter 11 that one level of a categorical predictor, here
Democrat, serves as a baseline or reference level for our model. The other
levels, GOP and swing, enter our model as indicators. Thus, X indicates i2
1 GOP 1 swing
X i2 = { and X i3 = {
0 otherwise 0 otherwise.
Since it's the only technique we've explored thus far, our rst approach to
understanding the relationship between our quantitative response variable Y
and our predictors X might be to build a regression model with a Normal data
structure:
ind
2
Y i β 0 , β 1 , β 2 , β 3 , σ ˜N (μ i , σ ) with μ i = β 0 + β 1 X i1 + β 2 X i2 + β 3 X i3 .
Other than an understanding that a state that's “typical” with respect to its
urban population and historical voting patterns has around 7 laws, we have
very little prior knowledge about this relationship. Thus, we'll set a N (7, 1.5 ) 2
prior for the centered intercept β , but utilize weakly informative priors for
0c
Next, let's consider some data. Each year, the Human Rights Campaign
releases a “State Equality Index” which monitors the number of LQBTQ+
rights laws in each state. Among other state features, the equality_index
dataset in the bayesrules package includes data from the 2019 index
compiled by Sarah Warbelow, Courtnay Avant, and Colin Kutney (2019). To
obtain a detailed codebook, type ?equality in the console.
The histogram below indicates that the number of laws ranges from as low as 1
to as high as 155, yet the majority of states have fewer than ten laws:
The state with 155 laws happens to be California. As a clear outlier, we'll
remove this state from our analysis:
Next, in a scatterplot of the number of state laws versus its
percent_urban population and historical voting patterns, notice that
historically dem states and states with greater urban populations tend to have
more LGBTQ+ anti-discrimination laws in place:
Using stan_glm(), we combine this data with our weak prior understanding
to simulate the posterior Normal regression model of laws by
percent_urban and historical voting trends. In a quick posterior
predictive check of this equality_normal_sim model, we compare a
histogram of the observed anti-discrimination laws to ve posterior simulated
datasets (Figure 12.3). (A histogram is more appropriate than a density plot
here since our response variable is a non-negative count.) The results aren't
good – the posterior predictions from this model do not match the features of
the observed data. You might not be surprised. The observed number of anti-
discrimination laws per state are right skewed (not Normal!). In contrast, the
datasets simulated from the posterior Normal regression model are roughly
symmetric. Adding insult to injury, these simulated datasets assume that it is
quite common for states to have a negative number of laws (not possible!).
FIGURE 12.3: A posterior predictive check of the Normal regression model
of anti-discrimination laws. A histogram of the observed laws (y) is plotted
alongside ve posterior simulated datasets (yrep).
This sad result reveals the limits of our lone tool. Luckily, probability models
come in all shapes, and we don't have to force something to be Normal when
it's not.
Goals
You will extend the Normal regression model of a quantitative response
variable Y to settings in which Y is a count variable whose dependence on
predictors X is better represented by the Poisson or Negative Binomial,
not Normal, models.
12.1 Building the Poisson regression model
12.1.1 Specifying the data model
Recall from Chapter 5 that the Poisson model is appropriate for modeling
discrete counts of events (here anti-discrimination laws) that happen in a xed
interval of space or time (here states) and that, theoretically, have no upper
bound. The Poisson is especially handy in cases like ours in which counts are
right-skewed, and thus can't reasonably be approximated by a Normal model.
Moving forward, let's assume a Poisson data model for the number of
LGBTQ+ anti-discrimination laws in each state i (Yi) where the rate of anti-
discrimination laws λi depends upon demographic and voting trends (X , X ,
i1 i2
and X ):
i3
ind
Y i λ i ˜P ois(λ i ).
E(Y i |λ i ) = λ i .
Quiz Yourself!
Figure 12.4 highlights a aw in assuming that the expected number of
laws in a state, λi, is a linear combination of percent_urban and
historical. What is it?
At the risk of projecting, interpreting the logged number of laws isn't so easy.
Instead, we can always transform the model relationship off the log scale by
appealing to the fact that if log(λ) = a, then λ = e for natural number e:
a
ind
β 0 +β 1 X i1 +β 2 X i2 +β 3 X i3
Y i β 0 , β 1 , β 2 , β 3 ˜P ois(λ i ) with λi = e .
Figure 12.5 presents a prior plausible outcome for this model on both the
log(λ) and λ scales. In both cases there are three curves, one per historical
voting category. On the log(λ) scale, these curves are linear. Yet when
transformed to the λ or (unlogged) laws scale, these curves are nonlinear and
restricted to be at or above 0. This was the point – we want our model to
preserve the fact that a state can't have a negative number of laws.
FIGURE 12.5: An example relationship between the logged number of laws
(left) or number of laws (right) in a state with its urban percentage and
historical voting trends.
_________________________
2 By convention, “log” refers to the natural log function “ln” throughout this book.
The model curves on both the log(λ) and λ scales are de ned by the (
β , β , β , β ) parameters. When describing the linear model of the logged
0 1 2 3
number of laws in a state, these parameters take on the usual meanings related
to intercepts and slopes. Yet (β , β , β , β ) take on new meanings when
0 1 2 3
log(λ) = β0 + β1 X1 + ⋯ βp Xp
β 0 +β 1 X 1 +⋯β p X p
λ = e .
(12.1)
Interpreting β 0
Let's apply these concepts to the hypothetical models of the logged and
unlogged number of state laws in Figure 12.5. In this gure, the curves are
de ned by:
(12.2)
Now, since there are no states in which the urban population is close to 0, it
doesn't make sense to put too much emphasis on these interpretations of β0.
Rather, we can just understand β0 as providing a baseline for our model, on
both the log(λ) and λ scales.
Next, consider the urban percentage coef cient β = 0.03. On the linearlog(λ)
1
scale, we can still interpret this value as the shared slope of the lines in Figure
12.5 (left). Speci cally, no matter a state's historical voting trends, we
expect the logged number of laws in states to increase by 0.03 for every extra
percentage point in urban population. We can interpret the relationship
between state laws and urban percentage more meaningfully on the unlogged
scale of λ by examining
β1 0.03
e = e = 1.03.
Finally, consider the GOP coef cient β = −1.1. Recall from Chapter 11 that
2
Democrat model lines in Figure 12.5 (left). Thus, at any urban percentage,
we'd expect the logged number of laws to be 1.1 lower in historically GOP
states than Democrat states. Though we also see a difference between the GOP
and Democrat curves on the nonlinearλ scale (Figure 12.5 right), the difference
isn't constant – the gap in the number of laws widens as urban percentage
increases. In this case, instead of representing a constant difference between
two lines, e measures the percentage or multiplicative difference in the GOP
β2
versus Democrat curve. Thus, at any urban percentage level, we'd expect a
historically GOP state to have 1/3 as many anti-discrimination laws as a
historically Democrat state:
β2 −1.1
e = e = 0.333.
We conclude this section with the formal assumptions encoded by the Poisson
data model.
Y i β 0 , β 1 , … , β p ˜P ois(λ i ) where
log(λ i ) = β 0 + β 1 X i1 + ⋯ + β p X ip .
FIGURE 12.6: Two simulated datasets. The data on the left satis es the
Poisson regression assumption that, at any given X, the variability in Y is
roughly on par with the average Y value. The data on the right does not,
exhibiting consistently low variability in Y across all X values.
on any value on the real line, it's again reasonable to utilize Normal priors.
Further, as was the case for our Normal regression model, we'll assume these
priors (i.e., our prior understanding of the model coef cients) are independent.
The complete representation of our Poisson regression model of Yi is as
follows:
First, consider the prior for the centered intercept β . Recall our prior
0c
Further, the range of this Normal prior indicates our relative uncertainty about
this baseline. Though the logged average number of laws is most likely around
2, we think it could range from roughly 1 to 3 (2 ± 2*0.5). Or, more
intuitively, we think that the average number of laws in typical states might be
somewhere between 3 and 20:
1 3
(e , e ) ≈ (3, 20).
Beyond this baseline, we again used weakly informative default priors for (
β , β , β ), tuned by stan_glm() below. Being centered at zero with
1 2 3
relatively large standard deviation on the scale of our variables, these priors
re ect a general uncertainty about whether and how the number of anti-
discrimination laws is associated with a state's urban population and voting
trends.
To examine whether these combined priors accurately re ect our uncertain
understanding of state laws, we'll simulate 20,000 draws from the prior models
of (β , β , β , β ). To this end, we can run the same stan_glm() function
0 1 2 3
that we use to simulate the posterior with two new arguments: prior_PD =
TRUE speci es that we wish to simulate the prior, and family = poisson
indicates that we're using a Poisson data model (not Normal or gaussian).
curves here certainly re ects our prior uncertainty! These are all over the map,
indicating that the number of laws might increase or decrease with urban
population and might or might not differ by historical voting trends. We don't
really know.
MCMC trace, density, and autocorrelation plots con rm that our simulation
has stabilized:
And before we get too far into our analysis of these simulation results, a quick
posterior predictive check con rms that we're now on the right track (Figure
12.8). First, histograms of just ve posterior simulations of state law data
exhibit similar skew, range, and trends as the observed law data. Second,
though density plots aren't the best display of count data, they allow us to more
directly compare a broader range of 50 posterior simulated datasets to the
actual observed law data. These simulations aren't perfect, but they do
reasonably capture the features of the observed law data.
FIGURE 12.8: A posterior predictive check of the Poisson regression model
of anti-discrimination laws compares the observed laws (y) to ve posterior
simulated datasets (yrep) via histograms (left) and to 50 posterior simulated
datasets via density plots (right).
Unlike the prior plausible models in Figure 12.7, which were all over the
place, the messages are clear. At any urban population level, historically dem
states tend to have the most anti-discrimination laws and gop states the
fewest. Further, the number of laws in a state tend to increase with urban
percentage. To dig into the details, we can examine the posterior models for
the regression parameters β0 (Intercept), β1 (percent_urban), β2
(historicalgop), and β3 (historicalswing):
(12.4)
Consider the percent_urban coef cient β1, which has a posterior median
of roughly 0.0164. Then, when controlling for historical voting trends, we
expect the logged number of anti-discrimination laws in states to increase by
0.0164 for every extra percentage point in urban population. More
meaningfully (on the unlogged scale), if the urban population in one state is 1
percentage point greater than another state, we'd expect it to have 1.0165 times
the number of, or 1.65% more, anti-discrimination laws:
0.0164
e = 1.0165.
Or, if the urban population in one state is 25 percentage points greater than
another state, we'd expect it to have roughly one and a half times the number
of, or 51% more, anti-discrimination laws (e 25⋅0.0164
= 1.5068). Take a quick
quiz to similarly interpret the β3 coef cient for historically swing states.3
Quiz Yourself!
The posterior median of β3 is roughly -0.61. Correspondingly, e = −0.61
The key here is remembering that the categorical swing state indicator
provides a comparison to the dem state reference level. Then, when controlling
for a state's urban population, we'd expect the logged number of anti-
discrimination laws to be 0.61 lower in a swing state than in a dem leaning
state. Equivalently, on the unlogged scale, swing states tend to have 54
percent as many anti-discrimination laws as dem leaning states (e −0.61
= 0.54
).
In closing out our posterior interpretation, notice that the 80% posterior
credible intervals for (β , β , β ) in the above tidy() summary provide
1 2 3
evidence that each coef cient signi cantly differs from 0. For example, there's
an 80% posterior chance that the percent_urban coef cient β1 is between
0.0119 and 0.021. Thus, when controlling for a state's historical political
leanings, there's a signi cant positive association between the number of anti-
discrimination laws in a state and its urban population. Further, when
controlling for a state's percent_urban makeup, the number of anti-
discrimination laws in gop leaning and swing states tend to be signi cantly
below that of dem leaning states – the 80% credible intervals for β2 and β3
both fall below 0. These conclusions are consistent with the posterior plausible
models in Figure 12.9.
0
(i)
1
(i)
2
(i)
, using rpois().
(i)
Y new
FIGURE 12.12: 50% and 95% posterior credible intervals (blue lines) for the
number of anti-discrimination laws in a state. The actual number of laws are
represented by the dark blue dots.
Because we really don't have any prior understanding of this relationship, we'll
utilize weakly informative priors throughout our analysis. Moving on, let's
load and process the pulse_of_the_nation data from the bayesrules
package. In doing so, we'll remove some outliers, focusing on people that read
fewer than 100 books:
Figure 12.13 reveals some basic patterns in readership. First, though most
people read fewer than 11 books per year, there is a lot of variability in reading
patterns from person to person. Further, though there appears to be a very weak
relationship between book readership and one's age, readership appears to be
slightly higher among people that would prefer wisdom over happiness (makes
sense to us!).
Given the skewed, count structure of our books variable Y, the Poisson
regression tool is a reasonable rst approach to modeling readership by a
person's age and their prioritization of wisdom versus happiness:
BUT the results are de nitely not great. Check out the pp_check() in Figure
12.14.
Counter to the observed book readership, which is right skewed and tends to be
below 11 books per year, the Poisson posterior simulations of readership are
symmetric around 11 books year. Simply put, the Poisson regression model is
wrong. Why? Well, recall that Poisson regression preserves the Poisson
property of equal mean and variance. That is, it assumes that among subjects
of similar age and perspectives on wisdom versus happiness (X1 and X2), the
typical number of books read is roughly equivalent to the variability in books
read. Yet the pp_check() highlights that, counter to this assumption, we
actually observe high variability in book readership relative to a low average
readership. We can con rm this observation with some numerical summaries.
First, the discrepancy in the mean and variance in readership is true across all
subjects in our survey. On average, people read 10.9 books per year, but the
variance in book readership was a whopping 198 books2:
When we cut() the age range into three groups, we see that this is also true
among subjects in the same general age bracket and with the same take on
wisdom versus happiness, i.e., among subjects with similar X1 and X2 values.
For example, among respondents in the 45 to 72 year age bracket that prefer
wisdom to happiness, the average readership was 12.5 books, a relatively small
number in comparison to the variance of 270 books2.
Overdispersion
A random variable Y is overdispersed if the observed variability in Y
exceeds the variability expected by the assumed probability model of Y.
When our count response variable Y is too overdispersed to squeeze into the
Poisson regression assumptions, we have some options. Two common options,
which produce similar results, are to (1) include an overdispersion parameter
in the Poisson data model or (2) utilize a non-Poisson data model. Since it ts
nicely into the modeling framework we've established, we'll focus on the latter
approach. To this end, the Negative Binomial probability model is a useful
alternative to the Poisson when Y is overdispersed.Like the Poisson model, the
Negative Binomial is suitable for count data Y ∈ {0, 1, 2, …}. Yet unlike the
Poisson, the Negative Binomial does not make the restrictive assumption that
E(Y ) = Var(Y ).
(12.5)
To make the switch to the more exible Negative Binomial regression model
of readership, we can simply swap out a Poisson data model for a Negative
Binomial data model. In doing so, we also pick up the extra reciprocal
dispersion parameter, r > 0, for which the Exponential provides a reasonable
prior structure. Our Negative Binomial regression model, along with the
weakly informative priors scaled by stan_glm() and obtained by
prior_summary(), follows:
The results are fantastic. By incorporating more exible assumptions about the
variability in book readership, the posterior simulation of book readership very
closely matches the observed behavior in our survey data (Figure 12.16). That
is, the Negative Binomial regression model is not wrong (or, at least is much
less wrong than the Poisson model).
FIGURE 12.16: A posterior predictive check for the Negative Binomial
regression model of readership.
With this peace of mind, we can continue just as we would with a Poisson
analysis. Mainly, since it utilizes a log transform, the interpretation of the
Negative Binomial regression coef cients follows the same framework as in
the Poisson setting. Consider some posterior punchlines, supported by the
tidy() summaries below:
When controlling for a person's prioritization of wisdom versus happiness,
there's no signi cant association between age and book readership – 0 is
squarely in the 80% posterior credible interval for age coef cient β1.
When controlling for a person's age, people that prefer wisdom over
happiness tend to read more than those that prefer happiness over wisdom –
the 80% posterior credible interval for β2 is comfortably above 0.
Assuming they're the same age, we'd expect a person that prefers wisdom
to read 1.3 times as many, or 30% more, books as somebody that prefers
happiness (e0.266
= 1.3).
12.7 Generalized linear models: Building on the theme
Though the Normal, Poisson, and Negative Binomial data structures are
common among Bayesian regression models, we're not limited to just these
options. We can also use stan_glm() to t models with Binomial, Gamma,
inverse Normal, and other data structures. All of these options belong to a
larger class of generalized linear models (GLMs).
We needn't march through every single GLM to consider them part of our
toolbox. We can build a GLM with any of the above data structures by drawing
upon the principles we've developed throughout Unit 3. First, it's important to
note the structure in our response variable Y. Is Y discrete or continuous?
Symmetric or skewed? What range of values can Y take? These questions can
help us identify an appropriate data structure. Second, let E(Y | …) denote the
average Y value as de ned by its data structure. For all GLMs, the dependence
of E(Y | …) on a linear combination of predictors (X , X , …, X ) is 1 2 p
expressed by
g(E(Y | …)) = β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β p X p
where the appropriate link functiong(⋅) depends upon the data structure. For
example, in Normal regression, the data is modeled by
2
Y i β 0 , β 1 , ⋯ , β p , σ~N (μ i , σ )
g(μ i ) = μ i = β 0 + β 1 X i1 + β 2 X i2 + ⋯ + β p X ip .
Thus, Normal regression utilizes an identity link function since g(μ i ) is equal
to μi itself. In Poisson regression, the count data is modeled by
Y i |β 0 , β 1 , ⋯ , β p ~P ois(λ i )
g(λ i ) := log(λ i ) = β 0 + β 1 X i1 + β 2 X i2 + ⋯ + β p X ip .
Thus, Poisson regression utilizes a log link function since g(λ ) = log(λ ).
i i
The same is true for Negative Binomial regression. We'll dig into one more
GLM, logistic regression, in the next chapter. We hope that our survey of these
four speci c tools (Normal, Poisson, Negative Binomial, and logistic
regression) empowers you to implement other GLMs in your own Bayesian
practice.
12.8 Chapter summary
Let response variable Y ∈ {0, 1, 2, …} be a discrete count of events that occur
in a xed interval of time or space. In this context, using Normal regression to
model Y by predictors (X , X , … , X ) is often inappropriate – it assumes
1 2 p
that Y is symmetric and can be negative. The Poisson regression model offers
a promising alternative:
One major constraint of Poisson regression is its assumption that, at any set of
predictor values, the typical value of Y and variability in Y are equivalent.
Thus, when Y is overdispersed, i.e., its variability exceeds assumptions, we
might instead utilize the more exible Negative Binomial regression model:
12.9 Exercises
12.9.1 Conceptual exercises
Exercise 12.1 (Warming up).
a) Give a new example (i.e., not the same as from the chapter) in which
we would want to use a Poisson, instead of Normal, regression model.
b) The Poisson regression model uses a log link function, while the
Normal regression model uses an identity link function. Explain in one
or two sentences what a link function is.
c) Explain why the log link function is used in Poisson regression.
d) List the four assumptions for a Poisson regression model.
Exercise 12.3 (Why use a Negative Binomial). You and your friend Nico are in
a two-person Bayes Rules! book club. How lovely! Nico has read only part of
this chapter, and now they know about Poisson regression, but not Negative
Binomial regression. Be a good friend and answer their questions.
log(λ) = β 0 + β 1 X 1 + β 2 X 2 .
a) Interpret e β0
in context.
b) Interpret e in context.
β1
c) Interpret e in context.
β2
d) Provide the model equation for the expected number of “Likes” for a
tweet in one hour, when the person who wrote the tweet has 300
followers, and the tweet does not use an emoji.
(a) In the bald eagle analysis, why might a Poisson regression approach
be more appropriate than a Normal regression approach?
(b) Simulate the posterior of the Poisson regression model of Y versus
X1 and X2. Check the prior_summary().
(c) Use careful notation to write out the complete Bayesian structure of
the Poisson regression model of Y by X1 and X2.
(d) Complete a pp_check() for the Poisson model. Use this to explain
whether the model is “good” and, if not, what assumptions it makes
that are inappropriate for the bald eagle analysis.
Exercise 12.8 (Eagles: an even better model). The Poisson regression model of
bald eagle counts (Y) by year (X1) and observation hours (X2), was pretty
good. Let's see if a Negative Binomial approach is even better.
Exercise 12.9 (Eagles: model evaluation). Finally, let's evaluate the quality of
our Negative Binomial bald eagle model.
DOI: 10.1201/9780429288340-13
1 if rain tomorrow
Y = { .
0 otherwise
Though there are various potential predictors of rain, we'll consider just
three:
′
X1 = today s humidity at 9 a.m. (percent)
′
X2 = today s humidity at 3 p.m. (percent)
Goals
Build a Bayesian logistic regression model of a binary categorical
variable Y by predictors X = (X , X , …, X ).
1 2 p
(13.1)
of rain are 2:
2/3
odds of rain = = 2.
1−2/3
That is, it's twice as likely to rain than to not rain. If the probability of rain
tomorrow is π = 1/3, then the probability it doesn't rain is 2/3 and the
odds of rain are 1/2:
1/3 1
odds of rain = = .
1−1/3 2
That is, it's half as likely to rain than to not rain tomorrow. Finally, if the
chances of rain or no rain tomorrow are 50-50, then the odds of rain are 1:
1/2
odds of rain = = 1.
1−1/2
That is, it's equally likely to rain or not rain tomorrow. These scenarios
illuminate the general principles by which to interpret the odds of an
event.
Interpreting odds
Let an event of interest have probability π ∈ [0, 1] and corresponding
odds π/(1 − π) ∈ [0, ∞). Across this spectrum, comparing the odds
to 1 provides perspective on an event's uncertainty:
The odds of an event are less than 1 if and only if the event's
chances are less than 50-50, i.e., π < 0.5.
The odds of an event are equal to 1 if and only if the event's
chances are 50-50, i.e., π = 0.5.
The odds of an event are greater than 1 if and only if event's
chances are greater than 50-50, i.e., π > 0.5.
(13.2)
Thus, if we learn that the odds of rain tomorrow are 4 to 1, then there's an
80% chance of rain:
4
π = = 0.8.
1+4
Normal nor Poisson regression models are appropriate for this task. So
what is?
Quiz Yourself!
What's an appropriate model structure for data Yi?
a. Bernoulli (or, equivalently, Binomial with 1 trial)
b. Gamma
c. Beta
g(π i ) = β 0 + β 1 X i1 .
Keeping in mind the principles that went into building the Poisson
regression model, take a quick quiz to re ect upon what g(π ) might be
i
appropriate.1
Quiz Yourself!
Let πi and odds = π /(1 − π ) denote the probability and
i i i
b. odds i = β 0 + β 1 X i1
c. log(π i ) = β 0 + β 1 X i1
d. log(odds i ) = β 0 + β 1 X i1
Our goal here is to write the Bernoulli mean πi, or a function of this mean
g(π ), as a linear function of predictor X , β + β X . Among the
i i1 0 1 i1
options presented in the quiz above, the rst three would be mistakes.
Whereas the line de ned by β + β X can span the entire real line, πi
0 1 i1
β + β X , spans the entire real line. Thus, the most reasonable option is
0 1 i1
ind
πi
Y i β 0 , β 1 ˜Bern(π i ) with log( ) = β 0 + β 1 X i1 .
1−π i
(13.3)
That is, we assume that the log(odds of rain) is linearly related to 9 a.m.
humidity. To work on scales that are (much) easier to interpret, we can
rewrite this relationship in terms of odds and probability, the former
following from properties of the log function and the latter following from
(13.2):
β +β X
πi β 0 +β 1 X i1 e 0 1 i1
= e and πi = β +β X
.
1−π i 1+e 0 1 i1
(13.4)
_________________________
1 Answer : d
In examining (13.3) and Figure 13.1, notice that parameters β0 and β1 take
on the usual intercept and slope meanings when describing the linear
relationship between 9 a.m. humidity (X ) and the log(odds of rain). Yet
i1
π
log(odds) = log( ) = β0 + β1 X1 + ⋯ + βp Xp .
1−π
Interpreting β 0
When (X , X , … , X ) are all 0, β0 is the log odds of the event of
1 2 p
Interpreting β 1
represent the odds of the event of interest when X = x and odds 1 x+1
For example, the prior plausible relationship plotted in Figure 13.1 on the
log(odds), odds, and probability scales assumes that
π
log( ) = −4 + 0.1X i1 .
1−π
with 0 percent humidity, the log(odds of rain) would be -4. Or, on more
meaningful scales, rain would be very unlikely if preceded by a day with 0
humidity:
−4
odds of rain = e = 0.0183
−4
e
probability of rain = −4
= 0.0180.
1+e
Next, consider the humidity coef cient β = 0.1. On the linear log(odds)
1
scale, this is simply the slope: for every one percentage point increase in
humidity level, the logged odds of rain increase by 0.1. Huh? This is easier
to make sense of on the nonlinear odds scale where the increase is
multiplicative. For every one percentage point increase in humidity level,
the odds of rain increase by 11%: e = 1.11. Since the probability
0.1
Consider our prior tunings. Starting with the centered intercept β , recall
0c
The range of this Normal prior indicates our vague understanding that the
log(odds of rain) might also range from roughly -2.8 to 0 (−1.4 ± 2*0.7).
More meaningfully, we think that the odds of rain on an average day could
be somewhere between 0.06 and 1:
−2.8 0
(e , e ) ≈ (0.06, 1)
and thus that the probability of rain on an average day could be somewhere
between 0.057 and 0.50 (a pretty wide range in the context of rain):
0.06 1
( , ) ≈ (0.057, 0.50).
1+0.06 1+1
Next, our prior model on the humidity coef cient β1 re ects our vague
sense that the chance of rain increases when preceded by a day with high
humidity, but we're foggy on the rate of this increase and are open to the
possibility that it's nonexistent. Speci cally, on the log(odds) scale, we
assume that slope β1 ranges somewhere between 0 and 0.14, and is most
likely around 0.7. Or, on the odds scale, the odds of rain might increase
anywhere from 0% to 15% for every extra percentage point in humidity
level:
0 0.14
(e , e ) = (1, 1.15).
We plot just 100 of these prior plausible relationships below (Figure 13.2
left). These adequately re ect our prior understanding that the probability
of rain increases with humidity, as well as our prior uncertainty around the
rate of this increase. Beyond this relationship between rain and humidity,
we also want to con rm that our priors re ect our understanding of the
overall chance of rain in Perth. To this end, we simulate 100 datasets from
the prior model. From each dataset (labeled by .draw) we utilize
group_by() with summarize() to record the proportion of predicted
outcomes Y (.prediction) that are 1, i.e., the proportion of days on
which it rained. Figure 13.2 (right) displays a histogram of these 100
simulated rain rates from our 100 prior simulated datasets. The percent of
days on which it rained ranged from as low as roughly 5% in one dataset
to as high as roughly 50% in another. This does adequately match our prior
understanding and uncertainty about rain in Perth. In contrast, if our prior
predictions tended to be centered around high values, we'd question our
prior tuning since we don't believe that Perth is a rainy place.
_________________________
2 weather_perth is a wrangled subset of the weatherAUS dataset in the rattle package.
FIGURE 13.2: 100 datasets were simulated from the prior models. For
each, we plot the relationship between the probability of rain and the
previous day's humidity level (left) and the observed proportion of days on
which it rained (right).
FIGURE 13.3: A jitter plot of rain outcomes versus humidity level (left).
A scatterplot of the proportion of days that see rain by humidity bracket
(right).
Not that you would forget (!), but we should check diagnostic plots for the
stability of our simulation results before proceeding:
For every one percentage point increase in today's 9 a.m. humidity, there's
an 80% posterior chance that the log(odds of rain) increases by somewhere
between 0.0415 and 0.0549. This rate of increase is less than our 0.07 prior
mean for β1 – the chance of rain does signi cantly increase with humidity,
just not to the degree we had anticipated. More meaningfully, for every
one percentage point increase in today's 9 a.m. humidity, the odds of rain
increase by somewhere between 4.2% and 5.6%:
(e
0.0415
,e ) = (1.042, 1.056).
0.0549
Equivalently, for every fteen
percentage point increase in today's 9 a.m. humidity, the odds of rain
roughly double: (e 15*0.0415
,e ) = (1.86, 2.28).
15*0.0549
∣
suppose you're in Perth today and experienced 99% humidity at 9 a.m..
Yuk. To predict whether it will rain tomorrow, we can approximate the
posterior predictive model for the binary outcomes of Y, whether or not it
rains, where
1−π
) = β 0 + β 1 *99.
To really connect with the prediction concept, we can also simulate the
posterior predictive model from scratch. From each of the 20,000 pairs of
posterior plausible pairs of β0 (Intercept) and β1 (humidity9am) in
our Markov chain simulation, we calculate the log(odds of rain) (13.6). We
then transform the log(odds) to obtain the odds and probability of rain.
Finally, from each of the 20,000 probability values π, we simulate a
Bernoulli outcome of rain, Y ~Bern(π), using rbinom() with size =
1:
(13.6)
For example, our rst log(odds) and probability values are calculated by
plugging β = −4.244 and β = 0.0446 into (13.6) and transforming it to
0 1
−4.244+0.0446*99
e
π = −4.244+0.0446*99
= 0.54.
1+e
Quiz Yourself!
Suppose it's 99% humidity at 9 a.m. today. Based on the posterior
predictive model above, what binary classi cation would you make?
Will it rain or not? Should we or shouldn't we carry an umbrella
tomorrow?
Questions of classi cation are somewhat subjective, thus there's more than
one reasonable answer to this quiz. One reasonable answer is this: yes, you
should carry an umbrella. Among our 20,000 posterior predictions of Y,
10804 (or 54.02%) called for rain. Thus, since rain was more likely than
no rain in the posterior predictive model of Y, it's reasonable to classify Y
as “1” (rain).
In making this classi cation, we used the following classi cation rule:
Though it's a natural choice, we needn't use a 50% cut-off. We can utilize
any cut-off between 0% and 100%.
classi cation cut-offc ∈ [0, 1], we can turn our posterior predictions
into a binary classi cation of Y using the following rule:
If p ≥ c, then classify Y as 1.
If p < c, then classify Y as 0.
The rst two questions have quick answers. We believe this weather
analysis to be fair and innocuous in terms of its potential impact on
society and individuals. To answer question (2), we can perform a
posterior predictive check to con rm that data simulated from our
posterior logistic regression model has features similar to the original
data, and thus that the assumptions behind our Bayesian logistic regression
model (13.5) are reasonable. Since the data outcomes we're simulating are
binary, we must take a slightly different approach to this pp_check()
than we did for Normal and Poisson regression. From each of 100
posterior simulated datasets, we record the proportion of outcomes Y that
are 1, i.e., the proportion of days on which it rained, using the
proportion_rain() function. A histogram of these simulated rain
rates con rms that they are indeed consistent with the original data
(Figure 13.6). Most of our posterior simulated datasets saw rain on
roughly 18% of days, close to the observed rain incidence in the weather
data, yet some saw rain on as few as 12% of the days or as many as 24% of
the days.
The results for the rst three days in our sample are shown below. Based
on its 9 a.m. humidity level, only 12% of the 20,000 predictions called for
rain on the rst day (rain_prob= 0.122). Similarly, the simulated
probabilities of rain for the second and third days are also amply below
our 50% cut-off. As such, we predicted “no rain” for the rst three sample
days (as shown in rain_class_1). For the rst two days, these
classi cations were correct – it didn't raintomorrow. For the third day,
this classi cation was incorrect – it did raintomorrow.
Notice that our classi cation rule, in conjunction with our Bayesian
model, correctly classi ed 817 of the 1000 total test cases (803 + 14).
Thus, the overall classi cation accuracy rate is 81.7% (817 / 1000). At
face value, this seems pretty good! But look closer. Our model is much
better at anticipating when it won't rain than when it will. Among the 814
days on which it doesn't rain, we correctly classify 803, or 98.65%. This
gure is referred to as the true negative rate or speci city of our Bayesian
model. In stark contrast, among the 186 days on which it does rain, we
correctly classify only 14, or 7.53%. This gure is referred to as the true
positive rate or sensitivity of our Bayesian model. We can con rm these
gures using the shortcut classification_summary() function in
bayesrules:
Ŷ = 0 Ŷ = 1
Y = 0 a b
Y = 1 c d
Now, if you fall into the reasonable group of people that don't like walking
around in wet clothes all day, the fact that our model is so bad at
predicting rain is terrible news. BUT we can do better. Recall from Section
13.4 that we can adjust the classi cation cut-off to better suit the goals of
our analysis. In our case, we can increase our model's sensitivity by
decreasing the cut-off from 0.5 to, say, 0.2. That is, we'll classify a test
case as rain if there's even a 20% chance of rain:
Success! By making it easier to classify rain, the sensitivity jumped from
7.53% to 63.98% (119 of 186). We're much less likely to be walking
around with wet clothes. Yet this improvement is not without
consequences. In lowering the cut-off, we make it more dif cult to predict
when it won't rain. As a result, the true negative rate dropped from 98.65%
to 71.25% (580 of 814) and we'll carry around an umbrella more often
than we need to.
Finally, to hedge against the possibility that the above model assessments
are biased by training and testing rain_model_1 using the same data,
we can supplement these measures with cross-validated estimates of
classi cation accuracy. The fact that these are so similar to our measures
above suggests that the model is not over t to our sample data – it does
just as well at predicting rain on new days:3
In closing this section, was it worth it? That is, do the bene ts of better
classifying rain outweigh the consequences of mistakenly classifying no
rain as rain? We don't know. This is a subjective question. As an analyst,
you can continue to tweak the classi cation rule until the corresponding
correct classi cation rates best match your goals.
Our original prior understanding was that the chance of rain increases with
each individual predictor in this model. Yet we have less prior certainty
about how one predictor is related to rain when controlling for the other
predictors – we're not meteorologists or anything. For example, if we
know today's 3pm humidity level, it could very well be the case that
today's 9am humidity doesn't add any additional information about
whether or not it will rain tomorrow (i.e., β = 0). With this, we'll1
maintain our original N (−1.4, 0.7 ) prior for the centered intercept β ,
2
0c
today's morning and afternoon humidity levels, we expect that the odds of
rain tomorrow more than triple if it rains today.
In contrast, notice that the β1 (humidity9am) posterior straddles 0 with
an 80% posterior credible interval which ranges from -0.0163 to 0.0025.
Let's start with what this observation doesn't mean. It does not mean that
humidity9am isn't a signi cant predictor of tomorrow's rain. We saw in
rain_model_1 that it is. Rather, humidity9am isn't a signi cant
predictor of tomorrow's rain when controlling for afternoon humidity and
whether or not it rains today. Put another way, if we already know today's
humidity3pm and rain status, then knowing humidity9am doesn't
signi cantly improve our understanding of whether or not it rains
tomorrow. This shift in understanding about humidity9am from
rain_model_1 to rain_model_2 might not be much of a surprise –
humidity9am is strongly associated with humidity3pm and
raintoday, thus the information it holds about raintomorrow is
somewhat redundant in rain_model_2.
Finally, which is the better model of tomorrow's rain, rain_model_1 or
rain_model_2? Using a classi cation cut-off of 0.2, let's compare the
cross-validated estimates of classi cation accuracy for rain_model_2
to those for rain_model_1:
Quiz Yourself!
Which of rain_model_1 and rain_model_2 produces the most
accurate posterior classi cations of tomorrow's rain? Which model
do you prefer?
Here, the estimated ELPD for rain_model_1 is more than two standard
errors below, and hence worse than, that of rain_model_2: (
). Thus, ELPD provides even more evidence in favor of
−80.2 ± 2*13.5
rain_model_2.
13.7 Chapter summary
Let response variable Y ∈ {0, 1} be a binary categorical variable. Thus,
modeling the relationship between Y and a set of predictors
(X , X , … , X ) requires a classi cation modeling approach. To this end,
1 2 p
where we can transform the model from the log(odds) scale to the more
meaningful odds and probability scales:
β +β X +⋯+β p X
πi β 0 +β 1 X i1 +⋯+β p X ip e
0 1 i1 ip
= e and πi = β +β X +⋯+β p X
.
1−π i 0 1 i1 ip
1+e
13.8 Exercises
13.8.1 Conceptual exercises
Exercise 13.1 (Normal vs logistic). For each scenario, identify whether
Normal or logistic regression is the appropriate tool for modeling Y by X.
(a) Y = whether or not a person bikes to work, X = the distance from
the person's home to work
(b) Y = the number of minutes it takes a person to commute to work,
X = the distance from the person's home to work
(c) Y = the number of minutes it takes a person to commute to work,
X = whether or not the person takes public transit to work
Exercise 13.2 (What are the odds?). Calculate and interpret the odds for
each event of interest below.
(a) The odds that your team will win the basketball game are 20 to 1.
(b) The odds of rain tomorrow are 0.5.
(c) The log(odds) of a certain candidate winning the election are 1.
(d) The log(odds) that a person likes pineapple pizza are -2.
(a) Express the posterior median model on the odds and probability
scales.
(b) Interpret the age coef cient on the odds scale.
(c) Calculate the posterior median probability that a 60-year-old
believes in climate change.
(d) Repeat part c for a 20-year-old.
y 0 1
FALSE (0) 50 300
TRUE (1) 30 620
_________________________
4 This is a random sample of data collected by Antonio et al. (2019) and distributed by the
Hotels TidyTuesday project (R for Data Science, 2020c).
DOI: 10.1201/9780429288340-14
X1 = {
X2
X3
Y
⎪
There exist multiple penguin species throughout Antarctica, including the Adelie,
Chinstrap, and Gentoo. When encountering one of these penguins on an Antarctic trip, we
might classify its species
⎧A
= ⎨C
G
Adelie
Chinstrap
Gentoo
by examining various physical characteristics, such as whether the penguin weighs more
than the average 4200g,
above-average weight
below-average weight
The penguins_bayes data, originally made available by Gorman et al. (2014) and
distributed by Horst et al. (2020), contains the above species and feature information for a
sample of 344 Antarctic penguins:
Among these penguins, 152 are Adelies, 68 are Chinstraps, and 124 are Gentoos. We'll
assume throughout that the proportional breakdown of these species in our dataset re ects
the species breakdown in the wild. That is, our prior assumption about any new penguin is
that it's most likely an Adelie and least likely a Chinstrap:
Before proceeding with our analysis of this data, take a quick quiz.
Quiz Yourself!
Explain why neither the Normal nor logistic regression model would be appropriate
for classifying a penguin's species Y from its physical characteristics X.
Sounds fantastic. But as you might expect, these bene ts don't come without a cost. We'll
observe throughout this chapter that naive Bayes classi cation is called “naive” for a
reason.
Goals
Explore the inner workings of naive Bayes classi cation.
Implement naive Bayes classi cation in R.
Develop strategies to evaluate the quality of naive Bayes classi cations.
Quiz Yourself!
Based on the plot below, for which species is a below-average weight most likely?
FIGURE 14.1: The proportion of each penguin species that's above average weight.
Notice in Figure 14.1 that Chinstrap penguins are the most likely to be below average
weight. Yet before declaring this penguin to be a Chinstrap, we should consider that this
is the rarest species to encounter in the wild. That is, we have to think like Bayesians by
combining the information from our data with our prior information about species
membership to construct a posterior model for our penguin's species. The naive Bayes
classi cation approach to this task is nothing more than a direct appeal to the tried-and-
true Bayes' Rule from Chapter 2. In general, to calculate the posterior probability that our
penguin is species Y = y in light of its weight status X = x , we can plug into
1 1
prior⋅likelihood f (y)L(y | x 1 )
f (y x1 ) = =
normalizing constant f (x 1 )
(14.1)
where, by the Law of Total Probability,
f (x 1 ) = ∑ f (y )L(y
all y
= f (y
′
′
′ ′
= A)L(y
∣
′
x1 )
= A|x 1 ) + f (y
′
= C)L(y
In fact, we can directly calculate the posterior model of our penguin's species from this
table. For example, notice that among the 193 penguins that are below average weight,
126 are Adelies. Thus, there's a roughly 65% posterior chance that this penguin is an
Adelie:
f (y = A) =
f (y = A
342
,
x 1 = 0) =
f (y = C) =
126
193
342
Further, the likelihoods demonstrate that below-average weight is most common among
Chinstrap penguins. For example, 89.71% of Chinstraps but only 4.88% of Gentoos are
below average weight:
≈ 0.6528.
Let's con rm this result by plugging information from our tabyl above into Bayes' Rule
(14.1). This tedious step is not to annoy, but to practice for generalizations we'll have to
make in more complicated settings. First, our prior information about species
membership indicates that Adelies are the most common and Chinstraps the least:
151 68
, f (y = G) =
′
= G)L(y
123
342
.
′
= G|x 1 ).
(14.2)
(14.3)
126
L(y = A | x 1 = 0) = ≈ 0.8344
151
61
L(y = C | x 1 = 0) = ≈ 0.8971
68
6
L(y = G | x 1 = 0) = ≈ 0.0488
123
Plugging these priors and likelihoods into (14.2), the total probability of observing a
below-average weight penguin across all species is
151 126 68 61 123 6 193
f (x 1 = 0) = ⋅ + ⋅ + ⋅ = .
342 151 342 68 342 123 342
Finally, by Bayes' Rule we can con rm that there's a 65% posterior chance that this
penguin is an Adelie:
f (y=A)L(y=A | x 1 =0)
f (y = A | x 1 = 0) =
f (x 1 =0)
(151/342)⋅(126/151)
=
193/342
≈ 0.6528.
All in all, the posterior probability that this penguin is an Adelie is more than double that
of the other two species. Thus, our naive Bayes classi cation, based on our prior
information and the penguin's below-average weight alone, is that this penguin is an
Adelie. Though below-average weight is relatively less common among Adelies than
among Chinstraps, the nal classi cation was pushed over the edge by the fact that
Adelies are far more common.
Quiz Yourself!
Based on Figure 14.2, among which species is a 50mm-long bill the most common?
FIGURE 14.2: Density plots of the bill lengths (mm) observed among three penguin
species.
Notice from the plot that a 50mm-long bill would be extremely long for an Adelie
penguin. Though the distinction in bill lengths is less dramatic among the Chinstrap and
f (y x2 ) =
f (x 2 )
=
f (y)L(y | x 2 )
∑ f (y )L(y
all y
′
′
bill if the penguin is an Adelie, L(y = A|x = 50) ? Unlike the penguin's categorical
X 2 | (Y = A)
X 2 | (Y = C)
X 2 | (Y = G)
2
weight status (X1), its bill length is quantitative. Thus, we can't simply calculate
L(y = A | x = 50) from a table of species vs bill_length_mm. Further, we
2
haven't assumed a model for data X2 from which to de ne likelihood function L(y|x ).
′
Our question for you is this: What is the likelihood that we would observe a 50mm-long
This is where one “naive” part of naive Bayes classi cation comes into play. The naive
Bayes method typically assumes that any quantitative predictor, here X2, is continuous
and conditionally Normal. That is, within each species, bill lengths X2 are Normally
distributed with possibly different means μ and standard deviations σ:
_________________________
2 Answer : Chinstrap
~N (μ A , σ
~N (μ C , σ
~N (μ G , σ
2
A
2
C
2
G
)
).
∣
Gentoo, Chinstrap bills tend to be a tad longer. In particular, a 50mm-long bill is fairly
average for Chinstraps, though slightly long for Gentoos. Thus, again, our data points to
our penguin being a Chinstrap. And again, we must weigh this data against the fact that
Chinstraps are the rarest of these three species. To this end, we can appeal to Bayes' Rule
to make a naive Bayes classi cation of Y from information that bill length X = x :
f (y)L(y | x 2 )
x2 )
.
2 2
(14.4)
2
Though this is a naive blanket assumption to make for all quantitative predictors, it turns
out to be reasonable for bill length X2. Reexamining Figure 14.2, notice that within each
species, bill lengths are roughly bell shaped around different means and with slightly
different standard deviations. Thus, we can tune the Normal model for each species by
setting its μ and σ parameters to the observed sample means and standard deviations in
bill lengths within that species. For example, we'll tune the Normal model of bill lengths
for the Adelie species to have mean and standard deviation μ = 38.8 mm andA
σA
= 2.66 mm:
Plotting the tuned Normal models for each species con rms that this naive Bayes
assumption isn't perfect – it's a bit more idealistic than the density plots of the raw data in
Figure 14.2. But it's ne enough to continue.
FIGURE 14.3: Normal pdfs tuned to match the observed mean and standard deviation in
bill lengths (mm) among three penguin species.
Recall that this Normality assumption provides the mechanism we need to evaluate the
likelihood of observing a 50mm-long bill among each of the three species, L(y | x = 50)
2
. Connecting back to Figure 14.3, these likelihoods correspond to the Normal density
curve heights at a bill length of 50 mm. Thus, observing a 50mm-long bill is slightly
more likely among Chinstrap than Gentoo, and highly unlikely among Adelie penguins.
More speci cally, we can calculate the likelihoods using dnorm().
We now have everything we need to plug into Bayes' Rule (14.4). Weighting the
likelihoods by the prior probabilities of each species (à la (14.2)), the marginal pdf of
observing a penguin with a 50mm-long bill is
151 68 123
f (x 2 = 50) = ⋅ 0.0000212 + ⋅ 0.112 + ⋅ 0.09317 = 0.05579.
342 342 342
It follows that our naive Bayes classi cation, based on our prior information and
penguin's bill length alone, is that this penguin is a Gentoo – it has the highest posterior
probability. Though a 50mm-long bill is relatively less common among Gentoo than
among Chinstrap, the nal classi cation was pushed over the edge by the fact that Gentoo
are much more common in the wild.
length of X = 195 mm. Either one of these measurements alone might lead to a
3
misclassi cation. Just as it's tough to distinguish between the Chinstrap and Gentoo
penguins based on their bill lengths alone, it's tough to distinguish between the Chinstrap
and Adelie penguins based on their ipper lengths alone (Figure 14.4).
FIGURE 14.4: Density plots of the bill lengths (mm) and ipper lengths (mm) among
our three penguin species.
BUT the species are fairly distinguishable when we combine the information about bill
and ipper lengths. Our penguin with a 50mm-long bill and 195mm-long ipper,
represented at the intersection of the dashed lines in Figure 14.5, now lies squarely among
the Chinstrap observations:
FIGURE 14.5: A scatterplot of the bill lengths (mm) versus ipper lengths (mm) among
our three penguin species.
Let's use naive Bayes classi cation to balance this data with our prior information on
species membership. To calculate the posterior probability that the penguin is species
Y = y
X3 = x :
3
f (y
∣∣
, we can adjust Bayes' Rule to accommodate our two predictors,
x2 , x3 ) =
f (y)L(y | x 2 ,x 3 )
∑ f (y )L(y
all y
′
′ ′
x 2 ,x 3 )
This presents yet another new twist: How can we calculate the likelihood function that
incorporates two variables, L(y | x , x ) ? This is where yet another “naive” assumption
2 3
creeps in. Naive Bayes classi cation assumes that predictors are conditionally
independent, thus
In words, within each species, we assume that the length of a penguin's bill has no
relationship to the length of its ipper. Mathematically and computationally, this
assumption makes the naive Bayes algorithm ef cient and manageable. However, it might
also make it wrong. Revisit Figure 14.5. Within each species, ipper length and bill
length appear to be positively correlated, not independent. Yet, we'll naively roll with the
imperfect independence assumption, and thus the possibility that our classi cation
accuracy might be weakened.
Combined then, the multivariable naive Bayes model assumes that our two predictors are
Normal and conditionally independent. We already tuned this Normal model for bill
length X2 in Section 14.1.2. Similarly, we can tune the species-speci c Normal models
for ipper length X3 to match the corresponding sample means and standard deviations:
Accordingly, we can evaluate the likelihood of observing a 195-mm ipper length for
each of the three species, L(y | x = 195) = f (x = 195 | y):
3 3
.
X2 = x2 and
(14.5)
For each species, we now have the likelihood of observing a bill length of X = 50 mm 2
and the prior probability (14.3). Combined, the likelihoods of observing a 50mm-long bill
and a 195mm-long ipper for each species Y = y, weighted by the prior probability of
the species are:
′ ′ 151
f (y = A)L(y = A | x 2 = 50, x 3 = 195) = ⋅ 0.0000212 ⋅ 0.04554
342
′ ′ 68
f (y = C)L(y = C | x 2 = 50, x 3 = 195) = ⋅ 0.112 ⋅ 0.05541
342
′ ′ 123
f (y = G)L(y = G | x 2 = 50, x 3 = 195) = ⋅ 0.09317 ⋅ 0.0001934
342
with a sum of
′ ′
∑ f (y )L(y x 2 = 50, x 3 = 195) ≈ 0.001241.
′
all y
And plugging into Bayes' Rule (14.5), the posterior probability that the penguin is an
Adelie is
151
⋅0.0000212⋅0.04554
342
f (y = A x 2 = 50, x 3 = 195) = ≈ 0.0003.
0.001241
In conclusion, our penguin is almost certainly a Chinstrap. Though we didn't come to this
conclusion using any physical characteristic alone, together they paint a pretty clear
picture.
14.2
2
that
2
p
p
a
∣∣
Let Y denote a categorical response variable with two or more categories and
(X , X , … , X ) be a set of p possible predictors of Y. Then the posterior
1
new
1 2
case
p
with
f (y x 1 , x 2 , … , x p ) =
σ
L(y
observed
(X , X , … , X ) = (x , x , … , x ) belongs to class Y = y is
1
predictors
f (y)L(y | x 1 ,x 2 ,…,x p )
∑ f (y )L(y
all y
′
′
Naive Bayes classi cation makes the following naive assumptions about the
likelihood function of Y:
x 1 , x 2 , … , x p ) = ∏ f (x i
f (x i | y) = P (X i = x i | Y = y)
X i (Y = y)~N (μ iy , σ
p
i=1
category Y = y.
That was nice, but we needn't do all of this work by hand. To implement naive Bayes
classi cation in R, we'll use the naiveBayes() function in the e1071 package (Meyer
et al., 2021). As with stan_glm(), we feed naiveBayes() the data and a formula
2
iy
).
y) .
.
iy
indicating which data variables to use in the analysis. Yet, since naive Bayes calculates
prior probabilities directly from the data and implementation doesn't require MCMC
simulation, we don't have to worry about providing information regarding prior models or
Markov chains. Below we build two naive Bayes classi cation algorithms, one using
bill_length_mm alone and one that also incorporates flipper_length_mm:
Let's apply both of these to classify our_penguin that we studied throughout Section
14.1:
And just as we concluded in Section 14.1.3, if we take into account both its bill length and
ipper length, our best guess is that this penguin is a Chinstrap:
We can similarly apply our naive Bayes models to classify any number of penguins. As
with logistic regression, we'll take two common approaches to evaluating the accuracy of
these classi cations:
1. construct confusion matrices which compare the observed species of our sample
penguins to their naive Bayes species classi cations; and
2. for a better sense of how well our naive Bayes models classify new penguins,
calculate cross-validated estimates of classi cation accuracy.
rst approach, we classify each of the penguins using both
To begin with the
naive_model_1 and naive_model_2 and store these in penguins as class_1
and class_2:
The classi cation results are shown below for four randomly sampled penguins,
contrasted against the actual species of these penguins. For the last two penguins, the
two models produce the same classi cations (Adelie) and these classi cations are correct.
For the rst two penguins, the two models lead to different classi cations. In both cases,
naive_model_2 is correct.
Quiz Yourself!
With these prompts in mind, let's examine the two confusion matrices. One quick
observation is that naive_model_2 does a better job across the board. Not only are its
classi cation accuracy rates for each of the Adelie, Chinstrap, and Gentoo species higher
than in naive_model_1, its overall accuracy rate is also higher. The
naive_model_2 correctly classi es 327 (146 + 59 + 122) of the 344 total penguins
(95%), whereas naive_model_1 correctly classi es only 261 (76%). Where
naive_model_2 enjoys the greatest improvement over naive_model_1 is in the
classi cation of Chinstrap penguins. In naive_model_1, only 9% of Chinstraps are
correctly classi ed, with a whopping 85% being misclassi ed as Gentoo. At 87%, the
classi cation accuracy rate for Chinstraps is much higher in naive_model_2.
Finally, for due diligence, we can utilize 10-fold cross-validation to evaluate and compare
how well our naive Bayes classi cation models classify new penguins, not just those in
our sample. We do so using the naive_classification_summary_cv() function
in the bayesrules package:
The cv_model_2$folds object contains the classi cation accuracy rates for each of
the 10 folds whereas cv_model_2$cv averages the results across all 10 folds:
The accuracy rates in this cross-validated confusion matrix are comparable to those in the
non-cross-validated confusion matrix above. This implies that our naive Bayes model
appears to perform nearly as well on new penguins as it does on the original penguin
sample that we used to build this model.
naive Bayes lacks regression coef cients βi. Thus, though naive Bayes can turn
information about predictors X into classi cations of Y, it does so without much
illumination of the relationships among these variables.
Whether naive Bayes or logistic regression is the right tool for a binary classi cation job
depends upon the situation. In general, if the rigid naive Bayes assumptions are
inappropriate or if you care about the speci c connections between Y and X (i.e., you
don't simply want a set of classi cations), you should use logistic regression. Otherwise,
naive Bayes might be just the thing. Better yet, don't choose! Try out and learn from both
tools.
14.4 Chapter summary
∣∣
Naive Bayes classi cation is a handy tool for classifying categorical response variables Y
with two or more categories. Letting (X , X , … , X ) be a set of p possible predictors
2 p
1
f (y x 1 , x 2 , … , x p ) =
2
∑ f (y )L(y
all y
′
′
p
of Y, naive Bayes calculates the posterior probability of each category membership via
Bayes' Rule:
f (y)L(y | x 1 ,x 2 ,…,x p )
′
x 1 ,x 2 ,…,x p )
In doing so, it makes some very naive assumptions about the data model from which we
de ne the likelihood L(y | x , x , … , x ): the predictors Xi are conditionally independent
1
and the values of quantitative predictors Xi vary Normally within each category Y = y.
When appropriate, these simplifying assumptions make the naive Bayes model
computationally ef cient and straightforward to apply. Yet when these simplifying
assumptions are violated (which is common), the naive Bayes model can produce
misleading classi cations.
14.5
14.5.1
Exercises
Conceptual exercises
Exercise 14.1 (Naive). Why is naive Bayes classi cation called naive?
Exercise 14.2 (Which model?). For each scenario below, indicate whether you could
classify Y by X using logistic regression, naive Bayes classi cation, or both.
Exercise 14.3 (Pros and cons). Every modeling technique has some pros and cons.
Exercise 14.4 (Fake news: exclamation points). The fake_news data in the bayesrules
package contains information about 150 news articles, some real news and some fake
news. In the next exercises, our goal will be to develop a model that helps us classify an
article's type, real or fake, using a variety of predictors. To begin, let's consider whether
an article's title has an exclamation point (title_has_excl).
(a) Construct and discuss a visualization of the relationship between article type
and title_has_excl.
(b) Suppose a new article is posted online and its title does not have an exclamation
point. Utilize naive Bayes classi cation to calculate the posterior probability
that the article is real. Do so from scratch, without using naiveBayes() with
predict().
(c) Check your work to part b using naiveBayes() with predict().
Exercise 14.5 (Fake news: title length). Consider another possible predictor of article
type: the number of words in the title.
(a) Construct and discuss a visualization of the relationship between article type
and the number of title_words.
(b) In using naive Bayes classi cation to classify an article's type based on its
title_words, we assume that the number of title_words are
conditionally Normal. Do you think this is a fair assumption in this analysis?
(c) Suppose a new article is posted online and its title has 15 words. Utilize naive
Bayes classi cation to calculate the posterior probability that the article is real.
Do so from scratch, without using naiveBayes() with predict().
(d) Check your work to part c using naiveBayes() with predict().
Exercise 14.6 (Fake news: title length and negative sentiment). Of course, we can use
more than one feature to help us classify whether an article is real or fake. Here, let's
consider both an article's title length (title_words) and the percent of words in the
article that have a negative sentiment.
(a) Construct and discuss a visualization of the relationship between article type
and negative sentiment.
(b) Construct a visualization of the relationship of article type with both
title_words and negative.
(c) Suppose a new article is posted online – it has a 15-word title and 6% of its
words have negative associations. Utilize naive Bayes classi cation to calculate
the posterior probability that the article is real. Do so from scratch, without
using naiveBayes() with predict().
(d) Check your work to part c using naiveBayes() with predict().
Exercise 14.7 (Fake news: three predictors). Suppose a new article is posted online – it
has a 15-word title, 6% of its words have negative associations, and its title doesn't have
an exclamation point. Based on these three features, utilize naive Bayes classi cation to
calculate the posterior probability that the article is real. Do so using naiveBayes()
with predict().
Exercise 14.8 (Fake news: let's pick a model). We've now tried four different naive Bayes
classi cation models of article type. In this exercise you'll evaluate and compare the
performance of these four models.
model formula
news_model_1 type title_has_excl
news_model_2 type title_words
news_model_3 type title_words + negative
news_model_4 type title_words + negative + title_has_excl
a) Construct a cross-validated confusion matrix for news_model_1.
b) Interpret each percentage in the confusion matrix.
c) Similarly, construct cross-validated confusion matrices for the other three
models.
d) If our goal is to best detect when an article is fake, which of the four models
should we use?
Exercise 14.9 (Logistic vs naive). Naive Bayes isn't the only approach we can take to
classifying real vs fake news. Since article type is binary, we could also approach this
task using Bayesian logistic regression.
DOI: 10.1201/9780429288340-15
Welcome to Unit 4!
Unit 4 is all about hierarchies. Used in the sentence “my workplace is
so hierarchical,” this word might have negative connotations. In
contrast, “my Bayesian model is so hierarchical” often connotes a
good thing! Hierarchical models greatly expand the exibility of our
modeling toolbox by accommodating hierarchical, or grouped data.
For example, our data might consist of:
a sampled group of schools and data y on multiple individual
students within each school; or
a sampled group of labs and data y from multiple individual
experiments within each lab; or
a sampled group of people on whom we make multiple individual
observations of information y over time.
Ignoring this type of underlying grouping structure violates the
assumption of independent data behind our Unit 3 models and, in
turn, can produce misleading conclusions. In Unit 4, we'll explore
techniques that empower us to build this hierarchical structure into
our models:
Chapter 15: Why you should be excited about hierarchical
modeling
Chapter 16: (Normal) hierarchical models without predictors
Chapter 17: (Normal) hierarchical models with predictors
Chapter 18: Non-Normal hierarchical models
Chapter 19: Adding more layers
Before we do, we want to point out that we're using “hierarchical” as
a blanket term for group-structured models. Across the literature,
these are referred to as multilevel, mixed effects, or random effects
models. These terms can serve to confuse, especially since some
people use them interchangeably and others use them to make minor
distinctions in group-structured models. It's kind of like “pop” vs
“soda” vs “cola.” We'll avoid that confusion here but point it out so
that you're able to make that connection outside this book.
This data, a subset of the Cherry data in the mdsr package (Baumer et
al., 2021), contains the net running times (in minutes) for 36 participants
in the annual 10-mile Cherry Blossom race held in Washington, D.C.. Each
runner is in their 50s or 60s and has entered the race in multiple years. The
plot below illustrates the degree to which some runners are faster than
others, as well as the variability in each runner's times from year to year.
FIGURE 15.1: Boxplots of net running times (in minutes) for 36 runners
that entered the Cherry Blossom race in multiple years.
Our goal is to better understand the relationship between running time and
age for runners in this age group. What we'll learn is that our current
Bayesian modeling toolbox has some limitations.
Goals
Explore the limitations of our current Bayesian modeling toolbox
under two extremes, complete pooling and no pooling.
Examine the bene ts of the partial pooling provided by
hierarchical Bayesian models.
Focus on the big ideas and leave the details to subsequent
chapters.
FIGURE 15.2: A scatterplot of net running time versus age for every race
result.
And we can simulate this complete pooled model as usual, here using
weakly informative priors:
relationship between running time and age. The posterior median model
suggests that running times tend to increase by a mere 0.27 minutes for
each year in age. And with an 80% posterior credible interval for β1 which
straddles 0 (-0.3, 0.84), this relationship is not signi cant:
OK. We didn't want to be the rst to say it, but this seems a bit strange.
Our own experience does not support the idea that we'll continue to run at
the same speed. In fact, check out Figure 15.3. If we examine the
relationship between running time and age for each of our 36 runners (gray
lines), almost all have gotten slower over time and at a more rapid rate
than suggested by the posterior median model (blue line).
FIGURE 15.3: Observed trends in running time versus age for the 36
subjects (gray) along with the posterior median model (blue).
In Figure 15.4, we zoom in on the details for just three runners. Though
the complete pooled model (blue line) does okay in some cases (e.g.,
runner 22), it's far from universal. The general speed, and changes in speed
over time, vary quite a bit from runner to runner. For example, runner 20
tends to run slower than average and, with age, is slowing down at a more
rapid rate. Runner 1 has consistently run faster than the average.
FIGURE 15.4: Scatterplots of running time versus age for 3 subjects,
along with the posterior median model (blue).
into one population or one “pool.” In doing so, it makes two assumptions:
15.2 No pooling
Having failed with our complete pooled model, let's swing to the other
extreme. Instead of lumping everybody together into one pool and
ignoring any information about runners, the no pooling approach considers
each of our m = 36 runners separately. This framework is represented by
the diagram in Figure 15.6 where y denotes the ith observation on runner
ij
j.
This framework also means that the no pooling approach builds a separate
model for each runner. Speci cally, let (Y , X ) denote the observed run
ij ij
∣
times and age for runner j in their ith race. Then the data structure for the
Normal linear regression model of run time vs age for runner j is:
1j
2
Yij β0j, β1j, σ~N (μij, σ ) with μij = β0j + β1jXij.
This model allows for each runner j to have a unique intercept β and age
β ) and that changes in speed over time aren't the same for everyone
0j
coef cient β . Or, in the context of running, the no pooled models re ect
This seems great at rst glance. The runner-speci c models pick up on the
runner-speci c trends. However, there are two signi cant drawbacks to the
no pooling approach. First, suppose that you planned to run in the Cherry
Blossom race in each of the next few years. Based on the no pooling
results for our three example cases, what do you anticipate your running
times to be? Are you stumped? If not, you should be. The no pooling
(15.2)
the fact that some people tend to be faster than others (hence the different
approach can't help you answer this question. Since they're tailored to the
36 individuals in our sample, the resulting 36 models don't reliably extend
beyond these individuals. To consider a second wrinkle, take a quick quiz.1
Quiz Yourself!
Reexamine runner 1 in Figure 15.7. If they were to race a sixth time
at age 62, 5 years after their most recent data point, what would you
expect their net running time to be?
a) Below 75 minutes
b) Between 75 and 85 minutes
c) Above 85 minutes
If you were utilizing the no pooled model to answer this question, your
answer would be a. Runner 1's model indicates that they're getting faster
with age and should have a running time under 75 minutes by the time
they turn 62. Yet this no pooled conclusion exists in a vacuum, only taking
into account data on runner 1. From the other 35 runners, we've observed
that most people tend to get slower over time. It would be unfortunate to
completely ignore this information, especially since we have a mere ve
race sample size for runner 1 (hence aren't in the position to disregard the
extra data!). A more reasonable prediction might be option b: though they
might not maintain such a steep downward trajectory, runner 1 will likely
remain a fast runner with a race time between 75 and 85 minutes. Again,
this would be the reasonable conclusion, not the conclusion we'd make if
using our no pooled models alone. Though we've explored the no pooling
drawbacks in the speci c context of the Cherry Blossom race, they are true
in general.
_________________________
1 Answer : There's no de nitive right answer here, but we think that b is the most reasonable.
1. Within-group variability
The degree of the variability among multiple observations within
each group can be interesting on its own. For example, we can
examine how consistent an individual's running times are from
year to year.
2. Between-group variability
Hierarchical data also allows us to examine the variability from
group to group. For example, we can examine the degree to which
running patterns vary from individual to individual.
Finally, consider the modeling results for you, a new runner. Notice that
there's no blue line in your panel. Since you weren't in the original
running sample, we can't use the no pooled model to predict your
running trend. In contrast, since they don't ignore the broader population
from which our runners were sampled, the complete pooled and
hierarchical models can both be used to predict your trend. Yet the results
are different! Since the complete pooled model wasn't able to detect the
fact that runners in their 50s tend to get slower over time, your posterior
median model (black) has a slope near zero. Though this is a nice thought,
the hierarchical model likely offers a more realistic prediction. Based on
the data across all 36 runners, you'll probably get a bit slower. Further, the
anticipated rate at which you will slow down is comparable to the average
rate across our 36 runners.
15.5 Chapter summary
In Chapter 15 we motivated the importance of incorporating another
technique into our Bayesian toolbox: hierarchical models. Focusing on big
ideas over details, we explored three approaches to studying group
structured data:
Complete pooled models lump all data points together, assuming they
are independent and that a universal model is appropriate for all
groups. This can produce misleading conclusions about the relationship
of interest and the signi cance of this relationship. Simply put, the
vibe of complete pooled models is “no room for individuality.”
No pooled models build a separate model for each group, assuming
that one group's model doesn't tell us anything about another's. This
approach underutilizes the data and cannot be generalized to groups
outside our sample. Simply put, the vibe of no pooled models is “every
group for themselves.”
Partial pooled or hierarchical models provide a middle ground. They
acknowledge that, though each individual group might have its own
model, one group can provide valuable information about another. That
is, “let's learn from one another while celebrating our individuality.”
15.6 Exercises
15.6.1 Conceptual exercises
Exercise 15.1 (Three pooling types: explain to your friend). In one to two
sentences each, explain the following concepts to your friend Hakeem,
who has taken an intro statistics course but otherwise is new to pooled
data.
a) Complete pooling
b) No pooling
c) Partial pooling
Exercise 15.2 (Three pooling types: redux). Hakeem now understands the
three pooling approaches thanks to your excellent explanations in the
previous exercise. Now he has some follow-up questions! Answer these in
your own words and using your own examples.
Exercise 15.5 (Complete pooling: Part II). In the context of the sleep
study, what two incorrect assumptions does the complete pooled model
make and why are these inappropriate in the sleep study analysis?
Exercise 15.6 (No pooling: Part I). Suppose instead that we (incorrectly)
took a no pooling approach in our sleep study analysis.
Exercise 15.7 (No pooling: Part II). In the context of the sleep study, what
are the two main drawbacks to analyzing the relationship between reaction
time and sleep deprivation using a no pooling approach?
DOI: 10.1201/9780429288340-16
In Chapter 16 we'll build our rst hierarchical models upon the foundations
established in Chapter 15. We'll start simply, by assuming that we have some
response variable Y, but no predictors X. Consider the following data story. The
Spotify music streaming platform provides listeners with access to more than
50 million songs. Of course, some songs are more popular than others. Let Y be
a song's Spotify popularity rating on a 0-100 scale. In general, the more recent
plays that a song has on the platform, the higher its popularity rating. Thus, the
popularity rating doesn't necessarily measure a song's overall quality, long-term
popularity, or popularity beyond the Spotify audience. And though it will be
tough to resist asking which song features X can help us predict ratings, we'll
focus on understanding Y alone. Speci cally, we'd like to better understand the
following:
Other than a vague sense that the average popularity rating is around 50, we
don't have any strong prior understanding about these dynamics, and thus will
be utilizing weakly informative priors throughout our Spotify analysis. With
that, let's dig into the data. The spotify data in the bayesrules package is a
subset of data analyzed by Kaylin Pavlik (2019) and distributed through the
#TidyTuesday project (R for Data Science, 2020d). It includes information on
350 songs on Spotify.1
Here we select only the few variables that are important to our analysis, and
reorder the artist levels according to their mean song popularity using
fct_reorder() in the forcats package (Wickham, 2021):
_________________________
1 To focus on the new methodology, we analyze a tiny fraction of the nearly 33,000 songs analyzed
by Pavlik. We COULD but won't analyze all 33,000 songs.
You're encouraged to view the resulting spotify data in full. But even just a
few snippets reveal a hierarchical or grouped data structure:
Speci cally, our total sample of 350 songs is comprised of multiple songs for
each of 44 artists who were, in turn, sampled from the population of all artists
that have songs on Spotify (Figure 16.1).
Complete pooling
Ignore artists and lump all songs together (Figure 15.5).
No pooling
Separately analyze each artist and assume that one artist's data doesn't
contain valuable information about another artist (Figure 15.6).
Partial pooling (via hierarchical models)
Acknowledge the grouping structure, so that even though artists differ in
popularity, they might share valuable information about each other and
about the broader population of artists (Figure 16.1).
As you might suspect, the rst two approaches oversimplify the analysis. The
good thing about trying them out is that it will remind us why we should care
about hierarchical models in the rst place.
Goals
Build a hierarchical model of variable Y with no predictors X.
Simulate and analyze this hierarchical model using rstanarm.
Utilize hierarchical models for predictingY.
To focus on the big picture, we keep this rst exploration of hierarchical
models to the Normal setting, a special case of the broader generalized
linear hierarchical models we'll study in Chapter 18.
Warning
As our Bayesian modeling toolkit expands, so must our R tools and syntax.
Be patient with yourself here – as you see more examples throughout this
unit, the patterns in the syntax will become more familiar.
Summing the number of songs nj across all artists produces our total sample
size n:
44
n = ∑ n j = n 1 + n 2 + ⋯ + n 44 = 350.
j=1
Beyond the artists are the songs themselves. To distinguish between songs, we
will let Y (or Y ) represent the ith song for artist j where i ∈ {1, 2, … , n }
ij i,j j
and j ∈ {1, 2, … , 44}. Thus, we can think of our sample of 350 total songs as
the collection of our smaller samples on each of 44 artists:
Though these ratings appear somewhat left skewed, we'll make one more
simpli cation in our complete pooling analysis by assuming Normality.
Speci cally, we'll utilize the following Normal-Normal complete pooled model
with a prior for μ that re ects our weak understanding that the average
popularity rating is around 50 and, independently, a weakly informative prior
for σ. These prior speci cations can be con rmed using prior_summary().
Though we saw similar models in Chapters 5 and 9, it's important to review the
meaning of parameters μ and σ before shaking things up. Since μ and σ are
shared by every song Y , we can think of them as global parameters which do
ij
not vary by artist j. Across all songs across all artists in the population:
To simulate the corresponding posterior, note that the complete pooled model
(16.1) is a simple Normal regression model in disguise. Speci cally, if we
substitute notation β0 for global mean μ, (16.1) is an intercept-only regression
model with no predictors X. With this in mind, we can simulate the complete
pooled model using stan_glm() with the formula popularity 1 where
the 1 speci es an “intercept-only” term:
The posterior summaries below suggest that Spotify songs have a typical
popularity rating, μ, around 58.39 points with a relatively large standard
deviation from song to song, σ, around 20.67 points.
Quiz Yourself!
Suppose the following three artists each release a new song on Spotify:
Mia X, the artist with the lowest mean popularity in our sample (13).
Beyoncé, an artist with nearly the highest mean popularity in our
sample (70).
Mohsen Beats, a musical group that we didn't observe.
Using the complete pooled model, what would be the approximate
posterior predictive mean of each song's popularity?
Your intuition might have reasonably suggested that the predicted popularity
for these three artists should be different – our data suggests that Beyoncé's
song will be more popular than Mia X's. Yet, recall that by lumping all songs
together, our complete pooled model ignores any artist-speci c information. As
a result, the posterior predicted popularity of any new song will be the same for
every artist, those that are in our sample (Beyoncé and Mia X) and those that
are not (Mohsen Beats). Further, the shared posterior predictive mean
popularity will be roughly equivalent to the posterior expected value of the
global mean, E(μ|y) ≈ 58.39. Plotting the posterior predictive popularity
models (light blue) alongside the observed mean popularity levels (dark dots)
for all 44 sampled artists illustrates the unfortunate consequences of this
oversimpli cation – it treats all artists the same even though we know they're
not.
_________________________
2 Answer : The posterior predictive median will be around 58.39 for each artist.
16.2 No pooled model
The complete pooled model ignored the grouped structure of our Spotify data –
we have 44 unique artists in our sample and multiple songs per each. Next,
consider a no pooled model which swings the opposite direction, separately
analyzing the popularity rating of each individual artist. Figure 16.4 displays
density plots of song popularity for each artist.
FIGURE 16.4: Density plots of the variability in popularity from song to song,
by artist.
The punchline in this mess of lines is this: the typical popularity and variability
in popularity differ by artist. Whereas some artists' songs tend to have
popularity levels below 25, others tend to have popularity levels higher than 75.
Further, for some artists, the popularity from song to song is very consistent, or
less variable. For others, song popularity falls all over the map – some of their
songs are wildly popular and others not so much.
Instead of assuming a shared global mean popularity level, the no pooled model
recognizes that artist popularity differs by incorporating group-speci c
parametersμj which vary by artist. Speci cally, within each artist j, we assume
that the popularity of songs i are Normally distributed around some mean μj
with standard deviation σ:
2
Y ij μ j , σ~N (μ j , σ )
(16.2)
Thus
μj = mean song popularity for artist j; and
σ = the standard deviation in popularity from song to song within each
artist.
Figure 16.5 illustrates these parameters and the underlying model assumptions
for just ve of our 44 sampled artists. First, our Normal data model (16.2)
assumes that the mean popularity μj varies between different artists j, i.e., some
artists' songs tend to be more popular than other artists' songs. In contrast, it
assumes that the variability in popularity from song to song, σ, is the same for
each artist, and hence its lack of a j subscript. Though Figure 16.4 might bring
the assumptions of Normality and equal variability into question, with such
small sample sizes per artist, we can't put too much stock into the potential
noise. The current simplifying assumptions are a reasonable place to start.
The notation of the no pooled data structure (16.2) also clari es some features
of our no pooling approach:
parameters μj and μk. Thus, each artist gets a unique mean and one artist's
mean doesn't tell us anything about another's.
Whereas our model does not pool information about the players' mean
popularity levels μj, by assuming that the artists share the σ standard
deviation parameter, it does pool information from the artists to learn about
σ. We could also assign unique standard deviation parameters to each artist,
σj, but the shared standard deviation assumption simpli es things a great
deal. In this spirit of simplicity, we'll continue to refer to (16.3) as a no
pooled model, though “no” pooled would be more appropriate.
Finally, to complete the no pooled model, we utilize weakly informative
Normal priors for the mean parameters μj (each centered at an average
popularity rating of 50) and an Exponential prior for the standard deviation
parameter σ. Since the 44 μj priors are uniquely tuned for the corresponding
artist j, we do not list them here:
Quiz Yourself!
Suppose the following three artists each release a new song on Spotify:
a. Mia X, the artist with the lowest mean popularity in our sample
(13).
b. Beyoncé, an artist with nearly the highest mean popularity in our
sample (70).
c. Mohsen Beats, a musical group that we didn't observe.
Using the no pooled model, what will be the approximate posterior
predictive mean of each song's popularity?
Since each artist gets a unique mean popularity parameter μj which is modeled
using their unique song data, they each get a unique posterior predictive model
for their next song's popularity. Having utilized weakly informative priors, it
thus makes sense that Mia X's posterior predictive model should be centered
near her observed sample mean of 13, whereas Beyoncé's should be near 70. In
fact, Figure 16.6 con rms that the posterior predictive model for the popularity
of each artist's next song is centered around the mean popularity of their
observed sample songs.
_________________________
3 To make the connection to the no pooled model, let predictor X be 1 for artist j and 0 otherwise.
ij
μ X
1 i1
+ μ X
2 i2
+ ⋯ + μ 44
X = μ , the mean popularity for artist j.
i,44 j
4 Answer : a) Roughly 13; b) Roughly 70; c) NA. The no pooled model can't be used to predict the
popularity of a Mohsen Beats song, since Mohsen Beats does not have any songs included in this
sample.
FIGURE 16.6: Posterior predictive intervals for artist song popularity, as
calculated from a no pooled model.
If we didn't pause and re ect, this result might seem good. The no pooled
model acknowledges that some artists tend to be more popular than others.
However, there are two critical drawbacks. First, the no pooled model ignores
data on one artist when learning about the typical popularity of another. This is
especially problematic when it comes to artists for whom we have few data
points. For example, our low posterior predictions for Mia X's next song were
based on a measly 4 songs. The other artists' data suggests that these low
ratings might just be a tough break – her next song might be more popular!
Similarly, our high posterior predictions for Lil Skies' next song were based on
only 3 songs. In light of the other artists' data, we might wonder whether this
was beginner's luck that will be tough to maintain.
Second, the no pooled model cannot be generalized to artists outside our
sample. Remember Mohsen Beats? Since the no pooled model is tailor-made to
the artists in our sample, and Mohsen Beats isn't part of this sample, it cannot
be used to predict the popularity of Mohsen Beats' next song. That is, just as the
no pooled model assumes that no artist in our sample contains valuable
information about the typical popularity of another artist in our sample, it
assumes that our sampled artists tell us nothing about the typical popularity of
artists that didn't make it into our sample. This also speaks to why we cannot
account for the grouped structure of our data by including artist as a
categorical predictorX. On top of violating the assumption of independence,
doing so would allow us to learn about only the 44 artists in our sample.
Layer 1 deals with the smallest unit in our hierarchical data: individual songs
within each artist. As with the no pooled model (16.2), this rst layer is group
or artist speci c, and thus acknowledges that song popularity Y depends in
ij
part on the artist. Speci cally, within each artist j, we assume that the
popularity of songs i are Normally distributed around some mean μj with
standard deviation σy:
2
Y ij μ j , σ y ~N (μ j , σ y )
Thus, the meanings of the Layer 1 parameters are the same as they were for the
no pooled model:
2
μ j μ, σ μ ~N (μ, σ μ ).
A density plot of the mean popularity levels for our 44 artists indicates that this
prior assumption is reasonable. The artists' observed sample means do appear
to be roughly Normally distributed around some global mean:
FIGURE 16.7: A density plot of the variability in mean song popularity from
artist to artist.
Notation alert
There's a difference between μj and μ. When a parameter has a subscript
j, it refers to a feature of group j. When a parameter doesn't have a
subscript j, it's the global counterpart, i.e., the same feature across all
groups.
Subscripts signal the group or layer of interest. For example, σy refers
to the standard deviation of Y values within each group, whereas σμ
refers to the standard deviation of means μj from group to group.
Putting this all together, our nal hierarchical model brings together the models
of how individual song popularity Y varies within artists (Layer 1) with the
ij
model of how mean popularity levels μj vary between artists (Layer 2) with our
prior understanding of the entire population of artists (Layer 3). The weakly
informative priors are speci ed here and con rmed below. Note that we again
assume a baseline song popularity rating of 50.
2
Y ij |μ j , σ y ~N (μ j , σ y ) model of individual songs within artistj
ind
2
μ j |μ, σ μ ˜N (μ, σ μ ) model of variability between artists
2
μ ~N (50, 52 ) prior models on global parameters
σy ~Exp(0.048)
σμ ~Exp(1)
(16.5)
This type of model has a special name: one-way analysis of variance. We'll
explore the motivation behind this moniker in Section 16.3.3.
2
Y ij |μ j , σ y ~N (μ j , σ y ) model of individual observations within group j
ind
2
μ j |μ, σ μ ˜N (μ, σ μ ) model of how parameters vary between groups
2
μ ~N (m, s ) prior models on global parameters
σy ~Exp(l y )
σμ ~Exp(l μ )
(16.6)
tweaks or adjustmentsbj to μ,
μj = μ + bj
where these tweaks are normal deviations from 0 with standard deviation σμ:
2
b j ~N (0, σ μ ).
For example, suppose that the average popularity of all songs on Spotify is
μ = 55, whereas the average popularity for some artist j is μ = 65. Then this
j
artist's average popularity is 10 above the global average. That is, b = 10 and
j
μ j = μ + b j = 55 + 10 = 65.
Again, these two model formulations, (16.5) and (16.7), are equivalent. They're
just two different ways to think about the hierarchical structure of our data.
We'll use both throughout this book.
16.3.3 Within- vs between-group variability
Let's take in the model we've just built. The models we studied prior to Unit 4
allow us to explore one source of variability, that in the individual observations
Y across an entire population. In contrast, the aptly named one-way “analysis of
variance” models of hierarchical data enable us to decompose the variance in Y
into two sources: (1) the variance of individual observations Y within any given
group, σ ; and (2) the variance of features between groups, σ , i.e., from group
2
y
2
μ
(16.8)
2 2
= proportion of Var(Y ij ) that can be explained by
σ μ +σ y
2 2
= proportion of Var(Y ij ) that can be explained by
σ μ +σ y
(16.9)
Or in the context of our Spotify analysis, we can quantify how much of the total
variability in popularity across all songs and artists (Var(Y )) can be explained
ij
Figure 16.8 illustrates three scenarios for the comparison of these two sources
of variability. At one extreme is the scenario in which the variability between
groups is much greater than that within groups, σ > σ (plot a). In this case,
μ y
very little distinction between the Y values for the two groups. As such, the
majority of the variability in Y can be explained by differences between the
observations within each group, not between the groups themselves. With 44
artists, it's tough to visually assess from the 44 density plots in Figure 16.4
where our Spotify data falls along this spectrum. However, our posterior
analysis in Section 16.4 will reveal that our example is somewhere in the
middle. The differences from song to song within any one artist are similar in
scale to the differences between the artists themselves (à la plot b).
group correlation among the Y values is 0.8 (a), 0.5 (b), and 0.2 (c).
Plot (a) illuminates another feature of grouped data: the multiple observations
within any given group are correlated. The popularity of one Beyoncé song is
related to the popularity of other Beyoncé songs. The popularity of one Mia X
song is related to the popularity of other Mia X songs. In general, the one-way
ANOVA model (16.6) assumes that, though observations on one group are
independent of those on another, the within-group correlation for two
observations i and k on the same group j is
2
σ
μ
Cor(Y ij , Y kj ) = 2 2
.
σ +σ
μ y
(16.10)
Thus, the more unique the groups, with relatively large σμ, the greater the
correlation within each group. For example, the within-group correlation in plot
(a) of Figure 16.8 is 0.8, a gure close-ish to 1. In contrast, the more similar the
groups, with relatively small σμ, the smaller the correlation within each group.
For example, the within-group correlation in plot (c) is 0.2, a gure close-ish to
0. We'll return to these ideas once we simulate our model posterior, and thus
can put some numbers to these metrics for our Spotify data.
16.4 Posterior analysis
16.4.1 Posterior simulation
Next step: posterior! Notice that our hierarchical Spotify model (16.5) has a
total of 47 parameters: 44 artist-speci c parameters (μj) and 3 global
parameters (μ, σy, σμ). Exploring the posterior models of these whopping 47
parameters requires MCMC simulation. To this end, the stan_glmer()
function operates very much like stan_glm(), with two small tweaks.
To indicate that the artist variable de nes the group structure of our data,
as opposed to it being a predictor of popularity, the appropriate formula
here is popularity (1 | artist).
The prior for σμ is speci ed by prior_covariance. For this particular
model, with only one set of artist-speci c parameters μj, this is equivalent to
an Exp(1) prior. (You will learn more about prior_covariance in
Chapter 17.)
MCMC diagnostics for all 47 parameters con rm that our simulation has
stabilized:
Further, a quick posterior predictive check con rms that, though we can always
do better, our hierarchical model (16.5) isn't too wrong when it comes to
capturing variability in song popularity. A set of posterior simulated datasets of
song popularity are consistent with the general features of the original
popularity data.
FIGURE 16.9: 100 posterior simulated datasets of song popularity (light blue)
along with the actual observed popularity data (dark blue).
With this reassurance, let's dig into the posterior results. To begin, our
spotify_hierarchical simulation contains Markov chains of length
20,000 for each of our 47 parameters. To get a sense for how this information is
stored, check out the labels of the rst and last few parameters:
Due to the sheer number and type of parameters here, summarizing our
posterior understanding of all of these parameters will take more care than it
did for non-hierarchical models. We'll take things one step at a time, putting in
place the syntactical details we'll need for all other hierarchical models in Unit
4, i.e., this won't be a waste of your time!
μ = (Intercept)
σy = sigma
σ
2
μ
= Sigma[artist:(Intercept),(Intercept)]5
To call up the posterior medians for σy and σμ, we can specify effects =
"ran_pars", i.e., parameters related to randomness or variability:
_________________________
5 This is not a typo. The default output gives us information about the standard deviation within artists
(σ ) but the variance between artists (σ ).
y
2
μ
The posterior median of σy (sd_Obervation.Residual) suggests that,
within any given artist, popularity ratings tend to vary by 14 points from song
to song. The between standard deviation σμ (sd_(Intercept).artist)
tends to be slightly higher at around 15.1. Thus, the mean popularity rating
tends to vary by 15.1 points from artist to artist. By (16.10), these two sources
of variability suggest that the popularity levels among multiple songs by the
same artist tend to have a moderate correlation near 0.54:
μj = μ + bj .
Thus, bj describes the difference between artist j's mean popularity and the
global mean popularity. It's these tweaks that are simulated by
stan_glmer() – each bj has a corresponding Markov chain labeled
b[(Intercept) artist:j]. We can obtain a tidy() posterior summary
of all bj terms using the argument effects = "ran_vals" (i.e., random
artist-speci c values):
Consider the 80% credible interval for Camilo's bj tweak: there's an 80%
chance that Camilo's mean popularity rating is between 19.4 and 32.4 above
that of the average artist. Similarly, there's an 80% chance that Mia X's mean
popularity rating is between 23.4 and 40.7 below that of the average artist. We
can also combine our MCMC simulations for the global mean μ and artist
tweaks bj to directly simulate posterior models for the artist-speci c means μj.
For artist “j”:
μ j = μ + b j = (Intercept)+b[(Intercept)artist : j].
The tidybayes package provides some tools for this task. We'll take this one
step at a time here, but combine these steps in later analyses. To begin, we use
spread_draws() to extract the (Intercept) and b[(Intercept)
artist:j] values for each artist in each iteration. We then sum these two
terms to de ne mu_j. This produces an 880000-row data frame which contains
20000 MCMC samples of μj for each of the 44 artists j:
For example, at the rst set of chain values, Alok's mean popularity (67.4) is
16.3 points above the average (51.1). Next, mean_qi() produces posterior
summaries for each artist's mean popularity μj, including the posterior mean
and an 80% credible interval:
For example, with 80% posterior certainty, we can say that Beyoncé's mean
popularity rating μj is between 65.6 and 72.7. Plotting the 80% posterior
credible intervals for all 44 artists illustrates the variability in our posterior
understanding of their mean popularity levels μj:
FIGURE 16.10: 80% posterior credible intervals for each artist's mean song
popularity.
Not only do the μj posteriors vary in location, i.e., we expect some artists to be
more popular than others, they vary in scale – some artists' 80% posterior
credible intervals are much wider than others. Pause to think about why in the
following quiz.6
Quiz Yourself!
At the more popular end of the spectrum, Lil Skies and Frank
Ocean have similar posterior mean popularity levels. However, the 80%
credible interval for Lil Skies is much wider than that for Frank Ocean.
Explain why.
_________________________
6 Answer : Lil Skies has a much smaller sample size.
To sleuth out exactly why Lil Skies and Frank Ocean have similar posterior
means but drastically different intervals, it's important to remember that not all
is equal in our hierarchical data. Whereas our posterior understanding of Frank
Ocean is based on 40 songs, the most of any artist in the dataset, we have only 3
songs for Lil Skies. Then we naturally have greater posterior certainty about
Frank Ocean's popularity, and hence narrower intervals.
j y
Y
new,j
∣
{Y
μ j , σ y ~ N (μ
(1)
new,j
,Y
new,j
(i)
,…,Y
(i)
new,j
posterior predictive model will re ect two sources of variability, and hence
2
, (σ y ) ).
(20000)
To construct this set of predictions, we rst obtain the Markov chain for μj
(mu_ocean) by summing the global mean μ (Intercept) and the artist
adjustment (b[(Intercept) artist:Frank_Ocean]). We then
simulate a prediction (y_ocean) from the Layer 1 model at each MCMC
parameter set:
and corresponding
much more certain about Ocean's underlying mean song popularity than in the
popularity of any single Ocean song.
Step 1: Simulate a potential mean popularity level μmohsen for Mohsen Beats
by drawing from the Layer 2 model evaluated at each MCMC parameter set
}:
(i)
(i)
{μ ,σ μ
2
(i) (i)
(i)
μ μ, σ μ ~ N (μ , (σ μ ) ).
mohsen
new,mohsen
∣
μ mohsen , σ y ~ N (μ
mohsen
(i)
y,
Finally, below we replicate these “by hand” simulations for Frank Ocean and
Mohsen Beats using the posterior_predict() shortcut function. Plots of
2
, (σ y ) ).
The additional step in our Mohsen Beats posterior prediction process re ects a
third source of variability. When predicting song popularity for a new group,
the resulting posterior predictive models illustrate two key features of our
understanding about our two artists (Figure 16.11). First, we anticipate that
Ocean's next song will have a higher probability rating. This makes sense –
since Ocean's observed sample songs were popular, we'd expect his next song to
be more popular than that of an artist for whom we have no information.
Second, we have greater posterior certainty about Ocean's next song. This again
makes sense – since we actually have some data for Ocean (and don't for
.
Mohsen Beats), we should be more con dent in our ability to predict his next
song's popularity. We'll continue our hierarchical posterior prediction
exploration in the next section, paying special attention to how the results
compare with those from no pooled and complete pooled models.
FIGURE 16.11: Posterior predictive models for the popularity of the next
songs by Mohsen Beats and Frank Ocean.
Quiz Yourself!
What might “shrinkage” mean in this Spotify example and why might it
occur?
We swung the other direction in Section 16.2, using a no pooled model which
separately analyzed each artist (16.3). As such, when using weakly informative
priors, the no pooled posterior predictive means were roughly equivalent to the
sample mean popularity levels for each artist (Figure 16.6). For any artist j, this
sample mean is calculated by averaging the popularity levels y of each song i ij
1
yj = ∑ y ij .
nj
i=1
Figure 16.12 contrasts the hierarchical model posterior mean predictions with
the complete pooled model predictions (dashed horizontal line) and no pooled
model predictions (dark blue dots). In general, our hierarchical posterior
understanding of artists strikes a balance between these two extremes – the
hierarchical predictions are pulled or shrunk toward the global trends of the
complete pooled model and away from the local trends of the no pooled model.
Hence the term shrinkage.
Shrinkage
Shrinkage refers to the phenomenon in which the group-speci c local
trends in a hierarchical model are pulled or shrunk toward the global
trends.
2 2
σy nj σμ
2 2
y global + 2 2
yj.
σ +n j σ σ +n j σ
y μ y μ
In posterior predictions for artist j, the weights given to the global and local
means depend upon how much data we have on artist j (nj) as well as the
comparison of the within-group and between-group variability in song
popularity (σy and σμ). These weights highlight a couple of scenarios in which
individualism fades, i.e., our hierarchical posterior predictions shrink away
from the group-speci c means y and toward the global mean y
j :
global
We can see these dynamics at play in Figure 16.12 – some artists shrunk more
toward the global mean popularity levels than others. The artists that shrunk the
most are those with smaller sample sizes nj and popularity levels at the
extremes of the spectrum. Consider two of the most popular artists: Camila
Cabello and Lil Skies. Though Cabello's observed mean popularity is
slightly lower than that of Lil Skies, it's based on a much bigger sample
size:
As such, Lil Skies' posterior predictive mean popularity shrunk closer to the
global mean – the data on the other artists in our sample suggests that Lil
Skies might have beginner's luck, and thus that their next song will likely be
below their current three-song average. This makes sense and is a compelling
feature of hierarchical models. Striking a balance between complete and no
pooling, hierarchical models allow us to:
1. generalize the observations on our sampled groups to the broader
population; while
2. borrowing strength or information from all sampled groups when
learning about any individual sampled group.
Quiz Yourself!
Consider the three modeling approaches in this chapter: no pooled,
complete pooled, and hierarchical models.
Let's apply this thinking to the categorical artist variable. Our sample data
included multiple observations for a mere random sample of 44 among
thousands of artists on Spotify. Hence, treating artist as a predictor (as in
the no pooled model) would limit our understanding to only these artists. In
contrast, treating it as a grouping variable (as in the hierarchical model) allows
us to not only learn about the 44 artists in our data, but the broader population
of artists from which they were sampled.
Check out another example. Reconsider the bikes data from Chapter 9 which,
on each of 500 days, records the number of rides taken by Capital Bikeshare
members and whether the day fell on a weekend:
Quiz Yourself!
In a model of rides, is the weekend variable a potential grouping
variable or predictor?
There are only two possible categories for the weekend variable: each day
either falls on a weekend (TRUE) or it doesn't (FALSE). Our dataset covers
both instances:
Since the observed weekend values cover all categories of interest, it's not a
grouping variable. Rather, it's a potential predictor that can help us explore
whether ridership is different on weekends vs weekdays.
Consider yet another example. To address disparities in vocabulary levels
among children from households with different income levels, the Abdul Latif
Jameel Poverty Action Lab (J-PAL) evaluated the effectiveness of a digital
vocabulary learning program, the Big Word Club (Kalil et al., 2020). To do so,
they enrolled 818 students across 47 different schools in a vocabulary study.
Data on the students' vocabulary levels after completing this program
(score_a2) was obtained through the Inter-university Consortium for
Political and Social Research (ICPSR) and stored in the big_word_club
dataset in the bayesrules package. We'll keep only the students for whom we
have a vocabulary score:
Quiz Yourself!
In a model of score_a2, is the school_id variable a potential
grouping variable or predictor?
ind
2
μ j |μ, σ μ ˜N (μ, σ μ ) model of how parameters vary between groups
2
μ ~N (m, s ) prior models on global parameters
σy ~Exp(l y )
σμ ~Exp(l μ ).
(16.12)
to learn about any one group, this model borrows strength from the
information on other groups, thereby shrinking group-speci c local
phenomena toward the global trends; and
this model is less variable than the no pooling approach and less biased than
the complete pooling approach.
16.9 Exercises
16.9.1 Conceptual exercises
Exercise 16.1 (Shrinkage). The plot below illustrates the distribution of critic
ratings for 7 coffee shops. Suppose we were to model coffee shop ratings using
three different approaches: complete pooled, no pooled, and hierarchical. For
each model, sketch what the posterior mean ratings for the 7 coffee shops might
look like on the plot below.
Exercise 16.3 (Speed typing: interpret the coef cients). Alicia loves typing. To
share the appreciation, she invites four friends to each take 20 speed-typing
tests. Let Y be the time it takes friend j to complete test i.
ij
Exercise 16.4 (Speed typing: sketch the data). In the spirit of Figure 16.8,
sketch what density plots of your four friends' typing speed outcomes Y might
ij
a. The overall results of the 20 timed tests are remarkably similar among
the four friends.
b. Each person is quite consistent in their typing times, but there are big
differences from person to person – some tend to type much faster than
others.
c. Within the subjects, there doesn't appear to be much correlation in
typing time from test to test.
Exercise 16.5 (Speed typing: connecting to the model). For each scenario in the
above exercise, indicate the corresponding relationship between σy and σμ:
σ < σ , σ ≈ σ , or σ > σ .
y μ y μ y μ
Exercise 16.6 (Big words: getting to know the data). Recall from Section 16.7
the Abdul Latif Jameel Poverty Action Lab (J-PAL) study into the effectiveness
of a digital vocabulary learning program, the Big Word Club (BWC) (Kalil et
al., 2020). In our analysis of this program, we'll utilize weakly informative
priors with a baseline understanding that the average student saw 0 change in
their vocabulary scores throughout the program. We'll balance these priors by
the big_word_club data in the bayesrules package. For each student
participant, big_word_club includes a school_id and the percentage
change in vocabulary scores over the course of the study period
(score_pct_change). We keep only the students that participated in the
BWC program (treat == 1), and thus eliminate the control group.
Exercise 16.9 (Big words: global parameters). In this exercise, we'll explore the
global parameters of our BWC model: (μ, σ , σ ).
y μ
Exercise 16.10 (Big words: focusing on schools). Next, let's dig into the
school-speci c means, μj.
a) Construct and discuss a plot of the 80% posterior credible intervals for
the average percent change in vocabulary score for all schools in the
study, μj.
b) Construct and interpret the 80% posterior credible interval for μ10.
c) Is there ample evidence that, on average, vocabulary scores at School
10 improved by more than 5% throughout the duration of the
vocabulary program? Provide posterior evidence.
Exercise 16.11 (Big words: predicting vocab levels). Suppose we continue the
vocabulary study at each of Schools 6 and 17 (which participated in the current
study) and Bayes Prep, a school which is new to the study. In this exercise
you'll make predictions about Y new,j, the vocabulary performance of a student
that's new to the study from each of these three schools j.
Prep.
c) Using posterior_predict() this time, simulate posterior
predictive models of Y for each of School 6, School 17, and Bayes
new,j
Prep. Illustrate your simulation results using mcmc_areas() and
discuss your ndings.
d) Finally, construct, plot, and discuss the 80% posterior prediction
intervals for all schools in the original study.
Exercise 16.13 (Open exercise: voices). The voices data in the bayesrules
package contains the results of an experiment in which subjects participated in
role-playing dialog under various conditions. Of interest to the researchers was
the subjects' voice pitches (in Hz). In this open-ended exercise, complete a
hierarchical analysis of voice pitch, without any predictors.
17
(Normal) Hierarchical Models with Predictors
DOI: 10.1201/9780429288340-17
In Chapter 15 we convinced you that there's a need for hierarchical models. In Chapter 16
we built our rst hierarchical model – a Normal hierarchical model of Y with no predictors
X. Here we'll take the next natural step by building a Normal hierarchical regression model
of Ywith predictors X. Going full circle, let's return to the Cherry Blossom 10 mile running
race analysis that was featured in Chapter 15. Our goal is to better understand variability in
running times: To what extent do some people run faster than others? How are running
times associated with age, and to what extent does this differ from person to person?
In answering these questions, we'll utilize the cherry_blossom_sample data from the
bayesrules package, shortened to running here. This data records multiple net running
times (in minutes) for each of 36 different runners in their 50s and 60s that entered the 10-
mile race in multiple years.
But it turns out that the running data is missing some net race times. Since functions
such as prediction_summary(), add_fitted_draws(), and
add_predicted_draws() require complete information on each race, we'll omit the
rows with incomplete observations. In doing so, it's important to use na.omit() after
selecting our variables of interest so that we don't throw out observations that have
complete information on these variables just because they have incomplete information on
variables we don't care about.
With multiple observations per runner, this data is hierarchical or grouped. To acknowledge
this grouping structure, let Y denote the net running time and X the age for runner j in
ij ij
ignore the data's grouped structure: a complete pooling approach ignores the fact that we
have multiple dependent observations on each runner, and a no pooling approach assumes
that data on one runner can't tell us anything about another runner (thus that our data also
can't tell us anything about the general running population). Instead, our analysis of the
relationship between running times and age will combine two Bayesian modeling
paradigms we've developed throughout the book:
regression models of Y by predictors X when our data does not have a group structure
(Chapters 9 through 14); and
hierarchical models of Y which account for group-structured data but do not use
predictors X (Chapter 16).
Goals
Build hierarchical regression models of response variable Y by predictors X.
Evaluate and compare hierarchical and non-hierarchical models.
Use hierarchical models for posterior prediction.
the data's grouped structure. This isn't the right approach, but it provides a good point of
comparison and a building block to a better model. To this end, the complete pooled
regression model of Y from Chapter 15 assumes an age-speci c linear relationship
ij
μ = β + β X
i 0 1 with weakly informative priors:
ij
This model depends upon three global parameters: intercept β0, age coef cient β1, and
variability from the regression model σ. Our posterior simulation results for these
parameters from Section 15.1, stored in complete_pooled_model, are summarized
below.
The vibe of this complete pooled model is captured by Figure 17.1: it lumps together all
race results and describes the relationship between running time and age by a common
global model. Lumped together in this way, a scatterplot of net running times versus age
exhibit a weak relationship with a posterior median model of 75.2 + 0.268 age.
FIGURE 17.1: A scatterplot of net running times versus age along with the posterior
median model from the complete pooled model.
This posterior median estimate of the age coef cient β1 suggests that running times tend to
increase by a mere 0.27 minutes for each year in age. Further, with an 80% posterior
credible interval for β1 which straddles 0, (-0.3, 0.84), our complete pooled regression
model suggests there's not a signi cant relationship between running time and age. Our
intuition (and personal experience) says this is wrong – as adults age they tend to slow
down. This intuition is correct.
17.2 Hierarchical model with varying intercepts
17.2.1 Model building
Chapter 15 revealed that it would indeed be a mistake to stop our runner analysis with the
complete pooled regression model (17.1). Thus, our next goal is to incorporate the data's
grouped structure while maintaining our age predictor X . Essentially, we want to combine
ij
the regression principles from the complete pooled regression model with the grouping
principles from the simple Normal hierarchical model without a predictor from Chapter 16:
2
Y ij |μ j , σ y ~N (μ j , σ y ) model of running times WITHIN runner j
ind
2
μ j |μ, σ μ ˜N (μ, σ μ ) model of how typical running times vary BETWEEN runners
(17.2)
This hierarchical regression model will build from (17.2) and unfold in three similar layers.
of their age X : ij
2
Y ij μ j , σ y ~N (μ j , σ y ).
To incorporate information about age into our understanding of running times within
runners, we can replace the μj with runner-speci c means μ , which depend upon the ij
runner's age in their ith race, X . There's more than one approach, but we'll start with the
ij
following:
μ ij = β 0j + β 1 X ij .
Thus, the rst layer of our hierarchical model describes the relationship between net times
and age within each runner j by:
2
Y ij β 0j , β 1 , σ y ~N (μ ij , σ y ) where μ ij = β 0j + β 1 X ij .
(17.3)
For each runner j, (17.3) assumes that their running times are Normally distributed around
an age- and runner-speci c mean, β + β X , with standard deviation σy. This model
0j 1 ij
depends upon three parameters: β , β1, and σy. Paying special attention to subscripts, only
0j
β1 = global age coef cient, i.e., the typical change in a runner's net time per one year
increase in age; and
σy = within-group variability around the mean regression model β + β X , and hence
0j 1 ij
a measure of the strength of the relationship between an individual runner's time and
their age.
Putting this together, the rst layer of our hierarchical model (17.3) assumes that
relationships between running time and age randomly vary from runner to runner, having
different intercepts β but a shared age coef cient β1. Figure 17.2 illustrates these
0j
assumptions and helps us translate them within our context: though some runners tend to be
faster or slower than others, runners experience roughly the same increase in running times
as they age.
runners under (17.3). The black dots represent the runner-speci c intercepts β which vary
0j
Quiz Yourself!
Which of our current model parameters, (β , β , σ ), will we model in the next layer
0j 1 y
Since it's the only regression feature that we're assuming can vary from runner to runner,
the next layer will model variability in the intercept parameters β . It's important to
0j
∣
recognize here that our 36 sample runners are drawn from the same broader population of
runners. Thus, instead of taking a no pooling approach, which assigns each β a unique
β 0j β 0 , σ 0 ˜N (β 0 , σ ).
0
0j
prior, and hence assumes that one runner j can't tell us about another, these intercepts
should share a prior. To this end we'll assume that the runner-speci c intercept parameters,
and hence baseline running speeds, vary normally around some mean β0 with standard
deviation σ0:
ind
2
Figure 17.2 adds context to this layer of the hierarchical model which depends upon two
new parameters:
We've now completed the rst two layers of our hierarchical regression model which re ect
the relationships of running time and age, within and between runners. As with any
Bayesian model, the last step is to specify our priors.
Quiz Yourself!
For which model parameters must we specify priors in the
hierarchical regression model?
(17.4)
β0 = the global average intercept across all runners, i.e., the average runner's baseline
speed; and
σ0 = between-group variability in intercepts β , i.e., the extent to which baseline speeds
vary from runner to runner.
Building from (17.3) and (17.4), the nal layer of our hierarchical regression model must
specify priors for the global parameters: β0, β1, σy, and σ0. It's these global parameters that
describe the entire population of runners, not just those in our sample. As usual, we'll
utilize independent Normal priors for regression coef cients β0 and β1, where our prior
understanding of baseline β0 is expressed through the centered intercept β . Further, we'll
0j
ij ij
0c
utilize independent Exponential priors for standard deviation terms σy and σ0. The nal
hierarchical regression model thus combines information about the relationship between
running times Y and age X within runners (17.3) with information about how baseline
speeds β vary between runners (17.4) with our prior understanding of the broader global
population of runners. As embodied by Figure 17.2, this particular model is often referred
to as a hierarchical random intercepts model:
The complete set of model assumptions is summarized below.
of those on another group k, Y . However, different data points within the same
ik
function of predictor X . ij
β 0j = β 0 + b 0j
where these tweaks are normal deviations from 0 with standard deviation σ0:
2
b 0j ~N (0, σ ).
0
Consider an example. Suppose that some runner j has a baseline running speed of β = 24
0j
minutes, whereas the average baseline speed across all runners is β = 19 minutes. Thus, at
0
any age, runner j tends to run 5 minutes slower than average. That is, b = 5 and
0j
β 0j = β 0 + b 0j = 19 + 5 = 24.
In general, then, we can reframe Layers 1 and 2 of our hierarchical model (17.5) as follows:
The typical runner in this age group runs somewhere between an 8-minute mile and a
12-minute mile during a 10-mile race, and thus has a net time somewhere between 80
and 120 minutes for the entire race. As such we'll set the prior model for the centered
global intercept to β ~N (100, 10 ). (This centered intercept is much easier to think
0c
2
about than the raw intercept β0, the typical net time for a 0-year-old runner!)
We're pretty sure that the typical runner's net time in the 10-mile race will, on average,
increase over time. We're not very sure about the rate of this increase, but think it's
likely between 0.5 and 4.5 minutes per year. Thus, we'll set our prior for the global age
coef cient to β ~N (2.5, 1 ).
1
2
Beyond the typical net time for the typical runner, we do not have a clear prior
understanding of the variability between runners (σ0), nor of the degree to which a
runner's net times might uctuate from their regression trend (σy). Thus, we'll utilize
weakly informative priors on these standard deviation parameters.
Our nal tuning of the hierarchical random intercepts model follows, where the priors on σy
and σ0 are assigned by the stan_glmer() simulation below:
To get a sense for the combined meaning of our prior models, we simulate 20,000 prior
parameter sets using stan_glmer() with the following special arguments:
We specify the model of net times by age by the formula net age + (1 |
runner). This essentially combines a non-hierarchical regression formula (net age)
with that for a hierarchical model with no predictor (net (1 | runner)).
We specify prior_PD = TRUE to indicate that we wish to simulate parameters from
the prior, not posterior, models.
The simulation results describe 20,000 prior plausible scenarios for the relationship
between running time and age, within and between our 36 runners. Though we encourage
you to plot many more on your own, we show just 4 prior plausible scenarios of what the
mean regression models, β + β X, might look like for our 36 runners (Figure 17.3 left).
0j 1
The variety across these prior scenarios re ects our general uncertainty about running.
Though each scenario is consistent with our sense that runners likely slow down over time,
the rate of increase ranges quite a bit. Further, in examining the distances between the
runner-speci c regression lines, some scenarios re ect a plausibility that there's little
difference between runners, whereas others suggest that some runners might be much faster
than others.
Finally, we also simulate 100 datasets of race outcomes from the prior model, across a
variety of runners and ages. The 100 density plots in Figure 17.3 (right) re ect the
distribution of the net times in these simulated datasets. There is, again, quite a range in
these simulations. Though some span ridiculous outcomes (e.g., negative net running
times), the variety in the simulations and the general set of values they cover, adequately
re ect our prior understanding and uncertainty. For example, since a 25- to 30-minute mile
is a good walking (not running) pace, the upper values near 250-300 minutes for the entire
10-mile race seem reasonable.
FIGURE 17.3: Simulated scenarios under the prior models of the hierarchical regression
model (17.7). At left are 4 prior plausible sets of 36 runner-speci c relationships between
running time and age, β + β X. At right are density plots of 100 prior plausible sets of
0j 1
Combining our prior understanding with this data, we take a syntactical shortcut to simulate
the posterior random intercepts model (17.5) of net times by age: we update() the
running_model_1_prior with prior_PD = FALSE. We encourage you to follow
this up with a check of the prior tunings as well as some Markov chain diagnostics:
We'll keep our focus on the big themes here, rst those related to the relationship between
running time and age for the typical runner, and then those related to the variability from
this average.
β 0 + β 1 X.
Posterior summaries for β0 and β1, which are fixed across runners, are shown below.
Accordingly, there's an 80% chance that the typical runner tends to slow down somewhere
between 1.02 and 1.58 minutes per year. The fact that this range is entirely and comfortably
above 0 provides signi cant evidence that the typical runner tends to slow down with age.
This assertion is visually supported by the 200 posterior plausible global model lines below,
superimposed with their posterior median, all of which exhibit positive associations
between running time and age. In plotting these model lines, note that we use
add_fitted_draws() with re_formula = NA to specify that we are interested in
the global, not group-speci c, model of running times:
FIGURE 17.5: 200 posterior plausible global model lines, β0 + β1 X , for the relationship
between running time and age.
Don't let the details distract you from the important punchline here! By incorporating the
group structure of our data, our hierarchical random intercepts model has what the
complete pooled model (17.1) lacked: the power to detect a signi cant relationship between
running time and age. Our discussion in Chapter 15 revealed why this happens in our
running analysis: pooling all runners' data together masks the fact that most individual
runners slow down with age.
β 0j + β 1 X ij = (β 0 + b 0j ) + β 1 X ij .
We'll do so by combining what we learned about the global age parameter β1 above, with
information on the runner-speci c intercept terms β . The latter will require the
0j
specialized syntax we built up in Chapter 16, and thus some patience. First, the
b[(Intercept) runner:j] chains correspond to the difference in the runner-speci c
and global intercepts b . Thus, we obtain MCMC chains for each β = β + b by adding
0j 0j 0 0j
These observations are echoed in the plots below, which display 100 posterior plausible
models of net time by age for runners 4 and 5:
FIGURE 17.6: 100 posterior plausible models of running time by age, β 0j + β 1 X , for
subjects j ∈ {4, 5}.
We can similarly explore the models for all 36 runners, β + β X . For a quick
0j 1 ij
comparison, the runner-speci c posterior median models are plotted below and
superimposed with the posterior median global model, β + β X . This drives home the
0 1 ij
point that the global model represents the relationship between running time and age for the
most average runner. The individual runner models vary around this global average, some
with faster baseline speeds (β < β ) and some with slower baseline speeds (β > β ).
0j 0 0j 0
FIGURE 17.7: The posterior median models for our 36 runners j as calculated from the
hierarchical random intercepts model (gray), with the posterior median global model
(blue).
FIGURE 17.8: Simulated output for the relationship between response variable Y and
predictor X when σ < σ (a) and σ > σ (b).
y 0 y 0
Posterior tidy() summaries for our variance parameters suggest that the running analysis
is more like scenario (a) than scenario (b). For a given runner j, we estimate that their
observed running time at any age will deviate from their mean regression model by roughly
5.25 minutes (σy). By the authors' assessment (none of us professional runners!), this
deviation is rather small in the context of a long 10-mile race, suggesting a rather strong
relationship between running times and age within runners. In contrast, we expect that
baseline speeds vary by roughly 13.3 minutes from runner to runner (σ0).
Comparatively then, the posterior results suggest that σ < σ – there's greater variability
y 0
in the models between runners than variability from the model within runners. Think about
this another way. As with the simple hierarchical model in Section 16.3.3, we can
decompose the total variability in race times across all runners and races into that explained
by the variability between runners and that explained by the variability within each runner
(16.8):
2 2
Var(Y ij ) = σ + σy .
0
Thus, proportionally (16.9), differences between runners account for roughly 86.62% (the
majority!) of the total variability in racing times, with uctuations among individual races
within runners explaining the other 13.38%:
A snapshot of the observed trends for all 36 runners provides a more complete picture of
just how much the change in net time with age might vary by runner:
FIGURE 17.10: Observed trends in running time versus age for the 36 subjects (gray)
along with the posterior median model (blue).
Quiz Yourself!
How can we modify the random intercepts model (17.5) to recognize that the rate at
which running time changes with age might vary from runner to runner?
becomes:
1j
∣
. Thus, the model of the relationship between running time and age within each runner j
Y ij β 0j , β 1j , σ y ~N (μ ij , σ y )
(
β 0j
β 1j
0
2
where μ ij = β 0j + β 1j X ij .
β 0j β 0 , σ 0 ~N (β 0 , σ )
2
and
1j
But these priors aren't yet complete – β and β work together to describe the model for
0j
) β 0 , β 1 , σ 0 , σ 1 ~ N ((
β0
β1
), Σ)
1j
runner j, and thus are correlated. Let ρ ∈ [−1, 1] represent the correlation between β and
and β by
where (β , β ) is the joint mean and Σ is the 2x2 covariance matrix which encodes the
0 1
Σ = (
σ
0
0j
ρσ 0 σ 1
1j
ρσ 0 σ 1
σ
2
1
Though this notation looks overwhelming, it simply indicates that β and β are both
).
marginally Normal (17.9) and have correlation ρ. The correlation ρ between the runner-
speci c intercepts and slopes, β and β , isn't just a tedious mathematical detail. It's an
1j
0j 1j
interesting feature of the hierarchical model! Figure 17.11 provides some insight. Plot (a)
illustrates the scenario in which there's a strong negative correlation between β and β –
models that start out lower (with small β ) tend to increase at a more rapid rate (with
0j
higher β ). In plot (c) there's a strong positive correlation between β and β – models
that start out higher (with larger β ) also tend to increase at a more rapid rate (with higher
0j
β ). In between these two extremes, plot (b) illustrates the scenario in which there's no
1j
0j
0j
0j 1j
1j
0j
1j
(17.8)
0j
(17.9)
0j
(17.10)
(17.11)
1j
FIGURE 17.11: Simulated output for the relationship between response variable Y and
predictor X when ρ < 0 (a), ρ = 0 (b), and ρ > 0 (c).
Consider the implications of this correlation in the context of our running analysis.1
Quiz Yourself!
correlated?
a. Runners that start out slower (i.e., with a higher baseline), also tend to slow
down at a more rapid rate.
b. The rate at which runners slow down over time isn't associated with how fast
they start out.
c. Runners that start out faster (i.e., with a lower baseline), tend to slow down at a
more rapid rate.
2. Similarly, what would it mean for β and β to be positively correlated?
0j 1j
The completed hierarchical model (17.12) pulls together (17.8) and (17.10) with priors for
the global parameters. For reasons you might imagine, this is often referred to as a
hierarchical random intercepts and slopes model:
Equivalently, we can re-express the random intercepts and slopes as random tweaks to the
global intercept and slope: μ = (β + b ) + (β + b )X with
ij 0 0j 1 1j ij
b 0j 0
( ) σ 0 , σ 1 ~N (( ), Σ).
b 1j 0
when the group-speci c age coef cients do not differ from group to group, these two
models are equivalent.
Most of the pieces in this model are familiar. For global parameters β0 and β1 we use the
tuned Normal priors from (17.7). For σy we use a weakly informative prior. Yet there is one
big new piece. We need a joint prior model to express our understanding of how the
combinedσ , σ1, and ρ parameters de ne covariance matrix Σ (17.11). At the writing of this
0
1. the correlation between the group-speci c intercepts and slopes, ρ (Figure 17.11);
2. the combined degree to which the intercepts and slopes vary by group, σ + σ ; 2
0
2
1
and2
3. the relative proportion of this variability between groups that's due to differing
intercepts vs differing slopes,
2 2
σ σ
0 1
π0 = 2 2
vs π1 = 2 2
.
σ +σ σ +σ
0 1 0 1
Figure 17.12 provides some context on this third piece, displaying a few scenarios for the
relationship between π0 and π1. In general, π0 and π1 always sum to 1, and thus have a push-
and-pull relationship. For example, when π ≈ 1 and π ≈ 0, the variability in intercepts (
0 1
models is explained by differences in slopes (plot c). In between these extremes, when π0
and π1 are both approximately 0.5, roughly half of the variability between groups can be
explained by differing intercepts and the other half by differing slopes (plot b).
FIGURE 17.12: Simulated output for the relationship between response variable Y and
predictor X when π = 1 and π = 0 (a), π = π = 0.5 (b), and π = 0 and π = 1 (c).
0 1 0 1 0 1
In our analysis, we'll utilize the weakly informative default setting for the hierarchical
random intercepts and slopes model: decov(reg = 1, conc = 1, shape = 1,
scale = 1) in rstanarm notation. This makes the following prior assumptions regarding
the three pieces above:
We'll utilize these default assumptions for the covariance prior in this book. Beyond the
defaults, specifying and tuning the decomposition of covariance prior requires the
acquisition of two new probability models. We present more optional detail in the next
section and refer the curious reader to Gabry and Goodrich (2020a) for a more
mathematical treatment that scales up to models beyond those considered here.
_________________________
2 Technically , this is the combined variability in group-speci c intercepts and slopes when assuming they are
uncorrelated.
which it's de ned. These pieces are numbered in accordance to their corresponding
interpretations above:
1 ρ
R = ( ) (1)
ρ 1
2 2
τ = √σ + σ (2)
0 1
2
σ
0
⎛ ⎞
π0 2
σ +σ
0
2
1
π = ( ) = ( 3)
2
π1 σ
1
⎝ 2 2
⎠
σ +σ
0 1
We can decompose Σ into a product which depends on R, τ, and π. If you know some linear
algebra, you can con rm this result, though the fact that we can rewrite Σ is what's
important here:
2
σ ρσ 0 σ 1 σ0 0 1 ρ σ0 0
0
( ) = ( )( )( ) = diag(σ 0 , σ 1 ) R diag(σ 0 , σ 1 ).
2
ρσ 0 σ 1 σ 0 σ1 ρ 1 0 σ1
1
σ0
( ) = τ √π.
σ1
And since we can rewrite Σ using R, τ, and π, we can also express our prior understanding of
Σ by our combined prior understanding of these three pieces. This joint prior, which we
simply expressed above by
Σ~(decomposition of covariance)
R ~LKJ(η)
τ ~Gamma(s, r)
π ~Dirichlet(2, δ)
Let's begin with the “LKJ” prior model on the correlation matrix R with regularization
hyperparameterη > 0. In our model (17.12), R depends only on the correlation ρ between
the group-speci c intercepts β and slopes β . Thus, the LKJ prior model simpli es to a
0j 1j
Figure 17.13 displays the LKJ pdf under a variety of regularization parameters η,
illustrating the important comparison of η to 1:
Setting η < 1 indicates a prior understanding that the group-speci c intercepts and
slopes are most likely strongly correlated, though we're not sure if this correlation is
negative or positive.
When η = 1, the LKJ model is uniform from -1 to 1, indicating that the correlation
between the intercepts and slopes is equally likely to be anywhere in this range – we're
not really sure.
Setting η > 1 indicates a prior understanding that the group-speci c intercepts and
slopes are most likely weakly correlated (ρ ≈ 0). The greater η is, the tighter and tighter
the LKJ hugs values of ρ near 0.
FIGURE 17.13: The LKJ pdf under three regularization parameters.
Next, for the total standard deviation in the intercepts and slopes, τ = √σ
2
0
+ σ
2
1
, we
utilize the “usual” Gamma prior (or its Exponential special case). Finally, consider the prior
for the π0 and π1 parameters. Recall that π0 and π1 measure the relative proportion of the
variability between groups that's due to differing intercepts vs differing slopes,
respectively. Thus, π0 and π1 are both restricted to values between 0 and 1 and must sum to
1:
2 2
σ σ
0 1
π0 + π1 = 2 2
+ 2 2
= 1.
σ +σ σ +σ
0 1 0 1
Γ (2δ) δ−1
f (π 0 , π 1 ) = (π 0 π 1 ) f or π 0 , π 1 ∈ [0, 1] and π 0 + π 1 = 1.
Γ (δ)Γ (δ)
In fact, in the special case when we have only two group-speci c parameters, β 0j and β 1j ,
the symmetric Dirichlet model for (π , π ) is equivalently expressed by:
0 1
π 0 ~Beta(δ, δ) and π1 = 1 − π0 .
Figure 17.14 displays the marginal symmetric Dirichlet pdf, i.e., Beta pdf, for π0 under a
variety of concentration parameters δ, illustrating the important comparison of δ to 1:
Setting δ < 1 places more prior weight on proportions π0 near 0 or 1. This indicates a
prior understanding that relatively little (π ≈ 0) or relatively much (π ≈ 1) of the
0 0
FIGURE 17.14: The marginal symmetric Dirichlet pdf under three concentration
parameters.
When our models have both group-speci c intercepts and slopes, we'll use the following
default decomposition of variance priors which indicate general uncertainty about the
correlation between group-speci c intercepts and slopes, the overall variability in group-
speci c model, and the relative degree to which this variability is explained by differing
intercepts vs differing slopes:
R ~LKJ(1)
π ~Dirichlet(2, 1)
speci c age coef cients β , and 6 global parameters (β , β , σ , σ , σ , ρ). Let's examine
1j 0 1 y 0 1
these piece by piece, starting with the global model of the relationship between running
time and age,
β 0 + β 1 X.
The results here for the random intercepts and slopes model (17.12) are quite similar to
those for the random intercepts model (17.5): the posterior median model is 18.5 + 1.32
age.
Since the global mean model β + β X captures the relationship between running time and
0 1
age for the average runner, we shouldn't be surprised that our two hierarchical models
produced similar assessments. Where these two models start to differ is in their
assessments of the runner-speci c relationships. Obtaining the MCMC chains for the
runner-speci c intercepts and slopes gets quite technical. We encourage you to pick through
the code below, line by line. Here are some important details to pick up on:
spread_draws() uses b[term, runner] to grab the chains for all runner-
speci c parameters. As usual now, these chains correspond to b and b , the
0j 1j
differences between the runner-speci c vs global intercepts and age coef cients.
pivot_wider() creates separate columns for each of the b and b chains and
0j 1j
From these chains, we can obtain the posterior medians for each runner-speci c intercept
and age coef cient. Since we're only obtaining posterior medians here, we use
summarize() in combination with group_by() instead of using the median_qi()
function:
Figure 17.15 plots the posterior median models for all 36 runners.
FIGURE 17.15: The posterior median models for the 36 runners, as calculated from the
hierarchical random intercepts and slopes model.
Hmph. Are you surprised? We were slightly surprised. The slopes do differ, but not as
drastically as we expected. But then we remembered – shrinkage! Consider sample runners
1 and 10. Their posteriors suggest that, on average, runner 10's running time increases by
just 1.06 minute per year, whereas runner 1's increases by 1.75 minutes per year:
FIGURE 17.16: For runners 1 and 10, the posterior median relationships between running
time and age from the hierarchical random intercepts and slopes model (dashed) are
contrasted by the observed no pooled models (blue) and the complete pooled model (black).
Figure 17.16 contrasts these posterior median models for runners 1 and 10 (dashed lines) by
the complete pooled posterior models (black) and no pooled posterior models (blue). As
usual, the hierarchical models strike a balance between these two extremes. Like the no
pooled models, the hierarchical models do vary between the two runners. Yet the difference
is not as stark. The hierarchical models are drawn away from the no pooled models and
toward the complete pooled models. Though this shrinkage is subtle for runner 10, the
association between running time and age switches from negative to positive for runner 1.
This is to be expected. Unlike the no pooled approach, which models runner-speci c
relationships using only runner-speci c data, our hierarchical model assumes that one
runner's behavior can tell us about another's. Further, we have very few data points on each
runner – at most 7 races. With so few observations, the other runners' information has
ample in uence on our posterior understanding for any one individual (as it should). In the
case of runner 1, the other 35 runners' data is enough to make us think that this runner, too,
will eventually slow down.
The standard deviation σ1 in the age coef cients β is likely around 0.251 minutes per
1j
year. On the scale of a 10-mile race, this indicates very little variability between the
runners when it comes to the rate at which running times change with age.
Per the output for σy, an individual runner's net times tend to deviate from their own
mean model by roughly 5.17 minutes.
There's a weak negative correlation of roughly -0.0955 between the runner-speci c β 0j
and β parameters. Thus, it seems that, ever so slightly, runners that start off faster tend
1j
So which one should we use? To answer this question, we can compare our three models
using the framework of Chapters 10 and 11, and asking these questions: (1) How fair is
each model? (2) How wrong is each model? (3) How accurate are each model's posterior
predictions? Consider question (1). The context and data collection procedure is the same
for each model. Since the data has been anonymized and runners are aware that race results
will be public, we think this data collection process is fair. Further, though the models
produce slightly different conclusions about the relationship between running time and age
(e.g., the hierarchical models conclude this relationship is signi cant), none of these
conclusions seem poised to have a negative impact on society or individuals. Thus, our
three models are equally fair.
Next, consider question (2). Posterior predictive checks suggest that the complete pooled
model comparatively underestimates the variability in running times – datasets of running
time simulated from the complete pooled posterior tend to exhibit a slightly narrower range
than the running times we actually observed. Thus, the complete pooled model is more
wrong than the hierarchical models.
FIGURE 17.18: Posterior predictive checks of the complete pooled model (left), random
intercepts model (middle), and random intercepts and slopes model (right).
In fact, we know that the complete pooled model is wrong. By ignoring the data's grouped
structure, it incorrectly assumes that each race observation is independent of the others.
Depending upon the trade-offs, we might live with this wrong but simplifying assumption
in some analyses. Yet at least two signs point to this being a mistake for our running
analysis.
1. The complete pooled model isn't powerful enough to detect the signi cant
relationship between running time and age.
2. Not only have we seen visual evidence that some runners tend to be signi cantly
faster or slower than others, the posterior prediction summaries in Section 17.2.4
suggest that there's signi cant variability between runners (σ0).
In light of this discussion, let's drop the complete pooled model from consideration. In
choosing between running_model_1 and running_model_2, consider question (3):
what's the predictive accuracy of these models? Recall some approaches to answering this
question from Chapter 11: posterior prediction summaries and ELPD. To begin, we use the
prediction_summary() function from the bayesrules package to compare how well
these two models predict the running outcomes of the 36 runners that were part of our
sample.
Finally, consider one last comparison of our two hierarchical models: the cross-validated
expected log-predictive densities (ELPD). The estimated ELPD for running_model_1 is
lower (worse) than, though within two standard errors of, the running_model_2 ELPD.
Hence, by this metric, there is not a signi cant difference in the posterior predictive
accuracy of our two hierarchical models.
After re ecting upon our model evaluation, we're ready to make a nal determination: we
choose running_model_1. The choice of running_model_1 over the
complete_pooled_model was pretty clear: the latter was wrong and didn't have the
power to detect a relationship between running time and age. The choice of
running_model_1 over running_model_2 comes down to this: the complexity
introduced by the additional random age coef cients in running_model_2 produced
little apparent change or bene t. Thus, the additional complexity simply isn't worth it (at
least not to us).
17.5 Posterior prediction
Finally, let's use our preferred model, running_model_1, to make some posterior
predictions. Suppose we want to predict the running time that three different runners will
achieve when they're 61 years old: runner 1, runner 10, and Miles. Though Miles' running
prowess is a mystery, we observed runners 1 and 10 in our sample. Should their trends
continue, we expect that runner 10's time will be slower than that of runner 1 when they're
both 61:
FIGURE 17.19: The observed net running times by age for runners 1 and 10.
In general, let Y denote a new observation on an observed runner j, speci cally runner
new,j
j's running time at age 61. As in Chapter 16, we can approximate the posterior predictive
model for Y by simulating a prediction from the rst layer of (17.3), that which
new,j
describes the variability in race times Y , evaluated at each of the 20,000 parameter sets
ij
2
(i) (i) (i) (i) (i) (i)
Y β 0j , β 1 , σ y ~ N (μ , (σ y ) ) where μ = β + β ⋅ 61.
new,j ij ij 0j 1
The resulting posterior predictive model will re ect two sources of uncertainty in runner j's
race time: the within-group sampling variabilityσy (we can't perfectly predict runner j's
time from their mean model); and posterior variability in β , β1, and σy (the parameters 0j
de ning runner j's relationship between running time and age are unknown and random).
Since we don't have any data on the baseline speed for our new runner, Miles, there's a third
source of uncertainty in predicting his race time: between-group sampling variabilityσ 0
(baseline speeds vary from runner to runner). Though we recommend doing these
simulations “by hand” to connect with the concepts of posterior prediction (as we did in
Chapter 16), we'll use the posterior_predict() shortcut function to simulate the
posterior predictive models for our three runners:
These posterior predictive models are plotted in Figure 17.20. As anticipated from their
previous trends, our posterior expectation is that runner 10 will have a slower time than
runner 1 when they're 61 years old. Our posterior predictive model for Miles' net time is
somewhere in between these two extremes. The posterior median prediction is just under
100 minutes, similar to what we'd get if we plugged an age of 61 into the global posterior
median model for the average runner:
That is, without any information about Miles, our default assumption is that he's an average
runner. Our uncertainty in this assumption is re ected by the relatively wide posterior
predictive model. Naturally, having observed data on runners 1 and 10, we're more certain
about how fast they will be when they're 61. But Miles is a wild card – he could be really
fast or really slow!
FIGURE 17.20: Posterior predictive models for the net running times at age 61 for sample
runners 1 and 10, as well as Miles, a runner that wasn't in our original sample.
17.6 Details: Longitudinal data
The running data on net times by age is longitudinal. We observe each runner over
time, where this time (or aging) is of primary interest. Though our hierarchical models of
this relationship account for the correlation in running times within any runner, they make a
simplifying assumption about this correlation: it's the same across all ages. In contrast, you
might imagine that observations at one age are more strongly correlated with observations
at similar ages. For example, a runner's net time at age 60 is likely more strongly correlated
with their net time at age 59 than at age 50. It's beyond the scope of this book, but we can
adjust the structure of our hierarchical models to account for a longitudinal correlation
structure. We encourage the interested reader to check out the bayeslongitudinal R package
(Carreño and Cuervo, 2017) and the foundational paper by Laird and Ware (1982).
Figure 17.21 illustrates both some global and artist-speci c patterns in the relationships
among these variables:
FIGURE 17.21: The relationship of danceability with genre (left) and valence (middle) for
individual songs. The relationship of danceability with valence by artist (right).
When pooling all songs together, notice that rock songs tend to be the least danceable and
rap songs the most (by a slight margin). Further, danceability seems to have a weak but
positive association with valence. Makes sense. Sad songs tend not to inspire dance. Taking
this all with a grain of salt, recall that the global models might mask what's actually going
on. To this end, the artist-speci c models in the nal plot paint a more detailed picture. The
two key themes to pick up on here are that (1) some artists' songs tend to be more danceable
than others, and (2) the association between danceability and valence might differ among
artists, though it's typically positive.
To model these relationships, let's de ne some notation. For song i within artist j, let Y ij
denote danceability and X denote valence. Next, note that there are 6 different genres,
ij1
edm being the baseline or reference. Thus, we must de ne 5 additional predictors, one for
each non-edm genre. Speci cally, let (X , X , … , X ) be indicators of whether a song
ij2 ij3 ij6
falls in the latin, pop, r & b, rap, and rock genres, respectively. For example,
1 Latin genre
X ij2 = {
0 otherwise.
Thus, for an edm song, all genre indicators are 0. We'll consider two possible models of
danceability by valence and genre. The rst layer of Model 1 assumes that
(β , β , β , … , β , σ )~N (μ , σ ) with
2
Yij 0j 1 2 6 y ij y
The global coef cients (β , β , … , β ) re ect an assumption that the relationships between
1 2 6
danceability, valence, and genre are similar for each artist. Yet the artist-speci c intercepts
β0j assume that, when holding constant a song's valence and genre, some artists' songs tend
to be more danceable than other artists' songs.
The rst layer of Model 2 incorporates additional artist-speci c valence coef cients,
assuming Y (β , β , β , … , β , σ )~N (μ , σ ) with
ij 0j 1j 2 6 y ij
2
y
Posterior predictive checks of these two models are similar, both models producing
posterior simulated datasets of song danceability that are consistent with the main features
in the original song data.
FIGURE 17.22: Posterior predictive checks of Spotify Model 1 (left) and Model 2 (right).
Yet upon a comparison of their ELPDs, we think it's best to go forward with Model 1. The
quality of the two models do not signi cantly differ, and Model 1 is substantially simpler.
Without more data per artist, it's dif cult to know if the artist-speci c valence coef cients
are insigni cant due to the fact that the relationship between danceability and valence
doesn't vary by artist, or if we simply don't have enough data per artist to determine that it
does.
Digging into Model 1, rst consider posterior summaries for the global model parameters (
β , β , … , β ). For the average artist in any genre, we'd expect danceability to increase by
0 1 6
between 2.16 and 3 points for every 10-point increase on the valence scale – a statistically
signi cant but fairly marginal bump. Among genres, it appears that when controlling for
valence, only rock is signi cantly less danceable than edm. Its 80% credible interval is
the only one to lie entirely above or below 0, suggesting that for the average artist, the
typical danceability of a rock song is between 1.42 and 12 points lower than that of an
edm song with the same valence.
In interpreting these summaries, keep in mind that the genre coef cients directly compare
each genre to edm alone and not, say, rock to r & b. In contrast, mcmc_areas() offers
a useful visual comparison of all genre posteriors. Other than the rock coef cient, 0 is a
fairly posterior plausible value for the other genre coef cients, reaf rming that these genres
aren't signi cantly more or less danceable than edm. There's also quite a bit of overlap in
the posteriors. As such, though there's evidence that some of these genres are more
danceable than others (e.g., rap vs r & b), the difference isn't substantial.
FIGURE 17.23: Posterior models for the genre-related coef cients in Spotify Model 1.
Finally, consider some posterior results for two artists in our sample: Missy Elliott and
Camilo. The below tidy() summary compares the typical danceability levels of these
artists to the average artist, b = β − β . When controlling for valence and genre,
0j 0j 0
Elliott's songs tend to be signi cantly more danceable than the average artist's, whereas
Camilo's tend to be less danceable. For example, there's an 80% posterior chance that
Elliott's typical song danceability is between 2.78 and 14.8 points higher than average.
To predict the danceability of their next songs, our hierarchical regression model takes into
consideration the artists' typical danceability levels as well as the song's valence and genre.
Suppose that both artists' next songs have a valence score of 80, but true to their genres,
Elliot's is a rap song and Camilo's is in the Latin genre. Figure 17.24 plots both artists'
posterior predictive models along with that of Mohsen Beats, a rock artist that wasn't in
our sample but also releases a song with a valence level of 80. As we'd expect, the
danceability of Elliott's song is likely to be the highest among these three. Further, though
Camilo's typical danceability is lower than average, we expect Mohsen Beats's song to be
even less danceable since it's of the least danceable genre.
FIGURE 17.24: Posterior predictive models for the popularity of the next songs by Missy
Elliott, Mohsen Beats, and Camilo.
denote the ith set of observed data on response variable Y and p different predictors X.
Then a Normal hierarchical regression model of Y vs X consists of three layers: it
combines information about the relationship between Y and X within groups, with
information about how these relationships vary between groups, with our prior
understanding of the broader global population. Letting βj denote a set of group-speci c
parameters and (β, σ , σ) a set of global parameters:
y
2
Y ij |β j , σ y ~N (μ ij , σ y ) regression model within group j
2
β j |β, σ ~N (β, σ ) variability in regression parameters between groups
Where we have some choices to make is in the de nition of the regression mean μ . In the ij
simplest case of a random intercepts model, we assume that groups might have unique
baselines β , yet share a common relationship between Y and X:
0j
In the most complicated case of a random intercepts and slopes model, we assume that
groups have unique baselines and unique relationships between Y and each X:
In between, we might assume that some predictors need group-speci c coef cients and
others don't:
17.9 Exercises
17.9.1 Conceptual exercises
Exercise 17.1 (Translating assumptions into model notation). To test the relationship
between reaction times and sleep deprivation, researchers enlisted 3 people in a 10-day
study. Let Y denote the reaction time (in ms) to a given stimulus and X the number of
ij ij
days of sleep deprivation for the ith observation on subject j. For each set of assumptions
below, use mathematical notation to represent an appropriate Bayesian hierarchical model
of Y vs X .
ij ij
a. Not only do some people tend to react more quickly than others, sleep deprivation
might impact some people's reaction times more than others.
b. Though some people tend to react more quickly than others, the impact of sleep
deprivation on reaction time is the same for all.
c. Nobody has inherently faster reaction times, though sleep deprivation might impact
some people's reaction times more than others.
Exercise 17.2 (Sketch the assumption: Part 1). Continuing with the sleep study, suppose we
model the relationship between reaction time Y and days of sleep deprivation X using
ij ij
b) Explain what σ > σ would mean in the context of the sleep study.
y 0
Exercise 17.3 (Sketch the assumption: Part 2). Suppose instead that we model the
relationship between reaction time Y and days of sleep deprivation X using the random
ij ij
Exercise 17.4 (Making meaning of models). To study the relationship between weight and
height among pug puppies, you collect data on 10 different litters, each containing 4 to 6
puppies born to the same mother. Let Y and X denote the weight and height,
ij ij
a) Write out formal model notation for model 1, a random intercepts model of Y vs ij
X . ij
b) Write out formal model notation for model 2, a random intercepts and slopes
model of Y vs X .ij ij
c) Summarize the key differences in the assumptions behind models 1 and 2. Root this
discussion in the puppy context.
Exercise 17.5 (Translating models to code). Suppose we had weight and height data for
the puppy study. Write out appropriate stan_glmer() model code for models 1 and 2
from Exercise 17.4.
Exercise 17.7 (Sleep: simulating the model). Continuing with the sleep analysis, let's
simulate and dig into the hierarchical posteriors.
Exercise 17.8 (Sleep: group-speci c inference). Next, let's dig into what Model 2 indicates
about the individuals that participated in the sleep study.
a) Use your posterior simulation to identify the person for whom reaction time
changes the least with sleep deprivation. Write out their posterior median
regression model.
b) Repeat part a, this time for the person for whom reaction time changes the most
with sleep deprivation.
c) Use your posterior simulation to identify the person that has the slowest baseline
reaction time. Write out their posterior median regression model.
d) Repeat part c, this time for the person that has the fastest baseline reaction time.
e) Simulate, plot, and discuss the posterior predictive model of reaction time after 5
days of sleep deprivation for two subjects: you and Subject 308. You're encouraged
to try this from scratch before relying on the posterior_predict() shortcut.
a) Evaluate the two models of reaction time: Are they wrong? Are they fair? How
accurate are their posterior predictions?
b) Which of the two models do you prefer and what does this indicate about the
relationship between reaction time and sleep deprivation? Justify your answer with
posterior evidence.
Exercise 17.10 (Voices: setting up the model). Does one's voice pitch change depending on
attitude? To address this question, Winter and Grawunder (2012) conducted a study in
which each subject participated in various role-playing dialogs. These dialogs spanned
different contexts (e.g., asking for a favor) and were approached with different attitudes
(polite vs informal). In the next exercises you'll explore a hierarchical regression analysis
of Y , the average voice pitch in subject j's ith dialog session (measured in Hz), by X ,
ij ij
whether or not the dialog was polite (vs informal). Beyond a baseline understanding that the
typical voice pitch is around 200 Hz, you should utilize weakly informative priors.
doing so, assume that baseline voice pitch differs from subject to subject, but that
the impact of attitude on voice pitch is similar among all subjects.
b) Compare and contrast the meanings of model parameters β and β0 in the context
0j
variable.
c) Compare and contrast the meanings of model parameters σy and σ0 in the context of
this voice pitch study.
Exercise 17.11 (Voices: check out some data). To balance our weakly informative priors for
the model of pitch by attitude, check out some data.
a) Load the voices data from the bayesrules package. How many study subjects are
included in this sample? In how many dialogs did each subject participate?
b) Construct and discuss a plot which illustrates the relationship between voice pitch
and attitude both within and between subjects.
Exercise 17.12 (Voices: simulating the model). Continuing with the voice pitch analysis, in
this exercise you will simulate and dig into the hierarchical posterior of your model
parameters.
a) Simulate the hierarchical posterior model of voice pitch by attitude. Construct and
discuss trace plots, density plots, autocorrelation plots, and a pp_check() of the
chain output.
b) Construct and interpret a 95% credible interval for β0.
c) Construct and interpret a 95% credible interval for β1.
d) Is there ample evidence that, for the average subject, voice pitch differs depending
on attitude (polite vs informal)? Explain.
Exercise 17.13 (Voices: focusing on the individual). Continuing with the voice pitch
analysis, in this exercise you will focus on speci c subjects.
a) Report the global posterior median model of the relationship between voice pitch
and attitude.
b) Report and contrast the posterior median models for two subjects in our data: A and
F.
c) Using posterior_predict(), simulate posterior predictive models of voice
pitch in a new polite dialog for three different subjects: A, F, and you. Illustrate
your simulation results using mcmc_areas() and discuss your ndings.
Exercise 17.15 (Sleep: different priors). In our earlier sleep analysis, we utilized weakly
informative priors. Pretending that you haven't already seen the data, specify a model of
Reaction time by Days of sleep deprivation using priors that you tune yourself. Use
prior simulation to illustrate your prior understanding.
18
Non-Normal Hierarchical Regression &
Classi cation
DOI: 10.1201/9780429288340-18
A master chef becomes a master chef by mastering the basic elements of cooking, from
avor to texture to smell. When cooking then, they can combine these elements without
relying on rigid step-by-step cookbook directions. Similarly, in building statistical
models, Bayesian or frequentist, there is no rule book to follow. Rather, it's important to
familiarize ourselves with some basic modeling building blocks and develop the ability
to use these in different combinations to suit the task at hand. With this, in Chapter 18
you will practice cooking up new models from the ingredients you already have. To
focus on the new concepts in this chapter, we'll utilize weakly informative priors
throughout. Please review Chapters 12 and 13 for a refresher on tuning prior models in
the Poisson and logistic regression settings, respectively. The same ideas apply here.
Goals
Expand our generalized hierarchical regression model toolkit by combining
hierarchical regression techniques (Chapter 17) with
Poisson and Negative Binomial regression models for count response variables
Y (Chapter 12) and logistic regression models for binary categorical response
variables Y (Chapter 13).
18.1 Hierarchical logistic regression
Whether for the thrill of thin air, a challenge, or the outdoors, mountain climbers set out
to summit great heights in the majestic Nepali Himalaya. Success is not guaranteed –
poor weather, faulty equipment, injury, or simply bad luck, mean that not all climbers
reach their destination. This raises some questions. What's the probability that a
mountain climber makes it to the top? What factors might contribute to a higher
success rate? Beyond a vague sense that the typical climber might have a 50/50 chance
at success, we'll balance our weakly informative prior understanding of these questions
with the climbers_sub data in the bayesrules package, a mere subset of data made
available by The Himalayan Database (2020) and distributed through the #tidytuesday
project (R for Data Science, 2020b):
This dataset includes the outcomes for 2076 climbers, dating back to 1978. Among
them, only 38.87% successfully summited their peak:
Quiz Yourself!
As you might imagine given its placement in this book, the climbers data has an
underlying grouping structure. Identify which of the following variables encodes
that grouping structure: expedition_id, member_id, season,
expedition_role, or oxygen_used.
Since member_id is essentially a row (or climber) identi er and we only have one
observation per climber, this is not a grouping variable. Further, though season,
expedition_role, and oxygen_used each have categorical levels which we
observe more than once, these are potential predictors of success, not grouping
variables.1 This leaves expedition_id – this is a grouping variable. The
climbers data spans 200 different expeditions:
_________________________
1 For example, the observed season categories (Autumn, Spring, Summer, Winter) are a xed and
complete set of options, not a random sample of categories from a broader population of seasons.
Each expedition consists of multiple climbers. For example, our rst three expeditions
set out with 5, 6, and 12 climbers, respectively:
It would be a mistake to ignore this grouping structure and otherwise assume that our
individual climber outcomes are independent. Since each expedition works as a team,
the success or failure of one climber in that expedition depends in part on the success or
failure of another. Further, all members of an expedition start out with the same
destination, with the same leaders, and under the same weather conditions, and thus are
subject to the same external factors of success. Beyond it being the right thing to do
then, accounting for the data's grouping structure can also illuminate the degree to
which these factors introduce variability in the success rates between expeditions. To
this end, notice that more than 75 of our 200 expeditions had a 0% success rate – i.e., no
climber in these expeditions successfully summited their peak. In contrast, nearly 20
expeditions had a 100% success rate. In between these extremes, there's quite a bit of
variability in expedition success rates.
FIGURE 18.1: A histogram of the success rates for the 200 climbing expeditions.
1 yes
Y ij = {
0 no
There are several potential predictors of climber success in our dataset. We'll consider
only two: a climber's age and whether they received supplemental oxygen in order to
breathe more easily at high elevation. As such, de ne:
By calculating the proportion of success at each age and oxygen use combination, we
get a sense of how these factors are related to climber success (albeit a wobbly sense
given the small sample sizes of some combinations). In short, it appears that climber
success decreases with age and drastically increases with the use of oxygen:
FIGURE 18.2: A scatterplot of the success rate among climbers by age and oxygen
use.
In building a Bayesian model of this relationship, rst recognize that the Bernoulli
model is reasonable for our binary response variable Y . Letting π be the probability
ij ij
Y ij |π ij ~Bern(π ij ).
Way back in Chapter 13, we explored a complete pooling approach to expanding this
simple model into a logistic regression model of Y by a set of predictors X:
This is a great start, BUT it doesn't account for the grouping structure of our data.
Instead, consider the following hierarchical alternative with independent, weakly
informative priors tuned below by stan_glmer() and with a prior model for β0
expressed through the centered intercept β . After all, it makes more sense to think
0c
about the baseline success rate among the typical climber, β , than among 0-year-old
0c
climbers that don't use oxygen, β0. To this end, we started our analysis with a weak
understanding that the typical climber has a 0.5 probability of success, or a
log(odds of success) = 0.
where b 0j
parameters:
∣
Equivalently, we can reframe this random intercepts logistic regression model by
expressing the expedition-speci c intercepts as tweaks to the global intercept,
ind
log(
σ 0 ˜N (0, σ )
2
0
π ij
1−π ij
) = (β 0 + b 0j ) + β 1 X ij1 + β 2 X ij2
measured by the log(odds of success), for each expedition j. These acknowledge that
some expeditions are inherently more successful than others.
The expedition-speci c intercepts β are assumed to be Normally distributed
0j
around some global intercept β0 with standard deviation σ0. Thus, β0 describes the
typical baseline success rate across all expeditions, and σ0 captures the between-
group variability in success rates from expedition to expedition.
β1 describes the global relationship between success and age when controlling for
oxygen use. Similarly, β2 describes the global relationship between success and
oxygen use when controlling for age.
Putting this all together, our random intercepts logistic regression model (18.1) makes
the simplifying (but we think reasonable) assumption that expeditions might have
unique intercepts β but share common regression parameters β1 and β2. In plain
in another.
0j
language, though the underlying success rates might differ from expedition to
expedition, being younger or using oxygen aren't more bene cial in one expedition than
To simulate the model posterior, the stan_glmer() code below combines the best of
two worlds: family = binomial speci es that ours is a logistic regression model
(à la Chapter 13) and the (1 | expedition_id) term in the model formula
incorporates our hierarchical grouping structure (à la Chapter 17):
You're encouraged to follow this simulation with a con rmation of the prior
speci cations and some MCMC diagnostics:
Whereas these diagnostics con rm that our MCMC simulation is on the right track, a
posterior predictive check indicates that our model is on the right track. From each of
100 posterior simulated datasets, we record the proportion of climbers that were
successful using the success_rate() function. These success rates range from
roughly 37% to 41%, in a tight window around the actual observed 38.9% success rate
in the climbers data.
To begin, notice that the 80% posterior credible interval for age coef cient β1 is
comfortably below 0. Thus, we have signi cant posterior evidence that, when
controlling for whether or not a climber uses oxygen, the likelihood of success
decreases with age. More speci cally, translating the information in β1 from the
log(odds) to the odds scale, there's an 80% chance that the odds of successfully
summiting drop somewhere between 3.5% and 5.8% for every extra year in age:
) = (0.942, 0.965). Similarly, the 80% posterior credible interval for
−0.0594 −0.0358
(e ,e
the oxygen_usedTRUE coef cient β2 provides signi cant posterior evidence that,
when controlling for age, the use of oxygen dramatically increases a climber's
likelihood of summiting the peak. There's an 80% chance that the use of oxygen could
correspond to anywhere between a 182-fold increase to a 617-fold increase in the odds
of success: (e , e ) = (182, 617). Oxygen please!
5.2 6.43
Combining our observations on β1 and β2, the posterior median model of the
relationship between climbers' log(odds of success) and their age (X1) and oxygen use
(X2) is
π
log( ) = −1.42 − 0.0474X 1 + 5.79X 2 .
1−π
This posterior median model merely represents the center among a range of posterior
plausible relationships between success, age, and oxygen use. To get a sense for this
range, Figure 18.4 plots 100 posterior plausible alternative models. Both with oxygen
and without, the probability of success decreases with age. Further, at any given age, the
probability of success is drastically higher when climbers use oxygen. However, our
posterior certainty in these trends varies quite a bit by age. We have much less certainty
about the success rate for older climbers on oxygen than for younger climbers on
oxygen, for whom the success rate is uniformly high. Similarly, but less drastically, we
have less certainty about the success rate for younger climbers who don't use oxygen
than for older climbers who don't use oxygen, for whom the success rate is uniformly
low.
FIGURE 18.4: 100 posterior plausible models for the probability of climbing success
by age and oxygen use.
For each climber, the probability of success is approximated by the observed proportion
of success among their 20,000 posterior predictions. Since these probabilities
incorporate uncertainty in the baseline success rate of the new expedition, they are more
moderate than the global trends observed in Figure 18.4:
These posterior predictions provide more insight into the connections between age,
oxygen, and success. For example, our posterior prediction is that climber 1, who is 20
years old and does not plan to use oxygen, has a 27.88% chance of summiting the peak.
This probability is naturally lower than for climber 2, who is also 20 but does plan to
use oxygen. It's also higher than the posterior prediction of success for climber 3, who
also doesn't plan to use oxygen but is 60 years old. Overall, the posterior prediction of
success is highest for climber 2, who is younger and plans to use oxygen, and lowest for
climber 3, who is older and doesn't plan to use oxygen.
In Chapter 13 we discussed the option of turning such posterior probability predictions
into posterior classi cations of binary outcomes: yes or no, do we anticipate that the
climber will succeed or not? If we used a simple 0.5 posterior probability cut-off to
make this determination, we would recommend that climbers 1 and 3 not join the
expedition (at least, not without oxygen) and give climbers 2 and 4 the go ahead. Yet in
this particular context, we should probably leave it up to individual climbers to
interpret their own results and make their own yes-or-no decisions about whether to
continue on their expedition. For example, a 65.16% chance of success might be worth
the hassle and risk to some but not to others.
Overall, under this classi cation rule, our model successfully predicts the outcomes for
91.71% of our climbers. This is pretty fantastic given that we're only utilizing
information on the climbers' ages and oxygen use, among many possible other
considerations (e.g., destination, season, etc.). Yet given the consequences of
misclassi cation in this particular context (e.g., risk of injury), we should prioritize
speci city, our ability to anticipate when a climber might not succeed. To this end, our
model correctly predicted only 92.51% of the climbing failures. To increase this rate,
we can change the probability cut-off in our classi cation rule.
Quiz Yourself!
What cut-off can we utilize to achieve a speci city of at least 95% while also
maintaining the highest possible sensitivity?
In general, to increase speci city, we can increase the probability cut-off, thereby
making it more dif cult to predict “success.” After some trial and error, it seems that
cut-offs of roughly 0.65 or higher will achieve a desired 95% speci city level. This
switch to 0.65 naturally decreases the sensitivity of our posterior classi cations, from
90.46% to 81.54%, and thus our ability to detect when a climber will be successful. We
think the added caution is worth it.
by visitor rating X and room type, where an entire private unit is the reference level
ij1
and we have indicator variables for the two other room types:
Figure 18.5 displays the trends in the number of reviews as well as their relationship
with a listing's rating and room type. In examining the variability in Y alone, note that
ij
the majority of listings have fewer than 20 reviews, though there's a long right skew.
Further, the volume of reviews tends to increase with ratings and privacy levels.
FIGURE 18.5: Plots of the number of reviews received across AirBnB listings (left) as
well as the relationship of the number of reviews with a listing's rating (middle) and
room type (right).
We can further break down these dynamics within each neighborhood. We show just
three here to conserve precious space: Albany Park (a residential neighborhood in
northern Chicago), East Gar eld Park (a residential neighborhood in central Chicago),
and The Loop (a commercial district and tourist destination). In general, notice that
Albany Park listings tend to have fewer reviews, no matter their rating or room type.
FIGURE 18.6: A scatterplot of an AirBnB listing's number of reviews by its rating and
room type, for three neighborhoods.
In building a regression model for the number of reviews, the rst step is to consider
reasonable probability models for data Y . Since the Y values are non-negative
ij ij
skewed counts, a Poisson model is a good starting point. Speci cally, letting rateλ ij
The hierarchical Poisson regression model below builds this out to incorporate (1) the
rating and room type predictors (X , X , X ) and (2) the airbnb data's grouped
ij1 ij2 ij3
structure. Beyond a general understanding that the typical listing has around 20 reviews
(hence log(20) ≈ 3 logged reviews), this model utilizes independent, weakly
informative priors tuned by stan_glmer():
Taking a closer look, this model assumes that neighborhoods might have unique
intercepts β but share common regression parameters (β , β , β ). In plain language:
0j 1 2 3
though some neighborhoods might be more popular AirBnB destinations than others
(hence their listings tend to have more reviews), the relationship of reviews with rating
and room type is the same for each neighborhood. For instance, ratings aren't more
in uential to reviews in one neighborhood than in another. This assumption greatly
simpli es our analysis while still accounting for the grouping structure in the data. To
simulate the posterior, we specify our family = poisson data structure and
incorporate the neighborhood-level grouping structure through (1 |
neighborhood):
Figure 18.7 indicates that our hierarchical Poisson regression model signi cantly
underestimates the variability in reviews from listing to listing, while overestimating
the typical number of reviews. We've been here before! Recall from Chapter 12 that an
underlying Poisson regression assumption is that, at any set of predictor values, the
average number of reviews is equal to the variance in reviews:
E(Y ij ) = Var(Y ij ) = λ ij .
The pp_check() calls this assumption into question. To address the apparent
overdispersion in the Y values, we swap out the Poisson model in (18.2) for the more
ij
Equivalently, we can express the random intercepts as tweaks to the global intercept,
log(μ ij ) = (β 0 + b 0j ) + β 1 X ij1 + β 2 X ij2 + β 3 X ij3
ind
where b 0j σ 0 ˜N (0, σ )
2
0
. To simulate the posterior of the hierarchical Negative
Binomial regression model, we can swap out family = poisson for family =
neg_binomial_2:
Though not perfect, the Negative Binomial model does a much better job of capturing
the behavior in reviews from listing to listing:
model of β1 re ects a signi cant and substantive positive association between reviews
and rating. When controlling for room type, there's an 80% chance that the volume of
reviews increases somewhere between 1.17 and 1.45 times, or 17 and 45 percent, for
every extra point in rating: (e0.154, e ) = (1.17, 1.45). In contrast, the posterior
0.371
model of β3 illustrates that shared rooms are negatively associated with reviews. When
controlling for ratings, there's an 80% chance that the volume of reviews for shared
room listings is somewhere between 52 and 76 percent as high as for listings that are
entirely private: (e
−0.659
,e ) = (0.52, 0.76).
−0.275
where β , the baseline review rate, varies from neighborhood to neighborhood. We'll
0j
again focus on just three neighborhoods: Albany Park, East Gar eld Park, and The
Loop. The below posterior summaries evaluate the differences between these
neighborhoods' baselines and the global intercept, b = β − β : 0j 0j 0
Note that AirBnB listings in Albany Park have atypically few reviews, those in East
Gar eld Park have atypically large numbers of reviews, and those in The Loop do not
signi cantly differ from the average. Though not dramatic, these differences from
neighborhood to neighborhood play out in posterior predictions. For example, below we
simulate posterior predictive models of the number of reviews for three listings that
each have a 5 rating and each offer privacy, yet they're in three different neighborhoods.
Given the differing review baselines in these neighborhoods, we anticipate that the
Albany Park listing will have fewer reviews than the East Gar eld Park listing, though
the predictive ranges are quite wide:
FIGURE 18.9: Posterior predictive models for the number of reviews for AirBnB
listings in three different neighborhoods.
cases of little variability in the β between groups and (2) the link functiong(⋅)
kj
π ij
Y ij | … ~Bern(π ij )
g(π ij ) = log( )
1−π ij
18.4 Exercises
18.4.1 Applied & conceptual exercises
Exercise 18.1 (We know how to do a lot of stuff). For each model scenario, specify an
appropriate structure for the data model, note whether the model is hierarchical, and if
so, identify the grouping variable. Though you do not need to simulate the models, be
sure to justify your selections using the data provided. To learn more about these
datasets, type ?name_of_dataset into the console.
Exercise 18.2 (Book banning: setting up the model). People have both failed and
succeeded at getting books banned, and hence ideas censored, from public libraries and
education. In the following exercises, you'll explore whether certain book
characteristics can help predict whether or not a book challenge is successful. To do so,
you'll balance weakly informative priors with the book_banning data in the
bayesrules package. This data, collected by Fast and Hegland (2011) and presented by
Legler and Roback (2021), includes features and outcomes for 931 book challenges
made in the US between 2000 and 2010. Let Y denote the outcome of the ith book
ij
challenge in state j, i.e., whether or not the book was removed. You'll consider three
potential predictors of outcome: whether the reasons for the challenge include concern
about violent material (X ), antifamily material (X ), or the use of
ij1 ij2
a) In your book banning analysis, you'll use the state in which the book
challenge was made as a grouping variable. Explain why it's reasonable (and
important) to assume that the book banning outcomes within any given state are
dependent.
b) Write out an appropriate hierarchical regression model of Y by
ij
(X ij1,X ij2,X ij3) using formal notation. Assume each state has its own
intercept, but that the states share the same predictor coef cients.
c) Dig into the book_banning data. What state had the most book challenges?
The least?
d) Which state has the greatest book removal rate? The smallest?
e) Visualize and discuss the relationships between the book challenge outcome
and the three predictors.
Exercise 18.3 (Book banning: simulating the model). Next, let's simulate and dig into
the posterior model of the book banning parameters.
a) How accurate are your model's posterior predictions of whether a book will be
banned? Provide evidence.
b) Interpret the posterior medians of b = β − β for two states j: Kansas (KS)
0j 0j 0
variable meaning
total_minutes the total number of minutes played throughout the season
games_played the number of games played throughout the season
starter whether or not the player started in more than half of the
games that they played
avg_points the average number of points scored per game
team team name
Exercise 18.8 (More basketball!). Utilize your nal chosen model (Poisson, Normal, or
Negative Binomial) to explore the relationship between the total number of minutes
played by a player and their average points per game in more depth.
(a) Summarize your key ndings. Some things to consider along the way: Can you
interpret every model parameter (both global and team-speci c)? Can you
summarize the key trends? Which trends are signi cant? How good is your
model?
(b) Predict the total number of minutes that a player will get throughout a season
if they play in 30 games, they start each game, and they score an average of 15
points per game.
Exercise 18.9 (Open exercise: basketball analysis with multiple predictors). In this
open-ended exercise, complete an analysis of the number of games started by WNBA
players using multiple predictors of your choosing.
Exercise 18.10 (Open exercise: more climbing). In Chapter 18, you analyzed the
relationship of a climber's success with their age and oxygen use. In this open-ended
exercise, continue your climbing analysis by considering other possible predictors.
These might include any combination of personal attributes (age, oxygen_used,
injured), time attributes (year, season), or attributes of the climb itself
(highpoint_metres, height_metres).
19
Adding More Layers
DOI: 10.1201/9780429288340-19
Throughout this book, we've laid the foundations for Bayesian thinking and modeling.
But in the broader scheme of things, we've just scratched the surface. This last chapter
marks the end of this book, not the end of the Bayesian modeling toolkit. There's so
much more we wish we could share, but one book can't cover it all. (Perhaps a sequel?!
Bayes Rules 2! The Bayesianing or Bayes Rules 2! Happy Bayes Are Here Again!)
Hopefully Bayes Rules! has piqued your curiosity and you feel equipped to further your
Bayesian explorations. We conclude here by nudging our hierarchical models one step
further, to address two questions.
Goals
We've utilized individual-level predictors to better understand the trends among
individuals within groups. How can we also utilize group-level predictors to
better understand the trends among the groups themselves?
What happens when we have more than one grouping variable?
We'll explore these questions through two case studies. To focus on the new concepts in
this chapter, we'll also utilize weakly informative priors throughout. For a more
expansive treatment, we recommend Legler and Roback (2021) or Gelman and Hill
(2006). Though these resources utilize a frequentist framework, if you've read this far,
you have the skills to consider their work through a Bayesian lens.
19.1 Group-level predictors
In Chapter 18 we explored how the number of reviews varies from AirBnB listing to
listing. We might also ask: what makes some AirBnB listings more expensive than
others? We have a weak prior understanding here that the typical listing costs around
$100 per night. Beyond this baseline, we'll supplement a weakly informative prior
understanding of the AirBnB market by the airbnb dataset in the bayesrules package.
Recall that airbnb contains information on 1561 listings across 43 Chicago
neighborhoods, and hence multiple listings per neighborhood:
The observed listing prices, ranging from $10 to $1000 per night, are highly skewed.
Thus, to facilitate our eventual modeling of what makes some listings more expensive
than others, we'll work with the symmetric logged prices. (Trust us for now and we'll
provide further justi cation below!)
FIGURE 19.1: Histograms of AirBnB listing prices in dollars (left) and log(dollars)
(right).
FIGURE 19.2: Plots of the log(price) of AirBnB listings by the number of bedrooms
(left), user rating (middle), and room type (right).
Further, though we're not interested in the particular Chicago neighborhoods in the
airbnb data (rather we want to use this data to learn about the broader market), we
shouldn't simply ignore them either. In fact, boxplots of the listing prices in each
neighborhood hint at correlation within neighborhoods. As is true with real estate in
general, AirBnB listings tend to be less expensive in some neighborhoods (e.g., 13 and
41) and more expensive in others (e.g., 21 and 36):
To this end, we can build a hierarchical model of AirBnB prices by a listing's number of
bedrooms, rating, and room type while accounting for the neighborhood grouping
structure. For listing i in neighborhood j, let Y ij denote the price, X ij1 the number of
bedrooms, X the rating,
ij2
Given the symmetric, continuous nature of the logged prices, we'll implement the
following Normal hierarchical regression model of log(Y ) by (X , X , X , X ):
1 2 3 4
Notice that (19.1) allows for random neighborhood-speci c intercepts yet assumes
shared predictor coef cients. That is, we assume that the baseline listing prices might
vary from neighborhood to neighborhood, but that the listing features have the same
association with price in each neighborhood. Further, beyond our vague understanding
that the typical listing has a nightly price of $100 (hence a log(price) of roughly 4.6), our
weakly informative priors are tuned by stan_glmer(). The corresponding posterior
is simulated below with a prior_summary() which con rms the prior speci cations
in (19.1).
The pp_check() in Figure 19.4 (left) reassures us that ours is a reasonable model –
100 posterior simulated datasets of logged listing prices have features similar to the
original logged listing prices. It's not because we're brilliant, but because we tried other
things rst. Our rst approach was to model price, instead of logged price. Yet a
pp_check() con rmed that this original model didn't capture the skewed nature in
prices (Figure 19.4 right).
FIGURE 19.4: Posterior predictive checks of the Normal models of logged AirBnB
listing prices (left) and raw, unlogged listing prices (right).
∣
Putting this together, airbnb includes individual-level predictors (e.g., rating) and
group-level predictors (e.g., walkability). The latter are ignored by our current model
(19.1). Consider the neighborhood-speci c intercepts β . The original model (19.1) uses
0j
the same prior for each β , assuming that the baseline logged prices in neighborhoods
are Normally distributed around some mean logged price β0 with standard deviation σ0:
ind
2
β 0j β 0 , σ 0 ˜N (β 0 , σ ).
0
By lumping them together in this way, (19.2) assumes that we have the same prior
(19.2)
information about the baseline price in each neighborhood. This is ne in cases where
we truly don't have any information to distinguish between groups, here neighborhoods.
Yet our airbnb analysis doesn't fall into this category. Figure 19.5 plots the average
logged AirBnB price in each neighborhood by its walkability. The results indicate that
neighborhoods with greater walkability tend to have higher AirBnB prices. (The same
goes for transit access, yet we'll limit our focus to walkability.)
FIGURE 19.5: A scatterplot of the average logged AirBnB listing price by the
walkability score for each neighborhood.
What's more, the average logged price by neighborhood appears to be linearly associated
with walkability. Incorporating this observation into our model of β , (19.2), isn't much
0j
different than incorporating individual-level predictors X into the rst layer of the
hierarchical model. First, let Uj denote the walkability of neighborhood j. Notice a
couple of things about this notation:
Next, we can replace the global trend in β 0j , β0, with a neighborhood-speci c linear
trend μ informed by walkability Uj:
0j
μ 0j = γ 0 + γ 1 U j .
(19.3)
This switch introduces two new model parameters which describe the linear trend
between the baseline listing price and walkability of a neighborhood:
intercept γ0 technically measures the average logged price we'd expect for
neighborhoods with 0 walkability (though no such neighborhood exists); and
slope γ1 measures the expected change in a neighborhood's typical logged price with
each extra point in walkability score.
Our nal AirBnB model thus expands upon (19.1) by incorporating the group-level
regression model (19.3) along with prior models for the new group-level regression
parameters. Given the large number of model parameters, we do not write out the
independent and weakly informative priors here. These can be obtained using
prior_summary() below.
2
log(Y ij )|β 0j , β 1 , … , β 4 , σ y ~N (μ j , σ y ) with μ j = β 0j + β 1 X ij1 + ⋯ + β 4 X ij4
ind
2
β 0j |γ 0 , γ 1 , σ 0 ˜N (μ 0j , σ ) with μ 0j = γ 0 + γ 1 U j
0
β1 , … , β4 , γ0 , γ1 , σ0 , σy ˜ some priors
(19.4)
The new notation here might make this model appear more complicated or different than
it is. First, consider what this model implies about the expected price of an AirBnB
listing. For neighborhoods j that were included in our airbnb data, the expected log
price of a listing i is de ned by
β 0j + β 1 X ij1 + β 2 X ij2 + β 3 X ij3 + β 4 X ij4
walkability. Beyond the included neighborhoods, the expected log price for an AirBnB
listing is de ned by replacing β with its walkability-dependent mean γ + γ U :
0j 0 1 j
The parentheses here emphasize the structure of (19.4): the baseline price, γ + γ U , 0 1 j
Finally, consider the within-group and between-group variability parameters, σy and σ0.
Since the rst layers of both models utilize the same regression structure of price within
neighborhoods, σy has the same meaning in our original model (19.1) and new model
(19.4): σy measures the unexplained variability in listing prices within any
neighborhood, given the listings' bedrooms, rating, and room_type. Yet, by
altering our model of how the typical logged prices vary between neighborhoods, the
meaning of σ0 has changed:
neighborhood to neighborhood;
in (19.4), σ0 re ects the unexplained variability in baseline prices β from 0j
values of which are shared by every individual in group j. The underlying structure
of a hierarchical model of Y which includes both individual- and group-level
ij
β1 , γ0 , γ1 , … ~ some priors.
(19.5)
The rst layer of (19.5) re ects the relationship between individualY and X
ij ij
values, with intercepts β that vary by group j. The next layer re ects how our
0j
level predictor Uj. Pulling these two layers together, the expected relationship
between Y and X is
ij ij
(γ 0 + γ 1 U j ) + β 1 X ij
model 1: 3.21 + 0.265X ij1 + 0.22X ij2 − 0.538X ij3 − 1.06X ij4
model 2: (1.92 + 0.0166U j ) + 0.265X ij1 + 0.221X ij2 − 0.537X ij3 − 1.06X ij4 .
With the exception of the intercept terms, the posterior median models are nearly
indistinguishable. This makes sense. Including the group-level walkability predictor in
airbnb_model_2 essentially replaces the original global intercept β0 in
airbnb_model_1 with γ + γ U , without tweaking the individual-level X
0 1 j
In a similar spirit, let's obtain and compare posterior summaries for the standard
deviation parameters, σy (sd_Observation.Residual) and σ0
(sd_(Intercept).neighborhood):
The posterior medians of the within-group variability parameter σy are nearly
indistinguishable for our two models: 0.365 vs 0.366. This suggests that incorporating
the neighborhood-level walkability predictor didn't improve our understanding of the
variability in individual listing prices within neighborhoods, i.e., why some listings are
more expensive than others in the same neighborhood. Makes sense! Since all listings
within a neighborhood share the same walkability value Uj, including this information in
airbnb_model_2 doesn't help us distinguish between listings in the same
neighborhood.
In contrast, the posterior median of the between-group variability parameter σ0 is
notably smaller in airbnb_model_2 than in airbnb_model_1: 0.202 vs 0.279.
Recall that σ0 re ects our uncertainty about neighborhood baseline prices β . Thus, the
0j
neighborhoods tend to have more expensive listings than others. This, too, makes sense!
Our two different models, (19.1) and (19.4), formulate different baseline prices β 0j for
these two neighborhoods. Letting b denote a neighborhood j adjustment:
0j
model 1: β 0j = β 0 + b 0j
model 2: β 0j = γ 0 + γ 1 U j + b 0j .
(19.6)
To calculate the posterior median intercepts β for both neighborhoods in both models,
0j
we can utilize the posterior median values of (β , γ , γ ), (3.21, 1.92, 0.0166), from the
0 0 1
Further, we can obtain the neighborhood adjustments b for both models from the
0j
There are some cool and intuitive things to notice in this table:
Looking beyond Pullman and Edgewater, Figure 19.6 plots the pairs of
airbnb_model_1 intercepts (open circles) and airbnb_model_2 intercepts
(closed circles) for all 43 sample neighborhoods. Like the observed average log(prices)
in these neighborhoods (Figure 19.5), the airbnb_model_2 intercepts are positively
associated with walkability. The posterior median model of this association is captured
by γ + γ U ≈ 1.92 + 0.0166U .
0 1 j j
FIGURE 19.6: For each of the 43 neighborhoods in airbnb, the posterior median
neighborhood-level intercepts from airbnb_model_1 (open circles) and
airbnb_model_2 (closed circles) are plotted versus neighborhood walkability.
Vertical lines connect the neighborhood intercept pairs. The sloped line represents the
airbnb_model_2 posterior median model of log(price) by walkability,
γ + γ U ≈ 1.92 + 0.0166U .
0 1 j j
The comparison between the two models' intercepts is also notable here. As our
numerical calculations above con rm, Edgewater's intercepts are quite similar in the two
models. (It's tough to even visually distinguish between them!) Since its
airbnb_model_1 intercept was already so close to the price vs walkability trend,
incorporating the walkability predictor in airbnb_model_2 didn't do much to change
our mind about Edgewater. In contrast, Pullman's airbnb_model_1 intercept implied
a much higher baseline price than we would expect for a neighborhood with such low
walkability. Upon incorporating walkability, airbnb_model_2 thus pulled Pullman's
intercept down, closer to the trend.
We've been here before. Hierarchical models pool information across all groups,
allowing what we learn about some groups to improve our understanding of others. As
evidenced by the airbnb_model_2 neighborhood intercepts that are pulled toward
the trend with walkability, this pooling is intensi ed by incorporating a group-level
predictor and is especially pronounced for neighborhoods that either (1) have
airbnb_model_1 intercepts that fall far from the trend or (2) have small sample
sizes. For example, Pullman falls into both categories. Not only is its
airbnb_model_1 intercept quite far above the trend with walkability, our airbnb
data included only 5 listings in Pullman (contrasted by 35 in Edgewater):
In this case, the pooled information from the other neighborhoods regarding the
relationship between prices and walkability has a lot of sway in our posterior
understanding about Pullman.
But that's not all! If you look more closely, you'll notice another grouping factor in the
data: the mountain peak being summited. For example, our dataset includes 27 different
expeditions with a total of 210 different climbers that set out to summit the Ama
Dablam peak:
Altogether, the climbers dataset includes information about 2076 individual climbers,
grouped together in 200 expeditions, to 46 different peaks:
Further, these groupings are nested: the data consists of climbers within expeditions and
expeditions within peaks. That is, a given climber does not set out on every expedition
nor does a given expedition set out to summit every peak. Figure 19.7 captures a
simpli ed version of this nested structure in pictures, assuming only 2 climbers within
each of 6 expeditions and 2 expeditions within each of 3 peaks.
FIGURE 19.7: In the nested group structure, climbers (C) are nested within expeditions
(Exp), which are nested within peaks.
Now, we don't really care about the particular 46 peaks represented in the climbers
dataset. These are just a sample from a vast world of mountain climbing. Thus, to
incorporate it into our analysis of climber success, we'll include peak name as a
grouping variable, not a predictor. This second grouping variable, in addition to
expedition group, requires a new subscript. Let Y denote whether or not climber i that
ijk
1 yes
Y ijk = {
0 no
denote the climber's age and whether they received oxygen, respectively. In models
where expedition j or peak k or both are ignored, we'll drop the relevant subscripts.
Given the binary nature of response variable Y, we can utilize hierarchical logistic
regression for its analysis. Consider two approaches to this task. Like our approach in
Chapter 18, Model 1 assumes that baseline success rates vary by expedition j, and thus
incorporates expedition-speci c intercepts β . In past chapters, we learned that we can
0j
Accordingly, we'll specify Model 1 as follows, where the random tweaks b are assumed
0j
distributed around β0, with standard deviation σb. Further, the weakly informative priors
are tuned by stan_glmer() below, where we again utilize a baseline prior assumption
that the typical climber has a 0.5 probability, or 0 log(odds), of success:
Next, let's acknowledge our second grouping factor. Model 2 assumes that baseline
success rates vary by expedition j AND peak k, thereby incorporating expedition- and
peak-speci c intercepts β . Following our approach to Model 1, we obtain these β
0jk 0jk
p0k :
π ijk
log( ) = β 0jk +β 1 X ijk1 + β 2 X ijk2
1−π ijk
= (β 0 + b 0j + p 0k ) +β 1 X ijk1 + β 2 X ijk2 .
Thus, for expedition j and peak k, we've decomposed the intercept β 0jk into three pieces:
β0 = the global baseline success rate across all climbers, expeditions, and peaks
b0j= an adjustment to β0 for climbers in expedition j
p0k = an adjustment to β0 for expeditions that try to summit peak k.
The complete Model 2 speci cation follows, where the independent weakly informative
priors are speci ed by stan_glmer() below:
Note that the two between variance parameters are interpreted as follows:
This is quite a philosophical leap! We'll put some speci city into the details by
simulating the model posteriors in the next section.
similar conclusions about the expected relationship between climber success with age
and use of oxygen: aging doesn't help, but oxygen does. For example, by the posterior
median estimates of β1 and β2 from climb_model_2, the odds of success are roughly
cut in half for every extra 15 years in age (e = 0.49) and increase nearly 500-
15*−0.0475
The variability in success rates from peak to peak, σp, is smaller than that from
expedition to expedition within any given peak, σb. This suggests that there are
greater differences between expeditions on the same peak than between the peaks
themselves.
The posterior median of σb drops from climb_model_1 to climb_model_2.
This makes sense for two reasons. First, some of the expedition-related variability in
climb_model_1 is being redistributed and attributed to peaks in
climb_model_2. Second, σb measures the variability in success across all
expeditions in climb_model_1, but the variability across expeditions within the
same peak in climb_model_2 – naturally, the outcomes of expeditions on the
same peak are more consistent than the outcomes of expeditions across all peaks.
(19.9)
Further, for each of the 200 sample expeditions and 46 sample peaks,
group_levels_2 provides a tidy posterior summary of the associated b and p 0j 0k
For example, expeditions to the Ama Dablam peak have a higher than average success
rate, with a positive peak tweak of p = 2.92. In contrast, expeditions to Annapurna I
0k
have a lower than average success rate, with a negative peak tweak of p = -2.04: 0k
Further, among the various expeditions that tried to summit Ama Dablam, both
AMAD03107 and AMAD03327 had higher than average success rates, and thus positive
expedition tweaks b (0.00575 and 3.32, respectively):
0j
We can combine this global, peak-speci c, and expedition-speci c information to model
the success rates for three different groups of climbers. In cases where the group's
expedition or destination peak falls outside the observed groups in our climbers data,
the corresponding tweak is set to 0 – i.e., in the face of the unknown, we assume average
behavior for the new expedition or peak:
Group a climbers join expedition AMAD03327 to Ama Dablam, and thus have
positive expedition and peak tweaks, b = 3.32 and p = 2.92;
0j 0j
Group b climbers join a new expedition to Ama Dablam, and thus have a neutral
expedition tweak and a positive peak tweak, b = 0 and p = 2.92; and
0j 0j
Group c climbers join a new expedition to Mount Pants Le Pants, a peak not included
in our climbers data, and thus have neutral expedition and peak tweaks,
b0j= p = 0.
0j
Plugging these expedition and peak tweaks, along with the posterior medians for (
β , β , β ), into (19.9) reveals the posterior median models of success for the three
0 1 2
groups of climbers:
π
Group a: log( ) = (−1.55 + 3.32 + 2.92) −0.0475X 1 + 6.19X 2
1−π
π
Group b: log( ) = (−1.55 + 0 + 2.92) −0.0475X 1 + 6.19X 2
1−π
π
Group c: log( ) = (−1.55 + 0 + 0) −0.0475X 1 + 6.19X 2
1−π
Exercise 19.1 (Individual- vs group-level predictors: Part I). In the Chapter 18 exercises,
you utilized the book_banning data to model whether or not a book was removed
while accounting for the grouping structure in state, i.e., there being multiple book
challenges per state. Indicate whether each variable below is a potential book-level or
state-level predictor of removed. Support your claim with evidence.
a) language
b) political_value_index
c) hs_grad_rate
d) antifamily
Exercise 19.2 (Individual- vs group-level predictors: Part II). In Chapter 19, you utilized
the climbers_sub data to model whether or not a mountain climber had success,
while accounting for the grouping structure in expedition_id and peak_id.
Indicate whether each variable below is a potential climber-level, expedition-level, or
peak-level predictor of success. Support your claim with evidence.
a) height_metres
b) age
c) count
d) expedition_role
e) first_ascent_year
Exercise 19.3 (Two groups: Part I). To study the occurrence of widget defects,
researchers enlisted 3 different workers at each of 4 different factories into a study. Each
worker produced 5 widgets and researchers recorded the number of defects in each
widget.
Exercise 19.4 (Two groups: Part II). Continuing with the widget study, let Y be the
ijk
number of defects for the ith widget made by worker j at factory k. Suppose the
following is a reasonable model of Y :ijk
a) Explain the meaning of the β0 term in this context.
b) Explain the meaning of the b and f terms in this context.
0j 0k
(2, 10, 1). Compare and contrast these values in the context of widget
manufacturing.
Exercise 19.6 (Spotify: two models). In this exercise you will compare two Normal
hierarchical regression models of popularity by valence. For simplicity, utilize
random intercepts but not random slopes throughout.
Exercise 19.7 (Spotify: digging in). Let's dig into your model that accounts for both
grouping variables in the spotify data.
a) Write out the posterior median model of the relationship between popularity
and valence for songs in the following groups:
Albums by artists not included in the spotify_small dataset
A new album by Kendrick Lamar
Kendrick Lamar's “good kid, m.A.A.d city” album
(album_id748dZDqSZy6aPXKcI9H80u)
b) Compare the posterior median models from part a. What do they tell us about
the relevant artists and albums?
c) Which of the 6 sample artists gets the highest “bump” or tweak in their baseline
popularity?
d) Which sample album gets the highest “bump” or tweak in its baseline
popularity? And which artist made this album?
Exercise 19.8 (Spotify: understanding variability). Your Spotify model has three
variance parameters. Construct, interpret, and compare posterior summaries of these
three parameters. For example, what do they tell you about the music industry: is there
more variability in the popularity from song to song within the same album, from album
to album within the same artist, or from artist to artist?
a) Re ecting on your work above, what school features are associated with greater
vocabulary improvement among its students?
b) Re ecting on your work above, what student features are associated with greater
vocabulary improvement?
19.4 Goodbye!
Goodbye, dear readers. We hope that after working through this book, you feel
empowered to go forth and do some Bayes things.
Bibliography
Antonio, N., de Almeida, A., and Nunes, L. (2019). Hotel booking demand
datasets. Data in Brief, 22:41–49.
Bachynski, K. (2019). No Game for Boys to Play: The History of Youth
Football and the Origins of a Public Health Crisis. UNC Press Books.
Baumer, B. S., Garcia, R. L., Kim, A. Y., Kinnaird, K. M., and Ott, M. Q.
(2020). Integrating data science ethics into an undergraduate major.
arXiv preprint arXiv:2001.07649.
Baumer, B. S., Horton, N., and Kaplan, D. (2021). mdsr: Complement to
Modern Data Science with R. R package version 0.2.4.
Bechdel, A. (1986). Dykes to Watch Out For. Firebrand Books.
Belenky, G., Wesensten, N. J., Thorne, D. R., Thomas, M. L., Sing, H. C.,
Redmond, D. P., Russo, M. B., and Balkin, T. J. (2003). Patterns of
performance degradation and restoration during sleep restriction and
subsequent recovery: A sleep dose-response study. Journal of Sleep
Research, 12:1–12.
Benjamin, R. (2019). Race After Technology: Abolitionist Tools for the
New Jim Code. John Wiley & Sons.
Berger, J. O. (1984). The Likelihood Principle (lecture notes-monograph
series). Institute of Mathematical Statistics.
Birds Canada (2018). https://www.birdscanada.org/.
Blackwell, D. (1969). Basic Statistics. McGraw Hill.
Blangiardo, M. and Cameletti, M. (2015). Spatial and Spatio-Temporal
Bayesian models with R - INLA. Wiley.
Blitzstein, J. and Hwang, J. (2019). Introduction to Probability. Chapman
& Hall/CRC Texts in Statistical Science, second edition.
Bolker, B. and Robinson, D. (2021). broom.mixed: Tidying Methods for
Mixed Models. R package version 0.2.7.
Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of
Markov Chain Monte Carlo. CRC Press.
Cards Against Humanity (2017).
Pulse of the Nation.
https://thepulseofthenation.com/.
Carreño, E. J. C. and Cuervo, E. C. (2017). bayeslongitudinal: Adjust
Longitudinal Regression Models Using Bayesian Methodology. R
package version 0.1.0.
Dastin, J. (2018). Amazon scraps secret AI recruiting tool that showed bias
against women. https://www.reuters.com/article/us-
amazon-com-jobs-automation-insight/amazon-scraps-
secret-ai-recruiting-tool-that-showed-bias-
against-women-idUSKCN1MK08G.
D'Ignazio, C. and Klein, L. F. (2020). Data Feminism. MIT Press.
Dogucu, M., Johnson, A., and Ott, M. (2021). bayesrules: Datasets and
Supplemental Functions from the Bayes Rules! Book. R package
version 0.0.2.
Dua, D. and Graff, C. (2017). UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml.
Eckhardt, R. (1987). Stan Ulam, John Von Neumann, and the Monte Carlo
method. Los Alamos Science Special Issue.
El-Gamal, M. A. and Grether, D. M. (1995). Are people Bayesian?
Uncovering behavioral strategies. Journal of the American Statistical
Association, 90(432):1137–1145.
Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Pro le,
Police, and Punish the Poor. St. Martin's Press.
Fanaee-T, H. and Gama, J. (2014). Event labeling combining ensemble
detectors and background knowledge. Progress in Arti cial Intelligence,
2:113–127.
Fast, S. and Hegland, T. (2011). Book challenges: A statistical
examination. Project for Statistics 316-Advanced Statistical Modeling,
St. Olaf College.
Firke, S. (2021). janitor: Simple Tools for Examining and Cleaning Dirty
Data. R package version 2.1.0.
Gabry, J. and Goodrich, B. (2020a). Estimating generalized linear models
with group-speci c terms with rstanarm. https://mc-stan.org/
rstanarm/articles/glmer.html.
Gabry, J. and Goodrich, B. (2020b). Prior distributions for rstanarm
models. https://mc-stan.org/rstanarm/articles/
priors.html.
Gabry, J. and Goodrich, B. (2020c). rstanarm: Bayesian Applied
Regression Modeling via Stan. R package version 2.21.1.
Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., and Gelman, A.
(2019). Visualization in Bayesian work ow. J. R. Stat. Soc. A, 182:389–
402.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H.,
Daumé III, H., and Crawford, K. (2018). Datasheets for datasets. arXiv
preprint arXiv:1803.09010.
Gelman, A. and Hill, J. (2006). Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge University Press.
Goodman, S. (2011). A dirty dozen: twelve p-value misconceptions.
Seminars in Hematology, 45:135–140.
Gorman, K. B., Williams, T. D., and Fraser, W. R. (2014). Ecological
sexual dimorphism and environmental variability within a community
of Antarctic penguins (Genus Pygoscelis). PLoS ONE, 9(3)(e90081).
Guo, J., Gabry, J., Goodrich, B., and Weber, S. (2020). rstan: R Interface to
Stan. R package version 2.21.2.
Hadavas, C. (2020). How automation bias encourages the use of awed
algorithms. https://slate.com/technology/2020/03/ice-
lawsuit-hijacked-algorithm.html.
Harmon, A. (2019). As cameras track Detroit's residents, a debate ensues
over racial bias. New York Times.
Horst, A., Hill, A., and Gorman, K. (2020). palmerpenguins: Palmer
Archipelago (Antarctica) Penguin Data. R package version 0.1.0.
Kalil, A., Mayer, S., and Oreopoulos, P. (2020). Closing the word gap with
Big Word Club: Evaluating the impact of a tech-based early childhood
vocabulary program. Ann Arbor, MI: Inter-university Consortium for
Political and Social Research.
Kay, M. (2021). tidybayes: Tidy Data and Geoms for Bayesian Models. R
package version 3.0.1.
Kim, A. Y., Ismay, C., and Chunn, J. (2020). vethirtyeight: Data and
Code Behind the Stories and Interactives at FiveThirtyEight. R package
version 0.6.1.
Laird, N. and Ware, J. (1982). Random-effects models for longitudinal
data. Biometrics, 38 4:963–974.
Legler, J. and Roback, P. (2021). Beyond Multiple Linear Regression:
Applied Generalized Linear Models and Multilevel Models in R.
Chapman and Hall/CRC.
Lock, R. H., Lock, P. F., Morgan, K. L., Lock, E. F., and Lock, D. F. (2016).
Statistics: Unlocking the Power of Data. John Wiley & Sons.
Lum, K., Price, M., Guberek, T., and Ball, P. (2010). Measuring elusive
populations with Bayesian model averaging for multiple systems
estimation: A case study on lethal violations in Casanare, 1998-2007.
Statistics, Politics and Policy, 1(1).
Mbuvha, R. and Marwala, T. (2020). Bayesian inference of COVID-19
spreading rates in South Africa. PLOS ONE, 15(8):1–16.
McElreath, R. (2019). Statistical Rethinking winter 2019 lecture 12.
https://www.youtube.com/watch?v=hRJtKCIDTwc.
McGrayne, S. (2012). The theory that would not die - How Bayes' Rule
cracked the Enigma code, hunted down Russian submarines and
emerged triumphant from two centuries of controversy. Yale University
Press.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F.
(2021). e1071: Misc Functions of the Department of Statistics,
Probability Theory Group (Formerly: E1071), TU Wien. R package
version 1.7-9.
Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal
and Social Psychology, 67:371–378.
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson,
B., Spitzer, E., Raji, I. D., and Gebru, T. (2019). Model cards for model
reporting. Proceedings of the Conference on Fairness, Accountability,
and Transparency.
Modrak, M. (2019). Divergent transitions – a primer.
https://discourse.mc-stan.org/t/divergent-
transitions-a-primer/17099.
MuseumofModernArt (2020). MoMA – collection. GitHub repository.
Noble, S. U. (2018). Algorithms of Oppression: How Search Engines
Reinforce Racism. NYU Press.
Pavlik, K. (2019). Understanding classifying genres using Spotify audio
features. https://www.kaylinpavlik.com/classifying-
songs-genres/.
R for Data Science (2018). Christmas bird counts.
https://github.com/rfordatascience/tidytuesday/
tree/master/data/2019/2019-06-18.
R for Data Science (2020a). Coffee ratings. https://github.com/
rfordatascience/tidytuesday/blob/master/data/
2020/2020-07-07.
R for Data Science (2020b). Himalayan climbing expeditions.
https://github.com/rfordatascience/tidytuesday/
tree/master/data/2020/2020-09-22.
R for Data Science (2020c). Hotels. https://github.com/
rfordatascience/tidytuesday/blob/master/data/
2020/2020-02-11.
R for Data Science (2020d). Spotify songs. https://github.com/
rfordatascience/tidytuesday/blob/master/data/
2020/2020-01-21/readme.md.
Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson,
B., Smith-Loud, J., Theron, D., and Barnes, P. (2020). Closing the AI
accountability gap: De ning an end-to-end framework for internal
algorithmic auditing. In Proceedings of the 2020 Conference on
Fairness, Accountability, and Transparency, pages 33–44.
Roberts, S. (2020). How to think like an epidemiologist. New York Times.
Shu, K., Sliva, A., Wang, S., Tang, J., and Liu, H. (2017). Fake news
detection on social media: A data mining perspective. ACM SIGKDD
Explorations Newsletter, 19(1):22–36.
Singh, R., Meier, T. B., Kuplicki, R., Savitz, J., Mukai, I., Cavanagh, L.,
Allen, T., Teague, T. K., Nerio, C., Polanski, D., et al. (2014).
Relationship of collegiate football experience and concussion with
hippocampal volume and cognitive outcomes. Journal of the American
Medical Association, 311(18):1883–1888.
Stan development team (2019). Stan user's guide. https://mc-
stan.org/docs/2_25/stan-users-guide/index.html.
The Himalayan Database (2020).
https://www.himalayandatabase.com/.
Trinh, L. and Ameri, P. (2016). AirBnB price determinants: A multilevel
modeling approach. Project for Statistics 316-Advanced Statistical
Modeling, St. Olaf College.
Vats, D. and Knudson, C. (2018). Revisiting the Gelman-Rubin diagnostic.
arXiv preprint arXiv:1812.09384.
Vehtari, A. (2019). Cross-validation for hierarchical models.
https://avehtari.github.io/modelselection/
rats_kcv.html.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C.
(2021). Rank-normalization, folding, and localization: An improved R̂
for assessing convergence of MCMC. Bayesian Analysis, 16:667–718.
Warbelow, S., Avant, C., and Kutney, C. (2019). 2019 State Equality Index.
Human Rights Campaign Foundation.
Wasserstein, R. L. (2016). The ASA's statement on p-values: Context,
process, and purpose. The American Statistician, 70:129–133.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York.
Wickham, H. (2021). forcats: Tools for Working with Categorical
Variables (Factors). R package version 0.5.1.
Wickham, H., François, R., Henry, L., and Müller, K. (2021). dplyr: A
Grammar of Data Manipulation. R package version 1.0.6.
Williams, G. J. (2011). Data Mining with Rattle and R: The Art of
Excavating Data for Knowledge Discovery. Use R! Springer.
Winter, B. and Grawunder, S. (2012). The phonetic pro le of Korean
formal and informal speech registers. Journal of Phonetics, 40:808–815.
Index
84735, 26
bald_eagles, 327
baseline level, 271, 303
basketball, 481
Bayes factor, 189–191
Bayes' rule, 13, 24, 25, 37, 38, 359
Bayesian knowledge-building process, 4
Bayesian learning, 86
bechdel, 77
Bechdel test, 75
Bernoulli model, 331
Beta posterior, 57, 58, 62, 68
Beta prior, 50, 54, 57, 61, 62, 68
Beta-Binomial model, 97, 109
Beta-Binomial rstan simulation, 139
Beta-Binomial simulation, 63
between-group variability, 383, 397, 398, 409, 425, 434, 450, 467, 491,
494, 501
Beyoncé, 3, 100, 392, 395, 401
bias-variance trade-off, 294, 297, 298, 411, 413
big_word_club, 415, 418, 508
bikes, 219, 246
Binomial model, 33, 40, 55
book_banning, 481
categorical predictor, 270
categorical response variable, 329
centered intercept, 268
cherry_blossom_sample, 375, 421
classi cation, 211, 329, 340, 366
classi cation cut-off, 341, 344
classi cation rule, 341
classification_summary(), 343, 472
classification_summary_cv(), 345
climate change, 372
climbers_sub, 464, 497
coffee_ratings, 263
coffee_ratings_small, 461
complement, 20
complete pooled model, 377, 383, 384, 389, 390, 422, 467
complete pooled model drawbacks, 380
complete pooled regression model, 422
concentration parameter, 441
conditional probability, 20, 21, 23, 34, 36
conditional probability function, 22
conditional probability mass function, 33, 34
conditional probability model, 33
conditionally independent, 363, 365, 369
confusion matrix, 343, 344, 366–368, 472
conjugate family, 11, 97, 98, 113, 118
conjugate prior, 62, 68, 98, 118
controlling for covariates, 278
correlation, 379, 404, 437, 487
covariance matrix, 437, 438, 440
cross-validation, 255, 256, 289, 291, 292, 294, 318, 345, 366, 368
curse of dimensionality, 137, 157
kernel, 60, 99
na.omit(), 421
naive Bayes classi cation, 355–357, 359, 361–363, 365, 366, 368, 369
naive_classification_summary_cv(), 368
naiveBayes(), 365
neff_ratio(), 149, 196
Negative Binomial model, 321, 322, 477
Negative Binomial regression model, 319, 322, 323
neighborhood of proposals, 162, 170
nested data, 499
no pooled model, 380, 381, 383, 384, 389, 393, 394
no pooled model drawbacks, 382
non-conjugate prior, 98
Normal hierarchical model, 387, 397
Normal hierarchical regression assumptions, 426
Normal hierarchical regression model, 488
Normal model, 109, 110
Normal regression assumptions, 214
Normal regression model, 211, 268
Normal-Normal, 109, 113, 117, 159, 212
Normal-Normal complete pooled model, 391
normalizing constant, 22, 36–39
odds, 330
one-sided hypothesis test, 187, 188
one-way analysis of variance (ANOVA), 399, 416
outlier, 304
overall accuracy, 367
overall classi cation accuracy, 343, 344
overdispersion, 319, 321, 322, 476
over tting, 294, 296–298, 318
p-value, 9, 201
parallel chains, 147
partial pooled model, 385
partial pooling, 383, 389
pbeta(), 188
penguins_bayes, 206, 300, 355
pivot_wider(), 444
plot_beta(), 54
plot_beta_binomial(), 58, 83, 184
plot_gamma(), 105
plot_gamma_poisson(), 108, 109
plot_normal(), 110
plot_normal_likelihood(), 114
plot_normal_normal(), 115
plot_poisson_likelihood(), 107
Poisson model, 100, 101, 306, 475, 476
Poisson rate parameter, 307
Poisson regression assumptions, 310
Poisson regression coef cients, 308
Poisson regression model, 310
polynomial model, 295
pop, 30
posterior, 4
posterior classi cation, 339, 470, 471
posterior classi cation accuracy, 472
posterior credible interval, 185–187, 191, 195, 198, 201
posterior mean, 108, 185
posterior median, 198
posterior median relationship, 223
posterior mode, 185, 193
posterior odds, 188–191
posterior percentiles, 186
posterior plausibility, 193
posterior prediction, 192, 226, 339, 407, 409, 450, 470
posterior prediction error, 251
posterior prediction interval, 228, 253
posterior predictive accuracy, 289, 290
posterior predictive check, 246, 248, 289, 312, 447, 468, 476
posterior predictive mean, 251
posterior predictive model, 193, 195, 227, 228, 251, 315
posterior probability, 24, 39, 195
posterior quantiles, 186
posterior simulation, 25
posterior summary statistics, 223
posterior variability, 192, 408, 409, 450
posterior_predict(), 229, 253, 279, 316, 339, 395, 407, 410, 450,
471
pp_check(), 248–250, 261, 269, 317, 320, 341, 476, 488
ppc_intervals(), 254–256, 395
predict(), 366
prediction_summary(), 254, 255, 258, 318, 448
prediction_summary_cv(), 257, 448, 449
predictive accuracy, 448
prior, 4
prior covariance, 401
prior model, 30
prior odds, 189–191
prior probability, 20
prior probability model, 20
prior simulation, 218, 234, 276, 311, 346, 428
prior_summary(), 311, 322, 488, 491
probability, 330
probability cut-off, 472
probability mass function, 32
probability model, 20
proportionality, 98, 103, 106, 111
pulse_of_the_nation, 205, 319, 351, 372
qbeta(), 186
quantitative predictor, 270
quantitative response variable, 211, 303
unstable, 296
update(), 234, 277, 312, 430
vaccine safety, 372
vague prior, 77, 79, 90
voices, 420, 461
weakly informative prior, 234, 264, 268, 277, 311, 488
weather_perth, 122, 335
weather_WU, 267
weighted average, 113
within- vs between-group variability, 400
within-group correlation, 401
within-group variability, 383, 397, 407, 409, 424, 434, 450, 491, 494