Bayes and MCMC For Undergraduates: The American Statistician
Bayes and MCMC For Undergraduates: The American Statistician
Bayes and MCMC For Undergraduates: The American Statistician
Jeff Witmer
To cite this article: Jeff Witmer (2017) Bayes and MCMC for Undergraduates, The American
Statistician, 71:3, 259-264, DOI: 10.1080/00031305.2017.1305289
Download by: [Gothenburg University Library] Date: 21 October 2017, At: 07:39
THE AMERICAN STATISTICIAN
, VOL. , NO. , –: Teacher’s Corner
https://doi.org/./..
Monte Carlo
CONTACT Jeff Witmer jeff.witmer@oberlin.edu Department of Mathematics, Oberlin College, Oberlin, OH .
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/TAS.
This does not include another five articles in which only Bayes’ theorem was used.
Computation.” We use the Jags program via R to implement plus data and produces a Beta posterior, along with MCMC out-
Markov chain Monte Carlo and find posterior distributions put from a second program.4
in a variety of settings. I am using a book (Kruschke 2015) Bayesian methods, in contrast to their frequentist coun-
that requires no mathematics beyond some exposure to first- terparts, incorporate information that was available prior
semester calculus, and even that is optional.2 to the collection of the data in a clear and consistent
Students in my course do not become programmers; they just way.5
execute code that someone else has written. Moreover, the pro- To explore the role of the prior, I present pre-posterior pre-
fessor does not need to be adept at creating code. Some famil- dictive analysis with the vegetarian data and ask the question “Is
iarity with R is certainly helpful, but it is quite easy to use the this prior sensible?” To choose a prior for the vegetarian param-
scripts that come with the textbook.3 Indeed, I have had a few eter I specified a prior mode—I chose 0.20 as a guess of the
students complain that I do not lead them down the path of writ- campus-side vegetarian proportion—and a weight—I chose 20
ing code to handle arbitrary situations; instead, I (mostly) have in answer to the question “How many observations is my prior
them make simple edits to existing code. To such students I plead worth?” I then entered 0.20 and 20 into a program and saw that
“guilty as charged,” as there is only so much that I want to take a Beta(4.6,15.4) matches those values. From here I generated a
on in a first course on Bayesian methods. We cover the Bayesian random binomial of size 25 for each of 2000 points drawn from
equivalents to t-tests, comparisons of proportions, regression, that Beta(4.6,15.4) prior. A histogram of the binomial counts
and ANOVA; in short, we cover the topics that students see was consistent with my expectations, so I was comfortable with
Downloaded by [Gothenburg University Library] at 07:39 21 October 2017
from a frequentist perspective if they take my traditional STAT the Beta(4.6,15.4) prior. That is, the histogram of possible bino-
101 course. mial counts changed the question “Is this a sensible prior?” into
Students in my Bayesian course do not actually need to know the question “Do I expect data that follow this kind of a distri-
calculus, although it helps to have some idea of what integration bution?”
is. Likewise, they do not need to have taken a previous statistics Although I discuss and use informative priors with my stu-
course, although it may help (or hurt?) to have seen frequentist dents, for much of the semester we avoid the objectivity versus
ideas of P-values and confidence intervals as points of compar- subjectivity issue by using noninformative or mildly informative
ison when we discuss Bayes’ factors and highest density inter- priors.6
vals. Many of my students have previously taken both calculus
and statistics, but some have taken only one of those two (but at
least one of the two). 4. Two proportions and hierarchical structure
MCMC makes it easy to use a prior that has a hierarchical struc-
ture. This is often a natural choice and one that tends to soften
3. A Single Proportion
the link between subjective prior belief and the posterior distri-
Early in my course I introduce model building with a sim- bution. Appendix B presents in more detail the opening example
ple construction of the degree of uncertainty about a Bernoulli on free throw shooting among guards and centers, using a fairly
parameter, θ . We look at this in three ways. (1) We use a simple hierarchical prior that has a parameter for each player,
Beta(a,b) as the prior distribution on θ and get a posterior that but that also allows for guards to be systematically different from
is Beta(a + z, b + n − z), where z is the number of successes centers. The flexibility of this model is a feature that is easily
in n Bernoulli trials. I do not prove this result about the pos- incorporated into Bayesian analyses.
terior, but I do present the simple mathematics that shows the
posterior mean as a weighted average of the prior mean and the
data mean. (2) We use a discrete prior for θ that lives on grid 5. Means
points between 0 and 1. (3) We use MCMC. I show these three
The Bayesian paradigm fosters flexibility aside from the use
approaches in parallel, noting that a discrete prior with many
of hierarchical models. In particular, when using MCMC it
grid points (2) gives a good approximation to the theoretical
is not necessary to stipulate that the error term is Gaussian.
result (1) and MCMC (3) gives a good approximation to both
A traditional t-test assumes a Gaussian likelihood and uses a
of these.
t-distribution for the test statistic. In contrast, the command
As an example, I ask the question “What percentage of stu-
bayes.t.test(y, mu = 30) in the R package BayesianFirstAid
dents on our campus are vegetarians?” and then present data
accepts a vector of data y and runs MCMC for a single mean
from a sample of 136 students, 19 of whom are vegetarians.
using a t likelihood, with mu = 30 as a null hypothesis com-
Appendix A shows output from a program that takes a Beta prior
parison point. Allowing the data to come from a long-tailed
distribution can render immaterial the often vexing question
Other books (Bolstad , Gill , Albert , Hoff , et al.) are available
that also do not presume much sophistication on the part of the audience. There
The R code for examples shown in this paper is available at github.com/jawitmer/
are a number of books that are written at a higher level (Christensen et al. ,
Gelman et al. , McElreath ) that would challenge most undergraduates at BayesTASArticle.
the introductory level, but that might be quite suitable for students with strong Regarding current statistical practice, about half of the JASA papers that use
backgrounds. Bayesian methods use informative priors and half do not.
Moreover, installing the BayesianFirstAid package in R and then changing t.test() The R package BayesianFirstAid incorporates noninformative priors for Bayesian
to bayes.t.test(), e.g., is easy for an R user who wants to use MCMC but does not analyses that mimic standard methods; for example, the prop.test() command
want to work with rjags scripts. becomes bayes.prop.test() and uniform priors are used on parameters.
THE AMERICAN STATISTICIAN 261
“What should I do if I think that one of the observations is an while, but we have not had easy-to-use software and textbooks
outlier?” that make Bayes accessible to undergraduates. That has recently
We rarely expect an effect to be exactly zero. Thus, Bayesian changed, so that today a Bayesian course that teaches MCMC is
reasoning focuses on parameter estimation, rather than hypoth- available to a variety of students, as I hope I have demonstrated
esis testing. Appendix C shows a comparison of two means using by discussing a course that I offer.
either a web applet or an easily edited R script. The output gives The world of statistics users has moved into using Bayesian
the estimated difference in the means, along with a 95% credible methods widely, and to good result. It is time for statistics edu-
interval for that difference. cators to join in.
6. Regression Appendix A
The usual presentation of regression in an introductory course Veggie %. Figure 1 shows the result of running a program called
imposes the condition that the error term has a normal distribu- BernBetaExample.R after editing a few lines of the code. One needs
tion. Outliers are then considered and points are either included to input the two Beta parameters plus the data, where “number of
or excluded from the analysis. But just as with means, in a regres- flips” is the number of observations in the sample and “number of
sion model fit using MCMC it is easy to use a t distribution for heads” is, in our example, the number of vegetarians.
the error term, with the degrees of freedom being a parameter Prior = c(4.6,15.4) # Specify Prior as vector with the two shape
Downloaded by [Gothenburg University Library] at 07:39 21 October 2017
estimated from the data. Appendix D details how this can be parameters.
done in the context of a well-known regression example. N = 136 # The total number of flips.
z = 19 # The number of heads.
7. Other topics Figure 2 shows the result of using MCMC, running a program
called Jags-Ydich-Xnom1subj-MbernBeta-Veggie.R. The rather
I spend some time on one-way and two-way Bayesian Analy- long program name says “I am going to use Jags as my MCMC
sis of Variance and I show an example or two of logistic regres- engine, the response variable Y is dichotomous, the predictor X is
sion; other professors would make other choices. Now that Stan a single subject nominal variable, the model is a Bernoulli likeli-
is available I mention Stan and Hamiltonian Monte Carlo, but I hood with a Beta prior, and I’m analyzing the vegetarian data so
do not expect students to learn how to use Stan. I’m adding ‘Veggie’ at the end.” The program displays the posterior
I also spend time “looking under the hood,” as it were, regard- mode of 0.149 rather than the posterior mean of 0.151, but the user
ing MCMC. I do not expect students to become adept at coding
in Jags (or BUGS or Stan), but I do want them to use code and
dbeta(θ|4.6, 15.4)
Prior (beta)
to understand what MCMC is and how it works. To that end, I mode=0.2
go beyond what the textbook shows (i.e., what I do is completely
5 10
95% HDI
optional, and I would skip this if I were not teaching strong stu- 0.065 0.411
0
dents) and spend about three weeks on a unit in which I intro- 0.0 0.2 0.4
θ
0.6 0.8 1.0
duce Markov chains, with the canonical example of rain/shine Likelihood (Bernoulli)
1.2e−24
raise the transition matrix to a large power to show the limiting, max at 0.14
0.0e+00
talking about a setting with many states and a long chain. Then, Posterior (beta)
mode=0.147
I introduce more Markov chain ideas: reducibility, reversibility,
5 10
95% HDI
the ergodic property, etc. I follow that with a “proof ” that the 0.097 0.208
0
Metropolis algorithm works. I put proof in quotes here because 0.0 0.2 0.4 0.6 0.8 1.0
θ
I do a lot of hand-waiving leading up to the final step, in which I
show that Metropolis uses a reversible transition process, which Figure . Prior, likelihood, and posterior distribution of the percentage of vegetari-
completes the “proof.” Finally, I say that Gibbs sampling is of the ans on campus.
same spirit as Metropolis, but can be more efficient, etc. In sum-
mary, I try to make the MCMC programs that we use be not a theta
black box, but perhaps a gray box instead. But what I require of mode = 0.149
z = 19
students is that they manipulate these programs (R scripts) in N = 136
Histogram of z we find that there is a 78% posterior probability that Snider is more
skilled than Cauley-Stein at shooting free throws.8
250
Stein (in the top right). The plus sign shows the difference in sam-
Frequency
ple proportions, but the posterior is not centered there. Instead, the
150
Appendix C
0
0 5 10 15
z Comparing two populations. Myocardial blood flow (MBF) was
measured for two groups of subjects after five minutes of bicycle
Figure . Pre-posterior predictive analysis of vegetarian question when the prior is exercise (Namdar et al. 2006). One group was given normal air to
Beta(.,.).
breath (“normoxia”), while the other group was given a gas mix-
has the option of asking for the mean. The only thing the user needs ture to breathe with reduced oxygen (“hypoxia”) to simulate high
to specify is the data,7 by inputting the numbers 19 and 136 in the altitude. The data (ml/min/g) are
Downloaded by [Gothenburg University Library] at 07:39 21 October 2017
following line: Normoxia: 3.45, 3.09, 3.09, 2.65, 2.49, 2.33, 2.28, 2.24, 2.17, 1.34
myData = data.frame(y = c(rep(0, 136 − 19), rep(1, 19))). Hypoxia: 6.37, 5.69, 5.58, 5.27, 5.11, 4.88, 4.68, 3.50
Figure 3 is a histogram of 2000 binomial counts from a pre- We want to know what these data indicate about the difference
posterior predictive analysis in the vegetarian setting. between training at normal altitude (normoxia) and training at high
altitude (hypoxia).
One can go to the website http://sumsar.net/best_online/ and
Appendix B enter the data and hit the “Click to start!” button. Within a few sec-
Free Throw Shooting and Hierarchical structure. I collected data on onds results appear that are based on a model of independent t dis-
the centers and guards for the teams that made it to the “Elite Eight” tributions for the data. Diffuse (non-informative) priors are used for
in the 2015 NCAA basketball tournament. The free throw shoot- the population means, the degrees of freedom, and the population
ing performance of guards ranged from a low of 55.3% to a high of standard deviations. The posterior distribution of µ1 –µ2 is graphed
88.8% success, compared to centers who ranged from a low of 47.9% and the mean is seen to be −2.65. The 95% Highest Density Interval
to a high of 77.7%. A naïve analysis might pool together all hits is (−3.53, −1.73) and the probability that µ1 < µ2 is nearly 100%.
and misses for each position and compare the aggregate guard per- Figure 5 is a screenshot.
centage (78.6%) to the aggregate center percentage (61.3%), as if all My students would analyze these data using an R script that
guards were interchangeable and all centers were interchangeable. accompanies the textbook. The student can easily choose a t like-
A conservative analysis might treat all players as separate, ignoring lihood rather than a normal likelihood, as this amounts to simply
the fact that guards are similar to one another and centers are sim- changing “y[i] ∼ dnorm( …)” to “y[i] ∼ dt( …)” within the code.
ilar to one another. But it is not even necessary to make that change, as the textbook
It is easy to take a middle path of fitting a hierarchical model and programs include a script that has a t density pre-selected.
conducting a Bayesian analysis, which my students see during week Figure 6 is a screenshot of output from running the script with
six of the semester. Denote each player’s ability with a parameter these data.
θ Player , let the θ s for the centers come from a Beta(a1 ,b1 ) distribu- To get this output, I took an existing R script – downloaded from
tion and the θ s for the guards come from a Beta(a2 ,b2 ) distribu- the textbook website – and edited three lines, stating the name of
tion, and let the parameters of the two Betas come from hyperpri- the data file, the name of the response variable, and the name of the
ors that describe prior belief about typical free throw shooting suc- predictor (group) variable:
cess. For example, using a Beta(30,10) for the mode, ω, (where ω = myDataFrame = read.csv(file = “ExerciseHypoxia.csv”)
(a − 1)/(a+b − 2)) and a diffuse gamma distribution on the concen- yName = “MBF”
tration, κ (where κ = a + b) says that we expect basketball players to xName = “Oxygen”
make about 75% of their free throws, without specifying that guards All I needed to do after that was to run the program. I might
are better than centers. The posterior distributions tell us that there have changed the default comparison point from zero (“Are the two
is a 92% chance that in general guards are better free throw shoot- means equal?”) to something else (e.g., “Is the difference in means
ers than centers. Beyond that, a hierarchical model allows us to pool at least 3?”), and I might have changed the default, vague, prior to
information among centers and among guards, which leads to the an informative prior, but those are options, not requirements.
following comparison. Willie Cauley-Stein, a center, made 79 of 128
free throws, for a 61.7% success rate. Quentin Snider, a guard, made Appendix D
21 of 38 free throws, for a 55.3% success rate. The small number of
attempts by Snider combined with the fact that he is a guard sug- Regression. A well-known dataset contains the winning (i.e., gold
gests that in the long run he will do quite a bit better than 55.3% and medal) men’s long jump distance in the Olympics and the year. Bob
A uniform prior is used by default. I used a Beta(., .) prior when creating During the – season, Snider made of free throws (%) while Cauley-
Figure , which had no material effect on the results when compared to using a Stein made of free throws (.%). Given the Bayesian analysis above, this
Beta(,) prior. Snider advantage comes as no surprise.
THE AMERICAN STATISTICIAN 263
Snider ( Guard )
Snider (Guard) − Cauley−Stein ( Center )
z = 21, N = 38
95% HDI 95% HDI
0.548 0.787 −0.0886 0.202
0.4 0.5
+ 0.6 0.7 0.8 −0.3 −0.2 −0.1
+ 0.0 0.1 0.2 0.3
theta[9] Difference of θ’s
Cauley−Stein (Center)
mode = 0.619
0.70
Cauley−Stein
0.60
z = 79, N = 128
95% HDI
0.533 0.696
+
0.50
Downloaded by [Gothenburg University Library] at 07:39 21 October 2017
Figure . Posterior with % highest density interval for the difference in free throw shooting abilities of Snider and Cauley-Stein, shown in the top-right. Marginal distri-
butions for Snider and for Cauley-Stein are also shown.
Beamon’s phenomenal jump in 1968 of 8.9 m broke the previous To fit this model, I have my students use an R script called
world record by 55 cm and results in the 1968 data point being Jags-Ymet-Xmet-Mrobust-Example.R, the name of which indicates
an outlier. Under a frequentist analysis, the 95% confidence inter- that both the response Y and the predictor X are continuous (met-
val for the slope of the regression line for predicting jump distance ric) and the model is robust (i.e., a t rather than a normal likeli-
(Y) from year (X) is (1.12, 1.69) if all of the data are used but is hood). They need to edit the script to specify where the data are
(1.14, 1.60) if the 1968 data point is deleted. The residual standard to be found and what the names are of the variables. This is done
error when all data are used is 0.238 but this changes to 0.191 if the with the following lines, the fourth of which rescales the response
1968 data point is removed. During week 11 of the semester, I cover variable to make the output easier to read:
Bayesian regression. Rather than delete the 1968 point, we conduct library(Stat2Data)
a Bayesian analysis that replaces the usual condition of normally data(LongJumpOlympics)
distributed errors with the condition that Y|X has a t distribution myData = LongJumpOlympics
on ν degrees of freedom, with ν as a random variable. This leads to myData$Year = (myData$Year - 1900)/100
fitting a Bayesian regression model with four parameters: the slope xName = “Year”; yName = “Gold”
and intercept of the line, the standard deviation of the error term, Using the noninformative priors that are built into the script pro-
and the degrees of freedom. duces a 95% HDI on the slope of (1.13, 1.67). The posterior mode
Difference of Means Biswas, S., Liu, D., Lee, J., and Berry, D. (2009), “Bayesian Clinical Trials at
the University of Texas M. D. Anderson Cancer Center,” Clinical Trials,
mode = −2.64 6, 205–216. [259]
Bolstad, W. (2007), Introduction to Bayesian Statistics (2nd ed.), Hoboken,
NJ: Wiley. [260]
100% < 0 < 0%
Christensen, R., Johnson, W., Branscum, A., and Hanson, T. E. (2011),
Bayesian Ideas and Data Analysis, Boca Raton, FL: CRC Press.
[260]
Cobb, G. (2015), “Mere Renovation is Too Little Too Late: We Need to
Rethink our Undergraduate Curriculum from the Ground Up,” The
American Statistician, 69, 266–282. [259]
95% HDI Dowman, M., Savova, V., Griffiths, T. L., Kording, K. P., Tenenbaum, J. B.,
−3.5 −1.72 and Purver, M. (2008), “A Probabilistic Model of Meetings That Com-
bines Words and Discourse Features,” IEEE Transcactions on Audio,
−6 −4 −2 0 Speech, and Language Processing, 16, 1238–1248. [259]
μ1 − μ2 Gill, J. (2008), Bayesian Methods: A Social and Behavioral Sciences Approach,
Boca Raton, FL: Chapman & Hall. [260]
Figure . Posterior distribution of the difference in means between hypoxia and Hoff, P. (2009), A First Course in Bayesian Statistical Methods, New York:
normoxia groups. Springer. [260]
Kruschke, J. (2015), Doing Bayesian Data Analysis, Waltham, MA: Elsevier.
Downloaded by [Gothenburg University Library] at 07:39 21 October 2017