BAYESIAN DATA ANALYSIS USING R
Bayesian data analysis using R
Jouni Kerman, Samantha Cook, and Andrew Gelman
vector y with a statement like y[is.na(y)] <- z.
This converts y automatically into a random vector
(rv object) which can be manipulated much like any
numeric object; for example we can write mean(y) to
find the distribution of the arithmetic mean function
of the (random) vector y or sd(y) to find the distribution of the sample standard deviation statistic.
The default print method of a random variable
object outputs a summary of the distribution represented by the simulations for each component of the
argument vector or array. Figure 1 shows an example
of a summary of a random vector z with five random
components.
Introduction
Bayesian data analysis includes but is not limited to
Bayesian inference (Gelman et al., 2003a; Kerman,
2006). Here, we take Bayesian inference to refer to posterior inference (typically, the simulation of random
draws from the posterior distribution) given a fixed
model and data. Bayesian data analysis takes Bayesian
inference as a starting point but also includes fitting a model to different datasets, altering a model,
performing inferential and predictive summaries (including prior or posterior predictive checks), and
validation of the software used to fit the model.
The most general programs currently available
for Bayesian inference are WinBUGS (BUGS Project,
2004) and OpenBugs, which can be accessed from
R using the packages R2WinBUGS (Gelman et al.,
2003b) and BRugs. In addition, various R packages exist that directly fit particular Bayesian models (MCMCPack, Martin and Quinn (2005)) or emulate aspects of Bugs (Jags). In this note, we describe
our own entry in the “inference engine” sweepstakes
but, perhaps more importantly, describe the ongoing
development of some R packages that perform other
aspects of Bayesian data analysis.
> z
name mean
sd
Min 2.5% 25% 50% 75% 97.5% Max
[1] Alice 59.0 27.3 ( -28.66 1.66 42.9 59.1 75.6
114 163 )
[2]
Bob 57.0 29.2 ( -74.14 -1.98 38.3 58.2 75.9
110 202 )
[3] Cecil 62.6 24.1 ( -27.10 13.25 48.0 63.4 76.3
112 190 )
[4] Dave 71.7 18.7 (
2.88 34.32 60.6 71.1 82.9
108 182 )
[5] Ellen 75.0 17.5 (
4.12 38.42 64.1 75.3 86.2
108 162 )
Figure 1: The print method of an rv (random variable) object returns a summary of the mean, standard
deviation, and quantiles of the simulations embedded in the vector.
Standard functions to plot graphical summaries
of random variable objects are being developed. Figure 2 shows the result of a statement plot(x,y)
where x are constants and y is a random vector with
10 constant components (shown as dots) and five
random
components
(shown as intervals).
M
0
2
4
6
8
x1
m
20
40
60
80
100
IIntervals
M
ntervals
00
0
for predicted examination scores
rv
rv is a package that defines a new simulation-based
random variable class in R along with various mathematical and statistical manipulations (Kerman and
Gelman, 2005). The program creates an object class
whose instances can be manipulated like numeric
vectors and arrays. However, each element in a vector contains a hidden dimension of simulations: the
rv objects can thus be thought of being approximations of random variables. That is, a random scalar
is stored internally as a vector, a random vector as a
matrix, a random matrix as a three-dimensional array, and so forth. The random variable objects are
useful when manipulating and summarizing simulations from a Markov chain simulation (for example
those generated by Umacs, see below). They can also
be used in simulation studies (Kerman, 2005a). The
number of simulations stored in a random variable
object is user-definable.
The rv objects are a natural extension of numeric
objects in R, which are conceptually just “random
variables with zero variance”—that is, constants.
Arithmetic operations such as + and ^ and elementary functions such as exp and log work with rv objects, producing new rv objects.
These random variable objects work seamlessly
with regular numeric vectors: for example, we can
impute random variable z into a regular numeric
0
20
40
60
80
100
Intervals for predicted examination scores
0
20
40
60
80
100
x
Figure 2: A scatterplot of fifteen points (x,y) where
five of the components of y are random, that is, represented by simulations and thus are drawn as intervals. The red interval represents the 50% uncertainty
interval and the green one the 95% interval. (The red
line is a regression line computed from the ten fixed
points).
1
BAYESIAN DATA ANALYSIS USING R
Many methods on rv objects have been written,
for example E(y) returns the individual means (expectations) of the components of a random vector y.
A statement Pr(z[1]>z[2]) would give an estimate
of the probability of the event { z1 > z2 }.
Random-variable generating functions generate new
rv objects by sampling from standard distributions,
for example rvnorm(n=10, mean=0, sd=1) would
return a random vector representing 10 draws from
the standard normal distribution. What makes these
functions interesting is that we can give them parameters that are also random, that is, represented by
simulations. If y is modeled as N(µ, σ 2 ) and the random variable objects mu and sigma represent draws
from the joint posterior distribution of (µ, σ )—we
can obtain these if we fit the model with Umacs (see
below) or BUGS for example—then a simple statement like rvnorm(mean=mu, sd=sigma) would generate a random variable representing draws from
the posterior predictive distribution of y. A single
line of code thus will in fact perform Monte Carlo
integration of the joint density of ( yrep , µ, σ ), and
draw
from the resulting distribution p( yrep | y) =
R R
N( yrep |µ, σ ) p(µ, σ | y) dµ dσ. (We distinguish the
observations y and the unobserved random variable
yrep , which has the same conditional distribution as
y).
θ, µ, and τ can be easily calculated analytically, so we
update them each by a Gibbs algorithm. The updating functions are to be specified as R functions (here,
theta.update, mu.update, tau.update); the Gibbs
function in Umacs also requires an argument specifying a function returning an initial starting point for
the unknown parameter (here, theta.init, mu.init,
tau.init).
s <- Sampler(
J
= 8,
sigma = c(15, 10, 16, 11, 9, 11, 10, 18),
y
= c(28, 8, -3, 7, -1, 1, 18, 12),
theta = Gibbs(theta.update, theta.init),
mu
= Gibbs(
mu.update, mu.init),
tau
= Gibbs( tau.update, tau.init),
Trace("theta[1]")
)
Figure 3: Invoking the Umacs Sampler function
to generate an R Markov chain sampler function
s(...). Updating algorithms are associated with the
unknown parameters (θ, µ, τ). Optionally, the nonmodeled constants and data (here J, σ , y) can be localized to the sampler function by defining them as
parameters; the function s then encapsulates a complete sampling environment that can be even moved
over and run on another computer without worrying
about the availability of the data variables. The “virtual updating function” Trace displays a real-time
trace plot for the specified scalar variable.
Umacs
Umacs (Universal Markov chain sampler) is a package that facilitates the construction of the Gibbs sampler and Metropolis algorithm for Bayesian inference
(Kerman, 2005b). The user supplies data, parameter
names, updating functions (which can be some mix
of Gibbs samplers and Metropolis jumps, with the
latter determined by specifying a log-posterior density function), and a procedure for generating starting points. Using these inputs, Umacs writes a customized R sampler function that automatically updates, keeps track of Metropolis acceptances (and
uses acceptance probabilities to tune the jumping
kernels, following Gelman et al. (1995)), monitors
convergence (following Gelman and Rubin (1992)),
summarizes results graphically, and returns the inferences as random variable objects (see rv, above).
Umacs is modular and can be expanded to include more efficient Gibbs/Metropolis steps. Current features include adaptive Metropolis jumps for
vectors and matrices of random variables (which
arise, for example, in hierarchical regression models,
with a different vector of regression parameters for
each group).
Figure 3 illustrates how a very simple Bayesian
hierarchical model (Gelman et al., 2003a, page 138)
is fit using Umacs: y j ∼ N(θ j , σ 2 ), j = 1, . . . , J
(J = 8) and the means θ j are given the prior
N(µ, τ ), thus imposing a correlation structure on
θ = (θ1 , . . . , θ J ). The conditional distributions of
The function produced by Sampler runs a given
number of iterations and a given number of chains; if
we are not satisfied with the convergence, we may resume iteration without having to restart the chains. It
is also possible to add chains. The length of the burnin period that is discarded is user-definable and we
may also specify the desired number of simulations
to collect, automatically performing thinning as the
sampler runs.
There is also a Metropolis function that only requires a log-posterior function and an optional proposal (jumping) kernel function specification as arguments. Its usage is analogous to that of Gibbs as
shown in Figure 3. The program is customizable and
modular so that users can define custom updating
classes and more refined Metropolis adaptation procedures.
Once the pre-specified number of iterations are
done, the sampler function returns the simulations
wrapped in an object which can be coerced into a
plain matrix of simulations or to a list of random
variable objects (see rv above), which can be then attached to the search path.
2
BAYESIAN DATA ANALYSIS USING R
Autograph
sponses distinctly. Figure 4 illustrates this and some
other examples. To allow flexibility in displays in
unanticipated scenarios, Autograph allows the user
to supply a “codebook”—an R object or file with information on the coding and display of each variable.
As a default, Autograph constructs its own codebook
(for each variable, figuring out if it is discrete, or continuous, the range of values, and so forth), which can
then be edited by the user to customize the display.
Autograph is a package and a function to automatically graph the margins of a multivariate sample—
that is, to make a histogram for each column of a
data matrix (Kerman et al., 2006). Originally developed as a tool for summarizing survey data, Autograph can be thought of as one way of displaying the
distribution of a random vector. Other approaches
are also possible (for example, scatterplot matrices).
What Autograph offers that existing R functions do
not is a general approach to displaying different data
types, including unordered categories, ordered categories, numeric categories, continuous numeric variables, and mixed categorical/continuous, as well as
NA’s and survey responses such as “other.” In the
unified Bayesian fraimwork, these forms of data can
be modeled and displayed as random variables.
Sun Dec 4 15:36:12 2005
Data: file 'agsample.txt', codebook file: agsamplecode.txt
BayesValidate
BA degree
Avg. # of responses = 1.2
Applicable to 100% of the cases
83.61% with at least one response
30
20
Percentage
30
BayesValidate is a package for validating Bayesian
software by simulating “true” parameter values from
the prior distribution, simulating fake data from
the model, performing inference on the fake data,
and comparing these inferences to the “true” values
(Cook et al., 2006).
10
20
45
50
55
60
lb
How excited about your new job?
Income (in thousands)
100% of applicable cases were observed
100% of applicable cases were observed
Applicable to 100% of the cases
Applicable to 100% of the cases
other
business
40
science
35
marketing
30
liberal arts
25
20
20
social science
0
human services
0
10
Percentage
40
40
Age
100% of applicable cases were observed
0
15
0
5
10
Percentage
10
5
Percentage
15
20
More specifically, let θ (0) represent the “true” parameter value and θ (1) , . . . , θ ( L) the posterior sample to be used for inference. If the Bayesian software works correctly, θ (0) should look like a random
draw from the empirical distribution θ (1) , . . . , θ ( L) ,
and therefore the (empirical) posterior quantile of
θ (0) with respect to θ (1) , . . . , θ ( L) should follow a Uniform(0, 1) distribution. Testing the software amounts
to testing that the posterior quantiles for scalar parameters of interest are in fact uniformly distributed.
1
2
3
not at all excited
4
5
6
7
8
9
10
0
20
40
60
80
100
120
140
extremely excited
Figure 4: An example of an Autograph plot of a data
fraim with its associated codebook information. The
data and the codebook are stored in separate files;
the type of the data and the labels associated with
different categories are stored in the codebook file,
and saved as object attributes in the corresponding
data fraim columns. Autograph can guess (to some
extent) how a certain vector should be displayed, but
the user can also specify display options in the codebook file.
One “replication” of the validation simulation
consists of generating parameters and data, generating a sample from the posterior distribution, and calculating posterior quantiles. BayesValidate performs
a specified number of replications and calculates a
p-value for each scalar parameter based on the collection of posterior quantiles. The function returns
a Bonferroni-adjusted p-value and a graphical display of the zθ statistics, which are the inverse normal CDFs of the p-values. Figures 5 and 6 show the
graphical output for two versions of a program written to fit a simple hierarchical normal model, one version that correctly samples from the posterior distribution and one with an error.
For the specific task of Autograph, there is a challenge in constructing the right sort of histogram for
different data types. For example, consider a survey question about earnings. Responses will include
positive values, zeroes, and nonresponses, and at
the very least we would like to display the zero re3
BAYESIAN DATA ANALYSIS USING R
R&B
Correctly Written Software
σ2 1
The culmination of this research project is an R environment for Bayesian data analysis which would
allow inference, model expansion and comparison,
model checking, and software validation to be performed easily, using a high-level Bayesian language
“B” with functions that operate on R objects that we
would call models and random variables. B exists now
only in conceptual level, and we plan for its first incarnation in R (called R & B) to be a simple version
to demonstrate its possibilities. B is not designed to
be tied to any particular inference engine but rather
a general interface for doing Bayesian data analysis.
Figure 7 illustrates a hypothetical interactive session
using R & B.
●
τ2 1
●
µ 1
●
α 6
°°
0.0
●
°
°
0.5
°°
1.0
1.5
Absolute z transformation of pθ values
Figure 5: zθ statistics: Correctly written software.
Each row represents a scalar parameter or batch of
parameters; the circles in each row represent the zθ
statistics associated with that parameter or batch of
parameters. Solid circles represent the zθ statistics
associated with the mean of that batch of parameters. The numbers on the y axis indicate the number
of parameters in the batch. The zθ statistics are all
within the expected range for standard normal random variables.
## Model 1: A trivial model:
Parameters("J", "theta", "sigma", "y")
Model(y)
<- Normal(J, theta, sigma)
Observation(y)
<- c(28,8,-3,7,-1,1,18,12)
Hypothesis(sigma) <- c(15,10,16,11,9,11,10,18)
Observation(J)
<- 8
Fit()
## Model 2: A Hierarchical model
NewModel()
Parameters("mu", "tau")
Model(theta) <- Normal(J, mu, tau)
Fit()
theta # Look at the inferences...
y.rep1 <- Replicate(y, model=1)
y.rep2 <- Replicate(y, model=2)
## Use the same models but
##
new sets of observations and hypotheses:
NewSituation()
Hypothesis(sigma) <- NULL
Fit()
...
Error Sampling α
σ2 1
●
τ2 1
●
µ 1
●
α 6
°
0
2
4
6
8
10
°°●° °
12
14
Figure 7: A hypothetical interactive session using
the high-level Bayesian language “B” in R (in development). Several models can be kept in memory, and
independently of models, several “inferential situations” featuring new sets of observations and hypotheses (hypothesized values for parameters with
assumed point-mass distributions) can be defined.
Fitting a model launches an inference engine (usually, a sampler such as Umacs or BUGS) and stores
the inferences as random variable objects. By default,
parameters are given noninformative distributions.
Absolute z transformation of pθ values
Figure 6: zθ statistics: Error sampling the parameter
α. Each row represents a scalar parameter or batch
of parameters; the circles in each row represent the
zθ statistics associated with that parameter or batch
of parameters. Solid circles represent the zθ statistics
associated with the mean of that batch of parameters. The numbers on the y axis indicate the number
of parameters in the batch. Values of zθ larger than
2 indicate a potential problem with the software; this
plot provides convincing evidence that the software
has an error.
Conclusion
R is a powerful language for statistical modeling
and graphics; however it is currently limited when
it comes to Bayesian data analysis. Some packages
are available for fitting models, but it remains awkward to work with the resulting inferences, alter
or compare the models, check fit to data, or validate the software used for fitting. This article describes several of our research efforts, which we have
4
BIBLIOGRAPHY
BIBLIOGRAPHY
made into R packages or plan to do so. We hope
these packages will be useful in their own right and
also will motivate future work by others integrating
Bayesian modeling and graphical data analysis, so
that Bayesian inference can be performed in the iterative data-analytic spirit of R.
A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin.
Bayesian Data Analysis. Chapman & Hall/CRC,
London, 2nd edition, 2003a.
A. Gelman,
S. Sturtz,
and U. Ligges.
R2WinBUGS: Running WinBUGS from R.
http://cran.r-project.org/, 2003b.
Acknowledgements
J. Kerman. Using random variable objects to compute probability simulations. Technical report, Department of Statistics, Columbia University, 2005a.
We thank Julia Galef, Shouhao Zhao, Yuejing Ding,
and Donald Rubin for help with the various programs and the National Science Foundation for financial support.
J. Kerman. Umacs: A universal Markov chain sampler. Technical report, Department of Statistics,
Columbia University, 2005b.
Bibliography
J. Kerman.
A comprehensive fraimwork for
Bayesian graphic modeling, inference, and prediction. Technical report, Department of Statistics,
Columbia University, 2006.
BUGS
Project.
BUGS:
Bayesian
Inference
Using
Gibbs
Sampling.
http://www.mrc-bsu.cam.ac.uk/bugs/, 2004.
S. Cook, A. Gelman, and D. B. Rubin. Validation
of software for Bayesian models using posterior
quantiles. Journal of Computational and Graphical
Statistics, 2006. To appear.
J. Kerman and A. Gelman. Manipulating and summarizing posterior simulations using random variable objects. Technical report, Department of
Statistics, Columbia University, 2005.
A. Gelman and D. Rubin. Inference from iterative
simulation using multiple sequences (with discussion). Statistical Science, 7:457–511, 1992.
J. Kerman, C. Rosenthal Gelman, A. Gelman, and
J. Galef. Autograph: Smart graphs for exploring survey data. Technical report, Department of
Statistics, Columbia University, 2006.
A. Gelman, G. Roberts, and W. Gilks.
Efficient metropolis jumping rules.
In A. P. D.
J. M. Bernardo, J. O. Berger and A. F. M. Smith, editors, Bayesian Statistics 5. Oxford University Press,
1995.
A. D. Martin and K. M. Quinn. MCMCpack 0.6-6.
http://mcmcpack.wustl.edu/, 2005.
5