Hipnotic Bayestas
Hipnotic Bayestas
Hipnotic Bayestas
Statisticat, LLC
Abstract
The Bayesian interpretation of probability is one of two broad categories of interpre-
tations. Bayesian inference updates knowledge about unknowns, parameters, with infor-
mation from data. The LaplacesDemon package is a complete environment for Bayesian
inference within R, and this vignette provides an introduction to the topic. This arti-
cle introduces Bayes’ theorem, model-based Bayesian inference, components of Bayesian
inference, prior distributions, hierarchical Bayes, conjugacy, likelihood, numerical approx-
imation, prediction, Bayes factors, model fit, posterior predictive checks, and ends by
comparing advantages and disadvantages of Bayesian inference.
This article is an introduction to Bayesian inference for users of the LaplacesDemon package
(Statisticat LLC. 2015) in R (R Core Team 2014), often referred to as LD. LaplacesDemonCpp
is an extension package that uses C++. A formal introduction to LaplacesDemon is provided
in an accompanying vignette entitled “LaplacesDemon Tutorial”. Merriam-Webster defines
‘Bayesian’ as follows
Bayesian : being, relating to, or involving statistical methods that assign proba-
bilities or distributions to events (as rain tomorrow) or parameters (as a population
mean) based on experience or best guesses before experimentation and data col-
lection and that apply Bayes’ theorem to revise the probabilities and distributions
after obtaining experimental data.
In statistical inference, there are two broad categories of interpretations of probability: Bayesian
inference and frequentist inference. These views often differ with each other on the fundamen-
tal nature of probability. Frequentist inference loosely defines probability as the limit of an
event’s relative frequency in a large number of trials, and only in the context of experiments
that are random and well-defined. Bayesian inference, on the other hand, is able to assign
probabilities to any statement, even when a random process is not involved. In Bayesian
inference, probability is a way to represent an individual’s degree of belief in a statement, or
given evidence.
Within Bayesian inference, there are also different interpretations of probability, and different
approaches based on those interpretations. The most popular interpretations and approaches
2 Bayesian Inference
are objective Bayesian inference (Berger 2006) and subjective Bayesian inference (Anscombe
and Aumann 1963; Goldstein 2006). Objective Bayesian inference is often associated with
Bayes and Price (1763), Laplace (1814), and Jeffreys (1961). Subjective Bayesian inference is
often associated with Ramsey (1926), De Finetti (1931), and Savage (1954). The first major
event to bring about the rebirth of Bayesian inference was De Finetti (1937). Differences in
the interpretation of probability are best explored outside of this article1 .
This article is intended as an approachable introduction to Bayesian inference, or as a handy
summary for experienced Bayesians. It is assumed that the reader has at least an elemen-
tary understanding of statistics, and this article focuses on applied, rather than theoretical,
material. Equations and statistical notation are included, but it is hopefully presented so the
reader does not need an intricate understanding of solving integrals, for example, but should
understand the basic concept of integration. Please be aware that it is difficult to summarize
Bayesian inference in such a short article. In which case, consider Robert (2007) for a more
thorough and formal introduction.
1. Bayes’ Theorem
Bayes’ theorem shows the relation between two conditional probabilities that are the reverse
of each other. This theorem is named after Reverend Thomas Bayes (1701-1761), and is also
referred to as Bayes’ law or Bayes’ rule (Bayes and Price 1763)2 . Bayes’ theorem expresses
the conditional probability, or ‘posterior probability’, of an event A after B is observed in
terms of the ‘prior probability’ of A, prior probability of B, and the conditional probability
of B given A. Bayes’ theorem is valid in all common interpretations of probability. The two
(related) examples below should be sufficient to introduce Bayes’ theorem.
Pr(B|A) Pr(A)
Pr(A|B) = (1)
Pr(B)
For example, suppose one asks the question: what is the probability of going to Hell, condi-
tional on consorting (or given that a person consorts) with Laplace’s Demon3 . By replacing
A with Hell and B with Consort, the question becomes
Pr(Consort|Hell) Pr(Hell)
Pr(Hell|Consort) =
Pr(Consort)
1
If these terms are new to the reader, then please do not focus too much on the words ‘objective’ and
‘subjective’, since there is a lot of debate over them. For what it’s worth, Statisticat, LLC, the provider of this
R package entitled LaplacesDemon, favors the ‘subjective’ interpretation.
2
Stigler (1983) suggests the earliest discoverer of Bayes’ theorem was Nicholas Saunderson (1682-1739), a
blind mathematician/optician, who at age 29 became the Lucasian Professor of Mathematics at Cambridge.
This position was previously held by Isaac Newton.
3
This example is, of course, intended with humor.
Statisticat LLC 3
Note that a common fallacy is to assume that Pr(A|B) = Pr(B|A), which is called the
conditional probability fallacy.
Pr(B|Ai ) Pr(Ai )
Pr(Ai |B) =
Pr(B|Ai ) Pr(Ai ) + ... + Pr(B|An ) Pr(An )
Let’s examine our burning question, by replacing Ai with Hell or Heaven, and replacing B
with Consort
• Pr(A1 ) = Pr(Hell)
• Pr(A2 ) = Pr(Heaven)
• Pr(B) = Pr(Consort)
• Pr(B|A1 ) = Pr(Consort|Hell)
• Pr(B|A2 ) = Pr(Consort|Heaven)
Laplace’s Demon was conjured and asked for some data. He was glad to oblige.
Data
Now, Bayes’ theorem is applied to the data. Four pieces are worked out as follows
• Pr(Hell) = 0.75
• Pr(Heaven) = 0.25
Finally, the desired conditional probability Pr(Hell|Consort) is calculated using Bayes’ theo-
rem
0.666(0.75)
• Pr(Hell|Consort) = 0.666(0.75)+0.714(0.25)
4 Bayesian Inference
• Pr(Hell|Consort) = 0.737
The probability of someone consorting with Laplace’s Demon and going to Hell is 73.7%,
which is less than the prevalence of 75% in the population. According to these findings,
consorting with Laplace’s Demon does not increase the probability of going to Hell. With
that in mind, please continue. . .
Pr(B|A) Pr(A)
Pr(A|B) =
Pr(B)
Replacing B with observations y, A with parameter set Θ, and probabilities Pr with densities
p (or sometimes π or function f ), results in the following
p(y|Θ)p(Θ)
p(Θ|y) =
p(y)
where p(y) will be discussed below, p(Θ) is the set of prior distributions of parameter set
Θ before y is observed, p(y|Θ) is the likelihood of y under a model, and p(Θ|y) is the joint
posterior distribution, sometimes called the full posterior distribution, of parameter set Θ
that expresses uncertainty about parameter set Θ after taking both the prior and data into
account. Since there are usually multiple parameters, Θ represents a set of j parameters, and
may be considered hereafter in this article as
Θ = θ1 , ..., θj
The denominator
Z
p(y) = p(y|Θ)p(Θ)dΘ
defines the “marginal likelihood” of y, or the “prior predictive distribution” of y, and may be
set to an unknown constant c. The prior predictive distribution4 indicates what y should
look like, given the model, before y has been observed. Only the set of prior probabilities and
the model’s likelihood function are used for the marginal likelihood of y. The presence of the
marginal likelihood of y normalizes the joint posterior distribution, p(Θ|y), ensuring it is a
proper distribution and integrates to one.
By replacing p(y) with c, which is short for a ‘constant of proportionality’, the model-based
formulation of Bayes’ theorem becomes
p(y|Θ)p(Θ)
p(Θ|y) =
c
4
The predictive distribution was introduced by Jeffreys (1961).
Statisticat LLC 5
By removing c from the equation, the relationship changes from ’equals’ (=) to ’proportional
to’ (∝)5
Θ = {Φ, Λ}
where Φ is the sub-vector of interest, and Λ is the complementary sub-vector of Θ, often
referred to as a vector of nuisance parameters. In a Bayesian framework, the presence of
nuisance parameters does not pose any formal, theoretical problems. A nuisance parameter
is a parameter that exists in the joint posterior distribution of a model, though it is not a
parameter of interest. The marginal posterior distribution of φ, the parameter of interest,
can simply be written as
Z
p(φ|y) = p(φ, Λ|y)dΛ
In model-based Bayesian inference, Bayes’ theorem is used to estimate the unnormalized joint
posterior distribution, and finally the user can assess and make inferences from the marginal
posterior distributions.
1. p(Θ) is the set of prior distributions for parameter set Θ, and uses probability as a
means of quantifying uncertainty about Θ before taking the data into account.
2. p(y|Θ) is the likelihood or likelihood function, in which all variables are related in a full
probability model.
3. p(Θ|y) is the joint posterior distribution that expresses uncertainty about parameter
set Θ after taking both the prior and the data into account. If parameter set Θ is
partitioned into a single parameter of interest φ and the remaining parameters are
considered nuisance parameters, then p(φ|y) is the marginal posterior distribution.
4. Prior Distributions
5
For those unfamiliar with ∝, this symbol simply means that two quantities are proportional if they vary
in such a way that one is a constant multiplier of the other. This is due to the constant of proportionality c
in the equation. Here, this can be treated as ‘equal to’.
6
In Bayesian decision theory, an additional component exists (Robert 2007, p. 53), the loss function,
L(Θ, ∆).
6 Bayesian Inference
In Bayesian inference, a prior probability distribution, often called simply the prior, of an
uncertain parameter θ or latent variable is a probability distribution that expresses uncertainty
about θ before the data are taken into account7 . The parameters of a prior distribution are
called hyperparameters, to distinguish them from the parameters (Θ) of the model.
When applying Bayes’ theorem, the prior is multiplied by the likelihood function and then
normalized to estimate the posterior probability distribution, which is the conditional distri-
bution of Θ given the data. Moreover, the prior distribution affects the posterior distribution.
Prior probability distributions have traditionally belonged to one of two categories: informa-
tive priors and uninformative priors. Here, four categories of priors are presented according
to information8 and the goal in the use of the prior. The four categories are informative,
weakly informative, least informative, and uninformative.
edge or problems such as an algorithmic failure to explore the state-space. Another goal is for
WIPs to use less prior information than is actually available. A WIP should provide some of
the benefit of prior information while avoiding some of the risk from using information that
doesn’t exist. WIPs are the most common priors in practice, and are favored by subjective
Bayesians.
Selecting a WIP can be tricky. WIP distributions should change with the sample size, because
a model should have enough prior information to learn from the data, but the prior information
must also be weak enough to learn from the data.
Following is an example of a WIP in practice. It is popular, for good reasons, to center and
scale all continuous predictors (Gelman 2008). Although centering and scaling predictors is
not discussed here, it should be obvious that the potential range of the posterior distribution
of θ for a centered and scaled predictor should be small. A popular WIP for a centered and
scaled predictor may be
θ ∼ N (0, 10000)
Vague Priors
A vague prior, also called a diffuse prior10 , is difficult to define, after considering WIPs.
The first formal move from vague to weakly informative priors is Lambert, Sutton, Burton,
Abrams, and Jones (2005). After conjugate priors were introduced (Raiffa and Schlaifer
1961), most applied Bayesian modeling has used vague priors, parameterized to approximate
the concept of uninformative priors (better considered as least informative priors, see section
4.3). For more information on conjugate priors, see section 6.
Typically, a vague prior is a conjugate prior with a large scale parameter. However, vague
priors can pose problems when the sample size is small. Most problems with vague priors and
small sample size are associated with scale, rather than location, parameters. The problem
10
Some sources refer to diffuse priors as flat priors.
8 Bayesian Inference
can be particularly acute in random-effects models, and the term random-effects is used rather
loosely here to imply exchangeable11 , hierarchical, and multilevel structures. A vague prior
is defined here as usually being a conjugate prior that is intended to approximate an uninfor-
mative prior (or actually, a least informative prior), and without the goals of regularization
and stabilization.
Flat Priors
The flat prior was historically the first attempt at an uninformative prior. The unbounded,
uniform distribution, often called a flat prior, is
θ ∼ U(−∞, ∞)
Hierarchical Prior
A hierarchical prior is a prior in which the parameters of the prior distribution are estimated
from data via hyperpriors, rather than with subjective information (Gelman 2008). Parame-
ters of hyperprior distributions are called hyperparameters. Subjective Bayesians prefer the
hierarchical prior as the LIP, and the hyperparameters are usually specified as WIPs. Hier-
archical priors are presented later in more detail in the section entitled ‘Hierarchical Bayes’.
11
For more information on exchangeability, see https://web.archive.org/web/20150418134644/http://
www.bayesian-inference.com/exchangeability.
Statisticat LLC 9
Jeffreys Prior
Jeffreys prior, also called Jeffreys rule, was introduced in an attempt to establish a least
informative prior that is invariant to transformations (Jeffreys 1961). Jeffreys prior works
well for a single parameter, but multi-parameter situations may have inappropriate aspects
accumulate across dimensions to detrimental effect.
MAXENT
A MAXENT prior, proposed by Jaynes (1968), is a prior probability distribution that is
selected among other candidate distributions as the prior of choice when it has the maximum
entropy (MAXENT) in the considered set, given constraints on the candidate set. More
entropy is associated with less information, and the least informative prior is preferred as a
MAXENT prior. The principle of minimum cross-entropy generalizes MAXENT priors from
mere selection to updating the prior given constraints while seeking the maximum, possible
entropy.
Reference Priors
Introduced by Bernardo (1979), reference priors do not express personal beliefs. Instead,
reference priors allow the data to dominate the prior and posterior (Berger, Bernardo, and
Dongchu 2009). Reference priors are estimated by maximizing the expected intrinsic discrep-
ancy between the posterior distribution and prior distribution. This maximizes the expected
posterior information about y when the prior density is p(y). In some sense, p(y) is the ‘least
informative’ prior about y (Bernardo 2005). Reference priors are often the objective prior
of choice in multivariate problems, since other rules (e.g., Jeffreys rule) may result in priors
with problematic behavior. When reference priors are used, the analysis is called reference
analysis, and the posterior is called the reference posterior.
Subjective Bayesian criticisms of reference priors are that the concepts of regularization and
stabilization are not taken into account, results that contradict knowledge are not prevented,
a numerical approximation algorithm may become stuck in low-probability or flat regions,
and it may not be desirable to let the data speak fully.
12
Improper priors were introduced in Jeffreys (1961).
10 Bayesian Inference
Z
p(θ)dθ = ∞
Although improper prior distributions can be used, it is good practice to avoid them.
5. Hierarchical Bayes
Prior distributions may be estimated within the model via hyperprior distributions, which are
usually vague and nearly flat. Parameters of hyperprior distributions are called hyperparam-
eters. Using hyperprior distributions to estimate prior distributions is known as hierarchical
Bayes. In theory, this process could continue further, using hyper-hyperprior distributions
to estimate the hyperprior distributions. Estimating priors through hyperpriors, and from
the data, is a method to elicit the optimal prior distributions. One of many natural uses for
hierarchical Bayes is multilevel modeling.
Recall that the unnormalized joint posterior distribution (equation 2) is proportional to the
likelihood times the prior distribution
p(Θ|y) ∝ p(y|Θ)p(Θ)
where Φ is a set of hyperprior distributions. By reading the equation from right to left, it
begins with hyperpriors Φ, which are used conditionally to estimate priors p(Θ|Φ), which
in turn is used, as per usual, to estimate the likelihood p(y|Θ), and finally the posterior is
p(Θ, Φ|y).
6. Conjugacy
When the posterior distribution p(Θ|y) is in the same family as the prior probability distri-
bution p(Θ), the prior and posterior are then called conjugate distributions, and the prior is
called a conjugate prior for the likelihood13 . For example, the Gaussian family is conjugate
to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood
13
The conjugate prior approach was introduced in Raiffa and Schlaifer (1961).
Statisticat LLC 11
function is Gaussian, then choosing a Gaussian prior for the mean will ensure that the poste-
rior distribution is also Gaussian. All probability distributions in the exponential family have
conjugate priors. See Robert (2007) for a catalog.
Although the gamma distribution is the conjugate prior distribution for the precision of a
normal distribution (Spiegelhalter, Thomas, Best, and Lunn 2003),
τ ∼ G(0.001, 0.001),
better properties for scale parameters are yielded with the non-conjugate, proper, half-
Cauchy14 distribution, with a general recommendation of scale=25 for a weakly informative
scale parameter (Gelman 2006),
σ ∼ HC(25)
τ = σ −2
σ ∼ U(0, 100)
τ = σ −2
7. Likelihood
In order to complete the definition of a Bayesian model, both the prior distributions and the
likelihood15 must be approximated or fully specified. The likelihood, likelihood function, or
p(y|Θ), contains the available information provided by the sample. The likelihood is
n
Y
p(y|Θ) = p(yi |Θ)
i=1
The data y affects the posterior distribution p(Θ|y) only through the likelihood function
p(y|Θ). In this way, Bayesian inference obeys the likelihood principle, which states that for
a given sample of data, any two probability models p(y|Θ) that have the same likelihood
function yield the same inference for Θ. For more information on the likelihood principle, see
section 7.2.
more thoroughly than Bayes, so some “Bayesians” refer to Bayesian inference as Laplacian
inference.
The term “inverse probability” appears in an 1837 paper of Augustus De Morgan (De Morgan
1837), in reference to Laplace’s method of probability (Laplace 1774, 1812), though the term
“inverse probability” does not occur in these works. Bayes’ theorem has been referred to as
“the principle of inverse probability”.
Terminology has changed, so that today, Bayesian probability (rather than inverse probability)
refers to assigning a probability distribution to an unobservable variable. The “distribution”
of an unobserved variable given data is the likelihood function (which is not a distribution),
and the distribution of an unobserved variable, given both data and a prior distribution, is
the posterior distribution. The term “Bayesian”, which displaced “inverse probability”, was in
fact introduced by Ronald A. Fisher as a derogatory term.
In modern terms, given a probability distribution p(y|θ) for an observable quantity y con-
ditional on an unobserved variable θ, the “inverse probability” is the posterior distribution
p(θ|y), which depends both on the likelihood function (the inversion of the probability distri-
bution) and a prior distribution. The distribution p(y|θ) itself is called the direct probability.
However, p(y|θ) is also called the likelihood function, which can be confusing, seeming to pit
the definitions of probability and likelihood against each other. A quick introduction to the
likelihood principle follows, and finally all of the information on likelihood comes together in
the section entitled “Likelihood Function of a Parameterized Model”.
necessity to use prior probabilities, there is no longer a need to separate the likelihood function
from the prior. Thus, the likelihood principle is accepted ‘conditional’ on the assumption that
a likelihood function has been specified, but it is denied that specifying a likelihood function
is necessary. Nonetheless, the likelihood principle is seen as a useful Bayesian weapon to
combat frequentism.
Following are some interesting qutoes from prominent statisticians:
“Using Bayes’ rule with a chosen probability model means that the data y affect
posterior inference ’only’ through the function p(y|θ), which, when regarded as a
function of θ, for fixed y, is called the ‘likelihood function’. In this way Bayesian
inference obeys what is sometimes called the ‘likelihood principle’, which states
that for a given sample of data, any two probability models p(y|θ) that have the
same likelihood function yield the same inference for θ” (Gelman et al. 2004, p. 9).
“The likelihood principle is reasonable, but only within the framework of the model
or family of models adopted for a particular analysis” (Gelman et al. 2004, p. 9).
yi ∼ N (µi , σ 2 )
µi = β1 + β2 Xi,1 + β3 Xi,2
1 1
f (y) = √ exp[(− σ 2 )(yi − µi )2 ]; y ∈ (−∞, ∞)
2πσ 2
By considering a conditional distribution, the record-level likelihood in Bayesian notation is
16
Note that L (θ|y) is not the same as the probability that those parameters are the right ones, given the
observed sample.
16 Bayesian Inference
1 1
p(yi |Θ) = √ exp[(− σ 2 )(yi − µi )2 ]; y ∈ (−∞, ∞)
2πσ 2
In both theory and practice, and in both frequentist and Bayesian inference, the log-likelihood
is used instead of the likelihood, on both the record- and model-level. The model-level product
of record-level likelihoods can exceed the range of a number that can be stored by a computer,
which is usually affected by sample size. By estimating a record-level log-likelihood, rather
than likelihood, the model-level log-likelihood is the sum of the record-level log-likelihoods,
rather than a product of the record-level likelihoods.
n
X
log[p(y|θ)] = log[p(yi |θ)]
i=1
rather than
n
Y
p(y|θ) = p(yi |θ)
i=1
As a function of θ, the unnormalized joint posterior distribution is the product of the likeli-
hood function and the prior distributions. To continue with the example of Bayesian linear
regression, here is the unnormalized joint posterior distribution
More usually, the logarithm of the unnormalized joint posterior distribution is used, which is
the sum of the log-likelihood and prior distributions. Here is the logarithm of the unnormalized
joint posterior distribution for this example
The logarithm of the unnormalized joint posterior distribution is maximized with numerical
approximation.
8. Numerical Approximation
The technical problem of evaluating quantities required for Bayesian inference typically re-
duces to the calculation of a ratio of two integrals (Bernardo and Smith 2000, p. 339). In all
cases, the technical key to the implementation of the formal solution given by Bayes’ theorem
is the ability to perform a number of integrations (Bernardo and Smith 2000, p. 340). Except
in certain rather stylized problems, the required integrations will not be feasible analytically
and, thus, efficient approximation strategies are required.
There are too many different types of numerical approximation algorithms in Bayesian infer-
ence to cover in any detail in this article. An incomplete list of broad categories of Bayesian
numerical approximation may include Approximate Bayesian Computation (ABC), Impor-
tance Sampling, Iterative Quadrature, Laplace Approximation, Markov chain Monte Carlo
Statisticat LLC 17
(MCMC), and Variational Bayes (VB). For more information on algorithms in LaplacesDe-
mon, see the accompanying vignette entitled “LaplacesDemon Tutorial”.
Approximate Bayesian Computation (ABC), also called likelihood-free estimation, is a family
of numerical approximation techniques in Bayesian inference. ABC is especially useful when
evaluation of the likelihood, p(y|Θ) is computationally prohibitive, or when suitable likeli-
hoods are unavailable. As such, ABC algorithms estimate likelihood-free approximations.
ABC is usually faster than a similar likelihood-based numerical approximation technique,
because the likelihood is not evaluated directly, but replaced with an approximation that is
usually easier to calculate. The approximation of a likelihood is usually estimated with a
measure of distance between the observed sample, y, and its replicate given the model, yrep ,
or with summary statistics of the observed and replicated samples.
Importance Sampling is a method of estimating a distribution with samples from a different
distribution, called the importance distribution. Importance weights are assigned to each
sample. The main difficulty with importance sampling is in the selection of the importance
distribution. Importance sampling is the basis of a wide variety of algorithms, some of which
involve the combination of importance sampling and Markov chain Monte Carlo. There are
also many variations of importance sampling, including adaptive importance sampling, and
parametric and nonparametric self-normalized importance sampling. Population Monte Carlo
(PMC) is based on adaptive importance sampling.
Iterative quadrature is a traditional approach to evaluating integrals. Multidimensional
quadrature, often called cubature, performs well, but is limited usually to ten or fewer param-
eters. Componentwise quadrature may be applied to any model regardless of dimension, but
estimates only variance, rather than covariance. Bayesian quadrature typically uses adaptive
Gauss-Hermite quadrature, which assumes the marginal posterior distributions are normally-
distrubted. Under this assumption, the conditional mean and conditional variance of each
distribution is adapted each iteration according to the evaluation of samples determined by
quadrature rules.
Laplace Approximation dates back to Laplace (1774, 1814), and is used to approximate the
posterior moments of integrals. Specifically, the posterior mode is estimated for each param-
eter, assumed to be unimodal and Gaussian. As a Gaussian distribution, the posterior mean
is the same as the posterior mode, and the variance is estimated. Laplace Approximation is
a family of deterministic algorithms that usually converge faster than MCMC, and just a lit-
tle slower than Maximum Likelihood Estimation (MLE) (Azevedo-Filho and Shachter 1994).
Laplace Approximation shares many limitations of MLE, including asymptotic estimation
with respect to sample size.
MCMC algorithms originated in statistical physics and are now used in Bayesian inference to
sample from probability distributions by constructing Markov chains. In Bayesian inference,
the target distribution of each Markov chain is usually a marginal posterior distribution, such
as each parameter θ. Each Markov chain begins with an initial value and the algorithm iter-
ates, attempting to maximize the logarithm of the unnormalized joint posterior distribution
and eventually arriving at each target distribution. Each iteration is considered a state. A
Markov chain is a random process with a finite state-space and the Markov property, meaning
that the next state depends only on the current state, not on the past. The quality of the
marginal samples usually improves with the number of iterations.
A Monte Carlo method is an algorithm that relies on repeated pseudo-random sampling for
18 Bayesian Inference
computation, and is therefore stochastic (as opposed to deterministic). Monte Carlo methods
are often used for simulation. The union of Markov chains and Monte Carlo methods is
called MCMC. The revival of Bayesian inference since the 1980s is due to MCMC algorithms
and increased computing power. The most prevalent MCMC algorithms may be the simplest:
random-walk Metropolis and Gibbs sampling. There are a large number of MCMC algorithms,
and further details on MCMC are best explored outside of this article.
VB is a family of algorithms within variational inference. VB are deterministic optimization
algorithms that approximate the posterior with a distribution. Each marginal posterior dis-
tribution is estimated with an approximating distribution. VB usually converges faster than
MCMC. VB shares many limitations of MLE, including asymptotic estimation with respect
to sample size.
9. Prediction
The “posterior predictive distribution” is either the replication of y given the model (usually
represented as yrep ), or the prediction of a new and unobserved y (usually represented as ynew
or y′ ), given the model. This is the likelihood of the replicated or predicted data, averaged
over the posterior distribution p(Θ|y)
Z
rep
p(y |y) = p(yrep |Θ)p(Θ|y)dΘ
or
Z
p(ynew |y) = p(ynew |Θ)p(Θ|y)dΘ
If y has missing values, then the missing ys can be estimated with the posterior predictive
distribution17 as ynew from within the model. For the linear regression example, the integral
for prediction is
Z
p(ynew |y) = p(ynew |β, σ 2 )p(β, σ 2 |y)dβdσ 2
ynew ∼ N (µ, σ 2 )
where µ = Xβ, and µ is the conditional mean, while σ 2 is the residual variance.
density of y under each model is proper. However, Bayes factors are easy to approximate
with the Laplace-Metropolis Estimator (Kass and Raftery 1995; Lewis and Raftery 1997)18 .
Hypothesis testing with Bayes factors is more robust than frequentist hypothesis testing, since
the Bayesian form avoids model selection bias, evaluates evidence in favor the null hypothesis,
includes model uncertainty, and allows non-nested models to be compared (though of course
the model must have the same dependent variable). Also, frequentist significance tests become
biased in favor of rejecting the null hypothesis with sufficiently large sample size.
The Bayes factor for comparing two models may be approximated as the ratio of the marginal
likelihood of the data in model 1 and model 2. Formally, the Bayes factor in this case is
R
p(y|M1 ) p(y|Θ1 , M1 )p(Θ1 |M1 )dΘ1
B= =R
p(y|M2 ) p(y|Θ2 , M2 )p(Θ2 |M2 )dΘ2
For example, when B = 2, the data favor M1 over M2 with 2:1 odds.
In a non-hierarchical model, the marginal likelihood may easily be approximated with the
Laplace-Metropolis Estimator for model m as
where d is the number of parameters and Σ is the inverse of the negative of the Hessian matrix
of second derivatives. Lewis and Raftery (1997) introduce the Laplace-Metropolis method of
approximating the marginal likelihood in MCMC, though it naturally works with Laplace
Approximation as well. For a hierarchical model that involves both fixed and random effects,
the Compound Laplace-Metropolis Estimator must be used.
Gelman finds Bayes factors generally to be irrelevant, because they compute the relative
probabilities of the models conditional on one of them being true. Gelman prefers approaches
that measure the distance of the data to each of the approximate models (Gelman et al.
2004, p. 180). However, Kass and Raftery (1995) explain that “the logarithm of the marginal
probability of the data may also be viewed as a predictive score. This is of interest, because
it leads to an interpretation of the Bayes factor that does not depend on viewing one of the
models as ‘true”’.
Three of many possible alternatives are to use
1. pseudo Bayes factors (PsBF) based on a ratio of pseudo marginal likelihoods (PsMLs)
DIC is the most popular method of assessing model fit and comparing models, though Bayes
factors are better, when appropriate, because they take more into account. WAIC is a newer
criterion.
D(y, Θ) = −2 log[p(y|Θ)]
Just as with the likelihood, p(y|Θ), or log-likelihood, the deviance exists at both the record-
and model-level. Due to the development of BUGS software (Gilks, Thomas, and Spiegelhalter
1994), deviance is defined differently in Bayesian inference than frequentist inference. In
frequentist inference, deviance is -2 times the log-likelihood ratio of a reduced model compared
to a full model, whereas in Bayesian inference, deviance is simply -2 times the log-likelihood.
In Bayesian inference, the lowest expected deviance has the highest posterior probability
(Gelman et al. 2004, p. 181).
It is possible to have a negative deviance. Deviance is derived from the likelihood, which is
derived from probability density functions (PDF). Evaluated at a certain point in parameter
space, a PDF can have a density larger than 1 due to a small standard deviation or lack of
variation. Likelihoods greater than 1 lead to negative deviance, and are appropriate.
On its own, the deviance is an insufficient model fit statistic, because it does not take model
complexity into account. The effect of model fitting, pD, is used as the ‘effective number of
parameters’ of a Bayesian model. The sum of the differences between the posterior mean of
the model-level deviance and the deviance at each draw i of θi is the pD.
A related way to measure model complexity is as half the posterior variance of the model-level
deviance, known as pV (Gelman et al. 2004, p. 182)
pV = var(D)/2
The effect of model fitting, pD or pV, can be thought of as the number of ‘unconstrained’
parameters in the model, where a parameter counts as: 1 if it is estimated with no constraints
or prior information; 0 if it is fully constrained or if all the information about the parameter
comes from the prior distribution; or an intermediate value if both the data and the prior are
informative (Gelman et al. 2004, p. 182). Therefore, by including prior information, Bayesian
inference is more efficient in terms of the effective number of parameters than frequentist
Statisticat LLC 21
inference. Hierarchical, mixed effects, or multilevel models are even more efficient regarding
the effective number of parameters.
Model complexity, pD or pV, should be positive. Although pV must be positive since it is
related to variance, it is possible for pD to be negative, which indicates one or more problems:
log-likelihood is non-concave, a conflict between the prior and the data, or that the posterior
mean is a poor estimator (such as with a bimodal posterior).
The sum of both the mean model-level deviance and the model complexity (pD or pV) is
the Deviance Information Criterion (DIC), a model fit statistic that is also an estimate of
the expected loss, with deviance as a loss function (Spiegelhalter, Best, and Carlin 1998;
Spiegelhalter, Best, Carlin, and van der Linde 2002). DIC is
DIC = D̄ + pV
DIC may be compared across different models and even different methods, as long as the
dependent variable does not change between models, making DIC the most flexible model fit
statistic. DIC is a hierarchical modeling generalization of the Akaike Information Criterion
(AIC) and Bayesian Information Criterion (BIC). Like AIC and BIC, it is an asymptotic
approximation as the sample size becomes large. DIC is valid only when the joint posterior
distribution is approximately multivariate normal. Models should be preferred with smaller
DIC. Since DIC increases with model complexity (pD or pV), simpler models are preferred.
It is difficult to say what would constitute an important difference in DIC. Very roughly,
differences of more than 10 might rule out the model with the higher DIC, differences between
5 and 10 are substantial, but if the difference in DIC is, say, less than 5, and the models make
very different inferences, then it could be misleading just to report the model with the lowest
DIC.
The Widely Applicable Information Criterion (WAIC) is an information criterion that is
more fully Bayesian than DIC. WAIC is more difficult to calculate because the record-level
log-likelihood is required over numerous samples. However, when available, the result more
closely resembles leave-one-out cross-validation (LOO-CV).
The Bayesian Predictive Information Criterion (BPIC) was introduced as a criterion of model
fit when the goal is to pick a model with the best out-of-sample predictive power (Ando 2007).
BPIC is a variation of DIC where the effective number of parameters is 2pD (or 2pV). BPIC
may be compared between ynew and yholdout , and has many other extensions, such as with
Bayesian Model Averaging (BMA).
Gelman recommends at the most basic level to compare yrep to y, looking for any systematic
differences, which could indicate potential failings of the model (Gelman et al. 2004, p. 159).
It is often first recommended to compare graphical plots, such as the distribution of y and
yrep . There are many posterior predictive checks that are not included in this article, but an
introduction to a selection of them appears below.
12.2. Chi-Square
Gelman et al. (2004, p. 175) suggest an omnibus test such as the following χ2
PT
yrep
(y − t=1T i,t )2
χ2i = i ,
var(yrep
i,1:T )
over records i = 1, . . . , N and posterior samples t = 1, . . . , T . The sum of χ2i over records
i = 1, . . . , N is an overall goodness of fit measure on the data set. Larger values of χ2i indicate
a worse fit for each record.
An alternative χ2 test is
p(χ2rep 2obs
i,1:T > χi,1:T )
where a worse fit is indicated as p approaches zero or one, and it is common to consider
records with a poor fit to be outside the 95% probability interval. To continue
[yrep rep 2
i,1:T − E(yi )]
χ2rep
i,1:T =
E(yrep
i )
19
Numerous problems with frequentist p-values, confidence intervals, point estimates, and hypothesis testing
are worth exploring, but not detailed in this article.
Statisticat LLC 23
Newer forms of χ2 tests have been proposed in the literature, and are best explored elsewhere.
where yi is each instance of an observed y, and y[i] is y without the current observation i.
The CPO is easy to calculate with MCMC or PMC numerical approximation. By considering
the inverse likelihood across T iterations, the CPO for each individual i is
1
CPOi = T
X
T −1 p(yi |Θt )−1
t=1
The CPO is a handy posterior predictive check because it may be used to identify outliers, in-
fluential observations, and for hypothesis testing across different non-nested models. However,
it may be difficult to calculate with latent mixtures.
The CPO expresses the posterior probability of observing the value (or set of values) of yi
when the model is fitted to all data except yi , with a larger value implying a better fit of the
model to yi , and very low CPO values suggest that yi is an outlier and an influential obser-
vation. A Monte Carlo estimate of the CPO is obtained without actually omitting yi from
the estimation, and is provided by the harmonic mean of the likelihood for yi . Specifically,
the CP Oi is the inverse of the posterior mean of the inverse likelihood of yi .
The CPO is connected with the frequentist studentized residual test for outlier detection.
Data with large studentized residuals have small CPOs and will be detected as outliers.
An advantage of the CPO is that observations with high leverage will have small CPOs,
independently of whether or not they are outliers. The Bayesian CPO is able to detect both
outliers and influential points, whereas the frequentist studentized residual is unable to detect
high-leverage outliers.
Inverse-CPOs (ICPOs) larger than 40 can be considered as possible outliers, and higher than
70 as extreme values (Ntzoufras 2009, p. 376). Congdon recommends scaling CPOs by divid-
ing each by its individual maximum (after the posterior mean) and considering observations
with scaled CPOs under 0.01 to be outliers (Congdon 2005). The range in scaled CPOs is
useful as an indicator of a good-fitting model.
The sum of the logged CPOs can be an estimator for the logarithm of the marginal likeli-
hood20 , sometimes called the log pseudo marginal likelihood (LPsML). A ratio of PsMLs is
20
Exercise extreme caution when approximating the marginal likelihood from CPOs, or use a method with
better repute, such as the Laplace-Metropolis Estimator or importance sampling.
24 Bayesian Inference
a surrogate for the Bayes factor, sometimes known as the pseudo Bayes factor (PsBF). In
this way, non-nested models may be compared with a hypothesis test to determine the better
model, if one exists, based on leave-one-out cross-validation.
PQi = p(yrep
i > yi )
and is somewhat similar to the Bayesian p-value. The percentage of yi s that are not outliers
is called the ‘Predictive Concordance’. Gelfand (1996) suggests the goal is to attempt to
achieve 95% predictive concordance. In the case of, say 80% predictive concordance, the
discrepancy between the model and data is undesirable because the model does not fit the
data well and many outliers have resulted. On the other hand, if the predictive concordance
is too high, say 100%, then overfitting may have occurred, and it may be worth considering
a more parsimonious model. Kernel density plots of each yrep i distribution are useful in this
case with the actual yi included as a vertical bar to show its position.
12.5. L-criterion
Laud and Ibrahim (1995) introduced the L-criterion as one of three posterior predictive checks
for model, variable, and transformation selection. The L-criterion is a posterior predictive
check that is widely applicable and easy to apply. It is a sum of two components: one involves
the predictive variance and the other includes the accuracy of the means of the predictive
distribution. The L-criterion measures model performance with a combination of how close
its predictions are to the observed data and variability of the predictions. Better models have
smaller values of L. L is measured in the same units as the response variable, and measures
how close the data vector y is to the predictive distribution. In addition to the value of L,
there is a value for SL , which is the calibration number of L, and is useful in determining how
much of a decrease is necessary between models to be noteworthy. The L-criterion is
s
N PT rep
t=1 yi,t 2
var(yrep
X
L= i,1:T ) + (yi − ) ,
T
i=1
• Bayesian inference can avoid problems with model identification by manipulating prior
distributions (usually in complex models). Frequentist inference with any numerical
approximation algorithm does not have prior distributions, and can become stuck in
regions of flat density, causing problems with model identification.
• Bayesian inference considers the data to be fixed (which it is), and parameters to be
random because they are unknowns. Frequentist inference considers the unknown pa-
rameters to be fixed, and the data to be random, estimating not based on the data
at hand, but the data at hand plus hypothetical repeated sampling in the future with
similar data. “The Bayesian approach delivers the answer to the right question in the
sense that Bayesian inference provides answers conditional on the observed data and
not based on the distribution of estimators or test statistics over imaginary samples not
observed” (Rossi et al. 2005, p. 4).
• Bayesian inference estimates a full probability model. Frequentist inference does not.
There is no frequentist probability distribution associated with parameters or hypothe-
ses.
• Bayesian inference has an axiomatic foundation (Cox 1946) that is uncontested by fre-
quentists. Therefore, Bayesian inference is coherent to a frequentist, but frequentist
inference is incoherent to a Bayesian.
• Bayesian inference has a decision theoretic foundation (Bernardo and Smith 2000;
Robert 2007). The purpose of most of statistical inference is to facilitate decision-
making (Robert 2007, p. 51). The optimal decision is the Bayesian decision.
• Bayesian inference includes uncertainty in the probability model, yielding more real-
istic predictions. Frequentist inference does not include uncertainty of the parameter
estimates, yielding less realistic predictions.
• Bayesian inference may use DIC to compare models with different methods includ-
ing hierarchical models, where frequentist model fit statistics cannot compare different
methods or hierarchical models.
• Bayesian inference obeys the likelihood principle. Frequentist inference, including Max-
imum Likelihood Estimation (MLE) and the General Method of Moments (GMM) or
Generalized Estimating Equations (GEE), violates the likelihood principle. “The likeli-
hood principle, by itself, is not sufficient to build a method of inference but should be
26 Bayesian Inference
• Frequentist models are perceived to handle large data sets, while Bayesian models via
MCMC have traditionally been restricted to small sample sizes, and Laplace Approxi-
mation is similar to the frequentist method in that it is known to be able to handle large
data sets. This reputation is no longer true for MCMC. Samplers in LaplacesDemon do
not usually loop through records and can handle large data sets. But most importantly,
algorithms now exist (and are available here) that enable fast Bayesian inference with
big data.
• Frequentist models are usually much easier to prepare because many things do not need
to be specified, such as prior distributions, initial values for numerical approximation,
and usually the likelihood function. Most frequentist methods have been standardized
to “procedures” where less knowledge and programming are required, and in many cases
the user can just click on a few things and not really know what they are doing. Garbage
in, garbage out.
• Frequentist models optimized with MLE have much shorter run-times than Bayesian
models via MCMC or PMC. This is not a difference between frequentist and Bayesian
methods, but is due to optimization vs. sampling algorithms. MCMC has a longer run-
time, whether it is Bayesian or frequentist. Laplace Approximation uses optimization
algorithms, and yields run-times similar to frequentist MLE. If frequentist MLE and
Bayesian Laplace Approximation seem to have very different run-times, it is probably
due to differences between a method-specific algorithm vs. a general-purpose algorithm.
References
Ando T (2007). “Bayesian Predictive Information Criterion for the Evaluation of Hierarchical
Bayesian and Empirical Bayes Models.” Biometrika, 94(2), 443–458.
Anscombe F, Aumann R (1963). “A Definition of Subjective Probability.” The Annals of
Mathematical Statistics, 34(1), 199–205.
Azevedo-Filho A, Shachter R (1994). “Laplace’s Method Approximations for Probabilistic
Inference in Belief Networks with Continuous Variables.” In R Mantaras, D Poole (eds.),
Uncertainty in Artificial Intelligence, pp. 28–36. Morgan Kauffman, San Francisco, CA.
Barnard G (1949). “Statistical Inference.” Journal of the Royal Statistical Society, B 11,
115–149.
Bayarri M, DeGroot M (1987). “Bayesian Analysis of Selection Models.” The Statistician,
36, 136–146.
Bayes T, Price R (1763). “An Essay Towards Solving a Problem in the Doctrine of Chances.
By the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, MA.
and F.R.S.” Philosophical Transactions of the Royal Society of London, 53, 370–418.
Berger J (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis, 1(3), 385–
402.
28 Bayesian Inference
Berger J, Bernardo J, Dongchu S (2009). “The Formal Definition of Reference Priors.” Annals
of Statistics, 37(2), 905–938.
Bernardo J (1979). “Reference Posterior Distributions for Bayesian Inference (with discus-
sion).” Journal of the Royal Statistical Society, B 41, 113–147.
Bernardo J (2005). “Reference Analysis.” In D Dey, C Rao (eds.), Handbook of Statistics 25,
pp. 17–90. Elsevier, Amsterdam.
Bernardo J, Smith A (2000). Bayesian Theory. John Wiley & Sons, West Sussex, England.
Congdon P (2005). Bayesian Models for Categorical Data. John Wiley & Sons, West Sussex,
England.
De Finetti B (1937). “La Prevision: ses lois logigues, ses sources subjectives.” Annales de
l’Institut Henri Poincare. English translation in H.E. Kyburg and H.E. Smokler (eds),
(1964), “Foresight: Its Logical Laws, Its Subjective Sources”, Studies in Subjective Proba-
bility, New York: Wiley.
Fisher R (1921). “On the ‘Probable Error’ of a Coefficient of Correlation Deduced From a
Small Sample.” Metron, 1(4), 3–32.
Gelfand A, Ghosh S (1998). “Model Choice: A Minimum Posterior Predictive Loss Approach.”
Biometrika, 85, 1–11.
Gelman A, Carlin J, Stern H, Rubin D (2004). Bayesian Data Analysis. 2nd edition. Chapman
& Hall, Boca Raton, FL.
Statisticat LLC 29
Gelman A, Meng X, Stern H (1996). “Posterior Predictive Assessment of Model Fitness via
Realized Discrepancies.” Statistica Sinica, 6, 773–807.
Gilks W, Thomas A, Spiegelhalter D (1994). “A Language and Program for Complex Bayesian
Modelling.” The Statistician, 43(1), 169–177.
Ibrahim J, Chen M (2000). “Power Prior Distributions for Regression Models.” Statistical
Science, 15, 46–60.
Jaynes E (1968). “Prior Probabilities.” IEEE Transactions on Systems Science and Cyber-
netics, 4(3), 227–241.
Jeffreys H (1961). Theory of Probability. Third edition. Oxford University Press, Oxford,
England.
Kass R, Raftery A (1995). “Bayes Factors.” Journal of the American Statistical Association,
90(430), 773–795.
Laplace P (1774). “Memoire sur la Probabilite des Causes par les Evenements.” l’Academie
Royale des Sciences, 6, 621–656. English translation by S.M. Stigler in 1986 as “Memoir
on the Probability of the Causes of Events” in Statistical Science, 1(3), 359–378.
Laplace P (1812). Theorie Analytique des Probabilites. Courcier, Paris. Reprinted as “Oeuvres
Completes de Laplace”, 7, 1878–1912. Paris: Gauthier-Villars.
Laplace P (1814). “Essai Philosophique sur les Probabilites.” English translation in Truscott,
F.W. and Emory, F.L. (2007) from (1902) as “A Philosophical Essay on Probabilities”.
ISBN 1602063281, translated from the French 6th ed. (1840).
Laud P, Ibrahim J (1995). “Predictive Model Selection.” Journal of the Royal Statistical
Society, B 57, 247–262.
Lewis S, Raftery A (1997). “Estimating Bayes’ Factors via Posterior Simulation with the
Laplace-Metropolis Estimator.” Journal of the American Statistical Association, 92, 648–
655.
Ntzoufras I (2009). Bayesian Modeling Using WinBUGS. John Wiley & Sons, West Sussex,
England.
R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org.
30 Bayesian Inference
Robert C (2007). The Bayesian Choice. 2nd edition. Springer, Paris, France.
Rossi P, Allenby G, McCulloch R (2005). Bayesian Statistics and Marketing. John Wiley &
Sons, West Sussex, England.
Savage L (1954). The Foundations of Statistics. John Wiley & Sons, West Sussex, England.
Spiegelhalter D, Best N, Carlin B, van der Linde A (2002). “Bayesian Measures of Model
Complexity and Fit (with Discussion).” Journal of the Royal Statistical Society, B 64,
583–639.
Spiegelhalter D, Thomas A, Best N, Lunn D (2003). WinBUGS User Manual, Version 1.4.
MRC Biostatistics Unit, Institute of Public Health and Department of Epidemiology and
Public Health, Imperial College School of Medicine, UK.
Stigler S (1983). “Who Discovered Bayes’s Theorem?” The American Statistician, 37(4),
290–296.
Affiliation:
Statisticat, LLC
Farmington, CT
E-mail: is defunct
URL: https://web.archive.org/web/20141224051720/http://www.bayesian-inference.
com/index