Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
Aki Vehtari
Chapter 2 – outline
Outline of the chapter 2
• 2.1 Binomial model (e.g. biased coin flipping)
• 2.2 Posterior as compromise between data and prior information
• 2.3 Posterior summaries
• 2.4 Informative prior distributions (skip exponential families and sufficient statistics)
• 2.5 Gaussian model with known variance
• 2.6 Other single parameter models
- in this course the normal distribution with known mean but unknwon variance is the most
important
- glance through Poisson and exponential
• 2.7 glance through this example, which illustrates benefits of prior information, no need to read all
the details (it’s quite long example)
• 2.8 Noninformative and weakly informative priors
Laplace’s approach for approximating integrals is discussed in more detail in Chapter 4.
R and Python demos (https://github.com/avehtari/BDA_R_demos and https://github.com/avehtari/
BDA_py_demos)
• demo2_1: Binomial model and Beta posterior.
• demo2_2: Comparison of posterior distributions with different parameter values for the Beta prior
distribution.
• demo2_3: Use samples to plot histogram with quantiles, and the same for a transformed variable.
• demo2_4: Grid sampling using inverse-cdf method.
Note that Γ(n) = (n − 1)!. Integral has a form which is called incomplete Beta function. Bayes and
Laplace had difficulties in computing this, but nowadays there are several series and continued fraction
expressions. Furthermore usually the normalisation term is computed by computing log(Γ(·)) directly
without explicitly computing Γ(·). Bayes was able to solve integral given small n and y. In case of large
n and y, Laplace used Gaussian approximation of the posterior (more in chapter 4). In this specific case,
R pbeta gives the same results as Laplace’s result with at least 3 digit accuracy.
Numerical accuracy
Laplace calculated
p(θ ≥ 0.5|y, n, M ) ≈ 1.15 × 10−42 .
Correspondingly Laplace could have calculated
Predictive distribution
Often the predictive distribution is more interesting than the posterior distribution. The posterior distri-
bution describes the uncertainty in the parameters (like the proportion of red chips in the bag), but the
predictive distribution describes also the uncertainty about the future event (like which color is picked
next). This difference is important, for example, if we want to what could happen if some treatment is
given to a patient.
In case of Gaussian distribution with known variance σ 2 the model is
y ∼ N(θ, σ 2 ),
Chapter 3
Outline of the chapter 3
• 3.1 Marginalisation
• 3.2 Normal distribution with a noninformative prior (very important)
• 3.3 Normal distribution with a conjugate prior (very important)
• 3.4 Multinomial model (can be skipped)
• 3.5 Multivariate normal with known variance (needed later)
• 3.6 Multivariate normal with unknown variance (glance through)
• 3.7 Bioassay example (very important, related to one of the exercises)
• 3.8 Summary (summary)
Normal model is used a lot as a building block of the models in the later chapters, so it is important to
learn it now. Bioassay example is good example used to illustrate many important concepts and it is used
in several exercises over the course.
Demos (https://github.com/avehtari/BDA_R_demos and https://github.com/avehtari/BDA_py_demos)
• demo3_1: visualise joint density and marginal densities of posterior of normal distribution with
unknown mean and variance
• demo3_2: visualise factored sampling and corresponding marginal and conditional density
• demo3_3: visualise marginal distribution of mu as a mixture of normals
• demo3_4: visualise sampling from the posterior predictive distribution
• demo3_5: visualise Newcomb’s data
• demo3_6: visualise posterior in bioassay example
Find all the terms and symbols listed below. When reading the chapter, write down questions related
to things unclear for you or things you think might be unclear for others. See also the additional comments
below.
• marginal distribution/density
• conditional distribution/density
• joint distribution/density
• nuisance parameter
• mixture
• normal distribution with a noninformative prior
• normal distribution with a conjugate prior
• sample variance
• sufficient statistics
• µ,σ 2 ,ȳ,s2
• a simple normal integral
• Inv-χ2
• factored density
• tn−1
• degrees of freedom
• posterior predictive distribution
• to draw
• N-Inv-χ2
• variance matrix Σ
• nonconjugate model
• generalized linear model
• exchangeable
• binomial model
• logistic transformation
• density at a grid
Bioassay
Bioassay example is is an example of very common statistical inference task typical, for example, medicine,
pharmacology, health care, cognitive science, genomics, industrial processes etc.
The example is from Racine et al (1986) (see ref in the end of the BDA3). Swiss company makes
classification of chemicals to different toxicity categories defined by authorities (like EU). Toxicity
classification is based on lethal dose 50% (LD50) which tells what amount of chemical kills 50% of the
subjects. Smaller the LD50 more lethal the chemical is. The original paper mentions "1983 Swiss poison
Regulation" which defines following categories for chemicals orally given to rats (mg/ml)
Class LD50
1 <5
2 5-50
3 50-500
4 500-2000
5 2000-5000
To reduce the number of rats needed in the experiments, the company started to use Bayesian methods.
The paper mentions that in those days use of just 20 rats to define the classification was very little. Book
gives LD50 in log(g/ml). When the result from demo3_6 is transformed to scale mg/ml, we see that the
mean LD50 is about 900 and p(500 < LD50 < 2000) ≈ 0.99. Thus, the tested chemical can be classified
as category 4 toxic.
Note that the chemical testing is moving away from using rats and other animals to using, for example,
human cells grown in chips, tissue models and human blood cells. The human-cell based approaches are
also more accurate to predict the effect for humans.
logit transformation can be justified information theoretically when binomial likelihood is used.
Example codes in demo3_6 can be helpful in exercises related to Bioassay example.
Chapter 4
Outline of the chapter 4
Normal approximation is used often used as part of posterior computation (more about this in Ch 13,
which is not a part of the course).
Demos
Find all the terms and symbols listed below. When reading the chapter, write down questions related
to things unclear for you or things you think might be unclear for others.
• sample size
• asymptotic theory
• normal approximation
• quadratic function
• Taylor series expansion
• observed information
• positive definite
• why log σ?
• Jacobian of the transformation
• point estimates and standard errors
• lower-dimnsional approximations
• large-sample theory
• asymptotic normality
• consistency
• underidentified
• nonidentified
• number of parameters increasing with sample size
• aliasing
• unbounded likelihood
• improper posterior
• edge of parameter space
• tails of distribution
Normal approximation
Other Gaussian posterior approximations are discussed in Chapter 13. For example, variational and
expectation propagation methods improve the approximation by global fitting instead of just the curvature
at the mode. The Gaussian approximation at the mode is often also called the Laplace method, as Laplace
used it first.
Several researchers have provided partial proofs that posterior converges towards Gaussian distribution.
In the mid 20th century Le Cam was first to provide a strict proof.
Observed information
When n → ∞, the posterior distribution approaches Gaussian distribution. As the log density of the
Gaussian is a quadratic function, the higher derivatives of the log posterior approach zero. The curvature
at the mode describes the information only in the case if asymptotic normality. In the case of the Gaussian
distribution, the curvature describes also the width of the Gaussian. Information matrix is a precision
matrix, which inverse is a covariance matrix.
Aliasing
In Finnish: valetoisto.
Aliasing is a special case of under-identifiability, where likelihood repeats in separate points of the
parameter space. That is, likelihood will get exactly same values and has same shape although possibly
mirrored or otherwise projected. For example, the following mixture model
has two Gaussians with own means and variances. With a probability λ the observation comes from
N(µ1 , σ12 ) and a probability 1 − λ from N(µ2 , σ22 ). This kind of model could be used, for example, for the
Newcomb’s data, so that the another Gaussian component would model faulty measurements. Model does
not state which of the components 1 or 2, would model good measurements and which would model the
faulty measurements. Thus it is possible to interchange values of (µ1 , µ2 ) and (σ12 , σ22 ) and replace λ with
(1 − λ) to get the equivalent model. Posterior distribution then has two modes which are mirror images of
each other. When n → ∞ modes will get narrower, but the posterior does not converge to a single point.
If we can integrate over the whole posterior, the aliasing is not a problem. However aliasing makes
the approximative inference more difficult.
Transformation of variables
See p. 21 for the explanation how to derive densities for transformed variables. This explains, for example,
why uniform prior p(log(σ 2 )) ∝ 1 for log(σ 2 ) corresponds to prior p(σ 2 ) = σ −2 for sigma2 .
On derivation
Here’s a reminder how to integrate with respect to g(θ). For example
d
σ −2 = −2σ −2
d log σ
is easily solved by setting z = log σ to get
d
exp(z)−2 = −2 exp(z)−3 exp(z) = −2 exp(z)−2 = −2σ −2 .
dz
Bayesian data analysis – reading instructions ch 5
Aki Vehtari
Chapter 5
Outline of the chapter 5
The hierarchical models in the chapter are simple to keep computation simple. More advanced
computational tools are presented in Chapters 10-12 (part of the course) and 13 (not part of the course).
Demos
Find all the terms and symbols listed below. When reading the chapter, write down questions related
to things unclear for you or things you think might be unclear for others.
• population distribution
• hyperparameter
• overfit
• hierarchical model
• exchangeability
• invariant to permutations
• independent and identically distributed
• ignorance
• the mixture of independent identical distributions
• de Finetti’s theorem
• partially exchangeable
• conditionally exchangeable
• conditional independence
• hyperprior
• different posterior predictive distributions
• the conditional probability formula
Computation
Examples in Sections 5.3 and 5.4 continue computation with factorization and grid, but there is no need to
go deep in to computational details as we can use MCMC and Stan instead. Hierarchical model exercises
are made with Stan.
Chapter 6
Outline of the chapter 6
• 6.1 The place of model checking in applied Bayesian statistics
• 6.2 Do the inferences from the model make sense?
• 6.3 Posterior predictive checking (p-values can be skipped)
• 6.4 Graphical posterior predictive checks
• 6.5 Model checking for the educational testing example
Demos
• demo6_1: Posterior predictive checking - light speed
• demo6_2: Posterior predictive checking - sequential dependence
• demo6_3: Posterior predictive checking - poor test statistic
• demo6_4: Posterior predictive checking - marginal predictive p-value
Find all the terms and symbols listed below. When reading the chapter, write down questions related
to things unclear for you or things you think might be unclear for others.
• model checking
• sensitivity analysis
• external validation
• posterior predictive checking
• joint posterior predictive distribution
• marginal (posterior) predictive distribution
• self-consistency check
• replicated data
• y rep , ỹ, x̃
• test quantities
• discrepancy measure
• tail-area probabilities
• classical p-value
• posterior predictive p-values
• multiple comparisons
• marginal predictive checks
• cross-validation predictive distributions
• conditional predictive ordinate
Additional reading
The following article has some useful discussion and examples also about prior and posterior predictive
checking.
• Jonah Gabry, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman (2018).
Visualization in Bayesian workflow. Journal of the Royal Statistical Society Series A, , 182(2):389-
402. https://doi.org/10.1111/rssa.12378.
• Michael Betancourt’s workflow case study with prior and posterior predictive checking
Chapter 7
Outline of the chapter 7
• Aki Vehtari, Andrew Gelman and Jonah Gabry (2017). Practical Bayesian model evaluation
using leave-one-out cross-validation and WAIC. In Statistics and Computing, 27(5):1413-1432,
doi:10.1007/s11222-016-9696-4. arXiv preprint arXiv:1507.04544.
In Sections 7.2 and 7.3 of BDA, for historical reasons there is a multiplier −2 used. After the book
was published, weh have concluded it causing too much confusion and recommed not to multiply by −2.
The above paper is not using −2 anymore.
See also extra material at https://avehtari.github.io/modelselection/
Find all the terms and symbols listed below. When reading the chapter and the above mentioned
article, write down questions related to things unclear for you or things you think might be unclear for
others.
• predictive accuracy/fit/error
• external validation
• cross-validation
• information criteria
• overfitting
• measures of predictive accuracy
• point prediction
• scoring function
• mean squared error
• probabilistic prediction
• scoring rule
• logarithmic score
• log-predictive density
• out-of-sample predictive fit
• elpd, elppd, lppd
• deviance
• within-sample predictive accuracy
• adjusted within-sample predictive accuracy
• AIC, DIC, WAIC (less important)
• effective number of parameters
• singular model
• BIC (less important)
• leave-one-out cross-validation
• evaluating predictive error comparisons
• bias induced by model selection
• Bayes factors
• continuous model expansion
• sensitivity analysis
Additional reading
More theoretical details can be found in
• Aki Vehtari and Janne Ojanen (2012). A survey of Bayesian predictive methods for model as-
sessment, selection and comparison. In Statistics Surveys, 6:142-228. http://dx.doi.org/10.1214/
12-SS102
• Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian predictive methods for model
selection. Statistics and Computing, 27(3):711-735. doi:10.1007/s11222-016-9649-y. http://link.
springer.com/article/10.1007/s11222-016-9649-y
Chapter 8
In the earlier chapters it was assumed that the data collection is ignorable. Chapter 8 explains when data
collection can be ignorable and when we need to model also the data collection. We don’t have time to go
through chapter 8 in BDA course at Aalto, but it is highly recommended that you would read it in the end
or after the course. Most important parts are 8.1, 8.5, pp 220–222 of 8.6, and 8.8, and you can get back to
the other sections later.
Outline of the chapter 8 (* denotes the most important parts)
• observed data
• complete data
• missing data
• stability assumption
• data model
• inclusion model
• complete data likelihood
• observed data likelihood
• finite-population and superpopulation inference
• ignorability
• ignorable designs
• propensity score
• sample surveys
• random sampling of a finite population
• stratified sampling
• cluster sampling
• designed experiments
• complete randomization
• randomized blocks and latin squares
• sequntial designs
• randomization given covariates
• observational studies
• censoring
• truncation
• missing completely at random
Gelman: “All contexts where the model is fit to data that are not necessarily representative of the
population that is the target of study. The key idea is to include in the Bayesian model an inclusion
variable with a probability distribution that represents the process by which data become observed.”
Bayesian data analysis – reading instructions 9
Aki Vehtari
Chapter 9
Outline of the chapter 9
Find all the terms and symbols listed below. When reading the chapter, write down questions related
to things unclear for you or things you think might be unclear for others.
• decision analysis
• steps of Bayesian decision analysis 1–4 (p. 238)
• decision
• outcome
• utility function
• expected utility
• decision tree
• summarizing inference
• model selection
• individual decision problem
• institutional decision problem
Simpler examples
The lectures have simpler examples and discus also some challenges in selecting utilities or costs.
Chapter 10
Outline of the chapter 10
Sections 10.1-10.4 give overview of different computational methods. Some of then have been already
used in the book.
Section 10.5 is very important and related to the exercises.
Demos
Find all the terms and symbols listed below. When reading the chapter, write down questions related
to things unclear for you or things you think might be unclear for others.
• unnormalized density
• target distribution
• log density
• overflow and underflow
• numerical integration
• quadrature
• simulation methods
• Monte Carlo
• stochastic methods
• deterministic methods
• distributional approximations
• crude estimation
• direct simulation
• grid sampling
• rejection sampling
• importance sampling
• importance ratios/weights
Draws and sample
A group of draws is a sample. A sample can consist of one draw, and thus some people use the word
sample for both single item and for the group. For clarity, we prefer separate words for single item (draw)
and for the group (sample).
• Don’t show digits which are just random noise. You can use Monte Carlo standard error estimates
to check how many digits are likely to stay the same if the sampling would be continued.
• Show meaningful digits given the posterior uncertainty. You can compare posterior standard error
or posterior intervals to the mean value. Posterior interval length can be used to determine also how
many digits to show for the interval endpoints.
• Example: The mean and 90% central posterior interval for temperature increase C◦ /century (see the
slides for the example) based on posterior draws:
• When reporting many numbers in table, for aesthetics reasons, it may be sometimes better for some
numbers to show one extra or one too few digits compared to the ideal.
• Often it’s better to plot the whole posterior density in addition of any summaries, as summaries
always loose some information content.
• For your reports: Don’t be lazy and settle for the default number of digits in R or Python. Think for
each reported value how many digits is sensible.
Quadrature
Sometimes ‘quadrature’ is used to refer generically to any numerical integration method (including Monte
Carlo), sometimes it is used to refer just to deterministic numerical integration methods.
Rejection sampling
Rejection sampling is mostly used as a part of fast methods for univariate sampling. For example, sampling
from the normal distribution is often made using Ziggurat method, which uses a proposal distribution
resembling stairs.
Rejection sampling is also commonly used for truncated distributions, in which case all draws from
the truncated part are rejected.
Importance sampling
Popularity of importance sampling is increasing. It is used, for example, as part of other methods as
particle filters and pseudo marginal likelihood approaches, and to improve distributional approximations
(including variational inference in machine learning).
Importance sampling is useful in importance sampling leave-one-out cross-validation. Cross-validation
is discussed in Chapter 7 and importance sampling leave-one-out cross-validation is discussed in the
article
• Aki Vehtari, Andrew Gelman and Jonah Gabry (2016). Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC. In Statistics and Computing, 27(5):1413–1432. arXiv
preprint arXiv:1507.04544 <http://arxiv.org/abs/1507.04544>
After the book was published, we have developed Pareto smoothed importance sampling which is
more stable than plain importance sampling and has very useful Pareto-k diagnostic to check the reliability
• Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry (2019). Pareto
smoothed importance sampling. arXiv preprint arXiv:1507.02646. <http://arxiv.org/abs/1507.
02646>
Buffon’s needles
Computer simulation of Buffon’s needle dropping method for estimating the value of π https://mste.
illinois.edu/activity/buffon/.
Bayesian data analysis – reading instructions 11
Aki Vehtari
Chapter 11
Outline of the chapter 11
Demos
Find all the terms and symbols listed below. When reading the chapter, write down questions related
to things unclear for you or things you think might be unclear for others.
• Markov chain
• Markov chain Monte Carlo
• random walk
• starting point
• transition distribution
• jumping / proposal distribution
• to converge, convergence, assessing convergence
• stationary distribution, stationarity
• effective number of simulations
• Gibbs sampler
• Metropolis sampling / algorithm
• Metropolis-Hastings algorithm
• acceptance / rejection rule
• acceptance / rejection rate
• within-sequence correlation, serial correlation
• warm-up / burn-in
• to thin, thinned
• overdispersed starting points
• mixing
• to diagnose convergence
• between- and within-sequence variances
• potential scale reduction, R̂
• the variance of the average of a correlated sequence
• autocorrelation
• variogram
• neff
Animations
Nice animations with discussion http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/
Metropolis algorithm
There is a lot of freedom in selection of proposal distribution in Metropolis algorithm. There are some
restrictions, but we don’t go to the mathematical details in this course.
Don’t confuse rejection in the rejection sampling and in Metropolis algorithm. In the rejection sampling,
the rejected samples are thrown away. In Metropolis algorithm the rejected proposals are thrown away,
but time moves on and the previous sample x(t) is also the sample x(t+1).
When rejecting a proposal, the previous sample is repeated in the chain, they have to be included and they
are valid samples from the distribution. For basic Metropolis, it can be shown that optimal rejection rate is
55–77%, so that on even the optimal case quite many of the samples are repeated samples. However, high
number of rejections is acceptable as then the accepted proposals are on average further away from the
previous point. It is better to jump further away 23–45% of time than more often to jump really close.
Methods for estimating the effective sample size are useful for measuring how effective a given chain is.
If starting point is selected at or near the mode, less time is needed to reach the area of essential mass,
but still the samples in the beginning of the chain are not presentative of the true distribution unless the
starting point was somehow samples directly from the target distribution.
R̂ ja neff
There are many versions of R̂ ja neff . Beware that some software packages compute R̂ using old inferior
approaches.
The R̂ and the approach to estimate effective number of samples neff were updated in BDA3, and
slightly updated version of this is described in Stan 2.18+ user guide. Since then we have developed even
better R̂, ESS (effective sample size or seff , with change from n to S is due to improved consistency in
the notation) in
• Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner (2019).
Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of
MCMC. arXiv preprint arXiv:1903.08008 hhttp://arxiv.org/abs/1903.08008i.
New R̂, ESS, and Monte Carlo error estimates are available in RStan monitor function in R, in
posterior package in R, and in ArviZ package in Python.
Due to randomness in chains, R̂ may get values slightly below 1.
Brief Guide to Stan’s Warnings https://mc-stan.org/misc/warnings.html provides summary of available
convergence diagnostics in Stan and how to interpret them.
Bayesian data analysis – reading instructions 12
Aki Vehtari
Chapter 12
Outline of the chapter 12
• See rstan_demo.Rmd, pystan_demo.py, pystan_demo.ipynb for demos how to use Stan from R/Python
and several model examples
MCMC animations
These don’t include the specific version of dynamic HMC in Stan, but are useful illustrations anyway.
• Radford Neal (2011). MCMC using Hamiltonian dynamics. In Brooks et al (ed), Handbook of
Markov Chain Monte Carlo, Chapman & Hall / CRC Press. Preprint https://arxiv.org/pdf/1206.
1901.pdf.
Stan uses a variant of dynamic Hamiltonian Monte Carlo (using adaptive number of steps in the
dynamic simulation), which has been further developed since BDA3 was published. The first dynamic
HMC variant was
• Matthew D. Hoffman, Andrew Gelman (2014). The No-U-Turn Sampler: Adaptively Setting Path
Lengths in Hamiltonian Monte Carlo. JMLR, 15:1593–1623 http://jmlr.org/papers/v15/hoffman14a.
html.
The No-U-Turn Sampler gave the name NUTS which you can see often associated with Stan, but the
current dynamic HMC variant implemented in Stan has some further developments described (mostly)
in
• Michael Betancourt (2018). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv
preprint arXiv:1701.02434 https://arxiv.org/abs/1701.02434.
Instead of reading all above, you can also watch a nice introduction video
• Scalable Bayesian Inference with Hamiltonian Monte Carlo by Michael Betancourt https://www.
youtube.com/watch?v=jUSZboSq1zg
• Divergence diagnostic checks whether the discretized dynamic simulation has problems due to fast
varying density. See more in a case study http://mc-stan.org/users/documentation/case-studies/
divergences_and_bias.html.
• BFMI checks whether momentum resampling in HMC is sufficiently efficient. See more in https:
//arxiv.org/abs/1604.00695
Part IV, Chapters 14–18 discuss basics of linear and generalized linear models with several examples. The
parts discussing computation can be useful to provide additional insight on these models or sometimes for
actual computation, it’s likely that most of the readers will use some probabilistic programming framework
for computation. Regression and other stories (ROS) by Gelman, Hill and Vehtari discusses linear and
generalized linear models from the modeling perspective more thoroughly.
18.1 Notation
• Missing completely at random (MCAR)
missingness does not depend on missing values or other observed values (including covariates)
• Missing at random (MAR)
missingness does not depend on missing values but may depend on other observed values
(including covariates)
• Missing not at random (MNAR)
missingness depends on missing values
18.2 Multiple imputation
1. make a model predicting missing data
2. sample repeatedly from the missing data model to generate multiple imputed data sets
3. make usual inference for each imputed data set
4. combine results
• discussion of computation is partially outdated
18.3 Missing data in the multivariate normal and t models
• a special continuous data case computation, which can still be useful as fast starting point
18.4 Example: multiple imputation for a series of polls
• an example
18.5 Missing values with counted data
• discussion of computation for count data (ie computation in 18.3 is not applicable)
18.6 Example: an opinion poll in Slovenia
• another example