Likelihood An Bayesian Exp
Likelihood An Bayesian Exp
Likelihood An Bayesian Exp
Leonhard Held
Daniel Sabanés Bové
Likelihood
and Bayesian
Inference
With Applications in Biology and
Medicine
Second Edition
Statistics for Biology and Health
Series Editors
Mitchell Gail, Division of Cancer Epidemiology and Genetics, National Cancer
Institute, Rockville, MD, USA
Jonathan M. Samet, Department of Epidemiology, School of Public Health, Johns
Hopkins University, Baltimore, MD, USA
Second Edition
Leonhard Held Daniel Sabanés Bové
Epidemiology, Biostatistics and Prevention Google
Institute Zürich, Switzerland
University of Zurich
Zürich, Switzerland
This Springer imprint is published by the registered company Springer-Verlag GmbH, DE part of
Springer Nature
The registered company address is: Heidelberger Platz 3, 14197 Berlin, Germany
To Our Families:
Ulrike, Valentina, Richard and Lorenz,
Carrie and Ben
Preface
vii
viii Preface
fibach for their support in the first edition. Special thanks go to Manuela Ott, who
was instrumental in writing the second edition of the book. Last but not least, we are
grateful to Eva Hiripi from Springer-Verlag Heidelberg for her continuing support
and the Editors of Statistics in Biology and Health for welcoming us in their series.
Zürich, Switzerland Leonhard Held, Daniel Sabanés Bové
March 2019
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Inference for a Proportion . . . . . . . . . . . . . . . . . 2
1.1.2 Comparison of Proportions . . . . . . . . . . . . . . . . 2
1.1.3 The Capture–Recapture Method . . . . . . . . . . . . . . 4
1.1.4 Hardy–Weinberg Equilibrium . . . . . . . . . . . . . . . 4
1.1.5 Estimation of Diagnostic Tests Characteristics . . . . . . 5
1.1.6 Quantifying Disease Risk from Cancer Registry Data . . 6
1.1.7 Predicting Blood Alcohol Concentration . . . . . . . . . 8
1.1.8 Analysis of Survival Times . . . . . . . . . . . . . . . . 8
1.2 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Contents and Notation of the Book . . . . . . . . . . . . . . . . 11
1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Likelihood and Log-Likelihood Function . . . . . . . . . . . . . 13
2.1.1 Maximum Likelihood Estimate . . . . . . . . . . . . . . 14
2.1.2 Relative Likelihood . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Invariance of the Likelihood . . . . . . . . . . . . . . . . 23
2.1.4 Generalised Likelihood . . . . . . . . . . . . . . . . . . 26
2.2 Score Function and Fisher Information . . . . . . . . . . . . . . 27
2.3 Numerical Computation of the Maximum Likelihood Estimate . 31
2.3.1 Numerical Optimisation . . . . . . . . . . . . . . . . . . 31
2.3.2 The EM Algorithm . . . . . . . . . . . . . . . . . . . . 33
2.4 Quadratic Approximation of the Log-Likelihood Function . . . . 37
2.5 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 Minimal Sufficiency . . . . . . . . . . . . . . . . . . . . 45
2.5.2 The Likelihood Principle . . . . . . . . . . . . . . . . . 47
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Elements of Frequentist Inference . . . . . . . . . . . . . . . . . . . 51
3.1 Unbiasedness and Consistency . . . . . . . . . . . . . . . . . . 51
3.2 Standard Error and Confidence Interval . . . . . . . . . . . . . . 55
3.2.1 Standard Error . . . . . . . . . . . . . . . . . . . . . . . 56
ix
x Contents
9 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.1 Plug-in Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.2 Likelihood Prediction . . . . . . . . . . . . . . . . . . . . . . . 290
9.2.1 Predictive Likelihood . . . . . . . . . . . . . . . . . . . 291
9.2.2 Bootstrap Prediction . . . . . . . . . . . . . . . . . . . . 293
9.3 Bayesian Prediction . . . . . . . . . . . . . . . . . . . . . . . . 297
9.3.1 Posterior Predictive Distribution . . . . . . . . . . . . . . 297
9.3.2 Computation of the Posterior Predictive Distribution . . . 301
9.3.3 Model Averaging . . . . . . . . . . . . . . . . . . . . . 303
9.4 Assessment of Predictions . . . . . . . . . . . . . . . . . . . . . 304
9.4.1 Discrimination and Calibration . . . . . . . . . . . . . . 304
9.4.2 Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . 309
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
9.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
10 Markov Models for Time Series Analysis . . . . . . . . . . . . . . . 315
10.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . 316
10.2 Observation-Driven Models for Categorical Data . . . . . . . . . 316
10.2.1 Maximum Likelihood Inference . . . . . . . . . . . . . . 317
10.2.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 319
10.2.3 Inclusion of Covariates . . . . . . . . . . . . . . . . . . 320
10.3 Observation-Driven Models for Continuous Data . . . . . . . . . 321
10.3.1 The First-Order Autoregressive Model . . . . . . . . . . 321
10.3.2 Maximum Likelihood Inference . . . . . . . . . . . . . . 322
10.3.3 Inclusion of Covariates . . . . . . . . . . . . . . . . . . 325
10.3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 326
10.4 Parameter-Driven Models . . . . . . . . . . . . . . . . . . . . . 328
10.4.1 The Likelihood Function . . . . . . . . . . . . . . . . . 328
10.4.2 The Posterior Distribution . . . . . . . . . . . . . . . . . 329
10.5 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 331
10.5.1 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . 332
10.5.2 Bayesian Inference for Hidden Markov Models . . . . . . 335
10.6 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . 337
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
10.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Appendix A Probabilities, Random Variables and Distributions . . . . 343
A.1 Events and Probabilities . . . . . . . . . . . . . . . . . . . . . . 344
A.1.1 Conditional Probabilities and Independence . . . . . . . 344
A.1.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . 345
A.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 345
A.2.1 Discrete Random Variables . . . . . . . . . . . . . . . . 345
A.2.2 Continuous Random Variables . . . . . . . . . . . . . . 346
A.2.3 The Change-of-Variables Formula . . . . . . . . . . . . 347
A.2.4 Multivariate Normal Distributions . . . . . . . . . . . . . 349
A.3 Expectation, Variance and Covariance . . . . . . . . . . . . . . . 350
Contents xiii
Contents
1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Inference for a Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Comparison of Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 The Capture–Recapture Method . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Hardy–Weinberg Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Estimation of Diagnostic Tests Characteristics . . . . . . . . . . . . . . . . 5
1.1.6 Quantifying Disease Risk from Cancer Registry Data . . . . . . . . . . . . 6
1.1.7 Predicting Blood Alcohol Concentration . . . . . . . . . . . . . . . . . . . 8
1.1.8 Analysis of Survival Times . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Contents and Notation of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Statistics is a discipline with different branches. This book describes two central ap-
proaches to statistical inference, likelihood inference and Bayesian inference. Both
concepts have in common that they use statistical models depending on unknown
parameters to be estimated from the data. Moreover, both are constructive, i.e. pro-
vide precise procedures for obtaining the required results. A central role is played
by the likelihood function, which is determined by the choice of a statistical model.
While a likelihood approach bases inference only on the likelihood, the Bayesian
approach combines the likelihood with prior information. Hybrid approaches also
exist.
What do we want to learn from data using statistical inference? We can distin-
guish three major goals. Of central importance is to estimate the unknown parame-
ters of a statistical model. This is the so-called estimation problem. However, how
do we know that the chosen model is correct? We may have a number of statistical
models and want to identify the one that describes the data best. This is the so-called
model selection problem. And finally, we may want to predict future observations
based on the observed ones. This is the prediction problem.
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 1
https://doi.org/10.1007/978-3-662-60792-3_1,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
2 1 Introduction
1.1 Examples
Several examples from biology and health will be considered throughout this book,
many of them more than once viewed from different perspectives or tackled with
different techniques. We will now give a brief overview.
the two groups. Different measures are now employed to compare the two groups,
among which the risk difference π1 − π2 and the risk ratio π1 /π2 are the most
common ones. The odds ratio
ω1 π1 /(1 − π1 )
= ,
ω2 π2 /(1 − π2 )
the ratio of the odds ω1 and ω2 , is also often used. Note that if the risk in the two
groups is equal, i.e. π1 = π2 , then the risk difference is zero, while both the risk ratio
and the odds ratio is one. Statistical methods can now be employed to investigate if
the simpler model with one parameter π = π1 = π2 can be preferred over the more
complex one with different risk parameters π1 and π2 . Such questions may also be
of interest if more than two groups are considered.
A controlled clinical trial compares the effect of a certain treatment with a con-
trol group, where typically either a standard treatment or a placebo treatment is
provided. Several randomised controlled clinical trials have investigated the use of
diuretics in pregnancy to prevent preeclampsia. Preeclampsia is a medical condition
characterised by high blood pressure and significant amounts of protein in the urine
of a pregnant woman. It is a very dangerous complication of a pregnancy and may
affect both the mother and fetus. In each trial women were randomly assigned to one
of the two treatment groups. Randomisation is used to exclude possible subjective
influence from the examiner and to ensure equal distribution of relevant risk factors
in the two groups.
The results of nine such studies are reported in Table 1.1. For each trial the ob-
served proportions xi /ni in the treatment and placebo control group (i = 1, 2) are
given, as well as the corresponding empirical odds ratio
x1 /(n1 − x1 )
.
x2 /(n2 − x2 )
One can see substantial variation of the empirical odds ratios reported in Table 1.1.
This raises the question if this variation is only statistical in nature or if there is
evidence for additional heterogeneity between the studies. In the latter case the true
4 1 Introduction
treatment effect differs from trial to trial due to different inclusion criteria, different
underlying populations, or other reasons. Such questions are addressed in a meta-
analysis, a combined analysis of results from different studies.
Fig. 1.1 The de Finetti diagram, named after the Italian statistician Bruno de Finetti (1906–1985),
displays the expected relative genotype frequencies Pr(AA) = π1 , Pr(aa) = π3 and Pr(Aa) = π2
in a bi-allelic, diploid population as the length of the perpendiculars a, b and c from the inner
point F to the sides of an equilateral triangle. The ratio of the straight length aa, Q to the side
length aa, AA is the relative allele frequency υ of A. Hardy–Weinberg equilibrium is represented
by all points on the parabola 2υ(1 − υ). For example, the point G represents such a population
with υ = 0.5, whereas population F has substantially less heterozygous Aa than expected under
Hardy–Weinberg equilibrium
M and N. Most people in the Eskimo population are MM, while other populations
tend to possess the opposite genotype NN. In the sample from Iceland, the fre-
quencies of the underlying genotypes MM, MN and NN turned out to be x1 = 233,
x2 = 385 and x3 = 129. If we assume that the population is in Hardy–Weinberg
equilibrium, then the statistical task is to estimate the unknown allele frequency υ
from these data. Statistical methods can also address the question if the equilibrium
assumption is supported by the data or not. This is a model selection problem, which
can be addressed with a significance test or other techniques.
Table 1.2 Distribution of the number of positive test results among six consecutive screening tests
of 196 colon cancer cases
Number k of positive tests 0 1 2 3 4 5 6
Frequency Zk ? 37 22 25 29 34 49
Here Pr(A | B) denotes the conditional probability of an event A, given the infor-
mation B. The first line thus reads “the sensitivity is the conditional probability of
a positive test, given the fact that the subject is diseased”; see Appendix A.1.1 for
more details on conditional probabilities. Thus, high values for the sensitivity and
specificity mean that classification of diseased and non-diseased individuals is cor-
rect with high probability. The sensitivity is also known as the true positive fraction
whereas specificity is called the true negative fraction.
Screening examinations are particularly useful if the disease considered can be
treated better in an earlier stage than in a later stage. For example, a diagnostic
study in Australia involved 38 000 individuals, which have been screened for the
presence of colon cancer repeatedly on six consecutive days with a simple diagnostic
test. 3000 individuals had at least one positive test result, which was subsequently
verified with a coloscopy. 196 cancer cases were eventually identified, and Table 1.2
reports the frequency of positive test results among those. Note that the number Z0
of cancer patients that have never been positively tested is unavailable by design.
The closely related false negative fraction
is 1 − specificity.
Cancer registries collect incidence and mortality data on different cancer locations.
For example, data on the incidence of lip cancer in Scotland have been collected
between 1975 and 1980. The raw counts of cancer cases in 56 administrative dis-
tricts of Scotland will vary a lot due to heterogeneity in the underlying popu-
lation counts. Other possible reasons for variation include different age distribu-
tions or heterogeneity in underlying risk factors for lip cancer in the different dis-
tricts.
A common approach to adjust for age heterogeneity is to calculate the expected
number of cases using age standardisation. The standardised incidence ratio (SIR)
of observed to expected number of cases is then often used to visually display ge-
1.1 Examples 7
ographical variation in disease risk. If the SIR is equal to 1, then the observed in-
cidence is as expected. Figure 1.2 maps the corresponding SIRs for lip cancer in
Scotland.
However, SIRs are unreliable indicators of disease incidence, in particular if the
disease is rare. For example, a small district may have zero observed cases just by
chance such that the SIR will be exactly zero. In Fig. 1.3, which plots the SIRs
versus the number of expected cases, we can identify two such districts. More gen-
erally, the statistical variation of the SIRs will depend on the population counts, so
more extreme SIRs will tend to occur in less populated areas, even if the underlying
disease risk does not vary from district to district. Indeed, we can see from Fig. 1.3
8 1 Introduction
that the variation of the SIRs increases with decreasing number of expected cases.
Statistical methods can be employed to obtain more reliable estimates of disease
risk. In addition, we can also investigate the question if there is evidence for hetero-
geneity of the underlying disease risk at all. If this is the case, then another question
is whether the variation in disease risk is spatially structured or not.
In many countries it is not allowed to drive a car with a blood alcohol concentra-
tion (BAC) above a certain threshold. For example, in Switzerland this threshold
is 0.5 mg/g = 0.5 h. However, usually only a measurement of the breath alco-
hol concentration (BrAC) is taken from a suspicious driver in the first instance. It
is therefore important to accurately predict the BAC measurement from the BrAC
measurement. Usually this is done by multiplication of the BrAC measurement with
a transformation factor TF. Ideally this transformation should be accompanied with
a prediction interval to acknowledge the uncertainty of the BAC prediction.
In Switzerland, currently TF0 = 2000 is used in practice. As some experts con-
sider this too low, a study was conducted at the Forensic Institute of the University
of Zurich in the period 2003–2004. For n = 185 volunteers, both BrAC and BAC
were measured after consuming various amounts of alcoholic beverages of personal
choice. Mean and standard deviation of the ratio TF = BAC/BrAC are shown in
Table 1.3. One of the central questions of the study was if the currently used factor
of TF0 = 2000 needs to be adjusted. Moreover, it is of interest if the empirical dif-
ference between male and female volunteers provides evidence of a true difference
between genders.
Table 1.4 Survival times of 94 patients under Azathioprine treatment in days. Censored observa-
tions are marked with a plus sign
8+ 9 38 96 144 167 177 191+ 193 201
207 251 287+ 335+ 379+ 421 425 464 498+ 500
574+ 582+ 586 616 630 636 647 651+ 688 743
754 769+ 797 799+ 804 828+ 904+ 932+ 947 962+
974 1113+ 1219 1247 1260 1268 1292+ 1408 1436+ 1499
1500 1522 1552 1554 1555+ 1626+ 1649+ 1942 1975 1982+
1998+ 2024+ 2058+ 2063+ 2101+ 2114+ 2148 2209 2254+ 2338+
2384+ 2387+ 2415+ 2426 2436+ 2470 2495+ 2500 2522 2529+
2744+ 2857 2929 3024 3056+ 3247+ 3299+ 3414+ 3456+ 3703+
3906+ 3912+ 4108+ 4253+
with censored survival time actually died of PBC. Possible reasons for censoring
include drop-out of the study, e.g. due to moving away, or death by some other
cause, e.g. due to a car accident. Figure 1.4 illustrates this type of data.
The formulation of a suitable probabilistic model plays a central role in the statisti-
cal analysis of data. The terminology statistical model is also common. A statistical
model will describe the probability distribution of the data as a function of an un-
known parameter. If there is more than one unknown parameter, i.e. the unknown
parameters form a parameter vector, then the model is a multiparameter model. In
10 1 Introduction
this book we will concentrate on parametric models, where the number of parame-
ters is fixed, i.e. does not depend on the sample size. In contrast, in a non-parametric
model the number of parameters grows with the sample size and may even be infi-
nite.
Appropriate formulation of a statistical model is based on careful considerations
on the origin and properties of the data at hand. Certain approximations may often
be useful in order to simplify the model formulation. Often the observations are
assumed to be a random sample, i.e. independent realisations from a known distri-
bution. See Appendix A.5 for a comprehensive list of commonly used probability
distributions.
For example, estimation of a proportion is often based on a random sample of
size n drawn without replacement from some population with N individuals. The
appropriate statistical model for the number of observations in the sample with the
property of interest is the hypergeometric distribution. However, the hypergeomet-
ric distribution can be approximated by a binomial one, a statistical model for the
number of observations with some property of interest in a random sample with
replacement. The difference between these two models is negligible if n is much
smaller than N , and then the binomial model is typically preferred.
Capture–recapture methods are also based on a random sample of size n without
replacement, but now N is the unknown parameter of interest, so it is unclear if n
is much smaller than N . Hence, the hypergeometric distribution is the appropriate
statistical model, which has the additional advantage that the quantity of interest is
an explicit parameter contained in that model.
The validity of a statistical model can be checked with statistical methods. For
example, we will discuss methods to investigate if the underlying population of a
random sample of genotypes is in Hardy–Weinberg equilibrium. Another example
is the statistical analysis of continuous data, where the normal distribution is a pop-
ular statistical model. The distribution of survival times, for example, is typically
skewed, and hence other distributions such as the gamma or the Weibull distribution
are used.
For the analysis of count data, as for example the number of lip cancer cases in
the administrative districts of Scotland from Example 1.1.6, a suitable distribution
has to be chosen. A popular choice is the Poisson distribution, which is suitable if
the mean and variance of the counts are approximately equal. However, in many
cases there is overdispersion, i.e. the variance is larger than the mean. Then the
Poisson-gamma distribution, a generalisation of the Poisson distribution, is a suit-
able choice.
Statistical models can become considerably more complex if necessary. For ex-
ample, the statistical analysis of survival times needs to take into account that some
of the observations are censored, so an additional model (or some simplifying as-
sumption) for the censoring mechanism is typically needed. The formulation of a
suitable statistical model for the data obtained in the diagnostic study described in
Example 1.1.5 also requires careful thought since the study design does not deliver
direct information on the number of patients with solely negative test results.
1.3 Contents and Notation of the Book 11
Chapter 2 introduces the central concept of a likelihood function and the maxi-
mum likelihood estimate. Basic elements of frequentist inference are summarised
in Chap. 3. Frequentist inference based on the likelihood, as described in Chaps. 4
and 5, enables us to construct confidence intervals and significance tests for parame-
ters of interest. Bayesian inference combines the likelihood with a prior distribution
and is conceptually different from the frequentist approach. Chapter 6 describes the
central aspects of this approach. Chapter 7 gives an introduction to model selec-
tion from both a likelihood and a Bayesian perspective, while Chap. 8 discusses the
use of modern numerical methods for Bayesian inference and Bayesian model se-
lection. In Chap. 9 we give an introduction to the construction and the assessment
of probabilistic predictions. Finally, Chap. 10 describes methodology for time se-
ries analysis. Every chapter ends with exercises and some references to additional
literature.
Modern statistical inference is unthinkable without the use of a computer. Nu-
merous numerical techniques for optimisation and integration are employed to solve
statistical problems. This book emphasises the role of the computer and gives many
examples with explicit R code. Appendix C is devoted to the background of these
numerical techniques. Modern statistical inference is also unthinkable without a
solid background in mathematics, in particular probability, which is covered in Ap-
pendix A. A collection of the most common probability distributions and their prop-
erties is also given. Appendix B describes some central results from matrix algebra
and calculus which are used in this book.
We finally describe some notational issues. Mathematical results are given in
italic font and are often followed by a proof of the result, which ends with an open
square (). A filled square () denotes the end of an example. Definitions end with
a diamond (). Vectorial parameters θ are reproduced in boldface to distinguish
them from scalar parameters θ . Similarly, independent univariate random variables
Xi from a certain distribution contribute to a random sample X1:n = (X1 , . . . , Xn ),
whereas dependent univariate random variables X1 , . . . , Xn form a general sample
X = (X1 , . . . , Xn ). A random sample of n independent multivariate random vari-
ables Xi = (Xi1 , . . . , Xik ) is denoted as X1:n = (X1 , . . . , Xn ). On page 389 we
give a concise overview of the notation used in this book.
1.4 References
alcohol concentration is described in Iten (2009) and Iten and Wüst (2009). Kirk-
wood and Sterne (2003) report data on the clinical study on the treatment of primary
biliary cirrhosis with Azathioprine. Jones et al. (2014) is a recent book on statistical
computing, which provides much of the background necessary to follow our numer-
ical examples using R. For a solid but accessible treatment of probability theory, we
recommend Grimmett and Stirzaker (2001, Chaps. 1–7).
Likelihood
2
Contents
2.1 Likelihood and Log-Likelihood Function . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Maximum Likelihood Estimate . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Relative Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Invariance of the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.4 Generalised Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Score Function and Fisher Information . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Numerical Computation of the Maximum Likelihood Estimate . . . . . . . . . . . 31
2.3.1 Numerical Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Quadratic Approximation of the Log-Likelihood Function . . . . . . . . . . . . . 37
2.5 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 Minimal Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.2 The Likelihood Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
The term likelihood has been introduced by Sir Ronald A. Fisher (1890–1962). The
likelihood function forms the basis of likelihood-based statistical inference.
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 13
https://doi.org/10.1007/978-3-662-60792-3_2,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
14 2 Likelihood
The function f (x; θ) describes the distribution of the random variable X for
fixed parameter θ . The goal of statistical inference is to infer θ from the observed
datum X = x. Playing a central role in this task is the likelihood function (or simply
likelihood)
L(θ ; x) = f (x; θ ), θ ∈ Θ,
viewed as a function of θ for fixed x. We will often write L(θ ) for the likelihood if
it is clear which observed datum x the likelihood refers to.
Definition 2.1 (Likelihood function) The likelihood function L(θ ) is the probability
mass or density function of the observed data x, viewed as a function of the unknown
parameter θ .
For discrete data, the likelihood function is the probability of the observed data
viewed as a function of the unknown parameter θ . This definition is not directly
transferable to continuous observations, where the probability of every exactly mea-
sured observed datum is strictly speaking zero. However, in reality continuous mea-
surements are always rounded to a certain degree, and the probability of the ob-
served datum x can therefore be written as Pr(x − 2ε ≤ X ≤ x + 2ε ) for some small
rounding interval width ε > 0. Here X denotes the underlying true continuous mea-
surement.
The above probability can be re-written as
x+ ε
ε ε 2
Pr x − ≤ X ≤ x + = f (y; θ ) dy ≈ ε · f (x; θ ),
2 2 ε
x− 2
Plausible values of θ should have a relatively high likelihood. The most plausible
value with maximum value of L(θ ) is the maximum likelihood estimate.
Definition 2.3 (Likelihood kernel) The likelihood kernel is obtained from a like-
lihood function by removing all multiplicative constants. We will use the symbol
L(θ ) both for likelihood functions and kernels.
Multiplicative constants in L(θ ) turn to additive constants in l(θ ), which again can
often be ignored. A log-likelihood function without additive constants is called log-
likelihood kernel. We will use the symbol l(θ ) both for log-likelihood functions and
kernels.
The uniqueness of the MLE is not guaranteed, and in certain examples there may
exist at least two parameter values θ̂1 = θ̂2 with L(θ̂1 ) = L(θ̂2 ) = arg maxθ∈Θ L(θ ).
In other situations, the MLE may not exist at all. The following example illustrates
16 2 Likelihood
Fig. 2.1 Likelihood function for π in a binomial model. The MLEs are marked with a vertical
line
that application of the capture–recapture method can result both in non-unique and
non-existing MLEs.
X ∼ HypGeom(n, N, M)
θ = N can only take integer values and is not continuous, although the figure sug-
gests the opposite.
It is possible to show (cf. Exercise 3) that the likelihood function is maximised at
N̂ML = M · n/x , where y denotes the largest integer not greater than y. For ex-
ample, for M = 26, n = 63 and x = 5 (cf. Fig. 2.2), we obtain N̂ML = 26 · 63/5 =
327.6 = 327.
However, sometimes the MLE is not unique, and the likelihood function attains
the same value at N̂ML − 1. For example, for M = 13, n = 10 and x = 5, we have
N̂ML = 13 · 10/5 = 26, but N̂ML = 25 also attains exactly the same value of L(N ).
This can easily be verified empirically using the R-function dhyper, cf. Table A.1.
M <- 13
n <- 10
x <- 5
ml <- c (25 , 26)
( dhyper ( x =x , m =M , n = ml -M , k = n ) )
[1] 0.311832 0.311832
On the other hand, the MLE will not exist for x = 0 because the likelihood function
L(N ) is then monotonically increasing.
Definition 2.4 (Random sample) Data x1:n = (x1 , . . . , xn ) are realisations of a ran-
dom sample X1:n = (X1 , . . . , Xn ) of size n if the random variables X1 , . . . , Xn
are independent and identically distributed from some distribution with probabil-
ity mass or density function f (x; θ ). The number n of observations is called the
iid
sample size. This may be denoted as Xi ∼ f (x; θ ), i = 1, . . . , n.
18 2 Likelihood
Example 2.3 (Analysis of survival times) Let X1:n denote a random sample from
an exponential distribution Exp(λ). Then
n
n
L(λ) = λ exp(−λxi ) = λn exp −λ xi
i=1 i=1
Setting the derivative to zero, we easily obtain the MLE λ̂ML = 1/x̄ where x̄ =
n
i=1 xi /n is the mean observed survival time. If our interest is instead in the the-
oretical mean μ = 1/λ of the exponential distribution, then the likelihood function
takes the form
1
n
L(μ) = μ−n exp − xi , μ ∈ R+ ,
μ
i=1
Fig. 2.3 Likelihood function for λ (left) and μ (right) assuming independent and exponentially
distributed PBC-survival times. Only uncensored observations are taken into account
is ignored. In Example 2.8 we will therefore also take into account the censored
observations.
The likelihood functions for the rate parameter λ and the mean survival time
μ = 1/λ are shown in Fig. 2.3. Note that the actual values of the likelihood functions
are identical, only the scale of the x-axis is transformed. This illustrates that the
likelihood function and in particular the MLE are invariant with respect to one-to-
one transformations of the parameter θ , see Sect. 2.1.3 for more details. It also shows
that a likelihood function cannot be interpreted as a density function of a random
variable. Indeed, assume that L(λ) was an (unnormalised) density function; then the
density of μ = 1/λ would be not equal to L(1/μ) because this change of variables
would also involve the derivative of the inverse transformation, cf. Eq. (A.11) in
Appendix A.2.3.
The assumption of exponentially distributed survival times may be unrealistic,
and a more flexible statistical model may be warranted. Both the gamma and the
Weibull distributions include the exponential distribution as a special case. The
Weibull distribution Wb(μ, α) is described in Appendix A.5.2 and depends on two
parameters μ and α, which both are required to be positive. A random sample X1:n
from a Weibull distribution has the density
n
n α
α xi α−1 xi
f (x1:n ; μ, α) = f (xi ; μ, α) = exp − ,
μ μ μ
i=1 i=1
Fig. 2.4 Flexible modelling of survival times is achieved by a Weibull or gamma model. The
corresponding likelihood functions are displayed here. The vertical line at α = 1 corresponds to
the exponential model in both cases
n α n
n α−1
n
β α α−1 β
L(α, β) = xi exp(−βxi ) = xi exp −β xi .
(α) (α)
i=1 i=1 i=1
Example 2.4 (Poisson model) Consider Example 1.1.6 and denote the observed and
expected number of cancer cases in the n = 56 regions of Scotland with xi and ei ,
respectively, i = 1, . . . , n. The simplest model for such registry data assumes that
the underlying relative risk λ is the same in all regions and that the observed counts
xi ’s constitute independent realisations from Poisson distributions with means ei λ.
The random variables Xi hence belong to the same distributional family but are
not identically distributed since the mean parameter ei λ varies from observation to
observation.
The log-likelihood kernel of the relative risk λ turns out to be
n
n
l(λ) = xi log λ − ei λ,
i=1 i=1
It is often useful to consider the likelihood (or log-likelihood) function relative to its
value at the MLE.
Example 2.5 (Inference for a proportion) All different likelihood functions are dis-
played for a binomial model (cf. Example 2.1) with sample size n = 10 and observa-
tion x = 2 in Fig. 2.6. Note that the change from an ordinary to a relative likelihood
changes the scaling of the y-axis, but the shape of the likelihood function remains
the same. This is also true for the log-likelihood function.
It is important to consider the entire likelihood function as the carrier of the in-
formation regarding θ provided by the data. This is far more informative than to
consider only the MLE and to disregard the likelihood function itself. Using the val-
ues of the relative likelihood function gives us a method to derive a set of parameter
values (usually an interval), which are supported by the data. For example, the fol-
lowing categorisation based on thresholding the relative likelihood function using
the cutpoints 1/3, 1/10, 1/100 and 1/1000 has been proposed:
1
1 ≥ L̃(θ ) > θ very plausible,
3
1 1
≥ L̃(θ ) > θ plausible,
3 10
1 1
≥ L̃(θ ) > θ less plausible,
10 100
1 1
≥ L̃(θ ) > θ barely plausible,
100 1000
1
≥ L̃(θ ) ≥ 0 θ not plausible.
1000
However, such a pure likelihood approach to inference has the disadvantage that the
scale and the thresholds are somewhat arbitrarily chosen. Indeed, the likelihood on
its own does not allow us to quantify the support for a certain set of parameter values
2.1 Likelihood and Log-Likelihood Function 23
Suppose we parametrise the distribution of X not with respect to θ , but with respect
to a one-to-one transformation φ = h(θ ). The likelihood function Lφ (φ) for φ and
24 2 Likelihood
The actual value of the likelihood will not be changed by this transformation, i.e.
the likelihood is invariant with respect to one-to-one parameter transformations. We
therefore have
φ̂ML = h(θ̂ML )
for the MLEs φ̂ML and θ̂ML . This is an important property of the maximum likelihood
estimate:
Example 2.6 (Binomial model) Let X ∼ Bin(n, π), so that π̂ML = x/n. Now con-
sider the corresponding odds parameter ω = π/(1 − π). The MLE of ω is
x
π̂ML x
ω̂ML = = n = .
1 − π̂ML 1 − x
n n−x
We also have
π ω 1
ω = h(π) = ⇐⇒ π = h−1 (ω) = and 1 − π =
1−π 1+ω 1+ω
and therefore
so the root ω̂ML must fulfil x(1 + ω̂ML ) = n ω̂ML . We easily obtain
x
ω̂ML = .
n−x
3
l(π ) = xi log(πi )
i=1
The log-likelihood kernel for the allele frequency υ is therefore (2x1 + x2 ) log(υ) +
(x2 + 2x3 ) log(1 − υ), which can be identified as a binomial log-likelihood kernel
for the success probability υ with 2x1 + x2 successes and x2 + 2x3 failures.
The MLE of υ is therefore
2x1 + x2 2x1 + x2 x1 + x2 /2
υ̂ML = = = ,
2x1 + 2x2 + 2x3 2n n
which is exactly the proportion of A alleles in the sample. For the data above, we
obtain υ̂ML ≈ 0.570. The MLEs of π1 , π2 and π3 (assuming Hardy–Weinberg equi-
librium) are therefore
π̂1 = υ̂ML
2
≈ 0.324, π̂2 = 2υ̂ML (1 − υ̂ML ) ≈ 0.490 and
π̂3 = (1 − υ̂ML )2 ≈ 0.185,
In the last example, the transformation to which the MLE is invariant is not really
a one-to-one transformation. A more detailed view of the situation is the following:
We have the more general multinomial model with two parameters π1 , π2 (π3 is
determined by these) and the simpler Hardy–Weinberg model with one parame-
ter υ. We can restrict the multinomial model to the Hardy–Weinberg model, which
is hence a special case of the multinomial model. If we obtain an MLE for υ, we can
hence calculate the resulting MLEs for π1 , π2 and also π3 . However, in the other di-
26 2 Likelihood
rection, i.e. by first calculating the unrestricted MLE π̂ ML in the multinomial model,
we could not calculate a corresponding MLE υ̂ML in the simpler Hardy–Weinberg
model.
n
1−δi
L(θ ) = f (xi ; θ )δi 1 − F (xi ; θ ) . (2.2)
i=1
˜ ).
θ̂ML = arg max l(θ ) = arg max l(θ
θ∈Θ θ∈Θ
However, the log-likelihood function l(θ ) has much larger importance, besides sim-
plifying the computation of the MLE. Especially, its first and second derivatives are
important and have their own names, which are introduced in the following. For
simplicity, we assume that θ is a scalar.
Definition 2.6 (Score function) The first derivative of the log-likelihood function
dl(θ )
S(θ ) =
dθ
Definition 2.7 (Fisher information) The negative second derivative of the log-
likelihood function
d 2 l(θ ) dS(θ )
I (θ ) = − =−
dθ 2 dθ
is called the Fisher information. The value of the Fisher information at the MLE
θ̂ML , i.e. I (θ̂ML ), is the observed Fisher information.
Note that the MLE θ̂ML is a function of the observed data, which explains the
terminology “observed” Fisher information for I (θ̂ML ).
Example 2.9 (Normal model) Suppose we have realisations x1:n of a random sam-
ple from a normal distribution N(μ, σ 2 ) with unknown mean μ and known vari-
28 2 Likelihood
1
n
l(μ) = − (xi − μ)2 and
2σ 2
i=1
1
n
S(μ) = 2 (xi − μ),
σ
i=1
respectively. The solution of the score equation S(μ) = 0 is the MLE μ̂ML = x̄.
Taking another derivative gives the Fisher information
n
I (μ) = ,
σ2
which does not depend on μ and so is equal to the observed Fisher information
I (μ̂ML ), no matter what the actual value of μ̂ML is.
Suppose we switch the roles of the two parameters and treat μ as known and σ 2
as unknown. We now obtain
n
σ̂ML =
2
(xi − μ)2 /n
i=1
1
n
n
I σ2 = 6 (xi − μ)2 − 4 .
σ 2σ
i=1
The Fisher information of σ 2 now really depends on its argument σ 2 . The observed
Fisher information turns out to be
2 n
I σ̂ML = 4 .
2σ̂ML
dl(π) x n−x
S(π) = = −
dπ π 1−π
and has been derived already in Example 2.1. Taking the derivative of S(π) gives
the Fisher information
d 2 l(π) dS(π)
I (π) = − =−
dπ 2 dπ
x n−x
= 2+
π (1 − π)2
x/n (n − x)/n
=n + .
π2 (1 − π)2
Plugging in the MLE π̂ML = x/n, we finally obtain the observed Fisher information
n
I (π̂ML ) = .
π̂ML (1 − π̂ML )
This result is plausible if we take a frequentist point of view and consider the MLE
as a random variable. Then
X 1 1 π(1 − π)
Var(π̂ML ) = Var = 2 · Var(X) = 2 nπ(1 − π) = ,
n n n n
so the variance of π̂ML has the same form as the inverse observed Fisher information;
the only difference is that the MLE π̂ML is replaced by the true (and unknown)
parameter π . The inverse observed Fisher information is hence an estimate of the
variance of the MLE.
How does the observed Fisher information change if we reparametrise our statis-
tical model? Here is the answer to this question.
Result 2.1 (Observed Fisher information after reparametrisation) Let Iθ (θ̂ML ) de-
note the observed Fisher information of a scalar parameter θ and suppose that
φ = h(θ ) is a one-to-one transformation of θ . The observed Fisher information
Iφ (φ̂ML ) of φ is then
−1
dh (φ̂ML ) 2 dh(θ̂ML ) −2
Iφ (φ̂ML ) = Iθ (θ̂ML ) = Iθ (θ̂ML ) . (2.3)
dφ dθ
The second derivative of lφ (φ) can be computed using the product and chain rules:
dSφ (φ) d dh−1 (φ)
Iφ (φ) = − =− Sθ (θ ) ·
dφ dφ dφ
dSθ (θ ) dh−1 (φ) d 2 h−1 (φ)
=− · − Sθ (θ ) ·
dφ dφ dφ 2
−1
dSθ (θ ) dh (φ) 2 d 2 h−1 (φ)
=− · − Sθ (θ ) ·
dθ dφ dφ 2
−1
dh (φ) 2 d 2 h−1 (φ)
= Iθ (θ ) − Sθ (θ ) · .
dφ dφ 2
Evaluating Iφ (φ) at the MLE φ = φ̂ML (so θ = θ̂ML ) leads to the first equation in (2.3)
(note that Sθ (θ̂ML ) = 0). The second equation follows with
dh−1 (φ) dh(θ ) −1 dh(θ )
= for = 0. (2.5)
dφ dθ dθ
Example 2.11 (Binomial model) In Example 2.6 we saw that the MLE of the odds
ω = π/(1 − π) is ω̂ML = x/(n − x). What is the corresponding observed Fisher
information? First, we compute the derivative of h(π) = π/(1 − π), which is
dh(π) 1
= .
dπ (1 − π)2
Note that Iφ (φ̂ML ) does not change if we redefine successes as failures and vice
versa. This is also the case for the observed Fisher information Iπ (π̂ML ) but not for
Iω (ω̂ML ).
Explicit formulas for the MLE and the observed Fisher information can typically
only be derived in simple models. In more complex models, numerical techniques
have to be applied to compute maximum and curvature of the log-likelihood func-
tion. We first describe the application of general purpose optimisation algorithms
to this setting and will discuss the Expectation-Maximisation (EM) algorithm in
Sect. 2.3.2.
S(θ (t) )
θ (t+1) = θ (t) +
I (θ (t) )
gives after convergence (i.e. θ (t+1) = θ (t) ) the MLE θ̂ML . As a by-product, the ob-
served Fisher information I (θ̂ML ) can also be extracted.
To apply the Newton–Raphson algorithm in R, the function optim can conve-
niently be used, see Appendix C.1.3 for details. We need to pass the log-likelihood
function as an argument to optim. Explicitly passing the score function into optim
typically accelerates convergence. If the derivative is not available, it can sometimes
be computed symbolically using the R function deriv. Generally no derivatives
need to be passed to optim because it can approximate them numerically. Particu-
larly useful is the option hessian = TRUE, in which case optim will also return
the negative observed Fisher information.
32 2 Likelihood
Example 2.12 (Screening for colon cancer) The goal of Example 1.1.5 is to esti-
mate the false negative fraction of a screening test, which consists of six consecutive
medical examinations. Let π denote the probability for a positive test result of the
ith diseased individual and denote by Xi the number of positive test results among
the six examinations. We start by assuming that individual test results are indepen-
dent and that π does not vary from patient to patient (two rather unrealistic assump-
tions, as we will see later), so that Xi is binomially distributed: Xi ∼ Bin(N = 6, π).
However, due to the study design, we will not observe a patient with Xi = 0 positive
tests. We therefore need to use the truncated binomial distribution as the appropriate
statistical model. The corresponding log-likelihood can be derived by considering
Pr(Xi = k)
Pr(Xi = k | Xi > 0) = , k = 1, . . . , 6, (2.6)
Pr(Xi > 0)
N
l(π) = Zk k log(π) + (N − k) log(1 − π) − n log 1 − (1 − π)N . (2.7)
k=1
Here Zk denotes the number of patients with k positive test results, and n =
N
k=1 Zk = 196 is the total number of diseased patients with at least one positive
test result.
Computation of the MLE is now most conveniently done with numerical tech-
niques. To do so, we write an R function log.likelihood, which returns the
log-likelihood kernel of the unknown probability (pi) for a given vector of counts
(data) and maximise, it with the optim function.
## Truncated binomial log - likelihood function
## pi : the parameter , the probability of a positive test result
## data : vector with counts Z_1 , ... , Z_N
log . likelihood <- function ( pi , data )
{
n <- sum ( data )
k <- length ( data )
vec <- seq_len ( k )
result <- sum ( data * ( vec * log ( pi ) + (k - vec ) * log (1 - pi ) ) ) -
n * log (1 - (1 - pi ) ^ k )
}
data <- c (37 , 22 , 25 , 29 , 34 , 49)
eps <- 1e -10
result <- optim (0.5 , log . likelihood , data = data ,
method = "L - BFGS - B " , lower = eps , upper = 1 - eps ,
control = list ( fnscale = -1) , hessian = TRUE )
( ml <- result $ par )
[1] 0.6240838
The MLE turns out to be π̂ML = 0.6241, and the observed Fisher information
I (π̂ML ) = 4885.3 is computed via
( observed . fisher <- - result $ hessian )
[ ,1]
[1 ,] 4885.251
2.3 Numerical Computation of the Maximum Likelihood Estimate 33
Invariance of the MLE immediately gives the MLE of the false negative fraction
ξ = Pr(Xi = 0) via
ξ̂ML = (1 − π̂ML )N = 0.0028.
A naive estimate of the number of undetected cancer cases Z0 can be obtained
by solving Ẑ0 /(196 + Ẑ0 ) = ξ̂ML for Ẑ0 :
ξ̂ML
Ẑ0 = 196 · = 0.55. (2.8)
1 − ξ̂ML
It turns out that this estimate can be justified as a maximum likelihood estimate. To
see this, note that the probability to detect a cancer case in one particular application
of the six-stage screening test is 1 − ξ . The number of samples until the first cancer
case is detected therefore follows a geometric distribution with success probability
1 − ξ , cf. Table A.1 in Appendix A.5.1. Thus, if n is the observed number of detected
cancer cases, the total number of cancer cases Z0 + n follows a negative binomial
distribution with parameters n and 1 − ξ (again cf. Appendix A.5.1):
Z0 + n ∼ NBin(n, 1 − ξ ),
Example 2.13 (Screening for colon cancer) We reconsider Example 1.1.5, where
the number Zk of 196 cancer patients with k ≥ 1 positive test results among six
cancer colon screening tests has been recorded. Due to the design of the study, we
have no information on the number Z0 of patients with solely negative test results
(cf. Table 1.2). Numerical techniques allow us to fit a truncated binomial distribution
to the observed data Z1 , . . . , Z6 , cf. Example 2.12.
However, the EM algorithm could also be used to compute the MLEs. The idea
is that an explicit and simple formula for the MLE of π would be available if the
number Z0 was known as well:
6
k=0 k · Zk
π̂ = . (2.10)
6 · 6k=0 Zk
Indeed, in this case we are back in the untruncated binomial case with 6k=0 k · Zk
positive tests among 6 · 6k=0 Zk tests. However, Z0 is unknown, but if π and hence
ξ = (1 − π)6 are known, Z0 can be estimated by the expectation of a negative
binomial distribution (cf. Eq. (2.9) at the end of Example 2.12):
ξ
Ẑ0 = E(Z0 ) = n · , (2.11)
1−ξ
where n = 196 and ξ = (1−π)6 . The EM algorithm now iteratively computes (2.10)
and (2.11) and replaces the terms Z0 and π on the right-hand sides with their current
estimates Ẑ0 and π̂ , respectively. Implementation of the algorithm is shown in the
following R-code:
## data set
fulldata <- c ( NA , 37 , 22 , 25 , 29 , 34 , 49)
k <- 0:6
n <- sum ( fulldata [ -1])
## impute start value for Z0 ( first element )
## and initialise some different old value
fulldata [1] <- 10
Z0old <- 9
## the EM algorithm
while ( abs ( Z0old - fulldata [1]) >= 1e -7)
{
Z0old <- fulldata [1]
pi <- sum ( fulldata * k ) / sum ( fulldata ) / 6
xi <- (1 - pi ) ^6
fulldata [1] <- n * xi / (1 - xi )
}
This method quickly converges, as illustrated in Table 2.1, with starting value
Z0 = 10. Note that the estimate π̂ from Table 2.1 is numerically equal to the MLE
(cf. Example 2.12) already after 5 iterations. As a by-product, we also obtain the
estimate Ẑ0 = 0.55.
Note that the log-likelihood functions cannot be written in the simple form l(θ ) as
they are based on different data: l(θ ; x, u) is the complete data log-likelihood, while
l(θ ; x) is the observed data log-likelihood. Now u is unobserved, so we replace it
in (2.12) by the random variable U :
and consider the expectation of this equation with respect to f (u | x; θ ) (it will
soon become clear why we distinguish θ and θ ):
Note that the last term does not change, as it does not depend on U . So if
Q θ ; θ ≥ Q θ ; θ , (2.14)
where the last inequality follows from the information inequality (cf. Appen-
dix A.3.8).
This leads to the EM algorithm with starting value θ :
1. Expectation (E-step): Compute Q(θ ; θ ).
2. Maximisation (M-step): Maximise Q(θ ; θ ) with respect to θ to obtain θ .
3. Now iterate Steps 1 and 2 (i.e. set θ = θ and apply Step 1), until the values of
θ converge. A possible stopping criterion is |θ − θ | < ε for some small ε > 0.
36 2 Likelihood
Equation (2.15) implies that every iteration of the EM algorithm increases the log-
likelihood. This follows from the fact that—through maximisation—Q(θ ; θ ) is
larger than Q(θ ; θ ) for all θ , so in particular (2.14) holds (with θ = θ ), and there-
fore l(θ ) ≥ l(θ ). In contrast, the Newton–Raphson algorithm is not guaranteed to
increase the log-likelihood in every iteration.
Example 2.14 (Example 2.13 continued) The joint probability mass function of ob-
served data X = (Z1 , . . . , Z6 ) and unobserved data U = Z0 is multinomially dis-
tributed (cf. Appendix A.5.3) with size parameter equal to n + Z0 and probability
vector p with entries
6 k
pk = π (1 − π)6−k ,
k
k = 0, . . . , 6, which we denote by
(Z0 , Z1 , . . . , Z6 ) ∼ M7 (n + Z0 , p).
6
l(π) = Zk log(pk ) (2.16)
k=0
6
l(π) = Zk (6 − k) log(1 − π) + k log(π) ,
k=0
easily obtained from (2.16), which leads to the complete data score function
6
6
S(π) = k · Zk /π − 6 Zk /(1 − π).
k=0 k=0
One can also show that the EM algorithm always converges to a local or global
maximum, or at least to a saddlepoint of the log-likelihood. However, the conver-
gence can be quite slow; typically, more iterations are required than for the Newton–
Raphson algorithm. Another disadvantage is that the algorithm does not automati-
cally give the observed Fisher information. Of course, this can be calculated after
2.4 Quadratic Approximation of the Log-Likelihood Function 37
dl(θ̂ML ) 1 d 2 l(θ̂ML )
l(θ ) ≈ l(θ̂ML ) + (θ − θ̂ML ) + (θ − θ̂ML )2
dθ 2 dθ 2
1
= l(θ̂ML ) + S(θ̂ML )(θ − θ̂ML ) − · I (θ̂ML )(θ − θ̂ML )2 .
2
˜ ≈ − 1 x (λ − λ̂ML )2 .
l(λ)
2 λ̂2ML
Example 2.16 (Normal model) Let X1:n denote a random sample from a normal
distribution N(μ, σ 2 ) with unknown mean μ and known variance σ 2 . We know
from Example 2.9 that
1
n
l(μ) = − (xi − μ)2
2σ 2
i=1
n
1
=− 2 (xi − x̄)2 + n(x̄ − μ)2 ,
2σ
i=1
1
n
l(μ̂ML ) = − (xi − x̄)2 , and hence
2σ 2
i=1
˜ n
l(μ) = l(μ) − l(μ̂ML ) = − 2 (x̄ − μ)2 ,
2σ
38 2 Likelihood
Under certain regularity conditions, which we will not discuss here, it can be
shown that a quadratic approximation of the log-likelihood improves with increasing
sample size. The following example illustrates this phenomenon in the binomial
model.
Example 2.17 (Binomial model) Figure 2.8 displays the relative log-likelihood of
the success probability π in a binomial model with sample size n = 10, 50, 200,
1000. The observed datum x has been fixed at x = 8, 40, 160, 800 such that the MLE
of π is π̂ML = 0.8 in all four cases. We see that the quadratic approximation of the
relative log-likelihood improves with increasing sample size n. The two functions
are nearly indistinguishable for n = 1000.
Example 2.18 (Uniform model) Let X1:n denote a random sample from a contin-
uous uniform distribution U(0, θ ) with unknown upper limit θ ∈ R+ . The density
function of the uniform distribution is
2.4 Quadratic Approximation of the Log-Likelihood Function 39
Fig. 2.8 Quadratic approximation (dashed line) of the relative log-likelihood (solid line) of the
success probability π in a binomial model
1
f (x; θ ) = I[0,θ) (x)
θ
with indicator function IA (x) equal to one if x ∈ A and zero otherwise. The likeli-
hood function of θ is
n −n for θ ≥ max (x ),
i=1 f (xi ; θ ) = θ i i
L(θ ) =
0 otherwise,
Fig. 2.9 Likelihood and log-likelihood function for a random sample of different size n from a
uniform distribution with unknown upper limit θ . Quadratic approximation of the log-likelihood is
impossible even for large n
are
dl(θ̂ML ) d 2 l(θ̂ML ) n
S(θ̂ML ) = = 0 and −I (θ̂ML ) = 2
= 2 > 0,
dθ dθ θ̂ML
so the log-likelihood l(θ ) is not concave but convex, with negative (!) observed
Fisher information, cf. Fig. 2.9b. It is obvious from Fig. 2.9b that a quadratic ap-
proximation to l(θ ) will remain poor even if the sample size n increases. The reason
for the irregular behaviour of the likelihood function is that the support of the uni-
form distribution depends on the unknown parameter θ .
2.5 Sufficiency
Definition 2.8 (Statistic) Let x1:n denote the realisation of a random sample X1:n
from a distribution with probability mass or density function f (x; θ ). Any function
T = h(X1:n ) of X1:n with realisation t = h(x1:n ) is called a statistic.
f (x1:n | T = t)
Example 2.19 (Poisson model) Let x1:n denote the realisation of a random sample
X1:n from a Po(λ) distribution with unknown rate parameter λ. The statistic T =
X1 + · · · + Xn is sufficient for λ since the conditional distribution of X1:n | T = t is
multinomial with parameters not depending on λ. Indeed, first note that f (t | X1:n =
x1:n ) = 1 if t = x1 + · · · + xn and 0 elsewhere. We also know from Appendix A.5.1
that T ∼ Po(nλ), and therefore we have
f (t | x1:n )f (x1:n )
f (x1:n | t) =
f (t)
f (x1:n )
=
f (t)
n λxi
i=1 xi ! exp(−λ)
= (nλ)t
t! exp(−nλ)
t
t! 1
= n ,
x
i=1 i ! n
A sufficient statistic T contains all relevant information from the sample X1:n
with respect to θ . To show that a certain statistic is sufficient, the following result is
helpful.
Result 2.2 (Factorisation theorem) Let f (x1:n ; θ ) denote the probability mass or
density function of the random sample X1:n . A statistic T = h(X1:n ) with realisation
t = h(x1:n ) is sufficient for θ if and only if there exist functions g1 (t; θ ) and g2 (x1:n )
such that for all possible realisations x1:n and all possible parameter values θ ∈ Θ,
Note that g1 (t; θ ) depends on the argument x1:n only through t = h(x1:n ), but also
depends on θ . The second term g2 (x1:n ) must not depend on θ .
A proof of this result can be found in Casella and Berger (2001, p. 276). As a
function of θ , we can easily identify g1 (t; θ ) as the likelihood kernel, cf. Defini-
tion 2.3. The second term g2 (x1:n ) is the corresponding multiplicative constant.
Example 2.20 (Poisson model) We already know from Example 2.19 that T =
h(X1:n ) = X1 + · · · + Xn is sufficient for λ, so the factorisation (2.18) must hold.
This is indeed the case:
n
f (x1:n ; θ ) = f (xi ; θ )
i=1
n xi
λ
= exp(−λ)
xi !
i=1
n
1
= λt exp(−nλ) .
xi !
i=1
g1 (t;λ)
g2 (x1:n )
Example 2.21 (Normal model) Let x1:n denote a realisation of a random sample
from a normal distribution N(μ, σ 2 ) with known variance σ 2 . We now show that
the sample mean X̄ = ni=1 Xi /n is sufficient for μ. First, note that
n
1
2 −2 1 (xi − μ)2
f (x1:n ; μ) = 2πσ exp − ·
2 σ2
i=1
2
n
2 −2 1 n
i=1 (xi − μ)
= 2πσ exp − · .
2 σ2
Now
n
n
(xi − μ) =
2
(xi − x̄)2 + n(x̄ − μ)2 ,
i=1 i=1
Result 2.2 now ensures that the sample mean X̄ is sufficient for μ. Note that, for
example, also nX̄ = ni=1 Xi is sufficient for μ.
2.5 Sufficiency 43
Suppose now that also σ 2 is unknown, i.e. θ = (μ, σ 2 ), and assume that n ≥ 2.
It is easy to show that now T = (X̄, S 2 ), where
1
n
S =
2
(Xi − X̄)2
n−1
i=1
Example 2.22 (Blood alcohol concentration) If we are prepared to assume that the
transformation factor is normally distributed, knowledge of n, x̄ and s 2 (or s) in
each group (cf. Table 1.3) is sufficient to formulate the likelihood function. It is not
necessary to know the actual observations.
is the likelihood ratio of one parameter value θ1 relative to another parameter value
θ2 with respect to the realisation x1:n of a random sample X1:n .
Note that likelihood ratios between any two parameter values θ1 and θ2 can be
calculated from the relative likelihood function L̃(θ ; x1:n ). Note also that
because L̃(θ̂ML ; x1:n ) = 1, so the relative likelihood function can be recovered from
the likelihood ratio.
Result 2.3 A statistic T = h(X1:n ) is sufficient for θ if and only if for any pair x1:n
and x̃1:n such that h(x1:n ) = h(x̃1:n ),
for all θ1 , θ2 ∈ Θ.
Proof We show the equivalence of the factorisation (2.18) and Eq. (2.19). Suppose
that (2.18) holds. Then
so if h(x1:n ) = h(x̃1:n ), we have Λx1:n (θ1 , θ2 ) = Λx̃1:n (θ1 , θ2 ) for all θ1 and θ2 .
44 2 Likelihood
so (2.18) holds:
Result 2.4 Let L(θ ; x1:n ) denote the likelihood function with respect to a realisa-
tion x1:n of a random sample X1:n . Let L(θ ; t) denote the likelihood with respect to
the realisation t = h(x1:n ) of a sufficient statistic T = h(X1:n ) for θ . For all possible
realisations x1:n , the ratio
L(θ ; x1:n )
L(θ ; t)
will then not depend on θ , i.e. the two likelihood functions are (up to a proportion-
ality constant) identical.
Proof To show Result 2.4, first note that f (t | x1:n ) = 1 if t = h(x1:n ) and 0 other-
wise, so f (x1:n , t) = f (x1:n )f (t | x1:n ) = f (x1:n ) if t = h(x1:n ). For t = h(x1:n ),
the likelihood function can therefore be written as
Example 2.23 (Binomial model) Let X1:n denote a random sample from a
Bernoulli distribution B(π) with unknown parameter π ∈ (0, 1). The likelihood
function based on the realisation x1:n equals
n
L(π; x1:n ) = f (x1:n ; π) = π xi (1 − π)1−xi = π t (1 − π)n−t ,
i=1
n
where t = i=1 xi . Obviously, T = h(x1:n ) = ni=1 Xi is a sufficient statistic for π .
Now T follows the binomial distribution Bin(n, π), so the likelihood function with
respect to its realisation t is
n t
L(π, t) = π (1 − π)n−t .
t
As Result 2.4 states, the likelihood
functions with respect to x1:n and t are identical
up to the multiplicative constant nt .
Example 2.23 has shown that regarding the information about the proportion π ,
the whole random sample X1:n can be compressed into the total number of successes
T = ni=1 Xi without any loss of information. This will be important in Chap. 4,
where we consider asymptotic properties of ML estimation, i.e. properties of certain
statistics for sample size n → ∞. Then we can consider a single binomial random
variable X ∼ Bin(n, π) because it implicitly contains the whole information of n
independent Bernoulli random variables with respect to π . We can also approximate
the binomial distribution Bin(n, π) by a Poisson distribution Po(nπ) when π is
small compared to n. Therefore, we can consider a single Poisson random variable
Po(eλ) and assume that the asymptotic properties of derived statistics are a good
approximation of their finite sample properties. We will often use this Poisson model
parametrisation with expected number of cases e = n · p and relative risk λ = π/p,
using a reference probability p, see e.g. Example 2.4.
We have seen in the previous section that sufficient statistics are not unique. In
particular, the original sample X1:n is always sufficient due to Result 2.2:
The concept of minimal sufficiency ensures that a sufficient statistic cannot be re-
duced further.
The following result describes the relationship between two minimal sufficient
statistics.
Result 2.5 If T and T̃ are minimal sufficient statistics, then there exists a one-to-
one function g such that T̃ = g(T ) and T = g −1 (T̃ ).
Result 2.6 A necessary and sufficient criterion for a statistic T (x1:n ) to be minimal
sufficient is that h(x1:n ) = h(x̃1:n ) if and only if
for all θ1 , θ2 .
This implies that the likelihood function contains the whole information of the data
with respect to θ . Any further reduction will result in information loss.
Example 2.24 (Normal model) Let x1:n denote a realisation of a random sample
from a normal distribution N(μ, σ 2 ) with known variance σ 2 . The mean h(X1:n ) =
X̄ is minimal sufficient for μ, whereas T̃ (X1:n ) = (X̄, S 2 ) is sufficient but not min-
imal sufficient for μ.
2.5 Sufficiency 47
Are there general principles how to infer information from data? In the previous
section we have seen that sufficient statistics contain the complete information of
a sample with respect to an unknown parameter. It is thus natural to state the suffi-
ciency principle:
Sufficiency principle
Consider a random sample X1:n from a distribution with probability mass or
density function f (x; θ ) and unknown parameter θ . Assume that T = h(X1:n )
is a sufficient statistic for θ . If h(x1:n ) = h(x̃1:n ) for two realisations of X1:n ,
then inference for θ should be the same whether x1:n or x̃1:n has been ob-
served.
Likelihood principle
If realisations x1:n and x̃1:n from a random sample X1:n with probability
mass or density function f (x; θ ) have proportional likelihood functions, i.e.
L(θ ; x1:n ) ∝ L(θ ; x̃1:n ) for all θ , then inference for θ should be the same,
whether x1:n or x̃1:n is observed.
This principle is also called the weak likelihood principle to distinguish it from the
strong likelihood principle:
2.6 Exercises
1. Examine the likelihood function in the following examples.
(a) In a study of a fungus that infects wheat, 250 wheat seeds are dissemi-
nated after contaminating them with the fungus. The research question is
how large the probability θ is that an infected seed can germinate. Due to
technical problems, the exact number of germinated seeds cannot be eval-
uated, but we know only that less than 25 seeds have germinated. Write
down the likelihood function for θ based on the information available
from the experiment.
(b) Let X1:n be a random sample from an N(θ, 1) distribution. However, only
the largest value of the sample, Y = max(X1 , . . . , Xn ), is known. Show
that the density of Y is
n−1
f (y) = n Φ(y − θ ) ϕ(y − θ ), y ∈ R,
where Φ(·) is the distribution function, and ϕ(·) is the density function of
the standard normal distribution N(0, 1). Derive the distribution function
of Y and the likelihood function L(θ ).
(c) Let X1:3 denote a random sample of size n = 3 from a Cauchy C(θ, 1)
distribution, cf. Appendix A.5.2. Here θ ∈ R denotes the location param-
eter of the Cauchy distribution with density
1 1
f (x) = , x ∈ R.
π 1 + (x − θ )2
Derive the likelihood function for θ .
2.6 Exercises 49
1
n
l(α) = − (xi − αxi−1 )2 .
2
i=1
(b) Derive the score equation for α, compute α̂ML and verify that it is really
the maximum of l(α).
(c) Create a plot of l(α) and compute α̂ML for the following sample:
(x0 , . . . , x6 ) = (−0.560, −0.510, 1.304, 0.722, 0.490, 1.960, 1.441).
3. Show that in Example 2.2 the likelihood function L(N ) is maximised at N̂ =
M · n/x , where x is the largest integer that is smaller than x. To this end,
analyse the monotonic behaviour of the ratio L(N )/L(N − 1). In which cases
is the MLE not unique? Give a numeric example.
4. Derive the MLE of π for an observation x from a geometric Geom(π) distri-
bution. What is the MLE of π based on a realisation x1:n of a random sample
from this distribution?
5. A sample of 197 animals has been analysed regarding a specific phenotype. The
number of animals with phenotypes AB, Ab, aB and ab, respectively, turned out
to be
x = (x1 , x2 , x3 , x4 ) = (125, 18, 20, 34) .
A genetic model now assumes that the counts are realisations of a multinomi-
ally distributed multivariate random variable X ∼ M4 (n, π) with n = 197 and
probabilities π1 = (2 + φ)/4, π2 = π3 = (1 − φ)/4 and π4 = φ/4 (Rao 1973,
p. 368).
(a) What is the parameter space of φ? See Table A.3 in Appendix A for
details on the multinomial distribution and the parameter space of π .
(b) Show that the likelihood kernel function for φ, based on the observa-
tion x, has the form
L(φ) = (2 + φ)m1 (1 − φ)m2 φ m3
and derive expressions for m1 , m2 and m3 depending on x.
50 2 Likelihood
2.7 References
Contents
3.1 Unbiasedness and Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Standard Error and Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.1 Standard Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.2 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.3 Pivots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.4 The Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.5 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Significance Tests and P -Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 51
https://doi.org/10.1007/978-3-662-60792-3_3,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
52 3 Elements of Frequentist Inference
isation x1:n of X1:n . The MLE θ̂ML is one particular estimate of θ , but here we will
consider any possible estimate θ̂ . To investigate and compare frequentist properties
of different estimates we first define the notion of an estimator based on a random
sample.
Tn = h(X1:n ),
based on a random sample X1:n from a distribution with probability mass or density
function f (x; θ ) where θ is an unknown scalar parameter θ . If the random variable
Tn is computed to make inference about θ , then it is called an estimator. We may
simply write T rather than Tn if the sample size n is not important. The particular
value t = h(x1:n ) that an estimator takes for a realisation x1:n of the random sample
X1:n is called an estimate.
What is a good estimator? At first glance it seems reasonable that a useful esti-
mator is “on average” equal to the true value θ . This leads to the notion of unbiased
estimators.
E(Tn ) = θ
for all θ ∈ Θ and for all n ∈ N. Otherwise, the estimator Tn is called biased.
Example 3.1 (Sample variance) Let X1:n denote a random sample from a distribu-
tion with unknown mean μ and variance σ 2 > 0. It is easy to show that the sample
mean X̄ = n−1 ni=1 Xi has the following properties:
σ2
E(X̄) = μ and Var(X̄) = .
n
So μ̂ = X̄ is an unbiased estimator of the mean μ. The sample variance
1
n
S2 = (Xi − X̄)2 (3.1)
n−1
i=1
is unbiased for the true variance σ 2 . To see this, first note that
n
n
2
(Xi − X̄)2 = Xi − μ − (X̄ − μ)
i=1 i=1
n
n
= (Xi − μ)2 − 2 (Xi − μ)(X̄ − μ) + n(X̄ − μ)2
i=1 i=1
3.1 Unbiasedness and Consistency 53
n
= (Xi − μ)2 − 2n(X̄ − μ)2 + n(X̄ − μ)2
i=1
n
= (Xi − μ)2 − n(X̄ − μ)2 ,
i=1
so
n
E (n − 1)S 2
=E (Xi − X̄)2
i=1
and unknown parameter π ∈ (0, 1), cf. Table A.1. We assume that the parameter
space is an open interval, the limits π = 0 and π = 1 are not included. This is a
common assumption since f (x; π = 0) is not a proper distribution and X is for
π = 1 deterministic, i.e. only the realisation x = 1 can occur.
Now there is only one unbiased estimate of π because the requirement
∞
E T (X) = T (x)π(1 − π)(x−1) = π (3.2)
x=1
54 3 Elements of Frequentist Inference
and the power series on the left and right sides are equal only if their coefficients
series, T (x) and π̂(x), are identical.
So (3.3) is the only unbiased estimator of π but appears to be not very sensible, as
it can only take two distinct values on the border of the parameter space. In contrast,
the ML estimator π̂ML = 1/X (cf. Exercise 4 in Chap. 2) seems more sensible but
is biased. Indeed, we have E(X) = 1/π , and since h(x) = 1/x is strictly convex on
the positive real line and X is not degenerate, we have
The underlying idea for defining consistency is that the estimator should be able
to identify the true parameter value when the sample size increases. Appendix A.4.1
lists important relationships between the different modes of convergence. These
properties translate to the relationships between the corresponding notions of consis-
tency. In particular, a mean square consistent estimate is also consistent. Application
3.2 Standard Error and Confidence Interval 55
of the continuous mapping theorem (see Appendix A.4.2) shows that any continu-
ous function h of a consistent estimator for θ is consistent for h(θ ). One can also
establish that a mean square consistent estimator is asymptotically unbiased, while
this is in general not true for a consistent estimator.
Consistency in mean square implies that
goes to zero as n → ∞. The quantity (3.4) is called the mean squared error (MSE).
The mean squared error is of particular importance because the following decom-
position holds:
2
MSE = Var(Tn ) + E(Tn ) − θ . (3.5)
The quantity E(Tn ) − θ is called the bias of Tn . The mean squared error is therefore
the sum of the variance of an estimator and its squared bias. Note that, of course,
the bias of an asymptotically unbiased estimator goes to zero for n → ∞. If the
variance of the estimator goes also to zero as n → ∞, then the estimator is, due to
the decomposition (3.5), consistent in mean square and also consistent.
Example 3.3 (Sample variance) As we have seen in Example 3.1, the sample vari-
ance S 2 is unbiased for σ 2 . An alternative estimator of σ 2 is
1
n
S̃ 2 = (Xi − X̄)2 ,
n
i=1
An estimator T will be equal to the true parameter θ only in rare cases. However, it
will often be close to θ in a certain sense if the estimator is useful. The standard er-
ror quantifies how much an estimator varies in hypothetical independent repetitions
of the sampling experiment. If the estimator is unbiased or at least asymptotically
56 3 Elements of Frequentist Inference
unbiased, then the standard error is a consistent estimator of the standard deviation
of T . To keep notation simple we will use the term standard error for both the
estimator and its realisation.
Starting with a definition of the notion of the standard error, we will continue
introducing confidence intervals, i.e. intervals which cover plausible values of the
unknown parameter. Confidence intervals are interval estimators and consist of a
lower and an upper limit. This is in contrast to the real-valued (point) estimator. We
will use the term confidence interval for both the estimator and its realisation.
Definition 3.5 (Standard error) Let X1:n denote a random sample, and Tn = h(X1:n )
an estimator of an unknown parameter θ . Suppose V is a consistent estimator
of Var(Tn ). The continuous √mapping theorem (see Appendix A.4.2) then guaran-
tees
√ that the standard error V is a consistent estimator of the standard deviation
Var(T ):
√
se(T ) = V .
Example 3.4 (Sample variance) Let X1:n denote a random sample from a distribu-
tion with unknown mean μ and variance σ 2 . Now μ̂ = X̄ is an unbiased estimator
of μ, and S 2 is a consistent (and even unbiased) estimator of σ 2 . We further have
Var(X̄) = σ 2 /n, so V = S 2 /n is a consistent estimator of Var(X̄), and we obtain
the following standard error of μ̂ = X̄:
S
se(X̄) = √ .
n
√
Using Example 3.3, we see that also S̃/ n is a standard error of X̄, which illustrates
that an estimator can have different standard errors.
Definition 3.6 (Confidence interval) For fixed γ ∈ (0, 1), a γ · 100 % confidence
interval for θ is defined by two statistics Tl = hl (X1:n ) and Tu = hu (X1:n ) based on
3.2 Standard Error and Confidence Interval 57
Pr(Tl ≤ θ ≤ Tu ) = γ
for all θ ∈ Θ. The statistics Tl and Tu are the limits of the confidence interval,
and we assume Tl ≤ Tu throughout. The confidence level γ is also called coverage
probability.
The limits of a confidence interval are functions of the random sample X1:n and
therefore also random. In contrast, the unknown parameter θ is fixed. If we imagine
identical repetitions of the underlying statistical sampling experiment, then a γ ·
100 % confidence interval will cover the unknown parameter θ in γ · 100 % of all
cases.
Confidence interval
For repeated random samples from a distribution with unknown parameter θ ,
a γ · 100 % confidence interval will cover θ in γ · 100 % of all cases.
However, we are not allowed to say that the unknown parameter θ is within a
γ · 100 % confidence interval with probability γ since θ is not a random variable.
Such a Bayesian interpretation is possible exclusively for credible intervals, see
Definition 6.3.
In the following we will concentrate on the commonly used two-sided confi-
dence intervals. One-sided confidence intervals can be obtained using Tl = −∞ or
Tu = ∞. However, such intervals are rarely used in practice.
Example 3.5 (Normal model) Let X1:n denote a random sample from a normal
distribution N(μ, σ 2 ) with known variance σ 2 . The interval [Tl , Tu ] with limits
σ
Tl = X̄ − q · √ and
n
σ
Tu = X̄ + q · √
n
defines a γ · 100 % confidence interval for μ, where q = z(1+γ )/2 denotes the (1 +
γ )/2 quantile of the standard normal distribution. To prove this, we have to show
that [Tl , Tu ] has coverage probability γ for all μ. Now
σ2
X̄ ∼ N μ, ,
n
so
√ X̄ − μ
Z= n (3.6)
σ
58 3 Elements of Frequentist Inference
has a standard normal distribution. Due to the symmetry of the standard normal
distribution around zero, we have z(1−γ )/2 = −q, and therefore
Plugging the definition (3.6) into (3.7) and rearranging so that the unknown param-
eter μ is in the centre of the inequalities, we finally obtain the coverage probability
Pr(Tl ≤ μ ≤ Tu ) = γ .
As for estimators, the question arises how we can characterise a “good” confi-
dence interval. We will discuss this topic at a later stage in Sect. 4.6. As a caution-
ary note, the following example illustrates that properly defined confidence intervals
may not be sensible at all from a likelihood perspective.
Tl = min(X1 , X2 ) and
Tu = max(X1 , X2 )
3.2.3 Pivots
Definition 3.7 (Pivot) A pivot is a function of the data X (viewed as random) and
the true parameter θ , with distribution not depending on θ . The distribution of a
pivot is called pivotal distribution. An approximate pivot is a pivot whose distribu-
tion does not asymptotically depend on the true parameter θ .
Example 3.7 (Exponential model) Let X1:n denote a random sample from an ex-
ponential distribution Exp(λ), i.e. F (x) = Pr(Xi ≤ x) = 1 − exp(−λx), where the
rate parameter λ is unknown. The distribution function of λXi is therefore
x
Pr(λXi ≤ x) = Pr Xi ≤ = 1 − exp(−x),
λ
so λXi ∼ Exp(1) = G(1, 1) no matter what the actual value of λ is. The sum
n
λXi = λnX̄ (3.8)
i=1
then follows a gamma G(n, 1) distribution (cf. Appendix A.5.2), so is a pivot for λ.
For illustration, consider n = 47 non-censored survival times from Sect. 1.1.8
and assume that they follow an exponential distribution with unknown parameter λ.
The total survival time in this sample is nx̄ = 53 146 days. The 2.5 % and 97.5 %
quantiles of the G(47, 1) distribution are q0.025 = 34.53 and q0.975 = 61.36, respec-
tively, so
Pr(34.53 ≤ λnX̄ ≤ 61.36) = 0.95.
A rearrangement gives
and we obtain a 95 % confidence interval for the rate λ with limits 6.5 · 10−4 and
1.15 · 10−3 events per day. The inverse values finally give a 95 % confidence interval
for the expected survival time E(Xi ) = 1/λ from 866.2 to 1539.0 days.
60 3 Elements of Frequentist Inference
Example 3.8 (Normal model) Let X1:n denote a random sample from a normal
distribution N(μ, σ 2 ), where both mean μ and variance σ 2 are unknown. Suppose
that the parameter of interest is μ while σ 2 is a nuisance parameter. The t statistic
√ X̄ − μ
T= n ∼ t(n − 1)
S
is a pivot for μ since the distribution of T is independent of μ and T does not depend
on σ 2 . It is well known that T follows a standard t distribution (cf. Appendix A.5.2)
with n − 1 degrees of freedom, thus we have
where tα (k) denotes the α quantile of the standard t distribution with k degrees of
freedom. Using the property tα (n − 1) = −t1−α (n − 1), which follows from the
symmetry of the standard-t distribution around zero, the interval with limits
S
X̄ ± √ · t(1+γ )/2 (n − 1) (3.9)
n
is a γ · 100 % confidence interval for μ.
If instead σ 2 is the parameter of interest, then
n−1 2
S ∼ χ 2 (n − 1)
σ2
is a pivot for σ 2 with a chi-squared distribution with n − 1 degrees of freedom as
pivotal distribution. Using this pivot, we can easily construct confidence intervals
for σ 2 :
n−1 2
γ = Pr χ(1−γ
2
)/2 (n − 1) ≤ S ≤ χ 2
(1+γ )/2 (n − 1)
σ2
σ2
= Pr 1/χ(1−γ
2
(n − 1) ≥ ≥ 1/χ 2
(n − 1)
)/2
(n − 1)S 2 (1+γ )/2
= Pr (n − 1)S 2 /χ(1+γ
2
)/2 (n − 1) ≤ σ ≤ (n − 1)S /χ(1−γ )/2 (n − 1) , (3.10)
2 2 2
3.2 Standard Error and Confidence Interval 61
where χα2 (k) denotes the α quantile of the chi-squared distribution with k degrees
of freedom.
Example 3.9 (Blood alcohol concentration) We want to illustrate the method with
the study on blood alcohol concentration, cf. Sect. 1.1.7. Assuming that the trans-
formation factors X1:n follow a normal distribution N(μ, σ 2 ), we obtain the 95 %
confidence interval (3.9) for μ from 2414.7 to 2483.7. The point estimate is the
arithmetic mean x̄ = 2449.2. Furthermore, computing (3.10) and taking the square
root of both limits yields the 95 % confidence interval from 215.8 to 264.8 for the
standard deviation σ . Note that the bounds are not symmetric around the point esti-
mate s = 237.8.
Result 3.1 (z-statistic) Let X1:n denote a random sample from some distribution
with probability mass or density function f (x; θ ). Let Tn denote a consistent esti-
mator of the parameter θ with standard error se(Tn ). Then the z-statistic
Tn − θ
Z(θ ) = (3.11)
se(Tn )
is an approximate pivot, which follows under regularity conditions an asymptotic
standard normal distribution, so
are the limits of an approximate γ · 100 % confidence interval for θ . It is often called
the Wald confidence interval.
The name of this approximate pivot derives from the quantiles of the standard
normal distribution, which are often (also in this book) denoted by the symbol z.
We now show why the z-statistic does indeed have an approximate standard normal
distribution:
By Slutsky’s theorem (cf. Appendix A.4.2) it follows that we can replace the un-
known standard deviation of the estimator with the standard error, i.e.
√
Var(Tn ) Tn − θ Tn − θ a
·√ = ∼ N(0, 1).
se(Tn ) Var(Tn ) se(Tn )
Example 3.10 (Analysis of survival times) The standard deviation of the survival
times is s = 874.4 days, so we obtain with (3.12) the limits of an approximate 95 %
confidence interval for the mean survival time E(Xi ):
874.4
x̄ ± 1.96 · √ = 1130.8 ± 250.0 = 880.8 and 1380.8 days.
47
Here we have used the 97.5 % quantile z0.975 ≈ 1.96 of the standard normal dis-
tribution in the calculation. This number is worth remembering because it appears
very often in formulas for approximate 95 % confidence intervals.
This confidence interval is different from the one derived under the assumption
of an exponential distribution, compare Example 3.7. This can be explained by the
fact that the empirical standard deviation of the observed survival times s = 874.4 is
smaller than the empirical mean x̄ = 1130.8. By contrast, under the assumption of
an exponential distribution, the standard deviation must equal the mean. Therefore,
the above confidence interval is smaller.
Example 3.11 (Inference for a proportion) We now consider the problem sketched
in Sect. 1.1.1 and aim to construct a confidence interval for the unknown success
probability π of a binomial sample X ∼ Bin(n, π), where X denotes the number of
successes, and n is the (known) number of trials. A commonly used estimator of π
is π̂ = X/n with variance
π(1 − π)
Var(π̂ ) = ,
n
cf. Example 2.10. Because π̂ is consistent for π ,
se(π̂ ) = π̂ (1 − π̂)/n
It is possible that there exists more than one approximate pivot for a particular
parameter θ . For finite sample size, we typically have different confidence intervals,
and we need criteria to compare their properties. One particularly useful criterion
is the actual coverage probability of a confidence interval with nominal coverage
3.2 Standard Error and Confidence Interval 63
probability γ . Complications will arise as the actual coverage probability may de-
pend on the unknown parameter θ . Another criterion is the width of the confidence
interval because smaller intervals are to be preferred for a given coverage probabil-
ity. In Example 4.22 we will describe a case study where we empirically compare
various confidence intervals for proportions.
Suppose θ̂ is a consistent estimator of θ with standard error se(θ̂ ). The delta method
(cf. Appendix A.4.5) allows us to compute a standard error of h(θ̂ ) where h(θ ) is
some transformation of θ with dh(θ )/dθ = 0.
Assuming that dh(θ )/dθ = 0 at the true value θ , the delta method allows us to
compute a Wald confidence interval for h(θ ) with limits
Tl ≤ h(θ ) ≤ Tu
is equivalent to
h−1 (Tl ) ≤ θ ≤ h−1 (Tu ).
Hence, the coverage probability of the back-transformed confidence interval
[h−1 (Tl ), h−1 (Tu )] for θ will be the same as for the confidence interval [Tl , Tu ]
for h(θ ). For the Wald confidence intervals, the limits of the back-transformed con-
fidence interval will not equal the limits
Example 3.12 (Inference for a proportion) LetX ∼ Bin(n, π) and suppose that
n = 100, x = 2, i.e. π̂ = 0.02 with se(π̂ ) = π̂ (1 − π̂)/n = 0.014. In Exam-
ple 3.11 we derived the 95 % Wald confidence interval (3.13) for π with limits
so the lower limit is negative and thus outside the parameter space (0, 1).
Alternatively, one can first apply the logit function
π
φ = h(π) = logit(π) = log ,
1−π
which transforms a probability π ∈ (0, 1) to the log odds φ ∈ R. We now need to
compute the MLE of φ and its standard error. Invariance of the MLE gives
π̂ML x
φ̂ML = logit(π̂ML ) = log = log .
1 − π̂ML n−x
The standard error of φ̂ML can be computed with the delta method:
dh(π̂ML )
se(φ̂ML ) = se(π̂ML ) · ,
dπ
where
dh(π̂ML ) 1 − π 1 − π + π 1
= · = ,
dπ π (1 − π)2 π(1 − π)
so we obtain
π̂ML (1 − π̂ML ) 1
se(φ̂ML ) = ·
n π̂ML (1 − π̂ML )
1 n
= =
n · π̂ML (1 − π̂ML ) x(n − x)
1 1
= + .
x n−x
For the above data, we obtain se(φ̂ML ) ≈ 0.714, and the 95 % Wald confidence in-
terval for φ = logit(π) has limits
Finally, we describe a Monte Carlo technique, which can also be used for construct-
ing confidence intervals. Instead of analytically computing confidence intervals, the
help of a computer is needed here. Details on Monte Carlo methods for Bayesian
computations are given in Chap. 8.
We are interested in a model parameter θ and would like to estimate it with
the corresponding statistic θ̂ of the realisation x1:n of a random sample. Standard
errors and confidence intervals allow us to account for the sampling variability of
iid
the underlying random variables Xi ∼ f (x; θ ), i = 1, . . . , n. These can often be
calculated from x1:n by the use of mathematical derivations, as discussed previously
in this section.
(1) (B)
Of course, it would be easier when many samples x1:n , . . . , x1:n from the pop-
ulation would be available. Then we could directly estimate the distribution of
the parameter estimate θ̂ (X1:n ) implied by the distribution of the random sample
X1:n . The idea of the bootstrap, which was invented by Bradley Efron (born 1938)
in 1979, is simple: We use the given sample x1:n to obtain an estimate fˆ(x) of
the unknown probability mass or density function f (x). Then we draw bootstrap
samples x1:n , . . . , x1:n by sampling from fˆ(x) instead from f (x), so, each boot-
(1) (B)
rectly estimate quantities of the distribution of θ̂ (X1:n ) from the bootstrap samples
(1) (B)
θ̂ (x1:n ), . . . , θ̂ (x1:n ). For example, we can compute the bootstrap standard error of θ̂
by estimating the standard deviation of the bootstrap samples. Analogously, we can
compute the 2.5 % and 97.5 % quantiles of the bootstrap samples to obtain a 95 %
bootstrap confidence interval for θ . With this approach, we can directly estimate the
uncertainty attached to our estimate θ̂ = θ̂ (x1:n ) in the original data set x1:n .
The most straightforward estimate of f (x) is the empirical distribution fˆn (x),
which puts weight 1/n on each realisation x1 , . . . , xn of the sample. Random sam-
pling from fˆn (x) reduces to drawing data points with replacement from the original
sample x1:n . The name of the bootstrap method traces back to this simple trick,
which is “pulling oneself up by one’s bootstraps”. This procedure does not make
66 3 Elements of Frequentist Inference
Fig. 3.1 Histogram of bootstrap means and coefficients of variation for the transformation factor.
(b) (b)
The means of the bootstrap samples μ̂(x1:n ) and φ̂(x1:n ) are marked by continuous vertical lines,
and the estimates μ̂ and φ̂ in the original sample are marked by dashed vertical lines
any parametric assumptions about f (x) and is therefore known as the nonparamet-
ric bootstrap. Note that there are finitely many different bootstrap samples (actu-
ally nn ordered samples if there are no duplicate observations) that can be drawn
from fˆn (x). Since the nonparametric bootstrap distribution of θ̂ (X1:n ) is discrete,
we could in principle avoid the sampling at all and work with the theoretical dis-
tribution instead to obtain uncertainty estimates. However, this is only feasible for
very small sample sizes, say n < 10. Otherwise, we have to proceed with Monte
Carlo sampling from the nonparametric bootstrap distribution, which reduces the
statistical accuracy only marginally in comparison with the sampling error of X1:n
when enough (e.g. B > 10 000) bootstrap samples are used.
terval. Of course, this comes at the cost of more extensive calculations and additional
Monte Carlo error through stochastic simulation.
The great advantage of the bootstrap is that it can easily be applied to more com-
plicated statistics. If analytical methods for the computation of confidence intervals
are missing or rely on strong assumptions, we can always resort to bootstrap confi-
dence intervals.
Example 3.14 (Blood alcohol concentration) Besides the mean transformation fac-
tor μ (see Example 3.13), we may also be interested in a bootstrap confidence inter-
val for the coefficient of variation φ = σ/μ. The corresponding sample coefficient of
variation φ̂ = s/x̄ can be computed for every bootstrap sample; the corresponding
histogram is shown in Fig. 3.1b. Based on the quantiles of the bootstrap distribu-
tion, we obtain the 95 % bootstrap confidence interval from 0.086 to 0.108, with
point estimate of 0.097. In the same manner, a 95 % bootstrap confidence inter-
val can be constructed e.g. for the median transformation factor, with the result
[2411, 2479].
We can also estimate the bias of an estimator using the bootstrap. The bias of θ̂
was defined in Sect. 3.1 as
Ef θ̂ (X1:n ) − θ,
where we use the subscript f to emphasise that the expectation is with respect to
the distribution f (x). Since the population parameter θ is linked to the distribution
f (x), plugging in fˆ(x) for f (x) yields the bootstrap bias estimate
1 (b)
B
Efˆ θ̂ (X1:n ) − θ̂ ≈ θ̂ x1:n − θ̂ ,
B
b=1
1 (b)
B
θ̃ = θ̂ − Efˆ θ̂ (X1:n ) − θ̂ = 2θ̂ − Efˆ θ̂(X1:n ) ≈ 2θ̂ − θ̂ x1:n .
B
b=1
Ĝ−1
B Φ(q(1−γ )/2 ) and Ĝ−1
B Φ(q(1+γ )/2 ) , (3.15)
where Ĝ−1B (α) is the α quantile of the bootstrap estimates, Φ is the standard normal
distribution function, and
c + zα
qα = c + .
1 − a(c + zα )
Note that if the bias constant c = 0 and the acceleration constant a = 0, then
qα = zα where zα = Φ −1 (α) is the α quantile of the standard normal distribution,
and hence (3.15) reduces to the simple percentile interval with limits Ĝ−1 B {(1 −
γ )/2} and Ĝ−1
B {(1 + γ )/2}. In general, these two constants can be estimated as
ĉ = Φ −1 ĜB (θ̂ )
and
n ∗
i=1 {θ̂ − θ̂(x−i )}
1 3
â = ,
6[ n ∗
i=1 {θ̂ − θ̂(x−i )} ]
2 3/2
where θ̂ (x−i ) is the estimate of θ obtained from the original sample x1:n excluding
observation xi , and θ̂ ∗ = n1 ni=1 θ̂ (x−i ) is their average.
Example 3.15 (Blood alcohol concentration) We see in Fig. 3.1 that while, for the
mean parameter μ (see Example 3.13), the average of the bootstrap estimates is very
close to the original estimate μ̂ (bias estimate divided by standard error: −0.006),
for the coefficient of variation φ (see Example 3.14), the relative size of the bias
estimate is larger (bias estimate divided by standard error: −0.070). Because the
bias estimates are negative, the estimates are expected to be too small, and the bias
correction shifts them upwards: μ̂ = 2449.176 is corrected to μ̃ = 2449.278, and
φ̂ = 0.09708 is corrected to φ̃ = 0.09747.
The 95 % BCa confidence intervals are computed from the estimated constants
â = 0.002 and ĉ = 0.018 for the mean estimator μ̂ and â = 0.048 and ĉ = 0.087
for the coefficient of variation estimator φ̂. Both constants are further away from
zero for φ̂, indicating a greater need to move from the simple percentile method to
3.2 Standard Error and Confidence Interval 69
the BCa method. For μ, we obtain the 95 % BCa confidence interval [2416, 2484],
which is practically identical to the percentile interval computed before. For φ, we
obtain [0.088, 0.110], which has a notably larger upper bound than the percentile
interval. Indeed, not the 2.5 % and 97.5 % quantiles of the bootstrap estimates are
used here, but the 5.1 % and 99.1 % quantiles.
P -Value
The P -value is the probability, under the assumption of the null hypothesis
H0 , of obtaining a result equal to or more extreme than what was actually
observed.
Example 3.17 (Fisher’s exact test) Let θ denote the odds ratio and suppose we
want to test the null hypothesis H0 : θ = 1 against the alternative H1 : θ > 1. Let
θ̂ML = θ̂ML (x) denote the observed odds ratio, assumed to be larger than 1. The one-
sided P -value for the alternative H1 : θ > 1 is then
where θ̂ML (X) denotes the MLE of θ viewed as a function of the data X.
For illustration, consider the data from the clinical study in Table 1.1 labelled
as “Tervila”. Table 3.1 summarises the data in a 2 × 2 table. The observed odds
ratio is 6 · 101/(2 · 102) ≈ 2.97. We will show in the following that if we fix both
margins of the 2 × 2 table and if we assume that true odds ratio equals 1, i.e. θ = 1,
the distribution of each entry of the table follows a hypergeometric distribution with
all parameters determined by the margins. This result can be used to calculate a P -
value for H0 : θ = 1. Note that it is sufficient to consider one entry of the table, for
example x1 ; the values of the other entries directly follow from the fixed margins.
Using the notation given in Table 3.1, we assume that X ∼ Bin(m, πx ) and like-
wise Y ∼ Bin(n, πy ), independent of X. Now let Z = X + Y . Then our interest is
in
Pr(X = x) · Pr(Z = z | X = x)
Pr(X = x | Z = z) = , (3.16)
Pr(Z = z)
72 3 Elements of Frequentist Inference
s=0 s z−s
so the one-sided P -value can be calculated as the sum of the hypergeometric prob-
abilities to observe x = 6, 7 or 8 entries:
108103 108103 108103
6
2112 + 7
2111 + 8
2110 = 0.118 + 0.034 + 0.004 = 0.156.
8 8 8
Example 3.18 (Analysis of survival times) Suppose that survival times are exponen-
tially distributed with rate λ as in Example 3.7 and we wish to test the null hypothe-
sis H0 : λ0 = 1/1000, i.e. that the mean survival time θ is 1/λ0 = 1000 days. Using
the pivot (3.8) with λ = λ0 = 1/1000, we obtain the test statistic T = nX̄/1000 with
realisation
nx̄
t= = 53.146.
1000
The distribution of this test statistic is under the null hypothesis G(n = 47, 1), so the
one-sided P -value (using the alternative H1 : λ < 1/1000) can be easily calculated
using the function pgamma in R:
t
[1] 53.146
n
[1] 47
pgamma (t , shape =n , rate =1 , lower . tail = FALSE )
[1] 0.1818647
The one-sided P -value turns out to be 0.18, so under the assumption of exponen-
tially distributed survival times, there is no evidence against the null hypothesis of a
mean survival time equal to 1000 days.
Tn − θ 0
Z(θ0 ) = , (3.17)
se (Tn )
from (3.11) we can test the null hypothesis that the mean survival time is θ0 = 1000
days, but now without assuming exponentially distributed survival times, similarly
as with the confidence interval in Example 3.10. Here Tn denotes a consistent esti-
mator of the parameter θ with standard error se (Tn ).
The realisation of the test statistic (3.17) turns out to be
1130.8 − 1000
z= √ = 1.03
874.4/ 47
for the PBC data. The one-sided P -value can now be calculated using the standard
normal distribution function as Φ{−|z|} and turns out to be 0.15. The P -value is
fairly similar to the one based on the exponential model and provides no evidence
against the null hypothesis. A two-sided P -value can be easily obtained as twice the
one-sided one.
74 3 Elements of Frequentist Inference
Note that we have used the sample standard deviation s = 874.4 in the calculation
of the denominator of (3.17), based on the formula
1
n
s2 = (xi − x̄)2 , (3.18)
n−1
i=1
where x̄ is the empirical mean survival time. The P -value is calculated assuming
that the null hypothesis is true, so it can be argued that x̄ in (3.18) should be replaced
by the null hypothesis value 1000. In this case, n − 1 can be replaced by n to ensure
that, under the null hypothesis, s 2 is unbiased for σ 2 . This leads to a slightly smaller
value of the test statistic and consequently to the slightly larger P -value 0.16.
In practice the null hypothesis value is often the null value θ0 = 0. For example,
we might want to test the null hypothesis that the risk difference is zero. Similarly,
if we are interested to test the null hypothesis that the odds ratio is one, then this
corresponds to a log odds ratio of zero. For such null hypotheses, the test statistic
(3.17) takes a particularly simple form as the estimate Tn of θ divided by its standard
error:
Tn
Z= .
se(Tn )
A realisation of Z is called the Z-value.
It is important to realise that the P -value is a conditional probability under the
assumption that the null hypothesis is true. The P -value is not the probability of
the null hypothesis given the data, which is a common misinterpretation. The poste-
rior probability of a certain statistical model is a Bayesian concept (see Sect. 7.2.1),
which makes sense if prior probabilities are assigned to the null hypothesis and its
counterpart, the alternative hypothesis. However, from a frequentist perspective a
null hypothesis can only be true or false. As a consequence, the P -value is com-
monly viewed as an informal measure of the evidence against a null hypothesis.
Note also that a large P -value cannot be viewed as evidence for the null hypothe-
sis; a large P -value represents absence of evidence against the null hypothesis, and
“absence of evidence is not evidence of absence”.
The Neyman–Pearson approach to statistical inference rejects the P -value as an
informal measure of the evidence against the null hypothesis. Instead, this approach
postulates that there are only two possible “decisions” that can be reached after
having observed data: either “rejecting” or “not rejecting” the null hypothesis. This
theory then introduces the probability of the Type-I error α, defined as the condi-
tional probability of rejecting the null hypothesis although it is true in a series of
hypothetical repetitions of the study considered. It can now be easily shown that
the resulting hypothesis test will have a Type-I error probability equal to α, if the
null hypothesis is rejected whenever the P -value is smaller than α. Note that this
construction requires the Type-I error probability α to be specified before the study
is conducted. Indeed, in a clinical study the probability of the Type-I error (usually
5 %) will be fixed already in the study protocol. However, in observational studies
the P -value is commonly misinterpreted as a post-hoc Type-I error probability. For
3.4 Exercises 75
example, suppose that a P -value of 0.029 has been observed. This misconception
would suggest that the probability of rejecting the null hypothesis although it is true
in a series of imaginative repetitions of the study is 0.029. This interpretation of
the P -value is not correct, as it mixes a truly frequentist (unconditional) concept
(the probability of the Type-I error) with the P -value, a measure of the evidence
of the observed data against the null hypothesis, i.e. an (at least partly) conditional
concept.
In this book we will mostly use significance rather than hypothesis tests and
interpret P -values as a continuous measure of the evidence against the null hypoth-
esis, see Fig. 3.2. However, there is need to emphasise the duality of hypothesis
tests and confidence intervals. Indeed, the result of a two-sided hypothesis test of
the null hypothesis H0 : θ = θ0 at Type-I error probability α can be read off from
the corresponding (1 − α) · 100 % confidence interval for θ : If and only if θ0 is
within the confidence interval, then the Neyman–Pearson test would not reject the
null hypothesis.
3.4 Exercises
1. Sketch why the MLE
M ·n
N̂ML =
x
in the capture–recapture experiment (cf. Example 2.2) cannot be unbiased.
Show that the alternative estimator
(M + 1) · (n + 1)
N̂ = −1
(x + 1)
is unbiased if N ≤ M + n.
76 3 Elements of Frequentist Inference
2. Let X1:n be a random sample from a distribution with mean μ and vari-
ance σ 2 > 0. Show that
σ2
E(X̄) = μ and Var(X̄) = .
n
3. Let X1:n be a random sample from a normal distribution with mean μ and
variance σ 2 > 0. Show that the estimator
n − 1 ( n−1
2 )
σ̂ = n S
2 (2)
is unbiased for σ , where S is the square root of the sample variance S 2 in (3.1).
4. Show that the sample variance S 2 can be written as
1
n
S2 = (Xi − Xj )2 .
2n(n − 1)
i,j =1
has coverage γ .
7. Consider a population with mean μ and variance σ 2 . Let X1 , . . . , X5 be inde-
pendent draws from this population. Consider the following estimators for μ:
1
T1 = (X1 + X2 + X3 + X4 + X5 ),
5
1
T2 = (X1 + X2 + X3 ),
3
1 1
T3 = (X1 + X2 + X3 + X4 ) + X5 ,
8 2
T4 = X 1 + X 2 and
T5 = X 1 .
X̄ − μ
Z= √ ,
S/ n
78 3 Elements of Frequentist Inference
where S 2 = (n − 1)−1 ni=1 (Xi − X̄)2 . Using the result from above,
show that the distribution of Z does not depend on μ.
(d) For n = 10 and α ∈ {1, 2, 5, 10}, simulate 100 000 samples from Z and
compare the resulting 2.5 % and 97.5 % quantiles with those from
the asymptotic standard normal distribution. Is Z a good approximate
pivot?
(e) Show that X̄/μ ∼ G(nα, nα). If α was known, how could you use this
quantity to derive a confidence interval for μ?
(f) Suppose α is unknown; how could you derive a confidence interval
for μ?
10. All beds in a hospital are numbered consecutively from 1 to N > 1. In one
room a doctor sees n ≤ N beds, which are a random subset of all beds, with
(ordered) numbers X1 < · · · < Xn . The doctor now wants to estimate the total
number of beds N in the hospital.
(a) Show that the joint probability mass function of X = (X1 , . . . , Xn ) is
−1
N
f (x; N ) = I{n,...,N } (xn ).
n
3.5 References
The methods discussed in this chapter can be found in many books on statistical
inference, for example in Lehmann and Casella (1998), Casella and Berger (2001) or
Young and Smith (2005). The section on the bootstrap has only touched the surface
of a wealth of so-called resampling methods for frequentist statistical inference.
More details can be found e.g. in Davison and Hinkley (1997) and Chihara and
Hesterberg (2019).
Frequentist Properties of the Likelihood
4
Contents
4.1 The Expected Fisher Information and the Score Statistic . . . . . . . . . . . . . . 80
4.1.1 The Expected Fisher Information . . . . . . . . . . . . . . . . . . . . . . 81
4.1.2 Properties of the Expected Fisher Information . . . . . . . . . . . . . . . . 84
4.1.3 The Score Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.4 The Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.5 Score Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 The Distribution of the ML Estimator and the Wald Statistic . . . . . . . . . . . . 94
4.2.1 Cramér–Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.2 Consistency of the ML Estimator . . . . . . . . . . . . . . . . . . . . . . 96
4.2.3 The Distribution of the ML Estimator . . . . . . . . . . . . . . . . . . . . 97
4.2.4 The Wald Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3 Variance-Stabilising Transformations . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4 The Likelihood Ratio Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.1 The Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4.2 Likelihood Ratio Confidence Intervals . . . . . . . . . . . . . . . . . . . . 106
4.5 The p ∗ Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.6 A Comparison of Likelihood-Based Confidence Intervals . . . . . . . . . . . . . . 113
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
In Chap. 2 we have considered the likelihood and related quantities such as the
log-likelihood, the score function, the MLE and the (observed) Fisher information
for a fixed observation X = x from a distribution with probability mass or density
function f (x; θ ). For example, in a binomial model with known sample size n and
unknown probability π we have
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 79
https://doi.org/10.1007/978-3-662-60792-3_4,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
80 4 Frequentist Properties of the Likelihood
x n−x
and Fisher information I (π; x) = + .
π 2 (1 − π)2
Now we take a different point of view and apply the concepts of frequentist inference
as outlined in Chap. 3. To this end, we consider S(π), π̂ML and I (π) as random
variables, with distribution derived from the random variable X ∼ Bin(n, π). The
above equations now read
X n−X
S(π; X) = − ,
π 1−π
X X n−X
π̂ML (X) = , and I (π; X) = 2 +
n π (1 − π)2
where X ∼ Bin(n, π) is an identical replication of the experiment underlying our
statistical model. The parameter π is now fixed and denotes the true (unknown)
parameter value. To ease notation, we will often not explicitly state the dependence
of the random variables S(π), π̂ML and I (π) on the random variable X.
The results we will describe in the following sections are valid under a standard
set of regularity conditions, often called Fisher regularity conditions.
This chapter will introduce three important test statistics based on the likelihood:
the score statistic, the Wald statistic and the likelihood ratio statistic. Many of the
results derived are asymptotic, i.e. are valid only for a random sample X1:n with
relatively large sample size n. A case study on different confidence intervals for
proportions completes this chapter.
In this section we will derive frequentist properties of the score function and the
Fisher information. We will introduce the score statistic, which is useful to derive
likelihood-based significance tests and confidence intervals.
4.1 The Expected Fisher Information and the Score Statistic 81
Definition 4.2 (Expected Fisher information) The expectation of the Fisher infor-
mation I (θ ; X), viewed as a function of the data X with distribution f (x; θ ), is the
expected Fisher information J (θ ).
We will usually assume that the expected Fisher information J (θ ) is positive and
bounded, i.e. 0 < J (θ ) < ∞.
x n−x
I (π; x) = + .
π 2 (1 − π)2
J (π) = E I (π; X)
X n−X
=E 2 +E
π (1 − π)2
E(X)n − E(X)
= +
π2 (1 − π)2
nπ n − nπ
= 2 +
π (1 − π)2
n n n
= + = .
π 1−π π(1 − π)
Note that the only difference to the observed Fisher information I (π̂ML ; x) derived
in Example 2.10 is the replacement of the MLE π̂ML with the true value π .
The expected Fisher information can also be described as the variance of the
score function. Before showing this general result, we first study a specific example.
82 4 Frequentist Properties of the Likelihood
Result 4.1 (Expectation and variance of the score function) Under the Fisher reg-
ularity conditions, which ensure that the order of differentiation and integration can
be changed, we have:
E S(θ ; X) = 0,
Var S(θ ; X) = J (θ ).
Proof To prove Result 4.1, we assume without loss of generality that L(θ ) =
f (x; θ ), i.e. all multiplicative constants in f (x; θ ) are included in the likelihood
function. First, note that for continuous X we have
E S(θ ; X) = S(θ ; x)f (x; θ ) dx
d
= log L(θ ) f (x; θ ) dx
dθ
d
dθ L(θ )
= f (x; θ ) dx, with the chain rule,
L(θ )
4.1 The Expected Fisher Information and the Score Statistic 83
d
= L(θ ) dx, due to L(θ ) = f (x; θ ),
dθ
d
= L(θ ) dx, under the above regularity condition,
dθ
d
= 1 = 0.
dθ
So we have E{S(θ ; X)} = 0, and therefore Var{S(θ ; X)} = E{S(θ ; X)2 }. It is there-
fore sufficient to show that J (θ ) = E{S(θ ; X)2 } holds:
d2
J (θ ) = E − 2 log L(θ )
dθ
d
d dθ L(θ )
=E − , with the chain rule,
dθ L(θ )
d2
dθ 2
L(θ ) · L(θ ) − { dθ
d
L(θ )}2
=E − , with the quotient rule,
L(θ )2
d2 d
dθ 2
L(θ ) ( dθ L(θ ))2
= −E +E
L(θ ) L(θ )2
d2
dθ 2
L(θ ) { dθ
d
L(θ )}2
=− f (x; θ ) dx + f (x; θ ) dx,
L(θ ) L(θ )2
using the above regularity condition
2
d2 d
=− L(θ ) dx + log L(θ ) f (x; θ ) dx
dθ 2 dθ
=1 =S(θ)2
= E S(θ ; X)2 .
For discrete X the result is still valid with integration replaced by summation.
Result 4.1 says that the score function at the true parameter value θ is on average
zero. This suggests that the MLE, which has a score function value of zero, is on
average equal to the true value. However, this is in general not true. It is true, though,
asymptotically, as we will see in Result 4.9.
The variance of the score function is due to Result 4.1 equal to the expected
Fisher information. How can we interpret this result? The expected Fisher informa-
tion is the average negative curvature of the log-likelihood at the true value. If the
log-likelihood function is steep and has a lot of curvature, then the information with
respect to θ is large, and the score function at the true value θ will vary a lot. Con-
versely, if the log likelihood function is flat, then the score function will not vary
much at the true value θ , so does not have much information with respect to θ .
84 4 Frequentist Properties of the Likelihood
Result 4.2 (Expected Fisher information from a random sample) Let X1:n denote
a random sample from a distribution with probability mass or density function
f (x; θ ). Let J (θ ) denote the expected unit Fisher information, i.e. the expected
Fisher information from one observation Xi from f (x; θ ). The expected Fisher in-
formation J1:n (θ ) from the whole random sample X1:n is then
J1:n (θ ) = n · J (θ ).
This property follows directly from the additivity of the log-likelihood function
for random samples, see Eq. (2.1).
Example 4.3 (Normal model) Consider a random sample X1:n from a normal dis-
tribution N(μ, σ 2 ), where our interest lies in the expectation μ, and we assume that
the variance σ 2 is known. We know from Example 2.9 that the unit Fisher informa-
tion is
1
I (μ; xi ) = 2 .
σ
Now I (μ; xi ) does not depend on xi and thus equals the expected unit Fisher infor-
mation J (μ), and the expected Fisher information from the whole random sample
X1:n is J1:n (μ) = nJ (μ) = nI (μ) = n/σ 2 .
Suppose now that we are interested in the expected Fisher information of the
unknown variance σ 2 and treat the mean μ as known. We know from Example 2.9
that the unit Fisher information of σ 2 is
1 1
I σ 2 ; xi = 6 (xi − μ)2 − 4 .
σ 2σ
Using E{(Xi − μ)2 } = Var(Xi ) = σ 2 , we easily obtain the expected unit Fisher
information of σ 2 :
1 1 1
J σ2 = 4 − 4 = 4.
σ 2σ 2σ
The expected Fisher information from the whole random sample X1:n is therefore
J1:n (σ 2 ) = nJ (σ 2 ) = n/(2σ 4 ).
The following result establishes that the property described in Result 2.1 for the
observed Fisher information also holds for the expected Fisher information.
calculated as follows:
−1
dh (φ) 2 dh(θ ) −2
Jφ (φ) = Jθ (θ ) = Jθ (θ ) .
dφ dθ
dh−1 (φ)
Sφ (φ; X) = Sθ (θ ; X) ·
dφ
and with Result 4.1 we have
Example 4.4 (Binomial model) The expected Fisher information of the success
probability π in a binomial experiment is
n
Jπ (π) = ,
π(1 − π)
compare Example 4.1. The expected Fisher information of the log odds φ = h(π) =
log{π/(1 − π)} therefore is
−2
1
Jφ (φ) = Jπ (π) = nπ(1 − π)
π(1 − π)
due to
dh(π) 1
= .
dπ π(1 − π)
This corresponds to the observed Fisher information, as Jφ (φ̂ML ) = Iφ (φ̂ML ), cf.
Example 2.11.
Example 4.5 (Normal model) In Example 4.3 we showed that the expected Fisher
√ of the variance σ is n/(2σ ). Applying Result 4.3 to θ = σ and σ =
information 2 4 2
Definition 4.3 (Location and scale parameters) Let X denote a random variable
with probability mass or density function fX (x) = f (x; θ ), depending one scalar
parameter θ . If the probability mass or density function fY (y) of Y = X + c, where
c ∈ R is a constant, has the form fX (y; θ + c), then θ is called a location parameter.
If the probability mass or density function fY (y) of Y = cX, where c ∈ R+ is a
positive constant, has the form fX (y; c θ ) then θ is called a scale parameter.
J (θ ) ∝ θ −2 .
4.1 The Expected Fisher Information and the Score Statistic 87
Example 4.7 (Normal model) Consider a random sample X1:n from a normal dis-
tribution N(μ, σ 2 ). The expected Fisher information of the location parameter μ is
J (μ) = n/σ 2 and indeed independent of μ (cf. Example 4.3). In Example 4.5 we
have also shown that J (σ ) = 2n/σ 2 , which is in line with Result 4.4, which states
that the expected Fisher information of the scale parameter σ must be proportional
to σ −2 .
Result 4.1 has established formulas for the expectation and variance of the score
function. We can also make an asymptotic statement about the whole distribution of
the score function for a random sample:
Result 4.5 (Score statistic) Consider a random sample X1:n from a distribution
with probability mass or density function f (x; θ ). Under the Fisher regularity con-
ditions, the following holds:
S(θ ; X1:n ) a
√ ∼ N(0, 1). (4.2)
J1:n (θ )
This result identifies the score statistic (4.2) as an approximate pivot for θ , with
the standard normal pivotal distribution, cf. Definition 3.7. We note that the symbol
a
∼ is to be understood as convergence in distribution as n → ∞, cf. Appendix A.4.1.
The alternative formulation
a
S(θ ; X1:n ) ∼ N 0, J1:n (θ )
is mathematically less precise but makes it more explicit that the asymptotic vari-
ance of the score function S(θ ; X1:n ) is equal to the expected Fisher information
J1:n (θ ), a result which is commonly used in practice. However, Eq. (4.2) with a
limit distribution not depending on n is more rigid.
1 D
n
1 S(θ ; X1:n ) a
√ S(θ ; X1:n ) = √ Yi −
→ N 0, J (θ ) , so √ ∼ N(0, 1).
n n
i=1
J1:n (θ )
88 4 Frequentist Properties of the Likelihood
The next result shows that we can replace the expected Fisher information J1:n (θ )
with the ordinary Fisher information I (θ, X1:n ) in Eq. (4.2).
Result 4.6 We can replace in Result 4.5 the expected Fisher information J1:n (θ ) by
the ordinary Fisher information I (θ ; X1:n ), i.e.
S(θ ; X1:n ) a
√ ∼ N(0, 1). (4.3)
I (θ ; X1:n )
Proof Due to
d2
I (θ ; Xi ) = − log f (Xi ; θ ) and
dθ 2
E I (θ ; Xi ) = J (θ ),
1
n
1 P
I (θ ; X1:n ) = I (θ ; Xi ) −
→ J (θ ), and therefore
n n
i=1
I (θ ; X1:n ) I (θ ; X1:n ) P
= −
→ 1.
nJ (θ ) J1:n (θ )
The continuous mapping theorem (cf. Appendix A.4.2) ensures that also the inverse
square root converges to 1, i.e.
√
J1:n (θ ) P
√ −
→ 1.
I (θ ; X1:n )
By Slutsky’s theorem (cf. Appendix A.4.2) and Eq. (4.2) we finally obtain
√
J1:n (θ ) S(θ ; X1:n ) D
√ · √ −
→ 1 · N(0, 1),
I (θ ; X1:n ) J1:n (θ )
There are two further variants of the √ score statistic. We can replace the true pa-
rameter value θ in the denominator J1:n (θ ) of √ Eq. (4.2) with θ̂ML = θ̂ML (X1:n ).
Likewise, we can replace θ in the denominator I1:n (θ ; X1:n ) of Eq. (4.3) with
θ̂ML = θ̂ML (X1:n ). This will be justified later on page 98. In applications this requires
the calculation of the MLE θ̂ML .
The score statistic in Eq. (4.2) forms the basis of the corresponding significance
test, which we describe now.
4.1 The Expected Fisher Information and the Score Statistic 89
The score statistic can be used to construct significance tests and confidence inter-
vals. The significance test based on the score statistic is called score test. In the
following we still assume that the data has arisen through an appropriate random
sample of sufficiently large sample size n.
Suppose we are interested in the null √ hypothesis H0 : θ = θ0√with two-sided alter-
native H1 : θ = θ0 . Both S(θ0 ; x1:n )/ J (θ0 ) and S(θ0 ; x1:n )/ I (θ0 ; x1:n ) can now
be used to calculate a P -value, as illustrated in the following example.
Example 4.8 (Scottish lip cancer) Consider Sect. 1.1.6, where we have observed
(xi ) and expected (ei ) counts of lip cancer in i = 1, . . . , 56 regions of Scotland.
We assume that xi is a realisation from a Poisson distribution Po(ei λi ) with known
offset ei > 0 and unknown rate parameter λi .
In contrast to Example 2.4, we assume here that the rate parameters λi differ
between regions. Note that calculation of the expected counts ei has been done
under the constraint that the sum of the expected equals the sum of the observed
counts in Scotland. The value λ0 = 1 is hence a natural reference value for the rate
parameters because it corresponds to the overall relative risk xi / ei .
Consider a specific region i and omit the subscript i for ease of notation. Note
that the number of observed lip cancer cases x in a specific region can be viewed
as the sum of a very large number of binary observations (lip cancer yes/no)
based on the whole population in that region. So although we just have one Pois-
son observation, the usual asymptotics still apply, if the population size is large
enough.
We want to test the null hypothesis H0 : λ = λ0 . It is easy to show that
x
S(λ; x) = − e,
λ
I (λ; x) = x/λ2 ,
The observed test statistic of the score test using the expected Fisher information in
(4.2) is therefore
S(λ0 ; x) x − eλ0
T1 (x) = √ = √ .
J (λ0 ) eλ0
Using the ordinary Fisher information instead, we obtain the observed test statistic
from (4.3)
S(λ0 ; x) x − eλ0
T2 (x) = √ = √ .
I (λ0 ; x) x
90 4 Frequentist Properties of the Likelihood
Note that
T1 (x) x
=
T2 (x) eλ0
so if x > eλ0 , both T1 and T2 are positive, and T1 will have a larger value than T2 .
Conversely, if x < eλ0 , both T1 and T2 are negative, and T2 will have a larger abso-
lute value than T1 . Therefore, T1 ≥ T2 always holds.
We now consider all 56 regions separately in order to test the null hypothesis
H0 : λi = λ0 = 1 for each region i. Figure 4.1 plots the values of the two test statis-
tics T1 and T2 against each other.
Note that the test statistic T2 is infinite for the two observations with xi = 0.
The alternative test statistic T1 gives here still sensible values, namely −1.33 and
−2.04. Of course, it has to be noted that the assumption of an approximate nor-
mal distribution is questionable for small xi . This can be also seen from the dis-
crepancies between T1 and T2 , two test statistics which should be asymptotically
equivalent.
It is interesting that 24 of the 56 regions have absolute values of T1 larger than
the critical value 1.96. This corresponds to 43 %, considerably more than the 5 %
to be expected under the null hypothesis. This can also be seen in a histogram of
the corresponding two-sided P -values, shown in Fig. 4.2a. We observe far more
small P -values (< 0.1) than we would expect under the null hypothesis, where the
P -values are (asymptotically) uniformly distributed.
This suggests that there is heterogeneity in relative lip cancer risk between the
different regions despite the somewhat questionable asymptotic regime. Very simi-
lar results can be obtained with the test statistic T2 , excluding the two regions with
zero observed counts, cf. Fig. 4.2b.
A remarkable feature of the score test is that it does not require calculation of the
MLE θ̂ML . This can make the application of the score test simpler than alternative
methods, which often require knowledge of the MLE.
4.1 The Expected Fisher Information and the Score Statistic 91
Fig. 4.2 Histograms of the P -values based on the two score test statistics T1 and T2 for a unity
relative risk of lip cancer in the 56 regions of Scotland
The approximate pivots (4.2) and (4.3) can also be used to compute approximate
score confidence intervals for θ . Consider the score test statistic (4.2); then the du-
ality of hypothesis tests and confidence intervals implies that all values θ0 fulfilling
S(θ0 ; x1:n )
z(1−γ )/2 ≤ √ ≤ z(1+γ )/2
J1:n (θ0 )
form a γ · 100 % score confidence interval for θ . Due to z(1−γ )/2 = −z(1+γ )/2 , this
condition can also be written as
S(θ0 ; x1:n )2
≤ z(1+γ
2
)/2 .
J1:n (θ0 )
The same holds for the score test statistic (4.3), which uses the observed Fisher
information. In summary we have the two approximate γ · 100 % score confidence
intervals
S(θ ; x1:n )2 S(θ ; x1:n )2
θ: ≤q 2
and θ: ≤q ,
2
(4.4)
J1:n (θ ) I (θ ; x1:n )
where q = z(1+γ )/2 denotes the (1 + γ )/2 quantile of the standard normal distri-
bution. Computation is in general not straightforward, but explicit formulas can be
derived in some special cases, as illustrated by the following examples.
Example 4.9 (Scottish lip cancer) We now want to compute a score confidence
interval for the relative risk λ in one specific region.
92 4 Frequentist Properties of the Likelihood
x + q 2 /2 q
λ1/2 = ± 4x + q 2 .
e 2e
Note that the interval is boundary-respecting since the lower limit is always non-
negative and equal to zero for x = 0. The interval limits are symmetric around (x +
q 2 /2)/e, but not around the MLE λ̂ML = x/e. Nevertheless, it is easy to show that
the MLE is always inside the confidence interval for any confidence level γ . For the
data on lip cancer in Scotland, we obtain the 95 % score confidence intervals for the
relative risk in each region, as displayed in Fig. 4.3.
for the success probability π in the binomial model X ∼ Bin(n, π); here π̂ML = x/n.
We will now show that this confidence interval is a score confidence interval based
on (4.4) using the expected Fisher information. First note that
x n−x x − nπ
S(π; x) = − = (4.6)
π 1−π π(1 − π)
so the limits of this approximate γ · 100 % confidence interval for π are the solutions
of the equation
π(1 − π) x − nπ 2 (x − nπ)2
= = q 2,
n π(1 − π) nπ(1 − π)
An interesting property of the Wilson confidence interval is that the limits are
always within the unit interval, i.e. the Wilson interval is boundary-respecting. For
example, if there are x = 0 successes in n trials, the Wilson confidence interval has
limits 0 and q 2 /(q 2 + n); for x = n the limits are n/(q 2 + n) and 1. The Wald
interval does not have this property; see Example 4.22 for a thorough comparison
of different confidence intervals for proportions.
The underlying reason of this property of the Wilson confidence interval is that
the score confidence intervals based on the expected Fisher information J (θ ) are
invariant with respect to one-to-one transformations of the parameter θ . For exam-
ple, suppose we would parametrise the binomial likelihood in terms of the log odds
φ = logit(π) instead of the success probability π . The limits of the score confidence
interval for φ are then simply the logit-transformed limits of the original Wilson
confidence interval for π . This is also true in general.
Result 4.7 (Invariance of score confidence intervals) Let φ = h(θ ) denote a one-
to-one transformation of θ and use subscripts to differentiate between the score
function and expected Fisher information of the old (θ ) and new parameter (φ).
Then we have
Sθ (θ ; x)2 Sφ (φ; x)2
h(θ ) : ≤ q2 = φ : ≤ q2 .
Jθ (θ ) Jφ (φ)
Proof This property follows immediately from Result 4.3 and Eq. (2.4):
−1 2
Sφ (φ; x)2 Sθ (θ ; x) dh dφ(φ) Sθ (θ ; x)2
= −1 2
= .
Jφ (φ) Jθ (θ ) dh dφ(φ) Jθ (θ )
We can therefore transform the limits of a score interval for θ and do not need to
re-calculate the score function of φ. This is an important and attractive property of
the score confidence interval based on the expected Fisher information. As we will
see in the next section, the more commonly used Wald confidence interval does not
have this property.
Result 4.8 Let T = h(X) denote an unbiased estimator of g(θ ) based on some
data X from a distribution with probability mass or density function f (x; θ ), i.e.
Let J (θ ) denote the expected Fisher information of θ with respect to X. Under the
Fisher regularity conditions, we then have the following property:
g (θ )2
Var(T ) ≥ . (4.7)
J (θ )
In particular, if g(θ ) = θ , we have
1
Var(T ) ≥ . (4.8)
J (θ )
The right-hand sides of (4.7) and (4.8) are called Cramér–Rao lower bounds.
Proof To prove (4.7), consider two arbitrary random variables S and T . Then the
squared correlation between S and T is
Cov(S, T )2
ρ(S, T )2 = ≤ 1,
Var(S) · Var(T )
compare Appendix A.3.6. Suppose now that S = S(θ ; X), i.e. S is the score func-
tion. We know that Var(S) = J (θ ), so we have
Cov(S, T )2
Var(T ) ≥ .
J (θ )
For discrete X the result is still valid with integration replaced by summation.
Efficient estimator
An estimator that is asymptotically unbiased and asymptotically attains the
Cramér–Rao lower bound is called efficient.
We will now consider the ML estimator and show that it is consistent if the Fisher
regularity conditions hold. In particular, we need to assume that the support of the
statistical model f (x; θ ) does not depend on θ , that L(θ ) is continuous in θ and that
θ is (as always in this section) a scalar.
Result 4.9 (Consistency of the ML estimator) Let X1:n denote a random sample
from f (x; θ0 ) where θ0 denotes the true parameter value. For n → ∞, there is a
consistent sequence of MLEs θ̂n (defined as the local maximum of the likelihood
function).
Proof We have to show that for any ε > 0 (for n → ∞), there is a (possibly local)
maximum θ̂ in the interval (θ0 − ε, θ0 + ε). Now L(θ ) is assumed to be continuous,
so it is sufficient to show that the probability of
and
where X has density f (x; θ0 ). Application of the information inequality (cf. Ap-
pendix A.3.8) gives with assumption (4.1):
" #
f (X; θ0 )
E log > 0.
f (X; θ0 − ε)
For Eq. (4.10), we can argue similarly.
Note that in this proof the MLE is defined as a local maximiser of the likeli-
hood, so the uniqueness of the MLE is not shown. A proof with the MLE defined as
the global maximiser requires additional assumptions, and the proof becomes more
involved.
The following result, which establishes the asymptotic normality of the MLE, is one
of the most important results of likelihood theory.
Result 4.10 Let X1:n denote a random sample from f (x; θ0 ) and suppose that
θ̂ML = θ̂ML (X) is consistent for θ0 . Assuming that the Fisher regularity conditions
hold, we then have:
√
→ N 0, J (θ0 )−1 ,
D
n(θ̂ML − θ0 ) −
where J (θ0 ) denotes the expected unit Fisher information of one observation Xi
from f (x; θ0 ). The expected Fisher information of the full random sample is there-
fore J1:n (θ0 ) = n · J (θ0 ), and we have
a
J1:n (θ0 )(θ̂ML − θ0 ) ∼ N(0, 1). (4.11)
Proof To show Result 4.10, consider a first-order Taylor expansion of the score
function around θ0 ,
I (θ0 ; X1:n ) P
−
→ J (θ0 ).
n
The continuous mapping theorem (cf. Appendix A.4.2) gives
−1
I (θ0 ; X1:n )
→ J (θ0 )−1 ,
P
−
n
We can now replace the expected Fisher information J1:n (θ0 ) in (4.11) with the
ordinary Fisher information I (θ0 ; X1:n ), just as for the score statistic. Similarly, we
can evaluate both the expected and the ordinary Fisher information not at the true
parameter value θ0 , but at the MLE θ̂ML . In total, we have the following three variants
4.2 The Distribution of the ML Estimator and the Wald Statistic 99
of Result 4.10:
θ̂ML ∼ N θ0 , J1:n (θ̂ML )−1 ,
a
θ̂ML ∼ N θ0 , I (θ0 ; X1:n )−1 ,
a
θ̂ML ∼ N θ0 , I (θ̂ML ; X1:n )−1 .
a
This illustrates that we can use both 1/ I (θ̂ML ; x1:n ) and 1/ J1:n (θ̂ML ) as a stan-
dard error of the MLE θ̂ML . This is an important result, as it justifies the usage of a
standard error based on I (θ̂ML ; x1:n ), i.e. the negative curvature of the log-likelihood,
evaluated at the MLE θ̂ML .
To test the null hypothesis H0 : θ = θ0 , we can use one of the two test statistics
a
I (θ̂ML )(θ̂ML − θ0 ) ∼ N(0, 1) (4.12)
and
a
J (θ̂ML )(θ̂ML − θ0 ) ∼ N(0, 1), (4.13)
which are both asymptotically standard normally distributed under the null hypoth-
esis H0 . These two statistics are therefore approximate pivots for θ and are usually
denoted as Wald statistic. The corresponding statistical test is called Wald test and
has been developed by Abraham Wald (1902–1950) in 1939.
Example 4.12 (Scottish lip cancer) Consider again a specific region of Scotland,
where we want to test whether the lip cancer risk is higher than on average. To test
the corresponding null hypothesis H0 : λ = λ0 with the Wald test, we know from
Example 4.8 that
Using the duality of hypothesis tests and confidence intervals, Wald confidence
intervals can be computed by inverting the Wald test based on the test statistics
(4.12) or (4.13), respectively. Using (4.12), all values θ0 that fulfil
would not lead to a rejection of the null hypothesis H0 : θ = θ0 when the significance
level is 1 − γ . These values form the γ · 100 % Wald confidence interval for θ with
limits
se(υ̂ML ) ≈ 0.013 ,
As discussed in Sect. 3.2.4, Wald confidence intervals are not invariant to nonlin-
ear transformations, and the choice of transformation may be guided by the require-
ment that the confidence interval respects the boundaries of the parameter space. We
will now describe another approach for determining a suitable parameter transfor-
mation.
Example 4.14 (Scottish lip cancer) In Example 4.9 we computed a score confi-
dence interval for the relative lip cancer risk λ in a specific region of Scotland, cf.
Sect. 1.1.6. The used data is the realisation x from a Poisson distribution Po(eλ)
with known offset e. The expected Fisher information of λ is Jλ (λ) = e/λ ∝ λ−1 ,
cf. Example 4.8. Therefore,
λ λ √
Jλ (u) du ∝
1/2
u−1/2 du = 2 λ.
102 4 Frequentist Properties of the Likelihood
We can ignore the √constant factor 2, so we can identify the square root transforma-
tion φ = h(λ) = λ as the variance-stabilising transformation of the relative rate λ
of a Poisson distribution. The MLE of√λ is λ̂ML = x/e, so using the invariance of the
MLE, we immediately obtain φ̂ML = x/e.
We know from Result 4.11 that the expected Fisher information Jφ (φ) does not
depend on φ, but what is its value? Using Result 4.3, we can easily compute
dh(λ) −2 e 1 − 1 −2
Jφ (φ) = Jλ (λ) = λ = 4e,
λ2
2
dλ
and back-transform those to the 95 % confidence interval [1.800, 6.07] for λ. Note
that this interval is no longer symmetric around λ̂ML = 3.62 but slightly shifted to
the right.
Figure 4.4 displays the variance-stabilised Wald confidence intervals for the
56 regions of Scotland. The intervals look similar to the score intervals shown in
Fig. 4.3.
Example 4.15 (Inference for a proportion) We now want to derive the variance-
stabilising transformation in the binomial model. The ML estimator of the success
probability π is π̂ML = X̄ = X/n with expected Fisher information
n −1
J (π) = ∝ π(1 − π) ,
π(1 − π)
π 1
so we have to compute {u(1 − u)}− 2 du. Using the substitution
√
u = sin(p)2 ⇐⇒ p = arcsin( u),
√
π arcsin( π) 2 sin(p) cos(p)
− 12
u(1 − u) du = dp
|sin(p) cos(p)|
√
arcsin( π)
∝ 2 dp
√
= 2 arcsin( π).
√
So h(π) = arcsin( π ) is the variance-stabilising transformation with approximate
variance 1/(4n), as can be shown easily. As intended, the approximate variance does
not depend on π .
Suppose n = 100 and x = 2, i.e. π̂ML = 0.02 and h(π̂ML ) = arcsin(√ π̂ML ) =
0.142. The approximate 95 % confidence interval for φ = h(π) = arcsin( π ) now
has the limits
√
0.142 ± 1.96 · 1/ 400 = 0.044 and 0.240.
Back-transformation using h−1 (φ) = sin(φ)2 finally gives the 95 % confidence
interval [0.0019, 0.0565] for π . It should be noted that application of the back-
transformation requires the limits of the confidence interval for φ to lie in the in-
terval [0, π/2 ≈ 1.5708] (here π denotes the circle constant). In extreme cases, for
example for x = 0, this may not be the case.
cf. Appendix A.5.3 for properties of the multivariate normal distribution. In total,
there are five unknown parameters, namely the means μ1 and μ2 , the variances σ12
and σ22 and the correlation ρ ∈ (−1, 1). The sample correlation
n
i=1 (xi − x̄)(yi − ȳ)
r=
n n
i=1 (xi − x̄)2 i=1 (yi − ȳ)2
is the MLE of ρ.
It can be shown (with the central limit theorem and the delta method) that r has
an asymptotic normal distribution with mean ρ and variance V (ρ) = (1 − ρ 2 )2 /n
(compare also Exercise 2 in Chap. 5). The asymptotic variance depends on ρ, so we
would like to find a transformation that removes this dependence.
Using (4.14) with the Fisher information replaced by the inverse variance, we
obtain
ρ
h(ρ) = V (u)−1/2 du
ρ 1
∝ du
1 − u2
ρ ρ
1 1
= du + du
2(1 + u) 2(1 − u)
1 1
= log(1 + ρ) − log(1 − ρ)
2 2
1 1+ρ
= log .
2 1−ρ
exp(2ζ ) − 1
ρ = tanh(ζ ) =
exp(2ζ ) + 1
4.4 The Likelihood Ratio Statistic 105
to obtain a confidence interval for the correlation ρ. Note that this transformation
ensures that the limits of the confidence interval for ρ are inside the parameter space
(−1, 1) of ρ.
Example 4.17 (Blood alcohol concentration) We want to illustrate the method with
the study on blood alcohol concentration, cf. Sect. 1.1.7. We denote the BrAC mea-
surement from the ith proband as xi and the BAC value as yi , i = 1, . . . , n = 185,
and assume that they are realisations from a bivariate normal random sample.
We obtain the means x̄ = 1927.5 and ȳ = 0.7832 and the sample correlation
r√= 0.9725. If we use the asymptotic normality of r, we obtain the standard error
V (r) = 0.003994 for r and the 95 % Wald confidence interval [0.9646, 0.9803]
for ρ.
The estimated transformed correlation is z = tanh(r) = 2.1357, with 95 % Wald
√
confidence interval z ± 1.96/ n = 1.9916 and 2.2798 for ζ . Transforming this to
the correlation scale gives the 95 % confidence interval [0.9634, 0.9793] for ρ. This
is similar to the original Wald confidence interval above. The reason is that the
sample size n is quite large in this example, so that the asymptotics are effective
for both scales. If we had just n = 10 observations giving the same estimate, we
would get an upper bound for the original Wald confidence interval that is larger
than 1.
1
l(θ ) ≈ l(θ̂ML ) − · I (θ̂ML )(θ − θ̂ML )2
2
so
L(θ̂ML )
2 log = 2 l(θ̂ML ) − l(θ ) ≈ I (θ̂ML )(θ − θ̂ML )2 .
L(θ )
The left-hand side
L(θ̂ML ) ˜ )
W = 2 log = −2l(θ (4.15)
L(θ )
is called the likelihood ratio statistic. Here W = W (X1:n ) is a function of the ran-
dom sample X1:n because the likelihood L(θ ; X1:n ) and the ML estimator θ̂ML =
θ̂ML (X1:n ) depend on the data.
If θ denotes the true parameter value, then (4.12) implies that
a
I (θ̂ML )(θ̂ML − θ )2 ∼ χ 2 (1),
106 4 Frequentist Properties of the Likelihood
due to the fact that the squared standard normal random variable has a chi-squared
distribution with one degree of freedom, cf. Appendix A.5.2. So the likelihood ratio
statistic (4.15) follows a chi-squared distribution with one degree of freedom,
a
W ∼ χ 2 (1),
and is an approximate pivot. It can be used both for significance testing and the
calculation of confidence intervals.
The likelihood ratio test to test the null hypothesis H0 : θ = θ0 is based on the
likelihood ratio statistic (4.15) with θ replaced by θ0 . An equivalent formulation is
based on the signed likelihood ratio statistic
√
sign(θ̂ML − θ0 ) · W , (4.16)
Example 4.18 (Scottish lip cancer) We now want to compute the likelihood ratio
test statistic for one specific region in Scotland with the null hypothesis H0 : λ = λ0 .
The log-likelihood equals
l(λ) = x log(λ) − eλ,
so
2[x{log(x) − log(eλ0 ) − 1} + eλ0 ] for x > 0,
W (x) =
2eλ0 for x = 0,
and we easily obtain the signed likelihood ratio statistic (4.16), cf. Fig. 4.5. If posi-
tive, the signed likelihood ratio statistic T4 is more conservative than the score test
in this example. If negative, it is less conservative due to T4 ≤ T1 . Figure 4.6 gives
a histogram of the corresponding P -values. The pattern is very similar to the one
based on the score statistic shown in Fig. 4.2. Note that 25 out of 56 regions have a
P -value smaller than 5 %, so many more than we would expect under the assump-
tion that the null hypothesis holds in all regions.
Note that the central quantity to compute a likelihood ratio confidence interval is the
˜ ), cf. Definition 2.5. We are now in a position to calibrate
relative log-likelihood l(θ
the relative log-likelihood if θ is a scalar, as shown in Table 4.1. Of course, this also
induces a calibration of the relative likelihood L̃(θ ). For example, all parameter
values with relative likelihood larger than 0.147 will be within the 95 % likelihood
ratio confidence interval.
Computation of the limits of a likelihood ratio confidence interval requires in
general numerical methods such as bisection (see Appendix C.1.2) to find the roots
of the equation
˜ ) = − 1 χγ2 (1).
l(θ
2
This is now illustrated in the Poisson model.
Example 4.19 (Scottish lip cancer) For comparison with Examples 4.9 and 4.14, we
numerically compute 95 % likelihood ratio confidence intervals for λi in each of the
regions i = 1, . . . , 56 in Scotland. We use the R-function uniroot (cf. Appendix C
for more details):
Note that we are careful in the R-code when xi = 0 because then the mode of the
likelihood is at zero, which is the left boundary of the parameter space R+ 0 . There-
fore, we only compute the upper bound numerically in that case. Moreover, we do
not search for the lower bound very close to zero, but only a small ε > 0 away from
that (here it is ε ≈ 1.5 · 10−8 ). Figure 4.7 displays the resulting intervals. They look
very similar to the ones based on the score statistic in Fig. 4.3.
which corresponds to
$ %
θ: I (θ̂ML )|θ − θ̂ML | ≤ z 1+γ ,
2
i.e. the Wald confidence interval. This is because (z(1+γ )/2 )2 = χγ2 (1) due to the
relation between the standard normal and the chi-squared distribution. Figure 4.8
illustrates this for X ∼ Po(eλ) with known offset e = 3.04 and observation x =
11. This comparison suggests that likelihood ratio confidence intervals are more
accurate than Wald confidence intervals, as they avoid the quadratic approximation
of the log-likelihood. Various theoretical results support this intuitive finding. In
particular, likelihood ratio intervals are invariant to one-to-one transformations, as
we have shown (implicitly) already in Sect. 2.1.3.
Likelihood ratio confidence intervals can also be computed in cases where the
quadratic approximation of the log-likelihood fails, as in the following example.
Example 4.20 (Uniform model) In Example 2.18 we considered the uniform model
U(0, θ ) with unknown upper bound θ as a counter-example, where quadratic ap-
proximation of the log-likelihood is not possible. However, we still can derive a
likelihood ratio confidence interval. Because the likelihood from the realisation x1:n
of a random sample is
& '
L(θ ) = I[0,θ] max(xi ) θ −n
i
4.4 The Likelihood Ratio Statistic 111
and thus zero for values θ < maxi (xi ) = θ̂ML , we only need to compute the upper
limit of the interval. The relative likelihood is (in the range θ ≥ θ̂ML )
n
L(θ ) θ̂ML
L̃(θ ) = = ,
L(θ̂ML ) θ
maxi (xi )
θ= 1
exp − 12 χγ2 (1) n
maxi (xi )
θ= 1
(1 − γ ) n
to achieve a confidence level of γ . Since 1 − γ < exp{− 12 χγ2 (1)} (see Table 4.1),
this upper bound is larger than the one obtained from the above likelihood calcu-
lation, and the likelihood ratio confidence interval will have lower than nominal
coverage.
112 4 Frequentist Properties of the Likelihood
Plugging the left-hand side of (4.20) into the exponential function in (4.19), we
obtain the following (approximate) formula for the density of the ML estimator θ̂ML :
!
I (θ̂ML ) L(θ )
f (θ̂ML ) ≈ p ∗ (θ̂ML ) = .
2π L(θ̂ML )
Example 4.21 (Normal model) Let X1:n denote a random sample from a normal
distribution with unknown mean μ and known variance σ 2 . The ML estimator of μ
is μ̂ML = x̄ = ni=1 xi /n with observed Fisher information I (μ̂ML ) = n/σ 2 . Now X̄
is sufficient for μ, so the likelihood function of μ is (up to a multiplicative constant)
n
L(μ) = exp − 2 (x̄ − μ) 2
2σ
the density of a normal distribution with mean μ and variance σ 2 /n. So here the p ∗
formula gives the density of the exact distribution of the ML estimator. However,
this result could also have been obtained using the Wald statistic.
2 = n
Suppose now that μ is known but σ 2 is unknown. Then σ̂ML i=1 (xi − μ) /n,
2
which can be identified as the kernel of a G(n/2, n/(2σ 2 )) distribution, cf. Ap-
pendix A.5.2. After suitable normalisation, the p ∗ formula hence gives σ̂ML 2 ∼
2 2
G(n/2, n/(2σ )). Interestingly, this is the exact distribution of σ̂ML , since we know
2 /σ 2 has a χ 2 (n) distribution, i.e. a G(n/2, 1/2)
from Example 3.8 that the pivot nσ̂ML
distribution, so σ̂ML ∼ G(n/2, n/(2σ 2 )). Note that this distribution has mean σ 2 and
2
variance (2σ 4 )/n, which matches the mean and variance of the ordinary normal
approximation.
L(θ̂ML ) a 2
2 log ∼ χ (1). (4.23)
L(θ )
114 4 Frequentist Properties of the Likelihood
The asymptotic pivotal distribution in (4.21) and (4.22) still holds if we replace
J1:n (θ ) by J1:n (θ̂ML ), I (θ ; X1:n ) or I (θ̂ML ; X1:n ). Regarding the choice of the Fisher
information, it is typically recommended to use I (θ̂ML ; X1:n ). However, note that in
exponential families, I (θ̂ML ; X1:n ) = J1:n (θ̂ML ), cf. Exercise 8 in Chap. 3.
The large number of possible statistics, that are all asymptotically equivalent but
give different results for finite samples is confusing. Which of the different pivots
should we use in practice? The score statistic is applied only in special cases such
as in the case of the Wilson confidence interval for a proportion (cf. Example 4.10).
More commonly used are Wald confidence intervals with limits
we can easily compute the limits of the γ · 100 % Wald confidence interval:
π̂ML ± q · se(π̂ML ),
where q = z(1+γ )/2 denotes the (1 + γ )/2 quantile of the standard normal dis-
tribution. However, the standard error will be zero for x = 0 and x = n, so in
these cases we calculate the standard error based on x = 0.5 or x = n − 0.5
successes, respectively, and use it to calculate a one-sided confidence interval
4.6 A Comparison of Likelihood-Based Confidence Intervals 115
for π with lower limit 0 and upper limit 1, respectively. If, for the other cases,
the limits of the Wald confidence interval lie outside the unit interval, they are
rounded to 0 and 1, respectively.
2. The Wald confidence interval for φ = logit(π) avoids this problem using the
logit transformation. First note that (due to invariance of the MLE)
π̂ML x
φ̂ML = logit(π̂ML ) = log = log .
1 − π̂ML n−x
The standard error of φ̂ML can be easily computed with the delta method
(see Example 3.12):
1 1
se(φ̂ML ) = + .
x n−x
The limits of the Wald confidence interval for φ = logit(π),
φ̂ML ± q · se(φ̂ML ),
x + q 2 /2
,
n + q2
116 4 Frequentist Properties of the Likelihood
Table 4.2 Comparison of different confidence intervals for a binomial probability at level 95 %
for various values of sample size n and number of successes x
(a) Different Wald confidence intervals
√
n x (1) Wald for π (2) Wald for logit(π) (3) Wald for arcsin( π)
10 0 0.000 to 0.113 0.000 to 0.364 0.000 to 0.066
10 1 0.000 to 0.286 0.014 to 0.467 0.000 to 0.349
10 5 0.190 to 0.810 0.225 to 0.775 0.210 to 0.790
100 0 0.000 to 0.012 0.000 to 0.049 0.000 to 0.007
100 10 0.041 to 0.159 0.055 to 0.176 0.049 to 0.166
100 50 0.402 to 0.598 0.403 to 0.597 0.403 to 0.597
the relative proportion in the sample after the addition of q 2 /2 successes and
q 2 /2 non-successes. For illustration, if γ = 0.95, then q 2 /2 = 1.962 /2, so we
will add slightly less than two successes and non-successes.
5. Numerical methods (cf. Appendix C) are required to compute the limits of the
γ · 100 % likelihood ratio confidence interval
˜ ) ≥ −χγ2 (1)/2 .
θ : l(θ
if x ∈
/ {0, n}, where bα (α, β) denotes the α quantile of the beta distribution
with parameters α and β, see Appendix A.5.2. If x = 0, we set the lower limit
to zero, and if x = n, we set the upper limit to one.
Table 4.2 gives the limits of the different 95 % confidence intervals for selected
values of the sample size n and the number of successes x. There are substantial dif-
ferences for small x and n, but the intervals become more similar for larger sample
sizes and are quite close for n = 100 and x = 50.
What are the actual coverage probabilities of the different confidence intervals?
For example, for n = 50, there are 51 different confidence intervals CIγ (x) depend-
ing on the number of successes x ∈ T = {0, 1, . . . , 50}. We can now compute the
4.6 A Comparison of Likelihood-Based Confidence Intervals 117
coverage probability
Pr π ∈ CIγ (X) = f (x; π, n)ICIγ (x) (π)
x∈T
for every true parameter value π based on the binomial probability mass function
f (x; π, n).
Ideally, Pr{π ∈ CIγ (X)} should be equal to the nominal level γ for every sample
size n and every true parameter value π . However, the binomial distribution is dis-
crete, so all confidence intervals will only approximately have the nominal coverage
level. Figure 4.9 illustrates that the true coverage of the various confidence intervals
actually differs a lot. Shown are the actual coverage probabilities and smoothed
ones, which give a better impression of the locally averaged coverage probability.
Figure 4.9 shows that the Wald confidence interval has nearly always smaller
coverage than the nominal confidence level, sometimes √ considerably smaller. The
variance-stabilised Wald confidence interval for arcsin( π ) behaves somewhat bet-
ter. The Wald confidence intervals for logit(π) tends to have larger coverage than
the nominal level. The best locally averaged coverage is achieved by the Wilson
confidence interval, followed by the likelihood ratio confidence interval, which be-
haves similarly for medium values of π and has slightly lower coverage than the
Wilson confidence interval. Of particular interest is the behaviour of the “exact”
Clopper–Pearson interval: its coverage is always larger than the nominal level, so
this confidence interval appears to be anything else but “exact”, at least in terms
of coverage. Only in rare applications such a conservative confidence interval may
be warranted. However, the Clopper–Pearson interval is widely used in practice,
presumably due to the misleading specification “exact”.
118 4 Frequentist Properties of the Likelihood
Figure 4.10 displays the widths of the confidence intervals, which is an alterna-
tive quality criterion: If several confidence intervals attain the same nominal level,
then the one with smaller width should be preferred. It is good to see from Fig. 4.10,
which displays the widths for values of x in the range between 0 and 25, that the
conservative Clopper–Pearson confidence interval has the largest width for x ≥ 4.
The Wilson confidence interval has the smallest width for x > 10, while for x ≤ 10,
√
the Wald for π and arcsin( π ) and the likelihood ratio confidence interval have the
smallest width. The Wald confidence interval for logit(π) has a quite large width
for x ≤ 10.
4.7 Exercises 119
4.7 Exercises
(d) Use the score statistic (4.2) to obtain a P -value. Why do we not need to
consider parameter transformations when using this statistic?
(e) Use the exact null distribution from your model to obtain a P -value.
What are the advantages and disadvantages of this procedure in general?
4. Suppose X1:n is a random sample from an Exp(λ) distribution.
(a) Derive the score function of λ and solve the score equation to get λ̂ML .
(b) Calculate the observed Fisher information, the standard error of λ̂ML and
a 95 % Wald confidence interval for λ.
(c) Derive the expected Fisher information J (λ) and the variance-stabilising
transformation φ = h(λ) of λ.
(d) Compute the MLE of φ and derive a 95 % confidence interval for λ by
back-transforming the limits of the 95 % Wald confidence interval for φ.
Compare with the result from 4(b).
(e) Derive the Cramér–Rao lower bound for the variance of unbiased esti-
mators of λ.
(f) Compute the expectation of λ̂ML and use this result to construct an unbi-
ased estimator of λ. Compute its variance and compare it to the Cramér–
Rao lower bound.
5. An alternative parametrisation of the exponential distribution is
1 x
fX (x) = exp − IR+ (x), θ > 0.
θ θ
Let X1:n denote a random sample from this density. We want to test the null
hypothesis H0 : θ = θ0 against the alternative hypothesis H1 : θ = θ0 .
(a) Calculate both variants T1 and T2 of the score test statistic.
(b) A sample of size n = 100 gave x̄ = 0.26142. Quantify the evidence
against H0 : θ0 = 0.25 using a suitable significance test.
6. In a study assessing the sensitivity π of a low-budget diagnostic test for
asthma, each of n asthma patients is tested repeatedly until the first positive
test result is obtained. Let Xi be the number of the first positive test for pa-
tient i. All patients and individual tests are independent, and the sensitivity π
is equal for all patients and tests.
(a) Derive the probability mass function f (x; π) of Xi .
(b) Write down the log-likelihood function for the random sample X1:n and
compute the MLE π̂ML .
(c) Derive the standard error se(π̂ML ) of the MLE.
(d) Give a general formula for an approximate 95 % confidence interval
for π . What could be the problem of this interval?
(e) Now we consider the parametrisation with φ = logit(π) =
log{π/(1 − π )}. Derive the corresponding MLE φ̂ML , its standard er-
ror and associated approximate 95 % confidence interval. What is the
advantage of this interval?
4.7 Exercises 121
(f) n = 9 patients did undergo the trial, and the observed numbers were
x = (3, 5, 2, 6, 9, 1, 2, 2, 3). Calculate the MLEs π̂ML and φ̂ML , the con-
fidence intervals from 6(d) and 6(e) and compare them by transforming
the latter back to the π -scale.
(g) Produce a plot of the relative log-likelihood function l(π) ˜ and two ap-
proximations in the range π ∈ (0.01, 0.5): The first approximation is
based on the direct quadratic approximation of l˜π (π) ≈ qπ (π), and
the second approximation is based on the quadratic approximation of
l˜φ (φ) ≈ qφ (φ), i.e. qφ {logit(π)} values are plotted. Comment the result.
7. A simple model for the drug concentration in plasma over time after a single
intravenous injection is c(t) = θ2 exp(−θ1 t) with θ1 , θ2 > 0. For simplicity,
we assume here that θ2 = 1.
(a) Assume that n probands had their concentrations ci , i = 1, . . . , n, mea-
iid
sured at the same single time-point t, and assume that the model ci ∼
N(c(t), σ 2 ) is appropriate for the data. Calculate the MLE of θ1 .
(b) Calculate the asymptotic variance of the MLE.
(c) In pharmacokinetic studies one is often interested in the area under the
∞
concentration curve, α = 0 exp(−θ1 t) dt. Calculate the MLE for α
and its variance estimate using the delta theorem.
(d) We now would like to determine the optimal time point for measuring
the concentrations ci . Minimise the asymptotic variance of the MLE
with respect to t, when θ1 is assumed to be known, to obtain an optimal
time point topt .
8. Assume the gamma model G(α, α/μ) for the random sample X1:n with mean
E(Xi ) = μ > 0 and shape parameter α > 0.
(a) First assume that α is known. Derive the MLE μ̂ML and the observed
Fisher information I (μ̂ML ).
(b) Use the p ∗ formula to derive an asymptotic density of μ̂ML depending on
the true parameter μ. Show that the kernel of this approximate density
is exact in this case, i.e. it equals the kernel of the exact density known
from Exercise 9 from Chap. 4.
(c) Stirling’s approximation of the gamma function is
2π x x
(x) ≈ . (4.24)
x exp(x)
(e) Show, by rewriting the score equation, that the MLE α̂ML fulfils
n
1
n
−nψ(α̂ML ) + n log(α̂ML ) + n = − log(xi ) + xi + n log(μ).
μ
i=1 i=1
(4.25)
Hence, show that the log-likelihood kernel can be written as
l(α) = n α log(α) − α − log (α) + αψ(α̂ML ) − α log(α̂ML ) .
4.8 References
The methods discussed in this chapter are found in many books on likelihood in-
ference, for example in Pawitan (2001) or Davison (2003), see also Millar (2011).
Further details on Fisher regularity conditions can be found in Lehmann and Casella
(1998, Chap. 6). Different confidence intervals for a binomial proportion are thor-
oughly discussed in Brown et al. (2001) and Connor and Imrey (2005), see also
Newcombe (2013). The smoothed coverage probabilities have been computed us-
ing a specific kernel function as described in Bayarri and Berger (2004).
Likelihood Inference in Multiparameter
Models 5
Contents
5.1 Score Vector and Fisher Information Matrix . . . . . . . . . . . . . . . . . . . . . 124
5.2 Standard Error and Wald Confidence Interval . . . . . . . . . . . . . . . . . . . . 128
5.3 Profile Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4 Frequentist Properties of the Multiparameter Likelihood . . . . . . . . . . . . . . 142
5.4.1 The Score Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4.2 The Wald Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4.3 The Multivariate Delta Method . . . . . . . . . . . . . . . . . . . . . . . . 145
5.4.4 The Likelihood Ratio Statistic . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5 The Generalised Likelihood Ratio Statistic . . . . . . . . . . . . . . . . . . . . . 147
5.6 Conditional Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Example 5.1 (Normal model) The normal model has two unknown parameters, the
expectation μ and the variance σ 2 . Suppose that both are unknown and let X1:n
denote a random sample from an N(μ, σ 2 ) distribution, so the unknown parameter
vector is θ = (μ, σ 2 ) . The log-likelihood
1
n
n
l(θ ) = l μ, σ 2 = − log σ 2 − 2 (xi − μ)2
2 2σ
i=1
n 1
= − log σ 2 − 2 (n − 1)s 2 + n(x̄ − μ)2
2 2σ
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 123
https://doi.org/10.1007/978-3-662-60792-3_5,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
124 5 Likelihood Inference in Multiparameter Models
is hence a function of μ and σ 2 . Note that the data x1:n enter the likelihood function
through the sufficient statistics x̄ and s 2 , compare Example 2.21.
For illustration, consider the study on alcohol concentration from Sect. 1.1.7.
Assume that there is no difference between genders, so that we can look at the
overall n = 185 volunteers with empirical mean and variance of the transformation
factors given in Table 1.3. The resulting relative likelihood is shown in Fig. 5.1.
k
l(θ ) = l(π ) = l(π1 , . . . , πk ) = xi log(πi )
i=1
∂ ∂ ∂
S(θ ) = l(θ ) = l(θ ), . . . , l(θ )
∂θ ∂θ1 ∂θp
S(θ ) = 0
The MLE θ̂ ML = (θ̂1 , . . . , θ̂p ) is now a vector and usually obtained by solving
the score equations.
∂2
− l(θ ), 1 ≤ i, j ≤ p.
∂θi ∂θj
Example 5.3 (Normal model) The partial derivatives of the log-likelihood are
1
n
∂l(θ )
= 2 (xi − μ) and
∂μ σ
i=1
1
n
∂l(θ ) n
= − + (xi − μ)2 ,
∂σ 2 2σ 2 2σ 4
i=1
126 5 Likelihood Inference in Multiparameter Models
1
n
μ̂ML = x̄ 2
and σ̂ML = (xi − x̄)2 .
n
i=1
Note that the MLE of σ 2 is biased, as discussed previously in Example 3.1. The
Fisher information I (θ ), the matrix containing the negative second partial deriva-
tives, turns out to be
⎛ 2 2 ⎞
∂ l(θ) ∂ l(θ)
∂μ2 ∂μ∂σ 2
I (θ ) = − ⎝ ⎠
∂ 2 l(θ) ∂ 2 l(θ)
∂σ 2 ∂μ ∂(σ 2 )2
n
i=1 (xi − μ)
1
1 n σ2
= 2 n n
. (5.1)
σ 1
σ2 i=1 (xi − μ) 1
σ4 i=1 (xi − μ) − 2σ 2
2 n
The observed Fisher information, obtained by replacing μ and σ 2 with μ̂ML and
2 , respectively, turns out to be
σ̂ML
⎛ n ⎞
2 0
σ̂ML
I (θ̂ ML ) = ⎝ n
⎠
0 4 2σ̂ML
k
l(π) = xi log(πi ),
i=1
under the constraint g(π ) = ki=1 (πi ) − 1 = 0 can be done using the Lagrange
method (cf. Appendix B.2.5). The equation to solve is
∂ ∂
l(π ) = λ · g(π ),
∂π ∂π
5.1 Score Vector and Fisher Information Matrix 127
k
k
k
n= xi = λπi = λ πi = λ,
i=1 i=1 i=1
and therefore
π̂ ML = (π̂1 , . . . , π̂k ) ,
where π̂i = xi /n. So the MLEs of the multinomial probabilities are just the corre-
sponding relative frequencies.
An alternative approach explicitly replaces πk by 1 − k−1
i=1 πi in the probability
mass function of the multinomial distribution. The corresponding log-likelihood of
the trimmed parameter vector π̃ is then
k−1
k−1
l(π̃ ) = xi log(πi ) + xk log 1 − πi ,
i=1 i=1
This defines a set of score equations, which will eventually lead to the same MLEs
as before.
The Fisher information matrix of π̃ has dimension (k − 1) × (k − 1) and is
k−1
∂ xi xk
I (π̃) = − S(π̃ ) = diag 2 + · 11 , (5.2)
∂ π̃ πi i=1 (1 − k−1 i=1 iπ )2
where 1 is the unity vector of length k − 1 such that the matrix 11 contains only
ones. If we replace in (5.2) π̃ by π̃ˆ ML , we obtain the observed Fisher information
matrix
−1
k−1
ˆ ˆ −1
I (π̃ ML ) = n diag(π̃ ML ) + 1 − π̂i
11 .
i=1
Applying a useful formula for matrix inversion (cf. Appendix B.1.4), we obtain the
inverse observed Fisher information matrix
128 5 Likelihood Inference in Multiparameter Models
˜ ) ≈ − 1 (θ − θ̂ ML ) I (θ̂ ML )(θ − θ̂ ML )
l(θ (5.4)
2
of the relative log-likelihood, which only requires the MLE θ̂ ML and the observed
Fisher information matrix I (θ̂ ML ). It is instructive to compare this approximation
with the corresponding one for a scalar parameter:
θ̂i − θi a
∼ N(0, 1), i = 1, . . . , p,
se(θ̂i )
where the standard error se(θ̂i ) is defined as the square root of the ith diagonal entry
of the inverse observed Fisher information matrix.
This result can be used to calculate the limits of a γ · 100 % Wald confidence interval
for θi :
θ̂i ± z(1+γ )/2 se(θ̂i ). (5.5)
A justification of this procedure is given in Result 5.1; see also Sect. 5.4.2.
5.2 Standard Error and Wald Confidence Interval 129
Example 5.5 (Normal model) In Example 5.3 we have derived the observed Fisher
information matrix in the normal model. This matrix is diagonal, so its inverse can
easily be obtained by inverting the diagonal elements:
2
σ̂ML
−1 0
I (θ̂ ML ) = n
4
2σ̂ML
.
0 n
√ √
The standard errors are therefore se(μ̂ML ) = σ̂ML / n and se(σ̂ML
2 ) = σ̂ 2
ML 2/n, and
we obtain the γ · 100 % Wald confidence intervals with limits
σ̂ML
x̄ ± z(1+γ )/2 √ and
n
2 2
σ̂ML ± z(1+γ )/2
2
σ̂
n ML
for μ and σ 2 , respectively. Note that these intervals are only asymptotically valid.
Indeed, we know from Example 3.8 that the interval with limits
s
x̄ ± t(1+γ )/2 (n − 1) · √ ,
n
where s 2 = ni=1 (xi − x̄)2 /(n − 1), forms an exact γ · 100 % confidence interval
for μ, which is wider because t(1+γ )/2 (n − 1) > z(1+γ )/2 and s 2 > σ̂ML
2 . However,
cf. Example 2.7. The standard errors of the MLEs in the multinomial model can be
easily obtained from (5.3):
π̂i (1 − π̂i )
se(π̂i ) = ,
n
130 5 Likelihood Inference in Multiparameter Models
which is exactly the same formula as in the binomial model. In our example we
obtain se(π̂1 ) = 0.017, se(π̂2 ) = 0.018 and se(π̂3 ) = 0.014. Note that the differ-
ence between the MLEs under Hardy–Weinberg equilibrium and the MLEs in the
trinomial model is less than two standard errors for all three parameters. This sug-
gests that the Hardy–Weinberg model fits these data quite well, although this in-
terpretation ignores the uncertainty of the estimates in the Hardy–Weinberg model.
We will describe more rigorous approaches to decide between these two models in
Sect. 7.1.
Definition 5.4 (Profile likelihood) Let L(θ , η) be the joint likelihood function of
the parameter of interest θ and the nuisance parameter η. Let η̂ML (θ ) denote the
MLE with respect to L(θ , η) for fixed θ . Then
is called the profile likelihood of θ . The value of the profile likelihood at a particular
parameter value θ is obtained through maximising the joint likelihood L(θ , η) with
respect to the nuisance parameter η.
We often need numerical techniques to maximise the joint likelihood with respect
to the nuisance parameter η. To avoid this, it is tempting to just plug-in the MLE of
the nuisance parameter η in the joint likelihood L(θ , η), which leads to the following
definition.
Definition 5.5 (Estimated likelihood) Let (θ̂ ML , η̂ML ) denote the MLE of (θ , η)
based on the joint likelihood L(θ , η). The function
Le (θ ) = L(θ , η̂ML )
The estimated likelihood ignores the uncertainty of the MLE η̂ML : it assumes
that the true value of the nuisance parameter η is equal to its estimate η̂ML . As a
consequence, the estimated likelihood Le (θ ) will in general not correctly quantify
the uncertainty with respect to θ .
5.3 Profile Likelihood 131
Example 5.7 (Normal model) Let X1:n denote a random sample from an N(μ, σ 2 )
distribution with both parameters unknown. The MLE of σ 2 for fixed μ is
1
n
2
σ̂ML (μ) = (xi − μ)2 .
n
i=1
2 = n
where σ̂ML i=1 (xi − x̄)2 /n.
132 5 Likelihood Inference in Multiparameter Models
Fig. 5.3 Comparison of estimated and profile likelihood in the normal model
Both the profile and the estimated relative likelihood for the alcohol concentra-
tion data set from Sect. 1.1.7 are shown in Fig. 5.3b. Here the estimated likelihood
is only slightly narrower than the profile likelihood. The difference is better to see
in Fig. 5.3a, which marks the points of the joint relative likelihood that define the
profile and estimated likelihood. The horizontal line σ 2 = σ̂ML 2 corresponds to the
Now μ̂ML (σ 2 ) = x̄ does not depend on σ 2 , so the profile likelihood Lp (σ 2 ) and the
estimated likelihood Le (σ 2 ) are identical in this case.
Proof To prove this result, recall that the profile log-likelihood lp (θ ) = l{θ, η̂ML (θ )}
is defined through the conditional MLE η̂ML (θ ) for fixed θ , which is found by solv-
∂
ing ∂η l(θ, η) = 0. Therefore, we know that
∂
l θ , η̂ML (θ ) = 0 (5.7)
∂η
for all θ . Hence, the derivative of this function with respect to θ is also zero:
" #
∂ ∂
l θ , η̂ML (θ ) = 0.
∂θ ∂η
Note that this derivative is an r × q matrix. Also note the order: First, derivatives of
the log-likelihood is with respect to η are taken, then the conditional MLE η̂ML (θ )
is plugged in for η, and afterwards this function is differentiated. Since we have the
convention that the function to be differentiated gives column values, we use the
transpose symbol in the inner derivative here, which means that we transpose the
resulting row vector after taking derivatives. Using the multivariate chain rule for
differentiation, cf. Appendix B.2.2, we can rewrite this matrix as
" #
∂ ∂ ∂
l θ , η̂ML (θ ) = h g(θ )
∂θ ∂η ∂θ
with h(γ ) = ∂
∂η
l(γ ) and
g(θ ) = θ , η̂ML (θ )
∂ ∂
= h g(θ ) · g(θ )
∂γ ∂θ
∂ ∂ Iq
= h g(θ ) h g(θ ) ∂
∂θ ∂η ∂θ η̂ ML (θ )
134 5 Likelihood Inference in Multiparameter Models
∂ ∂ ∂
= h g(θ ) + h g(θ ) · η̂ (θ )
∂θ ∂η ∂θ ML
∂ ∂ ∂ ∂ ∂
=
l θ , η̂ML (θ ) +
l θ, η̂ML (θ ) · η̂ML (θ ) .
∂θ ∂η ∂η ∂η ∂θ
r×q r×r r×q
Note that in the first two matrices of the last line, the log-likelihood is differentiated
twice each, and afterwards the conditional MLE η̂ML (θ ) is plugged in. From above
we know that g(θ ) = 0, so we can solve the equation for ∂θ ∂
η̂ML (θ ) and obtain
" #−1
∂ ∂ ∂ ∂ ∂
η̂ (θ ) = − l θ , η̂ML (θ ) l θ, η̂ML (θ ) . (5.8)
∂θ ML ∂η ∂η ∂θ ∂η
We can also use the multivariate chain rule to differentiate the profile log-likelihood
lp (θ ). In order to be consistent with the notation in Appendix B.2.2, we use here a
1 × q row vector for this derivative, instead of the otherwise used convention that
the score vector is a column vector:
∂ ∂
lp (θ ) = l g(θ )
∂θ ∂θ
∂ ∂
= l g(θ ) · g(θ )
∂γ ∂θ
∂ ∂ ∂
= l θ , η̂ML (θ ) + l θ , η̂ML (θ ) · η̂ (θ )
∂θ ∂η ∂θ ML
∂
= l θ , η̂ML (θ ) ,
∂θ
using (5.7) for the last line. Differentiating this again, we obtain the curvature of the
profile log-likelihood:
" #
∂ ∂ ∂ ∂
lp (θ ) = θ, η̂ML (θ )
∂θ ∂θ ∂θ ∂θ
∂ ∂ ∂ ∂ ∂
= l θ , η̂ML (θ ) + l θ , η̂ML (θ ) · η̂ (θ ). (5.9)
∂θ ∂θ ∂η ∂θ ∂θ ML
∂ ∂ ∂ ∂
lp (θ ) = l θ , η̂ML (θ )
∂θ ∂θ ∂θ ∂θ
" #−1
∂ ∂ ∂ ∂
− l θ , η̂ML (θ ) l θ , η̂ML (θ )
∂η ∂θ ∂η ∂η
∂ ∂
× l θ, η̂ML (θ ) .
∂θ ∂η
5.3 Profile Likelihood 135
Note that this holds for all θ . However, we are mainly interested in the negative
curvature at the MLE θ̂ ML , i.e.
∂ ∂
I p (θ̂ ML ) = − lp (θ̂ ML )
∂θ ∂θ
= I 11 − I 12 (I 22 )−1 I 21 .
Application of the formula for the inversion of block matrices (cf. Appendix B.1.3)
gives
11 −1
I = I 11 − I 12 (I 22 )−1 I 21 , (5.10)
Note that with (5.10) and the symmetry of the Fisher information matrix we have
11 −1
I = I 11 − I 12 (I 22 )−1 I
12 , so
≥0
11 −1
I ≤ I 11 ,
where this inequality denotes that the difference of the two matrices is (zero or)
positive definite (cf. Appendix B.1.2). In particular, this implies that the diagonal
elements of the left-hand side matrix are smaller than (or equal to) the corresponding
diagonal elements of the right-hand side matrix. Now the Fisher information of
the estimated likelihood at its maximum is I 11 . This inequality therefore shows
that the observed Fisher information of the profile likelihood (I 11 )−1 is smaller
than or equal to the observed Fisher information of the estimated likelihood I 11 .
This once again illustrates that the estimated likelihood ignores the uncertainty in
the estimation of the nuisance parameter η. If I 12 = I 21 = 0, then θ and η are
orthogonal. With Eq. (5.10), we then have
11 −1
I = I 11 ,
i.e. the observed Fisher information of profile and estimated likelihood are identical.
As we have seen in Example 5.3, this is the case in the normal model.
With the above result, we can approximate the profile log-likelihood l˜p (θi ) of a
scalar parameter θi by a quadratic function
1 −1
l˜p (θi ) ≈ − · I 11 (θi − θ̂i )2 ,
2
where I ii denotes the ith diagonal entry of I (θ̂ ML )−1 . This result corresponds to our
definition of Wald confidence intervals in Sect. 5.2.
The following result considers the special case of a difference between two pa-
rameters coming from independent data sets.
136 5 Likelihood Inference in Multiparameter Models
Result 5.2 (Standard error of a difference) Suppose that the data can be split in two
independent parts (denoted by 0 and 1) and the corresponding likelihood functions
are parametrised by α and β, respectively. The standard error of δ̂ML = β̂ML − α̂ML
then has the form
which is due to the independence of the two data sets. Now β = α + δ, so the joint
log-likelihood of α and δ is
1
= I 22
Ip (δ̂ML )
1
= I11
I11 I22 − I12 I21
I11
=
I11 I22 − I22 I22
I11
=
(I11 − I22 )I22
5.3 Profile Likelihood 137
which gives the result. See Exercise 3 for an alternative derivation of this formula.
π1 /(1 − π1 )
θ=
π2 /(1 − π2 )
or the log odds ratio ψ = log(θ ) is often used. The equality π1 = π2 , which is often
used as a null hypothesis, corresponds to H0 : θ = 1 and H0 : ψ = 0, respectively.
Now
π1 π2
ψ = log(θ ) = log − log ,
1 − π1 1 − π2
so with π̂i = xi /ni , i = 1, 2, and invariance of the MLE, we immediately obtain
x1 x2 x1 /(n1 − x1 )
ψ̂ML = log − log = log
n 1 − x1 n2 − x2 x2 /(n2 − x2 )
as the MLE of ψ.
For illustration, consider Table 3.1 summarising the result from the clinical study
in Table 1.1 labelled as “Tervila”. Here we obtain
6/102
ψ̂ML = log ≈ 1.089.
2/101
Now define the log odds φi = log{πi /(1 − πi )} for group i = 1, 2. We know from
Example 2.11 that the observed Fisher information of the MLE φ̂i of φi is
xi (ni − xi )
Ii (φ̂i ) = .
ni
from which we can directly calculate the profile likelihood of the log odds ratio ψ.
Note that we could have also chosen a different nuisance parameter, and for example
let η be the log odds in group 1 instead. Invariance of the likelihood ensures that
the profile likelihood of ψ would remain unchanged. The above reparametrisation
implies
exp(η + ψ) exp(η)
π1 = and π2 = ,
1 + exp(η + ψ) 1 + exp(η)
and we obtain the joint likelihood
x1 −n1 x2 −n2
L(ψ, η) = exp(η + ψ) 1 + exp(η + ψ) · exp(η) 1 + exp(η)
−n1 −n2
= exp η(x1 + x2 ) exp(ψx1 ) 1 + exp(η + ψ) 1 + exp(η) ,
Fig. 5.4 Log-likelihood functions for the odds ratio ψ in the binomial model with the data from
Table 3.1
Calculation of the profile likelihood Lp (ψ) = maxη L(ψ, η) can be done numer-
ically, and we obtain the 95 % profile likelihood confidence interval [−0.41, 3.02].
Compared with the Wald confidence interval, the profile likelihood confidence in-
terval is slightly shifted to the right but still covers the null hypothesis value ψ = 0,
compare Fig. 5.4b.
We obtain the MLEs of the two parameters of the Weibull distribution Wb(μ, α) as
μ̂ML = 3078.966 and α̂ML = 0.976 with observed Fisher information
4.72 · 10−6 6.52 · 10−3
I (μ̂ML , α̂ML ) = ,
6.52 · 10−3 7.61 · 101
μ̂ML 1
α̂ML = = 0.968 and β̂ML = = 3.1 · 10−4 .
φ̂ML φ̂ML
In Example 5.13 we will apply the multivariate delta method to compute the stan-
dard errors of α̂ML and β̂ML without directly maximising the log-likelihood of α
and β.
5.3 Profile Likelihood 141
Example 5.10 (Screening for colon cancer) A more general than the binomial
model from Example 2.12 is based on the beta-binomial distribution, which al-
lows for heterogeneity of the success probability π across individuals. Assume that
the numbers of positive test results X1:n , n = 196, are a random sample from a
beta-binomial BeB(N, α, β) distribution (cf. Appendix A.5.1). The probability for
k positive test results among N = 6 tests is therefore
N B(α + k, β + N − k)
Pr(Xi = k) =
k B(α, β)
for k = 0, . . . , N and α, β > 0, where B(x, y) denotes the beta function (cf. Ap-
pendix B.2.1).
As before, we need to truncate this distribution using Eq. (2.6) to obtain the log-
likelihood
N
B(α + k, β + N − k) B(α, β + N )
l(α, β) = Zk log − n log 1 − .
B(α, β) B(α, β)
k=1
However, because the parameters α and β are difficult to interpret, instead the
reparametrisation
α 1
μ= and ρ =
α+β α+β +1
is often used. Here N μ is the mean of Xi , and ρ is the correlation between two
binary observations from the same individual. This can be easily seen from the
definition of the beta-binomial distribution as marginal distribution of Xi when
Xi | π ∼ Bin(N, π) and π ∼ Be(α, β). Therefore, Xi is a sum of binary dependent
random variables Xij ∈ {0, 1}, j = 1, . . . , N :
Xi = Xi1 + · · · + XiN .
142 5 Likelihood Inference in Multiparameter Models
The law of iterated expectations (cf. Appendix A.3.4) can be used to calculate co-
variance and correlation between Xij and Xik for j = k (cf. Appendix A.3.6). Note
that the conditional independence of Xij and Xik given π is exploited:
α
E(Xij ) = E E(Xij | π) = E(π) = ,
α+β
2
Var(Xij ) = E Xij − E(Xij )2 = E(Xij ) − E(Xij )2
αβ
= E(Xij ) 1 − E(Xij ) = ,
(α + β)2
E(Xij Xik ) = E E(Xij Xik | π) = E E(Xij | π) E(Xik | π) = E π 2
2
αβ α
= Var(π) + E(π) = 2
+ ,
(α + β)2 (α + β + 1) α+β
αβ
Cov(Xij , Xik ) = E(Xij Xik ) − E(Xij ) E(Xik ) = and
(α + β)2 (α + β + 1)
Cov(Xij , Xik ) 1
ρ = Corr(Xij , Xik ) = = .
Var(Xij ) Var(Xik ) α+β +1
B(α, β + N )
ξ = Pr(Xi = 0) = ,
B(α, β)
As in Chap. 4, we now consider the score vector and the Fisher information matrix
as random, i.e. both the score vector S(θ ; X) and the Fisher information matrix
I (θ ; X) are functions of the data X. We define the expected Fisher information
matrix, shortly expected Fisher information, J (θ ) as
J (θ ) = E I (θ ; X) .
5.4 Frequentist Properties of the Multiparameter Likelihood 143
Fig. 5.6 Comparison of (relative) profile and estimated likelihood of the false negative fraction ξ
in the beta-binomial model
Applying the Sherman–Morrison formula (cf. Appendix B.1.4), we can easily com-
pute the inverse of the expected Fisher information:
In complete analogy to the case of a scalar parameter, we now define three ap-
proximate pivots: the score, the Wald and the likelihood ratio statistic. The important
properties of these statistics are sketched in the following.
144 5 Likelihood Inference in Multiparameter Models
As in Sect. 4.1.1, one can show that under the Fisher regularity conditions (suitably
generalised from Definition 4.1 to the multiparameter case) the expectation of each
element of the score vector is zero. The covariance matrix of the score vector is
equal to the expected Fisher information matrix.
E S(θ ; X) = 0,
Cov S(θ ; X) = J (θ ).
A direct consequence is
1
where Ip denotes the p × p identity matrix, and A− 2 is the Cholesky square root
of A−1 (cf. Appendix B.1.2), i.e.
− 1 − 1
A 2 A 2 = A−1 .
We can replace as in Sect. 4.1.3 the expected Fisher information matrix J 1:n (θ ) by
the Fisher information matrix I (θ ; X1:n ). We can also replace the true value θ by
the MLE θ̂ ML = θ̂ ML (X1:n ).
1 a
I (θ̂ ML ; X1:n ) 2 (θ̂ ML − θ ) ∼ Np (0, Ip ). (5.13)
A direct consequence of (5.14) is that the asymptotic distribution of the ith compo-
nent of θ̂ ML is
a
θ̂i ∼ N θi , I ii ,
where I ii denotes the ith diagonal element of the inverse observed Fisher informa-
tion:
I ii = I (θ̂ ML )−1 ii .
In addition to Result 5.1, this property provides an alternative justification of our
definition of Wald confidence intervals given in Sect. 5.2.
There is also a multidimensional extension of the Cramér–Rao inequality from
Sect. 4.2.1, which ensures that θ̂i is asymptotically efficient, i.e. has an asymptoti-
cally smallest variance among all asymptotically unbiased estimators.
Here D(θ ) denotes the q × p Jacobian of g(θ ). The square root of the diagonal
elements of the q × q matrix D(θ̂ ML )I (θ̂ ML )−1 D(θ̂ ML ) are the standard errors of
the q components of g(θ̂ ML ).
Example 5.13 (Analysis of survival times) We revisit the gamma model from Ex-
ample 5.9 for the PBC survival times shown in Table 1.4, see also Example 2.3. The
unknown parameter vector is θ = (μ, φ) , we are interested in the transformation
μ 1
g(θ ) = (α, β) = , .
φ φ
Now
1
φ − φμ2
D(θ ) = ,
0 − φ12
146 5 Likelihood Inference in Multiparameter Models
and D(θ̂ ML )I (θ̂ ML )−1 D(θ̂ ML ) is a consistent estimate of the asymptotic covariance
matrix of g(θ̂ ML ). The square roots of the diagonal elements are the corresponding
standard errors se(α̂ML ) = 0.1593 and se(β̂ML ) = 8.91 · 10−5 . Thus, the 95 % Wald
confidence interval for α has limits
The likelihood ratio statistic can be defined in the multivariate case with parameter
vector θ just as for a scalar parameter θ :
L(θ̂ ML ) ˜ ).
W = 2 log = −2l(θ
L(θ )
Combining Eqs. (5.13) and (5.4) shows that the number of degrees of freedom of
the asymptotic χ 2 distribution is equal to the dimension p of θ :
In analogy to (4.17) and (4.18), respectively, we can now construct γ · 100 % like-
lihood confidence regions for a multidimensional parameter θ :
˜ 1 2
θ : l(θ ) ≥ − χγ (p)
2
" #
1 2
= θ : L̃(θ ) ≥ exp − χγ (p) . (5.15)
2
Example 5.14 (Analysis of survival times) Figure 5.7 shows a contour plot of the
relative likelihood for the parameter vector θ = (α, μ) in a Weibull model for
survival times, compare Example 2.3. The 95 % confidence region is defined as all
values of θ with relative likelihood larger than 0.05 and is shaded in the figure.
5.5 The Generalised Likelihood Ratio Statistic 147
The general result (5.15) illustrates that a frequentist calibration of the relative
likelihood depends on the dimension of the unknown parameter vector. Table 5.1
lists the corresponding thresholds exp{− 12 χγ2 (p)} for various values of the confi-
dence level γ and the dimension p of the parameter vector.
Let θ denote the q-dimensional parameter vector of interest, and η the r-dimensional
nuisance parameter vector. The total parameter space has thus dimension p = q + r.
The joint likelihood L(θ , η) can be used to derive the profile likelihood of θ :
We will sketch in the following that the profile likelihood Lp (θ ) can be treated as a
normal likelihood. In particular, we will show in Result 5.4 that
Lp (θ̂ ML ) a 2
W = −2l˜p (θ ) = −2 log L̃p (θ ) = 2 log ∼ χ (q),
Lp (θ )
148 5 Likelihood Inference in Multiparameter Models
where W is called the generalised likelihood ratio statistic. Here θ denotes the true
parameter of interest and θ̂ ML its MLE with respect to Lp (θ ). This is identical to the
first q components of the MLE with respect to the joint likelihood L(θ , η), i.e.
$ %
Lp (θ̂ ML ) = max max L(θ , η) = max L(θ , η) = L(θ̂ ML , η̂ML ).
θ η θ,η
The particular form of the generalised likelihood ratio statistic W can be viewed
in the context of a significance test of the null hypothesis H0 : θ = θ 0 . Using the
transformation
we can write W as
max L(θ , η) Lp (θ̂ ML )
W = 2 log = 2 log .
maxH0 L(θ , η) Lp (θ 0 )
Example 5.15 (Two-sample t test) Let X1:n1 denote a random sample from an
N(μ1 , σ 2 ) distribution and X(n1 +1):(n1 +n2 ) a random sample from an N(μ2 , σ 2 ) dis-
tribution with μ1 , μ2 and σ 2 all unknown. Assume that the two random samples
are independent from each other. Consider the null hypothesis H0 : μ = μ1 = μ2 .
Using the reparametrisation μ2 = μ1 + c, the null hypothesis can be rewritten as
H0 : c = 0, which shows that H0 fixes one parameter of the unrestricted model at a
particular value (c = 0). Therefore, we are in the framework described above.
The likelihood is
L μ1 , μ2 , σ 2
n1
n − n 1 i=1 (xi − μ1 )2 + ni=n1 +1 (xi − μ2 )2
= (2π)− 2 σ 2 2 exp − .
2 σ2
1
n
σ̂02 = (xi − μ̂)2 ,
n
i=1
where n = n1 + n2 , and
1
n
μ̂ = xi = x̄
n
i=1
5.5 The Generalised Likelihood Ratio Statistic 149
1
n1
μ̂1 = xi ,
n1
i=1
1
n
μ̂2 = xi and
n2
i=n1 +1
1
n1 n
σ̂ =
2
(xi − μ̂1 ) +
2
(xi − μ̂2 ) ,
2
n
i=1 i=n1 +1
and we obtain
max L μ1 , μ2 , σ 2 = L μ̂1 , μ̂2 , σ̂ 2
n −n/2 n
= (2π)− 2 σ̂ 2 · exp − .
2
The generalised likelihood ratio statistic is therefore
2 − n
max L(μ1 , μ2 , σ 2 ) σ̂ 2
W = 2 log 2
= 2 log 2
maxH0 L(μ1 , μ2 , σ ) σ̂0
2
σ̂
= n log 02 .
σ̂
Now
n
nσ̂02 = (xi − μ̂)2
i=1
n1
n
= (xi − μ̂1 + μ̂1 − μ̂)2 + (xi − μ̂2 + μ̂2 − μ̂)2
i=1 i=n1 +1
n1
n
= (xi − μ̂1 )2 + n1 (μ̂1 − μ̂)2 + (xi − μ̂2 )2 + n2 (μ̂2 − μ̂)2
i=1 i=n1 +1
150 5 Likelihood Inference in Multiparameter Models
where the penultimate line follows from μ̂ = n1 (n1 μ̂1 + n2 μ̂2 ) and the combination
of quadratic forms in Appendix B.1.5. So we can rewrite W as
(μ̂1 − μ̂2 )2
1
W = n log 1 +
n1 + n2
1 1 nσ̂ 2
1 2
= n log 1 + t ,
n−2
where
μ̂1 − μ̂2
t=
( n11 + 1 n
n2 ) n−2 σ̂
2
is the test statistic of the two-sample t test. Note that W is a monotone function of
t 2 and hence of |t|. Under H0 , the exact distribution of t is a standard t distribution
with n1 + n2 − 2 degrees of freedom, i.e. t ∼ t(n1 + n2 − 2). This also induces an
exact distribution for W under H0 .
Asymptotically (n → ∞), the t distribution converges to the standard normal
distribution, so t 2 has asymptotically a χ 2 (1) distribution. Now, for large n, we
have
1 2 t2 n
W = n log 1 + t ≈ log 1 + ≈ log exp t 2 = t 2 ,
n−2 n
which illustrates that W and t 2 have the same asymptotic distribution. We would
also have obtained the asymptotic χ 2 (1) distribution of W from a more general
result, which will be described below.
Result 5.4 (Distribution of the generalised likelihood ratio statistic) Under regu-
larity conditions and assuming that H0 : θ = θ 0 holds, we have that, as n → ∞,
max L(θ , η) D
W = 2 log −
→ χ 2 (q).
maxH0 L(θ , η)
Proof To prove Result 5.4, let (θ̂ ML , η̂ML ) denote the unrestricted MLE and η̂0 =
η̂ML (θ 0 ) the restricted MLE of η under the null hypothesis H0 : θ = θ 0 . Under H0 ,
the true parameter is denoted as (θ 0 , η) . We know that, under H0 ,
θ̂ ML a θ0 −1
∼ Np , I (θ̂ ML , η̂ML ) , (5.16)
η̂ML η
where we partition the inverse observed Fisher information matrix as in Result 5.1:
I 11 I 12
I (θ̂ ML , η̂ML )−1 = .
I 21 I 22
1 −1
l˜p (θ ) ≈ − (θ − θ̂ ML ) I 11 (θ − θ̂ ML )
2
for the relative profile log-likelihood. The generalised likelihood ratio statistic W =
−2l˜p (θ 0 ) can therefore be approximated as
−1
W ≈ (θ̂ ML − θ 0 ) I 11 (θ̂ ML − θ 0 ).
a
But we have from (5.16) that the marginal distribution of the MLE for θ is θ̂ ML ∼
Nq (θ 0 , I 11 ). Hence, we finally obtain
a
W ∼ χ 2 (q).
Example 5.17 (Goodness-of-fit) We will now use the generalised likelihood ratio
statistic to derive a well-known goodness-of-fit statistic. Consider the MN blood
group data from Sect. 1.1.4 and suppose that we want to investigate if there is ev-
idence that the underlying population is not in Hardy–Weinberg equilibrium. The
restricted model now assumes that the population is in Hardy–Weinberg equilib-
rium. From Example 2.7 we know that υ̂ML ≈ 0.570 with log-likelihood value
2
l(υ̂ML ) = x1 log υ̂ML + x2 log 2υ̂ML (1 − υ̂ML ) + x3 log (1 − υ̂ML )2 = −754.17.
152 5 Likelihood Inference in Multiparameter Models
3
l(π̂ ML ) = xi log(π̂i ) = −753.19.
i=1
In this example we have derived a special case of Wilk’s G2 statistic. The ob-
served frequencies x1 , . . . , xk are compared with the fitted (“expected”) frequencies
ei in a restricted model with, say, r parameters. The G2 statistic is then
k
xi
G2 = 2 xi log ,
ei
i=1
using the convention that 0 log(0) = 0. This is also known as the deviance and is an
alternative to the χ 2 -statistic
k
(xi − ei )2
χ2 = .
ei
i=1
Example 5.18 (Screening for colon cancer) We now apply the χ 2 and G2 statistic
to the data on colon cancer screening from Sect. 1.1.5 to investigate the plausi-
bility of the underlying binomial and beta-binomial models. The computations are
straightforward, so we briefly discuss only the derivation of the degrees of free-
dom. The data are given in k = 6 categories, so the saturated multinomial model
has five degrees of freedom. The truncated binomial model has one parameter, and
the truncated beta-binomial model has two. The corresponding goodness-of-fit tests
therefore have 5 − 1 = 4 and 5 − 2 = 3 degrees of freedom, respectively.
There is strong evidence against the truncated binomial model from Exam-
ple 2.12 (χ 2 = 332.8, G2 = 185.1 at four degrees of freedom, with both P -values
<0.0001). This confirms our initial concerns at the end of Example 2.12. The more
flexible truncated beta-binomial model from Example 5.10 gives a much better fit
with χ 2 = 2.12 and G2 = 2.19 at three degrees of freedom (P -values 0.55 and 0.53,
respectively).
5.6 Conditional Likelihood 153
Example 5.19 (Poisson model) Suppose we want to compare disease counts x and
y in two areas with known expected numbers of cases ex and ey . The common sta-
tistical model is to assume that the corresponding random variables X and Y are
independent Poisson with means ex λx and ey λy , respectively. The parameter of in-
terest is then the rate ratio θ = λx /λy , and we can consider λy as nuisance parameter
if we write ex λx as ex θ λy . Intuitively, the sum Z = X + Y carries only little infor-
mation about the rate ratio, so we consider the conditional distribution of X | Z = z,
which is known to be a binomial Bin(z, π) distribution (cf. Appendix A.5.1) with
“number of trials” equal to z = x + y and “success probability” equal to
ex θ λy ex θ
π= = ,
ex θ λy + ey λy ex θ + ey
which does not depend on the nuisance parameter λy . The corresponding odds
π/(1 − π) are (ex θ )/ey , so the conditional log-likelihood for θ has the same form
as the odds in a binomial model, which we derived in Example 2.6:
ex θ ex θ
lc (θ ) = x log − (x + y) log 1 + . (5.17)
ey ey
We can easily derive the MLE of θ by equating the empirical odds x/y with π/(1 −
π) = (ex θ )/ey :
x/ex
θ̂ML = .
y/ey
Conditional likelihood confidence intervals can be obtained using the conditional
log-likelihood (5.17).
Now we consider profile likelihood as an alternative approach to eliminate the
nuisance parameter λy . First, we need to derive the MLE of λy for fixed θ . The joint
log-likelihood of θ and λy given the data x and y is
lp (θ ) = l θ, λ̂y (θ )
x +y
= x log(θ ) + (x + y) log − (x + y)
ex θ + ey
= x log(θ ) − (x + y) log(ex θ + ey ) + (x + y) log(x + y) − (x + y).
The last two terms do not depend on θ , and so can be ignored. We are also at liberty
to add the constant
x log(ex ) + log(ey ) ,
which gives, after some rearrangement,
ex θ ex θ
lp (θ ) = x log − (x + y) log 1 + .
ey ey
Conditional likelihood can also be used for the comparison of two binomial sam-
ples.
5.7 Exercises
1. In a cohort study on the incidence of ischaemic heart disease (IHD), 337 male
probands were enrolled. Each man was categorised as non-exposed (group 1,
daily energy consumption ≥2750 kcal) or exposed (group 2, daily energy con-
sumption <2750 kcal) to summarise his average level of physical activity. For
each group, the number of person years (Y1 = 2768.9 and Y2 = 1857.5), and
the number of IHD cases (D1 = 17 and D2 = 28) was registered thereafter.
ind
We assume that Di | λi ∼ Po(λi Yi ), i = 1, 2, where λi > 0 is the group-
specific incidence rate.
(a) For each group, derive the MLE λ̂i and a corresponding 95 % Wald
confidence interval for log(λi ) with subsequent back-transformation to
the λi -scale.
(b) In order to analyse whether λ1 = λ2 , we reparametrise the model with
λ = λ1 and θ = λ2 /λ1 . Show that the joint log-likelihood kernel of λ
and θ has the following form:
where D = D1 + D2 .
(c) Compute the MLE (λ̂, θ̂ ), the observed Fisher information matrix
I (λ̂, θ̂ ) and derive expressions for both profile log-likelihood functions
lp (λ) = l{λ, θ̂ (λ)} and lp (θ ) = l{λ̂(θ ), θ }.
(d) Plot both functions lp (λ) and lp (θ ) and also create a contour plot of
the relative log-likelihood l(λ,˜ θ ) using the R-function contour. Add
the points {λ, θ̂ (λ)} and {λ̂(θ ), θ } to the contour plot, analogously to
Fig. 5.3a.
156 5 Likelihood Inference in Multiparameter Models
(e) Compute a 95 % Wald confidence interval for log(θ ) based on the pro-
file log-likelihood. What can you say about the P -value for the null
hypothesis λ1 = λ2 ?
2. Let Z 1:n be a random sample from a bivariate normal distribution N2 (μ, Σ)
with mean vector μ = 0 and covariance matrix
1 ρ
Σ = σ2 .
ρ 1
1 − NPV
= LR− · ω.
NPV
k
k
(ni − npi0 )2 ni
Dn = and Wn = 2 ni log .
npi0 npi0
i=1 i=1
P
Show that Wn − Dn − → 0 as n → ∞.
10. In a psychological experiment the forgetfulness of probands is tested with
the recognition of syllable triples. The proband has ten seconds to memorise
the triple, afterwards it is covered. After a waiting time of t seconds, it is
5.7 Exercises 159
checked whether the proband remembers the triple. For each waiting time t,
the experiment is repeated n times.
Let y = (y1 , . . . , ym ) be the relative frequencies of correctly remembered
syllable triples for the waiting times of t = 1, . . . , m seconds. The power
model now assumes that
is the probability to correctly remember a syllable triple after the waiting time
t ≥ 1.
(a) Derive an expression for the log-likelihood l(θ) where θ = (θ1 , θ2 ).
(b) Create a contour plot of the log-likelihood in the parameter range
[0.8, 1] × [0.3, 0.6] with n = 100 and
11. Often the exponential model is used instead of the power model (described in
Exercise 10), assuming:
x = (225, 171, 198, 189, 189, 135, 162, 136, 117, 162).
where σ̂02 and σ̂k2 are the ML variance estimates for the kth group with
and without the H0 assumption, respectively. What is the approximate
distribution of W under H0 ?
(d) Consider the special case with K = 2 groups having equal sizes n1 =
n2 = n. Show that W is large when the ratio
n (1)
i=1 (xi − x̄1 )2
R= n (2)
i=1 (xi − x̄2 )2
K
T= (nk − 1) log s02 /sk2
k=1
1
K
1 1
C =1+ −
3(K − 1) nk − 1 K
k=1 (nk − 1)
k=1
Write two R-functions that take the vectors of the group sizes
(n1 , . . . , nK ) and the sample variances (s12 , . . . , sK
2 ) and return the val-
History of control
Exposed Unexposed
History of case Exposed a b
Unexposed c d
(c) Derive the standard error se(ψ̂ML ) of ψ̂ML and derive the Wald test statis-
tic for H0 : ψ = 1. Compare your result with the Wald test statistic for
H0 : log(ψ) = 0.
(d) Finally, compute the score test statistic for H0 : ψ = 1 based on the
expected Fisher information of the conditional likelihood.
ind
17. Let Yi ∼ Bin(1, πi ), i = 1, . . . , n, be the binary response variables in a logis-
tic regression model, where the probabilities πi = F (x i β) are parametrised
via the inverse logit function
exp(x)
F (x) =
1 + exp(x)
n
l(β) = yi log(πi ) + (1 − yi ) log(1 − πi ),
i=1
n
S(β) = (yi − πi )x i = X (y − π) and
i=1
n
I (β) = πi (1 − πi )x i x
i = X W X,
i=1
until the new estimate β (t+1) and the old one β (t) are almost identical
and β̂ ML = β (t+1) . Start with β (1) = 0.
(e) Consider the data set amlxray on the connection between X-ray usage
and acute myeloid leukaemia in childhood, which is available in the R-
package faraway. Here yi = 1 if the disease was diagnosed for the ith
child and yi = 0 otherwise (disease). We include an intercept in the
regression model, i.e. we set x1 = 1. We want to analyse the association
164 5 Likelihood Inference in Multiparameter Models
(h) Compute two P -values quantifying the evidence against H0 , one based
on the squared Wald statistic (5.18) and the other based on the gener-
alised likelihood ratio statistic.
(i) Since the data is actually from a matched case-control study, where
pairs of one case and one control have been matched (by age, race and
county of residence; the variable ID denotes the matched pairs), it is
more appropriate to apply conditional logistic regression. Compute the
corresponding MLEs and 95 % confidence intervals with the R-function
clogit from the package survival and compare the results.
18. In clinical dose-finding studies, the relationship between the dose d ≥ 0 of the
medication and the average response μ(d) in a population is to be inferred.
Considering a continuously measured response y, then a simple model for
ind
the individual measurements assumes yij ∼ N(μ(dij ; θ), σ 2 ), i = 1, . . . , K,
j = 1, . . . , ni . Here ni is the number of patients in the ith dose group with
dose di (placebo group has d = 0). The Emax model has the functional form
d
μ(d; θ) = θ1 + θ2 .
d + θ3
(a) Plot the function μ(d; θ) for different choices of the parameters
θ1 , θ2 , θ3 > 0. Give reasons for the interpretation of θ1 as the mean
placebo response, θ2 as the maximum treatment effect, and θ3 as the
dose giving 50 % of the maximum treatment effect.
5.8 References 165
(b) Compute the expected Fisher information for the parameter vector θ .
Using this result, implement an R function that calculates the approx-
imate covariance matrix of the MLE θ̂ ML for a given set of doses
d1 , . . . , dK , a total sample size N = K
i=1 ni , allocation weights wi =
ni /N and given error variance σ 2 .
(c) Assume that θ1 = 0, θ2 = 1, θ3 = 0.5 and σ 2 = 1. Calculate the approx-
imate covariance matrix, first, for K = 5 doses 0, 1, 2, 3, 4 and, second,
for doses 0, 0.5, 1, 2, 4, both times with balanced allocations wi = 1/5
and total sample size N = 100. Compare the approximate standard de-
viations of the MLEs of the parameters between the two designs, also
compare the determinants of the two calculated matrices.
(d) Using the second design, determine the required total sample size N
so that the standard deviation for estimation of θ2 is 0.35 (so that the
half-length of a 95 % confidence interval is about 0.7).
5.8 References
Contents
6.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.2 Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3 Choice of the Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.3.1 Conjugate Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . 179
6.3.2 Improper Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.3 Jeffreys’ Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.4 Properties of Bayesian Point and Interval Estimates . . . . . . . . . . . . . . . . . 192
6.4.1 Loss Function and Bayes Estimates . . . . . . . . . . . . . . . . . . . . . 192
6.4.2 Compatible and Invariant Bayes Estimates . . . . . . . . . . . . . . . . . . 195
6.5 Bayesian Inference in Multiparameter Models . . . . . . . . . . . . . . . . . . . . 196
6.5.1 Conjugate Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . 196
6.5.2 Jeffreys’ and Reference Prior Distributions . . . . . . . . . . . . . . . . . 198
6.5.3 Elimination of Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . 200
6.5.4 Compatibility of Uni- and Multivariate Point Estimates . . . . . . . . . . . 204
6.6 Some Results from Bayesian Asymptotics . . . . . . . . . . . . . . . . . . . . . . 204
6.6.1 Discrete Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.6.2 Continuous Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.7 Empirical Bayes Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Frequentist inference treats the data X as random. Point and interval estimates of
the parameter θ are viewed as functions of the data X, in order to obtain frequentist
properties of the estimates. The parameter θ is unknown but treated as fixed, not as
random.
Things are just the other way round in the Bayesian approach to statistical infer-
ence, named after Thomas Bayes (1702–1761). The unknown parameter θ is now
a random variable with appropriate prior distribution f (θ ). The posterior distri-
bution f (θ | x), computed with Bayes’ theorem, summarises the information about
θ after observing the data X = x. Note that Bayesian inference conditions on the
observation X = x, in contrast to frequentist inference.
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 167
https://doi.org/10.1007/978-3-662-60792-3_6,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
168 6 Bayesian Inference
For the time being, we assume in the following that θ is a scalar. For simplicity,
we will always speak of a (prior or posterior) density function, even when it is in
fact a probability mass function of a discrete parameter.
We start this chapter with a brief review of Bayes’ theorem for simple events, as
outlined in Appendix A.1.2.
For any two events A and B with 0 < Pr(A) < 1 and Pr(B) > 0, we have
Pr(B | A) Pr(A)
Pr(A | B) = . (6.1)
Pr(B)
Example 6.1 (Diagnostic test) Suppose a simple diagnostic test for a specific dis-
ease, which produces either a positive or a negative test result, is known to have the
90 % sensitivity. This means that if the person being tested has the disease (D+),
the probability of a positive test result (T +) is 90 %: Pr(T + | D+) = 0.9. Now as-
sume that the test also has the 90 % specificity and write D− if the person being
tested is free of the disease. Similarly, let T − denote a negative test result. The 90 %
specificity now translates to Pr(T − | D−) = 0.9.
We can use Bayes’ formula to compute the conditional probability of disease
given a positive test result:
Pr(T + | D+) Pr(D+)
Pr(D+ | T +) = . (6.2)
Pr(T +)
Thus, we can calculate Pr(T +) if we know the sensitivity Pr(T + | D+), the preva-
lence Pr(D+), and Pr(T + | D−) = 1 − Pr(T − | D−), i.e. 1 minus the specificity.
In the above example,
and hence
0.9 · 0.01
Pr(D+ | T +) = ≈ 0.083,
0.108
6.1 Bayes’ Theorem 169
i.e. the positive predictive value is 8.3 %. So if the test was positive, then the disease
risk increases from 1.0 % to 8.3 %.
It is up to the reader to write down an equivalent formula to (6.2) to compute
the negative predictive value Pr(D− | T −), which turns out to be approximately
99.89 %. So if the test was negative, then the disease risk decreases from 1.0 % to
Pr(D+ | T −) = 1 − Pr(D− | T −) = 1 − 0.9989 = 0.0011, i.e. 0.11 %. The disease
risk changes in the expected direction depending on the test result.
Example 6.1 exemplifies the process of Bayesian updating: We update the prior
risk Pr(D+) in the light of a positive test result T + to obtain the posterior risk
Pr(D+ | T +), the conditional probability of disease given a positive test result, also
known as the positive predictive value.
However, Eq. (6.2), with the denominator Pr(T +) replaced by (6.3), is somewhat
complex and not particularly intuitive. A simpler version of Bayes’ theorem can be
obtained if we switch from probabilities to odds:
Here Pr(D+)/Pr(D−) are the prior odds, Pr(D+ | T +)/Pr(D− | T +) are the pos-
terior odds, and Pr(T + | D+)/Pr(T + | D−) is the so-called likelihood ratio for a
positive test result. Bayesian updating is thus just one simple mathematical opera-
tion: Multiply the prior odds with the likelihood ratio to obtain the posterior odds.
This also explains the formula for the positive predictive value given in Exercise 5
of Chap. 5 on page 156.
Example 6.2 (Diagnostic test) In Example 6.1 the prior odds are 1/99 and the like-
lihood ratio (for a positive test result) is 0.9/0.1 = 9. The posterior odds (given a
positive test result) are therefore 9 · 1/99 = 1/11 ≈ 0.09, which of course corre-
sponds to the posterior probability of 8.3 % calculated above. So the prior odds of 1
to 99 change to posterior odds of 1 to 11 in the light of a positive test result. If the test
result was negative, then the prior odds need to be multiplied with the likelihood ra-
tio for a negative test result, which is Pr(T − | D+)/Pr(T − | D−) = 0.1/0.9 = 1/9.
This leads to posterior odds of 1/9 · 1/99 = 1/891. We leave it up to the reader
to check that this corresponds to a negative predictive value of approximately
99.89 %.
Formula (6.2) is specified for a positive test result T + and a diseased person
D+ but is equally valid if we replace a positive test result T + by a negative one,
i.e. T −, or a diseased person D+ by a non-diseased one, i.e. D−, or both. Thus,
a more general description of Bayes’ theorem is given by
f (T = t | D = d) · f (D = d)
f (D = d | T = t) = , (6.5)
f (T = t)
170 6 Bayesian Inference
where D and T are binary random variables taking values d and t, respectively.
In the diagnostic setting outlined above, d and t can be either + (“positive”) or −
(“negative”). Note that we have switched notation from Pr(·) to f (·) to emphasise
that (6.5) relates to general probability mass functions of the random variables D
and T , and not only to probabilities of the events D+ and T +, say.
A more compact version of (6.5) is
f (t | d) · f (d)
f (d | t) = , (6.6)
f (t)
cf. Eq. (A.8). Note that this equation also holds if the random variables D or T have
more than two possible values. The formula is also correct if it involves continuous
random variables, in which case f (·) denotes a density function, see Eq. (A.10).
This simple formula forms the basis for the whole rest of this chapter.
The term f (x | θ ) in (6.7) is simply the likelihood function L(θ ) previously de-
noted by f (x; θ ). Since θ is now random, we explicitly condition on a specific value
θ and write L(θ ) = f (x | θ ). The denominator in (6.7) can be written as
f (x | θ )f (θ ) dθ = f (x, θ ) dθ = f (x),
which emphasises that it does not depend on θ . Note that Eq. (6.7) is then equivalent
to Eq. (6.6). The quantity f (x) is known as the marginal likelihood and is important
for Bayesian model selection, cf. Sect. 7.2.
The density of the posterior distribution is therefore proportional to the product of
the likelihood and the density of the prior distribution, with proportionality constant
1/f (x). This is usually denoted as
f (θ | x) ∝ f (x | θ )f (θ ) or f (θ | x) ∝ L(θ )f (θ ), (6.8)
6.2 Posterior Distribution 171
where “∝” stands for “is proportional
to” and implies that 1/ L(θ )f (θ ) dθ is the
normalising constant to ensure that f (θ | x) dθ = 1, such that f (θ | x) is a valid
density function.
Posterior distribution
The density function of the posterior distribution can be obtained through
multiplication of the likelihood function and the density function of the prior
distribution with subsequent normalisation.
Definition 6.2 (Bayesian point estimates) The posterior mean E(θ | x) is the expec-
tation of the posterior distribution:
E(θ | x) = θf (θ | x) dθ.
The posterior median Med(θ | x) is the median of the posterior distribution, i.e. any
number a that satisfies
a ∞
f (θ | x) dθ = 0.5 and f (θ | x) dθ = 0.5. (6.9)
−∞ a
Implicitly we often assume that the posterior mean is finite, in which case it is
also unique. However, the posterior mode and the posterior median are not neces-
sarily unique. Indeed, a posterior distribution can have several modes and is then
called multimodal. As an example where the median is not unique, consider a con-
tinuous real-valued parameter, where the (posterior) density is zero in a central in-
terval. If 50 % of the probability mass are distributed to the left and right of this
centre, then any value a in the central interval fulfils (6.9), so is a posterior me-
dian.
Bayesian interval estimates are also derived from the posterior distribution. To
distinguish them from confidence intervals, which have a different interpretation,
they are called credible intervals. Here is the definition for a scalar parameter θ :
172 6 Bayesian Inference
Definition 6.3 (Credible interval) For fixed γ ∈ (0, 1), a γ · 100 % credible interval
is defined through two real numbers tl and tu , that fulfil
tu
f (θ | x) dθ = γ . (6.10)
tl
The quantity γ is called the credibility level of the credible interval [tl , tu ].
Table 6.1 Summary characteristics of the posterior distribution of π under a uniform prior distri-
bution in the binomial model. The 95 % credible interval based on the 2.5 % and 97.5 % quantiles
should be compared with the 95 % confidence intervals from Table 4.2
Observation Posterior characteristics
n x Mean Mode Median 2.5 % quantile 97.5 % quantile
10 0 0.08 0.00 0.06 0.00 0.28
10 1 0.17 0.10 0.15 0.02 0.41
10 5 0.50 0.50 0.50 0.23 0.77
100 0 0.01 0.00 0.01 0.00 0.04
100 10 0.11 0.10 0.11 0.06 0.17
100 50 0.50 0.50 0.50 0.40 0.60
estimates seem to come fairly close to the MLEs x/n for increasing sample size n.
This empirical result will be discussed in more detail in Sect. 6.6.2.
Example 6.4 (Diagnostic test) We now revisit the diagnostic testing problem dis-
cussed in Example 6.1 under the more realistic scenario that the disease prevalence
π = Pr(D+) is not known but only estimated from a prevalence study. We will
describe how the Bayesian approach can be used to assess the uncertainty of the
positive and negative predictive values in this case.
For example, suppose that there was x = 1 diseased individual in a prevalence
study with n = 100 participants. The Be(0.5, 5) prior distribution for π expresses
our initial prior beliefs that the prevalence is below 33 % with approximately 95 %
probability and below 5 % with approximately 50 % probability. The posterior dis-
tribution of π is then Be(α̃, β̃) with updated parameters α̃ = 0.5 + x = 1.5 and
β̃ = 5 + n − x = 104. The posterior mean, median and mode are 0.014, 0.011 and
0.005, respectively. The difference between the point estimates indicates that the
posterior distribution is quite skewed. The equal-tailed 95 % credible interval is
[0.001, 0.044].
In the posterior odds (6.4), the prevalence π enters:
Therefore, the posterior odds ω are easily obtained from π and we can transform ω
into the corresponding probability Pr(D+ | T +) as follows:
ω −1
θ = Pr(D+ | T +) = = 1 + ω−1 . (6.12)
1+ω
Suppose we want to replace the fixed prevalence π = 1/100 with the posterior dis-
tribution π ∼ Be(α̃, β̃) to appreciate the uncertainty involved in the estimation of π .
Note that (6.12) is a monotone function of π . Therefore, any quantile of the poste-
rior distribution of π can be transformed to the θ -scale. Based on the corresponding
6.2 Posterior Distribution 175
quantiles of π stated above, we easily obtain the posterior median 0.09 and the 95 %
equal-tailed credible interval [0.01, 0.29] for θ .
It is also possible to analytically compute the implied distribution of θ =
Pr(D+ | T +). In Exercise 2 it is shown that the density of θ is
f (θ ) = c · θ −2 · fF c(1/θ − 1); 2β̃, 2α̃ , (6.13)
where
α̃ Pr(T + | D+)
c= ,
β̃{1 − Pr(T − | D−)}
and fF (x; 2β̃, 2α̃) is the density of the F distribution with parameters 2β̃ and 2α̃, cf.
Appendix A.5.2. Note that c does not depend on θ . One can analogously proceed
for the negative predictive value Pr(D− | T −), see Exercise 2. The posterior mean
and the posterior mode of θ can now be computed using numerical integration and
optimisation, respectively. The posterior mean turns out to be 0.109 while the pos-
terior mode is 0.049. The positive predictive value 8.3 % obtained in Example 6.1
with fixed prevalence π = 0.01 lies in between.
For the negative predictive value, we obtain the posterior mean 0.9984 and the
posterior mode 0.9995. This is to be compared with the negative predictive value
99.89 % for fixed prevalence π , cf. again Example 6.1. We further obtain the poste-
rior median 0.9987 and the 95 % equal-tailed credible interval [0.9949, 0.9999].
It is also simple to generate a random sample from Pr(D+ | T +) and
Pr(D− | T −) using samples from the beta distribution of π . The following R-code
illustrates this.
## prior parameters
a <- 0.5
b <- 5
## data
x <- 1
n <- 100
## posterior parameters
apost <- a + x
bpost <- b +n - x
## sample size
nsample <- 10^5
## prevalence values sampled from Be (1.5 , 99.5) distribution
prev <- rbeta ( nsample , apost , bpost )
## set values for sensitivity and specificity
sens <- 0.9
spec <- 0.9
## compute resulting positive and negative predictive value samples
ppv <- sens * prev / ( sens * prev + (1 - spec ) * (1 - prev ) )
npv <- spec * (1 - prev ) / ((1 - sens ) * prev + spec * (1 - prev ) )
Histograms of these samples are shown in Fig. 6.2. The histograms are in per-
fect agreement with the true density functions. Such samples form the basis of
Monte Carlo techniques used for Bayesian inference, as discussed in more detail
in Chap. 8.
In Example 6.3 the posterior mode of the binomial proportion equals the MLE
when a uniform prior is used. This result holds in general:
176 6 Bayesian Inference
Fig. 6.2 Densities and samples histograms of the positive and negative predictive value if the
prevalence follows the Be(1.5, 104) distribution, and the sensitivity and specificity are both 90 %
Result 6.1 Under a uniform prior, the posterior mode equals the MLE.
Proof This can easily be seen from the fact that if the prior on θ is uniform, the
density f (θ ) does not depend on θ . From Eq. (6.8) it follows that
f (θ | x) ∝ L(θ ).
Hence, the mode of the posterior distribution must equal the value that maximises
the likelihood function, which is the MLE.
There are infinitely many γ · 100 % credible intervals for fixed γ , at least if the
parameter θ is continuous. In the previous example with γ = 0.95 we cut off 2.5 %
probability mass of each tail of the posterior distribution to obtain an equal-tailed
credible interval. Alternatively, we could, for example, cut off 5 % probability mass
on the left side of the posterior distribution and fix the right limit of the credible
interval at the upper bound of the parameter space (which was 1 in Example 6.4).
Under some regularity conditions, the following additional requirement ensures the
uniqueness of credible intervals.
Definition 6.4 (Highest posterior density interval) Let γ ∈ (0, 1) be a fixed credi-
bility level. A γ · 100 % credible interval I = [tl , tu ] is called a highest posterior
density interval (HPD interval) if
f (θ | x) ≥ f (θ̃ | x)
An HPD interval contains all those parameter values that have higher posterior
density than all parameter values not contained in the interval. Under some addi-
tional regularity conditions, the posterior density ordinates at the limits of an HPD
interval are equal, i.e.
There are counterexamples where this property of HPD intervals is not fulfilled. For
example, suppose that the posterior distribution is exponential. The lower limit tl of
an HPD interval at any level will then be zero because the density of the exponential
distribution is monotonically decreasing. However, as we will see in Sect. 6.6.2,
the posterior distribution of a continuous parameter is asymptotically normal, i.e.
unimodal, if the sample size increases. Equation (6.14) will therefore typically hold
for larger sample sizes.
Numerical computation of HPD intervals typically requires iterative algorithms.
Figure 6.3 compares an HPD interval with an equal-tailed credible interval.
For discrete parameter spaces Θ, the definition of credible intervals has to be
modified since it might not be possible to find any interval with exact credibility
level γ as required by Eq. (6.10). A γ · 100 % credible interval I = [tl , tu ] for θ is
then defined through
f (θ | x) ≥ γ . (6.15)
θ∈I ∩Θ
Similarly, the posterior median Med(θ | x) will typically not be unique if the param-
eter is discrete. One can add an additional arbitrary requirement to make it unique,
for example define Med(θ | x) as the smallest possible value with at least 50 % pos-
terior probability mass below it.
178 6 Bayesian Inference
already derived in Example 2.2. The support P of the posterior distribution is the in-
tersection of the support T of the prior distribution and the parameter values allowed
in the likelihood. As max(n, M + n − x) ≥ M + n − x ≥ M, it is
Since the prior probability function f (N ) does not depend on N , the posterior prob-
ability function is proportional to the likelihood:
f (x | N )
f (N | x) = .
N ∈P f (x | N )
Figure 6.4a gives the posterior distribution for M = 26, n = 63 and x = 5. Three
different point estimates and a 95 % HPD interval are shown. Note that the posterior
mode equals the MLE N̂ML = 327 (cf. Example 2.2).
To avoid the specification of the fairly arbitrary upper limit Nmax , a useful alter-
native to the uniform prior is a geometric distribution N ∼ Geom(π), truncated to
N ≥ M, i.e.
f (N ) ∝ π(1 − π)N −1 for N = M, M + 1, . . . ,
cf. Fig. 6.4b. Under this prior, the point estimates and the limits of the 95 % credible
interval are slightly shifted to the left. The posterior distribution is slightly more
concentrated with a shorter range of the 95 % HPD interval.
Fig. 6.4 Capture–recapture experiment: M = 26 fish have been marked, and x = 5 of them have
been caught in a sample of n = 63. The chosen prior probability function is shown in the upper
panels. The posterior probability functions can be seen in the lower panels and are to be compared
with the likelihood function in Fig. 2.2
The family G = {all distributions} is trivially conjugate with respect to any like-
lihood function. In practice one tries to find smaller sets G that are specific to the
likelihood Lx (θ ).
180 6 Bayesian Inference
Table 6.2 Summary of conjugate prior distributions for different likelihood functions
Likelihood Conjugate prior distribution Posterior distribution
X | π ∼ Bin(n, π) π ∼ Be(α, β) π | x ∼ Be(α + x, β + n − x)
X | π ∼ Geom(π) π ∼ Be(α, β) π | x ∼ Be(α + 1, β + x − 1)
X | λ ∼ Po(e · λ) λ ∼ G(α, β) λ | x ∼ G(α + x, β + e)
X | λ ∼ Exp(λ) λ ∼ G(α, β) λ | x ∼ G(α + 1, β + x)
X | μ ∼ N(μ, σ 2 known) μ ∼ N(ν, τ 2 ) see Eq. (6.16)
X |σ2 ∼ N(μ known, σ 2 ) σ2 ∼ IG(α, β) σ 2 | x ∼ IG(α + 12 , β + 12 (x − μ)2 )
Example 6.6 (Binomial model) Let X | π ∼ Bin(n, π). The family of beta distri-
butions, π ∼ Be(α, β), is conjugate with respect to L(π) since the posterior
distribution is again a beta distribution: π | x ∼ Be(α + x, β + n − x), cf. Exam-
ple 6.3.
with allele frequency υ ∈ (0, 1) (cf. Example 2.7). It is easy to see that a beta prior
distribution Be(α, β) for υ results in a beta posterior distribution,
Example 6.8 (Normal model) Let X denote a sample from a normal N(μ, σ 2 ) dis-
tribution with unknown mean μ and known variance σ 2 . The corresponding likeli-
hood function is
1
L(μ) ∝ exp − 2 (x − μ)2 .
2σ
6.3 Choice of the Prior Distribution 181
Combined with a normal prior distribution with mean ν and variance τ 2 for the
unknown mean μ, i.e.
1
f (μ) ∝ exp − 2 (μ − ν)2 ,
2τ
the posterior density of μ is given by
f (μ | x) ∝ L(μ) · f (μ)
" #
1 1 1
∝ exp − (x − μ)2
+ (μ − ν)2
2 σ2 τ2
" 2 #
1 1 1 1 1 −1 x ν
∝ exp − + μ− + · + ,
2 σ2 τ2 σ2 τ2 σ2 τ2
see (B.5) for justification of the last rearrangement. So the posterior distribution is
also normal:
1 1 −1 x ν 1 1 −1
μ|x ∼ N + · + , + . (6.16)
σ2 τ2 σ2 τ2 σ2 τ2
As in the binomial model, the posterior mean is a weighted average of the prior
mean ν and the data x with weights proportional to 1/τ 2 and 1/σ 2 , respectively.
Equation (6.16) simplifies if one uses precisions, i.e. inverse variances, rather
than the variances themselves. Indeed, let κ = 1/σ 2 and δ = 1/τ 2 ; then
κx + δν −1
μ|x ∼ N , (κ + δ) .
κ +δ
Therefore, a Bayesian analysis of normal observations often uses precision parame-
ters rather than variance parameters.
The above result can be easily extended to a random sample X1:n from an
N(μ, σ 2 ) distribution with known variance σ 2 . For simplicity, we work with preci-
sions rather than variances and use the fact that x̄ is sufficient for μ, so the likelihood
function of μ is
n
L(μ) ∝ exp − 2 (μ − x̄) , 2
2σ
cf. Result 2.4. Combined with the prior μ ∼ N(ν, τ 2 ), we easily obtain
nκ x̄ + δν −1
μ | x1:n ∼ N , (nκ + δ) .
nκ + δ
The corresponding formula with variances rather than precisions reads
n 1 −1 nx̄ ν n 1 −1
μ | x1:n ∼ N + · + , + .
σ2 τ2 σ2 τ2 σ2 τ2
182 6 Bayesian Inference
Note that the posterior mean is a weighted average of the prior mean ν and the MLE
x̄, with weights proportional to δ = 1/τ 2 and nκ = n/σ 2 , respectively. A larger
sample size n thus leads to a higher weight of the MLE and to a decreasing poste-
rior variance (nκ + δ)−1 . This behaviour of the posterior distribution is intuitively
reasonable. Furthermore, we can interpret n0 = δ/κ as a prior sample size to obtain
the relative prior sample size n0 /(n0 + n), cf. Example 6.3.
Example 6.10 (Comparison of proportions) We know from Example 5.8 that the
MLE ψ̂ML = log{(a · d)/(b · c)} of the log odds ratio is approximately normal with
mean equal to the true log odds ratio ψ and variance σ 2 = a −1 + b−1 + c−1 + d −1 .
For illustration, consider the data from the Tervila study from Table 1.1, sum-
marised in Table 3.1. Here we obtain ψ̂ML = 1.09 and σ 2 = 0.69. The MLE of the
odds ratio is θ̂ML = exp(ψ̂ML ) = 2.97 with 95 % Wald confidence interval [0.59,
15.07].
A Bayesian analysis selects a suitable prior that represents realistic prior beliefs
about the quantity of interest. A semi-Bayes analysis now uses the MLE ψ̂ML as
the data, instead of the underlying two binomial samples. This has the advantage
that the new likelihood depends directly on the parameter of interest ψ, with no
additional nuisance parameter. A suitable prior distribution is now placed directly
on ψ , rather than working with a multivariate prior for the success probabilities π1
and π2 of the two underlying binomial experiments. The MLE ψ̂ML is thus regarded
as the observed data, with likelihood depending on ψ, assuming that σ 2 is fixed at
its estimate.
6.3 Choice of the Prior Distribution 183
For example, we might use a normal prior for the log odds ratio ψ with mean
zero and variance τ 2 = 1. Note that the prior mean (or median) of zero for the log
odds ratio corresponds to a prior median of one for the odds ratio, expressing prior
indifference regarding positive or negative treatment effects. This particular prior
implies that the odds ratio is a priori between 1/7 and 7 with approximately 95 %
probability. These numbers arise from the fact that exp(z0.975 ) ≈ 7 where z0.975 ≈
1.96 is the 97.5 % quantile of the standard normal distribution, the selected prior
for the log odds ratio. Using Eq. (6.16), the posterior variance of the log odds ratio
is (1/σ 2 + 1/τ 2 )−1 = 0.41 and the posterior mean is (1/σ 2 + 1/τ 2 )−1 · ψ̂ML /σ 2 =
0.65. Due to the normality of the posterior distribution, the posterior median and
mode are identical to the posterior mean.
The posterior median estimate of the odds ratio is therefore Med(θ | ψ̂ML ) =
exp(0.65) ≈ 1.91 with equal-tailed 95 % credible interval [0.55, 6.66]. Note that
both the posterior median and the upper limit of the credible interval are consid-
erably smaller than the MLE and the upper limit of the confidence interval, re-
spectively, whereas the lower limit has barely changed. It is also straightforward to
calculate the posterior mean and the posterior mode of the odds ratio θ = exp(ψ)
based on properties of the log-normal distribution, cf. Appendix A.5.2. One obtains
E(θ | ψ̂ML ) = 2.34 and Mod(θ | ψ̂ML ) = 1.27.
Example 6.11 (Comparison of proportions) It is easy to see that the likelihood anal-
ysis at the beginning of Example 6.10 has a Bayesian interpretation if one uses a nor-
mal prior for the log odds ratio ψ with zero mean and very large (infinite) variance.
Indeed, if one lets the prior variance τ 2 in (6.16) go to infinity, then the posterior
distribution is simply
ψ | ψ̂ML ∼ N ψ̂ML , σ 2 .
Note that the limit τ 2 → ∞ induces an improper locally uniform density function
f (ψ) ∝ 1 on the real line. Nevertheless, the posterior distribution is proper. Now all
three Bayesian point estimates of the log odds ratio ψ are equal to the MLE ψ̂ML .
Furthermore, the equal-tailed credible intervals are numerically identical to Wald
confidence intervals for any choice of the credibility/confidence level γ .
184 6 Bayesian Inference
However, from a medical perspective it can be argued that this is not a realistic
prior since it places the same prior weight on odds ratios between 0.5 and 2 as on
odds ratios between 11 000 and 44 000, say. The reason is that these two intervals
have the same width log(4) on the log scale and the prior density for the log odds
ratio is constant across the whole real line. Such huge effect sizes are unrealistic and
rarely encountered in clinical or epidemiological research.
Definition 6.6 (Improper prior distribution) A prior distribution with density func-
tion f (θ ) ≥ 0 is called improper if
f (θ ) dθ = ∞ or f (θ ) = ∞
Θ θ∈Θ
Example 6.12 (Haldane’s prior) The conjugate prior π ∼ Be(α, β) in the binomial
model X ∼ Bin(n, π) has the density
and is proper for α > 0 and β > 0. In the limiting case α = β = 0, one obtains an
improper prior distribution with density
f (π) ∝ π −1 (1 − π)−1 ,
In some situations it may be useful to choose a prior distribution that does not convey
much information about the parameter because of weak or missing prior knowledge.
A first naive choice is a (locally) uniform prior fθ (θ ) ∝ 1, in which case the poste-
rior is proportional to the likelihood function. Note that a locally uniform prior will
be improper if the parameter space is not bounded.
However, there are problems associated with this approach. Suppose that φ =
h(θ ) is a one-to-one differentiable transformation of θ , that has a (locally) uniform
prior with density fθ (θ ) ∝ 1. Using the change-of-variables formula (A.11), we
6.3 Choice of the Prior Distribution 185
Note that this term is not necessarily constant. Indeed, fφ (φ) will be independent
of φ only if h is linear. If h is nonlinear, the prior density fφ (φ) will depend on φ
and will thus not be (locally) uniform. However, if we had chosen a parametrisation
with φ from the start, we would have chosen a (locally) uniform prior fφ (φ) ∝ 1.
This lack of invariance under reparameterisation is an unappealing feature of the
(locally) uniform prior distribution. Note that we implicitly assume that we can
apply the change-of-variables formula to improper densities in the same way as to
proper densities.
Example 6.13 (Binomial model) Let X ∼ Bin(n, π) with a uniform prior for π ,
i.e. fπ (π) = 1 for π ∈ (0, 1). Consider now the logit transformation φ = h(π) =
log{π/(1 − π)} ∈ R. Applying the change-of-variables formula, we obtain
exp(φ)
fφ (φ) = ,
{1 + exp(φ)}2
i.e. the log odds φ follow a priori a standard logistic distribution (cf. Ap-
pendix A.5.2).
On the other hand, if we select an improper locally uniform prior for φ, i.e.
fφ (φ) ∝ 1, then
fπ (π) ∝ π −1 (1 − π)−1 ,
Definition 6.7 (Jeffreys’ prior) Let X be a random variable with likelihood function
f (x | θ ) where θ is an unknown scalar parameter. Jeffreys’ prior is defined as
f (θ ) ∝ J (θ ), (6.17)
Jeffreys’ prior is proportional to the square root of the expected Fisher informa-
tion, which may give an improper prior distribution. At first sight, it is surprising
that this choice is supposed to be invariant under reparametrisation, but we will see
in the following result that this is indeed the case.
Result 6.2 (Invariance of Jeffreys’ prior) Jeffreys’ prior is invariant under a one-
to-one reparametrisation of θ : If
fθ (θ ) ∝ Jθ (θ ),
known parameter θ . If the Fisher information does not depend on the true param-
eter θ , then the (average) information provided by the data is the same whatever
the particular value of θ is. In this case it seems reasonable to select a (possibly
improper) locally uniform distribution for θ as a default prior.
However, if Jθ (θ ) depends on θ , then it seems natural to first apply the variance-
stabilising transformation φ = h(θ ) (cf. Sect. 4.3) to remove this dependence. Then
Jφ (φ) does not depend on φ, and we therefore select a locally uniform prior for φ,
i.e. fφ (φ) ∝ 1. Through a change of variables and with Eq. (4.14) it follows that
fθ (θ ) ∝ fφ (φ) · Jθ (θ ) ∝ Jθ (θ ),
i.e. the argument leads directly to Jeffreys’ prior. To put it the other way round, the
derivative of a variance-stabilising transformation h(θ ) equals Jeffreys’ prior for the
original parameter θ . For example, the variance-stabilising
√ transformation for the
mean λ of Poisson distributed data is h(λ) = λ (cf. Example 4.14), so Jeffreys’
prior is given by
√
d λ 1 −1/2
fλ (λ) ∝ = λ ∝ λ−1/2 .
dλ 2
This somewhat surprising result gives an interesting connection between likelihood
and Bayesian inference.
Jeffreys’ rule gives commonly accepted default priors for scalar parameters.
Quite often one obtains improper priors, which can be identified as limiting cases of
proper conjugate priors. Here are two examples.
Example 6.14 (Normal model) Let X1:n denote a random sample from a N(μ, σ 2 )
distribution with unknown mean μ and known variance σ 2 . From Example 4.3 we
know that J (μ) = n/σ 2 does not depend on μ, so Jeffreys’ prior for μ is locally uni-
form on the whole real line R, i.e. f (μ) ∝ 1, and is an improper prior distribution.
This is the limiting prior distribution if we assume the conjugate prior μ ∼ N(ν, τ 2 )
and let τ 2 → ∞. The resulting posterior distribution is (cf. Example 6.11)
σ2
μ | x1:n ∼ N x̄, .
n
The posterior mean, median and mode are therefore all equal to the MLE x̄.
Suppose now that the mean μ is known but the variance σ 2 is unknown. Then
J (σ 2 ) = n/(2σ 4 ), compare again Example 4.3, so Jeffreys’ prior is
f σ 2 ∝ 1/σ 2 ,
which can be identified as an inverse gamma distribution (cf. Appendix A.5.2) with
parameters n/2 and ni=1 (xi − μ)2 /2:
n
n i=1 (xi − μ)2
σ 2 | x1:n ∼ IG , .
2 2
n
− μ)2
i=1 (xi
E σ 2 | x1:n = ,
n−2
and the posterior mode is
n
− μ)2
i=1 (xi
Mod σ 2 | x1:n = .
n+2
2 = n
i=1 (xi − μ) /n (compare Example 2.9) lies between
Note that the MLE σ̂ML 2
these two estimates and will have a very similar numerical value, provided that the
sample size n is not very small.
Table 6.3 lists Jeffreys’ prior distributions for further likelihood functions. Only
in the binomial case Jeffreys’ prior, the Be(0.5, 0.5) distribution, turns out to be
proper. All other priors in Table 6.3 are improper distributions. They can be viewed
as limiting cases of the corresponding conjugate proper prior distributions, and this
is how they are listed. Bayesian point estimates using Jeffreys’ prior are often very
close or even identical to the MLE. Table 6.4 illustrates this for the posterior mean.
It is also interesting to compare Bayesian credible intervals using Jeffreys’ prior
with the corresponding frequentist confidence intervals. Quite surprisingly, it turns
out that the frequentist properties of such Bayesian procedures may be as good
or even better than those of their truly frequentist counterparts, as the following
example illustrates.
Table 6.3 Jeffreys’ prior for several likelihood functions. All the prior distributions are improper,
except for the binomial likelihood
Likelihood Jeffreys’ prior Density of Jeffreys’ prior
1
Bin(n, π) π ∼ Be( 12 , 12 ) f (π) ∝ {π(1 − π)}− 2
1
Geom(π) π ∼ Be(0, 12 ) f (π) ∝ π −1 (1 − π)− 2
1
Po(λ) λ ∼ G( 12 , 0) f (λ) ∝ λ− 2
Exp(λ) λ ∼ G(0, 0) f (λ) ∝ λ−1
N(μ, σ 2 known) μ ∼ N(0, ∞) f (μ) ∝ 1
N(μ known, σ 2 ) σ 2 ∼ IG(0, 0) f (σ 2 ) ∝ σ −2
Table 6.4 Comparison of MLEs and the posterior mean using Jeffreys’ prior
Likelihood θ̂ML Posterior mean using Jeffreys’ prior
Bin(n, π) x̄ n
n+1 (x̄ + 1
2n )
Geom(π) 1/x̄ 1/(x̄ + 1
2n )
Po(λ) x̄ x̄ + 1
2n
Exp(λ) 1/x̄ 1/x̄
N(μ, σ 2 known) x̄ x̄
n n
N(μ known, σ 2 ) 1
n i=1 (xi − μ)2 1
n−2 i=1 (xi − μ)2
Table 6.5 A comparison of different credible and confidence intervals for X ∼ Bin(n, π) and
π ∼ Be(1/2, 1/2). If n = 100 and x = 50, the four approaches yield nearly identical intervals
Observation Interval at 95 % level of type
n x Equal-tailed HPD Wald Likelihood
10 0 0.000 to 0.217 0.000 to 0.171 0.000 to 0.000 0.000 to 0.175
10 1 0.011 to 0.381 0.000 to 0.331 −0.086 to 0.286 0.006 to 0.372
10 5 0.224 to 0.776 0.224 to 0.776 0.190 to 0.810 0.218 to 0.782
100 0 0.000 to 0.025 0.000 to 0.019 0.000 to 0.000 0.000 to 0.019
100 10 0.053 to 0.170 0.048 to 0.164 0.041 to 0.159 0.051 to 0.169
100 50 0.403 to 0.597 0.403 to 0.597 0.402 to 0.598 0.403 to 0.597
compare Table 6.6. But how good are the frequentist properties of Bayesian credi-
ble intervals based on Jeffreys’ prior? To address this question, we have computed
the actual coverage probabilities for n = 50, just as we did in Example 4.22 for
confidence intervals. Figure 6.7 displays the actual coverage of the HPD and the
equal-tailed credible interval. They behave similarly to the likelihood confidence
interval in Fig. 4.9e with slightly lower coverage of the HPD interval as a result
of the smaller interval width. It is to be noted, though, that the coverage is actu-
ally comparable if not even better than that from the likelihood-based confidence
intervals discussed in Example 4.22.
190 6 Bayesian Inference
Fig. 6.7 Actual (grey) and locally smoothed (black) coverage probabilities of 95 % credible inter-
vals based on Jeffreys’ prior for n = 50. For X = 0 or X = n, the limits of the HPD interval do not
have equal posterior density ordinates. In these two cases the HPD interval is [0, b0.95 (0.5, 50.5)]
or [b0.05 (50.5, 0.5), 1], respectively, where bα (α, β) denotes the α quantile of the Be(α, β) distri-
bution
We know from Result 6.2 that one advantage of Jeffreys’ prior is that it is in-
variant under one-to-one transformations of the original parameter θ . For example,
one might be interested in the standard deviation σ or the precision κ = 1/σ 2 of the
normal distribution rather than the variance σ 2 .
Example 6.16 (Normal model) Jeffreys’ prior for the variance σ 2 of a normal dis-
tribution N(μ, σ 2 ) is
f σ 2 ∝ 1/σ 2
6.3 Choice of the Prior Distribution 191
We note that Jeffreys’ priors for the variance σ 2 , the standard deviation σ and the
precision κ of a normal distribution are all proportional to the respective recipro-
cal parameter. Another change of variables shows that the corresponding priors for
log(σ 2 ), log(σ ) and log(κ) are hence all locally uniform on the real line R.
The derivative is
dh(ρ) 1
= ,
dρ 1 − ρ2
so Jeffreys’ prior for ρ is
1
f (ρ) ∝ ,
1 − ρ2
again an improper distribution. This prior distribution gives more weight to ex-
treme values than to values close to zero. For example, f (±0.9)/f (0) = 5.3,
f (±0.99)/f (0) = 50.3 and f (±0.999)/f (0) is even 500.3, whereas this ratio
would always equal the unity for the uniform prior distribution.
192 6 Bayesian Inference
To simplify the notation, we denote in this section a point estimate of θ with a rather
than with θ̂ .
Definition 6.8 (Loss function) A loss function l(a, θ ) ∈ R quantifies the loss en-
countered when estimating the true parameter θ by a.
Definition 6.9 (Bayes estimate) A Bayes estimate of θ with respect to a loss func-
tion l(a, θ ) minimises the expected loss with respect to the posterior distribution
f (θ | x), i.e. it minimises
E l(a, θ ) | x = l(a, θ )f (θ | x) dθ.
Θ
It turns out that the commonly used Bayesian point estimates can be viewed as
Bayes estimates with respect to one of the loss functions described above.
Result 6.3 The posterior mean is the Bayes estimate with respect to quadratic loss.
The posterior median is the Bayes estimate with respect to linear loss. The posterior
mode is the Bayes estimate with respect to zero–one loss, as ε → 0.
Proof We first derive the posterior mean E(θ | x) as the Bayes estimate with respect
to quadratic loss. The expected quadratic loss is
E l(a, θ ) | x = l(a, θ )f (θ | x) dθ = (a − θ )2 f (θ | x) dθ.
6.4 Properties of Bayesian Point and Interval Estimates 193
It immediately follows that a = θf (θ | x) dθ = E(θ | x).
Consider now the expected linear loss
E l(a, θ ) | x = l(a, θ )f (θ | x) dθ = |a − θ |f (θ | x) dθ
= (a − θ )f (θ | x) dθ + (θ − a)f (θ | x) dθ.
θ≤a θ>a
The derivative with respect to a can be calculated using Leibniz’s integral rule (cf.
Appendix B.2.4):
∂
E l(a, θ ) | x
∂a
a ∞
∂ ∂
= (a − θ )f (θ | x) dθ + (θ − a)f (θ | x) dθ
∂a −∞ ∂a a
a
= f (θ | x) dθ − a − (−∞) f (−∞ | x) · 0 + (a − a)f (a | x) · 1
−∞
∞
− f (θ | x) dθ − (a − a)f (a | x) · 1 + (∞ − a)f (∞ | x) · 0
a
a ∞
= f (θ | x) dθ − f (θ | x) dθ.
−∞ a
Setting this equal to zero yields the posterior median a = Med(θ | x) as the solution
for the estimate.
Finally, the expected zero–one loss is
E l(a, θ ) | x = lε (a, θ )f (θ | x) dθ
a−ε +∞
= f (θ | x) dθ + f (θ | x) dθ
−∞ a+ε
a+ε
=1− f (θ | x) dθ.
a−ε
a+ε
This will be minimised if the integral a−ε f (θ | x) dθ is maximised. For small ε
the integral is approximately 2εf (a | x), which is maximised through the posterior
mode a = Mod(θ | x).
194 6 Bayesian Inference
The question arises if credible intervals can also be optimal with respect to certain
loss functions. For simplicity, we assume again that the unknown parameter θ ∈ Θ is
a scalar with associated posterior density f (θ | x). First, we introduce the notion of
a credible region, a straightforward generalisation of a credible interval. Similarly,
a highest posterior density region (HPD region) can be defined.
Result 6.4 Let f (θ | x) denote the posterior density function of θ and let, for fixed
γ ∈ (0, 1),
A = C | Pr(θ ∈ C | x) = γ
denote the set of all γ · 100 % credible regions for θ . Consider now the loss function
C is optimal with respect to l(C, θ ) if and only if, for all θ1 ∈ C and θ2 ∈
/ C,
Proof To prove Result 6.4, consider some set C ∈ A with expected loss
l(C, θ )f (θ | x) dθ = |C| − IC (θ )f (θ | x) dθ = |C| − Pr(θ ∈ C | x) = |C| − γ .
Θ Θ
For fixed γ , the γ · 100 % credible region C with smallest size |C| will therefore
minimise the expected loss. Now let C ∈ A with property (6.18), and let D ∈ A be
some other element in A. We need to show that |C| ≤ |D|. Let A ∪˙ B denote the
˙ B = A ∪ B and A ∩ B = ∅. Then:
disjoint union of A and B, i.e. A ∪
C =C ∩Θ =C ∩ D∪ ˙ D c = (C ∩ D) ∪
˙ C ∩ Dc
and analogously D = (C ∩ D) ∪ ˙ Cc ∩ D .
6.4 Properties of Bayesian Point and Interval Estimates 195
Pr(θ ∈ C | x) = Pr(θ ∈ D | x) = γ .
and hence
Pr θ ∈ C ∩ D c | x = Pr θ ∈ C c ∩ D | x . (6.20)
In total we obtain
inf c f (θ | x) · C ∩ D c ≤ f (θ | x) dθ
C∩D C∩D c
= f (θ | x) dθ with (6.20)
C c ∩D
≤ sup f (θ | x) · C c ∩ D
C c ∩D
≤ inf c f (θ | x) · C c ∩ D with (6.19).
C∩D
Therefore, |C ∩ D c | ≤ |C c ∩ D| and hence |C| ≤ |D|. The proof in the other direc-
tion is similar.
In practice point and interval estimates are often given jointly. The question arises
if all Bayesian point and interval estimates are compatible. A minimal requirement
appears to be that the point estimate is always within the credible interval. This is
fulfilled only by some combinations. The posterior mode, for example, will always
be within any HPD interval, and the posterior median is always within any equal-
tailed credible interval. In contrast, in extreme cases the posterior mean may not
necessarily be within the HPD nor the equal-tailed credible interval.
It is also interesting to study the behaviour of the different point and interval
estimates under a one-to-one transformation φ = h(θ ) of the parameter θ . It turns
out that the posterior mode and the posterior mean are in general not invariant, for
example, E{h(θ ) | x} = h{E(θ | x)} does not hold in general. In fact, E{h(θ ) | x} <
196 6 Bayesian Inference
h{E(θ | x)} if h is strictly concave; if h is strictly convex, then the inequality sign
is in the other direction, cf. Appendix A.3.7. However, all characteristics based on
quantiles of the posterior distribution, such as the posterior median and equal-tailed
credible intervals, are invariant under (continuous) one-to-one transformations.
Example 6.18 (Inference for a proportion) The posterior median of the posterior
distribution π | x ∼ Be(11, 4) derived in Example 6.3 and shown in Fig. 6.1 is
Med(π | x) = 0.744. The posterior median of the odds ω = π/(1 − π) is therefore
0.744/(1 − 0.744) = 2.905.
Conjugate prior distributions are available also for multivariate likelihood functions.
Here are a few examples.
Example 6.19 (Normal model) Let L(μ) denote the likelihood function of an ob-
servation x from a multivariate normal distribution Np (μ, Σ). The unknown mean
vector μ is a vector of dimension p, while the covariance matrix Σ is assumed
to be known. The Np (ν, T ) distribution is conjugate to L(μ) since the posterior
distribution of μ is again p-variate normal:
−1 −1
μ | x ∼ Np Σ −1 + T −1 · −1
x + T −1 ν , Σ −1 + T −1 .
This result can easily be derived using Eq. (B.4). It can also be generalised to a
random sample X 1:n from an Np (μ, Σ) distribution:
−1 −1 −1
μ | x 1:n ∼ Np nΣ −1 + T −1 · nΣ x̄ + T −1 ν , nΣ −1 + T −1 ,
where x̄ = n1 ni=1 x i denotes the mean of the realisations x 1:n . This generalises the
conjugate analysis for the univariate normal distribution, cf. Example 6.8.
6.5 Bayesian Inference in Multiparameter Models 197
p
f (π ) ∝ πiαi −1 , (6.21)
i=1
p
where πi > 0 (i = 1, . . . , p) and i=1 πi = 1, is conjugate to the multinomial like-
lihood.
p Indeed, the likelihood function of a multinomial observation x is L(π) =
xi
i=1 πi , so the posterior distribution of π can be easily derived using (6.21):
π | x ∼ Dp (α1 + x1 , . . . , αp + xp ), (6.22)
f (θ ) ∝ J (θ ),
J (θ ) = n ,
p
i=1 πi
6.5 Bayesian Inference in Multiparameter Models 199
so Jeffreys’ prior is
p
− 12
f (π ) = πi .
i=1
This is the kernel of a Dirichlet distribution π ∼ Dp (1/2, . . . , 1/2) with all p param-
eters equal to 1/2. As in the binomial case, this distribution is proper. The posterior
is, using Eq. (6.22),
π | x ∼ Dp (1/2 + x1 , . . . , 1/2 + xp ).
Example 6.23 (Normal model) Let X1:n denote a random sample of a N(μ, σ 2 )
distribution with unknown parameter vector θ = (μ, σ 2 ) . We know from Exam-
ple 5.11 that
n
σ2
0
J (θ ) = n ,
0 2σ 4
so Jeffreys’ prior is
!
n2
f (θ ) ∝ J (θ ) = ∝ σ −3 .
2σ 6
So μ and σ 2 are a priori independent with a locally uniform prior density for μ and
prior density proportional to (σ 2 )−3/2 for σ 2 .
This result is in conflict with Jeffreys’ prior if the parameter μ is known. In-
deed, it is always possible to factorise the prior f (μ, σ 2 ) ∝ (σ 2 )−3/2 in the form
f (μ)f (σ 2 | μ). Due to prior independence of μ and σ 2 , the conditional prior
f (σ 2 | μ) must therefore be equal to the marginal prior, which is proportional to
(σ 2 )−3/2 . But the conditional prior for σ 2 | μ should also equal Jeffreys’ prior for
known μ, which is, however, proportional to (σ 2 )−1 .
nuisance parameters, which are the means and variances of the two marginal normal
distributions. The following example discusses the reference prior for the univariate
normal distribution when both parameters are unknown.
Example 6.24 (Normal model) An alternative to Jeffreys’ prior from Example 6.23
is the reference prior
f (θ ) ∝ σ −2
for θ = (μ, σ 2 ) . Formally, this can be obtained through multiplication of Jeffreys’
prior for μ (with known σ 2 ) with Jeffreys’ prior for σ 2 (with μ known). In this
special case the reference prior remains the same, whether we treat μ or σ 2 as
parameter of interest and the other one as nuisance parameter.
Using the precision κ = σ −2 rather than the variance, the reference prior is
f (θ ) ∝ κ −1 .
1
n
1
θ | x1:n ∼ NG x̄, n, (n − 1), (xi − x̄)2 , (6.26)
2 2
i=1
Suppose we are only interested in the first component θ of the parameter vector
(θ , η) . Elimination of the nuisance parameter η is straightforward: all we need to
do is to integrate the joint posterior density f (θ , η | x) with respect to η. This gives
us the marginal posterior density of the parameter of interest.
More specifically, let
f (x | θ , η) · f (θ , η)
f (θ , η | x) =
f (x)
denote the joint posterior density of θ and η. The marginal posterior of θ is then
f (θ | x) = f (θ , η | x) dη. (6.27)
Fig. 6.9 Marginal distributions of a joint normal-gamma distribution NG(ν, λ, α, β) with param-
eters ν = 0, λ = 0.5 and α = 2, β = 1.2
For illustration, Fig. 6.9 displays the marginal densities of μ and κ for the joint
normal-gamma density shown in Fig. 6.8.
Example 6.26 (Normal model) Let X1:n denote a random sample from an
N(μ, κ −1 ) distribution with known precision κ = 1/σ 2 . We know from Exam-
ple 6.14 that Jeffreys’ prior for μ leads to the posterior μ | x1:n ∼ N(x̄, σ 2 /n).
However, if the mean μ is known and Jeffreys’ prior f (σ 2 ) ∝ 1/σ 2 is used for
202 6 Bayesian Inference
n 1
n
σ 2 | x1:n ∼ IG , (xi − μ)2 or, equivalently,
2 2
i=1
n 1
n
κ | x1:n ∼ G , (xi − μ)2
2 2
i=1
1
n
1
κ | x1:n ∼ G (n − 1), (xi − x̄)2
2 2
i=1
as a direct consequence from (6.26). The first parameter is half of n − 1 rather than
n, so in complete analogy to the frequentist analysis, one degree of freedom is lost
if μ is treated as unknown. The marginal posterior of σ 2 = 1/κ is
1
n
1
σ 2 | x1:n ∼ IG (n − 1), (xi − x̄)2 (6.29)
2 2
i=1
with mean E(σ 2 | x1:n ) = ni=1 (xi − x̄)2 /(n − 3) and mode Mod(σ 2 | x1:n ) =
n n
i=1 (xi − x̄) /(n + 1). The unbiased estimate S = i=1 (xi − x̄) /(n − 1) lies
2 2 2
between the posterior mean and the posterior mode, and this is also true for the
2 = n
i=1 (xi − x̄) /n. See Fig. 6.10b for an illustration.
MLE σ̂ML 2
This can be shown straightforwardly by applying Example 6.25 to (6.26). Due to the
symmetry of the t distribution around x̄, the posterior mean, mode and median are
all identical to x̄ (if n > 2), and any γ · 100 % HPD interval will also be equal-tailed.
It is interesting that the frequentist approach discussed in Example 3.8 gives nu-
merically exactly the same result: The distribution of the pivot
√ X̄ − μ X̄ − μ
n = ∼ t(n − 1) = t(0, 1, n − 1)
n
1
n−1
n
i=1 (Xi − X̄)2 1
(n−1)n i=1 (Xi − X̄)2
is due to
Y ∼ t μ, σ 2 , α ⇒ (Y − μ)/σ ∼ t(0, 1, α)
6.5 Bayesian Inference in Multiparameter Models 203
Fig. 6.10 Marginal posterior distributions for the mean and variance of the transformation factors
in the alcohol concentration data set
μ − x̄
x1:n ∼ t(0, 1, n − 1).
n
1
(n−1)n i=1 (xi − x̄)2
Since the standard t distribution is symmetric around the origin, this shows that the
limits of the equal-tailed γ · 100 % credible interval for μ will be equal to the limits
of the corresponding frequentist γ · 100 % confidence interval.
Two of the three commonly used point estimates for scalar parameters can also be
used if the posterior is multivariate: the posterior mean and the posterior mode. The
expectation of a multivariate random variable is defined as the vector of the expec-
tations of all its scalar components, which implies that the vector of the marginal
means is always equal to the joint mean, i.e.
However, this is not necessarily the case for the posterior mode. Marginal posterior
modes may not equal the components of the joint mode, i.e.
Proof In order to show posterior consistency, we first consider n as fixed. The pos-
terior probability of θi is
f (x1:n | θi )f (θi ) pi f (x1:n | θi )/f (x1:n | θ0 )
f (θi | x1:n ) = =
f (x1:n ) j pj f (x1:n | θj )/f (x1:n | θ0 )
(n)
pi nk=1 f (xk | θi )/f (xk | θ0 ) exp{log(pi ) + Si }
= n = ,
k=1 f (xk | θj )/f (xk | θ0 )
(n)
j pj exp{log(pj ) + S }
j j
(n) n
where Sj = k=1 log{f (xk | θj )/f (xk | θ0 )} for j = 1, . . . , n.
Now let
" #
f (x | θ0 ) f (X | θ0 )
D(θ0 θi ) = f (x | θ0 ) log dx = EX | θ0 log
f (x | θi ) f (X | θi )
and therefore
(n) 0 for j = 0,
lim S =
n→∞ j −∞ for j = 0,
so that
1 for i = 0,
lim f (θi | x1:n ) =
n→∞ 0 for i = 0.
The posterior probability of the true value θ0 hence converges to 1 as n → ∞.
206 6 Bayesian Inference
Fig. 6.11 Inference for the proportion θ after simulation from a binomial distribution Bin(n, θ)
with n = 10, 100, 1000 and using a discrete uniform distribution on Θ = {0.05, 0.15, . . . , 0.95}:
The posterior distribution (top) converges to the value θ ∈ Θ with smallest Kullback–Leibler dis-
crepancy (bottom) to the true model
In this section we will sketch that an unknown continuous parameter is—under suit-
able regularity conditions—asymptotically normally distributed. We consider the
general case of an unknown parameter vector θ .
Let X1:n denote a random sample from a distribution with probability mass or
density function f (x | θ ). The posterior density is then
f (θ | x1:n ) ∝ f (θ )f (x1:n | θ ) = exp log f (θ ) + log f (x1:n | θ ) ,
(1) (2)
where f (x1:n | θ ) = ni=1 f (xi | θ ). A quadratic approximation using a Taylor ex-
pansion of the terms (1) and (2) around their maxima m0 (the prior mode) and the
MLE θ̂ n = θ̂ ML (x1:n ), respectively, gives
1
log f (θ ) ≈ log f (m0 ) − (θ − m0 ) I 0 (θ − m0 ) and
2
6.6 Some Results from Bayesian Asymptotics 207
1
log f (x1:n | θ ) ≈ log f (x1:n | θ̂ n ) − (θ − θ̂ n ) I (θ̂ n )(θ − θ̂ n ).
2
Here I 0 denotes the negative curvature of log{f (θ )} at the mode m0 , and I (θ̂ n ) =
I (θ̂ n ; x1:n ) is the observed Fisher information matrix. Under regularity conditions,
it follows that the posterior density is asymptotically proportional to
" #
1
exp − (θ − m0 ) I 0 (θ − m0 ) + (θ − θ̂ n ) I (θ̂ n )(θ − θ̂ n )
2
1
∝ exp − (θ − mn ) I n (θ − mn ) ,
2
where
I n = I 0 + I (θ̂ n ) and
mn = I −1
n I 0 m0 + I (θ̂ n )θ̂ n ,
This result gives a Bayesian interpretation of the MLE as asymptotic posterior mode
(or mean). In addition, for large sample size, the limits of a Wald confidence interval
for any component of θ will become numerically identical to the limits of an HPD
(or equal-tailed) credible interval if confidence and credibility levels are identical.
There are at least three similar statements regarding asymptotic normality of the
posterior distribution:
208 6 Bayesian Inference
3. If the posterior mode Mod(θ | x1:n ) and the negative curvature C n of the log
posterior density at the mode are available, e.g. by numerical techniques, then
also
θ | x1:n ∼ N Mod(θ | x1:n ), C −1
a
n .
4. Often the posterior mean E(θ | x1:n ) and the posterior covariance Cov(θ | x1:n )
can be computed analytically or can at least be approximated with Monte Carlo
techniques. The following approximation may then be useful:
a
θ | x1:n ∼ N E(θ | x1:n ), Cov(θ | x1:n ) .
Example 6.29 (Binomial model) Consider the binomial model X | π ∼ Bin(n, π).
We know that the likelihood corresponds to a random sample of size n from a
Bernoulli distribution with parameter π .
Given the observation X = x, the MLE is π̂ML = x/n, and we know from Exam-
ples 2.10 and 4.1 that here
n
I (π̂ML ) = J (π̂ML ) = .
π̂ML (1 − π̂ML )
Thus, the above approximations 1 and 2 are identical here. Under a conjugate beta
prior, π ∼ Be(α, β), the posterior equals
π | x ∼ Be(α + x, β + n − x),
with known mean, mode and variance. So we can compute approximation 4. We can
as well compute the negative curvature at the mode:
d2 1
β+n−x−1
− 2 log π α+x−1
(1 − π)
dπ B(α + x, β + n − x) π= α+x−1 α+β+n−2
(α + β + n − 2)3
= .
(α + x − 1)(β + n − x − 1)
Hence, we can also investigate the approximation 3 described above. All three ap-
proximations are compared in Fig. 6.12.
Fig. 6.12 Normal approximation of the posterior f (π | x) based on simulated data from a
Bin(n, π = 0.1) distribution with a Be(1/2, 1/2) prior for π and increasing sample size n. Ap-
proximation 2 (identical to approximation 1 in this case) uses the MLE and the inverse Fisher
information at the MLE, approximation 3 uses the posterior mode and the inverse negative curva-
ture at the mode, approximation 4 uses the mean and variance of the posterior
Strictly speaking, this is not a fully Bayesian approach, but it can be shown that
empirical Bayes estimates have attractive theoretical properties. Empirical Bayes
techniques are often used in various applications.
Example 6.30 (Scottish lip cancer) Consider Example 1.1.6, where we have anal-
ysed the incidence of lip cancer in n = 56 regions of Scotland. For each region
i = 1, . . . , n, the observed number of lip cancer cases xi are available as well as
210 6 Bayesian Inference
the expected number ei under the assumption of a constant disease risk. We now
present a commonly used empirical Bayes procedure to estimate the disease risk in
each area while borrowing strength from the other areas.
Assume that x1 , . . . , xn are independent realisations from Po(ei λi ) distributions
with known expected counts ei > 0 and unknown region-specific parameters λi .
A suitable prior for the λi s is a gamma distribution, λi ∼ G(α, β), due to the conju-
gacy of the gamma distribution to the Poisson likelihood. The posteriors turn out to
be
λi | xi ∼ G(α + xi , β + ei ) (6.31)
with posterior means E(λi | xi ) = (α + xi )/(β + ei ), compare Table 6.2. If α and β
are fixed in advance, the posterior of λi does not depend on the data xj and ej from
other regions j = i.
An alternative approach is to assume that the relative risk is the same for all
regions. Then the posterior of λ is
n
n
λ | x1:n ∼ G α + xi , β + ei .
i=1 i=1
and can be maximised numerically with respect to α and β. One obtains MLEs α̂ML
and β̂ML of α and β, which are plugged into formula (6.31). The resulting posterior
means estimates
E(λi | xi ) = (α̂ML + xi )/(β̂ML + ei )
are called empirical Bayes estimates of λi . Here we obtain α̂ML = 1.876 and β̂ML =
1.317. Figure 6.13 displays the empirical Bayes estimates and the corresponding
equal-tailed 95 % credible intervals. We can see that the MLEs xi /ei are shrunk
towards the prior mean, i.e. the empirical Bayes estimates lie between these two
extremes. This phenomenon is called shrinkage.
It is illustrative to consider the partial derivative of the log-likelihood l(α, β) with
respect to β and to set it to zero. This score equation will hold for the MLE β̂ML , i.e.
1 α̂ML + xi
n
α̂ML
= .
n
i=1
β̂ML + e i β̂ML
6.7 Empirical Bayes Methods 211
So the average of the empirical Bayes estimates equals the MLE α̂ML /β̂ML of the
prior mean α/β. For the Scotland lip cancer data, this prior mean equals 1.424.
where σi2 = ai−1 + bi−1 + ci−1 + di−1 . All studies are based on relatively large sample
sizes, so the above normal approximation is likely to be fairly accurate. Figure 6.14
shows the study-specific log odds ratio estimates ψ̂i with corresponding Wald confi-
dence intervals. Under a locally uniform reference prior for ψ, its posterior is normal
212 6 Bayesian Inference
with mean
9
i=1 wi ψ̂i
ψ̂ = 9
(6.32)
i=1 wi
and variance 1/ 9i=1 wi , where wi = 1/σi2 is the precision of the ith study-specific
estimate. This is the so-called fixed effect model to meta-analysis, which gives an
overall treatment effect estimate as the weighted average of the study-specific treat-
ment effects with weights proportional to the inverse squared standard errors. Based
on this approach, we obtain the overall estimate ψ̂ = −0.40 with 95 % credible
interval for ψ from −0.57 to −0.22, so very similar values as before.
We now allow the underlying true treatment effects to vary from study to study,
i.e.
ψ̂i | ψi ∼ N ψi , σi2 ,
ψi ∼ N ν, τ 2 ,
and ν, the average treatment effect across all studies, is now of primary interest. The
study-specific effects ψi are allowed to vary randomly around ν, therefore such a
model is called a random effects model. If ν and τ 2 are known, we can easily derive
the posterior distribution
ψi | ψ̂i ∼ N ν̃i , σ̃i2 (6.33)
of each study effect ψi , compare Example 6.8. Here σ̃i2 = 1/(1/σi2 + 1/τ 2 ) and
ν̃i = σ̃i2 (ψ̂i /σi2 + ν/τ 2 ). So the posterior mean is a weighted average of the study-
specific log odds ratio estimate ψ̂i and the prior mean ν with weights proportional
to 1/σi2 and 1/τ 2 , respectively.
6.7 Empirical Bayes Methods 213
n " #
1 {ψ̂i − ν̂ML (τ 2 )}2
lp τ 2 = − log σi2 + τ 2 + .
2
i=1
σi2 + τ 2
Empirical Bayes estimates of the individual study effects ψi are finally obtained by
plugging the MLEs ν̂ML and τ̂ML 2 into (6.33) in place of the fixed values ν and τ 2 .
For the preeclampsia data, we obtain ν̂ML = −0.52 and τ̂ML 2 = 0.24. Note that the
MLE ν̂ML = −0.52 in the model with random effects is smaller than under a fixed
effect model (ψ̂ML = −0.37). Figure 6.15 displays 95 % empirical Bayes credible
intervals for the individual study effects. Five of them lie below zero, so for these
studies, we can identify a positive treatment effect. Note, however, that the intervals
tend to be too small, as they do not take into account the uncertainty in the estima-
tion of ν and τ 2 . Also displayed is a 95 % confidence interval based on the profile
likelihood of the mean study effect ν. Note that this is substantially wider than the
corresponding one for the fixed effect ψ under a homogeneity assumption.
214 6 Bayesian Inference
6.8 Exercises
1. In 1995, O.J. Simpson, a retired American football player and actor, was ac-
cused of the murder of his ex-wife Nicole Simpson and her friend Ronald
Goldman. His lawyer, Alan M. Dershowitz stated on T.V. that only one-tenth
of 1 % of men who abuse their wives go on to murder them. He wanted his au-
dience to interpret this to mean that the evidence of abuse by Simpson would
only suggest a 1 in 1000 chance of being guilty of murdering her.
However, Merz and Caulkins (1995) and Good (1995) argue that a dif-
ferent probability needs to be considered: the probability that the husband is
guilty of murdering his wife given both that he abused his wife and his wife
was murdered. Both compute this probability using Bayes theorem but in two
different ways. Define the following events:
A: “The woman was abused by her husband.”
M: “The woman was murdered by somebody.”
G: “The husband is guilty of murdering his wife.”
(a) Merz and Caulkins (1995) write the desired probability in terms of the
corresponding odds as
Pr(G | A, M) Pr(A | G, M) Pr(G | M)
= · . (6.34)
Pr(Gc | A, M) Pr(A | Gc , M) Pr(Gc | M)
They use the fact that, of the 4936 women who were murdered in 1992,
about 1430 were killed by their husband. In a newspaper article, Der-
showitz stated that “It is, of course, true that, among the small number
of men who do kill their present or former mates, a considerable num-
ber did first assault them.” Merz and Caulkins (1995) interpret “a con-
siderable number” to be 1/2. Finally, they assume that the probability
of a wife being abused by her husband, given that she was murdered
by somebody else, is the same as the probability of a randomly chosen
woman being abused, namely 0.05.
Calculate the odds (6.34) based on this information. What is the cor-
responding probability of O.J. Simpson being guilty, given that he has
abused his wife and she has been murdered?
(b) Good (1995) uses the alternative representation
Pr(G | A, M) Pr(M | G, A) Pr(G | A)
= · . (6.35)
Pr(Gc | A, M) Pr(M | Gc , A) Pr(Gc | A)
Calculate the odds (6.35) based on this information. What is the cor-
responding probability of O.J. Simpson being guilty, given that he has
abused his wife and she has been murdered?
(c) Good (1996) revised this calculation, noting that approximately only
a quarter of murdered victims are female, so Pr(M | Gc , A) reduces to
1/20 000. He also corrected Pr(G | A) to 1/2000, when he realised that
Dershowitz’s estimate was an annual and not a lifetime risk. Calculate
the probability of O.J. Simpson being guilty based on this updated in-
formation.
2. Consider Example 6.4. Here we will derive the implied distribution of θ =
Pr(D+ | T +) if the prevalence is π ∼ Be(α̃, β̃).
(a) Deduce with the help of Appendix A.5.2 that
α̃ 1 − π
γ= ·
β̃ π
θ = g(γ ) = (1 + γ /c)−1 ,
where
α̃ Pr(T + | D+)
c= .
β̃{1 − Pr(T − | D−)}
(c) Show that
d 1
g(γ ) = −
dγ c(1 + γ /c)2
and that g(γ ) is a strictly monotonically decreasing function of γ .
(d) Use the change-of-variables formula (A.11) to derive the density of θ in
(6.13).
(e) Analogously proceed with the negative predictive value τ = Pr(D−|T −)
to show that the density of τ is
f (τ ) = d · τ −2 · fF d(1/τ − 1); 2α̃, 2β̃ ,
where
β̃ Pr(T − | D−)
d= ,
α̃{1 − Pr(T + | D+)}
and fF (x; 2α̃, 2β̃) is the density of the F distribution with parameters
2α̃ and 2β̃.
216 6 Bayesian Inference
3. Suppose that the heights of male students are normally distributed with mean
180 and unknown variance σ 2 . We believe that σ 2 is in the range [22, 41]
with approximately 95 % probability. Thus, we assign an inverse-gamma dis-
tribution IG(38, 1110) as prior distribution for σ 2 .
(a) Verify with R that the parameters of the inverse-gamma distribution lead
to a prior probability of approximately 95 % that σ 2 ∈ [22, 41].
(b) Derive and plot the posterior density of σ 2 corresponding to the follow-
ing data:
183, 173, 181, 170, 176, 180, 187, 176, 171, 190, 184, 173, 176, 179,
181, 186.
with α, β > 0.
218 6 Bayesian Inference
(a) Derive the posterior density f (π | x). Which distribution is this and
what are its parameters?
(b) Define conjugacy and explain why, or why not, the beta prior is conju-
gate with respect to the negative binomial likelihood.
(c) Show that the expected Fisher information is proportional to π −2 (1 −
π)−1 and derive therefrom Jeffreys’ prior and the resulting posterior
distribution.
9. Let X1:n denote a random sample from a uniform distribution on the interval
[0, θ ] with unknown upper limit θ . Suppose we select a Pareto distribution
Par(α, β) with parameters α > 0 and β > 0 as a prior distribution for θ , cf.
Table A.2 in Sect. A.5.2.
(a) Show that T (X1:n ) = max{X1 , . . . , Xn } is sufficient for θ .
(b) Derive the posterior distribution of θ and identify the distribution type.
(c) Determine posterior mode Mod(θ | x1:n ), posterior mean E(θ | x1:n ), and
the general form of the 95 % HPD interval for θ .
10. We continue Exercise 1 in Chap. 5, so we assume that the number of IHD
ind
cases is Di | λi ∼ Po(λi Yi ), i = 1, 2, where λi > 0 is the group-specific inci-
dence rate. We use independent Jeffreys’ priors for the rates λ1 and λ2 .
(a) Derive the posterior distribution of λ1 and λ2 . Plot these in R for com-
parison.
(b) Derive the posterior distribution of the relative risk θ = λ2 /λ1 as fol-
lows:
(i) Derive the posterior distributions of τ1 = λ1 Y1 and τ2 = λ2 Y2 .
(ii) An appropriate multivariate transformation of τ = (τ1 , τ2 ) to
work with is g(τ ) = η = (η1 , η2 ) with η1 = τ2 /τ1 and η2 =
τ2 +τ1 to obtain the joint density fη (η) = fτ {g −1 (η)}|(g −1 ) (η)|,
cf. Appendix A.2.3.
(iii) Since η1 = τ2 /τ1 is the parameter of interest, integrate η2 out of
fη (η) and show that the marginal density is
(b) Show that for n > 1, the posterior probability mass function is
n − 1 xn N −1
f (N | xn ) = for N ≥ xn .
xn n n
(b) Now derive the empirical Bayes estimates π̂i . Compare them with the
corresponding MLEs.
6.9 References
Contents
7.1 Likelihood-Based Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.1.1 Akaike’s Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . 224
7.1.2 Cross Validation and AIC . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.1.3 Bayesian Information Criterion . . . . . . . . . . . . . . . . . . . . . . . 230
7.2 Bayesian Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.2.1 Marginal Likelihood and Bayes Factor . . . . . . . . . . . . . . . . . . . . 232
7.2.2 Marginal Likelihood and BIC . . . . . . . . . . . . . . . . . . . . . . . . 236
7.2.3 Deviance Information Criterion . . . . . . . . . . . . . . . . . . . . . . . 239
7.2.4 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 221
https://doi.org/10.1007/978-3-662-60792-3_7,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
222 7 Model Selection
Suppose M1 is the simpler model and M2 the more complex. We can now apply
the generalised LR statistic
$ %
maxM2 L(θ )
W = 2 log = 2 max l(θ) − max l(θ )
maxM1 L(θ ) M2 M1
to compare the two models. Here the log-likelihood l(θ) is understood as the log-
likelihood in the more general model M2 . Model M1 can be written as a restriction
of model M2 , so the maximised log-likelihood in model M1 is identical to l(θ ) max-
imised under the restriction corresponding to M1 , which we denote as maxM1 l(θ ).
Under the assumption that model M1 is correct, W is asymptotically χ 2 distributed.
The associated degrees of freedom are given by the difference of the number of un-
known parameters of the two models considered. For example, if the more complex
model M2 has one parameter more than the simpler model M1 , a P -value can be
computed based on the upper tail of the χ 2 distribution with one degree of freedom,
evaluated at the observed value w of W : Pr(X ≥ w) where X ∼ χ 2 (1). For example,
if w = 3.84, then the P -value is 0.05.
Example 7.1 (Analysis of survival times) Coming back to the introductory ex-
ample, we want to determine an adequate model for the survival data of the PBC
patients, see Sect. 1.1.8. In particular, we need to take into account that the observa-
tions are right censored.
In the exponential model (M1 ) of Example 2.8, it is possible to determine the
MLE analytically. Substituting this estimator into the log-likelihood function gives
l(λ̂ML ) = −424.0243.
We can use the code from Example 5.9 to determine the maximal log-likelihood
in the Weibull model M2 .
start <- c (1000 , 1)
resultWeibull <- optim ( start , weibullLik , log = TRUE ,
control = list ( fnscale = -1) , hessian = TRUE )
logLikWeibull <- resultWeibull $ value
logLikWeibull
[1] -424.0043
with corresponding P -value 0.84 computed from the χ 2 distribution function with
one degree of freedom. So we have no evidence against the simpler exponential
model.
In Example 2.3 a gamma model for the uncensored data has been described.
Analogously to the approach above, we can determine the maximal log-likelihood
for the gamma model M3 in the parametrisation using μ and φ:
7 Model Selection 223
This is almost the same value as for the Weibull model. Also the gamma model is a
generalisation of the exponential model, which is obtained when μ = φ. Hence, we
can test this null hypothesis with the generalised LR statistic
$ %
W31 = 2 max l(θ ) − max l(θ ) = 2 −424.0047 − (−424.0243) = 0.0393
M3 M1
and obtain the same P -value 0.84 as before, with the same conclusion that there is
no evidence against the exponential model.
The LR statistic is a valuable method for model selection and is frequently used.
However, for several reasons, it is not appropriate as a general method for model
selection. First, only nested models can be compared, the comparison of non-nested
models is not possible. For example, the LR statistic cannot be used to choose be-
tween the Weibull and the gamma model in Example 7.1 since none of the two
models can be written as a special case of the other. Second, the procedure is based
on a significance test, so it is asymmetric in nature: it can only provide evidence
against, but not for, the simpler model. Finally, the LR test cannot be generalised
to a comparison of more than two models, it can only be used for pairwise model
comparisons.
The form of the LR statistic indicates that the maximal value of the likelihood
function under the respective model maxMi L(θ ) is the central quantity for model
selection. However, a more complex model, in which the simpler model is nested,
will always increase the likelihood at the corresponding MLE. It is therefore not a
sensible strategy to choose the model with the largest maximal likelihood because
one would always select the most complex model. Ultimately, the dimensions of the
different models also have to be taken into account.
224 7 Model Selection
was later interpreted statistically under the name Ockham’s razor. John Ponce of
Cork, for example, wrote in 1639:
“Variation must be taken as random until there is positive evidence to the contrary; and
new parameters in laws, when they are suggested, must be tested one at a time unless there
is specific reason to the contrary.”
We will see that all approaches to model selection that are described in this chap-
ter, i.e. both likelihood methods (Sect. 7.1) and Bayesian methods (Sect. 7.2), do
indeed penalise model complexity somehow. As a more fundamental question we
have to discuss if there even is a “true” model or if all models are more or less bad
descriptions of the available data. This aspect is reflected in Sect. 7.2.4, where not
one single model is chosen, but where results from different models are combined.
The value of the likelihood function at the MLE θ̂ ML describes the quality of the
model fit. This value will in the following be combined with measures of model
complexity resulting in different model selection criteria, which allow the compar-
ison of non-nested models. The measure of model complexity will in some form
depend on the dimension p of the parameter vector θ . The model with the best
value of the criterion is then chosen as the best model. Note that it is now crucial to
include all multiplicative constants in the likelihood since otherwise the comparison
of two models based on a likelihood criterion is meaningless.
penalises the maximised log-likelihood with the number of parameters p. The crite-
rion is negatively oriented, i.e. the model with minimal AIC is selected. Therefore,
a difference of 2q is sufficient for a model with q additional parameters to be pre-
ferred. For example, for q = 1, a difference of 2 is sufficient, which corresponds
to a P -value of 0.16 for the comparison of two nested models using the LR test.
For q = 2, the corresponding P -value is 0.14. Table 7.1 shows that the correspond-
ing P -values decrease with increasing q. This indicates that model selection based
on AIC is not equivalent to or compatible with model selection based on the LR
statistic.
7.1 Likelihood-Based Model Selection 225
Table 7.1 P -value thresholds based on the LR statistic with q degrees of freedom, which corre-
spond to model selection based on AIC with penalty 2q
q 1 2 3 4 5 6 7 8
P -value 0.16 0.14 0.11 0.092 0.075 0.062 0.051 0.042
Example 7.2 (Analysis of survival times) For comparing the three survival models
with AIC, we take the maximised log-likelihood values from Example 7.1 and the
number of parameters (2 for the gamma and Weibull models and 1 for the expo-
nential model), and obtain the AIC values 850.0486, 852.0087 and 852.0094 for
M1 , M2 and M3 , respectively. The conclusion is the same as before: the exponential
model M1 is preferred because it has the smallest AIC. The Weibull model M2 and
the gamma model M3 have nearly the same AIC values.
Since g is fixed but unknown, minimisation of (7.2) reduces to maximising the mean
log-likelihood Eg {log f (X; θ)}, which also depends on the unknown distribution g.
For a fixed model, let θ 0 denote the resulting optimal (“least false”) parameter value.
If the model is actually correct, then g(x) and f (x; θ 0 ) are identical. Consider now
the MLE θ̂ ML , that maximises the log-likelihood l(θ ; x1:n ) = ni=1 log f (xi ; θ ). Due
to the law of large numbers (Appendix A.4.3), we have
1 D
l(θ ; X1:n ) −
→ Eg log f (X; θ) ,
n
and thus θ̂ ML → θ 0 as n → ∞. This perspective justifies the use of the MLE even
in settings when model misspecification is suspected. Moreover, the MLE can now
be viewed as the parameter estimate which minimises approximately the Kullback–
Leibler discrepancy to the truth: When the true distribution g is replaced with the
empirical distribution ĝn of the data, we have Eĝn {log f (X; θ)} = n1 l(θ ; x1:n ).
226 7 Model Selection
because the first term in the Kullback–Leibler discrepancy (7.2) does not depend on
the model. The inner expectation in (7.3) computes the mean log-likelihood M with
respect to the data Y , while the outer expectation is taken with respect to the inde-
pendent random sample X1:n from g. Replacing g with the empirical distribution ĝn
in (7.3), we obtain the estimate
1 1
K̂ = Eĝn l θ̂ ML (X1:n ); x1:n = l θ̂ ML (x1:n ); x1:n .
n n
It is intuitively clear that this estimate will be biased because we have used the same
data x1:n = (x1 , . . . , xn ) twice, both for estimating the mean log-likelihood and for
estimating its expectation. The AIC thus includes a bias correction for this estimate:
We will now show that
E(K̂ − K) ≈ p/n, (7.4)
where p is the number of parameters in the model, so that the bias-corrected esti-
mate is
1 1
K̂ − p/n = l(θ̂ ML ) − p = − AIC. (7.5)
n 2n
Hence, choosing the model that minimises the AIC is approximately the same as
minimising the expected Kullback–Leibler discrepancy to the truth.
To prove Eq. (7.5), we use a Taylor expansion of log f (x; θ̂ ML ) around θ 0 . The
difference between the estimate K̂(X1:n ) and the mean log-likelihood M(X1:n ) can
now be approximated by
1
n
K̂ − M ≈ Zi + (θ̂ ML − θ 0 ) G1 (θ̂ ML − θ 0 ), (7.6)
n
i=1
where G1 = Eg {I 1 (θ 0 ; X)} is the expected unit information. Since the random vari-
ables Zi = log f (Xi ; θ 0 ) − Eg {log f (X; θ 0 )} have zero mean, the expectation of
(7.6) reduces to the expectation of the quadratic form (θ̂ ML − θ 0 ) G1 (θ̂ ML − θ 0 ).
Analogously to the case in Sect. 4.2.3 where the model is correctly specified, there
7.1 Likelihood-Based Model Selection 227
is the following more general result for the distribution of the ML estimator in case
of model misspecification:
√
→ G−1
D
n(θ̂ ML − θ 0 ) − 1 U,
√ a
where U = S(θ 0 ; X1:n )/ n ∼ Np (0, H 1 ) due to the central limit theorem (cf. Ap-
pendix A.4.4), because Eg {S 1 (θ 0 ; X)} = 0 and H 1 = Covg {S 1 (θ 0 ; X)}. Hence, we
have
D 1 1 −1
→ U G−1
(θ̂ ML − θ 0 ) G1 (θ̂ ML − θ 0 ) − −1
1 G1 G1 U = n U G1 U . (7.7)
n
Using the approximate normal distribution of U and a result from Appendix A.2.4,
we obtain
−1
Eg U G−1
1 U ≈ tr G1 H 1 = p ,
∗
(7.8)
where p ∗ can be interpreted as a generalised parameter dimension. Note that
G1 = H 1 = J 1 (θ 0 ) is the expected unit Fisher information in the case where the
model is correctly specified. Then we have exactly p ∗ = tr(I p ) = p, the number of
parameters in the model. This directly leads to (7.4). However, even in the case of
misspecification, the AIC approximation p ∗ ≈ p is justifiable. More sophisticated
estimates of (7.8) are only rarely used in practice and may have larger variance.
Recall that the plain value l(θ̂ ML ) cannot be used as a model selection criterion since
the value of the maximal log-likelihood automatically increases for a more complex
model. The underlying problem is that the available data x1:n are used twice: on the
one hand, to calculate the estimate θ̂ ML (x1:n ) and, on the other hand, to calculate
the model selection criterion, the log-likelihood l(θ ; x1:n ). A better approach is to
divide the data into a training and a validation sample. The training sample is used
to estimate the parameters, and the validation sample is used to evaluate the model
selection criterion. The splitting into training and validation part is then repeated
such that each observation is contained once in a training sample, and the average
value of the model criterion is calculated. Instead of an automatic increase, for more
complex models, this average will decrease once the model is overfitted. This quite
general method is known as cross validation.
AIC can approximately be interpreted as a cross validation criterion, as explained
in the following. Suppose that in each cross validation iteration only one observation
(xi ) is left out to create the validation sample and the remaining observations (x−i )
constitute the training sample. In general, this is called leave-one-out cross valida-
tion. Then it turns out that the resulting cross-validated average log-likelihood
1
n
K̂CV = l θ̂ ML (x−i ); xi (7.9)
n
i=1
228 7 Model Selection
1
n
1
Eg (K̂CV ) = Eg Eg log f Y ; θ̂ ML (X1:(n−1) ) ≈ nK = K
n n
i=1
for large sample sizes n. That means that both the scaled AIC (7.5) and the cross-
validated average log-likelihood (7.9) are approximately unbiased estimates of K
and are hence equivalent model selection criteria for large sample sizes.
Example 7.3 (Analysis of survival times) We can compute the cross-validated av-
erage log-likelihood values (7.9) for the three survival models by a simple adapta-
tion of the R-functions to compute the log-likelihood: Each function obtains an ad-
ditional subset argument, which gives the indices of the data points that are used
to compute the log-likelihood. Looping over all observations i = 1, . . . , n, for each
observation xi , the MLE is computed with the remaining observations x−i , and the
log-likelihood is evaluated at this estimate for the observation xi . As an example,
we show R-code for the gamma model:
## Copy function gammaLik to gammaLik2 , add an argument " subset " and
## replace " pbcTreat " by " pbcTreat [ subset , , drop = FALSE ]" in the
## function .
cvLogliksGamma <- numeric ( nrow ( pbcTreat ) )
for ( i in seq_len ( nrow ( pbcTreat ) ) )
{
## compute the MLE from the training sample :
subsetMle <- optim ( c (1000 , 1) ,
gammaLik2 ,
log = TRUE ,
subset =
setdiff ( x = seq_len ( nrow ( pbcTreat ) ) ,
y=i),
control = list ( fnscale = -1) ,
hessian = TRUE )
stopifnot ( subsetMle $ convergence ==0)
subsetMle <- subsetMle $ par
## compute the log - likelihood for the validation observation :
cvLogliksGamma [ i ] <- gammaLik2 ( subsetMle ,
log = TRUE ,
subset = i )
}
m e a n C v L o g l i k G a m m a <- mean ( cvLogliksGamma )
Averaging these values for each model as in the last R-code line, we obtain −4.5219,
−4.5330 and −4.5331 for the models M1 , M2 and M3 , respectively. On the other
hand, if we scale the AIC values from Example 7.2 by dividing with −2n, we obtain
−4.5215, −4.5320 and −4.5320. These values are very close to the cross-validated
average log-likelihood values.
n
nK̂CV = li (θ̂ i )
i=1
n
∂ ∗
= li (θ̂ ML ) + l
i
θ i (θ̂ i − θ̂ ML )
i=1
∂θ
n
= l(θ̂ ML ) + S i θ ∗i (θ̂ i − θ̂ ML ), (7.10)
i=1
where θ ∗i lies somewhere on the line between θ̂ ML and θ̂ i (see Appendix B.2.3). We
also use a Taylor expansion around θ̂ ML to rewrite the score vector as
∂
S(θ̂ i ) = l(θ̂ i )
∂θ
∂ ∂ 2 ∗
= l(θ̂ ML ) + l θ i (θ̂ i − θ̂ ML )
∂θ ∂θ ∂θ
≈ 0 − nG1 (θ̂ i − θ̂ ML ). (7.11)
1
θ̂ i − θ̂ ML ≈ − G−1 S i (θ̂ i ).
n 1
If we plug this into (7.10), we obtain
1 ∗ −1
n
nK̂CV ≈ l(θ̂ ML ) − S i θ i G1 S i (θ̂ i )
n
i=1
1
n
= l(θ̂ ML ) − tr G−1
1 S i (θ̂ i )S i θ ∗i (7.12)
n
i=1
≈ l(θ̂ ML ) − tr G−1
1 H1 , (7.13)
1
≈ l(θ̂ ML ) − p = − AIC, (7.14)
2
230 7 Model Selection
where we used properties of the trace operation (see Appendix B.1.1) in (7.12).
In (7.13) we replaced the observed score “covariance” matrix S i (θ̂ i )S i (θ ∗i ) with
the expected one H 1 , arguing again with the convergence of θ ∗i and θ̂ i to θ 0 . This
approximation is the same as in the AIC derivation, and we again replaced the trace
by the dimension p in (7.14) to arrive at the scaled AIC.
is frequently used, where n denotes the size of the sample. Half of the negative BIC
is also known as the Schwarz criterion. It has the same orientation as AIC, such that
models with smaller BIC are preferred. It penalises model complexity in general (i.e.
if log(n) ≥ 2 ⇔ n ≥ 8) more distinctly than AIC. A derivation of BIC is outlined in
Sect. 7.2.2.
Example 7.5 (Analysis of survival times) We can also use BIC to compare the
three survival models. Since we have n = 94 observations, we get p log(n) = 4.54p
as the penalty term instead of 2p for the AIC, which we used in Example 7.2. We
obtain the BIC values 852.5919, 857.0953 and 857.0960 for models M1 , M2 and
M3 , respectively. Now the difference between the more complex models M2 , M3
and the simpler model M1 is larger due to the larger penalty factor. However, the
conclusion remains unchanged.
BIC is very similar to AIC in that it is composed of the model fit measured
by the maximised log-likelihood l(θ̂ ML ) and a penalty term incorporating the model
complexity measured by the number of parameters p. However, the penalty is higher
in BIC, where p is multiplied by log(n) instead of the factor 2 in the AIC. The
question is now how this affects the statistical properties of AIC and BIC.
As in the derivation of the AIC, let us assume that none of the models
M1 , . . . , MK under study corresponds to the true model g(x) that generated the
data. That means that we cannot pick a model Mk such that f (x; θ , Mk ) = g(x) for
some parameter value θ . Therefore, we would like to pick a model that is closest
to the truth in terms of the Kullback–Leibler discrepancy. If we use AIC or BIC to
guide this decision, the probability that we pick the model reaching the minimum
Kullback–Leibler discrepancy goes to one with increasing sample size n.
However, there might be multiple models that are closest to the truth. In that case,
we would like to pick the model that is most parsimonious, i.e. has the smallest num-
ber of parameters p among the closest models. If we use BIC for model selection,
the probability that we pick the closest and most parsimonious model goes also to
one with increasing sample size n. In contrast, the AIC does not guarantee this kind
of model selection consistency. That means, AIC may select models that are too
complex. This overfitting tendency of the AIC is due to the penalty term’s inde-
pendence of the sample size n. For BIC, the complexity penalty is increasing for
growing n, which guards the criterion against overfitting.
The posterior odds Pr(M1 | x)/Pr(M2 | x) can hence be written as the product
of the so-called Bayes factor BF12 = f (x | M1 )/f (x | M2 ) and the prior odds
Pr(M1 )/ Pr(M2 ):
Pr(M1 | x) f (x | M1 ) Pr(M1 )
= · .
Pr(M2 | x) f (x | M2 ) Pr(M2 )
The Bayes factor can therefore be interpreted as the ratio of the posterior odds of
M1 and the prior odds of M1 , i.e.
BF12 > 1 increased
if the data x the probability of M1 .
BF12 < 1 decreased
The Bayes factor is identical to the likelihood ratio if the models M1 and M2 are
completely specified, i.e. do not contain unknown parameters. Otherwise, the prior
predictive distribution
f (x | Mi ) = f (x | θ i , Mi ) · f (θ i | Mi ) dθ i , i = 1, 2, (7.15)
has to be evaluated at the observed data x, where θ i is the unknown parameter vector
in model Mi . This value is called the marginal likelihood of the model Mi .
Marginal likelihood
The marginal likelihood f (x | M) of a model M is the value of the prior pre-
dictive distribution at the observed data x.
For discrete data x, the marginal likelihood can therefore be interpreted as the prob-
ability of the data for a given model Mi . The marginal likelihood f (x | Mi ) can be
derived by integration with respect to the prior distribution from the ordinary like-
lihood f (x | θ i , Mi ). Note that the prior distribution f (θ i | Mi ) cannot be improper
since otherwise f (x | Mi ) would be indeterminate.
The term Bayes factor has been coined by Irving John Good (1916–2009). For
numerical reasons, it is often the logarithm of the marginal likelihood or the Bayes
factor that is being calculated. For the interpretation of Bayes factors it is common
7.2 Bayesian Model Selection 233
to use equidistant thresholds on the logarithmic scale, which form the basis of the
categories shown in Fig. 7.1.
Bayes factor
The Bayes factor BF12 is the ratio of marginal likelihoods of two models M1
and M2 .
The integration
in (7.15) can be avoided in conjugate families. The reason for this is
that f (x) = f (x | θ )f (θ ) dθ (we leave out the additional conditioning on Mi for
readability) appears as denominator in the posterior distribution
f (x | θ )f (θ )
f (θ | x) = ,
f (x)
cf. Example 6.7. The marginal likelihood f (x) is hence calculated easily through:
f (x | υ)f (υ)
f (x) =
f (υ | x)
Example 7.7 (Blood alcohol concentration) In Example 6.9 we have conducted in-
ference for the overall mean μ of the observed blood alcohol transformation factors,
where the corresponding variance σ 2 was assumed to be known. Now we would like
to compare both genders and calculate the posterior probability that the mean trans-
formation factor for women (μ1 ) is different than the one for men (μ2 ). To this end,
we compare two models, the first one having a single mean for the population and
the second one having gender-specific means.
Initially, we suppose that all observations, regardless of the subgroup they belong
to, are independent realisations of a normal distribution with unknown expected
value μ. We denote this model of no differences between subgroups as M1 . For
simplicity, we assume that the estimated standard deviation of the underlying normal
distribution is known, i.e. κ −1/2 = 237.8; but conceptually the procedure is the same
in the case of unknown variance, see Exercise 3. As a prior for μ, we choose the
conjugate normal distribution with expected value ν = 2000 and standard deviation
δ −1/2 = 200. This means that we expect a priori with 95-% probability an average
transformation factor between 1600 and 2400. The marginal likelihood in this model
can be explicitly calculated as
n 1 / n 0
κ 2 δ 2 κ nδ
f (x1:n | M1 ) = exp − (xi − x̄) +
2
(x̄ − ν)2
.
2π nκ + δ 2 nκ + δ
i=1
(7.18)
For the available data, we obtain a log marginal likelihood value of log f (x | M1 ) =
−1279.14.
We would like to compare the model M1 with a model allowing different ex-
pected values in the different subgroups. In this model M2 , the data are partitioned
into the two gender groups, and in both groups a conjugate prior N(ν, δ −1 ) for μ1
and μ2 is used. The marginal likelihood for each group is then calculated anal-
ogously to above, where of course only the data belonging to each subgroup are
included in the calculation. If we suppose furthermore that μ1 and μ2 are a pri-
ori independent, then the marginal likelihood of the model M2 is given as the
product of the marginal likelihood values within both subgroups. For the available
data, we obtain the value log f (x | M2 ) = −1276.11, i.e. a higher value than for
model M1 . The Bayes factor of the model M2 compared to the model M1 is hence
BF21 = exp{−1276.11 − (−1279.14)} = 20.6, strongly indicating that the model
M2 better describes the data. In other words, there is strong evidence for a differ-
ence in the average transformation factor between genders. Assigning both models
the same prior probability Pr(M1 ) = Pr(M2 ) = 0.5, we obtain the posterior proba-
bilities Pr(M1 | x) = 0.0463 and Pr(M2 | x) = 0.9537.
In this example, the posterior probabilities are ordered according to the number
of unknown parameters in the model, i.e. higher flexibility in the model is rewarded
with higher posterior probability. However, we emphasise that Bayesian model se-
lection does not automatically favour the more complex model, see Example 7.8.
This aspect will be discussed in more detail in Sect. 7.2.2.
236 7 Model Selection
Bayesian model selection strongly depends on the prior distribution for the
model parameters. While for Bayesian estimation of parameters the influence of
the prior distribution is asymptotically negligible for large samples, this is not true
for Bayesian model selection. We already saw that the definition of the marginal
likelihood does not allow improper priors. Of course, one could use proper but very
vague (i.e. having large variation) priors that are still integrable. However, such an
approach should be discouraged since it can be shown that for ever increasing prior
variation, the posterior probability of the simplest model will always converge to 1,
regardless of the information in the data. This phenomenon is known as Lindley’s
paradox and prohibits the use of vague prior distributions in Bayesian model selec-
tion.
Example 7.8 (Blood alcohol concentration) If we repeat the analysis from Exam-
ple 7.7 using a prior variance of δ −1 = 1010 , then the simpler model M1 has the
posterior probability 0.8360. Using δ −1 = 10100 , the posterior probability of M1 is
effectively equal to 1.
where k(θ ) = −{log f (x1:n | θ ) + log f (θ )}/n and −nk(θ ) is the non-normalised
log-posterior density. To this representation we can apply the Laplace approxima-
tion (see Appendix C.2.2). The minimum θ̃ of k(θ ) is a maximum of −nk(θ) and
hence equal to the posterior mode. Denoting the Hessian of k at the point θ̃ by K
(see Appendix B.2.2), we obtain the following approximation for the log-marginal
likelihood:
" p #
2π 2 1
log f (x1:n ) ≈ log |K|− 2 exp −nk(θ̃ )
n
p p 1
= log(2π) − log(n) − log|K| + log f (x1:n | θ̃ ) + log f (θ̃ ).
2 2 2
7.2 Bayesian Model Selection 237
The terms p/2 · log(2π) and log f (θ̃ ) can be neglected if the sample size n is large.
Further, one can show that the determinant |K| of the p × p Hessian is bounded
from above by a constant and is hence also negligible. The contribution of the prior
to the posterior is also small for large n, so we can replace θ̃ with the MLE θ̂ ML (see
Sect. 6.6.2). Combining these approximations, we obtain the Bayesian information
criterion of Sect. 7.1.3
The error of this approximation is of order O(1) under certain regularity con-
ditions (see Appendix B.2.6), so is constant for increasing sample size. However,
the values of the log marginal likelihood and BIC are increasing with larger sample
size, so the relative error decreases with increasing sample size n:
Hence, in the same way as the prior becomes less important compared to the likeli-
hood in its contribution to the posterior, so does the BIC approximation improve for
increasing sample sizes. Therefore, selecting the model with smallest BIC is indeed
asymptotically equivalent to selecting the maximum a posteriori (MAP) model with
the largest posterior model probability.
We note that exp(−BIC/2) can be interpreted as an approximate marginal likeli-
hood, from which posterior probabilities can be calculated. Moreover, half the dif-
ference of the BIC values of two different models is an approximation of the log
Bayes factor. The Bayes factor, as a ratio of two marginal likelihoods, incorporates
the parameter priors, which even for large sample sizes have a non-negligible in-
fluence on the marginal likelihood. However, it is important to recall that in the
calculation of BIC these priors do not enter. Posterior probabilities derived from
BIC values will therefore in general only be rough approximations of those derived
from full Bayesian approaches. In the following example the agreement is rather
good, though.
We noted above that the accuracy of the BIC approximation to the log marginal
likelihood is of order O(1) in general. For a specific choice of a parameter
prior, the accuracy is higher: Using the so-called unit information prior, we have
−2 log f (x) − BIC = O(n−1/2 ), that the approximation error gets smaller for in-
creasing sample size. The unit information prior
θ ∼ Np θ 0 , J 1 (θ 0 )−1 ,
contains as much information as one unit of the data, quantified by the expected unit
Fisher information J 1 (θ 0 ) at the prior mean θ 0 . This approach can be extended to
the case where the models to be compared have a common nuisance parameter.
Example 7.10 (Normal model) Assume the normal model N(μ, σ 2 ) with unknown
mean μ and known variance σ 2 for the random sample X1:n , as we did in Exam-
ple 6.8. From Example 2.9 we know that the expected unit Fisher information is
J1 (μ) = 1/σ 2 . Hence, the unit information prior is
μ ∼ N μ0 , σ 2
and thus has the same variance as the likelihood, to which it is also conjugate. The
log marginal likelihood can easily be derived from Eq. (7.18):
n
n κ 1 κ n
log f (x1:n ) = log − log(n + 1) − (xi − x̄)2 + (x̄ − μ0 )2 ,
2 2π 2 2 n+1
i=1
Now, as n → ∞, we have (n + 1)/n → 1, and thus the log term goes to zero.
Moreover, κn/(n + 1) goes to κ. Furthermore, we need the assumption that μ̂ML
converges to μ0 at an appropriate rate, which holds if μ0 is the true mean parameter
(this assumption can also be relaxed for the alternative model). Then (x̄ − μ0 )2 will
also be small for large enough sample size n.
7.2 Bayesian Model Selection 239
where θ̄ = E(θ | y) is the posterior mean of the parameter vector, and pD is an es-
timate of the effective number of parameters in the model. Note that this resembles
very much the AIC definition in (7.1). pD is defined as the posterior expected de-
viance
pD = E D(θ , θ̄ ) | y = D(θ , θ̄ )f (θ | y) dθ,
where
D(θ , θ̄ ) = 2 l(θ̄) − l(θ )
is the deviance of θ versus the point estimate θ̄ . While analytic computation of
DIC is rarely possible, it can easily be approximated if parameter samples, say
θ (1) , . . . , θ (B) , from the posterior are available. Then θ̄ ≈ B −1 B b=1 θ
(b)
is approx-
imated by the average of the samples, and pD ≈ B −1 B
D(θ (b)
, θ̄ ) is approxi-
b=1
mated by the average of the sampled deviances. See Sect. 8.3 for more details.
The proximity of DIC to AIC can be established as follows. Using a second-order
Taylor expansion of D(θ , θ̄ ) in θ around θ̄ as e.g. in (5.4) in Sect. 5.1, we obtain the
approximation
D(θ , θ̄ ) ≈ (θ − θ̄ ) I (θ̄ )(θ − θ̄ ).
a
Using the asymptotic normality of the posterior from Sect. 6.6.2, we have θ | y ∼
Np (θ̄ , I (θ̄ )−1 ) for large sample sizes. Therefore, we have
→ U G−1
(θ − θ̄ ) I (θ̄ )(θ − θ̄ ) −
D
1 U.
This is analogous to (7.7) in the AIC derivation. From the AIC derivation we know
that the expected value of the right-hand side is p ∗ = tr(G−1
1 H 1 ). Hence, pD ≈ p
∗
and DIC ≈ AIC for large sample sizes under the regularity conditions in Sect. 6.6.
derived in Example 2.7. We choose a uniform prior for υ, i.e. υ ∼ Be(1, 1), which
results in the posterior
In Example 6.20 we have seen that the Dirichlet prior is conjugate to the trinomial
likelihood
l(π ) = x1 log(π1 ) + x2 log(π2 ) + x3 log(π3 ),
cf. Example 5.2. Again we choose a uniform prior on the probability simplex, i.e.
π ∼ D3 (1, 1, 1). This results in the posterior
π | x ∼ D3 (1 + x1 , 1 + x2 , 1 + x3 ).
In Example 8.8 we will compute the DIC values for these two models using
samples from the posterior distributions. Note that the posterior expectations are
analytically available because we are working with conjugate priors. Moreover, the
same log-likelihood constants must be used in both models: If we directly used the
log-likelihood kernel from above for the Hardy–Weinberg model, then we would
miss the constant x2 log(2), which is incorporated in the trinomial log-likelihood.
Then the DIC values would no longer be on the same scale.
The resulting values, 1510.33 for the Hardy–Weinberg model and 1510.37 for
the trinomial model, are very close to the AIC values determined in Example 7.4.
The reason is that the uniform prior we have used has only little information com-
pared to the information from the data. Therefore, the posterior expectations of the
parameters are almost identical to the MLEs. Also, the estimated numbers of pa-
rameters (pD = 0.99 and 1.99 for the two models) are almost identical to the true
numbers of parameters (1 and 2) for both models.
Pr(x | Mk ) Pr(Mk )
Pr(Mk | x) = K
j =1 Pr(x | Mj ) Pr(Mj )
Note that the “parameter” θ , which is in fact the model here, is discrete, so that we
do not have to work with an ε > 0 for defining the zero–one loss function.
The implicit use of the zero–one loss function for deciding on the optimal model
choice may be inappropriate if the models are close in their description of the data
generating process. More application-specific loss functions could be constructed,
which might lead to another model being chosen as the Bayes estimate. We do not
proceed further here with this discussion, but describe an alternative approach for
using the posterior model probabilities. Especially in situations where the poste-
rior model probabilities are quite similar in size, so that no clear “winner” model
arises, choosing the MAP-model and discarding the other models appears as an in-
appropriate approach. Do we really need to select a single model at all? What is the
ultimate goal of the statistical analysis? Often an unknown quantity, say λ = h(θ ),
is the object of interest, and uncertainty statements about λ are the ultimate goal
of the statistical analysis. In that case, we can simply calculate the marginal pos-
terior distribution of λ, which is a discrete mixture of K model-specific posterior
distributions with weights given by the posterior model probabilities:
K
f (λ | x) = f (λ | x, Mk ) · Pr(Mk | x).
k=1
This is a Bayesian model average. It fully takes into account the model uncertainty
in the estimation of λ.
We can easily compute the model-averaged posterior expectation and variance
of λ, using the law of iterated expectations and the law of total variance from Ap-
pendix A.3.4, respectively. First, we have
K
E(λ | x) = E E(λ | x, M) = E(λ | x, Mk ) Pr(Mk | x),
k=1
where the outer expectation is with respect to the distribution of M given x, i.e. the
posterior model distribution. Second, we have
So we can easily compute the model-averaged central moments for the quantity
λ from the model-specific expectations E(λ | x, Mk ) and variances Var(λ | x, Mk ),
using the weights Pr(Mk | x) from the model average.
Example 7.12 (Blood alcohol concentration) In Example 7.7 the model M2 has
been determined as a MAP-model. The primary interest in this example, though, is
the average transformation factor μi in the two gender groups i = 1, 2. For exam-
242 7 Model Selection
Table 7.3 Estimated mean transformation factors in the alcohol concentration data. The models
are M1 : no difference between genders and M2 : difference between genders
Gender Transformation factor within model
M1 M2 Model-averaged
Female 2445.8 2305.5 2311.9
Male 2445.8 2473.1 2471.9
ple, in the model M2 the posterior mean transformation factor is μ̂1 = 2305.5 for
females. The estimates of the average transformation factors for both models are
given in Table 7.3. Additionally, the model-averaged estimates are given as well.
They are quite similar to the estimates obtained under the MAP-model M2 because
of its large posterior probability. The influence of the model M1 is negligible be-
cause it has a small posterior probability.
The perspective we considered so far, which assumes that the true model is con-
tained in our chosen model space M1 , . . . , Mk , is known as “M-closed”. It is a rather
unrealistic perspective because the real world is almost always much more complex
than our simple statistical models. While often the most important and prevailing
features can be captured quite well with these models and we can proceed as if the
truth was included in the model space, sometimes this may not be adequate. For ex-
ample, we might have a more complicated model, say Mt , in which we believe in,
but need to restrict ourselves for some reason to simpler models M1 , . . . , MK . This
is the “M-completed” perspective, under which we might evaluate all simple mod-
els in the light of our belief model Mt . The computations are more intricate because
the expected losses of the possible model choices must be evaluated with respect to
the model Mt . Usually, this does not allow for closed-form solutions for the Bayes
estimates. The third case is the “M-open” perspective, which does not assume any
belief model at all. This cautious view leads to cross-validation as the only way to
evaluate the models.
While Bayesian model averaging arises naturally from integrating out the model
from the posterior distribution, there are also proposals for frequentist model aver-
aging. The weights for these frequentist model averages are usually derived from
information criteria as for example AIC or BIC. For BIC, the rationale is due to the
fact that BIC is an approximation of twice the log marginal likelihood of a model,
see Sect. 7.2.2. Hence, we have
f (x | Mk ) ≈ exp(−BICk /2),
where BICk denotes the BIC for the model Mk . Assuming a flat prior on the model
space, i.e. Pr(Mk ) = 1/K, the weight wk for the model Mk is its resulting approxi-
mate posterior model probability
exp(−BICk /2)
wk = K
.
j =1 exp(−BICj /2)
7.3 Exercises 243
There are proposals to replace BIC with AIC for calculating the weights wk . Other
information criteria can be used as well in principle. However, the analogy to
Bayesian model averaging is lost, but frequentist properties of the resulting esti-
mators can still be studied.
7.3 Exercises
is often used to assess the fit of a regression model. Here SS = ni=1 (yi −
μ̂i )2 is the residual sum of squares, and σ̂ML2 is the MLE of the variance
2
σ . How does AIC relate to Cp ?
(c) Now assume that σ 2 is unknown as well. Show that AIC is given by
2
AIC = n log σ̂ML + 2p + n + 2.
3. Repeat the analysis of Example 7.7 with unknown variance κ −1 using the con-
jugate normal-gamma distribution (see Example 6.21) as a prior distribution
for κ and μ.
(a) First, calculate the marginal likelihood of the model by using the rear-
rangement of Bayes’ theorem in (7.16).
(b) Next, calculate explicitly the posterior probabilities of the two (a priori
equally probable) models M1 and M2 using an NG(2000, 5, 1, 50 000)
distribution as a prior for κ and μ.
(c) Evaluate the behaviour of the posterior probabilities depending on vary-
ing parameters of the prior normal-gamma distribution.
4. Let X1:n be a random sample from a normal distribution with expected value μ
and known variance κ −1 , for which we want to compare two models. In the first
model (M1 ) the parameter μ is fixed to μ = μ0 . In the second model (M2 ) we
suppose that the parameter μ is unknown with prior distribution μ ∼ N(ν, δ −1 ),
where ν and δ are fixed.
(a) Determine analytically the Bayes factor BF12 of model M1 compared to
model M2 .
(b) As an example, calculate the Bayes factor for the centred alcohol con-
centration data using μ0 = 0, ν = 0 and δ = 1/100.
244 7 Model Selection
(c) Show that the Bayes factor tends to ∞ as δ → 0 irrespective of the data
and the sample size n.
5. In order to compare the models
M0 : X ∼ N 0, σ 2 and
M1 : X ∼ N μ, σ 2
M0 : p ∼ U(0, 1) and
M1 : p ∼ Be(θ, 1),
where 0 < θ < 1. This scenario aims to reflect the distribution of a two-sided
P -value p under the null hypothesis (M0 ) and some alternative hypothesis
(M1 ), where smaller P -values are more likely (Sellke et al. 2001). This is cap-
tured by the decreasing density of the Be(θ, 1) for 0 < θ < 1. Note that the data
are now represented by the P -value.
(a) Show that the Bayes factor for M0 versus M1 is
1 −1
BF(p) = θp θ−1 f (θ )dθ
0
Pr f (X) ≤ f (xo ) ,
(xo − ν)2
.
σ2 + τ2
7.4 References
Davison (2003, Sect. 4.7) is a good overview over different aspects of model selec-
tion. A more detailed exposition is given in Claeskens and Hjort (2008) and Burn-
ham and Anderson (2002). The original references for AIC and BIC are Akaike
(1974) and Schwarz (1978), respectively, and the unit information prior interpre-
tation of BIC is described in Kass and Wasserman (1995). Stone (1977) described
the relationship of cross validation and AIC. A nice overview article on Bayesian
model selection is Kass and Raftery (1995). Early contributions on minimum Bayes
factors are Edwards et al. (1963) and Berger and Sellke (1987). Different perspec-
tives on the notion of a “true” model are discussed in Bernardo and Smith (2000,
Sect. 6.1.2). DIC has been proposed by Spiegelhalter et al. (2002), and frequentist
model averaging with AIC weights by Buckland et al. (1997).
Numerical Methods for Bayesian Inference
8
Contents
8.1 Standard Numerical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8.2 Laplace Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
8.3 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.3.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.3.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
8.3.3 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.4 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8.5 Numerical Calculation of the Marginal Likelihood . . . . . . . . . . . . . . . . . 279
8.5.1 Calculation Through Numerical Integration . . . . . . . . . . . . . . . . . 279
8.5.2 Monte Carlo Estimation of the Marginal Likelihood . . . . . . . . . . . . . 280
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
8.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 247
https://doi.org/10.1007/978-3-662-60792-3_8,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
248 8 Numerical Methods for Bayesian Inference
It is often convenient to first derive the posterior mode Mod(θ | x) based on max-
imisation of the unnormalised posterior density L(θ ) · f (θ ) or, equivalently but nu-
merically preferable, the unnormalised log-posterior density l(θ ) + log f (θ ). It is
advisable (especially for application of the integration routines below) to allow vec-
tors of parameter values as input for this function. We can easily achieve this with
the Vectorize function in R:
log . unnorm . posterior <- function ( theta , data )
log . likelihood ( theta , data ) + log . prior ( theta )
log . unnorm . posterior <- Vectorize ( log . unnorm . posterior , " theta ")
result . opt <- optimize ( log . unnorm . posterior , maximum = TRUE , data =... ,
lower =... , upper =...)
post . mode <- result . opt $ maximum
ordinate <- result . opt $ objective
The ordinate at the posterior mode can be used to scale the unnormalised log-
posterior such that it attains zero at the posterior mode and is negative everywhere
else. This typically stabilises the numerical integration necessary to obtain the nor-
malised posterior:
unnorm . posterior <- function ( theta , data )
exp ( log . unnorm . posterior ( theta , data ) - ordinate )
norm . const <- integrate ( unnorm . posterior , data =... ,
lower =... , upper =...) $ value
norm . posterior <- function ( theta , data )
unnorm . posterior ( theta , data ) / norm . const
Note that the limits lower and upper must span the whole support of the pos-
terior. The posterior mean can of course also be obtained via numerical integration:
post . mean <- integrate ( function ( theta ) theta * norm . posterior ( theta ,
data =...)
,
lower =... , upper =...) $ value
The posterior distribution function and its inverse, the quantile function, can be
implemented in the following scheme:
8.1 Standard Numerical Techniques 249
The quantile function can then be used to compute the posterior median and
equal-tailed credible intervals, e.g.:
post . median <- post . quantile (0.5 , data =...)
post . ci <- function ( gamma , data )
c ( post . quantile ((1 - gamma ) / 2 , data ) ,
post . quantile ((1 + gamma ) / 2 , data ) )
post .95 ci <- post . ci (0.95 , data =...)
For the calculation of HPD intervals, see Example 8.9 for an illustration.
The following two examples illustrate the application in the context of the colon
cancer screening example, see Sect. 1.1.5.
Example 8.1 (Screening for colon cancer) It is not obvious if there is a conjugate
prior distribution for the truncated binomial likelihood from Example 2.12. The log-
likelihood was given in (2.7), and the data are N = 6, n = 196, Z1 = 37, Z2 = 22,
Z3 = 25, Z4 = 29, Z5 = 34 and Z6 = 49. An R implementation of the log-likelihood
function is taken from Example 2.12:
## Truncated binomial log - likelihood function
## pi : the parameter , the probability of a positive test result
## data : vector with counts Z_1 , ... , Z_N
log . likelihood <- function ( pi , data )
{
n <- sum ( data )
k <- length ( data )
vec <- seq_len ( k )
result <-
sum ( data * ( vec * log ( pi ) + ( k - vec ) * log (1 - pi ) ) ) -
n * log (1 - (1 - pi ) ^ k )
return ( result )
}
log . likelihood <- Vectorize ( log . likelihood , " pi ")
If we choose the beta prior Be(0.5, 0.5) for the sensitivity π and combine it with
the likelihood, we first obtain the unnormalised log posterior:
log . prior <- function ( pi )
dbeta ( pi , 0.5 , 0.5 , log = TRUE )
log . unnorm . posterior <- function ( pi , data )
log . likelihood ( pi , data ) + log . prior ( pi )
Following further our general recipe from above, we can calculate the posterior
mean and median of π :
## posterior mean calculation as in the pseudo code :
post . mean <- integrate ( function ( pi ) pi * norm . posterior ( pi ,
data = counts ) ,
lower =0 , upper =1) $ value
post . mean
[1] 0.6239458
## likewise for the cdf and quantile functions , and hence the median :
post . cdf <- function (x , data )
integrate ( norm . posterior , data = data ,
lower =0 , upper = x ) $ value
## numerical problems occur if we go exactly to the boundaries here ,
## therefore go away some small epsilon :
eps <- 1e -10
post . quantile <- function (q , data )
uniroot ( function ( x ) post . cdf (x , data ) - q ,
lower =0 + eps , upper =1 - eps ) $ root
post . median <- post . quantile (0.5 , data = counts )
post . median
[1] 0.6240066
We find that the posterior distribution is nearly symmetric around the posterior
mean 0.6239 and median 0.6240, respectively. The posterior mode 0.6242 is very
close to the MLE derived in Example 2.12. To obtain the equal-tailed 95 % credible
interval, we code:
post . ci <- function ( gamma , data )
c ( post . quantile ((1 - gamma ) / 2 , data ) ,
post . quantile ((1 + gamma ) / 2 , data ) )
post .95 ci <- post . ci (0.95 , data = counts )
8.1 Standard Numerical Techniques 251
post .95 ci
[1] 0.5956939 0.6517494
N
Med(ξ | x) = 1 − Med(π | x) = (1 − 0.6240)6 = 0.0028.
We will now consider an example with two unknown parameters. Here numerical
integration is necessary to derive the marginal posterior distributions. Since the R
code follows the same scheme as in the example above, we will only show some
parts of it.
Example 8.2 (Screening for colon cancer) Assuming two independent Be(1, 1)
priors, i.e. uniform prior distributions for the parameters μ and ρ of the beta-
binomial likelihood in Example 5.10, the joint posterior density is shown in Fig. 8.2.
Computation is of course based on Bayes’ theorem
f (x | θ )f (θ )
f (θ | x) =
f (x)
252 8 Numerical Methods for Bayesian Inference
with θ = (μ, ρ) . The denominator f (x) can be calculated using two-dimensional
numerical integration over Θ = [0, 1] × [0, 1] (for example, using the package
cubature in R, cf. Appendix C.2.1):
f (x) = f (x, θ ) dθ = f (x | θ )f (θ ) dθ . (8.1)
Θ Θ
1 1
f (μ | x) = f (μ, ρ | x) dρ and f (ρ | x) = f (μ, ρ | x) dμ
0 0
can be calculated with a second numerical integration and are shown in Fig. 8.3. For
example, the marginal posterior density of ρ can be calculated at grid points (which
are stored in the vector rgrid) with the following code:
posterior . r <- numeric ( length ( rgrid ) )
for ( j in seq_along ( rgrid ) )
{
posterior . r [ j ] <- integrate ( posterior . norm . mu , myr = rgrid [ j ] ,
norm = norm , counts = data ,
lower =0 , upper =1 ,
rel . tol =1 e -6) [[" value "]]
}
Let θ̂ and θ̂g denote the locations of the minima of k(θ ) and kg (θ ), respectively, i.e.
the values where the terms −nk(θ ) and −nkg (θ ) are maximal. Further, let
d 2 k(θ̂ ) d 2 kg (θ̂g )
κ̂ = and κ̂g =
dθ 2 dθ 2
denote the curvatures of k(θ ) and kg (θ ), respectively, at the corresponding minima.
Separate application of the Laplace approximation (see Appendix C.2.2) to both the
numerator and denominator gives
!
κ̂
E g(θ ) | x1:n ≈ exp −n kg (θ̂g ) − k(θ̂ ) . (8.6)
κ̂g
Although the approximation error of the Laplace approximation for the two integrals
in (8.4) is of order O(n−1 ), the leading terms in the two errors cancel in (8.6). As a
result, the error of the Laplace approximation of the posterior mean is only of order
O(n−2 ). Similar results can be obtained for the posterior variance.
We note that the Laplace approximation (8.6) is not invariant to changes in the
parametrisation chosen for the likelihood and prior density. For example, suppose
we reparametrise θ one-to-one to φ = h(θ ), i.e. θ = h−1 (φ). Applying the substitu-
tion rule for integration to both numerator and denominator of (8.3), we obtain
g{h−1 (φ)}f {x1:n | h−1 (φ)}f (φ) dφ
E g(θ ) | x1:n = ,
f {x1:n | h−1 (φ)}f (φ) dφ
where f (φ) is the appropriate, Jacobian adjusted, prior density of φ, cf. Ap-
pendix A.2.3. If h(θ ) is well selected, the Laplace approximation may become more
accurate. A possible candidate for h(θ ) is the variance-stabilising transformation, if
available. See Example 8.3 for further illustration.
We note that a similar formula as (8.6) can be obtained for an unknown parameter
vector θ by replacing κ̂ and κ̂g with the determinants of the corresponding Hessian
matrices (cf. Appendix B.2.2). As before, g must be a positive real-valued function.
with derivatives
d 2 k(π̂ )
κ̂ =
dπ 2
1 (1 − n)π̂ (1 − π̂ ) n − 1 −1
=− · = · π̂ (1 − π̂)
n {π̂(1 − π̂)}2 n
(n − 1)3
= ,
n(x − 0.5)(n − x − 0.5)
and similarly we obtain κ̂g = n2 /{(x + 0.5)(n − x − 0.5)}. Collecting all these re-
sults, we obtain the Laplace approximation Ê1 (π | x) of the posterior mean E(π | x):
Ê1 (π | x)
! x−0.5
(n − 1)3 (x + 0.5)(n − x − 0.5) π̂g 1 − π̂g n−x−0.5
= · π̂g
n3 (x − 0.5)(n − x − 0.5) π̂ 1 − π̂
!
(n − 1)3 (x + 0.5) x + 0.5 (n − 1)(x + 0.5) x−0.5 n − 1 n−x−0.5
= ·
n3 (x − 0.5) n n(x − 0.5) n
(x + 0.5)x+1 (n − 1)n+0.5
= .
(x − 0.5)x nn+3/2
256 8 Numerical Methods for Bayesian Inference
Table 8.1 Comparison of two Laplace approximations Ê1 (π | x) and Ê2 (π | x) with the true pos-
terior mean E(π | x) in a binomial experiment with Jeffreys’ prior. The corresponding relative error
of the approximation is printed in brackets
Observation Posterior mean
x/n n E(π | x) Ê1 (π | x) Ê2 (π | x)
0.6 5 0.5833 0.5630 (−0.0349) 0.5797 (−0.0062)
0.6 20 0.5952 0.5940 (−0.0021) 0.5950 (−0.0005)
0.6 100 0.5990 0.5990 (−0.0001) 0.5990 (−0.0000)
0.8 5 0.7500 0.7208 (−0.0389) 0.7464 (−0.0048)
0.8 20 0.7857 0.7838 (−0.0024) 0.7854 (−0.0004)
0.8 100 0.7970 0.7970 (−0.0001) 0.7970 (−0.0000)
1 5 0.9167 0.8793 (−0.0408) 0.9129 (−0.0041)
1 20 0.9762 0.9737 (−0.0025) 0.9759 (−0.0003)
1 100 0.9950 0.9949 (−0.0001) 0.9950 (−0.0000)
(x + 1)x+1 nn+0.5
Ê2 (π | x) = . (8.9)
x x (n + 1)n+3/2
Table 8.1 compares the true posterior mean with both Laplace approximations for
different values of n and x. Also shown is the relative error
Ê(π | x) − E(π | x)
E(π | x)
In the previous example we could obtain analytical results for mode and curva-
ture of the integrands of both integrals in (8.4). If this is not possible, then numerical
maximisation is required. The same techniques as for maximising likelihood func-
tions can be applied, so in R we will use the functions optimize or optim. The
following example illustrates this approach.
8.2 Laplace Approximation 257
Example 8.4 (Screening for colon cancer) In order to approximate the posterior
expectation of μ and ρ in Example 8.2, we first calculate the mode and curvature of
the unnormalised log-posterior density function log.unnorm.posterior, which
represents −nk(θ ) in (8.4):
optimObj <- optim ( c (0.5 , 0.5) , log . unnorm . posterior , counts = data ,
control = list ( fnscale = -1) , hessian = TRUE )
( mode <- optimObj $ par )
[1] 0.4749234 0.4816687
curvature <- optimObj $ hessian
( l o g D e t C u r v a t u r e <- as . numeric ( determinant ( curvature ) $ modulus ) )
[1] 11.96775
The variable mode now contains the mode θ̂ = (0.475, 0.482) , and the variable
curvature contains the corresponding Hessian matrix K̂ with log determinant
log{|K̂|} = 11.968, saved under the name logDetCurvature. Note that we save
the original log value of the determinant to avoid loss of numerical precision below.
The R-function determinant also returns the sign of the determinant in the list
element sign, which we know to be positive in this case.
First, we are interested in the posterior expectation of μ, so we set g(θ ) = μ.
After suitable definition of −nkg (θ ) we proceed as above:
log . mu . times . unnorm . posterior <- function ( theta , counts )
log ( theta [1]) + log . unnorm . posterior ( theta , counts )
muOptimObj <- optim ( c (0.5 , 0.5) , log . mu . times . unnorm . posterior ,
counts = data , control = list ( fnscale = -1) ,
hessian = TRUE )
( muMode <- muOptimObj $ par )
[1] 0.4841198 0.4739346
muCurvature <- muOptimObj $ hessian
( m u L o g D e t C u r v a t u r e <- as . numeric ( determinant ( muCurvature ) $ modulus ) )
[1] 12.09874
The mode of this function is θ̂ g = (0.484, 0.474) with log determinant of the Hes-
sian matrix equal to log{|K̂ g |} = 12.099. We now have all quantities necessary to
calculate the multiparameter version of (8.6). We implement it on the log-scale to
avoid loss of numerical precision.
log E g(θ ) | x ≈ 1/2 log |K̂| − 1/2 log |K̂ g | − n kg (θ̂ g ) − k(θ̂ ) .
l o g P o s t e r i o r E x p e c t a t i o n M u <-
1/2 * l o g D e t C u r v a t u r e - 1/2 * m u L o g D e t C u r v a t u r e +
muOptimObj $ value - optimObj $ value
( p o s t e r i o r E x p e c t a t i o n M u <- exp ( l o g P o s t e r i o r E x p e c t a t i o n M u ) )
[1] 0.4491689
The posterior mean E(μ | x) ≈ 0.449 is hence smaller than the posterior mode
Mod(μ | x) = 0.475, which is also reflected in the skewness of the marginal pos-
terior f (μ | x) shown in Fig. 8.3a.
To compute the posterior mean of the second parameter ρ, we can proceed analo-
gously and obtain the approximation E(ρ | x) ≈ 0.505, slightly larger than the corre-
sponding mode and again in accordance with the skewness of the marginal posterior
density shown in Fig. 8.3b.
258 8 Numerical Methods for Bayesian Inference
Assume first that it is easy to generate independent samples θ (1) , . . . , θ (M) from the
posterior distribution f (θ | x) of interest. A Monte Carlo estimate of the posterior
mean
E(θ | x) = θf (θ | x) dθ (8.10)
is then given by
1 (m)
M
Ê(θ | x) = θ .
M
m=1
The law of large numbers (cf. Appendix A.4.3) ensures that this estimate is
simulation-consistent, i.e. the estimate converges to the true posterior mean as
M → ∞.
This approach is called Monte Carlo integration and avoids the analytical inte-
gration in (8.10). More general, for any suitable function g,
1 (m)
M
Ê g(θ ) | x = g θ (8.11)
M
m=1
is a simulation-consistent estimate of E{g(θ ) | x}, again easily shown with the law
of large numbers. For example, with g(θ ) = θ 2 , we obtain an estimate of E(θ 2 | x),
which can be used to estimate the posterior variance: Var(θ 1 | x) = Ê(θ 2 | x) −
Ê(θ | x) . Another example is the indicator function g(θ ) = I(−∞,θ0 ] (θ ) to estimate
2
Pr(θ ≤ θ0 | x).
The estimate Ê{g(θ ) | x} is unbiased and has the variance
1 2
Var Ê g(θ ) | x = g(θ ) − E g(θ ) | x f (θ | x) dθ.
M
The associated Monte Carlo standard error is
2
3 M
1 3 2
se Ê g(θ ) | x = √ 4 g θ (m) − Ê g(θ ) | x /(M − 1),
M m=1
We note that independent samples are not necessary for simulation consistency,
which may still hold if the samples θ (1) , . . . , θ (M) are dependent. However, the ac-
curacy of a Monte Carlo estimate of E{g(θ ) | x} will be reduced if the transformed
samples g(θ (1) ), . . . , g(θ (M) ) are positively correlated.
Example 8.5 (Binomial model) Suppose that we have obtained a Be(4.5, 1.5)
posterior distribution for the success probability π in a binomial model. Assume
that we want to estimate the posterior expectation and the posterior probability
Pr(π < 0.5 | x) by Monte Carlo sampling. Both quantities can be written in the gen-
eral form E{g(π) | x} with g(π) = π and g(π) = I[0,0.5) (π), respectively. Here we
generate M = 10 000 independent samples, which typically gives sufficient accu-
racy for most quantities of interest.
M <- 10000
theta <- rbeta (M , 4.5 , 1.5)
( Etheta <- mean ( theta ) )
[1] 0.748382
( se . Etheta <- sqrt ( var ( theta ) / M ) )
[1] 0.001649922
( Ptheta <- mean ( theta < 0.5) )
[1] 0.0875
( se . Ptheta <- sqrt ( var ( theta < 0.5) / M ) )
[1] 0.002825805
We obtain the Monte Carlo estimates Ê(π | x) = 0.7484 and P̂r(π < 0.5 | x) =
0.0875 with Monte Carlo standard errors se{Ê(π | x)} = 0.0016 and se{P̂r(π <
0.5 | x)} = 0.0028, respectively. The true values E(π | x) = 4.5/(1.5 + 4.5) = 0.75
and Pr(π < 0.5 | x) ≈ 0.087713 (calculated using the R call pbeta(0.5,4.5,1.5))
are both less than two Monte Carlo standard errors away from their respective esti-
mates.
Example 8.6 (Diagnostic test) We reconsider the problem to calculate the positive
predictive value of a diagnostic test under the scenario that the disease prevalence
π = Pr(D+) is estimated from a prevalence study. We now extend the approach
from Example 6.4 and assume that also the sensitivity and specificity of the diag-
nostic test, previously both fixed at 90 %, are actually derived from a diagnostic
study. Suppose that this study reported that 36 of 40 people who had the disease
also had positive tests, but only 4 of 40 people who did not have the disease had
positive tests. Again using Jeffreys’ prior, we obtain independent Be(36.5, 4.5) pos-
teriors both for sensitivity and specificity. A Monte Carlo approach allows us to in-
tegrate this additional uncertainty in the estimation of the positive predictive value
θ = Pr(D+ | T +). The following R-code illustrates this:
M <- 10000
## prev : samples from Beta (1.5 , 99.5) distribution
prev <- rbeta (n , 1.5 , 104)
## first use fixed values for sensitivity and specificity
sens <- 0.9
spec <- 0.9
## and calculate positive predictive value ( PPV )
ppv <- sens * prev / ( sens * prev + (1 - spec ) *(1 - prev ) )
## now assume distributions for sensitivity and specificity
sens <- rbeta (n , 36.5 , 4.5)
260 8 Numerical Methods for Bayesian Inference
The samples are visualised as boxplots in Fig. 8.4. The boxplots illustrate that the
point estimate of the positive predictive value rarely changes after incorporating
the uncertainty with respect to sensitivity and specificity. However, the variance
of the samples increases, so the uncertainty about this point estimate increases. In
particular, larger positive predictive values become more likely.
Example 8.7 (Blood alcohol concentration) In Example 7.7 we have already tried
to answer the question whether there exists a difference in mean transformation
factors between men and women. There we computed the Bayes factor between
a model with only one population mean and a model with gender-specific means.
Here we take a different route: We are going to compute the posterior distribution
of the difference of the gender-specific means using Monte Carlo methodology.
If we specify independent non-informative prior distributions f (μi , σi2 ) ∝ σi−2
(i = 1, 2) in the two groups, we obtain from Example 6.26 the marginal posterior
distributions of μi by plugging the sufficient statistics of the respective data subsets
into (6.30) as
μ1 | x1 ∼ t 2318.5, (38.32)2 , 32 and (8.12)
μ2 | x2 ∼ t 2477.5, (18.86)2 , 151 . (8.13)
A rigorous frequentist analysis of this model with unequal variances in the two
groups is impossible due to the famous Behrens–Fisher problem. However, if we
assume equal variances in the two groups, we can apply the two-sample t test as we
did in Example 5.16. There is an analogous Bayesian approach for this easier prob-
lem, which is an extension of the one-sample problem described in Example 6.24 to
two samples. In this case the posterior distribution can be computed analytically.
apply ( pi , 1 ,
FUN = function ( onePi ) sum ( x * log ( onePi ) ) )
}
hwLoglik <- function ( upsilon )
{
pi <- cbind ( upsilon * upsilon ,
2 * upsilon * (1 - upsilon ) ,
(1 - upsilon ) * (1 - upsilon ) )
triLoglik ( pi )
}
## sample from the Hardy - Weinberg posterior :
aPost <- 1 + 2 * x [1] + x [2]
bPost <- 1 + x [2] + 2 * x [3]
upsilonSamples <- rbeta ( n =10000 , aPost , bPost )
## calculate DIC :
upsilonBar <- aPost / ( aPost + bPost )
upsilonBar - mean ( upsilonSamples ) ## check : OK
262 8 Numerical Methods for Bayesian Inference
[1] 0.0001126755
hwDf <- mean (2 * ( hwLoglik ( upsilonBar ) - hwLoglik ( upsilonSamples ) ) )
hwDf
[1] 0.990882
hwDic <- - 2 * hwLoglik ( upsilonBar ) + 2 * hwDf
hwDic
[1] 1510.329
## sample from the Dirichlet posterior :
alphaPost <- 1 + x
piSamples <- rdirichlet ( n =10000 ,
alphaPost )
head ( piSamples )
[ ,1] [ ,2] [ ,3]
[1 ,] 0.2889879 0.5354279 0.1755842
[2 ,] 0.2926704 0.5642561 0.1430735
[3 ,] 0.3133803 0.5250923 0.1615275
[4 ,] 0.2967536 0.5212148 0.1820316
[5 ,] 0.3427231 0.4894122 0.1678647
[6 ,] 0.3196069 0.5078508 0.1725423
## calculate DIC :
piBar <- alphaPost / sum ( alphaPost )
piBar - colMeans ( piSamples ) ## check : OK
[1] 8.369696 e -05 -2.725740 e -04 1.888771 e -04
triDf <- mean (2 * ( triLoglik ( piBar ) - triLoglik ( piSamples ) ) )
triDf
[1] 1.988951
triDic <- - 2 * triLoglik ( piBar ) + 2 * triDf
triDic
[1] 1510.369
Note that we checked whether the Monte Carlo estimates of the posterior expecta-
tions are close to the analytically computed values: the differences were very small
for both models, which indicates that the sample size (B = 10 000) is large enough
to obtain good accuracy for the DIC estimates.
Example 8.9 (Binomial model) In Example 8.5 we estimated the posterior expec-
tations of (transformations of) the proportion π in the binomial model using Monte
Carlo methodology. Now we would like to estimate the 95 % HPD interval for π by
Monte Carlo and compare it to the 95 % equal-tailed credible interval:
So the 95 % HPD interval is [0.4356, 0.9982], which is shifted to the right compared
with the 95 % equal-tailed credible interval [0.3647, 0.9763].
Alternatively, root finding methods (cf. Appendix C.1.2) can be used to com-
pute the HPD interval numerically. To this end, the following R-code first de-
fines a function outerdens, which computes the probability of all π that ful-
fil fBe (π; α, β) < h, i.e. the probability of all π that would not be contained
in the corresponding HPD interval {π : fBe (π; α, β) > h} (cf. Fig. 6.3). Then in
the second function betaHpd, the value hopt is determined that gives an outer
probability of 5 %, so that the credibility level of the inner π values is 95 %.
The boundaries of the HPD interval are then obtained as the values π fulfilling
fBe (π; α, β) = hopt .
} else {
uniroot ( function ( x ) { dbeta (x , alpha , beta ) - h } ,
interval = c (0 , mode - eps ) ) $ root
}
## return everything
return ( c ( prob = prob ,
lower = lower ,
upper = upper ) )
}
betaHpd <- function ( alpha ,
beta ,
level =0.95)
{
## compute the mode of this beta distribution ,
## but go epsilon away from 0 or 1
eps <- 1e -15
mode <- max ( min (( alpha -1) /( alpha + beta - 2) ,
1 - eps ) , 0+ eps )
## determine h_opt :
result <- uniroot ( function ( h ) { outerdens (h , alpha , beta ) [" prob "] -
(1 - level ) } ,
## search in the interval ( eps , f ( mode ) - eps )
interval =
c ( eps ,
dbeta ( mode , alpha , beta ) - eps ) )
height <- result $ root
So the result is [0.4360, 0.9983], which is very close to the Monte Carlo estimate
obtained above. The equal-tailed credible interval is [0.3714, 0.9775], which can be
calculated with the R-function qbeta. This is slightly shifted to the right compared
with the Monte Carlo result above.
Note that both credible intervals are larger than the corresponding ones in
Fig. 6.3, where the uncertainty in the sensitivity and specificity estimates was not
incorporated.
8.3 Monte Carlo Methods 265
However, perhaps only samples from some other (and for the moment arbitrary)
density f ∗ (θ ) can be produced, but not from f (θ | x). We can re-write Eq. (8.14) to
obtain
f (θ | x) ∗
E g(θ ) | x = g(θ ) ∗ f (θ ) dθ,
f (θ )
so based on the sample θ (1) , . . . , θ (M) from f ∗ (θ ), a suitable Monte Carlo estimate
of (8.14) is
1
M
Ê g(θ ) | x = M
wm g θ (m) , (8.15)
m=1 wm m=1
Example 8.10 (Binomial model) Let us reconsider Monte Carlo estimation of the
posterior mean E(π | x) and the posterior probability Pr(π < 0.5 | x) from Exam-
ple 8.5. We now want to use importance sampling based on independent samples
from a standard uniform distribution. Computation of (8.15) is done with the fol-
lowing R-code:
u <- runif ( M )
w <- dbeta (u , 4.5 , 1.5)
( sum ( w ) )
266 8 Numerical Methods for Bayesian Inference
[1] 10151.34
( Etheta . u <- sum ( u * w ) / sum ( w ) )
[1] 0.7494834
( se . Etheta . u <- sqrt ( sum (( u - Etheta . u ) ^2 * w ^2) ) ) / sum ( w )
[1] 0.001780926
( Ptheta . u <- sum (( u < 0.5) * w ) / sum ( w ) )
[1] 0.08824196
( se . Ptheta . u <- sqrt ( sum ((( u < 0.5) - Ptheta . u ) ^2 * w ^2) ) ) / sum ( w )
[1] 0.00211711
First, note that the sum of the weights is 10151.3, so as expected quite close to
the number of samples M = 10 000. The importance sampling estimates appear
to be also of quite good accuracy with standard errors similar to ordinary Monte
Carlo. Note that the standard error for the mean estimate is slightly larger than with
ordinary Monte Carlo, while the standard error for the Pr(π < 0.5 | x) estimate is
slightly smaller.
In the previous example, we noted a decrease in the standard error for estimat-
ing the probability Pr(π < 0.5 | x) when importance sampling from the uniform
distribution is used. This illustrates the possibility to improve the Monte Carlo pre-
cision by choosing a suitable importance sampling distribution f ∗ (θ ). If the density
f ∗ (θ ) is approximately proportional to |g(θ )|f (θ | x), then the importance sam-
pling estimate (8.15) can be shown to have minimal variance. Note that, when
∗
g(θ ) > 0, the proportionality constant for the normalisation of this optimal f (θ )
is 1/ g(θ )f (θ | x) dθ , the reciprocal of the quantity we want to estimate. Unless
we can simulate from g(θ )f (θ | x) without knowledge of the normalising constant,
this result seems therefore of limited practical value. For more details, we point the
interested reader to the relevant literature listed at the end of this chapter.
The rejection sampling algorithm to simulate X from fX (x) hence proceeds as fol-
lows:
1. Generate independent random variables Z from fZ (z) and U ∼ U(0, 1).
8.3 Monte Carlo Methods 267
Pr Z ≤ x | a · U · fZ (Z) ≤ fX (Z)
Pr{Z ≤ x, a · U · fZ (Z) ≤ fX (Z)}
=
Pr{a · U · fZ (Z) ≤ fX (Z)}
x
Pr{a · U · fZ (Z) ≤ fX (Z) | Z = z}fZ (z) dz
= −∞
+∞ . (8.17)
−∞ Pr{a · U · fZ (Z) ≤ fX (Z) | Z = z}fZ (z) dz
Now
fX (z) fX (z)
Pr a · U · fZ (Z) ≤ fX (Z) | Z = z = Pr U ≤ =
a · fZ (z) a · fZ (z)
The single trials are independent, so the number of trials up to the first success
is geometrically distributed with parameter 1/a. The expected number of trials up
to the first success is therefore a; if a is large, the algorithm is hence not very
efficient. The constant a should thus be chosen as small as possible while fulfilling
condition (8.16), so typically a = maxz∈R fX (z)/fZ (z), as in the following example.
The estimates of the true values 0.75 and 0.0877, respectively, appear to be also
quite good using rejection sampling. This is not surprising, as the resulting samples
have the same distributional properties as if we had directly used the R-function
rbeta as in Example 8.5. Hence, also the standard errors have a similar size.
We note that care must be taken when an improper prior distribution is used,
because this may lead to an improper posterior distribution. Impropriety implies
that there does not exist a joint density to which the full-conditional distributions
correspond. However, the Gibbs sampling output might still look as if the approach
was working.
The efficiency of the Metropolis–Hastings algorithm depends crucially on the
acceptance rate, i.e. the relative frequency of acceptance (typically assessed after
convergence of the Markov chain). However, an acceptance rate close to one is
not always good. For example, for random walk proposals, a too large acceptance
rate implies that the proposal density is too close around the current value, so the
algorithm needs many small steps to explore the target distribution sufficiently. On
the other hand, if the acceptance rate of a random walk proposal is too small, large
moves are often proposed but rarely accepted. In some cases, the algorithm may
even get stuck at a specific value, and subsequent proposals will get rejected for a
large number of iterations. For random walk proposals, acceptance rates between
30 and 50 % are typically recommended, which can be easily achieved through
appropriate choice of the variance of the proposal distribution. Things are different
for independence proposals, where a high acceptance rate is desired, which means
that the proposal density is close to the target density.
Example 8.12 (Screening for colon cancer) We now want to apply Gibbs sam-
pling to the problem described in Sect. 1.1.5 and also to compare the likelihood
approaches in Example 2.12 and Sect. 2.3.2. For the probability π , we choose a
Be(0.5, 0.5) prior, so the full conditional of π , is
6
6
π | Z ∼ Be 0.5 + k · Zk , 0.5 + (6 − k)Zk . (8.19)
k=0 k=0
Fig. 8.6 Paths of the Gibbs sampling Markov chain (without burn-in of first 100 iterations). Both
plots suggest the convergence of the Markov chain
Note that the R function rnbinom generates a random number from a differently
parametrised negative binomial distribution, where the number of non-successes
and not the number of trials up to the first success is counted; the latter one is used
in the definition of Z0 .
In practice one has to decide when the simulated Markov chain has reached its
target distribution. A common approach is to ignore the first few iterations, the so-
called burn-in phase, so the remaining random numbers can be regarded as samples
from the target distribution. In the above code, the nburnin first samples are ignored
as burn-in. One should still inspect the trace plots of the parameter samples to ensure
that the samples at least “visually” converge. Figure 8.6 suggests that this is the case
here. We can now compute summary statistics of posterior characteristics of interest,
e.g. estimates of quantiles and expectations.
summary ( pisamples )
Min . 1 st Qu . Median Mean 3 rd Qu . Max .
0.5683 0.6142 0.6240 0.6240 0.6337 0.6776
272 8 Numerical Methods for Bayesian Inference
summary ( Z0samples )
Min . 1 st Qu . Median Mean 3 rd Qu . Max .
0.0000 0.0000 0.0000 0.5638 1.0000 5.0000
Application of Gibbs sampling requires that random numbers from the full con-
ditional distributions can be generated easily. However, this may not always be the
case, so alternative approaches need to be investigated. One idea is to use rejection
sampling or other techniques to obtain samples from the full conditional distribu-
tion indirectly. However, it is typically easier to use a simple Metropolis–Hastings
proposal to update the corresponding component of θ . The latter approach is some-
times called Metropolis-within-Gibbs. We illustrate the approach with the following
example.
Fig. 8.8 10 000 independent samples from the prior f (υ, δ) with α = β = 1, i.e. a marginal uni-
form prior for υ
ensures that the individual probabilities are all within the unit interval. The corre-
sponding likelihood function of υ and δ is of course only a reparametrisation of a
multinomial likelihood. This reparametrisation is useful to investigate the presence
of Hardy–Weinberg equilibrium, represented by the null hypothesis H0 : δ = 0.
A Bayesian analysis starts by choosing appropriate prior distributions for υ and δ.
Due to (8.21), these distributions cannot be independent, and we therefore factorise
the prior in the form
f (υ, δ) = f (υ)f (δ | υ) (8.22)
and select the Be(α, β) distribution for the marginal prior of υ and the uniform
distribution on the interval (8.21) for the conditional prior of δ given υ.
Figure 8.8a illustrates for α = β = 1 the dependence between the two parame-
ters using 10 000 simulations from the prior by first simulating υ ∗ from f (υ) and
subsequently δ ∗ from f (δ | υ ∗ ). Via (8.22), this gives samples from the joint distri-
bution f (υ, δ) as well as from the marginal distribution f (δ). Figure 8.8b shows
that the marginal prior of δ is not at all uniform and not even symmetric around
zero. The prior probability Pr(δ > 0) is approximately 0.75, estimated based on the
samples.
Turning to posterior inference based on the MN blood group frequencies in Ice-
land as described in Sect. 1.1.4, note that the full conditional distributions of υ and
δ are not of closed form and not easy to sample from. We therefore use Metropolis–
Hastings proposals for both conditional distributions. For υ, it may be useful to use
274 8 Numerical Methods for Bayesian Inference
## proposal for d :
first <- max ( d - scale , lower ( v ) )
second <- min ( d + scale , upper ( v ) )
dstar <- runif (1 , min = first , max = second )
## compute the log posterior ratio
logPostRatio <-
## from the likelihood
dmultinom (x , prob = myprob (v , dstar ) , log = TRUE ) -
dmultinom (x , prob = myprob (v , d ) , log = TRUE ) +
## from the prior on d given v
dunif ( x = dstar , lower ( v ) , upper ( v ) , log = TRUE ) -
dunif ( x =d , min = lower ( v ) , max = upper ( v ) , log = TRUE )
## compute the log proposal ratio
logPropRatio <- dunif (d , first , second , log = TRUE ) -
dunif ( dstar , first , second , log = TRUE )
## hence we obtain the log acceptance probability
logAcc <- logPostRatio + logPropRatio
## decide acceptance
if ( log ( runif (1) ) <= logAcc )
{
d <- dstar
## count acceptances
if ( i > 0)
{
dyes <- dyes + 1
}
276 8 Numerical Methods for Bayesian Inference
Fig. 8.9 Estimated joint posterior distribution of υ and δ and marginal posterior distribution of δ
based on 10 000 MCMC samples
}
## if burnin was passed , save the samples
if ( i > 0)
{
vsamples [ i ] <- v
dsamples [ i ] <- d
}
}
The empirical acceptance rates turn out to be vyes/niter = 97.6 % and
dyes/niter = 45.6 % for υ and δ, respectively. Figures 8.9a and 8.9b show the
empirical posterior distribution of υ and δ and also the marginal posterior of δ.
The estimated posterior means are Ê(υ | x) = 0.5697 and Ê(δ | x) = −0.0122. The
posterior probability Pr(δ > 0 | x) is estimated as 0.0911, so substantially smaller
than a priori, but not decisively close to zero. This suggests that the assumption of
Hardy–Weinberg equilibrium for these data may not be completely unreasonable.
Similar results are obtained using a likelihood analysis, as we will show
now. The MLEs (with standard errors in brackets) υ̂ML = 0.5696 (0.0125) and
δ̂ML = −0.0125 (0.0089) are easily obtained. If we interpret these estimates from
a Bayesian point of view using the second normal approximation described in
Sect. 6.6.2, we obtain the posterior probability
δ − δ̂ML −δ̂ML −δ̂ML
Pr(δ > 0 | x) = 1 − Pr ≤ ≈1−Φ = 0.0803,
se(δ̂ML ) se(δ̂ML ) se(δ̂ML )
8.4 Markov Chain Monte Carlo 277
Fig. 8.10 Estimated posterior distribution of ξ and Z0 based on 10 000 MCMC samples
where Φ(·) denotes the standard normal distribution function. This is again close
to the fully Bayes estimate. We will revisit this example in Example 8.16, where
we will apply explicit Bayesian model selection in order to decide between the
Hardy–Weinberg and the multinomial model.
Example 8.14 (Screening for colon cancer) In Example 8.2 we have already nu-
merically derived the posterior distribution of the parameters μ and ρ in the trun-
cated beta-binomial model, cf. Fig. 8.2. To estimate the posterior distribution of
the false negative fraction ξ , we now use random numbers from the joint posterior
f (θ | x) of θ = (μ, ρ) , generated using a bivariate Metropolis sampler. Samples
ξ (m) = ξ(θ (m) ) from the posterior distribution of the transformation ξ can then eas-
(m)
ily be obtained. Those can in turn be used to compute Z0 = 196 · ξ (m) /(1 − ξ (m) )
which are samples from the posterior of Z0 .
We choose a normal proposal h(θ ∗ | θ (m) ) with mean equal to the current value
and covariance matrix proportional to the negative inverse curvature of the log-
posterior at the posterior mode (cf. Sect. 6.6.2). The corresponding proportionality
constant is chosen such that the acceptance rates are between 30 and 50 %.
Figure 8.10a displays a kernel density estimate of the posterior of the false neg-
ative fraction based on the last 9000 random numbers (1000 burn-in samples have
been disregarded). The posterior distribution is quite skewed with estimated mean
0.28 much larger than the empirical median 0.26 or mode 0.20. The mode has
been estimated as the mean of the limits of the empirical 1 % HPD interval. The
empirical 95 % HPD interval is [0.10, 0.51]. Compared with the MLE ξ̂ML = 0.24
from Example 5.10, both mean and median are larger, and the uncertainty is smaller
than the one based on the profile likelihood interval [0.11, 0.55]. Figure 8.10b dis-
278 8 Numerical Methods for Bayesian Inference
Example 8.15 (Scottish lip cancer) We revisit the data introduced in Sect. 1.1.6
on the incidence of lip cancer in the n = 56 geographical regions of Scotland. Let
xi and ei , i = 1, . . . , n, denote the observed and expected cases, respectively. We
assume that the xi are independent conditional realisations from a Poisson distribu-
tion with mean ei λi , where λi denotes the unknown relative risk in region i. We now
specify a prior on the log relative risks ηi = log(λi ) that takes into account spatial
structure and thus allows for spatial dependence. More specifically, we use a Gaus-
sian Markov random field, which is most easily specified through the conditional
distribution of ηi given all other {ηj }j =i . A common choice is to assume that
σ2
ηi | {ηj }j =i , σ 2 ∼ N η̄i , ,
ni
where η̄i = n−1i j ∼i ηj denotes the mean of the ni spatially neighbouring regions
of region i, and σ 2 is an unknown variance parameter.
To simulate from the posterior distribution, the obvious choice is a Gibbs sam-
pler that iteratively updates the n + 1 unknown parameters λ1 , . . . , λn , σ 2 . Due to
conditional conjugacy, we use an inverse gamma prior for σ 2 , i.e. σ 2 ∼ IG(α, β)
a priori, so the full conditional distribution of σ 2 is again inverse gamma,
n−1 1
σ | η ∼ IG α +
2
,β + (ηi − ηj ) ,
2
2 2
i∼j
where the sum in the second (scale) parameter goes over all pairs of neighbouring
regions i ∼ j . Slightly more involved is simulation from the full conditional dis-
tribution of λi , i = 1, . . . , n, which is not of a known form. This problem can be
circumvented by using a Metropolis–Hastings step with (conditional) independence
proposal
μ̃2 μ̃
λi ∼ G xi + 2 , ei + 2 . (8.23)
σ̃ σ̃
This choice is motivated by the fact that the conditional prior of λi | {λj }j =i , by the
definition a log-normal distribution, can be well approximated through a gamma
distribution G(μ̃2 /σ̃ 2 , μ̃/σ̃ 2 ) with matching moments. The two parameters μ̃ and
σ̃ 2 are expectation and variance of that log-normal distribution and are given by
σ2
μ̃ = exp η̄i + and
2ni
σ̃ 2 = exp σ 2 /ni − 1 exp 2η̄i + σ 2 /ni ,
8.5 Numerical Calculation of the Marginal Likelihood 279
cf. Appendix A.5.2. The gamma distribution is conjugate to the Poisson likelihood
and can be analytically combined to obtain the proposal distribution (8.23) as an
approximation to the full conditional of λi .
A simulation of length 100 000 with burn-in of 10 000 gave an average accep-
tance rate of 94 %. For the individual λi s, the acceptance rate was never below 67 %.
We used the prior parameters α = 1 and β = 0.01. Figure 8.11 displays the corre-
sponding posterior mean estimates E(λi | x1:n ). Compared with the MLEs shown
in Fig. 1.2, obtained from a model without spatial dependence, a much smoother
picture can be observed. The empirical Bayes estimates displayed in Fig. 6.13 are
also less variable than the MLEs but do not take the spatial structure of the data into
account.
The calculation of the marginal likelihood f (x) in the non-conjugate case is the
greatest challenge in the implementation of Bayesian statistics.
In case the necessary integration in (8.1) is not feasible analytically, a natural ap-
proach is to try numerical methods of integration.
280 8 Numerical Methods for Bayesian Inference
Example 8.17 (Screening for colon cancer) Using likelihood inference, the trun-
cated beta-binomial model has been clearly preferred over the truncated binomial
model on the basis of a χ 2 goodness-of-fit test (see Example 5.18). For a Bayesian
model selection procedure between the simpler model M2 of Example 8.1 and the
more flexible model M1 of Example 8.2, we calculate the Bayes factor
f (x | M1 )
BF12 = ,
f (x | M2 )
where the numerator and denominator were already used in the calculation of
the posterior parameter densities. The necessary integration is again not feasible
analytically, and the R-function integrate has been applied instead. We obtain
BF12 = 1.62 · 1039 and therefore again overwhelming evidence for the truncated
beta-binomial model compared to the simpler binomial one.
In more complex models, numerical methods and the Laplace approximation are
computationally costly and/or imply non-negligible inaccuracies. An obvious idea
then is to adapt (MC)MC methods to not only estimate posterior densities but also
8.5 Numerical Calculation of the Marginal Likelihood 281
marginal likelihoods. Certainly, this involves higher costs and/or lower accuracies
as well. We illustrate this fact with three possible procedures.
First, observe that the equation
f (x) = f (x | θ )f (θ ) dθ
allows a direct application of Monte Carlo integration (see Sect. 8.3) using randomly
drawn numbers θ (1) , . . . , θ (M) from the prior distribution f (θ ). The resulting Monte
Carlo estimator is given as
1
M
fˆ(x) = f x | θ (m) , (8.24)
M
m=1
the arithmetic average of the likelihood values of the random numbers drawn from
the prior distribution. We apply this simple approach to Example 8.12.
Example 8.18 (Screening for colon cancer) The parameter θ = π in the considered
binomial model is a scalar, and the observed data x are given by Z = (Z1 , . . . , Zk ) .
Since Z0 is not observed, the likelihood function is given in this case by a binomial
distribution truncated to positive values (see Example 2.12):
N
Zk
N
π k (1 − π)N −k
n
f (Z | π) = / 1 − (1 − π)N , (8.25)
k
k=1
with N = 6 tests per patient and a total number of n = 196 positively tested pa-
tients. We used Jeffreys’ prior Be(0.5, 0.5) as a prior distribution for π and now
draw M = 30 000 random numbers from it and then calculate their average like-
lihood value. In order to assess the variability of the estimator (8.24) we show in
Fig. 8.12a estimations for five different simulations as functions of the sample size
m = 1, . . . , M. For M = 30 000 the estimates range between exp(−440.520) and
exp(−440.437).
The variance of the estimator (8.24) is typically quite high if the likelihood func-
tion is distinctly more concentrated than the prior distribution. This means that the
likelihood function will take values close to zero at most of the random numbers
drawn from the prior distribution but very high values for very few of those random
numbers. This is the case in Example 8.18, where the prior has very different mass
centres than the likelihood function, see Fig. 8.12b. Therefore, it is advisable to use
random numbers drawn from the posterior distribution instead the prior distribution.
For a direct application of importance sampling, the posterior distribution f (θ | x)
has to be available including the proportionality constant, which is identical to the
inverse of the marginal likelihood. But this is precisely the quantity we are intending
to calculate, such that importance sampling cannot be applied directly.
282 8 Numerical Methods for Bayesian Inference
Fig. 8.12 Application of the estimator (8.24) in the binomial model of the colon cancer screening
study
The trick here is to instead apply importance sampling to estimate the integral
f (θ )
f (θ ) dθ = f (θ | x) dθ,
f (θ | x)
known to be equal to one. Drawing θ (1) , . . . , θ (M) from the posterior distribution
f (θ | x), we obtain the following “estimate” of the above integral:
1 f (θ (m) ) 1 f (x)
M M
1= = .
M
m=1
f (θ (m) | x) M m=1 f (x | θ (m) )
the harmonic average of the likelihood values at the random numbers drawn from the
posterior distribution. Unfortunately, the statistical performance of this estimator is
again quite poor. Since the variance of the inverse likelihood is in most cases infinite,
the estimator is unstable in many applications. However, it is at least simulation-
consistent due to the law of large numbers, i.e. fˆ(x) converges to f (x) as M → ∞.
8.5 Numerical Calculation of the Marginal Likelihood 283
Example 8.19 (Screening for colon cancer) We now take a different route than in
Example 8.18 to compute the marginal likelihood for the colon cancer screening
model. Using the likelihood (8.25) and applying (8.26), we obtain the estimator
/ N 0−1
M
N N (m) k N −k −Zk
fˆ(Z) = M
n
1 − 1 − π (m) π 1 − π (m) .
k
m=1 k=1
(8.27)
Figure 8.13 shows this estimator of the marginal likelihood as a function of the sam-
ple size M, where we drew M random numbers from the (approximate) posterior
distribution using a Gibbs sampler as described in Example 8.12.
f (x | θ )f (θ )
f (x) = , (8.28)
f (θ | x)
which is a simple rearrangement of Bayes’ theorem and holds for all θ ∈ Θ. It will
mostly be evaluated at parameter values θ in the centre of the posterior distribution,
e.g. the posterior mean or the posterior mode since in these regions a higher accu-
racy of the density estimation can be expected for a fixed number of random draws
from the posterior. The denominator f (θ | x) in the above identity is typically only
known up to proportionality. Indeed, the corresponding normalising constant f (x)
is actually the quantity we intend to calculate. The representation
f (θ | x) = f (θ | x, z)f (z | x) dz (8.29)
284 8 Numerical Methods for Bayesian Inference
of the required posterior (8.29). Note that the full conditional density f (θ | x, z)
used in the Gibbs sampler is known, including its normalisation constant.
Example 8.20 (Screening for colon cancer) Again, we intend to illustrate this
method using the colon cancer screening example. Here θ corresponds to the un-
known probability π , the observed data x is Z, and the unobserved data z is Z0 .
The value of the density f (π | x) at the estimated posterior mean Ê(π | Z) = 0.624
can be estimated directly by
1
M
f 0.624 | Z, Z0(m) = 27.889,
M
m=1
where we use M = 100 000 simulations of Z0 | Z using Gibbs sampling, and (8.19)
gives the density f (π | x, z). Substitution into Eq. (8.28) leads to the Monte Carlo
estimate of the log marginal likelihood
which is actually quite close to the first estimate log{fˆ(x)} = −440.475 from Ex-
ample 8.18.
8.6 Exercises
1. Let X ∼ Po(eλ) with known e, and assume the prior λ ∼ G(α, β).
(a) Compute the posterior expectation of λ.
(b) Compute the Laplace approximation of this posterior expectation.
(c) For α = 0.5 and β = 0, compare the Laplace approximation with the
exact value, given the observations x = 11 and e = 3.04, or x = 110 and
e = 30.4. Also compute the relative error of the Laplace approximation.
(d) Now consider θ = log(λ). First, derive the posterior density function
using the change-of-variables formula (A.11). Second, compute the
Laplace approximation of the posterior expectation of λ = exp(θ ) and
8.6 Exercises 285
compare again with the exact value you have obtained by numerical in-
tegration using the R-function integrate.
2. In Example 8.3, derive the Laplace approximation (8.9) for the posterior ex-
pectation of π using the variance-stabilising transformation.
3. For estimating the odds ratio θ from Example 5.8, we will now use Bayesian
inference. We assume independent Be(0.5, 0.5) distributions as priors for the
probabilities π1 and π2 .
(a) Compute the posterior distributions of π1 and π2 for the data given
in Table 3.1. Simulate samples from these posterior distributions and
transform them into samples from the posterior distributions of θ and
ψ = log(θ ). Use the samples to compute Monte Carlo estimates of the
posterior expectations, medians, equal-tailed credible intervals and HPD
intervals for θ and ψ . Compare with the results from likelihood infer-
ence in Example 5.8.
(b) Try to compute the posterior densities of θ and ψ analytically. Use the
density functions to numerically compute the posterior expectations and
HPD intervals. Compare with the Monte Carlo estimates from 3(a).
4. In this exercise we will estimate a Bayesian hierarchical model with MCMC
methods. Consider Example 6.31, where we had the following model:
ψ̂i | ψi ∼ N ψi , σi2 ,
ψi | ν, τ ∼ N ν, τ 2 ,
where we assume that the empirical log odds ratios ψ̂i and corresponding vari-
ances σi2 := 1/ai + 1/bi + 1/ci + 1/di are known for all studies i = 1, . . . , n.
Instead of empirical Bayes estimation of the hyper-parameters ν and τ 2 , we
here proceed in a fully Bayesian way by assuming hyper-priors for them. We
choose ν ∼ N(0, 10) and τ 2 ∼ IG(1, 1).
(a) Derive the full conditional distributions of the unknown parameters
ψ1 , . . . , ψn , ν and τ 2 .
(b) Implement a Gibbs sampler to simulate from the corresponding poste-
rior distributions.
(c) For the data given in Table 1.1, compute 95 % credible intervals for
ψ1 , . . . , ψn and ν. Produce a plot similar to Fig. 6.15 and compare with
the results from the empirical Bayes estimation.
5. Let Xi , i = 1, . . . , n, denote a random sample from a Po(λ) distribution with
gamma prior λ ∼ G(α, β) for the mean λ.
(a) Derive closed forms of E(λ | x1:n ) and Var(λ | x1:n ) by computing the
posterior distribution of λ | x1:n .
(b) Approximate E(λ | x1:n ) and Var(λ | x1:n ) by exploiting the asymptotic
normality of the posterior (cf. Sect. 6.6.2).
(c) Consider now the log mean θ = log(λ). Use the change-of-variables
formula (A.11) to compute the posterior density f (θ | x1:n ).
286 8 Numerical Methods for Bayesian Inference
(d) Let α = 1, β = 1 and assume that x̄ = 9.9 has been obtained for n = 10
observations from the model. Compute approximate values of E(θ | x1:n )
and Var(θ | x1:n ) via:
(i) the asymptotic normality of the posterior,
(ii) numerical integration (cf. Appendix C.2.1) and
(iii) Monte Carlo integration.
6. Consider the genetic linkage model from Exercise 5 in Chap. 2. Here we
assume a uniform prior on the proportion φ, i.e. φ ∼ U(0, 1). We would like
to compute the posterior mean E(φ | x).
(a) Construct a rejection sampling algorithm to simulate from f (φ | x) us-
ing the prior density as the proposal density.
(b) Estimate the posterior mean of φ by Monte Carlo integration using M =
10 000 samples from f (φ | x). Calculate also the Monte Carlo standard
error.
(c) In 6(b) we obtained samples of the posterior distribution assuming a
uniform prior on φ. Suppose we now assume a Be(0.5, 0.5) prior in-
stead of the previous U(0, 1) = Be(1, 1). Use the importance sampling
weights to estimate the posterior mean and Monte Carlo standard error
under the new prior based on the old samples from 6(b).
7. As in Exercise 6, we consider the genetic linkage model from Exercise 5 in
Chap. 2. Now, we would like to sample from the posterior distribution of φ us-
ing MCMC. Using the Metropolis–Hastings algorithm, an arbitrary proposal
distribution can be used, and the algorithm will always converge to the target
distribution. However, the time until convergence and the degree of depen-
dence between the samples depends on the chosen proposal distribution.
(a) To sample from the posterior distribution, construct an MCMC sampler
based on the following normal independence proposal (cf. approxima-
tion 3 in Sect. 6.6.2):
φ ∗ ∼ N Mod(φ | x), F 2 · C −1 ,
where φ (m) denotes the current state of the Markov chain, and d is a
constant.
(c) Generate M = 10 000 samples from algorithm 7(a), setting F = 1 and
F = 10, and from algorithm 7(b) with d = 0.1 and d = 0.2. To check
the convergence of the Markov chain:
(i) plot the generated samples to visually check the traces,
(ii) plot the autocorrelation function using the R-function acf,
8.7 References 287
8.7 References
Contents
9.1 Plug-in Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.2 Likelihood Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.2.1 Predictive Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.2.2 Bootstrap Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9.3 Bayesian Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.3.1 Posterior Predictive Distribution . . . . . . . . . . . . . . . . . . . . . . . 297
9.3.2 Computation of the Posterior Predictive Distribution . . . . . . . . . . . . 301
9.3.3 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
9.4 Assessment of Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
9.4.1 Discrimination and Calibration . . . . . . . . . . . . . . . . . . . . . . . . 304
9.4.2 Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
9.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
A common problem in statistics concerns the prediction of future data. Consider the
following scenario: Let x1:n denote the realisation of a random sample X1:n from
a distribution with density (or probability mass) function f (x; θ ) with unknown
parameter θ . Our goal is to predict a future independent observation Y = Xn+1 , also
from f (x; θ ). Obviously, the observed data x1:n should be taken into account in this
task.
In contrast to previous chapters, which were concerned with inference for un-
known parameters, that are inherently unobservable, we now want to infer a quantity
that we will be able to observe. We are interested in predictive inference.
We may want to derive a point prediction Ŷ , but also a predictive distribution
with predictive density f (y). The point prediction is then just a function of the
predictive distribution, for example its mean. A 95 % prediction interval can be
obtained based, for example, on the 2.5 % and 97.5 % quantiles of the predictive
distribution.
For simplicity, in the following we will call f (y) a density function even if Y
may also be a discrete random variable. The methods are described for scalar Y
and θ but can easily be generalised to vectorial Y or θ .
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 289
https://doi.org/10.1007/978-3-662-60792-3_9,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
290 9 Prediction
where θ̂ML = θ̂ML (x1:n ) has been calculated based on the realisation x1:n of the ran-
dom sample X1:n . However, by replacing the true, unknown parameter θ with the
MLE θ̂ML , the uncertainty in estimating θ is ignored. The plug-in prediction there-
fore produces prediction intervals that are often too narrow. Because of its simplic-
ity, this method is nevertheless commonly used.
L(θ, y) = f (x1:n , y; θ ).
Now assume that X1:n and Y are independent and distributed according to a
density function f (x; θ ). Then we have
n
f (x1:n , y; θ ) = f (y; θ ) f (xi ; θ ).
i=1
Here the estimate θ̂ (y) is the MLE of θ based on the extended data x1:n and y.
Example 9.2 (Normal distribution) Let X1:n denote a random sample from an
N(μ, σ 2 ) distribution, from which a future observation Y has to be predicted. The
variance σ 2 is assumed known, the expectation μ unknown. The joint density func-
292 9 Prediction
nx̄ + y
μ̂(y) = .
n+1
Substituting μ̂(y) into L(μ, y) leads to the predictive likelihood
" #
1 nx̄ + y 2 nx̄ + y 2
Lp (y) = L μ̂(y), y = exp − 2 n x̄ − + y−
2σ n+1 n+1
" 2 2 #
1 y − x̄ y − x̄
= exp − 2 n + n2
2σ n+1 n+1
1 n + n2
= exp − 2 (y − x̄)2
2σ (n + 1)2
1 n
= exp − 2 (y − x̄)2 ,
2σ n + 1
which can be identified as the kernel of a normal density with expectation x̄ and
variance (1 + 1/n)σ 2 . Hence, the likelihood prediction
Y ∼ N x̄, σ 2 (1 + 1/n) (9.2)
differs from the plug-in prediction Y ∼ N(x̄, σ 2 ) only through a larger variance. The
likelihood prediction therefore provides somewhat larger prediction intervals since
the uncertainty in the MLE μ̂ML = x̄ has been accounted for.
cf. Table 1.3. The point prediction is of course equal to 2449.2. Note that
√
1 + 1/185 ≈ 1.003, so very close to one. The 95 % plug-in prediction interval
will therefore be essentially the same.
It is interesting that the lower limit of the above prediction interval for TF is
close to the currently used transformation factor of TF0 = 2000 in Switzerland. So
the factor TF0 = 2000 can be justified as the approximate lower limit of a 95 %
prediction interval, but not as a useful point prediction.
Example 9.4 (Poisson model) We aim to apply the predictive likelihood approach
in Example 9.1. The extended likelihood of λ and y is (after elimination of multi-
plicative constants) given by
y
ey x+y
L(λ, y) = λ exp −(ex + ey )λ .
y!
For fixed y, L(λ, y) is maximised by
x +y
λ̂(y) = .
ex + ey
Substituting this into L(λ, y) leads to the predictive likelihood
y
ey x + y x+y
Lp (y) = exp −(x + y) .
y! ex + ey
This function can be normalised numerically if the required infinite summation of
Lp (y) is truncated at a sufficiently large upper bound ymax , for example, at ymax =
1000. We then obtain the predictive probability mass function fp (y), from which
the expectation and the variance can be derived. For x = 11 and ex = ey = 3.04,
these are given as 11.498 and 22.998, respectively. The mean and variance of the
plug-in prediction are 11, cf. Example 9.1. While the mean of the predictive like-
lihood prediction is fairly close to the mean of the plug-in prediction, the variance
of the likelihood prediction is more than twice as large as the variance of the plug-
in prediction. The prediction interval [5, 17] has probability 0.95 for the plug-in
prediction. In contrast, the interval has only probability 0.84 for the likelihood pre-
diction.
Another approach to incorporate the uncertainty of the estimate θ̂ML into the pre-
diction procedure is as follows: Let f (θ̂ML ; θ ) denote the density function of the
ML estimator θ̂ML depending on a random sample X1:n from f (x; θ ) with true but
unknown parameter θ . The idea is now to replace the too optimistic predictive dis-
tribution (9.1) with the density
g(y; θ ) = f (y; θ̂ML )f (θ̂ML ; θ ) d θ̂ML (9.3)
294 9 Prediction
such that θ̂ML has been eliminated from f (y; θ̂ML ) through integration. If the ML
estimator has a discrete distribution, the integral has to be replaced by a sum. The
idea to eliminate the unknown parameter by integration from the likelihood function
can also be motivated through a Bayesian argumentation, see Sect. 9.3.
A shortcoming of the distribution (9.3) is that g(y; θ ) still depends on the true but
unknown parameter θ . A possible remedy to this is to replace θ in g(y; θ ) with the
MLE θ̂ML = θ̂ML (x1:n ) based on the observed realisation x1:n of the random sample
X1:n . This leads to the so-called bootstrap predictive distribution
f (y) = g y; θ̂ML (x1:n ) . (9.4)
The reasons for calling this prediction approach the bootstrap prediction will be
discussed after the following two examples.
Example 9.5 (Normal model) As in Example 9.2, let X1:n denote a random sam-
ple from an N(μ, σ 2 ) distribution with θ = μ unknown and σ 2 known. The ML
estimator μ̂ML = X̄ has the distribution N(μ, σ 2 /n). Note that this is true not only
asymptotically but in this case also for finite samples. Now we have
1 1 1 (y − μ̂ML )2
g(y; μ) = √ exp −
2π σ 2 σ2
√
1 n 1 (μ̂ML − μ)2
×√ exp − d μ̂ML
2π σ 2 σ 2 /n
" #
1 (μ̂ML − y)2 (μ̂ML − μ)2
= C exp − + d μ̂ML ,
2 σ2 σ 2 /n
√
where C = n/(2πσ 2 ). Setting τ 2 = σ 2 /(1 + n) and
y n y + nμ
c = τ2 + μ = ,
σ 2 σ 2 1+n
Example 9.6 (Poisson model) Let X ∼ Po(ex λ). Then the MLE is λ̂ML = x/ex .
Since X follows the Poisson distribution with parameter ex λ, the probability mass
function of λ̂ML is
Fig. 9.1 Comparison of the probability mass functions of plug-in, likelihood and bootstrap pre-
dictions in the Poisson example (the bars are centred around their x-value)
In our case (ex = ey = 3.04 and λ̂ML = 11/3.04) we therefore have E(Y ) = 11 and
Var(Y ) = 22. The empirical estimates above differ slightly from this due to addi-
tional Monte Carlo error.
Note that the last line follows from the assumption of conditional independence of
X1:n and Y , given θ . Thus, to compute f (y | x1:n ), we simply need to integrate the
product of the likelihood f (y | θ ) and the posterior density f (θ | x1:n ) with respect
to θ . The result f (y | x1:n ) is called the posterior predictive distribution in contrast
to the prior predictive distribution
f (y) = f (y | θ )f (θ ) dθ. (9.9)
Compared to Eq. (9.8), the posterior f (θ | x1:n ) has been replaced by the prior f (θ )
to obtain the prior predictive distribution, which plays a central role in Bayesian
model choice, cf. Sect. 7.2.1.
298 9 Prediction
Example 9.7 (Normal model) Let X1:n denote a random sample from an N(μ, σ 2 )
distribution with unknown mean μ and known variance σ 2 . Our goal is to predict
a future-independent observation Y = Xn+1 from the same distribution. Using Jef-
freys’ prior f (μ) ∝ 1 for μ, we know that the posterior is (cf. Example 6.14)
σ2
μ | x1:n ∼ N x̄, .
n
The posterior predictive distribution of Y | x1:n has therefore the density
f (y | x1:n ) = f (y | μ)f (μ | x1:n ) dμ
" #
1 (μ − y)2 n(μ − x̄)2
∝ exp − + dμ.
2 σ2 σ2
Using Appendix B.1.5, we have
(μ − y)2 n(μ − x̄)2 (y − x̄)2
+ = C(μ − c)2 + ,
σ 2 σ 2
(1 + n1 )σ 2
where C = (n + 1)/σ 2 and c = (y + nx̄)/(n + 1). Note that the first term in this
sum is a quadratic form of μ, whereas the second term does not depend on μ. It
then follows that
C 1 (y − x̄)2
f (y | x1:n ) ∝ exp − (μ − c)2 − dμ
2 2 (1 + n1 )σ 2
1 (y − x̄)2 C
= exp − exp − (μ − c)2
dμ
2 (1 + n1 )σ 2 2
√ √
= 2π σ/ n+1
1 (y − x̄)2
∝ exp − .
2 (1 + n1 )σ 2
9.3 Bayesian Prediction 299
This is the kernel of the normal density with mean x̄ and variance σ 2 (1 + n1 ), so the
posterior predictive distribution is
1
Y | x1:n ∼ N x̄, σ 1 +
2
. (9.10)
n
Note that the plug-in prediction is also normal with mean x̄, but has smaller vari-
ance σ 2 . However, the likelihood and the bootstrap prediction give exactly the same
result, cf. Examples 9.2 and 9.5.
We note that if the variance is also unknown, application of a reference prior
f (μ, σ 2 ) ∝ σ −2 leads to a t distribution as posterior predictive distribution:
n
1 − x̄)2
i=1 (xi
Y | x1:n ∼ t x̄, 1 + ,n − 1 ,
n n−1
see Exercise 2.
Example 9.8 (Blood alcohol concentration) Suppose that instead of Jeffreys’ prior
as in Example 9.7, we now use a normal prior μ ∼ N(ν, δ −1 ) for the mean μ of a
normal random sample with known variance σ 2 . The posterior predictive distribu-
tion is then
2
δσ + n + 1
Y | x1:n ∼ N E(μ | x1:n ), σ 2 . (9.11)
δσ 2 + n
Here E(μ | x1:n ) denotes the posterior mean of μ as derived in Example 6.8. Note
that the posterior predictive variance in (9.11) is always larger than σ 2 and reduces
to σ 2 (1 + n1 ) for δ = 0 as it should, cf. Eq. (9.10).
We now compute the posterior predictive distribution for the transformation fac-
tor both separately for females and males and overall, using a μ ∼ N(ν, δ −1 ) prior
and with known standard deviation σ = 237.8. The mean and standard deviation of
the posterior predictive normal distribution (9.11) are shown in Table 9.1.
Assume that X1:n is a random sample from the B(π) distribution. Then
n
X= Xi ∼ Bin(n, π).
i=1
x+1
Pr(Y = 1 | x) = ,
n+2
and therefore
n−x +1
Pr(Y = 0 | x) = .
n+2
For example, suppose that the sun has risen in the past x = n = 1 000 000 days, say.
The probability that it will not rise tomorrow (without additional information) is
1
Pr(Y = 0 | x) = ≈ 10−6 .
1 000 002
Example 9.10 (Sequential analysis of binary outcomes from a clinical trial) Sup-
pose that a clinical trial is conducted where the primary outcome is the success rate
of a novel therapy. The therapy comes with certain side effects, so it is generally
agreed upon that a success probability below 0.5 cannot be justified for individual
therapy. However, there is some optimism that the success rate π is around 0.75,
and this is reflected in a Be(6, 2) prior for π .
The outcomes Y from patients enrolled enter sequentially in the order shown in
Table 9.2, with success coded as 1 if the patient had a successful therapy and 0
otherwise. The predictive probability Pr(Y = 1 | data so far), the probability that the
therapy is successful for a future patient given the data so far, may now be used
as an early stopping criterion: if it falls below 0.5, then the trial must be stopped.
Fortunately, this is not the case for the data shown in Table 9.2. This scenario is of
course somewhat simplistic but contains important features of a sequential analysis
of clinical trials using predictive probabilities.
9.3 Bayesian Prediction 301
In Sect. 7.2.1 we noted that the prior predictive distribution (9.9) is simply the de-
nominator in Bayes’ theorem; it can therefore be calculated if likelihood, prior and
posterior are known:
f (y | θ )f (θ )
f (y) = ,
f (θ | y)
which holds for any θ . A similar formula also holds for the posterior predictive
distribution:
f (y | x, θ )f (θ | x)
f (y | x) =
f (θ | x, y)
f (y | θ )f (θ | x)
= ,
f (θ | x, y)
where the last equation follows from conditional independence of X and Y given θ .
If f (θ ) is conjugate with respect to f (x | θ ), then f (θ | x) and f (θ | x, y) belong
to the same family of distributions considered. The posterior predictive distribution
can then be derived without explicit integration. We will illustrate the procedure in
the following example.
Example 9.11 (Poisson model) Let X ∼ Po(ex λ), Y ∼ Po(ey λ) and λ ∼ G(α, β) a
priori, with X and Y being conditionally independent. From Example 6.30 we know
that
λ | x, y ∼ G(α̃ + y, β̃ + ey ),
f (y | λ)f (λ | x)
f (y | x) =
f (λ | x, y)
(ey λ)y β̃ α̃ α̃−1
y! exp(−ey λ) (α̃) λ exp(−β̃λ)
=
(β̃+ey )α̃+y α̃+y−1
(α̃+y) λ exp{−(β̃ + ey )λ}
y
β̃ α̃ ey (α̃ + y)
= (β̃ + ey )−(α̃+y) ,
(α̃) y!
i.e. the Poisson-gamma distribution with parameters α̃, β̃ and ey , compare Ap-
pendix A.5.1.
An equivalent way to obtain the posterior predictive distribution is to first iden-
tify the prior predictive distribution f (y) as the Poisson-gamma distribution with
parameters α, β and ex . The posterior predictive distribution f (y | x) can then be
obtained by replacing the prior parameters α and β by the posterior parameters α̃
and β̃, respectively, and ex by ey , compare (9.9) with (9.8).
The mean and variance of this posterior predictive distribution are
α+x
E(Y | x) = ey and
β + ex
α+x ey
Var(Y | x) = ey 1 + .
β + ex β + ex
Under Jeffreys’ prior, i.e. α = 1/2 and β = 0 (cf. Table 6.3), these formulas simplify
to
ey 1
E(Y | x) = x+ and
ex 2
ey ey 1
Var(Y | x) = 1 + x+ .
ex ex 2
In our original example with x = 11, ex = ey = 3.04 we obtain E(Y | x) = 11.5 and
Var(Y | x) = 23. Table 9.3 contains the mean and variance of all predictive distri-
butions considered for the Poisson model example. The results of likelihood and
Bayesian approaches to prediction are very close, whereas the bootstrap prediction
leads to somewhat smaller values of the predictive mean and variance. Also, the pre-
dictive mean of all prediction methods is close to the mean of the plug-in predictive
distribution, but the variance is nearly twice as large. For x = 0, the Bayesian ap-
proach gives E(Y | x) = 0.5 and Var(Y | x) = 1, again in sharp contrast to the plug-in
predictive distribution with mean and variance equal to zero.
9.3 Bayesian Prediction 303
In Sect. 7.2.4 Bayesian model averaging has been introduced, which allows one
to combine estimates from different models with respect to their posterior proba-
bilities. Analogously, it is possible to combine different predictions from different
models: Let M1 , . . . , MK denote the models considered and f (y | x, Mk ) the pos-
terior predictive density in model Mk , then the model-averaged posterior predictive
density is
K
f (y | x) = f (y | x, Mk ) · Pr(Mk | x), (9.12)
k=1
where the posterior model probabilities Pr(Mk | x) are appearing as weights sim-
ilar as in Sect. 7.2.4. Prediction based on model averaging takes therefore model
uncertainty into account.
2
E(y | x) = E(y | x, Mk ) · Pr(Mk | x)
k=1
is equal to the model-averaged mean transformation factor shown in Table 7.3. Note
that the posterior predictive distribution (9.12) is now a mixture of two normal dis-
tributions, with mixture weights equal to Pr(M1 | x) and Pr(M2 | x). The mean and
standard deviation of the two normal components can be read off from Table 9.1.
Predictions based on model averaging are more than a technical gadget using
elementary probability. Frequently, one has to assume that a statistical model does
not reflect the truth entirely, cf. the discussion at the end of Sect. 7.2.4. Even if there
is a correct model, it is uncertain that it has been considered as a candidate model.
Under uncertainty on which model fits the underlying data, predictions based on
model averaging will generally provide better results than predictions based on a
304 9 Prediction
K
E(y | x) = E(y | x, Mk ) · Pr(Mk | x)
k=1
of the posterior predictive distribution (9.12) minimises the expectation of the pos-
terior predictive squared error-loss l(a, y) = (a − y)2 , i.e.
E(y | x) = arg min l(a, y)f (y | x) dy.
d
The proof is similar to the proof that the posterior mean minimises the expected
squared error loss, see Sect. 6.4.1 for comparison.
In order to judge the performance of a statistical model, the actual observations are
compared with the corresponding predictive distributions. At least two aspects are
of importance in this comparison: discrimination and calibration. Discrimination
describes how well the model is able to predict different observations with different
predictions. Point predictions are central to discrimination, whereas the uncertainty
of the prediction may not be considered at all. Calibration, however, takes into ac-
count the entire predictive distribution in the sense of a statistical agreement with
the actual observations. For example, there should be on average only five out of
a hundred observations outside a 95 % prediction interval if the prediction is cali-
brated.
Any further discussion of these concepts heavily depends on whether the predic-
tive distribution is discrete or continuous. To begin with, we will consider (multiple)
binary predictive distributions f (yi ) (indexed with i), which are completely charac-
terised by the corresponding prediction probabilities Pr(Yi = 1) = πi . Subsequently,
we will discuss the univariate continuous case.
For a binary variable Yi ∈ {0, 1}, perfect agreement between the prediction prob-
abilities πi ∈ [0, 1] and the actually observed realisations yi ∈ {0, 1} is not achiev-
able. The discrimination of a binary prediction describes in this case the capacity to
9.4 Assessment of Predictions 305
By far the most frequently used measure of discrimination for binary predictions
is the so-called area under the curve (AUC), also called c-index, which is usually
defined as the area under the ROC curve, where ROC stands for “receiver operating
characteristic”, a term commonly used in signal detection theory. In general, AUC
is a number between zero and one, where only values above 0.5 reflect a certain
quality of classification. AUC can also be defined in a different way:
Definition 9.3 (AUC) AUC is the probability that a randomly chosen event i, which
actually occurred (yi = 1), has a larger prediction probability than another randomly
chosen event j , which actually did not occur (yj = 0):
In the case that different events may have the same probabilities this definition has
to be extended to
1
AUC = Pr(πi > πj ) + Pr(πi = πj ).
2
Note that AUC has not necessarily to be defined through the probabilities πi since
one could alternatively use any strictly monotone transformation as, for example,
logit(πi ). AUC can be estimated by the Wilcoxon rank sum statistic.
To empirically assess calibration of binary predictions, the probabilities have
to be grouped. In practice identical or at least very close prediction probabilities
are combined in J groups with representative probabilities π1 , . . . , πJ . The groups
should be of roughly the same size, and observations with the same prediction prob-
ability should not be in different groups. For each πj , let nj be the number of pre-
dicted events, and ȳj the relative frequency of the predicted event in the j th group.
The total number of predictions is denoted by N = Jj=1 nj .
where ȳ = N −1 N
i=1 yi denotes the overall prevalence.
MR is a measure that should be as large as possible. Since it does not take into
account the order of the predictions, it is much less commonly used than AUC. In
fact a prediction, which always classifies incorrectly, will have the largest possible
value of MR. Limited to predictions with AUC ≥ 0.5, it is nevertheless a sensible
measure, and it will be revisited in Sect. 9.4.2.
of MR are, however, identical for expert, oracle and inverted oracle prediction since
it is assessing only discrimination but not the direction of classification. The values
of the Brier score BS in the last column are discussed in Example 9.16.
Next, we will discuss continuous predictions with density f (y) and correspond-
ing distribution function F (y). In this case the following quantity is often used as a
calibration measure:
Definition 9.6 (PIT) The probability integral transform (PIT) is the value of the
predictive distribution function F (y) = Pr(Y ≤ y) at the actually observed value yo :
PIT(yo ) = F (yo ).
Example 9.15 (Normal model) Suppose X and Yo are independent, normally dis-
tributed random variables, both with unknown expectation μ and known vari-
ance σ 2 = 1. We want to predict Yo after observing X = x. The plug-in predic-
tion is in this case Y | X = x ∼ N(x, 1), whereas the Bayesian prediction with
Jeffreys’ prior for μ (just as the likelihood or the bootstrap prediction) predicts
Y | X = x ∼ N(x, 2), cf. Example 9.7.
Based on M = 10 000 independent realisations of X and Yo (with true μ = 0),
PIT values for the plug-in and the Bayesian prediction have been computed as fol-
lows and are shown in Fig. 9.2:
set . seed (1)
M <- 10000
x <- rnorm ( M )
y <- rnorm ( M )
pit . plugin <- pnorm (y , mean = x , sd = 1)
pit . bayes <- pnorm (y , mean = x , sd = sqrt (2) )
The plug-in prediction is obviously not well calibrated since the corresponding
PIT histogram in Fig. 9.2 does not resemble a uniform distribution but has a bowl
shape, which is typical for predictions with too small variances. The Bayesian pre-
diction Y | X = x ∼ N(x, 2) seems to be well calibrated, which can be confirmed
308 9 Prediction
Fig. 9.2 PIT histograms of the plug-in and the Bayesian prediction of Yo ∼ N(μ, 1) based on an
independent realisation of X ∼ N(μ, 1)
PIT(yo ) = Pr(Y ≤ yo )
Y − x yo − x
= Pr √ ≤ √
2 2
yo − x
=Φ √ ,
2
where Φ(·) denotes the distribution function of the standard normal distribution.
√ X ∼ N(μ, 1), and yo is a realisation of Yo ∼ N(μ, 1).
Now x is a realisation of
We have Z = (Yo − X)/ 2 ∼ N(0, 1) and therefore Φ(Z) ∼ U(0, 1). Viewed as a
function of the random variables X and Yo , we thus have
Yo − X
PIT(Yo ) = Φ √ ∼ U(0, 1).
2
Note that, by definition, the oracle prediction Y ∼ N (0, 1), which is equal to the
true data-generating distribution of Yo , also has uniform PIT values.
As for binary events, perfect calibration is not sufficient for a good prediction. In
the above example both the Bayesian and oracle predictions are well calibrated, but
common sense suggests that the oracle prediction is better. Indeed, another aspect
of a good continuous prediction is sharpness, i.e. how concentrated a predictive dis-
tribution is. The oracle has smaller variance (or higher sharpness) than the Bayesian
prediction, so should be preferred.
9.4 Assessment of Predictions 309
For the assessment of the accuracy of continuous predictions, the focus is often
on the point prediction Ŷ , which is in most cases the expectation E(Y ) of the pre-
dictive distribution. A very frequently used criterion is the squared prediction error
SPE = (Ŷ − yo )2 .
Definition 9.7 (Scoring rules) A scoring rule S(f (y), yo ) assigns a real number to
the probability mass or density function f (y) of a predictive distribution and the
actually observed value yo .
Scoring rules are typically negatively oriented, i.e. smaller values of S(f (y), yo )
reflect a better prediction. It is reasonable to only use proper scoring rules, defined
as follows.
Definition 9.8 (Proper scoring rule) A scoring rule S(f (y), yo ) is called proper if
the expected score E{S(f (y), Yo )} with respect to the true data-generating distri-
bution Yo ∼ fo is minimised if the predictive distribution f is equal to the data-
generating distribution fo . If the minimum is unique, then the scoring rule is called
strictly proper.
Definition 9.9 (Scoring rules for binary predictions) Let Y ∼ B(π) be the predic-
tive distribution for a binary event, i.e.
π for y = 1,
f (y) =
1 − π for y = 0.
310 9 Prediction
The Brier score BS, the absolute score AS and the logarithmic score LS are defined
as
BS f (y), yo = (yo − π)2 , (9.13)
AS f (y), yo = |yo − π| and (9.14)
LS f (y), yo = − log f (yo ), (9.15)
respectively.
Result 9.1 (Brier score) The Brier score (9.13) is strictly proper.
Proof To show Result 9.1, let B(πo ) be the true distribution of Yo . Then the expected
Brier score is given by
E BS f (y), Yo = E (Yo − π)2
= E Yo2 − 2π E(Yo ) + π 2
= E(Yo ) − 2ππo + π 2
= πo − 2ππo + π 2 .
Hence, we have
d E{BS(f (y), Yo )}
= −2πo + 2π,
dπ
from which we derive the root π = πo . Inspecting the second derivative
d 2 E{BS(f (y), Yo )}
=2
dπ 2
shows that the minimum is unique. Using the true success probability πo as predic-
tive probability hence gives the minimal expected score.
In the following we decompose the mean Brier score BS—averaged over a se-
ries of binary predictions—in two terms measuring calibration and discrimination,
respectively. To do so, we consider a series of N binary predictions with predic-
tive probabilities π1 , . . . , πN . The corresponding observed events are denoted by
y1 , . . . , yN , and ȳ denotes the overall prevalence of the observed binary events.
We first note that the Brier score of the prevalence prediction is ȳ(1 − ȳ) and
may be used as an upper bound for useful predictions. In Murphy’s decomposition
the predictions are first grouped as in Sect. 9.4.1 leading to the decomposition
1
N
BS = (yi − πi )2 = ȳ(1 − ȳ) − MR + SC (9.16)
N
i=1
9.4 Assessment of Predictions 311
of the mean Brier score. This shows explicitly that the Brier score assesses both
discrimination and calibration, through MR and SC, respectively.
Example 9.16 (Prediction of soccer matches) In the last column of Table 9.4 the
Brier score of each prediction is given. The Brier score orders the four predictions
by combining discrimination and calibration in a sensible way: the oracle prediction
is the best, followed by the expert and the prevalence prediction, and the inverted
oracle prediction is the worst.
Note that the overall prevalence of matches won by the home team is 50 %, so
the upper bound on the Brier score is 0.25, and this is of course the Brier score
of the prevalence prediction. It is easily confirmed that Murphy’s decomposition is
fulfilled for all predictions considered.
Result 9.2 (Absolute score) The absolute score (9.14) is not proper.
Proof To show this result, let B(πo ) be the true distribution of Yo . Then the expected
absolute score is
E AS f (y), Yo = E |Yo − π|
= (1 − π)πo + π(1 − πo )
= π(1 − 2πo ) + πo .
so the absolute score is not proper. For πo = 1/2, the expected score is 1/2 and hence
independent of π .
Result 9.3 (Logarithmic score) The logarithmic score (9.15) is strictly proper.
Proof To show the propriety of the logarithmic score, let again B(πo ) be the true
distribution of Yo . The expected logarithmic score is given by
E LS f (y), Yo = − E log f (Yo )
= − log(π)πo − log(1 − π)(1 − πo ).
Hence, we have
d E{LS(f (y), Yo )}
= −πo /π + (1 − πo )/(1 − π),
dπ
from which we derive the root π = πo . Through inspection of the second derivative
we conclude that this minimum is unique.
312 9 Prediction
The logarithmic score is strictly proper not only for binary events but for ar-
bitrary probability mass or density functions of Y . For example, if the predictive
distribution is normal, i.e. Y ∼ N(μ, σ 2 ), then the logarithmic score is
1 2 (yo − μ)2
LS f (y), yo = log(2π) + log σ + .
2 σ2
As an alternative, one can use the continuous ranked probability score (CRPS),
which is defined as
2
CRPS f (y), yo = F (t) − I[yo ,∞) (t) dt
and which is also strictly proper for arbitrary predictions with distribution function
F (y). The CRPS is closely related to the Brier score (9.13). For fixed t, we have
1 for yo ≤ t,
I[yo ,∞) (t) =
0 for yo > t,
with corresponding success probability Pr(Y ≤ t) = F (t). So {F (t) − I[yo ,∞) (t)}2
is a Brier score, and the CRPS is the integral of the Brier score over all possible
thresholds t.
It is possible to show that the CRPS can be written as
1
CRPS f (y), yo = E |Y1 − yo | − E |Y1 − Y2 | ,
2
where Y1 and Y2 are independent random variables with density function f (y) or
distribution function F (y). This representation allows a simplification of the for-
mula for the CRPS for certain predictive distributions. For example, for a normally
distributed prediction Y ∼ N(μ, σ 2 ), the CRPS is
" #
1
CRPS f (y), yo = σ ỹo 2Φ(ỹo ) − 1 + 2ϕ(ỹo ) − √ , (9.17)
π
where ỹo = (yo − μ)/σ , while ϕ(·) and Φ(·) denote the density and distribution
functions, respectively, of the standard normal distribution. A derivation of this re-
sult is discussed in Exercise 6.
Example 9.17 (Normal model) For the predictions of Example 9.15, we calculated
the logarithmic score and the CRPS for each of the M = 10 000 independent reali-
sations of X and Yo and averaged these values afterwards. Table 9.5 shows that the
plug-in prediction performs worse for both scores and that the Bayesian prediction
is always better. This is due to the lack of calibration of the plug-in prediction. The
oracle prediction performs best. Table 9.5 also gives the averaged squared prediction
error, which is the same for the plug-in and Bayesian predictions since both provide
the same point predictions. Differences in the predictive variances are not taken into
account. Again, the oracle prediction performs best.
9.5 Exercises 313
Table 9.5 Comparison of the different predictions in the normal model example with respect
to the logarithmic score (LS), the continuous ranked probability score (CRPS) and the squared
prediction error (SPE)
Prediction LS CRPS SPE
Plug-in 1.9172 0.9196 1.9966
Bayesian 1.7647 0.7950 1.9966
Oracle 1.4097 0.5546 0.9816
9.5 Exercises
(a) provide an expression for the likelihood L(π) for this study and
(b) specify a conjugate prior distribution f (π) for π and choose appropriate
values for its parameters. Using these parameters, derive the posterior
distribution f (π | n, y).
(c) A sixth physician wants to participate in the study with n6 = 5 patients.
Determine the posterior predictive distribution for y6 (the number of pa-
tients out of the five for which the medication will have a positive effect).
(d) Calculate the likelihood prediction as well.
2. Let X1:n be a random sample from an N(μ, σ 2 ) distribution, from which a
further observation Y = Xn+1 is to be predicted. Both the expectation μ and
the variance σ 2 are unknown.
(a) Start by determining the plug-in predictive distribution.
(b) Calculate the likelihood and bootstrap predictive distributions.
(c) Derive the Bayesian predictive distribution under the assumption of the
reference prior f (μ, σ 2 ) ∝ σ −2 .
3. Derive Eq. (9.11).
4. Prove Murphy’s decomposition (9.16) of the Brier score.
5. Investigate if the scoring rule
S f (y), yo = −f (yo )
9.6 References
Contents
10.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
10.2 Observation-Driven Models for Categorical Data . . . . . . . . . . . . . . . . . 316
10.2.1 Maximum Likelihood Inference . . . . . . . . . . . . . . . . . . . . . . 317
10.2.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
10.2.3 Inclusion of Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.3 Observation-Driven Models for Continuous Data . . . . . . . . . . . . . . . . . 321
10.3.1 The First-Order Autoregressive Model . . . . . . . . . . . . . . . . . . 321
10.3.2 Maximum Likelihood Inference . . . . . . . . . . . . . . . . . . . . . . 322
10.3.3 Inclusion of Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . 325
10.3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
10.4 Parameter-Driven Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.4.1 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.4.2 The Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 329
10.5 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
10.5.1 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
10.5.2 Bayesian Inference for Hidden Markov Models . . . . . . . . . . . . . . 335
10.6 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
10.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
The statistical analysis of time series is concerned with data which consists of time-
ordered sequences of measurements. Such a sequence is usually assumed to be
equally spaced, in which case the distance between two successive observations
is always constant, for example one day. Otherwise a time series is called unequally
spaced.
To introduce some notation, let x = (x1 , . . . , xn ) denote a time series of observa-
tions xt made at times t = 1, . . . , n. If the series is not equally spaced, then x(ti ) is
the more appropriate notation for the observation made at time ti , i = 1, . . . , n.
The purpose of this chapter is to illustrate how likelihood and Bayesian inference
can be employed in the statistical analysis of time series. We do not aim to provide
a comprehensive overview of statistical models for time series. Instead we try to
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 315
https://doi.org/10.1007/978-3-662-60792-3_10,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
316 10 Markov Models for Time Series Analysis
discuss a number of important approaches for time series analysis which are used
heavily in various biomedical applications, restricting our attention to models for
equally-spaced time series.
We distinguish between observation-driven and parameter-driven models for
time series. Both classes aim to take into account dependence between successive
observations of a time series. An observation-driven model relates the distribution
of the response variable xt directly to the p last observations xt−1 , . . . , xt−p . Un-
known parameters in an observation-driven model are typically global, i.e. do not
depend on time. Here likelihood inference is the inferential method of choice.
In contrast, a parameter-driven model assumes that the observations are condi-
tionally independent given some latent unobserved process, which is typically al-
lowed to change over time. The time-dependent latent process is unknown as well
as additional parameters determining the dynamics of this process. Here empirical
Bayes and fully Bayes approaches to inference provide useful alternatives to a pure
likelihood analysis.
Definition 10.1 (Markov property) The process X satisfies the Markov property if
Pr(Xt = i | X1 = x1 , X2 = x2 , . . . , Xt−1 = xt−1 ) = Pr(Xt = i | Xt−1 = xt−1 )
(10.1)
for all t ≥ 2 and all states i, x1 , . . . , xt−1 ∈ S. We then call X a (first-order) Markov
chain.
The Markov property implies that, conditional on all observations in the past, the
distribution of Xt depends only on the last observation xt−1 . If X were a second-
order Markov chain, the conditional distribution of Xt would only depend on xt−1
and xt−2 . This can be easily generalised to k-th order Markov chains. We can also
consider a continuous Markov chain with real-valued state space S, in which case
the Markov property (10.1) is formulated in terms of conditional density functions:
f (xt | X1 = x1 , X2 = x2 , . . . , Xt−1 = xt−1 ) = f (xt | Xt−1 = xt−1 ).
One important feature of Markov chains is that under regularity conditions there
exists a stationary distribution π (with entries πi = Pr(Xt = i)) with the defining
feature π = π P. This formula can be nicely interpreted as follows: If Xt−1 has
some distribution π then the distribution of Xt is π P. If this distribution is iden-
tical to π (the distribution of Xt−1 ), then the distribution of Xt+1 , Xt+2 , etc. will
also be π , so π is the stationary distribution of the Markov chain.
The stationary distribution can be computed via
π = 1 (I − P + J)−1 , (10.2)
here 1 is a column vector with ones, I is the identity matrix and J is a matrix with
only ones, both of dimension K × K. If there are only K = 2 states, the stationary
distribution can be computed simply via
p21 /(p12 + p21 )
π= .
p12 /(p12 + p21 )
here nij denotes the observed number of transitions from state i to state j in x. The
corresponding log likelihood
lc (P) = nij log(pij )
i,j
318 10 Markov Models for Time Series Analysis
can now be maximised under the restriction that all rows in P must sum to one.
Using the Lagrange method (see Appendix B.2.5), this is equivalent to maximizing
the function
K
K
K
K
lc∗ (P) = lc − λi pij − 1 = nij log(pij ) − λi pij − 1
i=1 j =1 i,j i=1 j =1
j nij
1=
λi
so
λi = nij = ni
j
where ni denotes the number of observations of (x1 , . . . , xn−1 ) (ignoring the n-th
observation) in state i. We can now re-write the ML estimates as
nij
p̂ij = .
ni
This is an intuitive result: The ML estimates of the transition probabilities pij are
simply the empirical proportions of observed transitions from i to j among all tran-
sitions from i to any state in {1, . . . , K} (including a “stay” atstate i). In analogy to
Example 4.22 the corresponding standard errors are equal to p̂ij (1 − p̂ij )/ni .
Incorporation of the likelihood of the first observation x1 complicates ML estima-
tion. If we assume that X1 is a realisation from some arbitrary initial distribution γ ,
then there is only one observation to estimate γ . A useful estimate of γ can rarely be
obtained from one observation, so γ is typically assumed to be completely known.
Alternatively one may assume that γ = π , i.e. the initial distribution of X1 equals
the stationary distribution π of the Markov chain. The initial distribution is then a
function of the transition matrix P, so again only P needs to be estimated. However,
the additional term log πi1 has to be added to the conditional log likelihood lc (P),
here πi is the i-th component of the stationary distribution π = π (P). Numerical
techniques are now typically needed to compute the full ML estimates. Those will
in general (and in particular if n is large) not differ much from the conditional ML
estimates, which serve as suitable starting values for optimisation.
10.2 Observation-Driven Models for Categorical Data 319
Example 10.1 (REM data) Here we present an analysis of a binary time series of
length n = 120 minutes representing an infant’s sleep pattern. The outcome variable
xt reports if an infant was judged to be in rapid eye movement (REM) sleep (xt = 2)
during minute t, xt = 1 otherwise. The data are shown Fig. 10.1. Conditional and
full ML estimates (based on the assumption that the initial distribution equals the
stationary distribution) of the diagonal entries of the 2 × 2 transition matrix P are
given in Table 10.1.
10.2.2 Prediction
Pr(Xn+k | x1 , . . . , xn ) = Pr(Xn+k | xn ).
Consider first the one-step forecast distribution of Xn+1 | xn , i.e. k = 1. If the tran-
sition matrix P is known, then the distribution of Xn+1 | xn = i is given by the i-th
row of P. This corresponds to the plug-in prediction discussed in Chap. 9 as the
estimated transition matrix P̂ is assumed to equal the true transition matrix. Incor-
porating the uncertainty with respect to the estimation of P will involve more efforts,
but the difference between the two resulting estimates will typically be small.
320 10 Markov Models for Time Series Analysis
Similarly, the k-step plug-in forecast distribution can be calculated with the so-
called Chapman–Kolmogorov equations:
k
Pk = P.
i=1
Here Pk is a matrix with entries Pr(Xn+k | Xn ) and the right-hand side refers to the
matrix product, e.g.
2
P = P · P.
i=1
The i-th row of Pk is the k-step forecast distribution of Xn+k | xn = i.
Example 10.2 (REM data) The estimated transition matrix (using the full likeli-
hood) is
0.775 0.225
P̂ = . (10.3)
0.266 0.734
The last observation of the observed sequence was x120 = 2, so the plug-in one-step
forecast distribution is the second line of P, i.e. Pr(X121 = 1 | x120 = 2) = 0.266 and
Pr(X121 = 2 | x120 = 2) = 0.734. The estimated transition matrix P̂2 of the 2-step
(plug-in) forecast distribution is
0.660 0.340
P̂2 = P̂ · P̂ =
0.401 0.599
with the second line relevant in our case due to x120 = 2, e.g. Pr(X122 = 1 |
x120 = 2) = 0.401.
For k → ∞, the k-step forecast distribution will converge to the stationary dis-
tribution π = (0.541, 0.459) , regardless of the last observed value:
⎛ ⎞
π
⎜π ⎟
⎜ ⎟
lim Pk = ⎜ . ⎟ .
k→∞ ⎝ .. ⎠
π
Figure 10.2 illustrates that both Pr(X120+k = 1 | x120 = 1) and Pr(X120+k = 1 |
x120 = 2) converge to π1 = 0.541.
We now describe selected Markov models for time series with continuous measure-
ments.
cid
Xt | Xt−1 = xt−1 ∼ N αxt−1 , σ 2 , t = 2, . . . , n,
cid
where ∼ stands for “is conditionally independent distributed as”. An equivalent
unconditional description of this model is
Xt = αXt−1 + t (10.4)
where the error terms t are assumed to be independent mean-zero normal random
variables with variance σ 2 . It can be easily shown that this model is stationary if
|α| < 1. Stationarity means that the marginal mean μ and variance τ 2 of Xt are
constant, i.e. do not depend on time. Taking expectation and variance on both sides
322 10 Markov Models for Time Series Analysis
of (10.4) it can be shown that the marginal distribution of Xt has mean zero and
variance τ 2 = σ 2 /(1 − α 2 ):
E(Xt ) = α · E(Xt−1 )
=μ =μ
so μ = 0 and τ 2 = σ 2 /(1 − α 2 ).
A more general model is
cid
Xt | Xt−1 = xt−1 ∼ N μ + α(xt−1 − μ), σ 2 , t = 2, . . . , n, (10.5)
Again there are two options for ML estimation of θ = (μ, α, σ 2 ) . The first ap-
proach uses the likelihood conditional on the first observation x1 . Since the distri-
bution of Xt depends only on the observation xt−1 from the previous time point, the
joint density of (X2 , . . . , Xn ), conditional on X1 = x1 has the form
n
f (x2 , . . . , xn | x1 ) = f (xt | xt−1 ),
t=2
here f (xt | xt−1 ) is the density of a N(μ + α(xt−1 − μ), σ 2 ) distribution, see (10.5).
The corresponding log likelihood can therefore easily be derived as
1
n−1
xn − αx1
= xt + . (10.7)
n−1 1−α
t=2
Note that this is essentially just the average of the observed time series, only the
end-values x1 and xn are treated slightly differently.
We can now plug-in (10.7) into (10.6) to obtain the profile log-likelihood (cf.
Sect. 5.3) of α and σ 2 . Maximisation of this profile log-likelihood with respect to α
is independent of σ 2 and can be based on maximizing
n
2
xt − μ̂(α) − α xt−1 − μ̂(α)
t=2
from which we numerically obtain α̂ML and subsequently μ̂ML = μ̂(α̂ML ). This
defines residuals rt = xt − μ̂ML − α̂ML (xt−1 − μ̂ML ), from which the ML estimate
of σ 2 can be derived as
n
rt2
2
σ̂ML = .
n−1
t=2
We note that the partial derivative of (10.6) with respect to α is
1
n
∂lc (θ )
= 2 xt − μ − α(xt−1 − μ) (xt−1 − μ)
∂α σ
t=2
n
1 n
= 2 (xt − μ)(xt−1 − μ) − α (xt−1 − μ)2
σ
t=2 t=2
This is essentially the classical estimate of the first-order autocorrelation, only the
term (xn − μ)2 is missing in the sum of the denominator.
A full likelihood approach takes also the observation x1 into account, assuming
that it is a realisation from the stationary distribution N(μ, σ 2 /(1 − α 2 )). Then the
term
1
log 1 − α 2 − log σ 2 − 1 − α 2 (x1 − μ)2 /2σ 2
2
has to be added to the conditional log likelihood (10.6) and numerical maximisation
is necessary to derive the ML estimates.
324 10 Markov Models for Time Series Analysis
Table 10.2 Conditional and full ML estimates with standard errors from separate analyses of
beaver body temperature inside and outside the retreat
Activity Likelihood μ̂ se(μ̂) α̂ se(α̂) σ̂ 2 se(σ̂ 2 )
inside conditional 37.238 0.119 0.834 0.080 0.009 0.002
inside full 37.073 0.212 0.942 0.062 0.011 0.002
outside conditional 37.908 0.083 0.797 0.078 0.017 0.003
outside full 37.916 0.074 0.787 0.075 0.017 0.003
From Table 10.2 we can see that the estimates of α and σ 2 from the two parts
of the time series are fairly similar whereas the level μ appears to be somewhat
different. In Example 10.4 we will present an analysis of the whole time series with
a level shift (represented by a binary covariate) and a common AR(1) process for
the residual time series.
We note that there is much more efficient and stable software in R to fit this model.
Indeed, the function arima() using the argument order=c(1,0,0) will produce
the same ML estimates as our own implementation above using the full likelihood.
Higher-order autoregressive models (AR(p)) as well as so-called moving average
models (MA(q)) and combinations of both (ARMA(p, q)) can be fitted using the
Box–Jenkings modelling framework for time series. The stationarity assumption can
be avoided using so-called integrated ARMA(p, q) models, short ARIMA models.
Finally, seasonality can be included leading to so-called SARIMA models.
A useful extension of model (10.5) allows to include covariates. The AR(1) model
will then be
cid 2
Xt | Xt−1 = xt−1 ∼ N μ + z
t β + α xt−1 − μ − zt β , σ , t = 2, . . . , n,
Example 10.4 (Beaver body temperature) We now analyse the complete time series
on body temperature of the beaver using a binary indicator for activity outside the
retreat as covariate zt with the R function arima(). For reference, we also include a
model without the covariate (model2) and a model without autoregression, thus not
allowing for residual correlation (model3).
library ( MASS )
attach ( beav2 )
model1 <- arima ( temp , order = c (1 , 0 , 0) , xreg = activ )
model2 <- arima ( temp , order = c (1 , 0 , 0) )
model3 <- arima ( temp , order = c (0 , 0 , 0) , xreg = activ )
Table 10.3 gives the parameter estimates of the three different models. The full
model estimates the mean temperature during activity to be 0.61 °C (SE: 0.14 °C)
higher than without activity. The model fits the time series considerably better than
an AR(1) model without this covariate. Not allowing for residual correlation gives
also a considerably worse model fit, as can be seen from the AIC values in Ta-
ble 10.3. In addition, the activity estimate is somewhat larger (0.81 °C) while the
associated standard error (SE: 0.04 °C) is very small due to ignoring substantial
residual correlation.
326 10 Markov Models for Time Series Analysis
Table 10.3 ML estimates and standard errors of parameters describing beaver body temperature
with activity as binary covariate
Model Covariate Autoregression μ̂ se(μ̂) α̂ se(α̂) β̂ se(β̂) AIC
1 yes yes 37.19 0.12 0.87 0.07 0.61 0.14 −125.55
2 no yes 37.49 0.35 0.97 0.02 −110.10
3 yes no 37.10 0.03 0.81 0.04 −21.47
10.3.4 Prediction
and variance
Iterating this result gives mean and variance of the k-step predictive distribution as
Note that for k → ∞ the predictive distribution of Xn+k | Xn = xn will (for |α| < 1)
converge to the stationary distribution with mean μ and variance σ 2 /(1 − α 2 ) due
to α k → 0 for k → ∞ and
k
k
j 1
lim α 2(j −1) = lim α2 = ,
k→∞ k→∞ 1 − α2
j =1 j =0
Example 10.5 (Beaver body temperature) Figure 10.4 shows predictions of beaver
temperature with pointwise 95 % prediction intervals for the next four hours, as-
suming that the beaver remains outside the retreat. The predictions are based on the
AR(1) model fitted in Example 10.4 with the activity covariate. Note that the point
predictions quickly converge to the estimated stationary mean μ̂ + β̂ outside the
retreat.
## predict the next four hours in 10 min intervals
n . ahead <- 6*4
p <- predict ( model1 , newxreg = rep (1 , n . ahead ) , n . ahead = n . ahead )
pred <- p $ pred
pred . se <- p $ se
round ( pred , 3)
Time Series :
Start = 101
End = 124
Frequency = 1
[1] 38.037 38.007 37.982 37.960 37.940 37.923 37.908 37.895
[9] 37.884 37.874 37.865 37.858 37.851 37.846 37.841 37.836
[17] 37.832 37.829 37.826 37.823 37.821 37.819 37.818 37.816
round ( pred . se , 3)
Time Series :
Start = 101
End = 124
Frequency = 1
328 10 Markov Models for Time Series Analysis
[1] 0.123 0.164 0.189 0.206 0.218 0.227 0.233 0.238 0.242
[10] 0.244 0.246 0.248 0.249 0.250 0.251 0.251 0.252 0.252
[19] 0.252 0.252 0.252 0.253 0.253 0.253
The output distribution f (yt | xt ), the initial distribution f (x1 ) and the transition
distribution f (xt | xt−1 ) may depend on unknown parameters θ to be estimated
from the data. ML estimation will be based on the likelihood function
L(θ ) = f (y; θ) = f (y, x; θ )dx,
n
L(θ ) = f (y1 ; θ) · f (yt | y≤(t−1) ; θ) (10.9)
t=2
where y≤t = (y1 , . . . , yt ). In this form the likelihood is easier to calculate. In the
following we suppress the dependence on θ to simplify notation.
The first term on the right-hand side of (10.9) can be computed as
f (y1 ) = f (y1 | x1 )f (x1 )dx1 .
The first term f (yt | xt ) in (10.10) is known and the second term f (xt | y≤(t−1) ) can
be computed recursively via the forward pass algorithm: Suppose f (xt−1 | y≤(t−2) )
is already available. First compute
Consider now θ as fixed. Of central interest is often the posterior distribution of the
latent states, i.e.
f (x | y) ∝ f (y | x)f (x)
n
n
= f (yt | xt ) · f (x1 ) · f (xt | xt−1 )
t=1 t=2
n
= f (y1 | x1 ) · f (x1 ) · f (yt | xt ) · f (xt | xt−1 ) . (10.12)
t=2
A crucial property for most of the following algorithms is that x | y inherits the
Markov property from x, though its transition probabilities are a function of y and
therefore time-dependent. In fact, it is possible to show that
n
f (x | y) = f (x1 | y) · f (xt | xt−1 , y≥t ) (10.13)
t=2
1
= f (xn | y) · f (xt | xt+1 , y≤t ) (10.14)
t=n−1
n
f (x | y) = f (x1 | y) · f (xt | x≤(t−1) , y)
t=2
n
= f (x1 | y) · f (xt | xt−1 , y)
t=2
n
= f (x1 | y) · f (xt | xt−1 , y≥t )
t=2
where both lines follow from the Markov property of (x, y). Equation (10.14) can
be shown with similar arguments, using the fact that every Markov chain retains the
Markov property with time reversed.
Equation (10.14) can be used to simulate from f (x | y) using the conditional dis-
tribution method by first sampling xn from f (xn | y) and subsequently sampling xt
from f (xt | xt+1 , y≤t ), t = n − 1, . . . , 1. The required distributions can be calculated
as follows. First note that
Example 10.6 (Noisy binary channel) This example is taken from unpublished lec-
ture notes by Julian Besag. The following time series of binary observations repre-
sents the observed data:
y = (2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2).
The support of yt and the state space of xt is S = {1, 2}, the length of the time series
is n = 20. Also assume that the transition matrix P of the underlying hidden Markov
chain X is known:
p11 = 0.75 1 − p11 = 0.25
P= . (10.15)
1 − p22 = 0.25 p22 = 0.75
332 10 Markov Models for Time Series Analysis
The hyperparameters P and θ are thus assumed to be known, so the goal of statistical
inference reduces to the restoration of the sequence x given the observations y.
The common approach is to find the sequence x̂MAP which maximises the posterior
distribution f (x | y), as given in (10.12). However, there are 2n different possible
sequences x, which makes direct evaluation of f (x | y) for all x computationally
challenging, if n is too large.
A more efficient algorithm to find the posterior mode is the Viterbi (1967) algo-
rithm, a recursive algorithm which finds the MAP estimate in O(K 2 · n) steps. The
algorithm proceeds as follows. First note that, due to (10.12) the unnormalised log
posterior can be written in the form
n
G(x) = g1 (x1 ) + gt (xt , xt−1 )
t=2
with
Let x ∗ denote the MAP estimate and suppose the t-th position of x ∗ is xt∗ = k. Now
define
for k ∈ {1, . . . , K}. Note that Gt,k (x1 , . . . , xt−1 ) is a function of the states before
time t, while Ht,k (xt+1 , . . . , xn ) is a function of the states after time t, so maximi-
sation of G(x) (assuming xt∗ = k) can be done by separately maximizing Gt,k and
Ht,k . Note also that
holds.
The Viterbi algorithm is then based on the following recursion:
1. Compute G∗1,i = g1 (x1 = i) for i = 1, . . . , K.
10.5 Hidden Markov Models 333
2. Compute for i = 1, . . . , K
G∗t,i = max G∗t−1,l + gt (i, l) = G∗t−1,l ∗ + g2 k, l ∗ .
l
It can be shown that the sequence x ∗ = {x1∗ , x2∗ , . . . , xn∗ } where x1∗ = arg maxi G∗1,i
and xt∗ = arg maxi G∗t,i , t = 2, . . . , n, is the MAP estimate x̂MAP .
Example 10.7 (Noisy binary channel) Table 10.4 gives several estimates of the
underlying sequence x. It is interesting to note that the posterior mode is not unique,
there are two estimates of x with identical posterior probability 0.0304. The Viterbi
algorithm will give one of them, depending on the selection of the maximum in the
case of ties. Note that the observed sequence y has a considerably smaller posterior
probability of 0.0027.
auxiliary <- function (y , m , theta ) {
n <- length ( y )
probs <- matrix ( NA , ncol = m , nrow = n )
for ( i in 1: m ) probs [ , i ] <- theta [i , y ]
return ( probs )
}
viterbi <- function (y , K , theta , P , delta = NULL ) {
if ( is . null ( delta ) )
delta <- solve ( t ( diag ( K ) - P + 1) , rep (1 , K ) )
n <- length ( y )
probs <- auxiliary (y , K , theta )
xi <- matrix (0 , n , K )
foo <- delta * probs [1 , ]
xi [1 , ] <- foo / sum ( foo )
for ( i in 2: n ) {
foo <- apply ( xi [ i - 1 , ] * P , 2 , max ) * probs [i , ]
xi [i , ] <- foo / sum ( foo )
}
map <- numeric ( n )
map [ n ] <- which . max ( xi [n , ])
for ( i in ( n - 1) :1) {
map [ i ] <- which . max ( P [ , map [ i + 1]] * xi [i , ])
}
return ( map )
}
noisy . channel <- c (2 , 2 , 2 , 1 , 2 , 2 , 1 , 1 , 1 , 1 ,
1 , 2 , 1 , 1 , 1 , 2 , 1 , 2 , 2 , 2)
P <- matrix ( c (0.75 , 0.25 , 0.25 , 0.75) , ncol = 2 , byrow = T )
eps <- 0.2
theta <- matrix ( c (1 - eps , eps , eps , 1 - eps ) , ncol = 2 , byrow = T )
( map <- viterbi ( noisy . channel , K = 2 , theta , P ) )
[1] 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2
334 10 Markov Models for Time Series Analysis
Table 10.4 also gives the marginal posterior mode x̂MPM , where each element xt ,
t = 1, . . . , 20, has marginal posterior probability Pr(xt | y) ≥ 0.5. This has been
computed using the algorithm outlined at the end of Sect. 10.4.2. Note that that
x̂MPM and x̂MAP do not necessarily coincide, a phenomenon which was discussed in
more generality in Sect. 6.5.4 in Chap. 6.
Figure 10.5 displays the marginal posterior probabilities Pr(xt = 2 | y) together
with the observed sequence y. One can see that the MPM estimate does not coincide
with the data at positions t = 4 and 12.
Example 10.8 (Noisy binary channel) Estimation of both transition matrix and mis-
classification probabilities θ = (θ1 , θ2 ) gives the parameter estimates
0.825 0.175
P̂ML =
0.143 0.857
and θ̂ML = 0.213 with virtually no difference in the value of the log likelihood func-
tion.
10.5 Hidden Markov Models 335
We have recomputed the MPM estimate based on the estimates of the simpler
three-parameter model. The result is shown in Fig. 10.6. The MPM estimate differs
only at position t = 16 from the one shown in Fig. 10.5, which has been computed
with fixed transition matrix (10.15) and misclassification probabilities (10.16) and
(10.17).
Example 10.9 (REM data) As a second example, we fit a hidden Markov model
to the REM data. Estimation of both transition matrix and the unknown parameter
vector θ = (θ1 , θ2 ) gives the parameter estimates
0.947 0.053
P̂ML =
0.028 0.972
and θ̂ ML = (0, 0.235) . Note that the ML estimate of θ1 is on the boundary of the
parameter space. The alternative model with the restriction θ = θ1 = θ2 gives
0.952 0.048
P̂ML =
0.042 0.958
f (x, θ | y) ∝ f (y | x, θ ) · f (x | θ ) · f (θ )
336 10 Markov Models for Time Series Analysis
with Markov chain Monte Carlo (MCMC) techniques (cf. Sect. 8.4). Specifi-
cally, we select suitable starting values x (1) and θ (1) for x and θ and sample it-
eratively x (s) from f (x | θ (s−1) , y) and θ (s) from f (θ | x (s) , y) in turn for s =
2, . . . , S.
Sampling from f (x | θ , y) can be efficiently done using the forward filtering
backward sampling algorithm described at the end of Sect. 10.4.2. The parameter
vector θ consists of the transition matrix P and additional unknown parameters in
the observation model. Regarding P it is convenient to work with conjugate Dirichlet
priors for each row of P, which reduce to two independent beta priors for the case
of K = 2 states. If the observation model is Bernoulli, then independent beta priors
are also convenient for the misclassification probabilities θr .
Example 10.10 (REM data) We reconsider the REM data using a fully Bayesian
approach. We use independent Be(9, 1) priors for p11 and p22 as well as inde-
pendent Be(1, 9) priors for θ1 and θ2 such that the expected prior probability of
a jump from one state to the other and the expected misclassification probability
are both equal to 10 %. We collected 1000 samples after a burn-in of 100 itera-
tions.
Figure 10.8 gives histograms of the posterior distribution of the parameters p11 ,
p22 , θ1 and θ2 . Note that the mode of θ2 is clearly larger than zero while the poste-
rior distribution of θ1 seems to peak at zero. This corresponds to the ML estimates
reported earlier.
Finally Fig. 10.9 shows the posterior probabilities Pr(xt = 2 | y) based on the
empirical posterior samples of x. Note that these posterior probabilities fully incor-
porate posterior uncertainty with respect to the hyperparameters, in contrast to the
analysis conditional on ML estimates.
10.6 State Space Models 337
Fig. 10.8 Bayesian analysis of REM sleep data. Shown are histograms of the posterior distribution
of p11 , p22 , θ1 and θ2 based on 1000 samples from the MCMC run
State space models are parameter-driven models, where the latent vector x follows
an autoregressive Gaussian process and the mean of the distribution of yt | xt de-
pends in some way on xt . For example, if yt is a continuous normal measurement
then xt is often simply the mean of yt . If yt is binary, then a logit model with
πt = E(yt | xt ) = 1/(1 + exp(−xt )) could be used.
The latent Gaussian process can also be multivariate, in which case the distri-
bution of yt | xt will have a form as in a generalised linear model. However, for
simplicity we will assume in the following that xt is scalar, i.e.
Xt | Xt−1 = xt−1 ∼ N μ + α(xt−1 − μ), σ 2 , (10.18)
338 10 Markov Models for Time Series Analysis
where α is treated as unknown. The case α = 1 (a simple random walk) is also often
used in practice, then we omit the (unidentifiable) parameter μ. The formulation is
completed with an initial distribution for X1 , e.g. X1 ∼ N(ν, τ 2 ). If |α| < 1, this
distribution can be chosen as the marginal distribution N(μ, σ 2 /(1 − α 2 )) of the
autoregressive process.
The structure of a state space model is very similar to a hidden Markov model
with conditionally independent observations yt | xt and a Markov model for x. The
fundamental difference between the two frameworks is that the distribution of xt in
a hidden Markov models is discrete with a finite set of K possible states (with K
typically small), whereas the distribution of xt in a state space model is continuous.
Both formulations also have hyperparameters which determine the process x. For
a hidden Markov model the process is driven by the transition matrix P whereas
the latent process (10.18) is determined by the autoregressive parameter α and the
variance σ 2 .
Empirical Bayes inference in state space models proceeds in two steps. First,
unknown hyperparameters θ are estimated based on the likelihood (10.9). Second,
characteristics of the posterior distribution of x conditional on the ML estimate
θ̂ ML are computed using the methods described in Sect. 10.4.2. Conceptually the
inferential process is identical to inference in hidden Markov models.
Fully Bayesian inference can be performed with MCMC, but recently the inte-
grated nested Laplace approximations (INLA) method has been proposed as a suit-
able alternative. INLA is an extension of the Laplace approximation described in
Sect. 8.2. We will show in the following how INLA can be used to fit state space
models to the REM data.
Example 10.11 (REM data) As in Example 10.2 we use a binary observation model
for the response yt but now assume that the latent logit-transformed response prob-
ability xt = logit(πt ) follows an AR(1) process with unknown autoregressive pa-
10.6 State Space Models 339
library ( INLA )
rem . data <- c (1 , 2 , 1 , 2 , 1 , 2 , 2 , 1 , 2 , 2 , 1 , 2 , 2, 2, 1,
1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2,
2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2)
len <- length ( rem . data )
time <- c (1: len )
mydata <- list ( y = rem . data - 1 , t = time )
formula1 <- y ~ f (t , model = " ar1 " ,
hyper = list ( rho = list ( initial = 5 , prior = " normal
",
10.7 Exercises 341
c (0 , 0.15) ) ,
prec = list ( prior = " loggamma " ,
param = c (1 , 0.005) ) ) )
model1 <- inla ( formula1 , family = " binomial " , data = mydata ,
control . predictor = list ( compute = TRUE ) )
npred <- 10
mydata . pred <- list ( y = c ( rem . data - 1 , rep ( NA , npred ) ) ,
t = c ( time , c (( len + 1) :( len + npred ) ) ) )
model1 . pred <- inla ( formula1 , family = " binomial " , data = mydata . pred ,
control . predictor = list ( compute = TRUE ) ,
Ntrials = rep (1 , len + npred ) )
10.7 Exercises
10.8 References
Contents
A.1 Events and Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
A.1.1 Conditional Probabilities and Independence . . . . . . . . . . . . . . . . 344
A.1.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
A.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
A.2.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 345
A.2.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . 346
A.2.3 The Change of Variables Formula . . . . . . . . . . . . . . . . . . . . . . 347
A.2.4 Multivariate Normal Distributions . . . . . . . . . . . . . . . . . . . . . 349
A.3 Expectation, Variance and Covariance . . . . . . . . . . . . . . . . . . . . . . . . 350
A.3.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
A.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
A.3.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
A.3.4 Conditional Expectation and Variance . . . . . . . . . . . . . . . . . . . 351
A.3.5 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
A.3.6 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
A.3.7 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
A.3.8 Kullback–Leibler Discrepancy and Information Inequality . . . . . . . . . 355
A.4 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 355
A.4.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
A.4.2 Continuous Mapping and Slutsky’s Theorem . . . . . . . . . . . . . . . . 356
A.4.3 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
A.4.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
A.4.5 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
A.5 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
A.5.1 Univariate Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . 359
A.5.2 Univariate Continuous Distributions . . . . . . . . . . . . . . . . . . . . 361
A.5.3 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 361
This appendix gives important definitions and results from probability theory
in a compact and sometimes slightly simplifying way. The reader is referred to
Grimmett and Stirzaker (2001) for a comprehensive and more rigorous introduc-
tion to probability theory. We start with the notion of probability and then move
on to random variables and expectations. Important limit theorems are described in
Appendix A.4.
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 343
https://doi.org/10.1007/978-3-662-60792-3,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
344 A Probabilities, Random Variables and Distributions
where Pr(A, B) is the probability that both A and B occur. This definition is only
sensible if the occurrence of B is possible, i.e. if Pr(B) > 0. Rearranging this equa-
tion gives Pr(A, B) = Pr(A | B) Pr(B), but Pr(A, B) = Pr(B | A) Pr(A) must obvi-
ously also hold. Equating and rearranging these two formulas gives Bayes’ theorem:
Pr(B | A) Pr(A)
Pr(A | B) = , (A.2)
Pr(B)
Pr(A | B) = Pr(A).
and
Pr(B | A) = Pr(B).
Conditional probabilities behave like ordinary probabilities if the conditional
event is fixed, so Pr(A | B) + Pr(Ac | B) = 1, for example. It then follows that
Pr(B) = Pr(B | A) Pr(A) + Pr B | Ac Pr Ac (A.3)
A.2 Random Variables 345
Let A and B denote two events A, B with 0 < Pr(A) < 1 and Pr(B) > 0. Then
Pr(B | A) · Pr(A)
Pr(A | B) =
Pr(B)
Pr(B | A) · Pr(A)
= . (A.5)
Pr(B | A) · Pr(A) + Pr(B | Ac ) · Pr(Ac )
for each j = 1, . . . , n.
For only n = 2 events Pr(A) and Pr(Ac ) in the denominator of (A.6), i.e.
Eq. (A.5), a formulation using odds ω = π/(1 − π) rather than probabilities π pro-
vides a simpler version without the uncomfortable sum in the denominator of (A.5):
Pr(A | B) Pr(A) Pr(B | A)
= · .
Pr(Ac | B) Pr(Ac ) Pr(B | Ac )
Posterior Odds Prior Odds Likelihood Ratio
Of particular interest is often the joint bivariate distribution of two discrete random
variables X and Y with probability mass function f (x, y). The conditional proba-
bility mass function f (x | y) is then defined via (A.1):
f (x, y)
f (x | y) = . (A.7)
f (y)
f (y | x)f (x)
f (x | y) = . (A.8)
f (y)
where the sum is over the support of X. Combining (A.7) and (A.9) shows that the
argument x can be removed from the joint density f (x, y) via summation to obtain
the marginal probability mass function f (y) of Y :
f (y) = f (x, y),
x
If X and Y are independent and g and h are arbitrary real-valued functions, then
g(X) and h(Y ) are also independent.
A continuous random variable X is usually defined with its density function f (x),
a non-negative real-valued function. The density function
x is the derivative of the
distribution function F (x) = Pr(X ≤ x), and therefore −∞ f (u) du = F (x) and
+∞
−∞ f (x) dx = 1. In the following we assume that this derivative exists. We then
A.2 Random Variables 347
have
b
Pr(a ≤ X ≤ b) = f (x) dx
a
for any real numbers a and b, which determines the distribution of X.
Under suitable regularity conditions, the results derived in Appendix A.2.1 for
probability mass functions also hold for density functions with replacement of sum-
mation by integrals. Suppose that f (x, y) is the joint density function of two random
variables X and Y , i.e.
y x
Pr(X ≤ x, Y ≤ y) = f (u, v) du dv.
−∞ −∞
When the transformation g(·) is one-to-one and differentiable, there is a formula for
the probability density function fY (y) of Y = g(X) directly in terms of the proba-
bility density function fX (x) of a continuous random variable X. This is known as
the change-of-variables formula:
−1
dg (y)
fY (y) = fX g −1 (y) ·
dy
dg(x) −1
= fX (x) · , (A.11)
dx
348 A Probabilities, Random Variables and Distributions
where x = g −1 (y) and the absolute values of the determinants of the Jacobian ma-
trices are meant. As an example, we will derive the density function of the multivari-
ate normal distribution from a location and scale transformation of univariate stan-
iid
dard normal random variables. Consider X = (X1 , . . . , Xp ) with Xi ∼ N(0, 1),
i = 1, . . . , p. So the joint density function is
p
1 1
fX (x) = (2π)−1/2 exp − xi2 = (2π)−p/2 exp − x x ,
2 2
i=1
which is of course the density function the Np (0, I ) distribution. Let the location
and scale transformation be defined by y = g(x) = Ax +μ, where A is an invertible
p × p matrix, and μ is a p-dimensional vector of real numbers. Then the inverse
transformation is given by x = g −1 (y) = A−1 (y − μ) with Jacobian (g −1 ) (y) =
A−1 . Therefore, the density function of the multivariate continuous random variable
Y = g(X) is
fY (y) = fX g −1 (y) · g −1 (y)
1
= (2π)−p/2 exp − g −1 (y) g −1 (y) A−1
2
A.2 Random Variables 349
−p/2 −1 1
−1
= (2π) |A| exp − (y − μ) AA (y − μ) , (A.13)
2
where we have used |A−1 | = |A|−1 and (A−1 ) A−1 = (AA )−1 in the last step.
If we define Σ = AA and compare (A.13) with the density function for the mul-
tivariate normal distribution Np (μ, Σ) from Appendix A.5.3, we find that the cor-
responding kernels match. The normalising constant can also be matched by noting
that
−1/2 −1/2
|Σ|−1/2 = AA = |A||A| = |A|−1 .
where μ and Σ are partitioned according to the dimensions of X and Y . Then the
marginal distributions of X and Y are
X ∼ N(μX , Σ XX ),
Y ∼ N(μY , Σ Y Y ).
μY |X=x = μY + Σ Y X Σ −1
XX (x − μX ) and
Σ Y |X=x = Σ Y Y − Σ Y X Σ −1
XX Σ XY .
A.3.1 Expectation
For a continuous random variable X with density function f (x), the expectation or
mean value of X is the real number
E(X) = xf (x) dx. (A.14)
Note that the expectation of a random variable does not necessarily exist; we then
say that X has an infinite expectation. If the integral exists, then X has a finite
expectation. For a discrete random variable X with probability mass function f (x),
the integral in (A.14) and (A.15) has to be replaced with a sum over the support
of X.
For any real numbers a and b, we have
E(a · X + b) = a · E(X) + b.
i.e. the expectation of a sum of random variables equals the sum of the expectations
of the random variables. If X and Y are independent, then
A.3.2 Variance
and
1
Var(X) = E (X1 − X2 )2 ,
2
where X1 and X2 are independent copies of√X, i.e. X, X1 and X2 are independent
and identically distributed. The square root Var(X) of the variance of X is called
the standard deviation.
For any real numbers a and b, we have
Var(a · X + b) = a 2 · Var(X).
A.3.3 Moments
ck = E (X − m1 )k .
The expectation E(X) of a random variable X is therefore its first moment, whereas
the variance Var(X) is the second central moment. Those two quantities are there-
fore often referred to as “the first two moments”. The third and fourth moments of a
random variable quantify its skewness and kurtosis, respectively. If the kth moment
of a random variable exists, then all lower moments also exist.
Equation (A.19) is also known as the law of iterated expectations. The law of total
variance provides a useful decomposition of the variance of Y :
These two results are particularly useful if the first two moments of Y | X = x and
X are known. Calculation of expectation and variance of Y via (A.19) and (A.20) is
then often simpler than directly based on the marginal distribution of Y .
A.3.5 Covariance
Let (X, Y ) denote a bivariate random variable with joint probability mass or den-
sity function fX,Y (x, y). The covariance of X and Y is defined as
Cov(X, Y ) = E X − E(X) Y − E(Y )
= E(XY ) − E(X) E(Y ),
where E(XY ) = xyfX,Y (x, y) dx dy, see (A.16). Note that Cov(X, X) = Var(X)
and Cov(X, Y ) = 0 if X and Y are independent.
A.3 Expectation, Variance and Covariance 353
Cov(a · X + b, c · Y + d) = a · c · Cov(X, Y ).
and has entry Cov(Xi , Xj ) in the ith row and j th column. In particular, on the
diagonal of Cov(X) there are the variances of the components of X.
If X is a p-dimensional random variable and A is a q × p matrix, we have
Cov(A · X) = A · Cov(X) · A .
In particular, for the bivariate random variable (X, Y ) and matrix A = (1, 1), we
have
Var(X + Y ) = Var(X) + Var(Y ) + 2 · Cov(X, Y ).
If X and Y are independent, then
A.3.6 Correlation
which can be shown with the Cauchy–Schwarz inequality (after Augustin Louis
Cauchy, 1789–1857, and Hermann Amandus Schwarz, 1843–1921). This inequality
states that for two random variables X and Y with finite second moments E(X 2 ) and
E(Y 2 ),
E(X · Y )2 ≤ E X 2 E Y 2 . (A.22)
354 A Probabilities, Random Variables and Distributions
Applying (A.22) to the random variables X − E(X) and Y − E(Y ), one obtains
2 2 2
E X − E(X) Y − E(Y ) ≤ E X − E(X) E Y − E(Y ) ,
from which
Cov(X, Y )2
Corr(X, Y )2 = ≤ 1,
Var(X) Var(Y )
i.e. (A.21), easily follows. If Y = a · X + b for some a > 0 and b, then Corr(X, Y ) =
+1. If a < 0, then Corr(X, Y ) = −1.
Let Σ denote the covariance matrix of a p-dimensional random variable X =
(X1 , . . . , Xp ) . The correlation matrix R of X can be obtained via
R = SΣS,
where
√ S denotes the diagonal matrix with entries equal to the standard deviations
Var(Xi ), i = 1, . . . , n, of the components of X. A correlation matrix R has the en-
try Corr(Xi , Xj ) in the ith row and j th column. In particular, the diagonal elements
are all one.
Let X denote a random variable with finite expectation E(X), and g(x) a convex
function (if the second derivative g (x) exists, this is equivalent to g (x) ≥ 0 for
all x ∈ R). Then
E g(X) ≥ g E(X) .
If g(x) is even strictly convex (g (x) > 0 for all real x) and X is not a constant, i.e.
not degenerate, then
For (strictly) concave functions g(x) (if the second derivative g (x) exists, this is
equivalent to the fact that for all x ∈ R, g (x) ≤ 0 and g (x) < 0, respectively), the
analogous results
E g(X) ≤ g E(X)
and
Let fX (x) and fY (y) denote two density or probability functions, respectively, of
random variables X and Y . The quantity
" #
fX (X)
D(fX fY ) = E log = E log fX (X) − E log fY (X)
fY (X)
is called the Kullback–Leibler discrepancy from fX to fY (after Solomon Kullback,
1907–1994, and Richard Leibler, 1914–2003) and quantifies effectively the “dis-
tance” between fX and fY . However, note that in general
D(fX fY ) = D(fY fX )
D
3. Xn → X converges in distribution, written as Xn −
→ X, if
Pr(Xn ≤ x) → Pr(X ≤ x) as n → ∞
for all points x ∈ R at which the distribution function FX (x) = Pr(X ≤ x) is
continuous.
356 A Probabilities, Random Variables and Distributions
The following relationships between the different modes of convergence can be es-
tablished:
r P
Xn −
→X =⇒ Xn −
→ X for any r ≥ 1,
P D
Xn −
→X =⇒ Xn −
→ X,
D P
Xn −
→c =⇒ Xn −
→ c,
where c ∈ R is a constant.
1 n
a
√ Xi − nμ ∼ N(0, 1),
nσ 2 i=1
a
where ∼ stands for “is asymptotically distributed as”.
If X1 , X2 , . . . denotes a sequence of independent and identically distributed
p-dimensional random variables with mean μ = E(X i ) and finite, positive definite
covariance matrix Σ = Cov(X i ), then, as n → ∞,
1
n
D
√ Xi − nμ −
→ Z,
n
i=1
as n → ∞.
Somewhat simplifying, the delta method states that
g(Z) ∼ N g(ν), g (ν)2 · τ 2
a
a
if Z ∼ N(ν, τ 2 ).
Now consider T n = n1 (X 1 + · · · + Xn ), where the p-dimensional random vari-
ables Xi are independent and identically distributed with finite expectation μ and
covariance matrix Σ. Suppose that g : Rp → Rq (q ≤ p) is a mapping continu-
ously differentiable in a neighbourhood of μ with q × p Jacobian matrix D (cf.
Appendix B.2.2) of full rank q. Then
√
n g(T n ) − g(μ) ∼ Nq 0, DΣD
a
as n → ∞.
358 A Probabilities, Random Variables and Distributions
a
Somewhat simplifying, the multivariate delta method states that if Z ∼ Np (ν, T ),
then
g(Z) ∼ Nq g(ν), DT D .
a
In this section we summarise the most important properties of the probability distri-
butions used in this book. A random variable is denoted by X, and its probability or
density function is denoted by f (x). The probability or density function is defined
for values in the support T of each distribution and is always zero outside of T . For
each distribution, the mean E(X), variance Var(X) and mode Mod(X) are listed, if
appropriate.
In the first row we list the name of the distribution, an abbreviation and the core
of the corresponding R-function (e.g. norm), indicating the parametrisation imple-
mented in R. Depending on the first letter, these functions can be conveniently used
as follows:
r stands for random and generates independent random numbers or vectors from
the distribution considered. For example, rnorm(n, mean = 0, sd = 1)
generates n random numbers from the standard normal distribution.
d stands for density and returns the probability and density function, respectively.
For example, dnorm(x) gives the density of the standard normal distribution.
p stands for probability and gives the distribution function F (x) = Pr(X ≤ x) of
X. For example, if X is standard normal, then pnorm(0) returns 0.5, while
pnorm(1.96) is 0.975002 ≈ 0.975.
q stands for quantile and gives the quantile function. For example, qnorm(0.975)
is 1.959964 ≈ 1.96.
For some distributions, not all four options may be available. The first argument
of each function is not listed since it depends on the particular function used. It is
either the number n of random variables generated, a value x in the domain T of the
random variable or a probability p ∈ [0, 1]. The arguments x and p can be vectors,
as well as some parameter values. The option log = TRUE is useful to compute
the log of the density, distribution or quantile function. For example, multiplication
of very small numbers, which may cause numerical problems, can be replaced by
addition of the log numbers and subsequent application of the exponential function
exp() to the obtained sum.
With the option lower.tail = FALSE, available in p- and q-type functions,
the upper tail of the distribution function Pr(X > x) and the upper quantile z with
Pr(X > z) = p, respectively, are returned. Further details can be found in the docu-
mentation to each function, e.g. by typing ?rnorm.
A.5 Probability Distributions 359
Table A.1 gives some elementary facts about the most important univariate discrete
distributions used in this book. The function sample can be applied in various set-
tings, for example to simulate discrete random variables with finite support or for
resampling. Functions for the beta-binomial distribution (except for the quantile
function) are available in the package VGAM. The density and random number gen-
erator functions of the noncentral hypergeometric distribution are available in the
package MCMCpack.
αβ (α+β+n)
E(X) = n α+β
α
Var(X) = n (α+β)2 (α+β+1)
The beta function B(x, y) is described in Appendix B.2.1. The BeB(n, 1, 1) distribution is a
discrete uniform distribution with support T and f (x) = (n + 1)−1 . For n = 1, the beta-binomial
distribution BeB(1, α, β) reduces to the Bernoulli distribution B(π) with success probability
π = α/(α + β).
Table A.2 gives some elementary facts about the most important univariate con-
tinuous distributions used in this book. The density and random number generator
functions of the inverse gamma distribution are available in the package MCMCpack.
The distribution and quantile function (as well as random numbers) can be cal-
culated with the corresponding functions of the gamma distribution. Functions re-
lating to the general t distribution are available in the package sn. The functions
_t(. . . , df = α) available by default in R cover the standard t distribution. The log-
normal, folded normal, Gumbel and the Pareto distributions are available in the
package VGAM. The gamma–gamma distribution is currently not available.
Table A.3 gives details about the most important multivariate probability distribu-
tions used in this book. Multivariate random variables X are always given in bold
face. Note that there is no distribution or quantile function available in R for the
362 A Probabilities, Random Variables and Distributions
E(X) = d Var(X) = 2d
The gamma function (x) is described in Appendix B.2.1. If Xi ∼ N(0, 1), i = 1, . . . , n, are
independent, then ni=1 Xi2 ∼ χ 2 (n).
Normal: N(μ, σ 2 ) _norm(. . . , mean = μ, sd = σ )
μ ∈ R, σ2 >0 T =R
2
f (x) = √ 1 exp{− 12 (x−μ)
σ2
} Mod(X) = μ
2π σ 2
E(X) = μ Var(X) = σ 2
X is standard normal if μ = 0 and σ 2 = 1, i.e. f (x) = ϕ(x) = √1 exp(− 12 x 2 ). If X is standard
2π
normal, then σ X + μ ∼ N(μ, σ 2 ).
Log-normal: LN(μ, σ 2 ) VGAM::_lnorm(. . . , meanlog = μ,
sdlog = σ )
μ ∈ R, σ 2 > 0 T = R+
f (x) = σ1 x1 ϕ{ log(x)−μ
σ } Mod(X) = exp(μ − σ 2 )
E(X) = exp(μ + σ /2) 2 Var(X) = {exp(σ 2 ) − 1} exp(2μ + σ 2 )
If X is normal, i.e. X ∼ N(μ, σ 2 ), then exp(X) ∼ LN(μ, σ 2 ).
Folded normal: FN(μ, σ 2 ) VGAM::_foldnorm(. . . , mean = μ, sd = σ )
μ ∈ R, σ2 >0 T = R+
0
= 0, |μ| ≤ σ,
f (x) = σ1 {ϕ( x−μ x+μ
σ ) + ϕ( σ )} Mod(X)
∈ (0, |μ|), |μ| > σ.
E(X) = 2σ ϕ(μ/σ ) + μ(2Φ(μ/σ ) − 1) Var(X) = σ 2 + μ2 − E(X)2
If X is normal, i.e. X ∼ N(μ, σ 2 ), then |X| ∼ FN(μ, σ 2 ). The mode can be calculated √
numerically if |μ| > σ . If μ = 0, one obtains the half normal distribution with E(X) = σ 2/π
and Var(X) = σ 2 (1 − 2/π).
364 A Probabilities, Random Variables and Distributions
multinomial, Dirichlet, Wishart and inverse Wishart distributions. The density func-
tion and random variable generator functions for Dirichlet, Wishart and inverse
Wishart are available in the package MCMCpack. The package mvtnorm offers the
density, distribution, and quantile functions of the multivariate normal distribution,
as well as a random number generator function. The multinomial-Dirichlet and the
normal-gamma distributions are currently not available in R.
E(X) = α(e
k α)
−1 Cov(X) = (1 + e
k α)
−1 · [diag{E(X)} − E(X) E(X) ]
where e
= (1, . . . , 1)
k
& '
Mod(X) = α1d−1 , . . . , αkd−1 where
k
d= i=1 αi − k if αi > 1 for all i
The gamma function (x) is described in Appendix B.2.1. The beta distribution is a special case
of the Dirichlet distribution: If X ∼ Be(α, β), then (X, 1 − X) ∼ D2 ((α, β) ).
Multinomial-Dirichlet: MDk (n, α)
α = (α1 , . . . , αk ) , n ∈ N x = (x1 , . . . , xk )
k
αi > 0 xi ∈ N0 , j =1 xj =n
k
j =1 (αj∗ ) k k
f (x) = C · k ∗ k C = n! ( j =1 αj )/ j =1 (αj )
( j =1 αj ) j =1 xj !
αj∗ = αj + xj π = α(e
k α)
−1
k ∗
j =1 αj
E(Xi ) = nπi Var(Xi ) = k nπi (1 − πi )
1+ j =1 αj
k ∗
j =1 αj
E(X) = nπ Cov(X) = k n{diag(π ) − π π }
1+ j =1 αj
The gamma function (x) is described in Appendix B.2.1. The beta-binomial distribution is a
special case of the multinomial-Dirichlet distribution: If X ∼ BeB(n, α, β), then
(X, n − X) ∼ MD2 (n, (α, β) ).
366 A Probabilities, Random Variables and Distributions
Contents
B.1 Some Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
B.1.1 Trace, Determinant and Inverse . . . . . . . . . . . . . . . . . . . . . . . 367
B.1.2 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 369
B.1.3 Inversion of Block Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 370
B.1.4 Sherman–Morrison Formula . . . . . . . . . . . . . . . . . . . . . . . . . 371
B.1.5 Combining Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . 371
B.2 Some Results from Mathematical Calculus . . . . . . . . . . . . . . . . . . . . . . 371
B.2.1 The Gamma and Beta Functions . . . . . . . . . . . . . . . . . . . . . . . 371
B.2.2 Multivariate Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
B.2.3 Taylor Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
B.2.4 Leibniz Integral Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
B.2.5 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
B.2.6 Landau Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
This appendix contains several results and definitions from linear algebra and
mathematical calculus that are important in statistics.
Trace tr(A), determinant |A| and inverse A−1 are all operations on square matrices
⎛ ⎞
a11 a12 · · · a1n
⎜a21 a22 · · · a2n ⎟
⎜ ⎟
A=⎜ . .. .. ⎟ = (aij )1≤i,j ≤n ∈ R .
n×n
⎝ .. . . ⎠
an1 an2 · · · ann
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 367
https://doi.org/10.1007/978-3-662-60792-3,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
368 B Some Results from Matrix Algebra and Calculus
While the first two produce a scalar number, the last one produces again a square
matrix. In this section we will review the most important properties of these opera-
tions.
The trace operation is the simplest one of the three mentioned in this section. It
computes the sum of the diagonal elements of the matrix:
n
tr(A) = aii .
i=1
From this one can easily derive some important properties of the trace operation, as
for example invariance to transpose of the matrix tr(A) = tr(A ), invariance to the
order in a matrix product tr(AB) = tr(BA) or additivity tr(A + B) = tr(A) + tr(B).
A non-zero determinant indicates that the corresponding system of linear equa-
tions has a unique solution. For example, the 2 × 2 matrix
a b
A= (B.1)
c d
has the determinant
a b
|A| = = ad − bc.
c d
= a11 a22 a33 + a12 a23 a31 + a13 a21 a32 − a31 a22 a13 − a32 a23 a11 − a33 a21 a12 .
For the calculation of the determinant of higher-dimensional matrices, similar
schemes exist. Here, it is crucial to observe that the determinant of a triangular
matrix is the product of its diagonal elements and that the determinant of the prod-
uct of two matrices is the product of the individual determinants. Hence, using some
form of Gaussian elimination/triangularisation, the determinant of any square matrix
can be calculated. The command det(A) in R provides the determinant; however,
for very high dimensions, a log transformation may be necessary to avoid numer-
ical problems as illustrated in the following example. To calculate the likelihood
of a vector y drawn from a multivariate normal distribution Nn (μ, Σ), one would
naively use
set . seed (15)
n <- 500
mu <- rep (1 , n )
Sigma <- 0.5^ abs ( outer (1: n , 1: n , " -") )
y <- rnorm (n , mu , sd =0.1)
1/ sqrt ( det (2* pi * Sigma ) ) *
exp ( -1/2 * t (y - mu ) %*% solve ( Sigma ) %*% (y - mu ) )
[ ,1]
[1 ,] 0
B.1 Some Matrix Algebra 369
AA−1 = A−1 A = I ,
where I denotes the identity matrix. The inverse exists if and only if the determinant
of A is non-zero. For the 2 × 2 matrix from Appendix B.1, the inverse is
1 d −b
A−1 = .
|A| −c a
Using the inverse, it is easy to find solutions of systems of linear equations of the
form Ax = b since simply x = A−1 b. In R, this close relationship is reflected by
the fact that the command solve(A) returns the inverse A−1 and the command
solve(A,b) returns the solution of Ax = b. See also the example in the next
section.
such that
G G = A.
G is called the Cholesky square root of the matrix A.
The Cholesky decomposition is a numerical method to determine the Cholesky
root G. The command chol(A) in R provides the matrix G.
370 B Some Results from Matrix Algebra and Calculus
Here, the function system.time returns (several different) CPU times of the current
R process. Usually, the last of them is of interest because it displays the elapsed wall-
clock time.
where A11 , A12 have the same numbers of rows, and A11 , A21 the same numbers
of columns. If A, A11 and A22 are quadratic and invertible, then the inverse A−1
satisfies
B −1 −B −1 A12 A−1
A−1 = 22
−A−1
22 A21 B
−1 A−1 −1 −1 −1
22 + A22 A21 B A12 A22
or alternatively
A−1 −1 −1 −1
11 + A11 A12 C A21 A11 −A11 A12 C
−1
A−1 =
−C −1 A21 A−1
11 C −1
AB
A(x − a)2 + B(x − b)2 = C(x − c)2 + (a − b)2 (B.5)
C
with C = A + B and c = (Aa + Bb)/C.
In this section we describe the gamma and beta functions, which appear in many
density formulas. Multivariate derivatives are defined, which are used mostly for
Taylor expansions in this book, and these are also explained in this section. More-
over, two results for integration and optimisation and the explanation of the Landau
notation used for asymptotic statements are included.
for all x ∈ R except zero and negative integers. It is implemented in R in the function
gamma(). The function lgamma() returns log{ (x)}. The gamma function is said
to interpolate the factorial x! because (x + 1) = x! for non-negative integers x.
The beta function is related to the gamma function as follows:
(x) (y)
B(x, y) = .
(x + y)
It is implemented in R in the function beta(). The function lbeta() returns
log{B(x, y)}.
f: Rm −→ R,
x −→ f (x),
The entry in the ith row and j th column of this r × p Jacobian is the partial deriva-
tive of k(x) of the ith component ki (x) with respect to the j th coordinate xj :
n
f (k) (a) f (n+1) (ξ )
f (x) = (x − a)k + (x − a)n+1 ,
k! (n + 1)!
k=0
where ξ is between a and x, and f (k) denotes the kth derivative of f . Moreover, we
have that
n
f (k) (a)
f (x) = (x − a)k + o |x − a|n as x → a.
k!
k=0
m m
where k ∈ Nm
0 , |k| = k1 + · · · + km , k! = i=1 ki !, (x − a)k = i=1 (xi − ai )ki , and
d |k|
D k f (x) = f (x).
dx1k1 · · · dxm
km
where f (a) = ( ∂x∂ 1 f (a), . . . , ∂x∂m f (a)) ∈ Rm is the gradient, and f (a) =
2
( ∂x∂i ∂xj f (a))1≤i,j ≤m ∈ Rm×m is the Hessian (see Appendix B.2.2).
∂ ∂
f (x 0 ) = λ · g(x 0 ).
∂x ∂x
A similar result holds for multiple constraints: Let gi , i = 1, . . . , k (k < m), be
continuously differentiable functions from M to R with gi (x 0 ) = 0 for all i, and
linearly independent gradients ∂x∂
gi (x 0 ), i = 1, . . . , k. Assume that x 0 is a local
maximum or minimum of f |g −1 (0)∩···∩g −1 (0) . Then there exist Lagrange multipliers
1 k
λ1 , . . . , λk ∈ R, such that
∂ k
∂
f (x 0 ) = λi · gi (x 0 ).
∂x ∂x
i=1
f (x) = o g(x) as x → ∞,
if for every ε > 0, there exists a real number δ(ε) > a such that the inequality
|f (x)| ≤ ε|g(x)| is true for x > δ(ε). If g(x) is non-zero for all x larger than a
certain value, this is equivalent to
f (x)
lim = 0.
x→∞ g(x)
Hence, the asymptotic growth of the function f is slower than that of g, and
f (x) vanishes for large values of x compared to g(x).
The same notation can be defined for limits x → x0 with x0 ≥ a:
f (x) = o g(x) as x → x0
376 B Some Results from Matrix Algebra and Calculus
means that for any ε > 0, there exists a real number δ(ε) > 0 such that the
inequality |f (x)| ≤ ε|g(x)| is true for all x > a with |x − x0 | < δ(ε). If g does
not vanish on its domain, this is again equivalent to
f (x)
lim = 0.
x→x0 g(x)
x>a
f (x) = O g(x) as x → ∞,
if there exist constants Q > 0 and R > a such that the inequality |f (x)| ≤
Q|g(x)| is true for all x > R. Again, an equivalent condition is
f (x)
lim sup <∞
x→∞ g(x)
if g(x) = 0 for all x larger than a certain value. The function f is, as x → ∞,
asymptotically as at most of the same order as the function g, i.e. it grows at
most as fast as g.
Analogously to little-o, the notation big-O for limits x → x0 can also be
defined:
f (x) = O g(x) as x → x0
denotes that there exist constants Q, δ > 0, such that the inequality |f (x)| ≤
Q|g(x)| is true for all x > a with |x − x0 | < δ. If g does not vanish on its
domain, this is again equivalent to
f (x)
lim sup < ∞.
x→x0 g(x)
x>a
Some Numerical Techniques
C
Contents
C.1 Optimisation and Root Finding Algorithms . . . . . . . . . . . . . . . . . . . . . . 377
C.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
C.1.2 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
C.1.3 Newton–Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . 380
C.1.4 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
C.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
C.2.1 Newton–Cotes Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
C.2.2 Laplace Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Numerical methods for optimisation and for finding roots of functions are very fre-
quently used in likelihood inference. This section provides an overview of the com-
monly used approaches.
C.1.1 Motivation
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 377
https://doi.org/10.1007/978-3-662-60792-3,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
378 C Some Numerical Techniques
root may then be used to decide whether it is a maximum of g(θ ), in which case
we have g (θ ∗ ) < 0. For a minimum, we have g (θ ∗ ) > 0, and for a saddle point,
g (θ ∗ ) = 0. Note that a univariate search for a root of g (θ ) corresponds to a uni-
variate optimisation problem: it is equivalent to the minimisation of |g (θ )|.
One example where numerical optimisation methods are often necessary is solv-
ing the score equation S(θ ) = 0 to find the maximum likelihood estimate θ ∗ = θ̂ML
as the root. Frequently, it is impossible to do that analytically as in the following
application.
Example C.1 Let X ∼ Bin(N, π). Suppose that observations are however only
available of Y = X | {X > 0} (Y follows a truncated binomial distribution). Con-
sequently, we have the following probability mass for the observations k =
1, 2, . . . , N :
Pr(X = k) Pr(X = k)
Pr(X = k | X > 0) = =
Pr(X > 0) 1 − Pr(X = 0)
N k
π (1 − π)N −k
= k .
1 − (1 − π)N
The log-likelihood kernel and score functions for π are then given by
−k + k(1 − π)N + π · N = 0.
Since finding closed forms for the roots of such a polynomial equation of degree
N is only possible up to N = 3, for larger N , numerical methods are necessary to
solve the score equation.
Suppose that we have g (a0 ) · g (b0 ) < 0 for two points a0 , b0 ∈ Θ. The intermediate
value theorem (e.g. Clarke 1971, p. 284) then guarantees that there exists at least
one root θ ∗ ∈ [a0 , b0 ] of g (θ ). The bisection method searches for a root θ ∗ with the
following iterative algorithm:
C.1 Optimisation and Root Finding Algorithms 379
|θ (t) − θ (t−1) |
< ε.
|θ (t−1) |
The bisection method is illustrated in Fig. C.1 using Example C.1 with y = 4 and
n = 6.
There exist several optimised bracketing methods. One of them is Brent’s
method, which has been proposed in 1973 by Richard Peirce Brent (1946–) and
combines the bisection method with linear and quadratic interpolation of the inverse
function. It results in a faster convergence rate for sufficiently smooth functions. If
only linear interpolation is used, then Brent’s method is equivalent to the secant
method described in Appendix C.1.4.
Brent’s method is implemented in the R-function uniroot(f, interval,
tol, ...), which searches in a given interval [a0 , b0 ] (interval = c(a0 , b0 )) for
a root of the function f. The convergence criterion can be controlled with the option
380 C Some Numerical Techniques
tol. The result of a call is a list with four elements containing the approximated
root (root), the value of the function at the root (froot), the number of performed
iterations (iter) and the estimated deviation from the true root (estim.prec). The
function f may take more than one argument, but the value θ needs to be the first.
Further arguments, which apply for all values of θ , can be passed to uniroot as
additional named parameters. This is formally hinted at through the “. . . ” argument
at the end of the list of arguments of uniroot.
A faster root finding method for sufficiently smooth functions g(θ ) is the Newton–
Raphson algorithm, which is named after the Englishmen Isaac Newton (1643–
1727) and Joseph Raphson (1648–1715). Suppose that g (θ ) is differentiable with
root θ ∗ and g (θ ∗ ) = 0, i.e. θ ∗ is either a local minimum or maximum of g(θ ).
In every iteration t of the Newton–Raphson algorithm, the derivative g (θ ) is
approximated using a linear Taylor expansion around the current approximation θ (t)
of the root θ ∗ :
g (θ ) ≈ g̃ (θ ) = g θ (t) + g θ (t) θ − θ (t) .
The function g (θ ) is hence approximated using the tangent line to g (θ ) at θ (t) . The
idea is then to approximate the root of g (θ ) by the root of g̃ (θ ):
g (θ (t) )
g̃ (θ ) = 0 ⇐⇒ θ = θ (t) − .
g (θ (t) )
The iterative procedure is thus defined as follows (see Fig. C.2 for an illustration):
1. Start with a value θ (0) for which the second derivative is non-zero, g (θ 0 ) = 0,
i.e. the function g(θ ) is curved at θ (0) .
C.1 Optimisation and Root Finding Algorithms 381
Fig. C.3 Convergence is dependent on the starting value as well: two searches for a root of
f (x) = arctan(x). The dashed lines correspond to the tangent lines and the vertical lines between
their roots and the function f (x). While for x (0) = 1.35, in (a) convergence to the true root θ ∗ = 0
is obtained, (b) illustrates that for x (0) = 1.4, the algorithm diverges
g (θ (t) )
θ (t+1) = θ (t) − . (C.1)
g (θ (t) )
3. If the convergence has not yet been reached go back to step 2; otherwise, θ (t+1)
is the final approximation of the root θ ∗ of g (θ ).
The convergence of the Newton–Raphson algorithm depends on the form of the
function g(θ ) and the starting value θ (0) (see Fig. C.3 for the importance of the
latter). If, however, g (θ ) is two times differentiable, convex and has a root, the
algorithm converges irrespective of the starting point.
Another perspective on the Newton–Raphson method is the following. An ap-
proximation of g through a second-order Taylor expansion around θ (t) is given by
1 2
g(θ ) ≈ g̃(θ ) = g θ (t) + g θ (t) θ − θ (t) + g θ (t) θ − θ (t) .
2
Minimising this quadratic approximation through
g̃ (θ ) = 0 ⇔ g θ (t) + g θ (t) θ − θ (t) = 0
g (θ (t) )
⇔ θ = θ (t) −
g (θ (t) )
However, there are better alternatives in a univariate setting as, for example, the
golden section search, for which √ no derivative is required. In every iteration of this
method a fixed fraction (3 − 5)/2 ≈ 0.382, i.e. the complement of the golden
section ratio to (1) of the subsequent search interval, is discarded. The R-function
optimize(f, interval, maximum = FALSE, tol, ...) extends this method
by iterating it with quadratic interpolation of f if the search interval is already small.
By default it searches for a minimum in the initial interval interval, which can be
changed using the option maximum = TRUE.
If the optimisation corresponds to a maximum likelihood problem, it is also
possible to replace the observed Fisher information −l (θ ; x) = I (θ ; x) with the
expected Fisher information J (θ ) = E{I (θ ; X)}. This method is known as Fisher
scoring. For some models, Fisher scoring is equivalent to Newton–Raphson. Oth-
erwise, the expected Fisher information frequently has (depending on the model) a
simpler form than the observed Fisher information, so that Fisher scoring might be
advantageous.
The Newton–Raphson algorithm is easily generalised to multivariate real-valued
functions g(x) taking arguments x ∈ Rn (see Appendix B.2.2 for the definitions of
the multivariate derivatives). Using the n × n Hessian g (θ (t) ) and the n × 1 gradient
g (θ (t) ), the update is defined as follows:
−1
θ (t+1) = θ (t) − g θ (t) · g θ (t) .
implies that -fn is minimised, i.e. fn is maximised. This and the option hessian
= TRUE are necessary to maximise a log-likelihood and to obtain the numerically
obtained curvature at the estimate. The function optim returns a list containing the
optimal function value (value), its corresponding x-value (par) and, if indicated,
the curvature (hessian) as well as the important convergence message: the algo-
rithm has obtained convergence if and only if the convergence code is 0. A more
recent implementation and extension of optim(...) is the function optimr(...)
in the package optimr. In the latter function, additional and multiple optimisation
methods can be specified.
θ (t) − θ (t−1)
θ (t+1) = θ (t) −
· g θ (t) .
g (θ ) − g (θ
(t) (t−1) )
Note that this method requires the specification of two starting points θ (0) and θ (1) .
The convergence is slower than for the Newton–Raphson method but faster than for
the bisection method. The secant method is illustrated in Fig. C.4 for Example C.1
with y = 4 and n = 6.
C.2 Integration
of a univariate function f (x).Unfortunately, there are only few functions f (x) for
x
which the primitive F (x) = f (u) du is available in closed form such that we
could calculate
I = F (b) − F (a)
directly. Otherwise, we need to apply numerical methods to obtain an approximation
for I .
384 C Some Numerical Techniques
For example, already the familiar density function of the standard normal distri-
bution
1 1 2
ϕ(x) = √ exp − x ,
2π 2
does not have a primitive. Therefore, the distribution function Φ(x) of the standard
normal distribution can only be specified in the general integral form as
x
Φ(x) = ϕ(u) du,
−∞
The Newton–Cotes formulas, which are named after Newton and Roger Cotes
(1682–1716), are based on the piecewise integration of f (x):
b n−1
xi+1
I= f (x) dx = f (x) dx (C.3)
a i=0 xi
over the decomposition of the interval [a, b] into n −1 pieces using the knots x0 =
x
a < x1 < · · · < xn−1 < xn = b. Each summand Ti = xii+1 f (x) dx in (C.3) is then
approximated as follows: the function f is evaluated at the m + 1 equally-spaced in-
terpolation points xi0 = xi < xi1 < · · · < xi,m−1 < xi,m = xi+1 . The resulting m + 1
points (xij , f (xij )) can be interpolated using a polynomial pi of degree m satisfy-
ing pi (xij ) = f (xij ) for j = 0, . . . , m. Therefore, we obtain an approximation of
C.2 Integration 385
Fig. C.5 Illustration of the trapezoidal rule for f (x) = cos(x) sin(x) + 1, a = 0.5, b = 2.5 and
n = 3 and decomposition into two parts, [0.5, 1.5] and [1.5, 2.5]. The solid grey areas T0 and T1
are approximated by the corresponding areas of the hatched trapezoids. The function f in this case
can be integrated analytically resulting in I = 14 {cos(2a) − cos(2b)} + (b − a). Substituting the
bounds leads to I = 2.0642. The approximation I ≈ 1 · { 12 f (0.5) + f (1.5) + 12 f (2.5)} = 2.0412
is hence inaccurate (relative error of (2.0642 − 2.0412)/2.0642 = 0.0111)
the function f (x) in the interval [xi , xi+1 ], which we can integrate analytically:
xi+1
m
Ti ≈ pi (x) dx = wij f (xij ), (C.4)
xi j =0
where the wij are weighting the function values at the interpolation points xij and
are available in closed form.
Choosing, for example, as interpolation points the bounds xi0 = xi and xi1 =
xi+1 of each interval implies interpolation polynomials of degree 1, i.e. a straight
line through the end points. Each summand Ti is then approximated by the area of
a trapezoid (see Fig. C.5),
1 1
Ti ≈ (xi+1 − xi ) · f (xi ) + (xi+1 − xi ) · f (xi+1 ),
2 2
and the weights are in this case given by wi0 = wi1 = 12 (xi+1 − xi ). The Newton–
Cotes formula for m = 1 is in view of this also called the trapezoidal rule. Substitu-
tion into (C.3) leads to
n−1
1 1
n−1
1
I≈ (xi+1 − xi ) f (xi ) + f (xi+1 ) = h f (x0 ) + f (xi ) + f (xn ) ,
2 2 2
i=0 i=1
(C.5)
386 C Some Numerical Techniques
where the last identity is only valid if the decomposition of the interval [a, b] is
equally-spaced with xi+1 − xi = h.
Intuitively, it is clear that a higher degree m results in a locally better approx-
imation of f (x). In particular, this allows the exact integration of polynomials up
to degree m. From (C.4) the question comes up if the 2(m + 1) degrees of free-
dom (from m + 1 weights and function values) can be fully exploited in order to
exactly integrate polynomials up to degree 2(m + 1). Indeed, there exist sophisti-
cated methods, which are based on Gaussian quadrature and which choose weights
and interpolation points cleverly, achieving this. Another important extension is the
adaptive choice of knots with unequally spaced knots over [a, b]. Here, only few
knots are chosen initially. After evaluation of f (x) at intermediate points more new
knots are assigned to areas where the approximation of the integral varies strongly
when introducing knots or where the function has large absolute value. Hence, the
density of knots is higher in difficult areas of the integration interval, parallelling
the Monte Carlo integration of Sect. 8.3.
The R function integrate(f, lower, upper, rel.tol, abs.tol, ...)
implements such an adaptive method for the integration of f on the interval between
lower and upper. Improper integrals with boundaries -Inf or Inf can also be cal-
culated (by mapping the interval to [0, 1] through substitution and then applying the
algorithm for bounded intervals). The function f must be vectorised, i.e. accepting
a vector as a first argument and return a vector of same length. Hence, the following
will fail:
f <- function ( x ) 2
try ( integrate (f , 0 , 1) )
The desired accuracy of integrate is specified through the options rel.tol and
abs.tol (by default typically set to 1.22 · 10−4 ). The returned object is a list con-
taining among other things the approximated value of the integral (value), the esti-
mation of the absolute error (abs.error) and the convergence message (message),
which reads OK in case of convergence. However, one has to keep in mind that, for
example, singularities in the interior of the integration interval could be missed.
Then, it is advisable to calculate the integral piece by piece so that the singularities
lie on the boundaries of the integral pieces and can hence be handled appropriately
by the integrate function.
Multidimensional integrals of multivariate real-valued functions f (x) over a
multidimensional rectangle A ⊂ Rn ,
b1 b2 bn
I= f (x) dx = ··· f (x) dxn · · · dx2 dx1 ,
A a1 a2 an
C.2 Integration 387
where k(u) is a convex and twice differentiable function with minimum at u = ũ.
Such integrals appear, for example, when calculating characteristics of posterior
d 2 k(ũ)
distributions. For u = ũ, we thus have dk( ũ)
du = 0 and κ = du2 > 0. A second-
order Taylor expansion of k(u) around ũ gives k(u) ≈ k(ũ) + 12 κ(u − ũ)2 , so (C.6)
can be approximately written as
+∞
1
In ≈ exp −nk(ũ) exp − nκ(u − ũ)2 du
−∞ 2
kernel of N(u | ũ,(nκ)−1 )
2π
= exp −nk(ũ) · .
nκ
In the multivariate case we consider the integral
In = exp −nk(u) du
Rp
where K denotes the p × p Hessian of k(u) at ũ, and |K| is the determinant of K.
Notation
A event or set
|A| cardinality of a set A
x∈A x is an element of A
x ∈ A x is not an element of A
Ac complement of A
A∩B joint event: A and B
A∪B union event: A and/or B
A∪B˙ disjoint event: either A or B
A⊂B A is a subset of B
Pr(A) probability of A
Pr(A | B) conditional probability of A given B
X random variable
X multivariate random variable
X1:n , X 1:n random sample
X=x event that X equals realisation x
fX (x) density (or probability mass) function of X
fX,Y (x, y) joint density function of X and Y
fY | X (y | x) conditional density of Y given X = x
FX (x) distribution function of X
FX,Y (x, y) joint distribution function of X and Y
FY | X (y | x) conditional distribution function of Y given X = x
T support of a random variable or vector
E(X) expectation of X
Var(X) variance of X
mk kth moment of X
ck kth central moment of X
Cov(X) covariance matrix of X
Mod(X) mode of X
Med(X) median of X
Cov(X, Y ) covariance of X and Y
Corr(X, Y ) correlation of X and Y
D(fX fY ) Kullback–Leibler discrepancy from fX to fY
E(Y | X = x) conditional expectation of Y given X = x
Var(Y | X = x) conditional variance of Y given X = x
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 389
https://doi.org/10.1007/978-3-662-60792-3,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
390 Notation
r
Xn −
→X convergence in rth mean
D
Xn −
→X convergence in distribution
P
Xn −
→X convergence in probability
X∼F X distributed as F
a
Xn ∼ F Xn asymptotically distributed as F
iid
Xi ∼ F Xi independent and identically distributed as F
ind
X i ∼ Fi Xi independent with distribution Fi
A ∈ Ra×b a × b matrix with entries aij ∈ R
a ∈ Rk vector with k entries ai ∈ R
dim(a) dimension of a vector a
|A| determinant of A
A transpose of A
tr(A) trace of A
A−1 inverse of A
diag(a) diagonal matrix with a on diagonal
diag{ai }ki=1 diagonal matrix with a1 , . . . , ak on diagonal
I identity matrix
1, 0 ones and zeroes vectors
IA (x) indicator function of a set A
x least integer not less than x
x integer part of x
|x| absolute value of x
log(x) natural logarithm function
exp(x) exponential function
logit(x) logit function log{x/(1 − x)}
sign(x) sign function with value 1 for x > 0, 0 for x = 0 and
−1 for x < 0
ϕ(x) standard normal density function
Φ(x) standard normal distribution function
x!
factorial of non-negative integer x
n
x
n!
binomial coefficient x!(n−x)! (n ≥ x)
(x) Gamma function
B(x, y) Beta function
f (x), dx
d
f (x) first derivative of f (x)
2
f (x), dx
d
2 f (x) second derivative of f (x)
∂
f (x), ∂xi f (x) gradient, (which contains) partial first derivatives of
f (x)
f (x), ∂x∂i ∂xj f (x)
2
Hessian, (which contains) partial second derivatives
of f (x)
arg maxx∈A f (x) argument of the maximum of f (x) from A
R set of all real numbers
R+ set of all positive real numbers
R+0 set of all positive real numbers and zero
Notation 391
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control, 19(6), 716–723.
Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal So-
ciety of London. Series A, Mathematical and Physical Sciences, 160(901), 268–282.
Baum, L. E., Petrie, T., Soules, G. & Weiss, N. (1970). A maximization technique occurring in
the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical
Statistics, 41, 164–171.
Bayarri, M. J., & Berger, J. O. (2004). The interplay of Bayesian and frequentist analysis. Statisti-
cal Science, 19(1), 58–80.
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society, 53, 370–418.
Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: irreconcilability of P values and
evidence (with discussion). Journal of the American Statistical Association, 82, 112–139.
Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. Chichester: Wiley.
Besag, J., Green, P. J., Higdon, D., & Mengersen, K. (1995). Bayesian computation and stochastic
systems. Statistical Science, 10, 3–66.
Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and robustness (with
discussion). Journal of the Royal Statistical Society, Series A, 143, 383–430.
Box, G. E. P., & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading: Addison-
Wesley.
Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion.
Statistical Science, 16(2), 101–133.
Buckland, S. T., Burnham, K. P., & Augustin, N. H. (1997). Model selection: an integral part of
inference. Biometrics, 53(2), 603–618.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: a practical
information-theoretic approach (2nd ed.). New York: Springer.
Carlin, B. P., & Louis, T. A. (2008). Bayesian methods for data analysis (3rd ed.). Boca Raton:
Chapman & Hall/CRC.
Carlin, B. P. & Polson, N. G. (1992). Monte Carlo Bayesian methods for discrete regression models
and categorical time series. In J. M. Bernardo, J. O. Berger, A. Dawid & A. Smith (Eds.),
Bayesian Statistics 4 (pp. 577–586). Oxford: Oxford University Press.
Casella, G., & Berger, R. L. (2001). Statistical inference (2nd ed.). Pacific Grove: Duxbury Press.
Chen, P.-L., Bernard, E. J. & Sen, P. K. (1999). A Markov chain model used in analyzing disease
history applied to a stroke study. Journal of Applied Statistics, 26(4), 413–422.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical
Association, 90, 1313–1321.
Chihara, L., & Hesterberg, T. (2019). Mathematical statistics with resampling and R (2nd ed.).
Hoboken: Wiley.
Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging. Cambridge: Cam-
bridge University Press.
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 393
https://doi.org/10.1007/978-3-662-60792-3,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
394 References
Clarke, D. A. (1971). Foundations of analysis: with an introduction to logic and set theory. New
York: Appleton-Century-Crofts.
Clayton, D. G., & Bernardinelli, L. (1992). Bayesian methods for mapping disease risk. In P.
Elliott, J. Cuzick, D. English & R. Stern (Eds.), Geographical and environmental epidemiology:
methods for small-area studies (pp. 205–220). Oxford: Oxford University Press. Chap. 18.
Clayton, D. G., & Kaldor, J. M. (1987). Empirical Bayes estimates of age-standardized relative
risks for use in disease mapping. Biometrics, 43(3), 671–681.
Clopper, C. J., & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the
case of the binomial. Biometrika, 26(4), 404–413.
Cole, S. R., Chu, H., Greenland, S., Hamra, G., & Richardson, D. B. (2012). Bayesian posterior
distributions without Markov chains. American Journal of Epidemiology, 175(5), 368–375.
Collins, R., Yusuf, S., & Peto, R. (1985). Overview of randomised trials of diuretics in pregnancy.
British Medical Journal, 290(6461), 17–23.
Connor, J. T., & Imrey, P. B. (2005). Proportions, inferences, and comparisons. In P. Armitage &
T. Colton (Eds.), Encyclopedia of biostatistics (2nd ed., pp. 4281–4294). Chichester: Wiley.
Cox, D. R. (1981). Statistical analysis of time series. Some recent developments. Scandinavian
Journal of Statistics, 8(2), 93–115.
Davison, A. C. (2003). Statistical models. Cambridge: Cambridge University Press.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge:
Cambridge University Press.
Devroye, L. (1986). Non-uniform random variate generation. New York: Springer. Available at
http://luc.devroye.org/rnbookindex.html.
Diggle, P. J. (1990). Time series: a biostatistical introduction. Oxford: Oxford University Press.
Diggle, P. J., Heagerty, P. J., Liang, K.-Y. & Zeger, S. L. (2002). Analysis of longitudinal data (2nd
ed.). Oxford: Oxford University Press.
Edwards, A. W. F. (1992). Likelihood (2nd ed.). Baltimore: Johns Hopkins University Press.
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference in psychological
research. Psychological Review, 70, 193–242.
Evans, M., & Swartz, T. (1995). Methods for approximating integrals in statistics with special
emphasis on Bayesian integration problems. Statistical Science, 10(3), 254–272.
Falconer, D. S., & Mackay, T. F. C. (1996). Introduction to quantitative genetics (4th ed.). Harlow:
Longmans Green.
Geisser, S. (1993). Predictive inference: an introduction. London: Chapman & Hall/CRC.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.) (1996). Markov chain Monte Carlo in
practice. Boca Raton: Chapman & Hall/CRC.
Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharp-
ness. Journal of the Royal Statistical Society. Series B (Methodological), 69, 243–268.
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association, 102, 359–378.
Good, I. J. (1995). When batterer turns murderer. Nature, 375(6532), 541.
Good, I. J. (1996). When batterer becomes murderer. Nature, 381(6532), 481.
Goodman, S. N. (1999). Towards evidence-based medical statistics. 2.: The Bayes factor. Annals
of Internal Medicine, 130, 1005–1013.
Green, P. J. (2001). A primer on Markov chain Monte Carlo. In O. E. Barndorff-Nielson, D. R.
Cox & C. Klüppelberg (Eds.), Complex stochastic systems (pp. 1–62). Boca Raton: Chapman
& Hall/CRC.
Grimmett, G., & Stirzaker, D. (2001). Probability and random processes (3rd ed.). Oxford: Oxford
University Press.
Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Heidelberg: Spektrum
Akademischer Verlag.
References 395
Iten, P. X. (2009). Ändert das ändern des Strassenverkehrgesetzes das Verhalten von alkohol- und
drogenkonsumierenden Fahrzeuglenkern? – Erfahrungen zur 0.5-Promillegrenze und zur Null-
tolerenz für 7 Drogen in der Schweiz. Blutalkohol, 46, 309–323.
Iten, P. X., & Wüst, S. (2009). Trinkversuche mit dem Lion Alcolmeter 500 – Atemalkohol versus
Blutalkohol I. Blutalkohol, 46, 380–393.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford University Press.
Jones, O., Maillardet, R., & Robinson, A. (2014). Introduction to scientific programming and sim-
ulation using R (2nd ed.). Boca Raton: Chapman & Hall/CRC.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association,
90(430), 773–795.
Kass, R. E., & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its
relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431),
928–934.
Kirkwood, B. R., & Sterne, J. A. C. (2003). Essential medical statistics (2nd ed.). Malden: Black-
well Publishing Limited.
Lange, K. (2002). Mathematical and statistical methods for genetic analysis (2nd ed.). New York:
Springer.
Lee, P. M. (2012). Bayesian statistics: an introduction (4th ed.). Chichester: Wiley.
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). New York: Springer.
Lloyd, C. J., & Frommer, D. (2004). Estimating the false negative fraction for a multiple screening
test for bowel cancer when negatives are not verified. Australian & New Zealand Journal of
Statistics, 46(4), 531–542.
Merz, J. F., & Caulkins, J. P. (1995). Propensity to abuse—propensity to murder? Chance, 8(2),
14.
Millar, R. B. (2011). Maximum likelihood estimation and inference: with examples in R, SAS and
ADMB. New York: Wiley.
Mossman, D., & Berger, J. O. (2001). Intervals for posttest probabilities: a comparison of 5 meth-
ods. Medical Decision Making, 21, 498–507.
Newcombe, R. G. (2013). Confidence intervals for proportions and related measures of effect size.
Boca Raton: CRC Press.
Newton, M. A., & Raftery, A. E. (1994). Approximate Bayesian inference with the weighted like-
lihood bootstrap. Journal of the Royal Statistical Society. Series B (Methodological), 56, 3–48.
O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J., Oakley,
J. E., & Rakow, T. (2006). Uncertain judgements: eliciting experts’ probabilities. Chichester:
Wiley.
O’Hagan, A., & Forster, J. (2004). Bayesian inference (2nd ed.). London: Arnold.
Patefield, W. M. (1977). On the maximized likelihood function. Sankhyā: The Indian Journal of
Statistics, Series B, 39(1), 92–96.
Pawitan, Y. (2001). In all likelihood: statistical modelling and inference using likelihood. New
York: Oxford University Press.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications. New York: Wiley.
Reynolds, P. S. (1994). Time-series analyses of beaver body temperatures. In N. Lange, L. Ryan, D.
Billard, L. Brillinger, L. Conquest & J. Greenhouse (Eds.), Case studies in biometry (pp. 211–
228). New York: Wiley. Chap. 11.
Ripley, B. D. (1987). Stochastic simulation. Chichester: Wiley.
Robert, C. P. (2001). The Bayesian choice (2nd ed.). New York: Springer.
Robert, C. P., & Casella, G. (2004). Monte Carlo statistical methods (2nd ed.). New York:
Springer.
Robert, C. P., & Casella, G. (2010). Introducing Monte Carlo methods with R. New York: Springer.
Royall, R. M. (1997). Statistical evidence: a likelihood paradigm. London: Chapman & Hall/CRC.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Seber, G. A. F. (1982). Capture–recapture methods. In S. Kotz & N. L. Johnson (Eds.), Encyclo-
pedia of statistical sciences (pp. 367–374). Chichester: Wiley.
396 References
Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p values for testing precise null
hypotheses. American Statistician, 55, 62–71.
Shumway, R. H., & Stoffer, D. S. (2017). Time series analysis and its applications: with R exam-
ples (4th ed.). Cham: Springer International Publishing.
Spiegelhalter, D. J., Best, N. G., Carlin, B. R., & van der Linde, A. (2002). Bayesian measures of
model complexity and fit. Journal of the Royal Statistical Society. Series B (Methodological),
64, 583–616.
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s
criterion. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 44–47.
Tierney, L. (1994). Markov chain for exploring posterior distributions. The Annals of Statistics, 22,
1701–1762.
Tierney, L., & Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal
densities. Journal of the American Statistical Association, 81(393), 82–86.
Venzon, D. J., & Moolgavkar, S. H. (1988). A method for computing profile-likelihood-based
confidence intervals. Journal of the Royal Statistical Society. Series C. Applied Statistics, 37,
87–94.
Young, G. A., & Smith, R. L. (2005). Essentials of statistical inference. Cambridge: Cambridge
University Press.
Index
L. Held, D. Sabanés Bové, Applied Statistical Inference, Statistics for Biology and Health, 397
https://doi.org/10.1007/978-3-662-60792-3,
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
398 Index