Bayesian Methods Statistical Analysis

BAYESIAN METHODS
for Statistical Analysis

BAYESIAN METHODS
for Statistical Analysis
BY BOREK PUZA
Published by ANU eView
The Australian National University
Acton ACT 2601, Australia
Email: enquiries.eview@anu.edu.au
This title is also available online at http://eview.anu.edu.au
National Library of Australia Cataloguing-in-Publication entry
Creator: Puza, Borek, author.
Title: Bayesian methods for statistical analysis / Borek Puza.
ISBN: 9781921934254 (paperback) 9781921934261 (ebook)
Subjects: Bayesian statistical decision theory.

Statistical decision.
Dewey Number: 519.542
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise,
without the prior permission of the publisher.
Cover design and layout by ANU Press
Printed by Griffin Press
This edition © 2015 ANU eView

Contents
Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1: Bayesian Basics Part 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Bayes’ rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Bayes factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Bayesian models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 The posterior distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 The proportionality formula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Continuous parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8 Finite and infinite population inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.9 Continuous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.10 Conjugacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.11 Bayesian point estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.12 Bayesian interval estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.13 Inference on functions of the model parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.14 Credibility estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 2: Bayesian Basics Part 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.1 Frequentist characteristics of Bayesian estimators. . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2 Mixture prior distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.3 Dealing with a priori ignorance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.4 The Jeffreys prior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.5 Bayesian decision theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.6 The posterior expected loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.7 The Bayes estimate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 3: Bayesian Basics Part 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

3.1 Inference given functions of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.2 Bayesian predictive inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3 Posterior predictive p-values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4 Bayesian models with multiple parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Chapter 4: Computational Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.1 Solving equations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.2 The Newton-Raphson algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.3 The multivariate Newton-Raphson algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.4 The Expectation-Maximisation (EM) algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.5 Variants of the NR and EM algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.6 Integration techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.7 The optim() function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Chapter 5: Monte Carlo Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.2 The method of Monte Carlo integration for estimating means. . . . . . . . . . . . . 202
5.3 Other uses of the MC sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.4 Importance sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.5 MC estimation involving two or more random variables. . . . . . . . . . . . . . . . . . 213
5.6 The method of composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.7 Monte Carlo estimation of a binomial parameter. . . . . . . . . . . . . . . . . . . . . . . . 216
5.8 Random number generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.9 Sampling from an arbitrary discrete distribution. . . . . . . . . . . . . . . . . . . . . . . . . 228
5.10 The inversion technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.11 Random number generation via compositions . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.12 Rejection sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5.13 Methods based on the rejection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.14 Monte Carlo methods in Bayesian inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
5.15 MC predictive inference via the method of composition. . . . . . . . . . . . . . . . . . 251
5.16 Rao-Blackwell methods for estimation and prediction. . . . . . . . . . . . . . . . . . . . 253
5.17 MC estimation of posterior predictive p-values. . . . . . . . . . . . . . . . . . . . . . . . . . 258
Chapter 6: MCMC Methods Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

6.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.2 The Metropolis algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.3 The batch means method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.4 Computational issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
6.5 Non-symmetric drivers and the general Metropolis algorithm. . . . . . . . . . . . . 286
6.6 The Metropolis-Hastings algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
6.7 Independence drivers and block sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.8 Gibbs steps and the Gibbs sampler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Chapter 7: MCMC Methods Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
7.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
7.2 Data augmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Chapter 8: Inference via WinBUGS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

8.1 Introduction to BUGS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
8.2 A first tutorial in BUGS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
8.3 Tutorial on calling BUGS in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Chapter 9: Bayesian Finite Population Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

9.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
9.2 Finite population notation and terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
9.3 Bayesian finite population models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
9.4 Two types of sampling mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
9.5 Two types of inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
9.6 Analytic inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
9.7 Descriptive inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Chapter 10: Normal Finite Population Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

10.1 The basic normal-normal finite population model . . . . . . . . . . . . . . . . . . . . . . . 467
10.2 The general normal-normal finite population model . . . . . . . . . . . . . . . . . . . . . 477
10.3 Derivation of the predictive distribution of the nonsample vector. . . . . . . . . . 480
10.4 Alternative formulae for the predictive distribution of the nonsample vector.481
10.5 Prediction of the finite population mean and other linear combinations. . . . . 483
10.6 Special cases including ratio estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
10.7 The normal-normal-gamma finite population model. . . . . . . . . . . . . . . . . . . . . 494
10.8 Special cases of the normal-normal-gamma finite population model. . . . . . . . 497
10.9 The case of an informative prior on the regression parameter. . . . . . . . . . . . . 501
Chapter 11: Transformations and Other Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515

11.1 Inference on complicated quantities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
11.2 Data transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
11.3 Frequentist properties of Bayesian finite population estimators. . . . . . . . . . . . 536
Chapter 12: Biased Sampling and Nonresponse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559

12.1 Review of sampling mechanisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
12.2 Nonresponse mechanisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
12.3 Selection bias in volunteer surveys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
12.4 A classical model for self-selection bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
12.5 Uncertainty regarding the sampling mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 583
12.6 Finite population inference under selection bias in volunteer surveys . . . . . . . 588
Appendix A: Additional Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Exercise A.1 Practice with the Metropolis algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Exercise A.2 Practice with the MH algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
Exercise A.3 Practice with a Bayesian finite population regression model. . . . . . . . . . 626
Exercise A.4 Case study in Bayesian finite population models
with biased sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Appendix B: Distributions and Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667

B.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
B.2 The gamma distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
B.3 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
B.4 The chi-squared distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
B.5 The inverse gamma distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
B.6 The t distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
B.7 The F distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
B.8 The (continuous) uniform distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
B.9 The discrete uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
B.10 The binomial distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
B.11 The Bernoulli distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
B.12 The geometric distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
Appendix C: Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673

Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
Abstract
‘Bayesian Methods for Statistical Analysis’ is a book on statistical

methods for analysing a wide variety of data. The book consists of 12
chapters, starting with basic concepts and covering numerous topics,
including Bayesian estimation, decision theory, prediction, hypothesis
testing, hierarchical models, Markov chain Monte Carlo methods, finite
population inference, biased sampling and nonignorable nonresponse.
The book contains many exercises, all with worked solutions, including
complete computer code. It is suitable for self-study or a semester-long
course, with three hours of lectures and one tutorial per week for 13 weeks.
ix
Acknowledgements
‘Bayesian Methods for Statistical Analysis’ derives from the lecture notes
for a four-day course titled ‘Bayesian Methods’, which was presented to
staff of the Australian Bureau of Statistics, at ABS House in Canberra, in
2013. Lectures of three hours each were held in the mornings of 11, 18
and 25 November and 9 December, and three-hour tutorials were held in
the mornings of 14, 20 and 27 November and 11 December.
Of the 30-odd participants, some of whom attended via video link from
regional ABS offices, special thanks go to Anura Amarasinghe, Rachel
Barker, Geoffrey Brent, Joseph Chien, Alexander Hanysz, Sebastien
Lucie, Peter Radisich and Anthony Russo, who asked insightful questions,
pointed out errors, and contributed to an improved second edition of the
lecture notes. Thanks also to Siu-Ming Tam, First Australian Statistician
of the Methodology and Data Management Division at ABS, for useful
comments, and for inviting the author to present the course in the first
place, after having read Puza (1995). Last but not least, special thanks go
to Kylie Johnson for her excellent work as the course administrator.
Further modifications to ‘Bayesian Methods’ led to the present work,

which is published as an eView textbook by the ANU Press, Canberra.
Many thanks to David Gardiner, Brian Kennett, Teresa Prowse, Emily
Tinker and two anonymous referees for useful comments and suggestions
which helped to further improve the quality of the book. Thanks also to
Yi (James) Zhang for his proofreading of the book whilst learning the
material as part of his Honours studies.
xi
Preface
‘Bayesian Methods for Statistical Analysis’ is a book which can be used

as the text for a semester-long course and is suitable for anyone who is
familiar with statistics at the level of ‘Mathematical Statistics with
Applications’ by Wackerly, Mendenhall and Scheaffer (2008). The book
does not attempt to cover all aspects of Bayesian methods but to provide
a ‘guided tour’ through the subject matter, one which naturally reflects the
author's particular interests gained over years of research and teaching.
For a more comprehensive account of Bayesian methods, the reader is

referred to the very extensive literature on this subject, including ‘Theory
of Probability’ by Jeffreys (1961), ‘Bayesian Inference in Statistical
Analysis’ by Box and Tiao (1973), ‘Markov Chain Monte Carlo in
Practice’ by Gilks et al. (1996), ‘Bayesian Statistics: An Introduction’ by
Lee (1997), ‘Bayesian Methods: An Analysis for Statisticians and
Interdisciplinary Researchers’ by Leonard and Hsu (1999), ‘Bayesian
Data Analysis’ by Gelman et al. (2004), ‘Computational Bayesian
Statistics’ by Bolstad (2009) and ‘Handbook of Markov Chain Monte
Carlo’ by Brooks et al. (2011). See also Smith and Gelfand (1992) and
O'Hagan and Forster (2004).
The software packages which feature in this book are R and WinBUGS.
R is a general software environment for statistical computing and graphics

which compiles and runs on UNIX platforms, Windows and MacOS. This
software is available for free at www.r-project.org/ Two useful guides to
R are ‘Bayesian Computation With R’ by Albert (2009) and ‘Data
Analysis and Graphics Using R: An Example-Based Approach’ by
Maindonald and Braun (2010).
BUGS stands for ‘Bayesian Inference Using Gibbs Sampling’ and is a

specialised software environment for the Bayesian analysis of complex
statistical models using Markov chain Monte Carlo methods. WinBUGS,
a version of BUGS for Microsoft Windows, is available for free at
www.mrc-bsu.cam.ac.uk/software/bugs/ Two useful guides to WinBUGS
are ‘Bayesian Modeling Using WinBUGS’ by Ntzoufras (2009) and
‘Bayesian Population Analysis Using WinBUGS’ by Kéry and Schaub
(2012).
xiii
Bayesian Methods for Statistical Analysis
The present book includes a large number of exercises, interspersed

throughout and each followed by a detailed solution, including complete
computer code. A student should be able to reproduce all of the numerical
and graphical results in the book by running the provided code. Although
many of the exercises are straightforward, some are fairly involved, and a
few will be of interest only to the particularly keen or advanced student.
All of the code in this book is also available in the form of an electronic
text document which can be obtained from the same website as the book.
This book is in the form of an Adobe PDF file saved from Microsoft Word
2013 documents, with the equations as MathType 6.9 objects. The figures
in the book were created using Microsoft Paint, the Snipping Tool in
Windows, WinBUGS and R. In the few instances where color is used, this
is only for additional clarity. Thus, the book can be printed in black and
white with no loss of essential information.
The following chapter provides an overview of the book. Appendix A

contains several additional exercises with worked solutions, Appendix B
has selected distributions and notation, and Appendix C lists some
abbreviations and acronyms. Following the appendices is a bibliography
for the entire book.
The last four of the 12 chapters in this book constitute a practical

companion to ‘Monte Carlo Methods for Finite Population Inference’, a
largely theoretical manuscript written by the author (Puza, 1995) during
the last year of his employment at the Australian Bureau of Statistics in
Canberra.
xiv
Overview
Chapter 1: Bayesian Basics Part 1 (pages 1–60)

Introduces Bayes’ rule, Bayes factors, Bayesian models, posterior
distributions, and the proportionality formula. Also covered are the
binomial-beta model, the Jeffreys’ famous tramcar problem, the
distinction between finite population inference and superpopulation
inference, conjugacy, point and interval estimation, inference on functions
of parameters, credibility estimation, the normal-normal model, and the
normal-gamma model.

Covers the frequentist characteristics of Bayesian estimators including
bias and coverage probabilities, mixture priors, uninformative priors
including the Jeffreys prior, and Bayesian decision theory including the
posterior expected loss and Bayes risk.

Covers inference based on functions of the data including censoring and
rounded data, predictive inference, posterior predictive p-values,
multiple-parameter models, and the normal-normal-gamma model
including an example of Bayesian finite population inference.
Chapter 4: Computational Tools (pages 153–200)

Covers the Newton-Raphson (NR) algorithm including its multivariate
version, the expectation-maximisation (EM) algorithm, hybrid search
algorithms, integration techniques including double integration,
optimisation in R, and specification of prior distributions.
Chapter 5: Monte Carlo Basics (pages 201–262)

Covers Monte Carlo integration, importance sampling, the method of
composition, Buffon’s needle problem, testing the coverage of Monte
Carlo confidence intervals, random number generation including the
inversion technique, rejection sampling, and applications to Bayesian
inference including prediction in the normal-normal-gamma model, Rao-
Blackwell estimation, and estimation of posterior predictive p-values.
xv
Chapter 6: MCMC Methods Part 1 (pages 263–320)

Covers Markov chain Monte Carlo (MCMC) methods including the
Metropolis-Hastings algorithm, the Gibbs sampler, specification of tuning
parameters, the batch means method, computational issues, and
applications to the normal-normal-gamma model.
Chapter 7: MCMC Methods Part 2 (pages 321–364)

Covers stochastic data augmentation, a comparison of classical and
Bayesian methods for linear regression and logistic regression,
respectively, and a Bayesian model for correlated Bernoulli data.
Chapter 8: MCMC Inference via WinBUGS

(pages 365–406)
Provides a detailed tutorial in the WinBUGS computer package including
running WinBUGS within R, and shows how WinBUGS can be used for
linear regression, logistic regression and ARIMA time series analysis.
Chapter 9: Bayesian Finite Population Theory

(pages 407–466)
Introduces notation and terminology for Bayesian finite population
inference in the survey sampling context, and discusses ignorable and
nonignorable sampling mechanisms. These concepts are illustrated by
way of examples and exercises, some of which involve MCMC methods.
Chapter 10: Normal Finite Population Models

(pages 467–514)
Contains a generalisation of the normal-normal-gamma model to the finite
population context with covariates. Useful vector and matrix formulae are
provided, special cases such as ratio estimation are treated in detail, and it
is shown how MCMC methods can be used for both descriptive and
analytic inferences.
xvi
Overview
Chapter 11: Transformations and Other Topics

(pages 515–558)
Shows how MCMC methods can be used for inference on complicated
functions of superpopulation and finite population quantities, as well for
inference based on transformed data. Frequentist characteristics of
Bayesian estimators are discussed in the finite population context, with
examples of how Monte Carlo methods can be used to estimate model
bias, design bias, model coverage and design coverage.
Chapter 12: Biased Sampling and Nonresponse

(pages 559–608)
Discusses and provides examples of ignorable and nonignorable response
mechanisms, with an exercise involving follow-up data. The topic of self-
selection bias in volunteer surveys is studied from a frequentist
perspective, then treated using Bayesian methods, and finally extended to
the finite population context.
Appendix A: Additional Exercises (pages 609–666)

Provides practice at applying concepts in the last four chapters.
Appendix B: Distributions and Notation

(pages 667–672)
Provides details of some distributions which feature in the book.
Appendix C: Abbreviations and Acronyms

(pages 673–676)
Catalogues many of the simplified expressions used throughout.
Computer Code in Bayesian Methods for Statistical

Analysis
Combines all of the R and WinBUGS code interspersed throughout the
679-page book. This separate 126-page PDF file is available online at:
http://eview.anu.edu.au/bayesian_methods/pdf/computer_code.pdf.
xvii
CHAPTER 1
Bayesian Basics Part 1
1.1 Introduction
Bayesian methods is a term which may be used to refer to any
mathematical tools that are useful and relevant in some way to Bayesian
inference, an approach to statistics based on the work of Thomas Bayes
(1701–1761). Bayes was an English mathematician and Presbyterian
minister who is best known for having formulated a basic version of the
well-known Bayes’ Theorem.
Figure 1.1 (page 3) shows part of the Wikipedia article for Thomas
Bayes. Bayes’ ideas were later developed and generalised by many
others, most notably the French mathematician Pierre-Simon Laplace
(1749–1827) and the British astronomer Harold Jeffreys (1891–1989).
Bayesian inference is different to classical inference (or frequentist

inference) mainly in that it treats model parameters as random variables
rather than as constants. The Bayesian framework (or paradigm) allows
for prior information to be formally taken into account. It can also be
useful for formulating a complicated statistical model that presents a
challenge to classical methods.
One drawback of Bayesian inference is that it invariably requires a prior

distribution to be specified, even in the absence of any prior information.
However, suitable uninformative prior distributions (also known as
noninformative, objective or reference priors) have been developed
which address this issue, and in many cases a nice feature of Bayesian
inference is that these priors lead to exactly the same point and interval
estimates as does classical inference. The issue becomes even less
important when there is at least a moderate amount of data available. As
sample size increases, the Bayesian approach typically converges to the
same inferential results, irrespective of the specified prior distribution.
Another issue with Bayesian inference is that, although it may easily

lead to suitable formulations of a challenging statistical problem, the
types of calculation needed for inference can themselves be very
complicated. Often, these calculations take on the form of multiple
1
integrals (or summations) which are intractable and difficult (or

impossible) to solve, even with the aid of advanced numerical
techniques.
In such situations, the desired solutions can typically be approximated to

any degree of precision using Monte Carlo (MC) methods. The idea is to
make clever use of a large sample of values generated from a suitable
probability distribution.
How to generate this sample presents another problem, but one which
can typically be solved easily via Markov chain Monte Carlo (MCMC)
methods. Both MC and MCMC methods will feature in later chapters of
the course.
1.2 Bayes’ rule

The starting point for Bayesian inference is Bayes’ rule. The simplest
form of this is
P ( A) P( B | A)
P( A | B)  ,
P ( A) P( B | A)  P( A) P( B | A)
where A and B are events such that P( B) > 0 . This is easily proven by
considering that:
P( AB)
P( A | B)  by the definition of conditional probability
P( B)
P ( AB )  P ( A) P( B | A) by the multiplicative law of probability
P ( B)  P( AB)  P( AB )  P ( A) P( B | A)  P( A) P( B | A)
by the law of total probability.
We see that the posterior probability P( A | B ) is equal to the prior

probability P( A) multiplied by a factor, where this factor is given by
P ( B | A) / P ( B ).
As regards terminology, we call P ( A) the prior probability of A

(meaning the probability of A before B is known to have occurred), and
we call P( A | B ) the posterior probability of A given B (meaning the
probability of A after B is known to have occurred). We may also say
that P( A) represents our a priori beliefs regarding A, and P( A | B )
represents our a posteriori beliefs regarding A.
2
Chapter 1: Bayesian Basics Part 1
Figure 1.1 Beginning of the Wikipedia article on Thomas

Bayes
Source: en.wikipedia.org/wiki/Thomas_Bayes, 29/10/2014
3
More generally, we may consider any event B such that P( B) > 0 and
k > 1 events A1 ,..., Ak which form a partition of any superset of B (such
as the entire sample space S). Then, for any i = 1,...,k, it is true that
P( Ai B )
P( Ai | B)  ,
P( B)
n
where P( B)   P( Aj B) and P( Aj B)  P( Aj ) P( B | Aj ) .
j 1
Exercise 1.1 Medical testing
The incidence of a disease in the population is 1%. A medical test for the
disease is 90% accurate in the sense that it produces a false reading 10%
of the time, both: (a) when the test is applied to a person with the
disease; and (b) when the test is applied to a person without the disease.
A person is randomly selected from population and given the test. The
test result is positive (i.e. it indicates that the person has the disease).
What is the probability that the person actually has the disease?
Solution to Exercise 1.1
Let A be the event that the person has the disease, and let B be the event
that they test positive for the disease. Then:
P ( A) = 0.01 (the prior probability of the person having the disease)
P ( B | A) = 0.9 (the true positive rate, also called
the sensitivity of the test)
P( B | A) = 0.9 (the true negative rate, also called
the specificity of the test).
So: P ( AB ) = P ( A) P ( B | A) = 0.01× 0.9 = 0.009

P ( AB ) = P ( A) P ( B | A) = 0.99 × 0.1 = 0.099.
So the unconditional (or prior) probability of the person testing positive

is =P ( B ) P ( AB ) + P( AB) = 0.009 + 0.099 = 0.108 .
So the required posterior probability of the person having the disease is

P( AB) 0.009 1
P( A | B)    = 0.08333.
P( B) 0.108 12
4
Figure 1.2 is a Venn diagram which illustrates how B may be considered

as the union of AB and AB . The required posterior probability of A
given B is simply the probability of AB divided by the probability of B.
Figure 1.2 Venn diagram for Exercise 1.1
Discussion
It may seem the posterior probability that the person has the disease
(1/12) is rather low, considering the high accuracy of the test (namely
P ( B | A) = P ( B | A) = 0.9).
This may be explained by considering 1,000 random persons in the

population and applying the test to each one. About 10 persons will have
the disease, and of these, 9 will test positive. Of the 990 who do not have
the disease, 99 will test positive. So the total number of persons testing
positive will be 9 + 99 = 108, and the proportion of these 108 who
actually have the disease will be 9/108 = 1/12. This heuristic derivation
of the answer shows it to be small on account of the large number of
false positives (99) amongst the overall number of positives (108).
On the other hand, it may be noted that the posterior probability of the
person having the disease is actually very high relative to the prior
probability of them having the disease ( P( A) = 0.01). The positive test
result has greatly increased the person’s chance of having the disease
(increased it by more than 700%, since 0.01 + 7.333 × 0.01 = 0.08333).
5
It is instructive to generalise the answer (1/12) as a function of the

prevalence (i.e. proportion) of the disease in the population, p = P( A) ,
and the common accuracy rate of the= test, q P= ( B | A) P( B | A) .
We find that
P( A) P( B | A) pq
P( A | B)   .
P( A) P( B | A)  P( A) P( B | A) pq  (1 p )(1 q )
Figure 1.3 shows the posterior probability of the person having the
disease ( P( A | B)) as a function of p with q fixed at 0.9 and 0.95,
respectively (subplot (a)), and as a function of q with p fixed at 0.01 and
0.05, respectively (subplot (b)). In each case, the answer (1/12) is
represented as a dot corresponding to p = 0.01 and q = 0.9.
Figure 1.3 Posterior probability of disease as functions of p

and q
6
R Code for Exercise 1.1
PAgBfun=function(p=0.01,q=0.9){ p*q / (p*q+(1-p)*(1-q)) }

PAgBfun() # 0.08333333
pvec=seq(0,1,0.01); Pveca=PAgBfun(p=pvec,q=0.9)
Pveca2=PAgBfun(p=pvec,q=0.95)
qvec=seq(0,1,0.01); Pvecb=PAgBfun(p=0.01,q=qvec)
Pvecb2=PAgBfun(p=0.05,q=qvec)
X11(w=8,h=7); par(mfrow=c(2,1));
plot(pvec,Pveca,type="l",xlab="p=P(A)",ylab="P(A|B)",lwd=2)
points(0.01,1/12,pch=16,cex=1.5); text(0.05,0.8,"(a)",cex=1.5)
lines(pvec,Pveca2,lty=2,lwd=2)
legend(0.7,0.5,c("q = 0.9","q = 0.95"),lty=c(1,2),lwd=c(2,2))
plot(qvec,Pvecb,type="l",xlab="q=P(B|A)=P(B'|A')",ylab="P(A|B)",lwd=2)
points(0.9,1/12,pch=16,cex=1.5); text(0.05,0.8,"(b)",cex=1.5)
lines(qvec,Pvecb2,lty=2,lwd=2)
legend(0.2,0.8,c("p = 0.01","p = 0.05"),lty=c(1,2),lwd=c(2,2))
# Technical note: The graph here was copied from R as ‘bitmap’ and then
# pasted into a Word document, which was then saved as a PDF. If the graph
# is copied from R as ‘metafile’, it appears correct in the Word document,
# but becomes corrupted in the PDF, with axis legends slightly off-centre.
# So, all graphs in this book created in R were copied into Word as ‘bitmap’.
Exercise 1.2 Blood types
In a particular population:
10% of persons have Type 1 blood,
and of these, 2% have a particular disease;
and of these, 4% have the disease;
and of these, 3% have the disease.
A person is randomly selected from the population and found to have the
disease.
What is the probability that this person has Type 3 blood?
7
Let: A = ‘The person has Type 1 blood’

B = ‘The person has Type 2 blood’
C = ‘The person has Type 3 blood’
D = ‘The person has the disease’.
Then: P ( A) = 0.1 , P ( D | A) = 0.02

P ( B ) = 0.3 , P ( D | B ) = 0.04
P (C ) = 0.6 , P ( D | C ) = 0.03 .
So: P ( D) = P( AD) + P ( BD) + P (CD)

= P ( A) P( D | A) + P( B) P( D | B ) + P (C ) P( D | C )
=0.1× 0.02 + 0.3 × 0.04 + 0.6 × 0.03
= 0.002 + 0.012 + 0.018 = 0.032 .
P(CD) 0.018 9
Hence: P(C=
| D) = = = 56.25%.
P( D) 0.032 16
Figure 1.4 is a Venn diagram showing how D may be considered as the

union of AD, BD and CD. The required posterior probability of C given
D is simply the probability of CD divided by the probability of D.
Figure 1.4 Venn diagram for Exercise 1.2
8
1.3 Bayes factors

One way to perform hypothesis testing in the Bayesian framework is via
the theory of Bayes factors. Suppose that on the basis of an observed
event D (standing for data) we wish to test a null hypothesis
H 0 : E0
versus an alternative hypothesis
H1 : E1 ,
where E0 and E1 are two events (which are not necessarily mutually
exclusive or even exhaustive of the event space).
Then we calculate:
π 0 = P( E0 ) = the prior probability of the null hypothesis
π 1 = P( E1 ) = the prior probability of the alternative hypothesis
PRO = π 0 / π 1 = the prior odds in favour of the null hypothesis
p0 = P( E0 | D ) = the posterior probability of the null hypothesis
p1 = P( E1 | D ) = the posterior probability of the alternative hypothesis
POO = p0 / p1 = the posterior odds in favour of the null hypothesis.
The Bayes factor is then defined as BF = POO / PRO. This may be

interpreted as the factor by which the data have multiplied the odds in
favour of the null hypothesis relative to the alternative hypothesis. If
BF > 1 then the data has increased the relative likelihood of the null, and
if BF < 1 then the data has decreased that relative likelihood. The
magnitude of BF tells us how much effect the data has had on the
relative likelihood.
Note 1: Another way to express the Bayes factor is as

p0 / p1 P ( E0 | D ) / P ( E1 | D ) P ( D ) P ( E0 | D ) / P ( E0 )
= BF = =
π 0 / π1 P( E0 ) / P ( E1 ) P ( D ) P ( E1 | D ) / P ( E1 )
P ( D | E0 )
= .
P( D | E1 )
Thus, the Bayes factor may also be interpreted as the ratio of the
likelihood of the data given the null hypothesis to the likelihood of the
data given the alternative hypothesis.
9
Note 2: The idea of a Bayes factor extends to situations where the null
and alternative hypotheses are statistical models rather than events. This
idea may be taken up later.
Exercise 1.3 Bayes factor in disease testing
The incidence of a disease in the population is 1%. A medical test for the
disease is 90% accurate in the sense that it produces a false reading 10%
of the time, both: (a) when the test is applied to a person with the
disease; and (b) when the test is applied to a person without the disease.
A person is randomly selected from population and given the test. The
test result is positive (i.e. it indicates that the person has the disease).
Calculate the Bayes factor for testing that the person has the disease
versus that they do not have the disease.
Recall in Exercise 1.1, where A = ‘Person has disease’ and B = ‘Person

tests positive’, the relevant probabilities are P ( A) = 0.01 , P( B | A) = 0.9
and P( B | A) = 0.9 , from which can be deduced that P( A | B ) = 1 / 12 .
We now wish to test H 0 : A vs H1 : A . So we calculate:

π 0 = P( A) = 0.01, π 1 = P( A) = 0.99, PRO = π 0 / π 1 = 1/99,
p0 = P( A | B ) = 1/12, p1 = P( A | B ) = 11/12, POO = p0 / p1 = 1/11.
Hence the required Bayes factor is BF = POO/PRO = (1/11)/(1/99) = 9.
This means the positive test result has multiplied the odds of the person
having the disease relative to not having it by a factor of 9 or 900%.
Another way to say this is that those odds have increased by 800%.
Note: We could also work out the Bayes factor here as

P( B | A) 0.9
=
BF = = 9,
P( B | A) 0.1
namely as the ratio of the probability that the person tests positive given
they have the disease to the probability that they test positive given they
do not have the disease.
10
1.4 Bayesian models

Bayes’ formula extends naturally to statistical models. A Bayesian
model is a parametric model in the classical (or frequentist) sense, but
with the addition of a prior probability distribution for the model
parameter, which is treated as a random variable rather than an unknown
constant. The basic components of a Bayesian model may be listed as:
• the data, denoted by y
• the parameter, denoted by 
• the model distribution, given by a specification of
f ( y |  ) or F ( y |  ) or the distribution of ( y | θ )
• the prior distribution, given by a specification of
f ( ) or F ( ) or the distribution of θ .
Here, F is a generic symbol which denotes cumulative distribution

function (cdf), and f is a generic symbol which denotes probability
density function (pdf) (when applied to a continuous random variable) or
probability mass function (pmf) (when applied to a discrete random
variable). For simplicity, we will avoid the term pmf and use the term
pdf or density for all types of random variable, including the mixed type.
Note 1: A mixed distribution is defined by a cdf which exhibits at least

one discontinuity (or jump) and is strictly increasing over at least one
interval of values.
Note 2: The prior may be specified by writing a statement of the form

‘ ~ ...’ , where the symbol ‘~’ means ‘is distributed as’, and where
‘...’ denotes the relevant distribution. Likewise, the model for the data
may be specified by writing a statement of the form ‘( y | θ ) ~ ...’ .
Note 3: At this stage we will not usually distinguish between y as a

random variable and y as a value of that random variable; but sometimes
we may use Y for the former. Each of y and  may be a scalar, vector,
matrix or array. Also, each component of y and  may have a discrete
distribution, a continuous distribution, or a mixed distribution.
In the first few examples below, we will focus on the simplest case
where both y and  are scalar and discrete.
11
1.5 The posterior distribution

Bayesian inference requires determination of the posterior probability
distribution of  . This task is equivalent to finding the posterior pdf of
 , which may be done using the equation
f ( ) f ( y |  )
f ( | y )  .
f ( y)
Here, f ( y ) is the unconditional (or prior) pdf of y, as given by

 f ( ) f ( y |  )d  if  is continuous

f ( y )   f ( y |  )dF ( )  
  f ( ) f ( y |  ) if  is discrete.
 
Note: Here,  f ( y |  )dF ( ) is a Lebesgue-Stieltjes integral, which may

need evaluating by breaking the integral into two parts in the case where
θ has a mixed distribution. In the continuous case, think of dF ( ) as
dF ( )
d   f ( )d  .
d
Exercise 1.4 Loaded dice
Consider six loaded dice with the following properties. Die A has
probability 0.1 of coming up 6, each of Dice B and C has probability 0.2
of coming up 6, and each of Dice D, E and F has probability 0.3 of
coming up 6.
A die is chosen randomly from the six dice and rolled twice. On both
occasions, 6 comes up.
What is the posterior probability distribution of θ , the probability of 6

coming up on the chosen die.
Let y be the number of times that 6 comes up on the two rolls of the
chosen die, and let θ be the probability of 6 coming up on a single roll
of that die. Then the Bayesian model is:
12
( y | θ ) ~ Bin(2, θ )
 1/ 6, θ = 0.1

f (θ ) =
= 2 / 6, θ 0.2
3 / 6, θ = 0.3.

In this case y = 2 and so

2  2 2
f ( y | θ=
)   θ y (1 − θ ) 2−=
y
  θ (1 − θ ) = θ .
2−2 2
 
y  
2
1 2 3
So f ( y ) = ∑ f (θ ) f ( y | θ ) = (0.1) 2 + (0.2) 2 + (0.3) 2 = 0.06.
θ 6 6 6
 (1/ 6)0.1
= 2
=
/ 0.06 0.02778, θ 0.1
f (θ ) f ( y | θ ) 
So f (θ | y )
= = (2 / 6)0.2 = 2
=
/ 0.06 0.22222, θ 0.2
f ( y)  (3 / 6)0.3=

2
=
/ 0.06 0.75, θ 0.3.
Note: This result means that if the chosen die were to be tossed again a
large number of times (say 10,000) then there is a 75% chance that 6
would come up about 30% of the time, a 22.2% chance that 6 would
come up about 20% of the time, and a 2.8% chance that 6 would come
up about 10% of the time.
1.6 The proportionality formula

Observe that f ( y ) is a constant with respect to θ in the Bayesian
equation
f ( | y )  f ( ) f ( y |  ) / f ( y ) ,
which means that we may also write the equation as
f ( ) f ( y |  )
f ( | y )  ,
k
or as
f ( | y )  cf ( ) f ( y |  ) ,
where k  f ( y ) and c  1 / k .
We may also write

f ( | y )  f ( ) f ( y |  ) ,
where ∝ is the proportionality sign.
13
Equivalently, we may write


f ( | y )  f ( ) f ( y |  )
to emphasise that the proportionality is specifically with respect to θ .
Another way to express the last equation is

f ( | y )  f ( ) L( | y ) ,
where L( | y ) is the likelihood function (defined as the model
density f ( y |  ) multiplied by any constant with respect to  , and
viewed as a function of  rather than of y).
The last equation may also be stated in words as:
The posterior is proportional to the prior times the likelihood.
These observations indicate a shortcut method for determining the

required posterior distribution which obviates the need for calculating
f ( y ) (which may be difficult).
This method is to multiply the prior density (or the kernel of that
density) by the likelihood function and try to identify the resulting
function of  as the density of a well-known or common distribution.
Once the posterior distribution has been identified, f ( y ) may then be

obtained easily as the associated normalising constant.
Exercise 1.5 Loaded dice with solution via the proportionality

formula
As in Exercise 1.4, suppose that Die A has probability 0.1 of coming up

6, each of Dice B and C has probability 0.2 of coming up 6, and each of
Dice D, E and F has probability 0.3 of coming up 6.
A die is chosen randomly from the six dice and rolled twice. On both
occasions, 6 comes up.
Using the proportionality formula, find the posterior probability

distribution of θ , the probability of 6 coming up on the chosen die.
14
With y denoting the number of times 6 comes up, the Bayesian model
may be written:
2
f ( y | θ ) =   θ y (1 − θ ) 2− y , y = 0,1, 2
 y
= f (θ ) 10 =θ / 6, θ 0.1, 0.2, 0.3 .
Note: 10θ / 6 = 1/6, 2/6 and 3/6 for θ = 0.1, 0.2 and 0.3, respectively.
Hence f (θ | y ) ∝ f (θ ) f ( y | θ )
10θ  2 
= ×   θ y (1 − θ ) 2− y
6  y
∝ θ ×θ 2 since y = 2.
=  0.13 1/1000,
= θ 0.1 =  1, θ 0.1
  
Thus f (θ | y ) ∝ θ 3=  0.23= 8 /1000, θ= 0.2  ∝  8, θ= 0.2
= 0.33 27= /1000, θ 0.3= 27, θ 0.3.
  
= 13 / 36 0.02778,
= θ 0.1
 3
Now, 1 + 8 + 27 = (θ | y ) =
36 , and so f= =
2 / 36 0.22222, θ 0.2
 3= 3
=
/ 36 0.75, θ 0.3,

which is the same result as obtained earlier in Exercise 1.4.
Exercise 1.6 Buses
You are visiting a town with buses whose licence plates show their
numbers consecutively from 1 up to however many there are. In your
mind the number of buses could be anything from one to five, with all
possibilities equally likely.
Whilst touring the town you first happen to see Bus 3.
Assuming that at any point in time you are equally likely to see any of
the buses in the town, how likely is it that the town has at least four
buses?
15
Let θ be the number of buses in the town and let y be the number of the
bus that you happen to first see. Then an appropriate Bayesian model is:
y | θ ) 1/=
f (= θ , y 1,...,θ
=f (θ ) 1/= 5, θ 1,...,5 (prior).
Note: We could also write this model as:

( y | θ ) ~ DU (1,..., θ )
θ ~ DU (1,...,5) ,
where DU denotes the discrete uniform distribution. (See Appendix B.9
for details regarding this distribution. Appendix B also provides details
regarding some other important distributions that feature in this book.)
So the posterior density of θ is

f (θ | y ) ∝ f (θ ) f ( y | θ )
∝ 1×1/ θ , θ = y,...,5 .
Noting that y = 3, we have that

 1/ 3, θ = 3

f (θ | y ) ∝ 1/ 4, θ = 4
1/ 5, θ = 5.

Now, 1/ 3 + 1/ 4 + 1/ 5 = (20 + 15 + 12) / 60 = 47 / 60 , and so

 1/ 3 20
 47= = ,θ 3
/ 60 47

 1/ 4 15
f (θ=| y)  = = ,θ 4
 47 / 60 47
 1/ 5 12
 47= = , θ 5.
 / 60 47
So the posterior probability that the town has at least four buses is
P(θ ≥ 4 | y ) = ∑
θ :θ ≥ 4
f (θ | y ) = f (θ = 4 | y ) + f (θ = 5 | y )
20 27
1 − f (θ =
= 3 | y ) =−
1 = = 0.5745.
47 47
16
Discussion
This exercise is a variant of the famous ‘tramcar problem’ considered by

Harold Jeffreys in his book Theory of Probability and previously
suggested to him by M.H.A. Newman (see Jeffreys, 1961, page 238).
Suppose that before entering the town you had absolutely no idea about
the number of buses θ . Then, according to Jeffreys’ logic, a prior which
may be considered as suitably uninformative (or noninformative) in this
situation is given by f (θ ) ∝ 1/ θ , θ = 1, 2,3,... .
Now, this prior density is problematic because it is improper (since

∑θ∞=1 1/ θ = ∞ ). However, it leads to a proper posterior density given by
1
f (θ | y ) = 2 , θ = 3, 4,5,... ,
cθ
1 1 1 π2  1 1 
where c = 2 + 2 + 2 + ... = −  +  = 0.394934.
3 4 5 6  12 22 
So, under this alternative prior, the probability of there being at least
four buses in the town (given that you have seen Bus 3) works out as
1
P (θ ≥ 4 | y ) =
1 − P (θ =
3 | y) =
1− = 0.7187.
9c
The logic which Jeffreys used to come up with the prior f (θ ) ∝ 1/ θ in

relation to the tramcar problem will be discussed further in Chapter 2.

options(digits=6); c=(1/6)*(pi^2)-5/4; c # 0.394934
1- (1/3^2)/c # 0.718659
Exercise 1.7 Balls in a box
In each of nine indistinguishable boxes there are nine balls, the ith box
having i red balls and 9 − i white balls (i = 1,…,9).
One box is selected randomly from the nine, and then three balls are
chosen randomly from the selected box (without replacement and
without looking at the remaining balls in the box).
Exactly two of the three chosen balls are red. Find the probability that
the selected box has at least four red balls remaining in it.
17
Let: N = the number of balls in each box (9)

n = the number of balls chosen from the selected box (3)
θ = the number of red balls initially in the selected box
(1,2,…,8 or 9)
y = the number of red balls amongst the n chosen balls (2).
Then an appropriate Bayesian model is:

( y | θ ) ~ Hyp( N , θ , n ) (Hypergeometric with parameters
N, θ and n, and having mean n θ /N)
θ ~ DU (1,..., N ) (discrete uniform over the integers 1,2,…,N).
For this model, the posterior density of θ is

1 θ   N − θ  N
f (θ | y ) ∝ f (θ ) f ( y | θ= ×
N  y   n − y  n
)
 
θ !( N − θ )!
∝ ,= θ y,..., N − (n − y ) .
(θ − y )!( N − θ − ( n − y ))!
In our case,
θ !(9 − θ )!
f (θ | y ) ∝ θ 2,...,9 − (3 − 2) ,
,=
(θ − 2)!(9 − θ − (3 − 2))!
or more simply,
f (θ | y ) ∝ θ (θ − 1)(9 − θ ) , θ = 2,...,8 .
14, θ = 2
 36, θ = 3
 
60, θ = 4
 
Thus f (θ | y ) ∝  80, θ = 5  ≡ k (θ ) ,
90, θ = 6
 
84, θ = 7
 56, θ = 8 

where
8
c ≡ ∑ k (θ ) = 14 + 36 + … + 56 = 420.
θ =1
18
=
 14 / 420 =
0.03333, θ 2
 36 / 420
= =
0.08571, θ 3

 60 / 420
= =
0.14286, θ 4
k (θ ) 
So f (θ=
| y) = =  80 / 420 =
0.19048, θ 5
c  90 / 420
= =
0.21429, θ 6

=
 84 / 420 =
0.20000, θ 7

=
56 / 420 =
0.13333, θ 8.
The probability that the selected box has at least four red balls remaining
is the posterior probability that θ (the number of red balls initially in the
box) is at least 6 (since two red balls have already been taken out of the
box). So the required probability is
90 + 84 + 56 23
P(θ ≥ =
6 | y) = = 0.5476.
420 42
tv=2:8; kv=tv*(tv-1)*(9-tv); c=sum(kv); c # 420

options(digits=4); cbind(tv,kv,kv/c,cumsum(kv/c))
# [1,] 2 14 0.03333 0.03333
# [2,] 3 36 0.08571 0.11905
# [3,] 4 60 0.14286 0.26190
# [4,] 5 80 0.19048 0.45238
# [5,] 6 90 0.21429 0.66667
# [6,] 7 84 0.20000 0.86667
# [7,] 8 56 0.13333 1.00000
23/42 # 0.5476
1-0.45238 # 0.5476 (alternative calculation of the required probability)
sum((kv/c)[tv>=6]) # 0.5476
# (yet another calculation of the required probability)
1.7 Continuous parameters

The examples above have all featured a target parameter which is
discrete. The following example illustrates Bayesian inference involving
a continuous parameter. This case presents no new problems, except that
the prior and posterior densities of the parameter may no longer be
interpreted directly as probabilities.
19
Exercise 1.8 The binomial-beta model (or beta-binomial model)
Consider the following Bayesian model:

( y |  ) ~ Binomial (n,  )
 ~ Beta (,  ) (prior).
Find the posterior distribution of  .
The posterior density is

f ( | y )  f ( ) f ( y |  )
 1 (1  ) 1  n  y
    (1  ) n y
B(,  )  y
  1 (1  ) 1  y (1  ) n y (ignoring constants which
do not depend on  )
(   y )1 (   n y )1
 (1  ) , 0    1.
This is the kernel of the beta density with parameters   y and

  n  y . It follows that the posterior distribution of  is given by
( | y ) ~ Beta (  y,   n  y ) ,
and the posterior density of  is (exactly)
 ( y )1 (1  )(  n y )1
f ( | y )  , 0    1.
B(  y,   n  y )
For example, suppose that  =  = 1, that is,  ~ Beta (1,1) .

θ 1−1 (1 − θ )1−1
Then the prior density is f (θ ) = = 1, 0 < θ < 1 .
B(1,1)
Thus the prior may also be expressed by writing  ~ U (0,1) .
Also, suppose that n = 2. Then there are three possible values of y,

namely 0, 1 and 2, and these lead to the following three posteriors,
respectively:
( | y ) ~ Beta (1  0,1  2  0)  Beta (1,3)
( | y ) ~ Beta (1  1,1  2 1)  Beta (2, 2)
( | y ) ~ Beta (1  2,1  2  2)  Beta (3,1) .
These three posteriors and the prior are illustrated in Figure 1.5.
20
Note: The prior here may be considered uninformative because it is

‘flat’ over the entire range of possible values for  , namely 0 to 1. This
prior was originally used by Thomas Bayes and is often called the Bayes
prior. However, other uninformative priors have been proposed for the
binomial parameter  . These will be discussed later, in Chapter 2.
Figure 1.5 The prior and three posteriors in Exercise 1.8
X11(w=8,h=5); par(mfrow=c(1,1));
plot(c(0,1),c(0,3),type="n",xlab="theta",ylab="density")
lines(c(0,1),c(1,1),lty=1,lwd=3); tv=seq(0,1,0.01)
lines(tv,3*(1-tv)^2,lty=2,lwd=3)
lines(tv,3*2*tv*(1-tv),lty=3,lwd=3)
lines(tv,3*tv^2,lty=4,lwd=3)
legend(0.3,3,c("prior","posterior if y=0","posterior if y=1","posterior if y=2"),

lty=c(1,2,3,4),lwd=rep(2,4))
21
1.8 Finite and infinite population inference

In the last example (Exercise 1.8), with the model:
( y |  ) ~ Binomial (n,  )
 ~ Beta (,  ) ,
the quantity of interest θ is the probability of success on a single
Bernoulli trial.
This quantity may be thought of as the average of a hypothetically

infinite number of Bernoulli trials. For that reason we may refer to
derivation of the posterior distribution,
( | y ) ~ Beta (  y,   n  y ) ,
as infinite population inference.
In contrast, for the ‘buses’ example further above (Exercise 1.6), which
involves the model:
y | θ ) 1/=
f (= θ , y 1,...,θ
=f (θ ) 1/= 5, θ 1,...,5 ,
the quantity of interest θ represents the number of buses in a population
of buses, which of course is finite.
Therefore derivation of the posterior,

 20 / 47, θ = 3

= f (θ | y ) =15 / 47, θ 4
12 / 47, θ = 5,

may be termed finite population inference.
Another example of finite population inference is the ‘balls in a box’

example (Exercise 1.7), where the model is:
( y | θ ) ~ Hyp( N , θ , n )
θ ~ DU (1,..., N ) ,
and where the quantity of interest θ is the number of red balls initially
in the selected box (1,2,…,8 or 9).
And another example of infinite population inference is the ‘loaded dice’

example (Exercises 1.4 and 1.5), where the model is:
2
f ( y | θ ) =   θ y (1 − θ ) 2− y , y = 0,1, 2
 y
= f (θ ) 10 =θ / 6, θ 0.1, 0.2, 0.3,
22
and where the quantity of interest θ is the probability of 6 coming up on

a single roll of the chosen die (i.e. the average number of 6s that come
up on a hypothetically infinite number of rolls of that particular die).
Generally, finite population inference may also be thought of in terms of

prediction (e.g. in the ‘buses’ example, we are predicting the total
number of buses in the town). For that reason, finite population
inference may also be referred to as predictive inference. Yet another
term for finite population inference is descriptive inference. In contrast,
infinite population inference may also be called analytic inference. More
will be said on finite population/predictive/descriptive inference in later
chapters of the course.
1.9 Continuous data

So far, all the Bayesian models considered have featured data which is
modelled using a discrete distribution. (Some of these models have a
discrete parameter and some have a continuous parameter.) The
following is an example with data that follows a continuous probability
distribution. (This example also has a continuous parameter.)
Exercise 1.9 The exponential-exponential model
Suppose θ has the standard exponential distribution, and the conditional

distribution of y given θ is exponential with mean 1/ θ . Find the
posterior density of θ given y .
The Bayesian model here is: f ( y |  )  e y , y  0

f ( )  e ,   0 .
So f ( | y )  f ( ) f ( y |  )  e e y   21e ( y 1) , y > 0.
This is the kernel of a gamma distribution with parameters 2 and y + 1,

as per the definitions in Appendix B.2. Thus we may write
( | y ) ~ Gamma (2, y  1) ,
from which it follows that the posterior density of θ is
( y  1) 2  21e ( y1)
f ( | y )  ,   0.
(2)
23
Exercise 1.10 The uniform-uniform model
Consider the Bayesian model given by:

( y | θ ) ~ U (0, θ )
θ ~ U (0,1) .
Find the posterior density of θ given y.
Noting that 0 < y < θ < 1, we see that the posterior density is
f (θ ) f ( y | θ ) 1 × (1 / θ )
= f (θ | y ) = 1
f ( y)
∫ 1 × (1 / θ )dθ
y
1/ θ −1
= = , y < θ <1.
log1 − log y θ log y
Note: This is a ‘non-standard’ density and strictly decreasing. To give a

physical example, a stick of length 1 metre is cut at a point randomly
located along its length. The part to the right of the cut is discarded and
then another cut is made randomly along the stick which remains. Then
the part to the right of that second cut is likewise discarded. The length
of the stick remaining after the first cut is a random variable with density
as given above, with y being the length of the finally remaining stick.
1.10 Conjugacy
When the prior and posterior distributions are members of the same class
of distributions, we say that they form a conjugate pair, or that the prior
is conjugate. For example, consider the binomial-beta model:
( y |  ) ~ Binomial (n,  )
 ~ Beta (,  ) (prior)
⇒ ( | y ) ~ Beta (  y,   n  y ) (posterior).
Since both prior and posterior are beta, the prior is conjugate.
Likewise, consider the exponential-exponential model:

f ( y |  )  e y , y  0
f ( )  e ,   0 (i.e. θ ~ Gamma (1,1)) (prior)
⇒ ( | y ) ~ Gamma (2, y  1) (posterior).
24
Since both prior and posterior are gamma, the prior is conjugate.
On the other hand, consider the model in the buses example:

( y | θ ) ~ DU (1,..., θ )
θ ~ DU (1,...,5) (prior)
20 / 47, θ = 3

⇒ f (θ | y= 3)= 15 / 47, θ= 4 (posterior).
12 / 47, θ = 5

The prior is discrete uniform but the posterior is not. So in this case the
prior is not conjugate.
Specifying a Bayesian model using a conjugate prior is generally

desirable because it can simplify the calculations required.
1.11 Bayesian point estimation

Once the posterior distribution or density f ( | y ) has been obtained,
Bayesian point estimates of the model parameter  can be calculated.
The three most commonly used point estimates are as follows.
• The posterior mean of  is
  f ( | y )d  if  is continuous

E ( | y )    dF ( | y )  
   f ( | y ) if  is discrete.
 
• The posterior mode of  is

Mode( | y ) = any value m ∈ℜ which satisfies
=f (θ m=| x ) max f (θ | x )
θ
or lim f (θ | x ) = sup f (θ | x ) ,
θ →m
or the set of all such values.
• The posterior median of  is
Median ( | y ) = any value m of  such that
P (  m | y )  1/ 2
and P (  m | y )  1/ 2 ,
or the set of all such values.
Note 1: In some cases, the posterior mean does not exist or it is equal to
infinity or minus infinity.
25
Note 2: Typically, the posterior mode and posterior median are unique.
The above definitions are given for completeness.
Note 3: The integral  dF ( | y) is a Lebesgue-Stieltje’s integral. This

may need to be evaluated as the sum of two separate parts in the case
where θ has a mixed distribution. In the continuous case, it is useful to
dF ( | y )
think of dF ( | y ) as d   f ( | y )d  .
d
Note 4: The above three Bayesian point estimates may be interpreted in

an intuitive manner. For example, ’s posterior mode is the value of 
which is ‘made most likely by the data’. They may also be understood in
the context of Bayesian decision theory (discussed later).
1.12 Bayesian interval estimation

There are many ways to construct a Bayesian interval estimate, but the
two most common ways are defined as follows. The 1  (or
100(1 − α )% ) highest posterior density region (HPDR) for  is the
smallest set S such that:
P (  S | y )  1  
and f (1 | y )  f (2 | y ) if 1  S and 2  S .
Figure 1.6 illustrates the idea of the HPDR. In the very common
situation where  is scalar, continuous and has a posterior density which
is unimodal with no local modes (i.e. has the form of a single ‘mound’),
the 1–  HPDR takes on the form of a single interval defined by two
points at which the posterior density has the same value. When the
HPDR is a single interval, it is the shortest possible single interval over
which the area under the posterior density is 1–  .
The 1–  central posterior density region (CPDR) for a scalar parameter

 may be defined as the shortest single interval [a,b] such that:
P (  a | y )   / 2
and P(  b | y )   / 2 .
26
Figure 1.6 An 80% HPDR
Figure 1.7 illustrates the idea of the CPDR. One drawback of the CPDR
is that it is only defined for a scalar parameter. Another drawback is that
some values inside the CPDR may be less likely a posteriori than some
values outside it (which is not the case with the HPDR). For example, in
Figure 1.7, a value just below the upper bound of the 80% CPDR has a
smaller posterior density than a value just below the lower bound of that
CPDR. However, CPDRs are typically easier to calculate than HPDRs.
In the common case of a continuous parameter with a posterior density

in the form of a single ‘mound’ which is furthermore symmetric, the
CPDR and HPDR are identical.
Note 1: The 1–  CPDR for  may alternatively be defined as the

shortest single open interval (a,b) such that:
P (  a | y )   / 2
and P (  b | y )   / 2 .
Other variations are possible (of the form [a,b) and (a,b]); but when the
parameter of interest  is continuous these definitions are all equivalent.
Yet another definition of the 1–  CPDR is any of the CPDRs as defined
above but with all a posteriori impossible values of  excluded.
Note 2: As regards terminology, whenever the HPDR is a single

interval, it may also be called the highest posterior density interval
(HPDI). Likewise, the CPDR, which is always a single interval, may
also be called the central posterior density interval (CPDI).
27
Figure 1.7 An 80% CPDR
Exercise 1.11 A bent coin
We have a bent coin, for which  , the probability of heads coming up, is
unknown. Our prior beliefs regarding  may be described by a standard
uniform distribution. Thus no value of  is deemed more or less likely
than any other.
We toss the coin n = 5 times (independently), and heads come up every

time.
Find the posterior mean, mode and median of . Also find the 80%
HPDR and CPDR for  .
Recall the binomial-beta model:

( y |  ) ~ Binomial (n,  )
 ~ Beta (,  ) ,
for which ( | y ) ~ Beta (  y,   n  y ) .
We now apply this result with n= y= 5 and     1 (corresponding

to  ~ U (0,1)), and find that:
( | y ) ~ Beta (1  5,5  5  1)  Beta (6,1)
 61 (1  )11
f ( | y )   6 5 , 0 <  < 1
B(6,1)

F ( | y )   6t 5 dt   6 , 0 <  < 1.
0
28
6 6
Therefore: E ( | y )   = 0.8571
6 1 7
6 1
Mode( | y )  1
(6 1)  (11)
Median( | y ) = solution in  of F ( | y )  1/ 2 , i.e.  6  0.5
= (0.5)1/ 6 = 0.8909.
Also, the 80% HPDR is (0.21/ 6 ,1)  (0.7647,1) (since f ( | y ) is strictly

increasing), and the 80% CPDR is (0.11/ 6 , 0.91/ 6 )  (0.6813,0.9826). The
three point estimate and two interval estimates just derived are shown in
Figure 1.8.
Figure 1.8 Inference in Exercise 1.11
options(digits=4); postmean=6/7; postmode=1; postmedian=0.5^(1/6)

c(postmean,postmode,postmedian) # 0.8571 1.0000 0.8909
hpdr=c(0.2^(1/6),1); cpdr=c(0.1,0.9)^(1/6)
c(hpdr,cpdr) # 0.7647 1.0000 0.6813 0.9826
X11(w=8,h=5); par(mfrow=c(1,1)); tv=seq(0,1,0.01); fv=dbeta(tv,6,1)

plot(tv,fv,type="l",lwd=3,xlab="theta",ylab="posterior density")
points(c(postmean,postmode,postmedian),c(0,0,0),pch=c(1,2,4))
points(hpdr,rep(0.2,2),pch=16); lines(hpdr,rep(0.2,2),lty=3,lwd=2)
29
points(cpdr,rep(0.4,2),pch=16); lines(cpdr,rep(0.4,2),lty=2,lwd=2)
abline(v=c(postmean,postmode,postmedian),lty=3)
abline(v=c(0,hpdr,cpdr),lty=3); abline(h=c(0,6),lty=3)
legend(0.2,5.8,c("posterior mean","posterior mode",
"posterior median"),pch=c(1,2,4))
legend(0.2,2.8,c("80% CPDR","80% HPDR"),lty=c(2,3),lwd=c(2,2))
Exercise 1.12 HPDR and CPDR for a discrete parameter
Consider the posterior distribution from Exercise 1.7 (Balls in a box):

= =
 14 / 420 0.03333, θ 2

= = θ 3
 36 / 420 0.08571,
= 60 / 420 0.14286,
= θ 4

f (θ | y ) = = =
 80 / 420 0.19048, θ 5
= =
90 / 420 0.21429, θ 6

= =
 84 / 420 0.20000, θ 7

= =
56 / 420 0.13333, θ 8.
Find the 90% HPDR and 90% CPDR for θ . Also find the 50% HPDR
and 50% CPDR for θ . For each region, calculate the associated exact
coverage probability.
The 90% HPDR is the set {3,4,5,6,7,8};

this has exact coverage 1 − 14/420 = 0.9667.
The 90% CPDR is the closed interval [3, 8];

this likewise has exact coverage 0.9667.
The 50% HPDR is {5,6,7};

this has exact coverage (80 + 90 + 84)/420 = 0.6047.
The 50% CPDR is [4, 7];

this has exact coverage (60 + 80 + 90 + 84)/420 = 0.7476.
Note: The lower bound of the 50% CPDR cannot be equal to 5.

This is because P (θ < 5 | y ) = (14 + 36 + 60) / 420 = 0.2619, which is not
less than or equal to α / 2 = 0.25 , as required by the definition of CPDR.
30
Exercise 1.13 Illustration of the definition of HPDR
Suppose that the posterior probabilities of a parameter θ given data y

are exactly 10%, 40% and 50% for values 1, 2 and 3, respectively. Find
S, the 40% HPDR for θ .
The smallest set S such that P (  S | y )  0.4 is {2} or {3}. With the
additional requirement that f (1 | y )  f (2 | y ) if 1  S and 2  S , we
see that S = {3} (only). That is, the 40% HPDR is the singleton set {3}.
1.13 Inference on functions of the model

parameter
So far we have examined Bayesian models with a single parameter θ
and described how to perform posterior inference on that parameter.
Sometimes there may also be interest in some function of the model
parameter, denoted by (say)
ψ = g (θ ) .
Then the posterior density of ψ can be derived using distribution theory,

for example by applying the transformation rule,
dθ
f (ψ | y ) = f (θ | y ) ,
dψ
in cases where ψ = g (θ ) is strictly increasing or strictly decreasing.
Point and interval estimates of ψ can then be calculated in the usual

way, using f (ψ | y ) . For example, the posterior mean of ψ equals
E (ψ | y ) = ∫ψ f (ψ | y )dψ .
Sometimes it is more practical to calculate point and interval estimates

another way, without first deriving f (ψ | y ) .
For example, another expression for the posterior mean is

= ( g (θ ) | y ) ∫ g (θ ) f (θ | y )dθ .
E (ψ | y ) E=
31
Also, the posterior median of ψ , call this M, can typically be obtained

by simply calculating
M = g ( m) ,
where m is the posterior median of θ .
Note: To see why this works, we write

P (ψ < M | y ) = P( g (θ ) < M | y )
= P ( g (θ ) < g (m) |= y ) P (θ < m | y ) = 1/ 2 .
Exercise 1.14 Estimation of an exponential mean
Suppose that θ has the standard exponential distribution, and y given θ

is exponential with mean 1/ θ . Find the posterior density and posterior
=
mean of the model mean, ψ E= ( y | θ ) 1/ θ , given the data y.
Recall that the Bayesian model

f ( y |  )  e y , y  0
f ( )  e ,   0
implies the posterior ( | y ) ~ Gamma (2, y  1) .
So, by definition, ( | y ) ~ InverseGamma (2, y  1) ,

( y  1) 2 (21) e( y 1)/  ( y  1) 2
with density f ( | y )   3 ( y 1)/  ,   0,
(2) e
y 1
and mean E ( | y )   y 1 .
2 1
Note: This mean could also be obtained as follows:

 1   1
E ( | y )  E  y    f ( | y )d 
   0 

1 ( y  1) 2  21e ( y 1)
  d
0
 (2)

(1)( y  1) 2 1 ( y  1)1 11e ( y 1)
(2)( y  1)1 0 
  d
(1)
 y  1 (using the fact that the last integral equals 1).
32
Exercise 1.15 Inference on a function of the binomial

parameter
Recall the binomial-beta model given by:

( y |  ) ~ Binomial (n,  )
 ~ Beta (,  ) ,
for which ( | y ) ~ Beta (  y,   n  y ) .
Find the posterior mean, density function and distribution function of

ψ = θ 2 in the case where n = 5, y = 5, and     1 .
Note: In the context where we toss a bent coin five times and get heads
every time (and the prior on the probability of heads is standard
uniform), the quantity ψ may be interpreted as the probability of the
next two tosses both coming up heads, or equivalently, as the proportion
of times heads will come up twice if the coin is repeatedly tossed in
groups of two tosses a hypothetically infinite number of times.
Here, ( | y ) ~ Beta (1  5,1  5  5) ~ Beta (6,1)

with pdf f (θ | =y ) 6θ 5 , 0 < θ < 1 .
Now θ = ψ 1/2 and so, by the transformation method, the posterior

density function of ψ is
dθ 1 − 12
f (ψ | y=
) f (θ | y ) = 6ψ − ψ = 3ψ 2 , 0 < ψ < 1 .
5/2
dψ 2
It follows that the posterior mean of ψ is
1
=ψˆ E=
(ψ | y ) ∫ψ ( 3ψ=
) dψ
2
0.75 ,
0
and the posterior distribution function of ψ is

ψ ψ
F (ψ | y= ∫ f (ψ= t | y )dt= ∫ 3t dt= ψ 3, 0 <ψ < 1 .
2
)
0 0
33
Note 1: The posterior mean of ψ = θ 2 can also be obtained by writing

1
=ψˆ E=
(θ | y ) ∫ θ (=
6θ ) dθ
2 2 5
0.75
0
or =ψˆ E (=
θ 2 | y ) V (θ | y ) + {E (θ | y )}2
6 ×1
2
 6 
= +  = 0.75
(6 + 1) (6 + 1 + 1)  6 + 1 
2
or (ψ | y ) ~ Beta (3,1) ⇒ ψˆ = E (ψ | y ) = 3/(3 + 1) = 0.75.
Note 2: The distribution function of ψ = θ 2 can also be obtained by

writing
F (ψ =v | y ) =P (ψ ≤ v | y ) =P (θ 2 ≤ v | y ) =P (θ ≤ v1/2 | y )
v1/2 | y ) =
(θ =
= F= θ6  v 3 , 0 < v < 1.
 θ =v1/2 
Note 3: In the above, f (ψ = t | y ) denotes the pdf of ψ given y, but

evaluated at t. This pdf could also be written as fψ (t | y ) or as
 f (ψ | y )  . Likewise, F (ψ = v | y ) ≡ Fψ (v | y ) ≡  F (ψ | y )  .
 ψ =t   ψ =v 
1.14 Credibility estimates

In actuarial studies, a credibility estimate is one which can be expressed
as a weighted average of the form
C= (1 − k ) A + kB ,
where:
A is the subjective estimate (or the collateral data estimate)
B is the objective estimate (or the direct data estimate)
k is the credibility factor, a number that is between 0 and 1
(inclusive) and represents the weight assigned to the
objective estimate.
A high value of k implies C ≅ B, representing a situation where the

objective estimate is assigned ‘high credibility’. A primary aim of
credibility theory is to determine an appropriate value or formula for k,
as is done, for example, in the theory of the Bühlmann model
(Bühlmann, 1967). Many Bayesian models lead to a point estimate
which can be expressed as an intuitively appealing credibility estimate.
34
Exercise 1.16 Credibility estimation in the binomial-beta

model
Consider the binomial-beta model: ( y |  ) ~ Binomial (n,  )

 ~ Beta (,  ) .
Express the posterior mean of  as a credibility estimate and discuss.
Earlier we showed that

( | y ) ~ Beta (  y,   n  y ) ,
and hence that the posterior mean of  is
(  y )  y
ˆ  E ( | y )   .
(  y )  (  n  y )     n
Observe that the prior mean of θ is E   / (   ) , and the maximum

likelihood estimate (MLE) of θ is y/n. This suggests that we write
 y
ˆ  
 n  n
        n  y 
       
    n            n  n 
      n  y 
  .
  
    n          n  n 
Thus θˆ =(1 − k ) A + kB
α y n
where: A = , B= , k .
α +β n  n
We see that the posterior mean θˆ is a credibility estimate in the form of

a weighted average of the prior mean = A E= θ α / (α + β ) and the MLE
B = y / n , where the weight assigned to the MLE is the credibility factor
given by k  n / ( n     ) . Observe that as n increases, the credibility
factor k approaches 1. This makes sense: if there is a lot of data then the
prior should not have much influence on the estimation.
Figure 1.9 illustrates this idea by showing relevant densities, likelihoods

and estimates for the following two cases, respectively:
35
(a) n = 5, y = 4, α = 2, β = 6
(b) n = 20, y = 16, α = 2, β = 6.
In both cases, the prior mean is the same (A = 2/(2 + 6) = 0.25), as is the
MLE (B = 4/5 = 16/20 = 0.8). However, due to n being larger in case (b)
(i.e. there being more direct data), case (b) leads to a larger credibility
factor (0.714 compared to 0.385) and hence a posterior mean closer to
the MLE (0.643 compared to 0.462).
Note: Each likelihood function in Figure 1.9 has been normalised so that
the area underneath it is exactly 1. This means that in each case (a) and
(b), the likelihood function L( ) as shown is identical to the posterior
density which would be implied by the standard uniform prior, i.e. under
fU (0,1) ( )  f Beta (1,1) ( ) . Thus, L( )  f Beta (1 y ,1n y ) ( ) .
Figure 1.9 Illustration for Exercise 1.16

Legend: solid line = prior, dashed line = likelihood, dotted line = posterior,
circle = prior mean, triangle = MLE, cross = posterior mean
36
X11(w=8,h=7); par(mfrow=c(2,1))
alp=2; bet=6; n = 5; y = 4; pvec=seq(0,1,0.01)

plot(c(0,1),c(0,3),type="n",xlab="theta",ylab="density/likelihood")
lines(pvec,dbeta(pvec,alp,bet),lty=1,lwd=2)
lines(pvec,dbeta(pvec,1+y,n-y+1),lty=2,lwd=2)
lines(pvec,dbeta(pvec,alp+y,n-y+bet),lty=3,lwd=2)
points(c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)),c(0,0,0),pch=c(1,2,3),
cex=rep(1.5,3),lwd=2); text(0,2.5,"(a)",cex=1.5)
c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)) # 0.2500000 0.8000000 0.4615385
n/(alp+bet+n) # 0.3846154
alp=2; bet=6; n = 20; y = 16; pvec=seq(0,1,0.01)

lines(pvec,dbeta(pvec,alp,bet),lty=1,lwd=2)
lines(pvec,dbeta(pvec,1+y,n-y+1),lty=2,lwd=2)
lines(pvec,dbeta(pvec,alp+y,n-y+bet),lty=3,lwd=2)
points(c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)),c(0,0,0),pch=c(1,2,3),
cex=rep(1.5,3),lwd=2); text(0,4.5,"(b)",cex=1.5)
c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)) # 0.2500000 0.8000000 0.6428571
n/(alp+bet+n) # 0.7142857
Exercise 1.17 Further credibility estimation in the binomial-

beta model
Consider the binomial-beta model:

(Y |  ) ~ Binomial (n,  )
 ~ Beta (,  ) .
If possible, express the posterior mode of  as a credibility estimate.
Since ( | y ) ~ Beta (  y,   n  y ) , the posterior mode of θ is

(  y  1)   y 1
Mode( | y )   .
(  y  1)  (   n  y  1)     n  2
37
(  1)  1
Now, the prior mode of θ is Mode( )   .
(  1)  (   1)     2
 1 y
So we write Mode( | y )  
   n2    n2
 1      2    1  n  y 
        .

    n  2    1      2      n  2  n 
We see that the posterior mode is a credibility estimate of the form

Mode( | y )  1  c  Mode( )  cˆ ,
 1
where: Mode( )  is the prior mode
  2
y
ˆ  is the maximum likelihood estimate
n
(mode of the likelihood function)
n
c is the credibility factor
n   2
(assigned to the direct data estimate, ˆ).
Exercise 1.18 The normal-normal model

( y1 , , yn | ) ~ iid N (,  2 )
 ~ N (0 , 02 ) ,
where  2 , 0 and 02 are known or specified constants.
Find the posterior distribution of  given data in the form of the vector
y  ( y1 ,..., yn ) .
The posterior density of  is

f ( | y )  f ( ) f ( y |  )
 2
 1  y   2 

 1    0   n
 exp     exp   i  
 2  0   i1  
  
   2  
38
 1 1 1  n  
 exp   2  2  20  02   2  yi2  2ny  n 2   , (1.1)
 2  0   i1  

where y  ( y1  ...  yn ) / n is the sample mean.
We see that the posterior density of  is proportional to the exponent of

a quadratic in  . That is,
 1 2
f ( | y )  exp  2   *   , (1.2)
 2 *

which then implies that
( | y ) ~ N ( , 2 ) ,
for some constants * and *2 .
It remains to find the normal mean and variance parameters, * and *2 .
(These must be functions of the known quantities n, y ,  , 0 and 0 .)
One way to obtain these parameters which completely define ’s

posterior distribution is to complete the square in the exponent of (1.2).
To this end we write
 1 
f ( | y )  exp  q ,
 2 
where
1 1
q  2  2  20   2 2ny  n 2 
0 
(ignoring constants with respect to  )
1 n  ny 
  2  2  2   2  02  2   c
 0    0  
(where c is a constant with respect to  )
1 n  ny
 a 2  2b  c where a  2  2 and b  02  2
0  0 
 b   b  b  
2
 a  2  2   c  a  2  2         c 

 a    a   a  
(where c  is a constant with respect to  )
1  b
2
     c  .
1/ a a
39
 1  b  
2
Thus, f ( | y )  exp     . (1.3)

 2(1/ a )  a  
So, equating (1.2) and (1.3), we obtain:

1 1  2 2
*2    2 0 2
 2   n0
a 1 n
0 
2
0 ny

b 02  2  2 0  n02 y
*    . (1.4)
a 1

n  2
 n 2
0
02  2
Note 1: A little algebra (left as an additional exercise) shows that the

posterior mean can also be written as
*  (1 k )0  ky ,
and the posterior variance can be written as
σ2
σ =k
2
* ,
n
where
n
k= .
σ2
n+ 2
σ0
We see that ’s posterior mean is a credibility estimate in the form of a

weighted average of the prior mean 0 and the sample mean y (which
is also the maximum likelihood estimate), with the weight assigned to y
being the credibility factor, k . More will be said on this further down.
Note 2: Another way to derive * and *2 is to write (1.2) as

 1 
f ( | y )  exp  2  2  2*  *2  (1.5)
 2 * 
and then equate coefficients of powers of  in (1.1) and (1.5). This logic
1 1 n   ny
leads to 2  2  2 and *2  02  2 and ultimately the same
* 0  * 0 
formulae for * and *2 as given by (1.4).
40
Note 3: Since both prior and posterior are normal, the prior is
conjugate.
Note 4: The posterior mean, mode and median of  are the same and
equal to * . The 1 − α CPDR and 1 − α HPDR for  are the same and
equal to ( µ* ± zα /2σ * ) .
Note 5: The posterior distribution of  depends on the data

vector y  ( y1 , , yn ) only by way of the sample mean, i.e.
y  ( y1    yn ) / n . Therefore, the main result, ( | y ) ~ N ( , 2 ) ,
also implies that ( | y ) ~ N ( , 2 ) .
That is, if we know only the sample mean y , the posterior distribution
of  is the same as if we know y, i.e. all n sample values. Knowing the
individual yi values makes no difference to the inference.
Note 6: The formula for the credibility factor in Note 1, namely

n 1
= k = ,
σ 2
σ2 /n
n + 2 1+
σ0 σ0 2
makes sense in the following ways:
(i) If the prior standard deviation 0 is small then k  0 , so that

  0 and   0 . Therefore ( | y ) ~ N (0 , 02 ) .
That is, if the prior information is very ‘precise’ or ‘definite’, the data
has little influence on the posterior. So the posterior is approximately
equal to the prior; i.e. f ( | y )  f () , or equivalently, ( | y ) ~  . In
this case the posterior mean, mode and median of  are approximately
equal to 0 . Also, the 1 − α CPDR and 1 − α HPDR for  are
approximately equal to ( µ0 ± zα /2σ 0 ) .
(ii) If 0 is large then k  1 , so that   y , 2   2 / n , and so

( | y ) ~ N ( y ,  2 / n) .
41
That is, a large 0 corresponds to a highly disperse prior, reflecting little

prior information and so little influence of the prior distribution (as
specified by 0 and 0 ) on the inference. In this case the posterior
mean, mode and median of  are approximately equal to y . Also, the
1 − α CPDR and 1 − α HPDR for  are approximately equal to
( y ± zα /2σ / n ) . Thus, inference is almost the same as implied by the
classical approach.
(iii) If the sample size n is large then k  1, so that   y and

2   2 / n . Therefore ( | y ) ~ N ( y ,  2 / n) .
So, in this case, just as when 0 is large, the prior distribution has very
little influence on the posterior, and the ensuing inference is almost the
same as that implied by the classical approach.
Note 7: In the case of a priori ignorance (meaning no prior information

at all) it is customary to take 0   , which implies that
 ~ N (0, ) .
This prior on µ appears to be problematic, because it is improper.

However, it meaningfully leads to a proper posterior, namely
( | y ) ~ N ( y ,  2 / n ) ,
which then leads to the same point and interval estimates implied by the
classical approach, namely the MLE y and 1 − α CI ( y ± zα /2σ / n ) .
The improper prior  ~ N (0, ) may be described as ‘flat’ or ‘uniform

over the whole real line’ and can also be written as
µ ~ U (−∞, ∞)
or f ( µ ) ∝ 1, µ ∈ℜ .
In some cases (more complicated models not considered here), using an

improper prior may lead to an improper posterior, which then becomes
problematic. For more information on this topic, see Hobert and Casella
(1996).
42
Summary: For the normal-normal model, defined by:

( y1 , , yn | ) ~ iid N (,  2 )
 ~ N (0 , 02 ) ,
the posterior distribution of the normal mean µ is given by
( | y ) ~ N ( , 2 ) ,
where: *  (1  k )0  ky
2
*2  k
n
n
k= (the normal-normal model credibility factor).
n + σ 2 / σ 02
The posterior mean, mode and median of  are all equal to µ* ,

and the 1 − α CPDR and HPDR for  are both ( µ* ± zα /2σ * ) .
In the case of a priori ignorance it is appropriate to set σ 0 = ∞ .
This defines an improper prior

f ( µ ) ∝ 1, µ ∈ℜ
and the proper posterior
( | y ) ~ N ( y ,  2 / n ) .
Exercise 1.19 Practice with the normal-normal model
In the context of the normal-normal model, given by:

( y1 , , yn | ) ~ iid N (,  2 )
 ~ N (0 , 02 ) ,
suppose that y = (8.4, 10.1, 9.4) , σ = 1, µ0 = 5 and σ 0 = 1/2.
Calculate the posterior mean, mode and median of µ .
Also calculate the 95% CPDR and 95% HPDR for µ .
Create a graph which shows these estimates as well as the prior density,
prior mean, likelihood, MLE and posterior density.
43
Here:
n = 3,
y = (8.4 + 10.1 + 9.4)/3 = 9.3
1 3
=k = 2
= 0.4285714
1 /3 7
1+
(1/ 2) 2
 3 3
*  1  5  9.3 = 6.8428571
 7  7
3 12 1
σ = × = = 0.1428571.
2
*
7 3 7
So the posterior mean/mode/median is

* = 6.84286,
and the 95% CPDR/HPDR is
( µ* ± z0.025σ * ) = (6.84286 ± 1.96 0.14286)
= (6.102, 7.584).
Figure 1.10 shows the various densities and estimates here, as well as the
normalised likelihood. Note that the likelihood function as shown is also
the posterior density if the prior is taken to be uniform over the whole
real line, i.e. µ ~ U (−∞, ∞) .
Discussion
If we change σ 0 from 0.5 to 2 we get k = 0.923 and results as illustrated

in Figure 1.11.
If we change σ 0 from 0.5 to 0.25 we get k = 0.158 and results as

illustrated in Figure 1.12 (page 46).
If we keep σ 0 as 0.5 but change σ from 1 to 2 we get k = 0.158 and

results as illustrated in Figure 1.13 (page 46).
Note that the posteriors in Figures 1.12 and 1.13 have the same mean but
different variances.
44
Figure 1.10 Results if σ 0 = 0.5 , σ = 1 , k =

n / (n + σ 2 / σ 02 ) =
0.429
Figure 1.11 Results if σ 0 = 2 , σ = 1 , k =

n / (n + σ 2 / σ 02 ) =
0.9223
45

n / (n + σ 2 / σ 02 ) =
0.158

n / (n + σ 2 / σ 02 ) =
0.158
46
X11(w=8,h=5); par(mfrow=c(1,1)); mu0=5; sig0=0.5; sig=1
y = c(8.4, 10.1, 9.4); n = length(y); k=1/(1+(sig^2/n)/sig0^2); k # 0.4285714

ybar=mean(y); ybar # 9.3
mus = (1-k)*mu0 + k*ybar; sigs2=k*sig^2/n
c(mus,sigs2) # 6.8428571 0.1428571
muv=seq(0,15,0.01)
prior = dnorm(muv,mu0,sig0); post=dnorm(muv,mus,sqrt(sigs2))
like = dnorm(muv,ybar,sig/sqrt(n))
cpdr=mus+c(-1,1)*qnorm(0.975)*sqrt(sigs2)
cpdr # 6.102060 7.583654
plot(c(0,11),c(-0.1,1.3),type="n",xlab="",ylab="density/likelihood")
lines(muv,prior,lty=1,lwd=2); lines(muv,like,lty=2,lwd=2)
lines(muv,post,lty=3,lwd=2)
points(c(mu0,ybar,mus),c(0,0,0),pch=c(1,2,4),cex=rep(1.5,3),lwd=2)
points(cpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2))
legend(0,1.3,
c("Prior density","Likelihood function (normalised)","Posterior density"),
lty=c(1,2,3),lwd=c(2,2,2))
legend(0,0.7,c("Prior mean","Sample mean (MLE)","Posterior mean",
"95% CPDR bounds"), pch=c(1,2,4,16),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
text(10.8,-0.075,"m", vfont=c("serif symbol","italic"), cex=1.5)
# Repeat above with sig0=2 to obtain Figure 1.11

# Repeat above with sig0=0.25 to obtain Figure 1.12
# Repeat above with sig0=0.5 and sig=2 to obtain Figure 1.13
Exercise 1.20 The normal-gamma model

( y1 , , yn |  ) ~ iid N (,1/  )
 ~ G ( ,  ) .
Find the posterior distribution of  given y  ( y1 ,..., yn ) .
47
Note 1: In the normal-normal model, the normal mean  is unknown

and the normal variance  2 is known. Now we consider the same
Bayesian model but with those roles reversed, i.e. with  known and  2
unknown. For an example of where this kind of situation might arise, see
Byrne and Dracoulis (1985).
Note 2: For reasons of mathematical convenience and conjugacy, we

parameterise the normal distribution here via the precision parameter
  1/  2
rather than using  2 directly as before in the normal-normal model.
Note 3: An equivalent formulation of the normal-gamma model being

considered here is:
( y1 , , yn |  2 ) ~ iid N (,  2 )
 2 ~ IG (,  ) ,
where this may be called the normal-inverse-gamma model.
The posterior density of  is

f ( | y )  f ( ) f ( y |  )
 n  1  yi   2 
exp    
1
  e 
1 
i 1 1/ 

 2 
 1/   
  n 
  1e  n /2 exp   ( yi  ) 2 
 2 i1 
  a1eb for some a and b.
We see that
( | y ) ~ G ( a, b) ,
n
where: a   
2
n 2
b    s
2
1 n
s2   ( yi   ) 2 .
n i1
48
Note 1: The posterior mean of  , namely

a n / 2
E ( | y )   ,
b   ns2 / 2
1
converges to ˆ  2 (the MLE of  ) as n → ∞ .
s
If α= β= 0 then E ( | y )  ˆ exactly for all n.
Note 2: Unlike the posterior mean of µ in the normal-normal model, the

posterior mean of  cannot be expressed as a credibility estimate of the
form
(1  c )0  cˆ ,

where: 0  E  (the prior mean of  )

1
ˆ  2 (the MLE of  ).
s
Note 3: We may write the posterior as

 2  n 2  ns2 

( | y ) ~ G  ,  .
 2 2 
It can then be shown via the method of transformations that

 2  n 1 
(u | y ) ~ G  ,  ~  2 (2  n ) ,
 2 2
where u  (2  ns2 ) .
So the 1 − A CPDR for u is 12 A/2 (2  n), A2 /2 (2  n) .
  2 (2  n)  2 (2  n) 
So the 1 − A CPDR for λ =
u  1 A/2 A /2 
2 β + nsµ2
is  2  ns 2 , 2  ns 2  .
   
 2  ns 2
2  ns 
2
So the 1 − A CPDR for σ 2 = is  2

1  .
, 2
λ  A/2 (2  n) 1 A/2 (2  n) 
If α= β= 0 , this is exactly the same as the classical 1 − A CI for σ 2 .
49
Note 4: The classical 1 − A CI for σ 2 may be derived as follows. First

consider all parameters fixed as constants. Then
y1 − µ y −µ
,..., n ~ iid N (0,1) .
σ σ
So
 y1 − µ   yn − µ 
2 2
 ~ iid χ (1) .
2
  ,..., 
 σ   σ 
So
 yi − µ 
2
n nsµ2
∑ 
i =1  σ 

=
σ 2
~ χ 2 ( n) .
So
 ns 2 
1 − A P  χ12− A/2 (n) < 2µ < χ A2 /2 (n) 
=
 σ 
 
 ns 2
ns 2

= P 2 µ <σ 2 < 2 µ .
 χ ( n) χ1− A/2 (n) 
 A/2
Note 5: Notes 1 to 3 indicate that in the case of a priori ignorance, a

reasonable specification is
    0,
or equivalently,
f ( )  1/ ,   0 .
This improper prior may be thought of as the limiting case as   0 of

the proper prior
 ~ Gam(, ) ,
where   0 .
Observe that
E   /   1
for all ε , and
V   /  2  
as   0 .
50
Summary: For the normal-gamma model, defined by:

( y1 , , yn |  ) ~ iid N (,1/  )
 ~ G ( ,  ) ,
the posterior distribution of λ is given by
( | y ) ~ G ( a, b) ,
n n 1 n
where: a    , b    s2 , s2   ( yi  ) 2 .
2 2 n i1
The posterior mean of λ is a/b. The posterior median is FG(1a ,b ) (1/ 2) .

The posterior mode of λ is ( a − 1) / b if a > 1; otherwise that mode is 0.
The 1 − A CPDR for λ is  FG(1a,b ) ( A / 2), FG(1a,b ) (1 A / 2)

  2 (2  n)  2 (2  n) 

and may also be written as  1 A/2 , A/2  .
 2  ns2 2  ns2 
 2  ns2 2  ns2 
The 1 − A CPDR for σ 2 = 1 / λ is  2 , 2  .
 A/2 (2  n) 1 A/2 (2  n) 
In the case of a priori ignorance it is appropriate to set α= β= 0 .

This defines an improper prior with density
f (λ ) ∝ 1/ λ , λ > 0 ,
and a proper posterior distribution given by
( ns2 | y ) ~  2 ( n ) .
Exercise 1.21 Practice with the normal-gamma model
In the context of the normal-gamma model, given by:

( y1 ,, yn |  ) ~ iid N (,1 /  )
 ~ Gamma (,  ) ,
suppose that y = (8.4, 10.1, 9.4) ,  = 8, α = 3 and β = 2.
(a) Calculate the posterior mean, mode and median of the model
precision  . Also calculate the 95% CPDR for  . Create a graph which
shows these estimates as well as the prior density, prior mean,
likelihood, MLE and posterior density.
51
(b) Calculate the posterior mean, mode and median of the model
variance  2  1 /  . Also calculate the 95% CPDR for  2 . Create a
graph which shows these estimates as well as the prior density, prior
mean, likelihood, MLE and posterior density.
(c) Calculate the posterior mean, mode and median of the model
standard deviation . Also calculate the 95% CPDR for . Create a
graph which shows these estimates as well as the prior density, prior
mean, likelihood, MLE and posterior density.
(d) Examine each of the point estimates in (a), (b) and (c) and determine
which ones, if any, can be easily expressed in the form of a credibility
estimate.
(a) The required posterior distribution is ( | y ) ~ Gamma (a, b) , where:

n n 1 n
a    = 4.5, b    s2 = 5.265, s2   ( yi  ) 2 = 2.177.
2 2 n i1
So:
• the posterior mean of λ is E ( | y )  a / b = 0.8547
• the posterior mode is Mode( | y )  ( a  1) / b = 0.6648
• the posterior median is the 0.5 quantile of the G(a,b) distribution
and works out as Median( | y ) = 0.7923
(as obtained using the qgamma() function in R; see below)
• the 95% CPDR for  is (0.2564, 1.8065) (where the bounds are
the 0.025 and 0.975 quantiles of the G(a,b) distribution).
Also:
• the prior mean is E   /  = 1.5
• the prior mode is Mode( )  (  1) /  = 1
• the prior median is Median( ) = 1.3370
• the MLE of λ is ˆ  1 / s 2 = 0.4594

(note that this estimate is biased).
normalised likelihood function.
52
Note: The normalised likelihood function (with area below equal to 1) is

the same as the posterior density of λ if the prior is taken to be uniform
over the positive real line, i.e. λ ~ U (0, ∞) . This prior is specified by
taking  = 1 and  = 0, because then f (λ ) ∝ λ 1−1e −0 λ ∝ 1 .
Figure 1.14 Results for Exercise 1.21(a)
(b) As regards the model variance  2  1 /  we note that  2 ~ IG (,  )

with density
d
where λ = (σ 2 )
−1
f ( 2 )  f ( )
d 2
2 1
  [( 2 )1 ]1 e (  )
 ( 2 )2 )
(  )

( 2 )1 e / ,  2  0 .
2
 (1.6)
(  )
Then, by well-known properties of the inverse gamma distribution and

maximum likelihood theory:
• the prior mean of  2 is E 2   / (  1) = 1
• the prior mode is Mode( 2 )   / (  1) = 0.5
• the prior median is Median( 2 )  1 / Median( ) = 0.7479
• the MLE of  2 is ˆ 2  1 / ˆ  s 2 = 2.1767
(note that this estimate is unbiased).
53
By analogy with the prior (1.6), we find that ( 2 | y ) ~ IG ( a, b) with

density
ba
( 2 )a1 eb/ ,  2  0 ,
2
f ( 2 | y ) 
( a )
and hence that:
• the posterior mean of  2 is E ( 2 | y )  b / ( a  1) = 1.5043
• the posterior mode is Mode( 2 | y )  b / ( a  1) = 0.9573
• the posterior median is
Median(σ 2 | y ) = 1 / Median(λ | y ) = 1.2622
(since 1 / 2 = P(σ 2 < m | y ) = P(1 / λ < m | y ) = P(1 / m < λ | y ) )
• the 95% CPDR for  2 is (0.5535, 3.8994) (where the lower
and upper bounds are the inverses of the 0.975 and 0.025
quantiles of the G(a,b) distribution, respectively).
Note: The normalised likelihood function is the same as the posterior

density of σ 2 if the prior on σ 2 is taken to be uniform over the positive
real line, i.e. f (σ 2 ) ∝ 1, σ 2 > 0 . This prior is specified by  ~ G (1,0) ,
i.e. by α = −1 and β = 0 as is evident from (1.6) above.
Figure 1.15 Results for Exercise 1.21(b)
54
(c) As regards the model standard deviation   1 /  , observe that the

prior density of this quantity is
d
f ( )  f ( ) where λ = σ −2
d
2
  ( 2 )1 e 2  21  /  2
 2 3   e ,   0. (1.7)
(  ) ( )
We find that:
• the prior mean of σ is
∞
β α λ α −1e − βλ
Eσ E=
λ −1/2 ∫λ dλ
−1/2
=
0
Γ(α )
1 1
α ∞ α− α − −1 − βλ
β Γ(α − 1 / 2) β λ 2
e 2
=
β α −1/2 ∫
Γ(α ) 0 Γ(α − 1 / 2)
dλ
Γ(α − 1 / 2)
= β 1/2 = 0.9400
Γ(α )
2
• the prior mode of σ is Mode( )  = 0.7559
2  1
(obtained by setting the derivative of the logarithm of (1.7)
to zero, where that derivative is derived as follows:
l ( )  log f ( )  (2  1) log    2 + constant
2  1 set 2
 l ( )    2 3  0   2  )
 2  1
• the prior median of σ is Median( )  Median( 2 ) = 0.8648
• the MLE of  is ˆ  s2 = 1.4754 (which is biased).
2b a 2 a1 b /  2
By analogy with the above, f ( | y )   e ,   0.
( a )
So we find that:
Γ( a − 1 / 2)
• the posterior mean of σ is E (σ | y ) = b1/2 = 1.1836
Γ( a )
2b
• the posterior mode is Mode( | y )  = 1.0262
2a  1
55
• the posterior median is

Median(σ | y ) = Median(σ 2 | y ) = 1.1235
(since 1/ 2 =P(σ 2 < m | y ) =P (σ < m | y ))
• the 95% CPDR for  is (0.7440,1.9747) (where these bounds
are the square roots of the bounds of the 95% CPDR for  2 ).
Note: The normalised likelihood function is the same as the posterior

density of σ if the prior on σ is taken to be uniform over the positive
real line, i.e. f (σ ) ∝ 1, σ > 0 . This prior is specified by  ~ G (1 / 2,0) ,
i.e. by α = −1 / 2 and β = 0 , as is evident from (1.7) above.
Figure 1.16 Results for Exercise 1.21(c)
(d) Considering the various point estimates of λ , σ 2 and σ derived

above, we find that two of them can easily be expressed as credibility
estimates, as follows. First, observe that
b   ns2 / 2 2  ns2
E ( | y ) 
2
 
a  1   ( n / 2)  1 2  n  2
 n  2 2
   s  ,
 n  2  2  n  2  2
where
2 2  1  2  2
     E 2 .
n  2  2 n  2  2    1 n  2  2
56
We see that the posterior mean of  2 is a credibility estimate of the form

E ( 2 | y )  (1  c ) E 2  cs2 ,
where:

E 2  is the prior mean of  2
 1
1 n
s2   ( yi   ) 2 is the MLE of  2
n i1
n
c is the credibility factor (assigned to the MLE).
n  2  2
Likewise,
b   ns2 / 2 2  ns2
Mode( | y ) 
2
 
a  1   ( n / 2)  1 2  n  2
 n  2 2
   s 
 ,
 n  2  2  n  2  2
where
2 2  1 
  
n  2  2 n  2  2   1
2  2
  Mode( 2 )
n  2  2
 n 
 1   Mode( ) .
2
 n  2  2 
We see that the posterior mode of  2 is a credibility estimate of the form

Mode( 2 | y )  (1  d ) Mode( 2 )  ds2 ,
where:

Mode( 2 )  is the prior mode of  2
 1
1 n
s2   ( yi   ) 2 is the MLE of  2
n i1
(i.e. mode of the likelihood function)
n
d is the credibility factor (assigned to the MLE).
n  2  2
57
# (a) Inference on lambda -----------------------------------------------
y = c(8.4, 10.1, 9.4); n = length(y); mu=8; alp=3; bet=2; options(digits=4)

a=alp+n/2; sigmu2=mean((y-mu)^2); b=bet+(n/2)*sigmu2
c(a,sigmu2,b) # 4.500 2.177 5.265
lampriormean=alp/bet; lamlikemode=1/sigmu2; lampriormode=(alp-1)/bet

lampriormedian= qgamma(0.5,alp,bet)
lampostmean=a/b; lampostmode=(a-1)/b; lampostmedian=qgamma(0.5,a,b)
lamcpdr=qgamma(c(0.025,0.975),a,b)
c(lampriormean,lamlikemode,lampriormode,lampriormedian,
lampostmode,lampostmedian, lampostmean,lamcpdr)
# 1.5000 0.4594 1.0000 1.3370 0.6648 0.7923 0.8547 0.2564 1.8065
lamv=seq(0,5,0.01); prior=dgamma(lamv,alp,bet)
post=dgamma(lamv,a,b); like=dgamma(lamv,a-alp+1,b-bet+0)
X11(w=8,h=4); par(mfrow=c(1,1))
plot(c(0,5),c(0,1.9),type="n",
main="Inference on the model precision parameter",
xlab="lambda",ylab="density/likelihood")
lines(lamv,prior,lty=1,lwd=2); lines(lamv,like,lty=2,lwd=2);
lines(lamv,post,lty=3,lwd=2)
points(c(lampriormean,lampriormode, lampriormedian,
lamlikemode,lampostmode,lampostmedian,lampostmean),
rep(0,7),pch=c(1,1,1,2,4,4,4),cex=rep(1.5,7),lwd=2)
points(lamcpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2))
legend(0,1.9,
lty=c(1,2,3),lwd=c(2,2,2))
legend(3,1.9,c("Prior mode, median\n & mean (left to right)",
"MLE"), pch=c(1,2),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
legend(3,1,c("Posterior mode, median\n & mean (left to right)",
"95% CPDR bounds"), pch=c(4,16),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
58
# (b) Inference on sigma2 = 1/lambda -------------------------------------------------
sig2priormean=bet/(alp-1); sig2likemode=sigmu2; sig2priormode=bet/(alp+1)

sig2postmean=b/(a-1); sig2postmode=b/(a+1);
sig2postmedian=1/lampostmedian
sig2cpdr=1/qgamma(c(0.975,0.025),a,b); sig2priormedian= 1/lampriormedian
c(sig2priormean, sig2likemode, sig2priormode, sig2priormedian,

sig2postmode, sig2postmedian, sig2postmean, sig2cpdr)
# 1.0000 2.1767 0.5000 0.7479 0.9573 1.2622 1.5043 0.5535 3.8994
sig2v=seq(0.01,10,0.01); prior=dgamma(1/sig2v,alp,bet)/sig2v^2
post=dgamma(1/sig2v,a,b)/sig2v^2;
like=dgamma(1/sig2v,a-alp-1,b-bet+0)/sig2v^2
plot(c(0,10),c(0,1.2),type="n",
main="Inference on the model variance parameter",
xlab="sigma^2 = 1/lambda",ylab="density/likelihood")
lines(sig2v,prior,lty=1,lwd=2); lines(sig2v,like,lty=2,lwd=2)
lines(sig2v,post,lty=3,lwd=2)
points(c(sig2priormean, sig2priormode, sig2priormedian, sig2likemode,

sig2postmode, sig2postmedian,sig2postmean),
points(sig2cpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2))
legend(1.8,1.2,
lty=c(1,2,3),lwd=c(2,2,2))
legend(7,1.2,c("Prior mode, median\n & mean (left to right)",
legend(6,0.65,c("Posterior mode, median\n & mean (left to right)",
# abline(h=max(like),lty=3) # Checking likelihood and MLE are consistent

# fun=function(t){ dgamma(1/t,a-alp-1,b-bet+0)/t^2 }
# integrate(f=fun,lower=0,upper=Inf)$value
# 1 Checking likelihood is normalised
59
# (c) Inference on sigma = 1/sqrt(lambda) ---------------------------------------------
sigpriormean=sqrt(bet)*gamma(alp-1/2)/gamma(alp);
siglikemode=sqrt(sigmu2); sigpriormode=sqrt(2*bet/(2*alp+1))
sigpostmean= sqrt(b)*gamma(a-1/2)/gamma(a)
sigpostmode= sqrt(2*b/(2*a+1)); sigpostmedian=sqrt(sig2postmedian)
sigcpdr=sqrt(sig2cpdr); sigpriormedian= sqrt(sig2priormedian)
c(sigpriormean, siglikemode, sigpriormode, sigpriormedian,

sigpostmode, sigpostmedian, sigpostmean, sigcpdr)
# 0.9400 1.4754 0.7559 0.8648 1.0262 1.1235 1.1836 0.7440 1.9747
sigv=seq(0.01,3,0.01); prior=dgamma(1/sigv^2,alp,bet)*2/sigv^3
post=dgamma(1/sigv^2,a,b)*2/sigv^3;
like=dgamma(1/sigv^2,a-alp-1/2,b-bet+0)*2/sigv^3
plot(c(0,2.5),c(0,4.1),type="n",
main="Inference on the model standard deviation parameter",
xlab="sigma = 1/sqrt(lambda)",ylab="density/likelihood")
lines(sigv,prior,lty=1,lwd=2)
lines(sigv,like,lty=2,lwd=2)
lines(sigv,post,lty=3,lwd=2)
points(c(sigpriormean, sigpriormode, sigpriormedian, siglikemode,
sigpostmode, sigpostmedian,sigpostmean),
points(sigcpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2))
legend(0,4.1,
lty=c(1,2,3),lwd=c(2,2,2))
legend(1.7,4.1,c("Prior mode, median\n & mean (left to right)",
legend(1.7,2.3,c("Posterior mode, median\n & mean (left to right)",
60
CHAPTER 2
2.1 Frequentist characteristics of Bayesian
estimators
Consider a Bayesian model defined by a likelihood f ( y |  ) and a prior
f ( ) , leading to the posterior
f ( ) f ( y |  )
f ( | y )  .
f ( y)
Suppose that we choose to perform inference on  by constructing a

point estimate ̂ (such as the posterior mean, mode or median) and a
(1 − α ) -level interval estimate I = ( L, U ) (such as the CPDR or HPDR).
Then ̂ , I, L and U are functions of the data y and may be written ˆ( y ) ,
I(y), L(y) and U(y). Once these functions are defined, the estimates
which they define stand on their own, so to speak, and may be studied
from many different perspectives.
Naturally, the characteristics of these estimates may be seen in the

context of the Bayesian framework in which they were constructed.
More will be said on this below when we come to discuss Bayesian
decision theory.
However, another important use of Bayesian estimates is as a proxy for

classical estimates. We have already mentioned this in relation to the
normal-normal model:
( y1 , , yn | ) ~ iid N (,  2 )
 ~ N (0 , 02 ) ,
where the use of a particular prior, namely the one specified by σ 0 = ∞ ,
led to the point estimate ˆ  ˆ( y )  y and the interval estimate
I ( y= = ( y ± zα /2σ / n ) .
) ( L( y ),U ( y ))
As we noted earlier, these estimates are exactly the same as the usual
estimates used in the context of the corresponding classical model,
61
y1 ,, yn ~ iid N (,  2 ) ,

where  is an unknown constant and  2 is given.
Therefore, the frequentist operating characteristics of the Bayesian

estimates are immediately known. In particular, we refer to the fact that
the frequentist bias of ̂ is zero, and the frequentist coverage probability
of I is exactly 1 − α . These statements mean that the expected value of
y given  is  for all possible values of  , and that the probability of
 being inside I given  is 1 − α for all possible values of µ .
More generally, in the context of a Bayesian model as above, we may

define the frequentist bias of a Bayesian point estimate
ˆ  ˆ( y )
as
= Bθ E{θˆ( y ) − θ | θ } .
Also, we may define the frequentist relative bias of ̂ as

 θˆ( y ) − θ  Bθ
= Rθ E=  θ ( θ ≠ 0 ).
θ  θ
 
Furthermore, we may define the frequentist coverage probability (FCP)

of a Bayesian interval estimate
I(y) = (L(y), U(y))
as
Cθ P{θ ∈ I ( y ) | θ } .
=
Thus, for the normal-normal model with σ 0 = ∞ , we may write:

Bµ = E{µˆ ( y ) − µ | µ} = E ( y | µ ) − µ = µ − µ = 0 ∀ µ ∈ℜ
0
R=
µ = 0 (µ ≠ 0)
µ
Cµ P{µ ∈ I ( y ) | µ}
=
 σ σ 
= P  y − zα /2 < µ < y + zα /2 µ  = 1 − α ∀ µ ∈ℜ .
 n n 
The above analysis is straightforward enough. However, in the case of

an informative prior (one with σ 0 < ∞ ), or in the context of other
62
Bayesian models, the frequentist bias of a Bayesian point estimate ( Bθ )

and the frequentist coverage probability of a Bayesian interval estimate
( Cθ ) may not be so obvious. Working out these functions may be useful
for adding insight to the estimation process as well as for deciding
whether or not to use a set of Bayesian estimates as frequentist proxies.
Exercise 2.1 Frequentist characteristics of estimators in the

normal-normal model
Consider the normal-normal model:

( y1 , , yn | ) ~ iid N (,  2 )
 ~ N (0 , 02 ) .
Work out general formulae for the frequentist and relative bias of the
posterior mean of  , and for the frequentist coverage probability of the
1 − α HPDR for  .
Produce graphs showing a number of examples of each of these three

functions.
Recall that
( | y ) ~ N ( , 2 ) ,
where:
*  (1 k )0  ky is ’s posterior mean
σ2
σ *2 = k is ’s posterior variance
n
n
k= is a credibility factor.
n + σ 2 / σ 02
Also, recall that ’s HPDR (and CPDR) is

(*  z /2* ) .
Using these results, we find that the frequentist bias of the posterior
mean of  is
Bµ = E ( µ* − µ | µ ) =−
(1 k ) µ0 + kE ( y | µ ) − µ
=(1 − k ) µ0 + k µ − µ
=(1 − k )( µ0 − µ ) .
63
Also, the frequentist relative bias of that mean is

Rµ (1 − k )( µ0 − µ )
R=µ =
µ µ
µ 
=
(1 − k )  0 − 1 ( µ ≠ 0 ).
µ 
Further, the frequentist coverage probability of the 1 − α HPDR for  is

Cµ = P {µ ∈ ( µ* ± zα /2σ * ) µ }
= P ( µ* − zα /2σ * < µ < µ* + zα /2σ * µ )
= P ( µ* − zα /2σ * < µ , µ < µ* + zα /2σ * µ )
= P ( (1 − k ) µ0 + ky − zα /2σ * < µ , µ < (1 − k ) µ0 + ky + zα /2σ * µ )
 µ − (1 − k ) µ0 + zα /2σ * µ − (1 − k ) µ0 − zα /2σ * 
=
P y < , < y µ
 k k 
P ( y < b( µ ), a ( µ ) < y µ ) ,
=
where:
µ − (1 − k ) µ0 + zα /2σ *
b( µ ) =
k
µ − (1 − k ) µ0 − zα /2σ *
a( µ ) = .
k
Thus, we find that

= Cµ P ( a ( µ ) < y < b( µ ) µ )
 a( µ ) − µ y − µ b( µ ) − µ 
= P < < µ
 σ/ n σ/ n σ/ n 
 a( µ ) − µ b( µ ) − µ 
= P <Z< 
 σ/ n σ/ n 
 y−µ 
where Z ~ N(0,1), since  µ  ~ N (0,1)
σ / n 
 b( µ ) − µ   a( µ ) − µ 
= Φ −Φ .
 σ/ n   σ/ n 
Note: Here, Φ denotes the standard normal cdf.
64
Figures 2.1, 2.2 and 2.3 (pages 66 and 67) show Bµ , Rµ and Cµ for
selected values of σ 0 , with n = 10 , µ0 = 1 , σ = 1 and α = 0.05 in each
case. The strength of the prior belief is represented by σ 0 , with large
values of this parameter indicating relative ignorance.
In Figure 2.1, we see that, for any given value of µ , the frequentist bias
Bµ of the posterior mean µ* = E ( µ | y ) converges to zero as the prior
belief tends to total ignorance, that is, in the limit as σ 0 → ∞ .
Also, Bµ → µ0 − µ as the prior belief tends to complete certainty, that

is, in the limit as σ 0 → 0 .
Note: One of the thin dotted guidelines in Figure 2.1 shows the function
B=
µ µ0 − µ in this latter extreme case of ‘absolute’ prior belief that
µ = µ0 . In all of the examples, µ0 = 1 .
In Figure 2.2, we see that, for any given value of µ , the frequentist
relative bias Rµ of the posterior mean µ* = E ( µ | y ) converges to zero
as σ 0 → ∞ . Also, Rµ → ( µ0 / µ ) − 1 as σ 0 → 0 .
Note: The curved thin dotted guidelines in Figure 2.2 shows the function
=Rµ ( µ0 / µ ) − 1 in this latter extreme case of ‘absolute’ prior belief that
µ = µ0 .
In Figure 2.3, we see that, for any given value of µ , the frequentist
coverage probability Cµ of the 1 − α (i.e. 0.95 or 95%) HPDR, namely
(*  z /2* ) , converges to 1 − α as σ 0 → ∞ .
Also, Cµ → 0 as σ 0 → 0 , except at exactly µ = µ0 where Cµ → 1 ;

thus, Cµ → I ( µ = µ0 ) as σ 0 → 0 (where I denotes the standard
indicator function).
Note: In Figure 2.3, the thin dotted horizontal guidelines show the
values 0, 0.95 and 1.
65
Figure 2.1 Frequentist bias in Exercise 2.1
Figure 2.2 Frequentist relative bias in Exercise 2.1
66
Figure 2.3 Frequentist coverage probability in Exercise 2.1
biasfun = function(mu,n,sig,mu0,sig0){
k = n/(n+(sig/sig0)^2)
(1-k)*mu0-mu*(1-k) }
coverfun = function(mu,n,sig,mu0,sig0,alp=0.05){
k = n/(n + (sig/sig0)^2)
sigstar = sig*sqrt(k/n); z=qnorm(1-alp/2)
a= ( mu-(1-k)*mu0-z*sigstar ) / k
b= ( mu-(1-k)*mu0+z*sigstar ) / k
u= pnorm((b-mu)/(sig/sqrt(n)))
l= pnorm((a-mu)/(sig/sqrt(n)))
u-l }
X11(w=8,h=5.5); par(mfrow=c(1,1))
muvec=seq(-5,5,0.01); mu0=1; sig=1; n=10; sig0v=c(0.1,0.2,0.5,1)
plot(c(-2,2),c(-1,3),type="n",xlab="mu",ylab="",main=" ")
abline(1,-1,lty=3); abline(v=0,lty=3); abline(h=0,lty=3)
lines(muvec,biasfun(mu=muvec,n=n,sig=sig,mu0=mu0,sig0=sig0v[1]),
lty=1,lwd=3)
67
lines(muvec,biasfun(mu=muvec,n=n,sig=sig, mu0=mu0,sig0=sig0v[2]),
lty=2,lwd=3)
lty=3,lwd=3)
lty=4,lwd=3)
legend(1,2.8,c("sig0=0.1","sig0=0.2","sig0=0.5","sig0=1.0"),
lty=1:4,lwd=rep(3,4))
plot(c(-2,2),c(-2,4),type="n",xlab="mu",ylab="",main=" ")
abline(v=0,lty=3); abline(h=0,lty=3); lines(muvec, mu0/muvec-1,lty=3)
lines(muvec, biasfun(mu=muvec,n=n,sig=sig,mu0=mu0, sig0=sig0v[1])/muvec,
lty=1,lwd=3)
lty=2,lwd=3)
lty=3,lwd=3)
lty=4,lwd=3)
legend(-2,4,c("sig0=0.1","sig0=0.2","sig0=0.5","sig0=1.0"),
plot(c(-1,3),c(0,1),type="n",xlab="mu",ylab="",main=" ")
abline(h=c(0,0.95,1),lty=3)
lines(muvec, coverfun(mu=muvec,n=n,sig=sig,mu0=mu0,sig0=sig0v[1]),
lty=1,lwd=3)
lty=2,lwd=3)
lty=3,lwd=3)
lty=4,lwd=3)
legend(-0.55,0.6,c("sig0=0.1","sig0=0.2","sig0=0.5","sig0=1.0"),
Exercise 2.2 Frequentist characteristics of estimators in the

normal-gamma model
Consider the normal-gamma model given by:

( y1 ,, yn |  ) ~ iid N (,1 /  )
 ~ Gamma (,  ) .
68
(a) Work out general formulae for the frequentist bias and relative bias
of the posterior mean of  2  1 /  , and for the frequentist coverage
probability of the 1 − α CPDR for  2 .
Produce graphs showing examples of each of these three functions.
(b) Attempt to find a single prior under this model (that is, a single
suitable pair of values  , β ) which results in both:
(i) a Bayesian posterior mean of  2 that is unbiased (in the
frequentist sense) for all possible values of  2 ; and
(ii) a CPDR for  2 that has frequentist coverage probabilities
exactly equal to the desired coverage for all possible values
of  2 .
(a) Recall that the posterior mean of  2 is

b
ˆ 2  E ( 2 | y )  ,
a 1
n n 1 n
where: a    , b    s2 , s2   ( yi  ) 2 .
2 2 n i1
  (n / 2) s2 2  ns2
Thus, ˆ 
2
 .
  (n / 2) 1 2  n  2
So the frequentist bias of ̂ 2 is

2  nE ( s2 |  2 ) 2  n 2
B 2  E (ˆ 2   2 |  2 )  2  2 .
2  n  2 2  n  2
Note: This follows because, conditional on  2 , it is true that

 yi   
2
ns2 n
 2
 
i 1


 
 ~  ( n ) (with mean n).
2
Therefore the frequentist relative bias of ̂ 2 is

B 2 (2 /  2 )  n
R 2  2 
1  .
 2  n  2
69
Note: We see that for any fixed  2 , α and β it is true that

B 2 , R 2  0 as n → ∞ .
Thus the posterior mean of  2 is asymptotically unbiased, in the

frequentist sense.
Next, recall that the 1 − A CPDR for σ 2 = 1/ λ is

 2  ns2 2  ns2 
I  I ( y )   ,  ,
 v u 
where: v   A2 /2 (2  n )  F2 1(2 n ) (1  A / 2)
u  12 A/2 (2  n )  F2 1(2 n ) ( A / 2) .
So the frequentist coverage probability of I is

{
Cσ 2 P σ 2 ∈ I ( y ) σ 2
= }
 2 β + nsµ2 2 β + nsµ2 2 
= P <σ <
2
σ 
 v u 
 
{
= P σ 2 ∈ I ( y) σ 2 }
 nsµ2 2β 2 β nsµ 2 
2
= P 2 < v − 2 ,u − 2 < 2 σ 
σ σ σ σ 
 
 2β   2β 
= Fχ 2 ( n )  v − 2  − Fχ 2 ( n )  u − 2  .
 σ   σ 
Figures 2.4, 2.5 and 2.6 (pages 72 and 73) show Bσ 2 , Rσ 2 and Cσ 2 for
selected values of α and β , with n = 10 and A = 0.05 in each case.
(b) Observe that under the prior given by α = 1 and β = 0

(that is, f (λ ) ∝ λ 1−1e −0λ ∝ 1 ), it is true that:
• the posterior mean of  2 equals the MLE, namely s2 , and so is
unbiased
 ns 2 ns2 
• the 1 − A CPDR for σ 2 is  2  , 2  ,
  A/2 ( n  2) 1 A/2 ( n  2) 
which has coverage probability less than 1 − A for all  2 .
70
Also, under the prior given by α= β= 0 (i.e. f (λ ) ∝ λ 0−1e −0λ ∝ 1 / λ ),

it is true that:
• the posterior mean of  2 equals s2 / (1  2 / n ) and so is biased
• the 1 − A CPDR for σ 2 is the same as the classical CI, namely
 ns2 ns2 
 
  2 ( n ) ,  2 ( n )  , and so has coverage exactly 1 − A for all
 A/2 1 A/2 
2 .
We see that there is no single gamma prior for λ = 1 / σ 2 which results

in both:
(i) a Bayesian posterior mean of  2 that is unbiased (in the
frequentist sense) for all possible values of  2 ; and
(ii) a CPDR for  2 that has frequentist coverage probabilities
exactly equal to the desired coverage for all possible values
of  2 .
Note: It is easy to modify or ‘correct’ the posterior mean under

α= β= 0 so that it becomes unbiased. Explictly, if α= β= 0 , then
n 2
E (ˆ 2 |  2 )  .
n2
So an unbiased estimate of  2 is
n  2 2 n  2 0  (n / 2) s
2
 
2
ˆ    s2 (i.e. the MLE).
n n 0  (n / 2) 1
71
Figure 2.4 Frequentist bias in Exercise 2.2
Figure 2.5 Frequentist relative bias in Exercise 2.2
72
Figure 2.6 Frequentist coverage probability in Exercise 2.2
biasfun = function(sig2,n=10,alp=0,bet=0){ (2*bet+n*sig2)/(2*alp+n-2)-sig2 }
coverfun = function(sig2,n=10,alp=0,bet=0,A=0.05){
u = qchisq(A/2,2*alp+n); v = qchisq(1-A/2,2*alp+n)
pchisq(v-2*bet/sig2, n) - pchisq(u-2*bet/sig2, n) }
X11(w=8,h=5.5); par(mfrow=c(1,1))
sig2vec=seq(0.01,5,0.01); n=10; alpv=c(0.1,1,5); betv=c(0.1,1,5)
plot(c(0,5),c(-2,1),type="n",xlab="sigma^2",ylab="",main=" ")
abline(h=0,lty=3)
lines(sig2vec,biasfun(sig2=sig2vec,alp=0,bet=0), lty=1,lwd=3)
legend(0,-0.5,c("alp=0, bet=0","alp=0, bet=1","alp=1, bet=0","alp=1, bet=1"),
73
plot(c(0,3),c(-1,6),type="n",xlab="sigma^2",ylab="",main=" ")
abline(h=0,lty=3); abline(v=0,lty=3)
lines(sig2vec,biasfun(sig2=sig2vec,alp=0,bet=0)/ sig2vec, lty=1,lwd=3)

legend(1.5,6,c("alp=0, bet=0","alp=0, bet=1","alp=1, bet=0","alp=1, bet=1"),
plot(c(0,2),c(0,1),type="n",xlab="sigma^2",ylab="",main=" ")
abline(h=c(0,0.95,1),lty=3)
lines(sig2vec, coverfun(sig2=sig2vec,n=10,alp=0,bet=0,A=0.05), lty=1,lwd=3)

legend(1,0.6,c("alp=0, bet=0","alp=0, bet=1","alp=1, bet=0","alp=1, bet=1"),
2.2 Mixture prior distributions

So far we have considered Bayesian models with priors that are limited
in the types of prior information that they can represent. For example,
the normal-normal model does not allow a prior for the normal mean
which has two or more modes. If a non-normal class of prior is used to
represent one’s complicated prior beliefs regarding the normal mean,
then that prior will not be conjugate, and this will lead to difficulties
down the track when making inferences based on the nonstandard
posterior distribution.
Fortunately, this problem can be addressed in any Bayesian model for

which a conjugate class of prior exists by specifying the prior as a
mixture of members of that class.
Generally, a random variable X with a mixture distribution has a density

of the form
M
f ( x) = ∑ cm f m ( x) ,
m =1
where each f m ( x) is a proper density and the cm values are positive and
sum to 1.
74
If our prior beliefs regarding a parameter θ do not follow any single

well-known distribution, those beliefs can in that case be conveniently
approximated to any degree of precision by a suitable mixture prior
distribution with a density having the form
M
f (θ ) = ∑ cm f m (θ ) .
m =1
It can be shown (see Exercise 2.3 below) that if each component prior
f m (θ ) is conjugate then f (θ ) is also conjugate. This means that θ ’s
posterior distribution is also a mixture with density of the form
M
f (θ | y ) = ∑ cm′ f m (θ | y ) , (2.1)
m =1
where f m (θ | y ) is the posterior implied by the mth prior f m (θ ) and is

from the same family of distributions as that prior.
Exercise 2.3 Binomial-beta model with a mixture prior
(a) Consider the following Bayesian model:

( y |  ) ~ Bin( n,  )
f ( )  kf Beta ( a1 ,b1 ) ( )  (1 k ) f Beta ( a2 ,b2 ) ( ) ,
where n, k and the ai , bi are specified constants.
Note: Here, f Beta ( a ,b ) (t ) denotes the density at t of the beta distribution

with parameters a and b (and mean a / (a + b)) .
Find the posterior distribution of  and shows that ’s prior is

conjugate. Then create a figure showing the prior, likelihood and
posterior for the situation defined by:
n = 5, k = 3/4, a1 = 8, b1 = 25, a2 = 20, b2 = 20 and y = 4.
Also calculate the prior mean of  , the posterior mean of  and the
MLE of  . Then mark these three points in the figure.
(b) Show that any mixture of conjugate priors is also conjugate and
derive a general formula which could be used to calculate the mixture
weights cm′ in (2.1) above.
75
(a) The posterior density is
f ( | y )  f ( ) f ( y |  )
  a11 (1   )b11  a2 1 (1   )b2 1  y

 k  (1  k )   (1   ) n y
 B( a1 , b1 ) B( a2 , b2 ) 
 B( a1  y , b1  n  y )   ( a1 y )1 (1   )( b1n y )1 

 k   
 B ( a , b )   B ( a1  y , b1  n  y ) 
 1 1 
 B(a2  y, b2  n  y ) 
  ( a2  y )1 (1  )(b2 n y )1 
 (1 k )  .
 B(a2 , b2 )  B(a2  y, b2  n  y ) 
Thus
f ( | y )  c1 f1 ( | y )  c2 f 2 ( | y ) ,
where:
B( a1  y , b1  n  y )
c1  k
B( a1 , b1 )
B( a2  y , b2  n  y )
c2  (1  k )
B( a2 , b2 )
 ( ai  y )1 (1   )( bi n y )1

f i ( | y )  f Beta ( ai  y ,bi n y ) ( )  , 0   1
B( ai  y , bi  n  y )
(the posterior density corresponding to  ~ Beta (ai , bi )
as prior).
Now,
 f ( | y )d   1 ,
and so
f ( | y )  c f Beta ( a1  y ,b1 n y ) ( )  (1 c) f Beta ( a2  y ,b2 n y ) ( ) ,
where
c1
c .
c1  c2
76
Note: This ensures that  f ( | y )d   c 1  (1 c)1  1 .
We see that the prior f ( ) and posterior f ( | y ) are in the same family,
namely the family of mixtures of two beta distributions. Therefore the
mixture prior is conjugate.
For the situation where

n = 5, k = 3/4, a1 = 8, b1 = 25, a2 = 20, b2 = 20 and y = 4,
we find that:
• the prior mean is

 a   a2 
= Eθ k  1  + (1 − k )   = 0.3068
 a1 + b1   a2 + b2 
• the maximum likelihood estimate is

y/n = 0.8
• the posterior mean is

 a +y   a2 + y 
= E (θ | y ) c  1  + (1 − c )   = 0.4772.
 a1 + b1 + n   a2 + b2 + n 
Figure 2.7 shows the prior density f ( ) , the likelihood function L( ) ,
and the posterior density f ( | y ) , as well as the prior mean, the MLE
and the posterior mean.
Note: The likelihood function in Figure 2.7 has been normalised so that
the area underneath it is exactly 1. This means that this likelihood
function is identical to the posterior density under the standard uniform
prior, i.e. under fU (0,1) ( )  f Beta (1,1) ( ) . Thus, L( )  f Beta (1 y ,1n y ) ( ) .
Figure 2.7 also shows the two component prior densities and the two
component posterior densities. It may be observed that, whereas the
lower component prior has the highest weight, 0.8, the opposite is the
case regarding the component posteriors. For these, the weight
associated with the lower posterior is only 0.2583. This is because the
inference is being ‘pulled up’ in the direction of the likelihood (with the
posterior mean being between the prior mean and the MLE, 0.8).
77
Figure 2.7 Densities and likelihood in Exercise 2.3
(b) Suppose that θ has a mixture prior of the general form

M
f (θ ) = ∑ cm f m (θ ) ,
m =1
where each f m (θ ) is conjugate for the data model.
Then the posterior density is

M 
f (θ | y ) ∝ f (θ ) f ( y | θ ) =  ∑ cm f m (θ )  f ( y | θ )
 m =1 
M 
 f (θ ) f ( y | θ )  
{cm f m (θ ) f ( y | θ )} ∑ ( cm f m ( y ) )  m
M
= ∑ =  ,
= m 1= 
m 1  fm ( y)  
where f m ( y ) = ∫ f m (θ ) f ( y | θ )dθ is the unconditional density of the data
under the mth prior, f m (θ ) . Thus
M
f (θ | y ) ∝ ∑ km f m (θ | y ) ,
m =1
where
f m (θ ) f ( y | θ )
km = cm f m ( y ) and f m (θ | y ) =
fm ( y)
is the posterior density of θ under the mth prior, f m (θ ) .
78
It follows that
M
f (θ | y ) = ∑ cm′ f m (θ | y ) ,
m =1
cm′ km / (k1 + ... + k M ) .

where=
Thus θ ’s posterior is a mixture of distributions from the same families

to which the components of θ ’s mixture prior belong, respectively. This
shows that θ ’s mixture prior is conjugate. Note that the component prior
distributions can be from different classes, so long as each is conjugate
in relation to its own class.
n=5; k=3/4; a1=8; b1=25; a2=20; b2=20; y=4; thetav=seq(0,1,0.01)

prior1=dbeta(thetav,a1,b1); prior2=dbeta(thetav,a2,b2)
post1=dbeta(thetav,a1+y,b1+n-y); post2=dbeta(thetav,a2+y,b2+n-y)
prior = k*prior1 + (1-k)*prior2
c1=k*beta(a1+y,b1+n-y)/beta(a1,b1); c2=(1-k)*beta(a2+y,b2+n-y)/beta(a2,b2)
c=c1/(c1+c2); post=c*post1 + (1-c)*post2; options(digits=4); c # 0.2583
like=dbeta(thetav,1+y,1+n-y) # likelihood = post. under U(0,1)=beta(1,1) prior
X11(w=8,h=5.5)
lines(thetav,prior,lty=1,lwd=4)
lines(thetav,like,lty=2,lwd=4)
lines(thetav,post,lty=3,lwd=4)
legend(0,8,c("Prior","Likelihood","Posterior"),lty=c(1,2,3),lwd=c(4,4,4))
lines(thetav,prior1,lty=1,lwd=2)
lines(thetav,prior2,lty=1,lwd=2)
lines(thetav,post1,lty=3,lwd=2)
lines(thetav,post2,lty=3,lwd=2)
legend(0.3,8,c("Component priors","Component posteriors"),
lty=c(1,3),lwd=c(2,2))
mle=y/n; priormean=k*a1/(a1+b1)+(1-k)*a2/(a2+b2)
postmean=c*(a1+y)/(a1+b1+n) + (1-c)*(a2+y)/(a2+b2+n)
points(c(priormean,mle,postmean),c(0,0,0),pch=c(1,2,4),cex=c(1.5,1.5,1.5),
lwd=c(2,2,2))
c(priormean,mle,postmean) # 0.3068 0.8000 0.4772
legend(0.7,8,c(" Prior mean"," MLE"," Posterior mean"),
pch=c(1,2,4),pt.cex=c(1.5,1.5,1.5),pt.lwd=c(2,2,2))
79
2.3 Dealing with a priori ignorance

The Bayesian approach requires a prior distribution to be specified even
when there is complete (or total) a priori ignorance (meaning no prior
information at all). This feature presents a general and philosophical
problem with the Bayesian paradigm, one for which several theoretical
solutions have been advanced but which does not yet have a universally
accepted solution. We have already discussed finding an uninformative
prior in relation to particular Bayesian models, as follows.
For the normal-normal model defined by ( y1 , , yn | ) ~ iid N (,  2 )

and  ~ N (0 , 02 ) , an uninformative prior is given by σ 0 = ∞ , that is,
f ()  1,    .
For the normal-gamma model defined by ( y1 , , yn | ) ~ iid N (,1/  )

and  ~ Gamma (,  ) , an uninformative prior is given by α= β= 0 ,
that is, f ( )  1/ ,   0 .
For the binomial-beta model defined by ( y |  ) ~ Binomial (n,  ) and

 ~ Beta (,  ) (having the posterior ( | y ) ~ Beta (  y,   n  y )),
an uninformative prior is the Bayes prior given by α= β= 1 , that is,
f (θ ) = 1, 0 < θ < 1 . This is the prior that was originally advocated by
Thomas Bayes.
Unlike for the normal-normal and normal-gamma models, more than one
uninformative prior specification has been proposed as reasonable in the
context of the binomial-beta model.
One of these is the improper Haldane prior, defined by     0 , or

1
f (θ ) ∝ , 0 < θ < 1.
θ (1 − θ )
Under the prior  ~ Beta (,  ) generally, the posterior mean of  is

(  y )  y
ˆ  E ( | y )   .
(  y )  (  n  y )     n
This reduces to the MLE y/n under the Haldane prior but not under the
Bayes prior. In contrast, the Bayes prior leads to a posterior mode which
is equal to the MLE.
80
The Haldane prior may be considered as being most appropriate for

allowing the data to ‘speak for itself’ in cases of a priori ignorance.
However, the Haldane prior leads to an improper and degenerate

posterior if the data y happens to be either 0 or n. Specifically:
y = 0 ⇒ ( | y ) ~ Beta (0, n) , or equivalently, P (  0 | y )  1
y = n ⇒ ( | y ) ~ Beta (n, 0) , or equivalently, P (  1| y )  1 .
So in each case, point estimation is possible but not interval estimation.
No such problems occur using the Bayes prior. This is because that prior
is proper and so cannot lead to an improper posterior, whatever the data
may be. Interestingly, there is a third choice which provides a kind of
compromise between the Bayes and Haldane priors, as described below.
2.4 The Jeffreys prior

The statistician Harold Jeffreys devised a rule for finding a suitable
uninformative prior in a wide variety of situations. His idea was to
construct a prior which is invariant under reparameterisation. For the
case of a univariate model parameter θ , the Jeffreys prior is given by
the following equation (also known as Jeffreys’ rule):
f (θ ) ∝ I (θ ) ,
where I (θ ) is the Fisher information defined by
 ∂ 
2

I (θ ) = E  log f ( y | θ )  θ.
 ∂θ  
Note 1: If log f ( y | θ ) is twice differentiable with respect to θ , and

certain regularity conditions hold, then
 ∂ 2 
I (θ ) = − E  2 log f ( y | θ ) θ  .
 ∂θ 
Note 2: Jeffreys’ rule also extends to the multi-parameter case (not

considered here).
The significance of Jeffreys’ rule may be described as follows. Consider

a prior given by f (θ ) ∝ I (θ ) and the transformed parameter φ = g (θ ) ,
81
where g is a strictly increasing or decreasing function. (For simplicity,

we only consider this case.) Then the prior density for φ is
∂θ
f (φ ) ∝ f (θ ) by the transformation rule
∂φ
 ∂   ∂θ 
2 2
 ∂θ 
2

∝ I (θ )   = E  log f ( y | θ )  θ  
 ∂φ   ∂θ    ∂φ 
 ∂ ∂θ 
2

= E  log f ( y | θ )  θ
 ∂θ ∂φ  
 ∂ 
2

= E  log f ( y | φ )  φ
 ∂φ  
= I (φ ) .
Thus, Jeffreys’ rule is ‘invariant under reparameterisation’, in the sense

that if a prior is constructed according to
f (θ ) ∝ I (θ ) ,
then, for another parameter φ = g (θ ) , it is also true that
f (φ ) ∝ I (φ ) .
Exercise 2.4 Jeffreys prior for the normal-normal model
Find the Jeffreys prior for  if ( y1 , , yn | ) ~ iid N (,  2 ) , where  is

known.
µ n
 1   1 n 
Here: f ( y | µ ) ∝ ∏ exp − 2 ( yi − µ ) 2  = exp − 2 ∑ ( yi − µ ) 2 
i =1  2σ   2σ i =1 
n
1
log f ( y | µ ) = − 2 ∑ ( yi − µ ) 2 + c (where c is a constant)
2σ i =1
∂ 1 n n
log f ( y | µ ) =− 2 ∑ 2( yi − µ )1 (−1) = 2 ( y − µ )
∂µ 2σ i =1 σ
2
 ∂  n2
 log f ( y | µ
= )  ( y − µ )2 .
 ∂µ  σ 4
82
Therefore the Fisher information is

 ∂    n 2 
2
= I ( µ ) E  log f ( y | µ
= )  µ  E  4 ( y − µ )2 µ 
 ∂µ   σ 
n2 n2 σ 2 n
= V (= y | µ ) = .
σ 4
σ n σ2
4
µ
n
It follows that the Jeffreys prior is f ( µ ) ∝ I =
(µ ) ∝1, µ ∈ℜ .
σ2
Note 1: This is the same prior as used earlier in the uninformative

case.
Note 2: The Fisher information here can also be derived as follows:

∂2 n
log f ( y | µ ) = − 2
∂µ 2
σ
 ∂ 2   n  n
⇒ I (µ ) =
− E  2 log f ( y | θ ) θ  = − E  − 2  =2 .
 ∂θ   σ  σ
Exercise 2.5 Jeffreys prior for the normal-gamma model
Find the Jeffreys prior for  if ( y1 , , yn |  ) ~ iid N (,1/  ) , where 

is known.
λ n
 λ   λ n 2
Here: f ( y | λ ) ∝ ∏ λ 1/2 exp − ( yi − µ ) 2= λ − ∑ ( yi − µ ) 
n /2
 exp
i =1  2   2 i =1 
n λ n
log f ( y | λ=) log λ − ∑ ( yi − µ ) 2 + c (where c is a constant)
2 2 i =1
∂ log f ( y | λ ) n 1 n ∂ 2 log f ( y | λ ) n
=− ∑ ( yi − µ ) , 2
= − 2.
∂λ 2λ 2 i =1 ∂λ 2
2λ
So the Fisher information is

 ∂ 2 log f ( y | λ )   n  n
I (λ ) =
−E  λ =− E − 2 λ  = 2 .
 ∂λ 2
  2λ  2λ
83
n λ 1
So the Jeffreys prior is f (λ ) ∝ =
I (λ ) ∝ , λ ∈ℜ .
2λ 2 λ
Note 1: This is the same prior as used earlier in the uninformative

case.
Note 2: Another way to obtain the the Fisher information is to first write
∂ log f ( y | λ ) n 1  n 2 1
= − λ ∑ ( yi − µ )  = (n − q ) ,
∂λ 2λ 2λ  i =1  2λ
 y −µ 
n 2
where: q = ∑  i  , ( q | λ ) ~ χ ( n) , E ( q | λ ) = n , V ( q | λ ) = 2n .
2
i =1  1/ λ 
 ∂ log f ( y | λ ) 
2
1
We may then write  =
 (n 2 − 2nq + q 2 ) ,
 ∂λ  4 λ 2
 ∂ log f ( y | λ ) 2 
and so the Fisher information is I (λ ) = E   λ
  ∂λ  
= 2 {n 2 − 2nE (q | λ ) + E (q 2 | λ=
1
4λ
)}
1
4λ 2 { } n
n 2 − 2nn +  2n + n 2  = 2 .
2λ
Exercise 2.6 Jeffreys prior for the binomial-beta model
Find the Jeffreys prior for  if ( y |  ) ~ Binomial (n,  ) , where n is

known.
n
Here: =f ( y | θ )   θ y (1 − θ ) n − y
 y
n
log f ( y = | θ ) log   + y log θ + (n − y ) log(1 − θ )
 y
∂
log f ( y | θ ) =0 + yθ −1 − (n − y )(1 − θ ) −1
∂θ
∂2
log f ( y | θ ) =− yθ −2 − (n − y )(1 − θ ) −2 .
∂θ 2
84
So the Fisher information is

 ∂ 2 
I (θ ) = − E  2 log f ( y | θ ) θ 
 ∂θ 
= {
− E − yθ −2 − (n − y )(1 − θ ) −2 θ }
= (nθ )θ + (n − nθ )(1 − θ )
−2 −2
1 1  (1 − θ + θ ) n
=n  +  =n = .
 θ 1−θ  θ (1 − θ ) θ (1 − θ )
It follows that the Jeffreys prior is given by

n θ 1
f (θ ) ∝ =I (θ ) ∝ , 0 < θ < 1.
θ (1 − θ ) θ (1 − θ )
Note: We may also write the Jeffreys prior density as

1 1
−1 −1
f (θ ) ∝ θ 2 (1 − θ ) 2 , 0 < θ < 1 .
Thus the Jeffreys prior can be specified by writing

 ~ Beta (,  )
with α= β= 1/ 2 .
We see that the Jeffreys prior may be thought of as ‘half-way’ between:

• the Bayes prior, defined by α= β= 1 ; and
• the Haldane prior, defined by α= β= 0 .
Exercise 2.7 Jeffreys prior for the tramcar problem
Recall the discussion of the tramcar problem following Exercise 1.6, in

relation to the model ( y | θ ) ~ DU (1,..., θ ) . Find the Jeffreys prior for θ .
Here,
θ ) 1/=
f ( y |= θ θ −1
⇒ log f ( y | θ ) = − log θ
∂ 1
⇒ log f ( y | θ ) = −
∂θ θ
85
 ∂
2
 1
⇒  log f ( y | θ )  = 2
 ∂θ  θ
 ∂   1
2
=⇒ I (θ ) E=  log f ( y | θ )  θ .
 ∂θ   θ 2
It follows that the Jeffreys prior for θ is given by

f (θ ) ∝ I (θ ) ∝ 1/ θ , θ = 1, 2,3,...
2.5 Bayesian decision theory

The posterior mean, mode and median, as well as other Bayesian point
estimates, can all be derived and interpreted using the principles and
theory of decision theory. Suppose we wish to choose an estimate of 
which minimises costs in some sense. To this end, let L(ˆ,  ) denote
generally a loss function (LF) associated with an estimate ̂ .
Note: The estimator ̂ is a function of the data y and so could also be

written ˆ( y ) . For example, in the context where ( y |  ) ~ Bin(n,  ) , the
sample proportion or MLE is the function given by ˆ  ˆ( y )  y / n .
The loss function L represents the cost incurred when the true value  is
estimated by ̂ and usually satisfies the property L( ,  )  0 .
The three most commonly used loss functions are defined as follows:
L(ˆ,  )  | ˆ   | the absolute error loss function (AELF)
L(ˆ,  )  (ˆ   ) 2
the quadratic error loss function (QELF)
0 if ˆ  
L(ˆ,  )  I (ˆ   )    the indicator error loss
1 if ˆ   
 
function (IELF), also known as the zero-one loss function
(ZOLF) or the all-or-nothing error loss function (ANLF).
Figures 2.8 and 2.9 illustrate these three basic loss functions.
86
Figure 2.8 The three most important loss functions
Figure 2.9 Alternative representation of the absolute error

loss function
(The other two loss functions can be represented similarly)
Given a Bayesian model, loss function and estimator, we would like to

quantify what the loss is likely to be. However, this loss depends on 
and y, which complicates things. An idea of the expected loss may be
provided by the risk function, defined as the conditional expectation
R( )  E ( L(ˆ,  ) |  )   L(ˆ( y ),  ) f ( y |  )dy .
The risk function R ( ) provides us with an idea of the expected loss

given any particular value of  . Figure 2.10 illustrates the idea.
87
Figure 2.10 The idea of a risk function
To obtain the overall expected loss we need to average the risk function
over all possible values of  . This overall expected loss is called the
Bayes risk and may be defined as
r  EL(ˆ,  )  EE{L(ˆ,  ) | }  ER( )   R( ) f ( )d  .
Exercise 2.8 Examples of the risk function and Bayes risk
Consider the normal-normal model: ( y1 , , yn | ) ~ iid N (,  2 )

 ~ N (0 , 02 ) .
For each of the following estimators, derive a formulae for the risk
function under the quadratic error loss function:
1
(a) ˆ  y  ( y1  ...  yn ) (the sample mean)
n
(b) ˆ  y (the absolute value of the sample mean).
In each case, use the derived risk function to determine the Bayes risk.
For both parts of this exercise, the loss function is given by

L(ˆ , )  (ˆ  ) 2 .
(a) If ˆ  y then the risk function is

R ()  E{L(ˆ , ) | }  E{( y  ) 2 | }  V ( y | )
  2 / n (a constant).
88
So the Bayes risk is simply

r  ER( )  E ( 2 / n )   2 / n (i.e. the same constant).
(b) If ˆ  y then the risk function is
R ()  E ( y  ) 2 |   E y  2 y   2   2

 E ( y 2 |  )  2 E  y     2
2 
    2   2m   2 , where m  E  y   .
 n 
Now,
0 
m   ( y ) f ( y |  )dy   ( y ) f ( y |  )dy
 0
0  0 0
   yf ( y |  )dy   yf ( y |  )dy   yf ( y |  )dy   yf ( y |  )dy

 0  
0 
 2  yf ( y |  )dy   yf ( y |  )dy
 
0
   2I , where I   yf ( y |  )dy .

Here,
 / c
y−µ σ
I    cz ( z )dz after putting z =
σ/ n
with c =
n

 / c  / c
   ( z )dz  c  z( z )dz

 
 / c
 
     cJ , where J   z( z )dz .
 c 
z
1 12 z 2
Note: Here, ( z )  e and ( z )   (t )dt are the standard
2 
normal pdf and cdf, respectively.
89
 / c
1 12 z 2
Now, J  z
2
e dz

2
2 c2
1 w 1
 e dw after substituting w = z 2

2 2
 2 
1  w c2 
 1   2 c2 
2
  
 e      .
2
 e  e
2   2    c 
w 
 
    
Hence I      cJ      c   ,
 c   c   c 
     
and so m    2 I    2     c   .
  c   c 
Therefore
2 2        
R( )   2 2  2m   2 2  2    2     c    .
n n    c   c  
Thereby we obtain:
2       
R ( )   4 2    4  ,    .
n   / n  n   / n 
The Bayes risk is then given by


r  ER( )   R( ) f ( )d    g ( )d  ,

where
  2        1    0 
g ( )    4 2  
  4      .
 n   / n  n   / n  0  0 
We see that the Bayes risk r is an intractable integral equal to the area
under the integrand, g ( )  R ( ) f ( ) . However, this area can be
evaluated numerically (using techniques discussed later). Figures 2.11
and 2.12 show examples of the risk function R() and the integrand
function g () . For the case n= σ= µ0= σ 0= 1 , we find that r = 1.16.
90
Figure 2.11 Some risk functions in Exercise 2.8
Figure 2.12 Some integrand functions used to calculate the

Bayes risk
91
Rfun=function(mu,sig,n){ sig^2/n+4*mu*( mu*pnorm(-mu/(sig/sqrt(n))) -

(sig/sqrt(n))*dnorm(mu/(sig/sqrt(n))) ) }
muvec=seq(-10,10,0.01); options(digits=4)
X11(w=8,h=5.5); par(mfrow=c(1,1));
plot(c(-0.5,4),c(0,3),type="n",xlab="mu",ylab="R(mu)",main=" ")
n=1; sig=1; lines(muvec,Rfun(muvec,sig=sig,n=n),lty=1,lwd=3);

abline(v=0,lty=3); abline(h=c(0,sig^2/n),lty=3)
abline(h= sig^2/n,lty=3)
abline(h= sig^2/n,lty=3)
legend(0.2,3.05,c("sig=1, n=1","sig=2, n=5","sig=3, n=5"),
lty=c(1,2,3),lwd=c(2,2,2))
Ifun = function(mu,sig,n,mu0,sig0){
Rfun(mu=mu,sig=sig,n=n)*dnorm(mu,mu0,sig0) }
plot(c(-5,10),c(0,1.5),type="n", xlab="mu",ylab="g(mu) = R(mu)*f(mu)",

main=" ")
n=1; sig=1; mu0=0; sig0=1
lines(muvec, Ifun(mu=muvec,sig=sig,n=n,mu0=mu0, sig0=sig0),lty=1,lwd=3)
# Check range over which to integrate the integrand
integrate(f=Ifun,lower=-7,upper=7, sig=sig,n=n,mu0=mu0, sig0=sig0)$value
#3
n=1; sig=1; mu0=1; sig0=1

# 1.16
n=1; sig=1; mu0=5; sig0=1

integrate(f=Ifun,lower=0,upper=10, sig=sig,n=n,mu0=mu0, sig0=sig0)$value
# 0.9994
92
n=1; sig=1; mu0=0; sig0=0.5

# 1.5
legend(1,1.5,c("mu0=0, sig0=1.0 => r=3.000", "mu0=1, sig0=1.0 => r=1.160",

"mu0=5, sig0=1.0 => r=0.999","mu0=0, sig0=0.5 => r=1.500"),
lty=c(1,2,3,4),lwd=c(3,3,3,3)); text(5,0.6,"In each case, n=1 and sig=1")
2.6 The posterior expected loss

We have defined the risk function as the expectation of the loss function
given the parameter, namely
R( )  E ( L(ˆ,  ) |  )   L(ˆ( y ),  ) f ( y |  )dy .
Conversely, we now define the posterior expected loss (PEL) as the

expectation of the loss function given the data, and we denote this
function by
PEL( y )  E{L(ˆ,  ) | y}   L(ˆ( y ),  ) f ( | y )d  .
Then, just as the risk function can be used to compute the Bayes risk
according to

r  EL(ˆ,  )  EE{L(ˆ,  ) | }  ER( )  R( ) f ( )d  ,
so also can the PEL be used, but with the formula
r  EL(ˆ,  )  EE{L(ˆ,  ) | y}  E{PEL( y )}   PEL( y) f ( y)dy .
Note: Both of these formulae for the Bayes risk use the law of iterated
expectation, but with different conditionings.
Exercise 2.9 Examples of the PEL and Bayes risk
Consider the normal-normal model:

( y1 , , yn | ) ~ iid N (,  2 )
 ~ N (0 , 02 ) .
93
For each of the following estimators, derive a formula for the posterior
expected loss under the quadratic error loss function:
1
(a) ˆ  y  ( y1  ...  yn ) (the sample mean)
n
(b) ˆ  y (the absolute value of the sample mean).
In each case, use the derived PEL to obtain the Bayes risk.
Note: This exercise is an extension of Exercise 2.8.
(a) If ˆ  y then the PEL function is

PEL( y )  E{L(ˆ , ) | y}
 E{( y  ) 2 | y}
 y 2  2 yE ( | y )  E ( 2 | y ) ,
where:
E ( | y )  *
E ( 2 | y )  V ( | y )  {E ( | y )}2
 *2  *2
σ2 n
*  (1 k )0  ky , σ *2 = k , k= .
n n + σ 2 / σ 02
Thus, more explicitly,

PEL( y )  y 2  2 y (1 k )0  ky   *2  (1 k )0  ky 
2
 y 2  2(1 k )0 y  2ky 2  *2  (1 k ) 2 02  2(1 k )0 ky  k 2 y 2
 y 2 (1 k ) 2  y (1 k ) 2 20  *2  (1 k ) 2 02
 *2  (1 k ) 2 ( y  0 ) 2 .
Note: This is a quadratic in y with a minimum of *2 at y  0 .
94
The Bayes risk is then

r  E{PEL( y )}
 *2  (1 k ) 2 E{( y  0 ) 2 } ,
where
E{( y  0 ) 2 }  Vy
 EV ( y | )  VE ( y | )
2 
 E    V 
 n 
2
  02 .
n
2 
Thus r  *2  (1 k ) 2   02 
 n 
2  2
2 

 (1 k )   02 
n
k (where k = )
n  n  n + σ 2 / σ 02
2
 (after a little algebra).
n
Note: This is in agreement with Exercise 2.8, where the result was
obtained much more easily by taking the mean of the risk function, as
follows:
r  ER( )  E ( 2 / n )   2 / n .
(b) If ˆ  y then the posterior expected loss function is

PEL( y )  E ( y  ) 2 | y
 y 2  2 y E ( | y )  E ( 2 | y )
 y 2  2 y *  *2  *2
 y 2  2 y (1 k )0  ky   *2  (1 k )0  ky  .
2
Some examples of this PEL function are shown in Figure 2.13. In all
these examples, n= σ= 1 .
95
Figure 2.13 Some posterior expected loss functions
In terms of the PEL function, the Bayes risk can be expressed as


r  E{PEL( y )}   PEL( y ) f ( y )dy ,


where
f ( y )  f N  ,  2  2 / n ( y ) ,
0 0 
since
 σ2 
y ~ N  µ0 , σ 02 + .
 n 
As an example, we consider the case n= σ= µ0= σ 0= 1 . Figure 2.14

shows the integrand function PEL( y ) f ( y ) . The area under this function
works out as 1.16, in agreement with an alternative working for the
Bayes risk in Exercise 2.8 (taking an expectation of the risk function).
96
Figure 2.14 An integrand function with area underneath equal

to 1.16
PELfun=function(ybar,sig,n,sig0,mu0){
k=n/(n+sig^2/sig0^2)
mustar=(1-k)*mu0+k*ybar
sigstar2=k*sig^2/n
ybar^2-2*abs(ybar)*mustar+sigstar2 + mustar^2
}
ybarvec=seq(-10,10,0.01); options(digits=4)
X11(w=8,h=5.5); par(mfrow=c(1,1));
plot(c(-4,5),c(0,3),type="n",xlab="ybar",ylab="PEL(ybar)", main=" ")

abline(v=0,lty=3); abline(h=0,lty=3)
n=1; sig=1; mu0=0; sig0=1

lines(ybarvec,PELfun(ybarvec,sig=sig,n=n,sig0=sig0,mu0=mu0),lty=1,lwd=3);
n=1; sig=1; mu0=1; sig0=1

97
n=1; sig=1; mu0=-0.5; sig0=1

n=1; sig=1; mu0=0; sig0=2

legend(-4,1.5,c("mu0=0, sig0=1","mu0=1, sig0=1","mu0=-0.5, sig0=1",

"mu0=0, sig0=2"), lty=c(1,2,3,4), lwd=c(3,3,3,3))
# Calculate r when n=1, sig=1, mu0=1, sig0=1 (should get 1.16 as before)
Jfun = function(ybar,sig,n,sig0,mu0){
PELfun(ybar=ybar,sig=sig,n=n,sig0=sig0,mu0=mu0)*
dnorm(ybar,mu0,sqrt(sig0^2+sig^2/n))
}
n=1; sig=1; mu0=1; sig0=1
plot(ybarvec, PELfun(ybar=ybarvec,sig=sig,n=n,sig0=sig0,mu0=mu0)*
dnorm(ybarvec,mu0,sqrt(sig0^2+sig^2/n)),
type="l", xlab="ybar",ylab="PEL(ybar)*f(ybar)", lwd=3)
integrate(f=Jfun,lower=-10,upper=10, sig=sig,n=n,mu0=mu0, sig0=sig0)$value

# 1.16 Correct (same as in last exercise)
2.7 The Bayes estimate

The Bayes estimate (or estimator) is defined to be the choice of the
function ˆ  ˆ( y ) for which the Bayes risk r  EL(ˆ,  ) is minimised.
This estimator has the smallest overall expected loss over all estimators
under the specified loss function L(ˆ,  ) .
In many cases, the procedure for finding a Bayes estimate can be

considerably simplified by considering which estimate minimises the
posterior expected loss function, PEL( y )  E{L(ˆ,  ) | y} .
If we can find an estimate ˆ  ˆ( y ) which minimises PEL( y ) for all

possible values of the data y, then that estimate must also minimise the
Bayes risk.
98
This is because the Bayes risk may be written as a weighted average of

the PEL, namely

r  EL(ˆ,  )  EE{L(ˆ,  ) | y}  E{PEL( y )}  PEL( y ) f ( y )dy .
Exercise 2.10 Bayes estimate under the QELF
Find the Bayes estimate under the quadratic error loss function.
Observe that ( y ) E{(θˆ − θ ) 2 | y} = E{θˆ 2 − 2θθ

PEL= ˆ + θ 2 | y}
=θˆ 2 − 2θÊ (θ | y ) + E (θ 2 | y )
2
θˆ − E (θ | y )  − {E (θ | y )}2 + E (θ 2 | y ) .
=
 
Note: We have completed the square in θˆ .
We see that the PEL is a quadratic function of θˆ which is clearly

minimised at the posterior mean, θˆ = E (θ | y ) . So the Bayes estimate
under the QELF is that posterior mean.
Note 1: This result can also be obtained using Leibniz’s rule for
differentiating an integral, which is generally
b b
d G (u , x) db da
dx a G (u , x)du  
x
du  G (b, x)  G (a, x)
dx dx
a
b
G (u , x)
and which reduces to  x
du  0  0 if a and b are constants.
a
∂ ∂
Thus we may write
∂θˆ
=
PEL ( y)
∂θˆ ∫ (θˆ − θ ) 2 f (θ | y )dθ
∂
= ∫
∂θˆ { }
(θˆ − θ ) 2 f (θ | y ) dθ + 0 − 0
{
= ∫ 2(θˆ − θ ) f (θ | y )dθ= 2 θˆ − ∫ θ f (θ | y )dθ .
1
}
=
Setting this to zero yields θˆ ∫=
θ f (θ | y )dθ E (θ | y ) .
99
Note 2: To check that this minimises the PEL (rather than maximises it)
we may further calculate
∂2
∂θˆ 2
PEL( y ) = 2
∂ ˆ
∂θˆ
{ }
θ − ∫ θ f (θ | y )dθ = 2 {1 − 0} > 0 .
Thus the slope of the PEL ( ∂PEL( y ) / ∂θˆ ) is increasing with θˆ ,

=
implying that PEL( y ) is indeed minimised at θˆ θˆ=
( y ) E (θ | y ) .
Exercise 2.11 Bayes estimate under the AELF
Find the Bayesian estimate under the absolute error loss function.
Suppose that the parameter θ is continuous, and let t denote θˆ = θˆ( y ) .
Then PEL( y )   | t   | f ( | y )d 

t 
  (t   ) f ( | y )d    (  t ) f ( | y )d  .
 t
So, by Leibniz’s rule for differentiation of an integral (in Exercise 2.10),

  t  (t   )
 d () 
PEL( y )   
dt
 f ( | y )d   {(t  t ) f (  t | y )}  () 
t 
 t dt dt  
 
  (  t )
 
d () 
dt 
 f ( | y )d   () {(t  t ) f (  t | y )}  

t t dt dt 

 

 t  
   

  
  f ( | y )d   0  0   (1) f ( | y )d   0  0


 
  t 

   
 P (  t | y )  P (  t | y ) .
Setting this to zero implies P (  t | y )  P (  t | y ) which yields t as

the posterior median. So the Bayes estimate under the AELF is the
posterior median. This argument can easily be adapted to the case where
θ is discrete. The idea is to approximate θ ’s discrete prior distribution
with a continuous distribution and then apply the result already proved.
100
Exercise 2.12 Bayes estimate under the IELF
Find the Bayes estimate under the indicator error loss function.
Let t denote θˆ = θˆ( y ) and first suppose that the parameter θ is discrete.
The indicator error loss function is L(t ,  )  I (t   )  1 I (t   ) .
Therefore
PEL( y )  E{L(t ,  ) | y}  E{1 I (t   ) | y}  1 E{I (t   ) | y}
 1 P (t   | y )
 1 f (  t | y ) .
Thus PEL( y ) is minimised at the value of t which maximises the

posterior density f ( | y ) . So, when θ is discrete, the Bayes estimate
under the IELF is the posterior mode, Mode(θ | y ) .
Now suppose that θ is continuous. In that case, consider the

approximating loss function
L (t ,  )  1 I (t      t  ) ,
where   0 , and observe that
lim L (t ,  )  1 I (t   )  L(t ,  ) .
0
The posterior expected loss under the loss function L (t ,  ) is

PEL ( y )  E{L (t ,  ) | y}  1 E{I (t      t  ) | y}
 1 P(t      t   | y ) .
The value of t which minimises the PEL ( y ) is the value which

maximises the area P(t      t   | y ) . But in the limit as   0,
that value is the posterior mode. So, when θ is continuous, the Bayes
estimate under the IELF is (as before) the posterior mode, Mode(θ | y ) .
Note: To clarify the above argument, observe that if  is small then

PEL (t )  1 2 f (t | y ) .
This function of t is minimised at approximately t = Mode(  | y ) and at

exactly t = Mode(  | y ) in the limit as   0. Figure 2.15 illustrates.
101
Figure 2.15 Illustration for the continuous case in Exercise 2.12
Exercise 2.13 Bayesian decision theory in the Poisson-gamma

model
Consider a random sample y1 ,..., yn from the Poisson distribution with

parameter  whose prior density is gamma with parameters  and  .
(a) Find the risk function, Bayes risk and posterior expected loss implied
by the estimator ˆ  2 y under the quadratic error loss function.
(b) Assuming quadratic error loss, find an estimator of  with a smaller

Bayes risk than the one in (a).
(a) The risk function is

R ( )  E{L(ˆ,  ) | }

 E 2 y    
2

 E 4 y 2  4 y   2 
 4 E  y 2   4 E  y    2
 4 V  y   E  y    4 E  y    2
2
 
 
 4    2   4   2
 n 
   4 / n,   0 (an increasing quadratic).
2
102
So the Bayes risk is

r  ER (ˆ,  )
 E ( 2  4 / n)
 {V   ( E ) 2 }  4( E ) / n
  
2
4
 2     .
    n
To find the posterior expected loss, we first derive  ’s posterior density:

f ( | y )  f ( ) f ( y |  )
  1e n e yi
 
( ) i 1 yi !

  ny 1e (  n ) .
We see that
f ( | y ) ~ Gam(  ny ,   n) .
It follows that
PEL( y )  E{L(ˆ,  ) | y}

 E 2 y    y
2

 E 4 y 2  4 y   2 y
 4 y 2  4 yE ( | y )  E ( 2 | y )
   ny     ny    ny  
2
 4 y  4 y 
2
   .
   n   (  n) 2    n  
 
Note: The Bayes risk could also be computed using an argument which
begins as follows:
r  E{PEL( y )}
    ny    ny    ny  
2

 E 4 y  4 y 
2
   ,
    n  (  n) 2    n  
 
where, for example,
Ey  EE ( y |  )  EE ( y1 |  )  E   /  .
103
(b) The Bayes estimate under the QELF is the posterior mean,
  ny
E ( | y )  .
 n
This estimator has the smallest Bayes risk amongst all possible
estimators, including the one in (a), which is different. So E ( | y ) must
have a smaller Bayes risk than the estimator in (a).
Discussion
The last statement could be verified by calculating r according to

   nY 
2
E    .
   n 
The result should be an expression for r which is smaller than

   
2
4
    ,
 2    n
for all n = 1,2,3,..., and all ,   0 .
We leave the required working as an additional exercise.
Exercise 2.14 A non-standard loss function

( y | ) ~ N (,1)
 ~ N (0,1) .
Then suppose that the loss function is

0 if 0    t  2
L(t , )  
1 otherwise.
(a) Find the risk function and Bayes risk for the estimator ˆ  y .
Sketch the risk function.
(b) Find the Bayes estimate and sketch it as a function of the data y.
Explicitly calculate the Bayes estimate at y  1 , 0 and 1, respectively.
104
(a) For convenience we will sometimes denote ˆ  y by t. Then, the

loss function may be written as
 1 I (  t  2),   0
L(t , )  
 1,   0.
Now, for   0 the risk function is simply

R ()  E{L( y, ) | }  1 .
For   0 , the risk function is

R ()  1 P (  y  2 | )  1 P (0  y     | )
 1 P (0  Z  ) where Z ~ N(0,1)
 1 (() 1/ 2)  1.5 () .
 1,   0
In summary, R ()    , as shown in Figure 2.16.
1.5 (),   0 
Figure 2.16 Risk function in Exercise 2.14
105
The associated Bayes risk is

0  
3
r  ER ()   ()d    ()d    () ()d 

2 0 0
1 3 1
   I ,
2 2 2
1
dw
where I   wdw = 3/8, after putting w  () with   ( ) .
1/ 2
d
So, for the estimator ˆ  y , the Bayes risk is

1 3 1 3 7
r     .
2 2 2 8 8
(b) Here, by the theory of the normal-normal model we have that

( | y ) ~ N (* , *2 ) ,
where:
*  (1 k )0  ky , *2  k  2 / n , k  1/(1   2 /(n02 ))
n = 1, 0  0 , 0  1 , y  y.
Thus k = 1/2, *  y / 2 and *2  1/ 2 , and so

( | y ) ~ N ( y / 2,1/ 2) .
The posterior expected loss is

PEL( y )  E{L(t , ) | y} ,
where t is a function of y (i.e. t  t ( y ) ).
Now
L(t , )  1 I (0    t  2) ,
and so
PEL( y )  E{1 I (0    t  2) | y}
 1 P (0    t  2 | y ) .
We see that if t  t ( y )  0 then

PEL( y )  1 .
106
Also, if t  0 then
PEL(t )  1 E{I (0    t  2) | y} .
 1 P (0    t  2 | y )
 1 P (t / 2    t | y )
 1  (t ) ,
where
 (t )  F (  t | y )  F (  t / 2 | y )
is to be maximised.
Now,  (t )  f (  t | y )  f (  t / 2 | y )1/ 2

1 1
1  ( t  y / 2) 1
2
1  (( t / 2) y / 2) 2
 e 2(1/ 2)   e 2(1/ 2) .
(1/ 2) 2 2 (1/ 2) 2
Setting  (t ) to zero we obtain

2e(t y / 2)  e((t / 2) y / 2)
2 2

 2
 y   y    t 2
  t 
2
y   y  
 log 2  t  2t           2       
 2         
  2   2     2  2   2  

 
  2 
 

3 1
 t 2  ty  log 2  0
4 2
y y2 3
  4 log 2
t  2 4 4 .
2(3 / 4)
Hence we find that the Bayes estimate of  is given by

1

ˆ  ˆ ( y )  y  y 2  12 log 2 ,
3

as shown in Figure 2.17.
We see that the Bayes estimate is a strictly increasing function of y and

converges to zero as y tends to negative infinity. The required values of
the Bayes estimate are:
1

ˆ (1)  1  1  12 log 2 = 0.6842
3

1

ˆ (0)  0  0  12 log 2 = 0.9614
3

1

ˆ (1)  1  1  12 log 2 = 1.3508.
3

107
Figure 2.17 Bayes estimate in Exercise 2.14
X11(w=8,h=5.5)
muvec <- seq(0,5,0.01) ; Rvec <- 1.5-pnorm(muvec);

plot(c(-2,5),c(0,1.1),type= "n",xlab="mu",ylab="R(mu)",cex=1.5)
lines(muvec,Rvec,lwd=2) ; lines(c(-2,0),c(1,1),lwd=2)
yvec <- seq(-30,10,0.01); muhatvec <- (1/3)*(yvec+sqrt(yvec^2 + 12*log(2)))

plot(yvec,muhatvec,type="l",xlab="y",ylab="Bayes estimate",cex=1.5,lwd=2)
abline(h=0,lty=2)
(1/3)*(c(-1,0,1)+sqrt(c(-1,0,1)^2 + 12*log(2)))
# 0.6841672 0.9613513 1.3508339
108
CHAPTER 3
3.1 Inference given functions of the data
Sometimes we observe a function of the data rather than the data itself.
In such cases the function typically degrades the information available
in some way. An example is censoring, where we observe a value only if
that value is less than some cut-off point (right censoring) or greater than
some cut-off value (left censoring). It is also possible to have censoring
on the left and right simultaneously. Another example is rounding,
where we only observe values to the nearest multiple of 0.1, 1 or 5, etc.
Exercise 3.1 Right censoring of exponential observations
Each light bulb of a certain type has a life which is conditionally

exponential with mean m = 1/ c , where c has a prior distribution which
is standard exponential. We observe n = 5 light bulbs of this type for 6
units of time, and the lifetimes are:
2.6, 3.2, *, 1.2, *,
where * indicates a right-censored value which is greater than 6. (Only
values less than or equal to 6 could be observed.)
Find the posterior distribution and mean of the average light bulb
lifetime, m.
The data here is

D ={ y1 =2.6, y2 =3.2, y3 > 6, y4 =1.2, y5 > 6} ,
and the probability of censoring is
∞
∫ ce
− cyi
P ( yi > 6 =
| c) dyi e −6 c .
=
6
Therefore the posterior density of c is

f (c | D ) ∝ f (c) f ( D | c)
∝ f ( c ) f ( y1 | c ) f ( y2 | c ) P( y3 > 6 | c ) f ( y4 | c ) P( y5 > 6 | c )
109
∝ e − c ( ce − cy1 )( ce − cy2 ) ( e −6 c ) ( ce − cy4 ) ( e −6 c )

= c 3 exp{−c(1 + y1 + y2 + 6 + y4 + 6)}
= c 4−1 exp{−c(1 + 2.6 + 3.2 + 6 + 1.2 + 6)}
= c 4−1 exp( −20c ) .
Hence: ( c | D ) ~ G (4, 20)

( m | D ) ~ IG (4, 20)
f ( m | D )  204 m(41) e20/ m / (4), m  0
E ( m | D )  20 / (4  1) = 6.667.
It will be observed that this estimate of m is appropriately higher than

the estimate obtained by simply averaging the observed values, namely
(1/3)(2.6 + 3.2 + 1.2) = 2.333.
The estimate 6.667 is also higher than the estimate obtained by simply
replacing the censored values with 6, namely
(1/3)(2.6 + 3.2 + 6 + 1.2 + 6) = 3.8.
Exercise 3.2 A uniform-uniform model with rounded data
Suppose that:
( y | θ ) ~ U (0, θ )
θ ~ U (0, 2) ,
where the data is
x = g ( y ) = the value of y rounded to the nearest integer.
Find the posterior density and mean of θ if we observe x = 1.
Observe that:
x = 0 if 0 < y < 1/2
x = 1 if 1/2 < y < 3/2
x = 2 if 3/2 < y < 2.
Therefore, considering y and θ on a number line from 0 to 2 in each

case, we have that:
110
 1 if θ < 1/ 2
 1  
P( x= 0 | θ =
) P0 < y < θ =  1/ 2
 2   if θ > 1/ 2
 θ

 0 if 0 < θ < 1/ 2

1 3   θ − 1/ 2 1 3
P( x= 1| θ=
) P < y < θ =  if < θ <
2 2   θ 2 2
 1 3
 if < θ < 2
θ 2
 0 if 0 < θ < 3 / 2
3  
P(=x 2 | θ=) P  < y < 2 θ=
  θ −3/ 2 3
2   if < θ < 2.
 θ 2
Since we observe x = 1, the posterior density of θ is

 θ − 1/ 2 1 3
1× θ , 2 < θ < 2
f (θ | x= 1) ∝ f (θ ) f ( x | θ ) ∝ 
 1× 1 , 3
< θ < 2.
 θ 2
Now, the area under this function is

3/2
θ −1/ 2 2
1
= B ∫ dθ + ∫ dθ
1/2
θ 3/2
θ
 1
3/2

= θ − log θ  + log θ 3/2 
2
 2 
1/2 
3 1 3 1 1 1  3
=  − log − + log  + log 2 − log 
2 2 2 2 2 2  2
= 0.7383759.
So the required posterior density is

 θ − 1/ 2 1 3
 Bθ , 2 < θ < 2
f (θ | x= 1)= 
 1 , 3
< θ < 2,
 Bθ 2
and the associated posterior mean of θ is
111
 θ − 1/ 2   1 
3/2 2
E1= E (θ | x= 1)= ∫θ
1/2  Bθ 
 dθ + ∫ θ 
3/2 
Bθ
 dθ

1
= = 1.354 (after some working).
B
Discussion
In contrast to f (θ | x ) , the posterior density of θ given the original data

y is
f (θ ) f ( y | θ ) (1/ 2)(1/ θ ) 1
f (θ | y ) = = 2= , y <θ < 2,
f ( y) (1/ 2)(1/ θ )dθ θ (log 2 − log y )
∫ y
and the corresponding posterior mean is

2
 1  2− y
= θˆ E= (θ | y ) ∫ θ   dθ = .
y  θ (log 2 − log y )  log 2 − log y
Figure 3.1 shows f (θ | x = 1) and examples of f (θ | y ) which are

consistent with x = 1.
Figure 3.1 Posteriors given x = 1 and given y = 0.6, 1, 1.1, 1.4
112
It is now of interest to also calculate f (θ | x ) for the other two possible

values of x, namely 0 and 2. We find that:
 1 1
 A , 0 < θ < 2
f (θ | =
x 0) = 
 1 , 1 <θ < 2
 2 Aθ 2
1 1 1 1
where A = + log 2 − log = 1.1931
2 2 2 2
1 3  3
f (θ | x = 2) = 1 − , < θ < 2
C  2θ  2
3 3 3 3
where C = 2 − log 2 − + log = 0.068477.
2 2 2 2
Figure 3.2 shows these two posteriors, and further examples of f (θ | y ) .
Figure 3.2 Posteriors given x = 0, 1, 2, and given y = 0.1, …, 1.9
For completeness and checking we now also calculate the other two
posterior means:
7
E=0 E (θ | = =
x 0) = 0.7334
8A
1
E=2 E (θ | = =
x 2) = 1.8254,
8C
113
as well as the unconditional probabilities of the data:

 1  1   1 
P0 =P ( x =0) =P  y <  =EP  y < θ  =∫ P  y < θ  f (θ )dθ
 2  2   2 
1 1
1/2 2
1 1/ 2 1
=∫ 1× dθ + ∫ × dθ = 1 + log 2 − log  = 0.5966
0
2 1/2
θ 2 4 2
=
P1 P=
( x 1) = 0.3692
=
P2 P=
( x 2) = 0.0342.
As a check on our calculations, we note that

P0 + P1 + P2 = 1 (which is correct).
We may also calculate the prior mean of θ (which is obviously 1) as

Eθ = EE (θ | x )
=E (θ | x = 0) P( x =0) + E (θ | x = 1) + E (θ | x =
1) P( x = 2) P( x =
2)
= E0 P0 + E1P1 + E2 P2
= 0.7334 × 0.5966 + 1.354 × 0.3692 + 1.825 × 0.03424
= 1.000 (correct).
X11(w=8,h=5.5); par(mfrow=c(1,1)); options(digits=7)

B=1.5-0.5*log(3/2)-0.5+0.5*log(0.5)+log(2)-log(1.5); c(B,1/B)
# 0.7383759 1.3543237
postfunB= function(theta,B=0.7383759){ res=0;
if((theta>=1/2)&&(theta<3/2)) res=1-1/(2*theta)
if((theta>=3/2)&&(theta<=2)) res=1/theta
res/B }
thetavec = seq(0,2,0.001); postvecB=thetavec;

for(i in 1:length(thetavec)) postvecB[i]=postfunB(theta=thetavec[i])
plot(c(0,2),c(0,2),type="n",xlab="theta",ylab="density", main=" ")
lines(thetavec, postvecB,lwd=3)
y=0.6; k=1/(log(2)-log(y))
lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], lty=2,lwd=3)
y=1; k=1/(log(2)-log(y))
y=1.1; k=1/(log(2)-log(y))
y=1.4; k=1/(log(2)-log(y))
114
legend(0,2,c("f(theta|x=1)","f(theta|y=0.6)","f(theta|y=1)","f(theta|y=1.1)",
"f(theta|y=1.4)"), lty=c(1,2,3,4,5), lwd=c(3,3,3,3,3))
C=2-1.5*log(2)-1.5+1.5*log(1.5)
A=0.5+0.5*log(2)-0.5*log(0.5)
options(digits=7); c(A,B,C) # 1.19314718 0.73837593 0.06847689
E0=7/(8*A); E1=1/B; E2=1/(8*C); c(E0,E1,E2)
# 0.7333546 1.3543237 1.8254333
P0=1/4+(1/4)*(log(2)-log(1/2))
P1=0.5*(1.5-0.5*log(1.5)-0.5+0.5*log(0.5)) +0.5*(log(2)-log(1.5))
P2=0.5*(2-1.5*log(2)-1.5+1.5*log(1.5))
P0+P1+P2 # 1 Correct
c(P0,P1,P2) # 0.59657359 0.36918796 0.03423845
E0*P0 + E1*P1 + E2*P2 # 1 Correct
postfunA= function(theta,A=1.19314718){ res=0;

if((theta>=0)&&(theta<1/2)) res=1
if((theta>=1/2)&&(theta<=2)) res=1/(2*theta)
res/A }
postfunC= function(theta,C=0.06847689){ res=0;
if((theta>=3/2)&&(theta<2)) res=1-3/(2*theta)
res/C }
postvecA=thetavec; postvecC=thetavec;
for(i in 1:length(thetavec)){ postvecA[i]=postfunA(theta=thetavec[i])
postvecC[i]=postfunC(theta=thetavec[i]) }
plot(c(0,2),c(0,3.7),type="n",xlab="theta",ylab="density", main=" ")
lines(thetavec, postvecA,lty=2,lwd=3)
lines(thetavec, postvecB,lty=1,lwd=3)
lines(thetavec, postvecC,lty=3,lwd=3)
for(y in seq(0.1,1.9,0.1)){ k=1/(log(2)-log(y))
lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], lty=1,lwd=1) }
legend(0.7,3.6,c("f(theta|x=0)","f(theta|x=1)","f(theta|x=2)","f(theta|y)"),
lty=c(2,1,3,1), lwd=c(3,3,3,1))
115
3.2 Bayesian predictive inference

In addition to estimating model parameters (and functions of those
parameters) there is often interest in predicting some future data (or
some other quantity which is not just a function of the model
parameters).
Consider a Bayesian model specified by f ( y | θ ) and f (θ ), with

posterior as derived in ways already discussed and given by f (θ | y ) .
Now consider any other quantity x whose distribution is defined by a

density of the form f ( x | y, θ ) .
The posterior predictive distribution of x is given by the posterior

predictive density f ( x | y ) . This can typically be derived using the
following equation:
f ( x | y ) = ∫ f ( x , θ | y ) dθ
= ∫ f ( x | y, θ ) f (θ | y )dθ .
Note: For the case where θ is discrete, a summation needs to be

performed rather than an integral.
The posterior predictive density f ( x | y ) forms a basis for making

probability statements about the quantity x given the observed data y.
Point and interval estimation for future values x can be performed in

very much the same way as that for model parameters, except with a
slightly different terminology.
Now, instead of referring to xˆ = E ( x | y ) as the posterior mean of x, we

may instead use the term predictive mean.
Also, the ‘P’ in HPDR, and CPDR may be read as predictive rather than
as posterior. For example, the CPDR for x is now the central predictive
density region for x.
As an example of point prediction, the predictive mean of x is

= xˆ E=( x | y ) ∫ xf ( x | y )dx .
116
Often it is easier to obtain the predictive mean of x using the equation

= xˆ E= ( x | y ) E{E ( x | y, θ ) | y}
= ∫ E ( x | y, θ ) f (θ | y )dθ .
Note: The basic law of iterated expectation (LIE) implies that

E ( x ) = EE ( x | θ ) . This equation must also be true after conditioning
throughout on y. We thereby obtain E ( x | y ) = E{E ( x | y , θ ) | y} .
Likewise, the predictive variance of x can be calculated via the equation

= V ( x | y ) E{V ( x | y , θ ) | y} + V {E ( x | y , θ ) | y} .
Note: This follows from the basic law of iterated variance (LIV),
=Vx EV ( x | θ ) + VE ( x | θ ) , after conditioning throughout on y.
An important special case of Bayesian predictive inference is where the

quantity of interest x is an independent future replicate of y.
This means that ( x | y, θ ) has exactly the same distribution as ( y | θ ) ,

which in turn may be expressed mathematically as
( x | y,θ ) ~ ( y | θ )
or equivalently as
, θ ) f=
f ( x | y= | θ )  f ( y | θ ) y=x  .
(y x=
 
Note: The last equation indicates that the pdf of ( x | y, θ ) is the same as
the pdf of ( y | θ ) but with y changed to x in the density formula.
In the case where x is an independent future replicate of y, we may write

f ( x | y, θ ) as f ( x | θ ) , and this then implies that
f ( x | y ) = ∫ f ( x | θ ) f (θ | y )dθ .
Exercise 3.3 Prediction in the exponential-exponential model
Suppose that θ has the standard exponential distribution, and the

conditional distribution of y given θ is exponential with mean 1/ θ .
Find the posterior predictive density of x, a future independent replicate

of y .
117
Then, for y = 2.0, find the predictive mean, mode and median of x, and
also the 80% central predictive density region and 80% highest
predictive density region for x.
Recall that the Bayesian model given by:

f ( y |  )  e y , y  0
f ( )  e ,   0
implies the posterior ( | y ) ~ Gamma (2, y  1) .
Now let x be a future independent replicate of the data y, so that

f ( x | y,  )  f ( x |  )  f ( y  x |  )  e x , x  0 .
Then the posterior predictive density of x is

f ( x | y )   f ( x | y,  ) f ( | y )d 

 ( y  1) 2  21e ( y 1) 
  e x   d 
  (2)
0

(3)( y  1) 2 ( x  y  1)3  31e ( x y 1)

(2)( x  y  1)3  (3)
d
0
2( y  1) 2
 , x0.
( x  y  1)3

2( y  1) 2
Check:  f ( x | y )dx  
( x  y  1)3
dx
0
 y 1
 2( y  1)  u 3 du
2
(where u = x + y + 1)
0 y 1
 2  
2 u
 2( y  1)    ( y  1) 2  1  1   1 (correct).
  2 ( y  1) 2 
 2 u  y 1 
Next, suppose that y = 2. Then

f ( x | y )  18( x  3)3 , x  0 .
This is a strictly decreasing function, and so the predictive mode is zero.
118
The predictive mean can be calculated according to the equation


E ( x | y )   x18( x  3)3 dx .
0
An easier way to find the predictive mean is to note that

( | y ) ~ Gamma (2,3)
and then write

 32  21e3 
E ( x | y )  E{E ( x | y,  ) | y}  E ( | y )    1 
1
d
 (2) 
0
 1 11 3
3 (1) 3  e
2
31 (2) 0
 d  3 .
(1)
An even easier way to do the calculation is to recall a previous exercise

where it was shown that the posterior mean of ψ = 1/ θ is given by
E (ψ | y )= y + 1 .
Thus, E ( x | y )  E{E ( x | y,  ) | y}  E ( | y )  y  1 = 3 when y = 2.
One way to find the predictive median of x is to solve F ( x | y ) = 1 / 2 for

x, where F ( x | y ) is the predictive cdf of x, or equivalently, to calculate
Q(1/ 2) , where Q( p ) = F −1 ( p | y ) is the predictive quantile function of
x.
Now, the predictive cdf of x is

x 3 x
F ( x | y )   18(t  3) dt  3
 18u
3
dt where u = 3 + t
0 3
 u 2 3 x 
 18    9  1  1   1 9 .

 2  (3  x) 32  (3  x) 2
2
u 3 
Setting this to p and solving for x yields the predictive quantile function,
−1
 1 
= Q ( p ) F= ( p | y) 3 − 1 .
 1− p 
1  1 
So the predictive median=
is Q   3  − 1 = 1.2426.
2  1−1/ 2 
119
The predictive quantile function can now also be used to calculate the
80% CPDR for x,
(Q (0.1), Q (0.9) ) = (0.1623, 6.4868),
and the 80% HPDR for x,
( 0, Q (0.8) ) = (0, 3.7082).
Another way to calculate the predictive median of x is as the solution in
q of
=
1/ 2 P( x < q | y )
after noting that the right hand side of this equation also equals
E{P( x < q | y, θ ) | y} =E (1 − e −θ q | y )
=1 − m(−q ) ,
where m(t ) is the posterior moment generating function (mgf) of θ .
(1 t / ( y + 1)) −2 .
But ( | y ) ~ Gamma (2, y  1) , and so m(t ) =−
So we need to solve 1/ 2 = (1 − (−q ) / ( y + 1)) −2 for q. The result is

q =+
( y 1)( 2 − 1) = 1.2426 when y = 2 (same as before).
Qfun=function(p){ 3*(-1+1/sqrt(1-p)) }; Qfun(0.5) # 1.242641

c(Qfun(0.1),Qfun(0.9)) # 0.1622777 6.4868330
c(0,Qfun(0.8)) # 0.000000 3.708204
Exercise 3.4 Predicting a bus number (Extension of Exercise 1.6)
You are visiting a small town with buses whose license plates show their
numbers consecutively from 1 up to however many there are. In your
mind the number of buses could be anything from 1 to 5, with all
possibilities equally likely. Whilst touring the town you first happen to
see Bus 3.
Assuming that at any point in time you are equally likely to see any of
the buses in the town, how likely is it that the next bus number you see
will be at least 4?
Also, what is the expected value of the bus number that you will next
see?
120
As in Exercise 1.6, let θ be the number of buses in the town and let y be
the number of the bus you happen to first see. Recall that a suitable
Bayesian model is:
y | θ ) 1/=
f (= θ , y 1,...,θ
= f (θ ) 1/= 5, θ 1,...,5 (prior),
and that the posterior density of θ works out as
 20 / 47, θ = 3

= f (θ | y ) =15 / 47, θ 4
12 / 47, θ = 5.

Now let x be the number on the next bus that you happen to see in the
town. Then
1
,θ ) =
f ( x | y= , x 1,..., θ (same distribution as that of ( y | θ )) .
θ
This may also be written
f ( x | y, θ ) =≤
I (x θ ) / θ , x =
1, 2,3,... ,
and so the posterior predictive density of x is
5
I (x ≤ θ )
=
f ( x | y) ∑=
f ( x, θ | y ) ∑ f ( x | y, θ ) f (θ | y ) = ∑ f (θ | y ) .
θ θ θ=y θ
In our case, the observed value of y is 3 and so:

1 20 1 15 1 12
f ( x =1| y ) = × + × + × = 0.27270
3 47 4 47 5 47
1 20 1 15 1 12
f ( x =2 | y ) = × + × + × = 0.27270
3 47 4 47 5 47
1 20 1 15 1 12
f ( x =3 | y ) = × + × + × = 0.27270
3 47 4 47 5 47
1 15 1 12
f ( x = 4 | y) = × + × = 0.13085
4 47 5 47
1 12
f ( x= 5 | y )= × = 0.05106.
5 47
5
Check: ∑ f (=
x =1
x | y) 0.27270 × 3 + 0.13085 + 0.05106
= 1 (correct).
121
 0.27270, x = 1, 2,3

=
In summary, for y = 3, we have that f ( x | y ) =0.13085, x 4
 0.05106, x = 5.

So the probability that the next bus you see will have a number on it
which is at least 4 equals
P( x ≥ 4 | y ) = ∑
x:x ≥ 4
f ( x | y) = f ( x = 4 | y) + f ( x = 5 | y)
= 0.13085 + 0.05106 = 18.2%.

Also, the expected value of the bus number you will next see is
E ( x | y ) =1(0.27270) + 2(0.27270) + 3(0.27270)
+ 4(0.13085) + 5(0.05106) = 2.4149.
 1+θ  1
=
Alternatively, {E ( x | y, θ ) | y} E 
E ( x | y ) E= y  = E (θ | y )
 2  2
1 + 3(20 / 47) + 4(15 / 47) + 5(12 / 47)  1 + 180 / 47 227
=  =  = = 2.4149.
 2  2 94
R Code for Problem 3.4

fv=rep(NA,5); fv[1] = (1/3)*(20/47)+(1/4)*(15/47)+(1/5)*(12/47)
fv[2] = fv[1]; fv[3] = fv[1]; fv[4] = (1/4)*(15/47)+(1/5)*(12/47)
fv[5] = (1/5)*(12/47); options(digits=5)
fv # 0.272695 0.272695 0.272695 0.130851 0.051064
sum(fv) # 1 (OK)
sum(fv[4:5]) # 0.18191
sum((1:5)*fv) # 2.4149
227/94 # 2.4149
Exercise 3.5 Prediction in the binomial-beta model
(a) For the Bayesian model given by (Y |  ) ~ Bin(n,  ) and the prior
 ~ Beta (,  ) , find the posterior predictive density of a future data
value x, whose distribution is defined by ( x | y,  ) ~ Bin(m,  ) .
(b) A bent coin is tossed 20 times and 6 heads come up. Assuming a flat
prior on the probability of heads on a single toss, what is the probability
that exactly one head will come up on the next two tosses of the same
coin? Answer this using results in (a).
122
(c) A bent coin is tossed 20 times and 6 heads come up. Assume a
Beta(20.3,20.3) prior on the probability of heads.
Find the expected number of times you will have to toss the same coin
again repeatedly until the next head comes up.
(d) A bent coin is tossed 20 times and 6 heads come up. Assume a
Beta(20.3,20.3) prior on the probability of heads.
Now consider tossing the coin repeatedly until the next head, writing
down the number of tosses, and then doing all of this again repeatedly,
again and again.
The result will be a sequence of natural numbers (for example

3, 1, 1, 4, 2, 2, 1, 5, 1, ....), where each number represents a number of
tails in a row within the sequence, plus one.
Next define ψ to be the average of a very long sequence like this (e.g.
one of length 1,000,000). Find the posterior predictive density and mean
of ψ (approximately).
Note: In parts (c) and (d) the parameters of the beta distribution (both
20.3) represent a prior belief that the probability of heads is about 1/2, is
equally likely to be on either side of 1/2, and is 80% likely to be between
0.4 and 0.6. See the R Code below for details.
(a) First note that x is not a future independent replicate of the observed
data y, except in the special case where m = n.
Next recall that ( | y ) ~ Beta (a, b) , where:

a y, b    n y .
Thus the posterior predictive density of x is

f ( x | y )   f ( x,  | y ) d 
  f ( x | y,  ) f ( | y )d 
 m
1
 a1 (1  )b1
     x (1  ) mx d
 x  B ( a, b)
0
123
m B( x  a, m  x  b) 1  xa1 (1  ) mxb1

    B ( x  a, m  x  b) d 
 x  B ( a, b) 0
m B ( x    y, m  x    n  y )
   , x  0, , m .
 x  B (  y,   n  y )
Note: The distribution of ( x | y ) here may be called the beta-binomial.
(b) Here, we consider the situation in (a) with n = 20, y = 6, m = 2,

 = 1,  = 1 and x = 0, 1 or 2. So, specifically,
2 B( x  1  6, 2  x  1  20  6)
f ( x | y )   
 x B(1  6,1  20  6)
 2 (7  x )(17  x ) / (24)
  
 x  (7)(15) / (22)
2! (6  x )!(16  x )!/ 23!

x !(2  x )! 6!14!/ 21!
 0.4743, x  0

  0.4150, x  1

0.1107, x  2.
Check: 0.4743 + 0.4150 + 0.1107 = 1 (correct).
So the (posterior predictive) probability that heads will come up on

exactly one of the next two tosses is f ( x  1| y  6) = 41.5%.
Note: An alternative way to do the working here is to see that if y = 6

then
( | y ) ~ Beta (1  6,1  20  6) ~ Beta (7,15) ,
so that:
7 7
E ( | y )  
7  15 22
7 15
V ( | y )  = 0.009432.
(7  15) 2 (7  15  1)
Also, ( x | y , θ ) ~ Bin(2, θ ) (if y = 6).
124
It follows that
P ( x  1 | y )  E{P( x  1 | y ,  ) | y}
 E{2 (1  ) | y}
 2{E ( | y )  E ( 2 | y )}
 2{E ( | y )  [V ( | y )  ( E ( | y )) 2 ]}
 7   7   
2
  
 2   0.009432     = 0.415.
 22   22  
   
(c) Let z be the number of tosses until the next head. Then
( z | y, θ ) ~ Geometric(θ )
with pdf
f ( z | y, θ )= (1 − θ ) z −1θ , z = 1,2,3,....
So the posterior predictive density of z can be obtained via the equation
=
f ( z | y) ∫=
f ( z , θ | y )dθ ∫ f ( z | y, θ ) f (θ | y )dθ .
It will be noted that ( z | y ) has a density with a similar form to that of

( x | y ) in (a), but with an infinite range (z = 1,2,3,...). If we were to write
down f ( z | y ) , we could then evaluate the expected number of tosses
until the next head according to the equation
∞
E ( z | y ) = ∑ zf ( z | y ) .
z =1
More easily, the posterior predictive mean of z can be obtained as

1  1 θ a −1 (1 − θ )b −1
1
 ∫ × B ( a , b ) dθ
E ( z | y ) = E{E ( z | y, θ ) | y} = E  y=
θ  0 θ
B(a − 1, b) θ ( a −1) −1 (1 − θ )b −1
1
B(a, b) ∫0 B(a − 1, b)
= dθ
Γ(a − 1)Γ(b) / Γ(a − 1 + b) a + b −1

= ×1 =
Γ(a )Γ(b) / Γ(a + b) a −1
(α + y ) + ( β + n − y ) − 1 α + β + n − 1
= = .
(α + y ) − 1 α + y −1
For n = 20, y = 6 and α = β = 20.3, we find that E ( z | y ) = 2.356.
125
(d) Here, ψ represents the average of a very large number of

independent realisations of the random variable z in (c). Therefore
=
(approximately), ψ E= ( z | y, θ ) 1/ θ .
It follows that the posterior predictive density of ψ is

dθ
f (ψ | y ) = f (θ | y ) ,
dψ
where θ = ψ −1 and dθ / dψ = −ψ −2 . Thus
(1/ ψ ) a −1 (1 − 1/ ψ )b −1 −1
f (ψ | y ) =
B ( a, b) ψ2
(ψ − 1)b −1
= ,ψ > 1.
ψ a + b B ( a, b)
So the posterior predictive mean of ψ is

∞
(ψ − 1)b −1
E (ψ | y ) = ∫ψ dψ
1
ψ a + b B ( a, b)
∞
B(a − 1, b) (ψ − 1)b −1
B(a, b) ∫1 ψ ( a −1) +b B(a − 1, b)
= dψ .
The last integral is 1, by analogy of its integrand with f (ψ | y ). Thus we

obtain the same expression as for E ( z | y ) and E (1/ θ | y ) in (c), namely
α + β + n −1
E (ψ | y ) = .
α + y −1
options(digits=4); pbeta(0.4,20.3,20.3) # 0.1004

pbeta(0.6,20.3,20.3) - pbeta(0.4,20.3,20.3) # 0.7993
x=0:2
( 2*factorial(6+x)*factorial(16-x)/factorial(23) )/
( factorial(x)*factorial(2-x) * factorial(6)*factorial(14)/factorial(21) )
# 0.4743 0.4150 0.1107
7*15/(22^2*23) # 0.009432
2 * (7/22 - ( 0.009432267 + (7/22)^2 ) ) # 0.415
(20.3+20.3+20-1)/(20.3+6-1) # 2.356
126
Exercise 3.6 Prediction in the normal-normal model (with

variance known)

( y1 , , yn | ) ~ iid N (,  2 )
 ~ N (0 , 02 ) ,
and suppose we have data in the form of the vector y  ( y1 ,..., yn ) .
Also suppose there is interest in m future values:

( x1 ,..., xm | y ,  ) ~ iid N (,  2 ) .
Find the posterior predictive distribution of

x  ( x1  ...  xm ) / m ,
both generally and in the case of a priori ignorance regarding  .
By Exercise 1.18 the posterior distribution of  is given by

( | y ) ~ N ( , 2 ) ,
−1
σ2
 σ2 /n
where: *  (1 k )0  ky , σ = k , k=  1 +
2
σ 02 
* .
n 
Now, ( x | y ,  ) ~ N (,  2 / m ) , and therefore
f ( x | y )   f ( x | y ,  ) f ( | y ) d 

 ( x  ) 2   (  * ) 2 
  exp  2  exp  d .



 2  / m 
 

 2*
2


This is the integral of the exponent of a quadratic in both x and µ and

so must equal the exponent of a quadratic in x . It follows that
( x | y ) ~ N (,  2 ) ,
where η and δ 2 are to be determined. This final step is easily achieved
as follows:
η = E ( x | y)
= E{E ( x | y, µ ) | y}
= E{µ | y} = µ*
127
δ 2 = V ( x | y)
= E{V ( x | y, µ ) | y} + V {E ( x | y, µ ) | y}
σ 2  σ2
= E y  + V {µ | y=
} + σ *2 .
 m  m
Thus generally we have that

 2    2  2 
( x | y ) ~ N * , *2   ~ N (1 k )0  ky , k   .
 m   n m 
A special case is where there is no prior information regarding the

normal mean  . In this case, assuming it is appropriate to set σ 0 = ∞
(so that f ( µ ) ∝ 1, µ ∈ℜ ), we have that k = 1 and hence
 2 2 
( x | y ) ~ N  y ,   .
 n m 
Exercise 3.7 Prediction in the normal-gamma model (with a

known mean)
Consider the Bayesian model given by :

( y1 ,, yn |  ) ~ iid N (,1 /  )
 ~ G ( ,  ) ,
and suppose we have data in the form of the vector y  ( y1 ,..., yn ) .
Also, suppose we are interested in m future values:

( x1 ,..., xm | y ,  ) ~ iid N (,1 /  ) .
Find the posterior predictive distribution of

x  ( x1  ...  xm ) / m ,
both generally and in the case of a priori ignorance regarding  .
By Exercise 1.20 the posterior distribution of  is given by

( | y ) ~ Gamma (a, b) ,
n n 2 1 n
where: a    , b    s y , s y   ( yi  ) 2 .
2
2 2 n i1
Now, ( x | y,  ) ~ N (,1/ (m )) , and therefore
128
f ( x | y )   f ( x | y ,  ) f ( | y )d 

 m 
   exp  ( x  ) 2   a1 exp b d 
0
 2 
  1 
a  1   
   2  exp  b  ( x  ) 2  d 
m
0
  2 
1
 2 a 1
  m2a ( x  ) 2   2
   
 
 1
 m 
a  
 2

 2 b  
 b  ( x   ) 2  
 1   .
 2   2a 
 
 
m 2a ( x   ) 2 x  b/a
Now let Q   , so that x    Q .
2b  b/a  m m
Then by the transformation rule,

1 1
− ( 2 a +1) − ( 2 a +1)
dx  Q2  2 b / a  Q2  2
=
f (Q | y ) f ( x | y ) ∝ 1 +  ∝ 1 +  .
dQ  2a  m  2a 
This implies that (Q | y ) ~ t (2a ) , or equivalently,

 
 
 
 x    ~ t (n  2 ) .
 2 y 
 ys  2  / n 
m 
 1  2 / n 
A special case of this general result is when there is no prior information

regarding the precision parameter λ . In that case, and assuming it is then
appropriate to set α= β= 0 (so that f (λ ) ∝ 1/ λ , λ > 0 ), we have that
 
 x    ~ t (n) .
 y 
 s y / m 
129
3.3 Posterior predictive p-values

Earlier, in Section 1.3, we discussed Bayes factors as a form of
hypothesis testing within the Bayesian framework. An entirely different
way to perform hypothesis testing in that framework is via the theory of
posterior predictive p-values (Meng, 1994). As in the theory of Bayes
factors, this involves first specifying a null hypothesis
H 0 : E0
and an alternative hypothesis
H1 : E1 ,
where E0 and E1 are two events.
Note: As in Section 1.3, E0 and E1 may or may not be disjoint. Also,

E0 and E1 may instead represent two different models for the same data.
In the context of a single Bayesian model with data y and parameter θ ,

the theory of posterior predictive p-values involves the following steps:
(i) Define a suitable discrepancy measure (or test statistic), denoted
T ( y, θ ),
following careful consideration of both H 0 and H1 (see below).
(ii) Define x as an independent future replicate of the data y.
(iii) Calculate the posterior predictive p-value (ppp-value), defined as
= p P{T ( x, θ ) ≥ T ( y, θ ) | y} .
Note 1: The ppp-value is calculated under the implicit assumption that

write p P{T ( x, θ ) ≥ T ( y , θ ) | y , H 0 } .
H 0 is true. Thus we could also=
Note 2: The discrepancy measure may or may not depend on the model
parameter, θ . Thus in some cases, T ( y , θ ) may also be written as T ( y ) .
The underlying idea behind the choice of discrepancy measure T is that

if the observed data y is highly inconsistent with H 0 in favour of H1
then p should likely be small. This is the same idea as behind classical
hypothesis testing. In fact, the classical theory may be viewed as a
special case of the theory of ppp-values. The advantage of the ppp-value
framework is that it is far more versatile and can be used in situations
where it is not obvious how the classical theory should be applied.
130
An example of how ppp-value theory can perform well relative to the

classical theory is where the null hypothesis is composite, meaning that
it consists of the specification of multiple values rather than a single
value (e.g. H 0 :| θ | < ε as compared to H 0 : θ = 0 ). The next exercise
illustrates this feature.
Exercise 3.8 Posterior predictive p-values for testing a

composite null hypothesis

( y | λ ) ~ Poisson(λ )
=f (λ ) e − λ , λ > 0 ,
and suppose that we observe y = 3.
(a) Find a suitable ppp-value for testing
H 0 : λ = 1 versus H1 : λ > 2.
(b) Find a suitable ppp-value for testing

H 0 : λ ∈ {1, 2} versus H1 : λ > 2.
(a) Here, ( x | y , λ ) ~ Poi (λ ) , and we may define the test statistic as

T ( y, λ ) = y .
Then, the posterior predictive p-value is

p= P( x ≥ y | y, λ = 1)
= 1 − FPoi (1) ( y − 1) ,
where y = 3 and where FPoi ( q ) ( r ) is the cumulative distribution function
of a Poisson random variable with mean q, evaluated at r.
Thus a suitable ppp-value is

 e −110 e −111 e −112 
p= 1−  + +
2! 
= 0.08030.
 0! 1!
Note: This is just the probability that a Poisson(1) random variable will
take on a value greater than 2, and so is the same as the classical
p-value which would be used in this situation.
131
(b) Here we first observe that

f (λ | y , H 0 ) ∝ f (λ | H 0 ) f ( y | H 0 , λ )
e−λ e−λ λ y
= −1 −2
× ∝ e −2 λ λ y , λ = 1, 2 (with y = 3).
e +e y!
e −2×113
P(λ 1=
Thus:= | y, H 0 ) = 0.48015
e −2×113 + e −2×2 23
P(λ = 2 | y , H 0 ) = 1 − 0.48015 = 0.51985.
So a suitable ppp-value is
= , H 0 ) E{P( x ≥ y | y , H 0 , λ ) | y , H 0 }
p P ( x ≥ y | y=
=−
E{1 FPoi ( λ ) ( y − 1) | y , H 0 }
= 0.48015 (1
× − FPoi (1) (2)) + 0.51985 (1
× − FPoi (2) (2))
  e −110 e −111 e −112  
= 0.48015 1
 − + + 
  0! 1! 2!  
  e −2 20 e −2 21 e −2 22  
+ 0.51985 1
 − + + 
  0! 1! 2!  
= 0.20664.
options(digits=5); 1-ppois(2,1) # 0.080301

p1=exp(-2)/(exp(-2)+8*exp(-4)); c(p1,1-p1) # 0.48015 0.51985
p1*(1-ppois(2,1))+(1-p1)*(1-ppois(2,2)) # 0.20664
Exercise 3.9 Posterior predictive p-values for testing a normal

mean
Consider a random sample y1 ,..., yn from a normal distribution with

variance σ 2 , where the prior on the precision parameter λ = 1 / σ 2 is
given by λ ~ Gamma (0, 0) , or equivalently by f (λ ) ∝ 1/ λ , λ > 0 .
We wish to test the null hypothesis

H 0 : that the normal mean equals µ
against the alternative hypothesis
H1 : that the normal mean is greater than µ
(where µ is a specified constant of interest).
132
Derive a formula for the ppp-value under each of the following three
choices of the test statistic:
y−µ y −µ
(a) T ( y , λ ) = y , (b) T ( y , λ ) = , (c) T ( y, λ ) = ,
σ/ n sy / n
1 n
where: y= ∑ yi (the sample mean)
n i =1
1 n
s y2  
n 1 i1
( yi  y ) 2 (the sample variance).
For each of these choices of test statistic, report the ppp-value for the
case where µ = 2 and y = (2.1, 4.0, 3.7, 5.5, 3.0, 4.6, 8.3, 2.2, 4.1, 6.2).
(a) Let x  ( x1  ...  xn ) / n be the mean of an independent replicate of

the sample values, defined by ( x1 ,..., xn | y ,  ) ~ iid N (,  2 ) .
 
 x    1 n
Then, by Exercise 3.7,  y  ~ t (n) , where s y2   ( yi  ) 2 .
 s y / n  n i1
 
From this, if the test statistic is T ( y , λ ) = y , then the ppp-value is

 x −µ y −µ   y −µ 
p =P( x > y | y ) =P  > y  = 1 − Ft ( n )  .
 s yµ / n s yµ / n  s / n
   y µ 
1 n 1 n
Here: µ = 2, n = 10, y = ∑ i
n i =1
y = 4.370, s 2
y   ( yi  )2 = 2.978.
n i1
y −µ
Therefore = 2.51658, and so p = 1 − Ft (10) ( 2.51658) = 0.01528.
s yµ / n
y−µ
(b) If T ( y , λ ) = then the ppp-value is
σ/ n
 x −µ y−µ 
p=
P > y =P( x > y | y ) .
σ / n σ / n 
We see that the answer here is exactly the same as in (a).
133
y −µ
(c) If T ( y, λ ) = then the ppp-value is
sy / n
 x −µ y −µ  1 n
= p P >
 sx / n s y / n 
y  where s x2  
n  1 i1
( xi  x ) 2
 
  
  x −µ y −µ
= E P  > y, λ  y 
  sx / n s y / n  
 
by the law of iterated expectation
  y − µ    x −µ 
= E 1 − Ft ( n −1)   y since  y , λ  ~ t ( n − 1)
s / n s / n 
  y    x 
 y −µ 
= 1 − Ft ( n −1)  .
s / n
 y 
We see that the ppp-value derived is exactly the same as the classical
p-value which would be used in this setting. Numerically, we have that:
1 n y −µ
s y2  
n 1 i1
( yi  y ) 2 = 1.901,
sy / n
= 3.942645.
Consequently, the ppp-value is p = 1 − Ft (9) ( 3.942645) = 0.001696.
Note: A fourth test statistic which makes sense in the present context is
y −µ 1 n
T ( y, λ ) = where s y   ( yi  ) 2 (as before).
2
s yµ / n n i1
This implies a ppp-value given by

 x −µ y −µ  1 n
= p P > y  where sx2   ( xi  ) 2 .
 s xµ / n s y µ / n  n i1
 
This ppp-value is more difficult to calculate, and it cannot be expressed

in terms of well-known quantities, e.g. the cdf of a t distribution, as in
(a), (b) and (c). (Here, x and sxµ are not independent, given y and µ .)
For more details, regarding this exercise specifically and ppp-values

generally, see Meng (1994) and Gelman et al. (2004).
134
options(digits=4); mu=2; y = c(2.1, 4.0, 3.7, 5.5, 3.0, 4.6, 8.3, 2.2, 4.1, 6.2);
n=length(y); ybar=mean(y); s=sd(y); smu=sqrt(mean((y-mu)^2))
c(ybar,s,smu) # 4.370 1.901 2.978
arga=(ybar-mu)/(smu/sqrt(n)); pppa=1-pt(arga,n); c(arga,pppa)
# 2.51658 0.01528
argc=(ybar-mu)/(s/sqrt(n)); pppc=1-pt(argc,n-1); c(argc,pppc)
# 3.942645 0.001696
3.4 Bayesian models with multiple parameters

So far we have examined Bayesian models involving some data y and a
parameter θ , where θ is a strictly scalar quantity. We now consider the
case of Bayesian models with multiple parameters, starting with a focus
on just two, say θ1 and θ 2 . In that case, the Bayesian model may be
defined by specifying f ( y | θ ) and f (θ ) in the same way as previously,
but with an understanding that θ is a vector of the form θ = (θ1 , θ 2 ) .
The first task now is to find the joint posterior density of θ1 and θ 2 ,
according to
f (θ | y ) ∝ f (θ ) f ( y | θ ) ,
or equivalently
f (θ1 , θ 2 | y ) ∝ f (θ1 , θ 2 ) f ( y | θ1 , θ 2 ) ,
where
f (θ ) = f (θ1 , θ 2 )
is the joint prior density of the two parameters.
Often, this joint prior density is specified as an unconditional prior

multiplied by a conditional prior, for example as
f (θ1 , θ 2 ) = f (θ1 ) f (θ 2 | θ1 ) .
Once a Bayesian model with two parameters has been defined, one task
is to find the marginal posterior densities of θ1 and θ 2 , respectively, via
the equations:
f (θ1 | y ) = ∫ f (θ1 , θ 2 | y )dθ 2
f (θ 2 | y ) = ∫ f (θ1 , θ 2 | y )dθ1 .
135
From these two marginal posteriors, one may obtain point and interval
estimates of θ1 and θ 2 in the usual way (treating each parameter
separately). For example, the marginal posterior mean of θ1 is
= θˆ E=1(θ | y ) 1 ∫
θ f (θ | y )dθ .
1 1 1
Another way to do this calculation is via the law of iterated expectation,

according to
= θˆ1 E=
(θ1 | y ) E{E (θ1 | y , θ 2 ) | y}
= ∫ E (θ1 | y , θ 2 ) f (θ 2 | y )dθ 2 .
Note: The equation E (θ1 | y ) = E{E (θ1 | y , θ 2 ) | y} follows from the

simpler identity Eθ1 = EE (θ1 | θ 2 ) after conditioning throughout on y.
Here, E (θ1 | y , θ 2 ) is called the conditional posterior mean of θ1 and can

be calculated as
E (θ1 | y , θ 2 ) = ∫ θ1 f (θ1 | y , θ 2 )dθ1 .
Also, f (θ1 | y , θ 2 ) is called the conditional posterior density of θ1 and

may be obtained according to
f (θ1 | y , θ 2 ) ∝ f (θ1 , θ 2 | y ) . (3.1)
Note: Equation (3.1) follows after first considering the equation

f (θ1 | θ 2 ) ∝ f (θ1 , θ 2 ) and then conditioning throughout on y.
The main idea of Equation (3.1) is to examine the joint posterior density
f (θ1 , θ 2 | y )
(or any kernel thereof), think of all terms in this as constant except for
θ1 , and then try to recognise a well-known density function of θ1 .
This density function will define the conditional posterior distribution of

θ1 , from which estimates such as the conditional posterior mean of θ1
(i.e. E (θ1 | y, θ 2 )) will hopefully be apparent.
One may also be interested in some function,

ψ = g (θ1 ,θ 2 ) ,
136
of the two parameters (possibly of only one).Then advanced distribution

theory may be required to obtain the posterior pdf of ψ , i.e. f (ψ | y ) .
This posterior density may then be used to calculate point and interval
estimates of ψ . For example, the posterior mean of ψ is
=ψˆ E=
(ψ | y ) ∫ψ f (ψ | y )dψ .
Alternatively, this mean may be obtained using the equation
= ψˆ E= ( g (θ1 , θ 2 ) | y ) ∫ ∫ g (θ1 , θ 2 ) f (θ1 , θ 2 | y )dθ1dθ 2 .
Further, one may be interested in predicting some other quantity x,

whose model distribution is specified in the form f ( x | y, θ ) .
To obtain the posterior predictive density of x will generally require a

double integral (or summation) of the form
f ( x | y ) = ∫ ∫ f ( x | y, θ1 , θ 2 ) f (θ1 , θ 2 | y )dθ1dθ 2 .
Further integrations will then be required to produce point and interval

estimates, such as the predictive mean of x,
= xˆ E= ( x | y ) ∫ xf ( x | y )dx .
Exercise 3.10 A bent coin which is tossed an unknown number

of times
Suppose that five heads have come up on an unknown number of tosses

of a bent coin.
Before the experiment, we believed the coin was going to be tossed a

number of times equal to 1, 2, 3, ..., or 9, with all possibilities equally
likely. As regards the probability of heads coming up on a single toss,
we deemed no value more or less likely than any other value. We also
considered the probability of heads as unrelated to the number of tosses.
Find the marginal posterior distribution and mean of the number of

tosses and of the probability of heads, respectively. Also find the number
of heads we could expect to come up if the coin were to be tossed again
the same number of times.
137
For this problem it is appropriate to consider the following three-level

hierarchical Bayesian model:
( y | , n ) ~ Binomial ( n,  )
( | n) ~ U (0,1)
n ~ DU (1,..., k ) , k = 9 (i.e. f (n)  1/ 9 , n = 1,...,9).
Under this model, the joint posterior density of the two parameters n and
θ is
f (n,  | y )  f (n,  ) f ( y | n,  )
 f (n) f ( | n) f ( y | n,  )
n
 1   y (1   ) n y
1
k  y 
n
   y (1   ) n y , 0    1, n  y , y  1,...,9 .
 y 
So the marginal posterior density of n is

f ( n | y )   f ( n,  | y )d 
n
1
    y (1   ) n y d  , =
n y , y + 1,...,9 (since y = 0,..., n )
 y 
0
n 1
 y11 (1   ) n y11
   B( y  1, n  y  1)  d  , n = 5,6,7,8,9
 y  B ( y  1, n  y  1)
0
 n  ( y  1)( n  y  1)
   1 (since the integral equals 1)
 y  ( y  1  n  y  1)
n! y !( n  y )!

y !( n  y )! ( n  1)!
1

n 1
 1/ 6, n  5

 1/ 7, n  6

  1/ 8, n  7

 1/ 9, n  8

1/10, n  9.
138
After normalising (i.e. dividing each of these five numbers by their sum,
0.6456), we find that, to four decimals, n’s posterior pdf is
 0.2581, n  5

 0.2213, n  6
f (n | y )   0.1936, n  7

 0.1721, n  8

0.1549, n  9.
Thus, for example, there is a 17.2% chance a posteriori the coin was
tossed 8 times.
It follows that n’s posterior mean is

9
nˆ  E (n | y )   nf (n | y )
n6
 0.25815  0.2213 6  ...  0.15499
= 6.744.
Next, the marginal posterior density of  is

f ( | y )   f ( n ,  | y )
n
9 n
     y (1   ) n y
 
n y  y 
9 n
 y11 (1   ) n y11
    B ( y  1, n  y  1)
 
n5  y  B ( y  1, n  y  1)
9
1
 f Beta ( y1,n y1) ( ) .
n y n 1
Recall that f (n | y )  1/ (n  1) . It follows that ’s marginal posterior

density must be exactly
9
f ( | y )   f ( n | y ) f Beta ( y1,n y1) ( )
n5
 5 (1   )55  5 (1   )95
 0.2581  ...  0.1549 .
5!(5  5)!/ (5  1)! 5!(9  5)!/ (9  1)!
We see that ’s posterior is a mixture of five beta distributions.
139
Note: This result can also be obtained, more directly, as follows. By

considering the ‘ordinary’ binomial-beta model (from earlier), we see
that in the present context the conditional posterior distribution of θ
(given n) is given by
( | y , n ) ~ Beta ( y  1, n  y  1) .
It immediately follows that

f ( | y )   f ( , n | y )   f (n | y ) f ( | y, n)
n n
9
  f (n | y ) f Beta ( y 1,n y 1) ( ) .
n 5
We may now perform inference on  . The posterior mean of  is

 y  1 
ˆ  E ( | y )  E{E ( | y, n) | y}  E  y
 n  2 
9
 1 
 ( y  1) 
 n  2 
f (n | y )
n 5
 1  1 1 1 1 

 6   0.2581    0.2213    0.1936    0.1721    0.1549
 7  8 9 10  11 
= 0.7040.
Figures 3.3 and 3.4 (page 141) show the marginal posterior densities of n
and  , respectively, with the posterior means n̂ = 6.744 and ̂ = 0.7040
marked by vertical lines.
Finally, we consider x, the number of heads on the next n tosses.
The distribution of x is defined by ( x | y , n, θ ) ~ Bin( n, θ ) .
So the posterior predictive mean of x is

E ( x | y )  E{E ( x | y , n,  ) | y}  E ( n | y )
 E{E ( n | y , n ) | y}  E{nE ( | y , n ) | y}
 y  1  9
 n 
 E n  y   ( y  1)   f ( n | y )
 n  2  n5  n  2 
 5  6 7 8 9 

 6   0.2581    0.2213    0.1936    0.1721    0.1549
 7  8 9 10  11 
= 4.592.
140
Figure 3.3 Posterior density of n
Figure 3.4 Posterior density of θ
y <- 5; k <- 9; options(digits=4)

nvec <- y:k; avec <- 1/(nvec+1); sumavec <- sum(avec); sumavec # 0.6456
fny <- avec/sumavec; rbind(nvec,avec,fny)
# nvec 5.0000 6.0000 7.0000 8.0000 9.0000
# avec 0.1667 0.1429 0.1250 0.1111 0.1000
# fny 0.2581 0.2213 0.1936 0.1721 0.1549
nhat <- sum(nvec*fny); nhat # 6.744
thhat <- sum( fny * (y+1)/(nvec+2) ); thhat # 0.704
xhat <- sum( fny * nvec * (y+1)/(nvec+2) ); xhat # 4.592
141
thvec <- seq(0,0.99,0.01); fthyvec <- thvec

for(i in 1:length(thvec)) fthyvec[i] <- sum( fny * dbeta(thvec[i],y+1,nvec-y+1) )
X11(w=8,h=4); par(mfrow=c(1,1))
plot(nvec,fny,type="n",xlab="n",ylab="f(n|y)",ylim=c(0,0.4))
points(nvec,fny,pch=16,cex=1); abline(v=nhat)
plot(thvec,fthyvec,type="n",xlab="theta",ylab="f(theta|y) ",ylim=c(0,2.5))
lines(thvec,fthyvec,lwd=3); abline(v=thhat)
Exercise 3.11 The uninformative normal-normal-gamma model

( y1 , , yn | ,  ) ~ iid N (,1/  )
( |  ) ~ N (0, )
 ~ Gamma (0, 0) ,
with observed data y = ( y1 ,..., yn ) .
(a) Find the marginal posterior distribution of  .
(b) Find the marginal posterior distribution of λ .
(c) Find the posterior mean of the signal to noise ratio, defined as
= γ µ= /σ µ λ .
(d) Find the posterior predictive distribution of

x = ( x1 + ... + xm ) / m ,
where the xi values have a distribution given by
( x1 ,..., xm | y, ,  ) ~ N (,1/  ) .
Note: Both  and λ are assigned uninformative priors. The joint prior
distribution of these two parameters could also be specified by:
f ( |  )  1,   
f ( )  1/ ,   0,
or by the single statement
f (,  )  1/ ,   ,   0 .
142
(a) The joint posterior density of the two parameters  and λ is

f (,  | y )  f (,  ) f ( y | ,  )  f ( ) f ( |  ) f ( y | ,  )
 , n
1  ( y  ) 2 
 1 1 exp  i 
i 1 1/   2(1/  ) 
n
1   n 
 2
exp   ( yi  ) 2  .
 2 i1 
So the marginal posterior density of  is

f (  | y )   f ( ,  | y ) d 
  n   1 n 
2
1
 
   2
exp 
  2 i1 ( yi   )  d 

0 

 
n
(n / 2)  1  1 n
 
2
2
 n 
   ( yi  ) 
   (n / 2)  2 i1
 


1

n

2
2 0 
  ( yi  ) 

 2 i1
 


n   1 n 
 2 exp    ( yi  ) 2  d 
1
  2 i1 


n

(n / 2)  n
  2
2
 
 ( yi  ) 
 .
n

 i1
 




 1 n 2
2
  ( yi  ) 


 2 i1
 


Note: The last integral is that of a gamma density and so is equal to 1.
Now observe that

n n
 ( yi  )2   ( yi  y )  ( y  )
2
i1 i1
n n n
  ( yi  y )  2( y  ) ( yi  y )  ( y  )
2 2
1
i1 i1 i1
 1 n

 ( n  1)   ( yi  y ) 2   2( y  )( ny  ny )  n( y  ) 2
 n  1 i1 
 (n 1) s 2  n(  y ) 2 , where s 2 is the sample variance.
143
This result implies that

n
f ( | y )  (n 1) s 2  n(  y ) 2 

2
1
 {( n1)1}
    y 2  2
   
  s / n  
1
 {( n1)1}
  n(  y ) 2 
 1  
2
  1  .
 (n 1) s 2   (n 1) 
 
 
 y s d s
We now define r  , so that   y  r and  .
s/ n n dr n
By the transformation rule, we then have that

d
f ( r | y )  f ( | y )
dr
1 1
 {( n1)1}  {( n1)1}
 r 2   r 2 
 1   1 
r 2 s r 2
   .
 (n 1)  n  (n 1) 
By definition of the t distribution, we see that (r | y ) ~ t (n 1) .
It follows that the marginal posterior distribution of  is given by

   y 
 y  ~ t (n 1) . (3.2)
 s / n 
Note 1: In result (3.2), the data vector y appears only by way of the
sample mean y and sample standard deviation s. So it is also true that
  y 
 y , s ~ t (n 1) .
 s / n 
Here, s may not be left out of the conditioning. So it is not true that
   y 
 y  ~ t (n 1) .
 s / n 
Note 2: Result (3.2) implies that the marginal posterior mean, mode and
median of µ are all equal to y , and the 1 − α CPDR/HPDR for µ is
144
( y  t /2 (n 1) s / n ) .
This inference is identical to that obtained via the classical approach and
thereby justifies the use of the joint prior
f (,  )  1/ ,   ,   0
in cases of a priori ignorance regarding both  and λ .
Note 3: The exact marginal posterior density of  is

dr
f ( | y )  f ( r | y ) ,
d
 y
where r  and (r | y ) ~ t (n 1) .
s/ n
({(n 1)  1} / 2)
Thus f ( | y ) 
((n 1) )((n 1) / 2)
1
 (( n1)1)
 1    y  
2 2
 n
1    ,  .
 n 1  s / n   s
This density can be calculated in R at any point  by first calculating

the corresponding value of r and then returning
dt(r,n-1)*sqrt(n)/s
(see below for examples).
(b) The marginal posterior density of λ is

f ( | y )   f (,  | y )d 

 n
1   
   2
exp  (n 1) s 2  n(  y ) 2  d 
 2 

 n1 2
n 
1  2 
s
2 e (1/ (n )) 2

 
(  y ) 2  d 
1 1
 exp 
(1/ (n )) 2  2(1/ (n )) 

 n1 2  n11  n1 2

n   
1  2  s
s
 2   2 
2 e (1/ (n )) 2   e .
145
Note: The last integral is that of a normal density and so equals 1.
It follows that
 n 1  n 1 2 
( | y ) ~ Gamma  ,  s ,
 2   2  
(3.3)
and hence also that

((n − 1) s 2 λ | y ) ~ χ 2 (n − 1) . (3.4)
Note 1: Result (3.4) can be proved as follows. Let

u= (n − 1) s 2 λ ,
u d 1
so that λ = and  .
(n − 1) s 2
du (n 1) s 2
Then, by the transformation rule,

d
f (u | y )  f ( | y )
du
 n1
1  n1 2
u u  2  
u
 s
1

  e ( n1) s 2  2 
 (n 1) s 2  (n 1) s 2

u  n11 
u
 2 
 u e 2.
 n 1 1 
Thus (u | y ) ~ Gamma  , ~  2 (n 1) , which confirms (3.4).
 2 2 
Note 2: Results (3.3) and (3.4) imply that λ has posterior mean 1/ s 2 .
This makes sense because λ = 1/ σ 2 , and s 2 is an unbiased estimator of
σ 2 . We see that the inverse of the posterior mean of λ provides us with
the classical estimator of σ 2 .
Also, result (3.4) implies that the 1 − α CPDR for λ is

 χ 21−α /2 (n − 1) χ 2α /2 (n − 1) 
 , .
 (n − 1) s (n − 1) s 2 
2
It follows that the 1 − α CPDR for σ 2 = 1/ λ is
146
 (n − 1) s 2 (n − 1) s 2 
 2 , 2 .
 χ α /2 (n − 1) χ 1−α /2 (n − 1) 
It will be observed that this is exactly the same as the usual classical
1 − α CI for σ 2 when the normal mean µ is unknown.
(c) The posterior mean of = γ µ=

/ σ µ λ could be calculated using the
equation
γˆ = ∫ γ f (γ | y )d γ ,
where f (γ | y ) is the posterior density of γ .
However, obtaining this density may be difficult. We could use Jacobian

theory to find the joint posterior density of µ and γ , and then integrate
that joint density with respect to µ . The result would be f (γ | y ) .
Another approach is to calculate the mean as

 
ˆ  E (  | y )     f (,  | y )d d  ,
  0
k ( ,  )
where: f (,  | y ) 
c
n
1   n 
k (,  )   2 exp   ( yi   ) 2 
 2 i1 
 
c   h ( ,  ) d d  .
  0
More simply, we may use the law of iterated expectation to write

= γˆ E=( µ λ | y ) E{E ( µ λ | y, λ ) | y} = E{ λ E ( µ | y, λ ) | y}
= E{ λ y | y} = yE (λ 1/2 | y )
 n −1 1 
Γ + 
=y  2 2
by (3.3)
1/2
  n −1  2   n −1 
 2  s  Γ 2 
    
y
= cn ,
s
147
where
 n −1 1 
Γ + 
cn =  2 2
.
 n −1   n −1 
1/2
  Γ 
 2   2 
Note 1: By a well-known property of the gamma function, cn → 1 as

n → ∞ . So for large n the posterior mean of γ = µ / σ is approximately
the same as γ ’s MLE, y / s .
Note 2: Suppose that we wish to find the posterior median or mode of γ

or the 95% CPDR or HPDR for that quantity. Then we first need to
determine f (γ | y ) . This and subsequent calculations may be difficult.
This points to the need for another strategy. As will be seen later, most
of these issues can be easily sidestepped using Monte Carlo methods.
(d) Recall from previous exercises that:

 1/  1/    
( x | y,  ) ~ N  y ,   ~ N  y , n  m 
 m n   nm 
 n 1  n 1 2 
( | y ) ~ Gamma  , s  .
 2   2  
Hence f ( x | y )   f ( x | y,  ) f ( | y )d 

 1/2 
 nm ( x  y ) 2  
  exp 
 

  2(n  m) 
0  
 

  1
 n 1
  n 1 2 
  2  exp   s  d
  2  
  
  n 
 1   nm( x  y ) 2  n 1 2  
 
   2  exp 
     s   d
 2( n  m )  2    
0  
 
n
 
 nm( x  y ) 2  n 1 2   2 
   s 
 2(n  m)  2  

 
148
n
 
 nm( x  y ) 2   2 
 1 
 (n 1)(n  m) s 2 
 
n
 
    2 
 
2
xy  
 
  ( s / n ) (n  m) / m  
 1   .
 n 1 
 
 
 
It follows that
 xy 

 y ~ t (n 1) . (3.5)
 ( s / n ) (n  m) / m 
Note 1: Equation (3.5) can be used to derive the predictive distribution

of the average of all n + m values considered (both past and future).
That average may be written

1  n m  ny + mx
= a  ∑ yi + ∑ xi  = .
n +=m  i 1 =j 1  n+m
Consequently,
(n + m)a − ny
x= .
m
It follows that in (3.5),

 ( n  m)a  ny 
   y
xy  m 

( s / n ) (n  m) / m ( s / n ) (n  m) / m
(a  y )(n  m) / m

( s / n ) ( n  m) / m
a y
 ,
( s / n ) m / ( n  m)
and therefore
 a y 

 y  ~ t (n 1) . (3.6)
 ( s / n ) m / (n  m) 
149
This may look familiar to some readers, the reason being as follows.
Denote the total number of values, n + m, as N, and write the average of

all these observations, a, as Y . Then (3.6) is equivalent to the result
 

 Y  y 
 y  ~ t (n 1) . (3.7)
 s n 
 1 
 n N 
So the posterior predictive mean of Y is the observed sample mean y ,

and the 95% central (and highest) predictive density region for Y is
 n
 y  t /2 ( n  1) 1   .
s
(3.8)
 n N 
It will be noted that this inference is exactly the same as implied by the
standard approach in the classical survey sampling framework (e.g. see
Cochran, 1977).
Recall that in this framework, 1 n / N is the finite population

correction (fpc) factor. As N increases, the fpc factor tends to 1 and (3.8)
reduces to
 
 y  t /2 ( n  1) s  ,
 n 
which is the ‘standard’ CI for a normal mean when the normal variance
is unknown.
We have here touched on the topic of Bayesian finite population

inference. More will be said on this topic later in the book.
Note 2: The exact posterior predictive density of the finite population

mean Y may be obtained according to
dq
f (Y | y )  f ( q | y ) ,
dY
Yy
where: q 
( s / n ) 1 n / N
( q | y ) ~ t ( n  1) .
150
({(n 1)  1} / 2)
We thereby obtain the density f (Y | y ) 
((n 1) )((n 1) / 2)
1
 2  2 (( n1)1)

 1  Y y 

s
n
1    , Y  .
 n 1  ( s / n ) 1 n / N   s 1 n / N
This density can be calculated in R at any point Y by first calculating

the corresponding value of q (as defined above) and then returning
dt(q,n-1)*sqrt(n)/(s*sqrt(1-n/N))
(see below for an example).
Note 3: The posterior predictive density of Y converges to the marginal

posterior density of  as N tends to infinity with n fixed. That is,
f (Y = c | y ) → f ( µ = c | y ) as N → ∞ .
This is on account of the fpc factor 1 − n / N converging to unity. Thus

 may be interpreted as the average of a hypothetically infinite number
of values from the underlying superpopulation, N ( µ ,1 / λ ) .
Figure 3.5 shows the predictive density f (Y | y ) for various values of N,

as well as the posterior density f ( µ | y ) , corresponding to the limiting
case N = ∞ . In each case, the values of n, y and s are (arbitrarily) taken
as 5, 10 and 2, respectively. Note that N = ∞ ⇔ m = ∞ since m= N − n .
Note 4: Consider the following Bayesian model:

( y1 , , yn | ,  ) ~ iid N (,1/  )
( |  ) ~ N (0 , 02 )
 ~ Gamma (,  ),
where σ 0 is not necessarily ∞ and α and β are not necessarily 0.
This may be called the (general) normal-normal-gamma model, as

distinct from the uninformative normal-normal-gamma model, here in
Exercise 3.11. In the general model, the inferences typically required are
much more difficult to perform. Later in the book, it will be shown how
to proceed in this—and similarly difficult—situations using Monte Carlo
methods, including Markov chain Monte Carlo (MCMC) methods.
151
Figure 3.5 Predictive density of the finite population mean

(See Note 3 on page 151)
X11(w=8,h=6); par(mfrow=c(1,1))
ybar=10; s=2; cv=seq(0,20,0.005)
plot(c(4,16),c(0,1),type="n",xlab="Ybar",ylab="f( Ybar | y )", main=" ")
n=5; rv=(cv-ybar)*sqrt(n)/s; lines(cv, dt(rv,n-1)*sqrt(n)/s,lty=1,lwd=2)
Nvec=c(6,7,10,40)
for(i in 1:length(Nvec)){ N=Nvec[i]; qv=rv/sqrt(1-n/N)
lines(cv, dt(qv,n-1)*sqrt(n)/(s*sqrt(1-n/N)),lty=i+1,lwd=2) }
legend(4,1,
c("N=6 (m=1)","N=7 (m=2)","N=10 (m=5)","N=40 (m=35)","N=infinity (=m)"),
lty=c(2:5,1),lwd=2)
text(6,0.6,
"The solid line is also the\nposterior density of mu,\nnamely f( mu | y ).")
152
CHAPTER 4
Computational Tools
4.1 Solving equations
In most of the Bayesian models so far examined, the calculations required
could be done analytically. For example, the model given by:
(Y |  ) ~ Binomial (5,  )
 ~ U (0,1) ,
together with data y = 5, implies the posterior ( | y ) ~ Beta (6,1) . So 
has posterior pdf f ( | y )  6 5 and posterior cdf F ( | y )   6 . Then,
setting F ( | y )  1/ 2 yields the posterior median,   1/ 21/6 = 0.8909.
But what if the equation F ( | y )  1/ 2 were not so easy to solve? In that

case we could employ a number of strategies. One of these is trial and
error, and another is via special functions in software packages, for
example using the qbeta() function in R. This yields the correct answer.
Yet another method is the Newton-Raphson algorithm, our next topic.
R Code for Section 4.1
qbeta(0.5,6,1) # 0.8908987
4.2 The Newton-Raphson algorithm
The Newton-Raphson (NR) algorithm is a useful technique for solving

equations of the form g ( x)  0 .
This algorithm involves choosing a suitable starting value x0 and

iteratively applying the equation
x j 1  x j  g ( x j )1 g ( x j )
until convergence had been achieved to a desired degree of precision.
How does the NR algorithm work? Figure 4.1 illustrates the idea.
153
Figure 4.1 The Newton-Raphson algorithm
Here, a is the desired solution of the equation g(x) = 0, c is a guess at that

solution, and b is a better estimate of a. Observe that the slope of the
tangent at point Q is equal to both g (c) and g (c) /(c  b) . Equating these
two expressions we get b  c  g (c) / g (c) .
Note: Sometimes the NR algorithm takes a long time to converge, and

sometimes it converges to the wrong or even impossible value or gets
‘stuck’ and fails to converge at all. This is a general problem with the
NR algorithm, namely its instability and the need to start it off with an
initial guess that is sufficiently close to the desired solution.
Exercise 4.1 Calculating a posterior median via the Newton-

Raphson algorithm
Suppose that the posterior cdf of a parameter is F ( | y )   6 .
Find the posterior median by solving the equation F ( | y )  1/ 2

via the Newton-Raphson algorithm.
Note: The algorithm should converge to the analytical solution, namely

  1/ 21/6 = 0.8909.
We wish to solve g ( )  0 , where g ( )  F ( | y ) 1/ 2 .
Here, g ( )  f ( | y )  0 , where f ( | y )  6 5 .
154
Chapter 4: Computational Tools
So the algorithm is given by

g ( j ) F ( j | y ) 1/ 2  6j 1/ 2
 j 1   j   j   j  .
g ( j ) f ( j | y ) 6 5j
Starting at the posterior mode, 0 = 1 (chosen arbitrarily), we get the

sequence shown in Table 4.1.
Table 4.1 NR algorithm starting from 1
j 0 1 2 3 4
j 1.0000 0.9167 0.8926 0.8909 0.8909
So the posterior median is 0.8909. The same result is obtained if we start

with 0 = 0.8, as shown in table 4.2
Table 4.2 NR algorithm starting from 0.8
j 0 1 2 3 4
j 0.8000 0.9210 0.8933 0.8909 0.8909
Note 1: The median must satisfy

 6 1/ 2
  .
6 5
This equation is indeed satisfied at the solution  = 0.8909 (working

to four decimals). This illustrates how to check whether or not the NR
algorithm has converged properly.
Note 2: In this simple example, one could get the answer by solving the
equation     g ( ) / g ( ) analytically. In general, that won’t be
possible, and iterating the algorithm will be required. Of course, if it is
possible to solve that equation analytically, there is no need to iterate.
155
NR <- function(th,J=5){
# This function performs the Newton-Raphson algorithm for J iterations
# after starting at the value th. It outputs a vector of th values of length J+1.
thvec <- th; for(j in 1:J){
num <- th^6-1/2 # theta’s posterior cdf minus 1/2 (numerator)
den <- 6*th^5 # theta’s posterior pdf (denominator)
th <- th - num/den
thvec <- c(thvec,th) }
thvec }
options(digits=4)
NR(th=1,J=6) # 1.0000 0.9167 0.8926 0.8909 0.8909 0.8909 0.8909
NR(th=0.8,J=6) # 0.8000 0.9210 0.8933 0.8909 0.8909 0.8909 0.8909
0.8909-(0.8909^6-0.5)/(6*0.8909^5) # 0.8909 (Check)
Exercise 4.2 Further practice with the NR algorithm
Use the Newton-Raphson algorithm to solve the equation t 2 = et .
Note: In this case there is no analytical solution.
) t 2 − et . Now, g ′(t=
We wish to solve g (t ) = 0 , where g (t = ) 2t − et .
 t 2j − et j 
So we iterate according to t j +1= t j −  .
 2t − et j 
 j 
Let us arbitrarily choose t0 = 0 . Then we get:
02 − e0 (−1) 2 − e −1
t1= 0 − = −1.000000, t2 =(−1) − = −0.733044
2(0) − e0 2(−1) − e −1
(−0.733044) 2 − e −0.733044
t3 =
(−0.733044) − = −0.703808
2(−0.733044) − e −0.733044
(−0.703808) 2 − e −0.703808
t4 =
(−0.703808) − = −0.703467
2(−0.703808) − e −0.703808
(−0.703467) 2 − e −0.703467
t5 =
(−0.703467) − = −0.703467, etc.
2(−0.703467) − e −0.703467
156
Thus the output of the NR algorithm starting from 0 is:

0.000000, -1.000000, -0.733044, -0.703808, -0.703467, -0.703467,
-0.703467, -0.703467, .....
Also, we find that the output of the NR algorithm starting from 1 is:
1.000000, -1.392211, -0.835088, -0.709834, -0.703483, -0.703467,
-0.703467, -0.703467, .....
From these results we feel confident that the required solution to 6

decimals is −0.703467. As a check, we calculate
g (−0.703467) = (−0.703467) 2 − e −0.703467 = 0.000000803508 ≈ 0.
Figure 4.2 illustrates the function g and the output of the NR algorithm
starting from −5, which is:
-5.000000, -2.502357, -1.287421, -0.802834, -0.707162, -0.703473,
-0.703467, -0.703467, .....
Figure 4.2 Solution via the NR algorithm starting at −5
options(digits=6); t=0; tv=t; for(j in 1:7){ t=t-(t^2-exp(t))/(2*t-exp(t))

tv=c(tv,t) }; tv
# 0.000000 -1.000000 -0.733044 -0.703808 -0.703467 -0.703467 -0.703467
# -0.703467
# Check:
t^2-exp(t) # 0
(-0.703467)^2-exp(-0.703467) # -8.03508e-07
157
t=1; tv=t; for(j in 1:7){ t=t-(t^2-exp(t))/(2*t-exp(t)); tv=c(tv,t) }; tv

# 1.000000 -1.392211 -0.835088 -0.709834 -0.703483 -0.703467 -0.703467
# -0.703467
t=-5; tv=t; for(j in 1:7){ t=t-(t^2-exp(t))/(2*t-exp(t)); tv=c(tv,t) }; tv

# -5.000000 -2.502357 -1.287421 -0.802834 -0.707162 -0.703473 -0.703467
# -0.703467
tvec=seq(-6,2,0.01); gvec= tvec^2-exp(tvec)

X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(tvec,gvec,type="l",lwd=2,xlab="t",ylab="g(t)", main="")
abline(h=0,v=t); points(tv, tv^2-exp(tv),pch=16)
text( tv[1:4], tv[1:4]^2-exp(tv[1:4]) + 3, 0:3)
Exercise 4.3 Another example of the NR algorithm
Consider the Bayesian model:

( x | p) ~ Bin(3, p)
p ~ U (0,1) ,
and suppose the observed value of x is 2. Find the posterior median of p.
The posterior distribution of p is given by

( p | x ) ~ Beta (1 + 2,1 + 1) ,
with density
p 3−1 (1 − p) 2−1
f ( p | x) = = 12 p 2 (1 − p) , 0 < p < 1.
Γ(3)Γ(2) / Γ(5)
So, the posterior cdf is

p
 p3 p 4 
F= ∫0 −= − =  4 p − 3 p , 0 < p < 1.
2 3 4
( p | x) 12 r (1 r ) dr 12 
 3 4 
To find the posterior median of p we need to solve F ( p | x) = 1/ 2 , or

equivalently g ( p ) = 0 , where g ( p ) = F ( p | x) − 1/ 2 = 4 p 3 − 3 p 4 − 1/ 2 .
′( p ) 12 p 2 − 12 p 3 . So the NR algorithm is defined by iterating

Now, g =
g( p j )  4 p 3j − 3 p 4j − 1/ 2 
p j += pj − = pj − 
 12 p 2 − 12 p 3 
.
g ′( p j )
1
 j j 
158
What’s a good starting value here? Let’s try the MLE, p0 = 2 / 3 .
Using this value, we get:

0.666667, 0.614583, 0.614272, 0.614272, 0.614272, 0.614272,
0.614272, 0.61427, .....
Starting at other values (0.5, 0.9 and 0.1), we get the following (three)
sequences (respectively):
0.500000, 0.625000, 0.614306, 0.614272, 0.614272, 0.614272,
0.614272, 0.614272, .....
0.900000, 0.439403, 0.649191, 0.614501, 0.614272, 0.614272,
0.614272, 0.614272, .....
0.10000, 4.69537, 3.62690, 2.83403, 2.25146, 1.83195, 1.54254,
1.36156, .....
The last sequence does not seem to have converged. Let’s run this for a
bit longer. The result is:
0.10000, 4.69537, 3.62690, 2.83403, 2.25146, 1.83195, 1.54254,
1.36156, 1.27282, 1.24913, 1.24749, 1.24748, 1.24748, 1.24748,
1.24748, 1.24748, 1.24748, 1.24748, 1.24748, 1.24748, .....
Thus if we start at 0.1, the algorithm converges to an impossible value of

p, namely 1.24748.
It appears that the required posterior median is 0.61427. As a check we
may calculate
= =
F ( p 0.61427 | x) 4(0.61427)3 − 3(0.61427) 4 = 0.499999 ≈ 0.5.
Figures 4.3 and 4.4 show the posterior median 0.61427, as well as the
other solution of g ( p ) = 0 (i.e. root of g), namely 1.24748. This is not
actually a solution of F ( p | x) = 0.5, because the values of F ( p | x) for
p < 0 and p > 1 are 0 and 1, respectively.
Thus, the definition of g above is ‘deceptive’, and a better definition is:

0 − 1/ 2 = −1/ 2, p<0
 3
= g ( p) F ( p | x) − 1/ 2=  4 p − 3 p 4 − 1/ 2, 0 ≤ p ≤ 1
=
 1 − 1/ 2 1/ 2, p > 1.
159
Figure 4.3 Posterior cdf and median of p
Figure 4.4 Posterior median of p and the other root of g
options(digits=6); p=2/3; pv=p; for(j in 1:7){

p = p - (4*p^3-3*p^4-1/2)/(12*p^2-12*p^3); pv=c(pv,p) }; pv
# 0.666667 0.614583 0.614272 0.614272 0.614272 0.614272 0.614272
# 0.614272
p=0.5; pv=p; for(j in 1:7){ p = p - (4*p^3-3*p^4-1/2)/(12*p^2-12*p^3);

pv=c(pv,p) }; pv # 0.500000 0.625000 0.614306 0.614272 0.614272 0.614272
# 0.614272 0.614272
160

pv=c(pv,p) }; pv # 0.900000 0.439403 0.649191 0.614501 0.614272 0.614272
# 0.614272 0.614272

pv=c(pv,p) }; pv
# 0.10000 4.69537 3.62690 2.83403 2.25146 1.83195 1.54254 1.36156

pv=c(pv,p) }; pv
# 0.10000 4.69537 3.62690 2.83403 2.25146 1.83195 1.54254 1.36156
# 1.27282 1.24913 1.24749 1.24748 1.24748 1.24748 1.24748 1.24748
# 1.24748 1.24748 1.24748 1.24748 1.24748
4*(0.614272)^3-3*(0.614272)^4 # 0.499999
pvec=seq(-0.5,1.4,0.005); Fvec = 4*pvec^3-3*pvec^4
Fvec[pvec<=0] = 0; Fvec[pvec>=1] = 1
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(pvec,Fvec,type="l",lwd=3,xlab="p",ylab="F(p|x)", main=" ")

abline(h=0.5,v=0.614272,lty=3); points(0.614272,0.5,pch=16, cex=1.2)
abline(h=c(0,1),lty=3); abline(v=c(0,1),lty=3)
gvecwrong=4*pvec^3-3*pvec^4-0.5
plot(pvec, gvecwrong,type="n",lwd=2,xlab="p",ylab="g(p) = F(p|x) - 1/2",

main=" ")
lines(pvec,Fvec-0.5,lwd=3)
lines(pvec[pvec<0], gvecwrong[pvec<0],lty=2,lwd=3)
lines(pvec[pvec>1], gvecwrong[pvec>1],lty=2,lwd=3)
abline(v=c(0.614272, 1.24748),lty=3); abline(h=0,lty=3)
points(c(0.614272, 1.24748),c(0,0),pch=16,cex=1.2)
abline(h=c(-0.5,0,0.5),lty=3); abline(v=c(0,1),lty=3)
4.3 The multivariate Newton-Raphson

algorithm
The Newton-Raphson algorithm can also be used to solve several
equations simultaneously, say
g k ( x1 ,..., xK ) = 0 , k = 1,...,K.
161
 x1   g1 ( x)  0
     
Let: x =    , g ( x) =    , 0 =    (a column vector of length K).
x   g ( x)  0
 K  K   
Then the system of K equations may be expressed as

g ( x) = 0 ,
and the NR algorithm involves iterating according to
x ( j 1)  x ( j )  g ( x ( j ) )1 g ( x ( j ) ) ,
 x1( j ) 
 

where: x     is the value of x at the jth iteration
( j)
 ( j ) 
 xK 
 x1( j 1)   ( j)   
   g1 ( x )   g1 ( x )  
    
x ( j 1)     , g ( x ( j ) )         
 ( j 1)   
 
 xK   g K ( x ( j ) )  g K ( x ) x x( j ) 
 
g ( x ( j ) )   g ( x) x x( j ) 
 
 g1 ( x) / xT   g1 ( x) / x1  g1 ( x) / xK 

   
g ( x)           .
 T
  
 
g K ( x) / x  g K ( x) / x1  g K ( x) / xK  
Exercise 4.4 Finding a HPDR via the multivariate NR algorithm
Consider the Bayesian model: ( x | λ ) ~ Poisson(λ )

f (λ ) ∝ 1, λ > 0 ,
and suppose that we observe x = 1. Find the 80% HPDR for λ .
1× e − λ λ x / x ! =
First, f (λ | x) ∝ f (λ ) f ( x | λ ) = e − λ λ , since x = 1.
Thus (λ | x) ~ Gamma (2,1) , with f =

(λ | x ) λ e − λ , λ > 0 .
162
The 80% HPDR for λ is (a,b), where a and b satisfy the two equations:
F (b | x) − F (a | x) = 0.8 (4.1)
f (b | x) = f (a | x) . (4.2)
Note: Here, f (b | x) is the posterior pdf of λ evaluated at b, F (b | x)

is the posterior cdf of λ evaluated at b, etc. Equations (4.1) and (4.2)
reflect the requirement that λ ∈ (a, b) with posterior probability 0.8,
and that the posterior density of λ must be the same at both a and b,
considering that λ’s posterior pdf is bell-shaped and unimodal.
Thus we wish to solve the equation

g (t ) = 0 ,
where:
0 a  g1 (t )   F (b | x) − F (a | x) − 0.8 
0 =   , t =  =, g (t ) =   .
0 b  g 2 (t )   f (b | x) − f (a | x) 
The Newton-Raphson algorithm for solving this equation is

t ( j +=
1)
t ( j ) − g ′(t ( j ) ) −1 g (t ( j ) ) ,
where:
 aj 
t( j) =  
 bj 
 ∂g1 (t ) / ∂a ∂g1 (t ) / ∂b   −ae − a be − b 
=g ′(t ) =   −a .
 ∂g 2 (t ) / ∂a ∂g 2 (t ) / ∂b   e (a − 1) e (1 − b) 
−b
Starting at
 a0   0.5 
=t (0) =   
 b0   3.0 
(based on a visual inspection of the posterior density f (λ | x) = λ e − λ ), we
obtain results as shown in Table 4.3.
Table 4.3 Multivariate NR algorithm
j 0 1 2 3 4 5
aj 0.5 0.0776524 0.163185 0.167317 0.16730 0.16730
bj 3.0 2.7406883 3.025571 3.079274 3.08029 3.08029
163
It seems that the 80% CPDR for λ is (0.16730, 3.08029). This interval is
illustrated in Figure 4.5 and appears to be correct.
As another check on our calculations, we find that:

λ 3.08029 | x) − f (=
f (= λ 0.16730 | x=) 0.14153 − 0.14153
= 0
λ 3.08029 | x) − F (=
F (= λ 0.16730 | x=) 0.81253 − 0.01253
= 0.8.
Figure 4.5 An 80% HPDR
gfun = function(a,b){
g1=pgamma(b,2,1)-pgamma(a,2,1)-0.8; g2=dgamma(b,2,1)-dgamma(a,2,1);
c(g1,g2) }
gpfun = function(a,b){ m11=-dgamma(a,2,1); m12=dgamma(b,2,1)

m21=exp(-a)*(a-1); m22=exp(-b)*(1-b)
matrix(c(m11,m12,m21,m22),nrow=2,byrow=T) }
gvec=c(0.5,3); gmat=gvec; for(j in 1:7){

a=gvec[1]; b=gvec[2]
gvec = gvec - solve(gpfun(a,b)) %*% gfun(a,b)
gmat = cbind(gmat,gvec) }
options(digits=6); gmat
# [1,] 0.5 0.0776524 0.163185 0.167317 0.16730 0.16730 0.16730 0.16730
# [2,] 3.0 2.7406883 3.025571 3.079274 3.08029 3.08029 3.08029 3.08029
164
lamv=seq(0,5,0.01); fv=dgamma(lamv,2,1)
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(lamv,fv,type="l",lwd=3,xlab="lambda",ylab="f(lambda|x)", main=" ")
abline(h=c(dgamma(a,2,1)),v=c(a,b),lty=1)
# Checks:
c(a,b,dgamma(c(a,b),2,1)) # 0.167300 3.080291 0.141527 0.141527
c(pgamma(a,2,1), pgamma(b,2,1), pgamma(b,2,1) - pgamma(a,2,1))
# 0.0125275 0.8125275 0.8000000
4.4 The Expectation-Maximisation (EM)

algorithm
We have shown how the Newton-Raphson algorithm for solving g(x) = 0
numerically can be useful for finding the posterior median and the HPDR.
That algorithm can also be used for finding the posterior mode, when this
is the solution of
∂f (θ | y )
= 0,
∂θ
or equivalently
∂ log f (θ | y )
= 0.
∂θ
In some situations, finding the posterior mode either analytically or via

the NR algorithm may be problematic because the posterior density
f (θ | y ) has a very complicated form. In that case, one may consider
applying the Expectation-Maximisation (EM) algorithm.
This algorithm first requires the specification (i.e. definition by the user)
of some suitable latent data, which we will denote by z, and then the
application of the following two steps iteratively until convergence.
Note: The choice of the latent data z will depend on the particular
application.
Step 1. The Expectation Step (E-Step)
Determine the Q-function, defined as

Q j (θ ) = Ez {log f (θ | y, z ) | y, θ j }
= ∫ log f (θ | y, z ) f ( z | y, θ j )dz , (4.3)
165
or, in words, as
the expectation of the log-augmented posterior density with respect
to the distribution of the latent data given the observed data and
current parameter estimates.
Step 2. The Maximisation Step (M-Step)
Find the value of θ which maximises the Q-function, for example using
the Newton-Raphson algorithm.
This value becomes the current parameter estimate in the next iteration.
Note 1: For mathematical convenience, the Q-function may also be

defined as at (4.3) but plus and/or multiplied by any constants which do
not depend on the parameter θ . This extended definition allows us to
ignore terms which have no impact on the final results. If (4.3) is
multiplied by a negative constant, the resulting Q-function should be
minimised at Step 2 rather than maximised.
Note 2: If there is a choice between using the NR algorithm or the EM

algorithm, one should consider the fact that the EM algorithm is slower
to converge but far more stable. In fact, under certain regularity
conditions, the EM algorithm is guaranteed to move closer to the
required solution at each iteration. By contrast, the NR algorithm may
not converge at all if started at a value far away from the required
solution. Thus, one plausible strategy is to use the EM algorithm to
obtain an approximate solution which is sufficiently close to the correct
answer, and then to obtain a very high precision using just a few
iterations of the NR algorithm.
Exercise 4.5 Illustration of the EM algorithm

( y1 ,..., yn | λ ) ~ iid Gamma(1, λ )
f (λ ) ∝ 1, λ > 0 .
Suppose that the data, denoted D, consists of the observed data vector,
denoted by
166
yo = ( y1 ,..., yk ) ,
and the partially observed (or missing) data vector, denoted by
ym = ( yk +1 ,..., yn ) .
We don’t know the values in ym exactly, only that each of those values
is greater than some specified constant c.
Suppose that c = 10, n = 5, k = 3 and yo = (3.1, 8.2, 6.9).
(a) Find the posterior mode of λ by maximising the posterior density

directly.
(b) Find the posterior mode of λ using the EM algorithm.
(a) First, f (λ | D ) ∝ f (λ ) f ( D | λ )
 k  n 
∝ 1×  ∏ f ( yi | λ )  ∏ P( yi > c | λ )  ,
 i= 1  i= k +1 
− λ yi
where: f ( yi | λ ) = λ e
∞
∫ λe
− yi λ
| λ)
P( yi > c = dyi e − cλ .
=
c
 k  n 
Then f (λ | D) ∝  ∏ λ e − λ yi  ∏ e − cλ 
 i= 1  i= k +1 
= λ exp{−λ[ yoT + (n − k )c]} ,
k
where yoT = y1 + ... + yk = 18.2 (the total of the observed values).
So l (λ ) ≡ log f (λ | D
= ) k log λ − λ[ y0T + (n − k )c]
k
⇒ l ′(λ ) = − [ yoT + (n − k )c] .
λ
Setting l ′(λ ) to zero yields the posterior mode,

k
= 0.078534.
yoT + ( n − k )c
167
(b) The latent data here may be defined as =

z y=
m ( yk +1 ,..., yn ) .
Then, the augmented posterior density is

n
f (λ | yo , ym ) ∝ ∏ λ e − λ yi = λ n exp{−λ[ yoT + ymT ]} ,
i =1
where ymT= yk +1 + ... + yn (the total of the missing values).
So the log-augmented density is

log f (λ | yo , ym )= n log λ − λ[ yoT + ymT ] + c1
(where c1 is a constant with respect to λ ).
λ e − λ yi
λ) =
Now, f ( yi | yi > c,= λ e − λ ( yi −c ) , yi > c
e− λc
(an exponential pdf shifted to the right by c).
1
Therefore, E ( yi | yi > c, λ ) =
c+ .
λ
It follows that the Q-function is given by
  1 
λ ) n log λ − λ  yoT + (n − k )  c +
Q j (=  
  λj
  
(note the distinction here between λ and λ j ).
That concludes the E-Step.
As regards the M-Step, we now calculate the derivative

n   1 
Q′j (λ ) = −  yoT + (n − k )  c +   .
λ   λ j  

Setting this derivative to zero yields a formula for the next value,
n
λ j +1 = . (4.4)
yoT + (n − k ) ( c + 1/ λ j )
Implementing the above EM algorithm starting at λ0 = 1 we get the

following sequence:
1.000000, 0.124378, 0.092115, 0.083456, 0.080431, 0.079282,
0.078832, 0.078653, 0.078581, 0.078553, 0.078542, 0.078537,
0.078535, 0.078535, 0.078534, 0.078534, …..
168
We see that the EM algorithm has converged correctly to the answer

obtained in (a), namely 0.078534.
λ j λ=
Note: Writing (4.4) with = j +1 λ (i.e. the limiting value) gives
n
λ= ,
yoT + (n − k ) ( c + 1/ λ )
and this can be solved easily for the same formula as derived in (a),
namely
k
λ= .
yoT + (n − k )c
Thus, in this exercise it was not necessary to actually perform any

iterations of the EM algorithm.
# (a)
n=5; k=3; c=10; yo=c(3.1, 8.2, 6.9); yoT=sum(yo); yoT # 18.2
k/(yoT+(n-k)*c) # 0.078534
# (b)
lam = 1; lamv = lam; options(digits=5)
for(j in 1:20){ lam=n/(yoT+(n-k)*(c+1/lam)); lamv=c(lamv,lam) }
lamv
# 1.000000 0.124378 0.092115 0.083456 0.080431 0.079282 0.078832
# 0.078653 0.078581 0.078553 0.078542 0.078537 0.078535 0.078535
# 0.078534 0.078534 0.078534 0.078534 0.078534 0.078534 0.078534
Exercise 4.6 EM algorithm for right-censored Gaussian data

( y1 ,..., yn | λ ) ~ iid N ( µ , σ 2 )
f ( µ ) ∝ 1, µ ∈ℜ .
Suppose that the data, denoted D, consists of the observed data vector
yo = ( y1 ,..., yk )
and the partially observed (or ‘missing’) data vector
ym = ( yk +1 ,..., yn ) .
169
We don’t know the values in ym exactly, but only that each of these
values is greater than some specified constant c.
Suppose that c = 10, n = 5, k = 3 and yo = (3.1, 8.2, 6.9).
(a) Find the log-posterior density of µ and describe how it could be used
to find the posterior mode of µ . (Do not actually find that mode in this
way.)
(b) Find the posterior mode of µ using the EM algorithm. Then check
your answer by showing the mode in plots of the likelihood and log-
likelihood functions.
 k  n 
(a) Observe that f ( µ | D) ∝ 1×  ∏ f ( yi | µ )  ∏ P( yi > c | µ )  .
 i= 1  i= k +1 
 1 
1
k k − ( yi − µ )2 k
Here,
=i 1 =i 1
∏ f ( yi | µ ) ∝ ∏ e 2σ 2
=−
exp 
 2σ
2 ∑ ( y − µ)
i =1
i
2


 1 
= exp − (k − 1) so2 + k ( µ − yo ) 2   ,
2 
 2σ 
1 k
where: ∑ yi (the observed sample mean)
k i =1
yo =
1 k
=so2 ∑ ( yi − yo )2 (the observed sample variance).
k − 1 i =1
∞ 1
1 − ( yi − µ )2
∫c σ 2π 2 dyi
Also, P ( yi > c | µ ) = e
 c−µ  c−µ ,
= PZ >  = 1− Φ  
 σ   σ 
where Z ~ N(0,1) and Φ ( z ) = P ( Z ≤ z ) (the standard normal cdf).
n−k
 k   c − µ 
Therefore f ( µ | D) ∝ exp − 2 ( µ − yo ) 2  1 − Φ   .
 2σ   σ 
170
So the log-posterior is
k   c − µ 
log f ( µ | D) = − ( µ − yo ) 2 + (n − k ) log 1 − Φ    + c1
2σ 2
  σ 
(where c1 is a term which does not depend on µ ).
To find the posterior mode of µ we could solve the equation

l ′( µ ) = 0,
∂ log f ( µ | D)
where l ′( µ ) =
∂µ
k (n − k )   c − µ    −1 
− 2 ( µ − yo ) +
=  −φ  σ    σ  .
σ   c − µ     
1 − Φ  σ  
  
This solution could be obtained via the NR algorithm defined by

l ′( µ j )
µ j += µj − ,
l ′′( µ j )
1
∂l ′( µ ) k
where l ′′( µ ) = =
− 2 + ...
∂µ σ
As a further exercise, one could complete the formula for l ′′( µ ) above
and actually implement the NR algorithm.
Note: The posterior mode here is also the maximum likelihood estimate,
since the prior is proportional to a constant.
(b) With ym = ( yk +1 ,..., yn ) as the latent data, the augmented posterior is

 k  n 
f ( µ | yo , ym ) ∝ 1×  ∏ f ( yi | µ )  ∏ f ( yi | µ ) 
 i= 1  i= k +1 
 1 k
  1 n 
∝ exp − 2 ∑ ( yi − µ ) 2  exp − 2 ∑ ( yi − µ ) 2  .
 2σ i= 1   2σ i= k +1 
So the log-augmented posterior is

1 k 1 k
2 ∑ 2 ∑
log f ( µ | yo , ym ) =
− ( y − µ ) 2
− ( yi − µ ) 2 + c1
2σ i 1 = 2σ i 1
i
=
171
k n
1 1
=
−
2σ 2
∑(y
i= 1
2
i − 2 µ yi + µ 2 ) −
2σ 2
∑ (y
i= k +1
2
i − 2 µ yi + µ 2 ) + c1
{
= c2 ( k µ − 2µ nyo ) + ( (n − k ) µ − 2µ (n − k ) ym ) + c3 ,
2 2
}
k
1
where: yo = ∑ yi (the sample mean of the observed values)
k i =1
n
1
ym = ∑ yi (the sample mean of the missing values).
n − k i= k +1
Thus the Q-function may be taken as

Q j ( µ )= k µ 2 − 2 µ kyo + (n − k ) µ 2 − 2 µ (n − k )e j
= 2nµ − 2{kyo + (n − k )e j } ,
where e j = E ( ym | D, µ j ) = E ( yi | D, µ j ) ( i > k ).
that e j  E ( X | X > c ) µ = µ  ,
We see=
 j 
where X ~ N ( µ , σ 2 ) (with µ taken as a constant).
Now observe that

∞
f ( x) I
E=
( X | X > c) ∫ x P=
c
( X > c)
dx
P( X > c)
,
 c−µ  c−µ 
where P ( X > c) =1 − P ( X < c) =1 − P  Z <  =1 − Φ  ,
 σ   σ 
and where
∞ 1
1 − 2 ( x − µ )2
I =∫x e 2σ
dx
c σ 2π
∞ 1 ∞ 1
1 − 2 ( x − µ )2 1 − 2 ( x − µ )2
∫c ( x − µ ) σ 2π e
= 2σ
dx + ∫ µ
c σ 2π
e 2σ dx
∞ 1
1 − 2t
= ∫ 2 σ 2π e σ dt + µ P( X > c)
( c − µ ) /2
1
=
where t ( x − µ ) 2 and dt= ( x − µ )dx
2
∞
σ
1
1 − t
=
2π ∫ σ 2
e σ2
dt + µ P( X > c)
(c−µ ) 2
/2
172
 1  c−µ  
2
1
− ( c − µ )2 /2 − 
σ 
X > c) σ   + µ P( X > c)
σ2 1 2 σ 
= + µ P(=
2π  2π 
 
c−µ  where φ ( z ) is the standard normal pdf.
= σφ   + µ P( X > c)
 σ 
1  c−µ  
=
Thus E ( X | X > c) σφ   + µ P( X > c) 
P( X > c)   σ  
c−µ    c − µ 
= µ +σ φ   1 − Φ   ,
 σ    σ 
 c − µj    c − µj 
and consequently e= µj +σ φ   1 − Φ   .
 σ    σ
j

That completes the E-Step, which may be summarised by writing

Q j ( µ )= nµ 2 − 2 µ{kyo + (n − k )e j } ,
where e j is as given above.
The M-Step then involves calculating

Q j′ ( µ )= 2nµ − 2{kyo + (n − k )e j }
and setting this to zero so as to yield the next parameter estimate,
ky + ( n − k )e j
µ j +1 = o
n
1   c − µj    c − µ j   
=  kyo + ( n − k )  µ j + σ φ   1 − Φ      .
n    σ    σ    
Implementing the above EM algorithm starting at 5 (arbitrarily), we

obtain the sequence:
5.000000, 8.137838, 8.371786, 8.395701, 8.398209, 8.398473,
8.398501, 8.398504, 8.398504, 8.398504, 8.398504, .....
We conclude that the posterior mode of µ is 8.3985.
Figure 4.6 shows the posterior density (top subplot) and the log-posterior
density (bottom subplot). Each of these density functions is drawn scaled,
meaning correct only up to a constant of proportionality. In each subplot,
the posterior mode is indicated by way of a vertical dashed line.
173
Figure 4.6 Posterior and log-posterior densities (scaled)
# (b)
options(digits=6); yo = c(3.1, 8.2, 6.9); n=5; k = 3; c= 10; sig=3;
yoT=sum(yo); c(yoT, yoT/3) # 18.20000 6.06667
mu=5; muv=mu; for(j in 1:10){
ej = mu + sig * dnorm((c-mu)/sig) / ( 1-pnorm((c-mu)/sig) )
mu = ( yoT + (n-k)*ej ) / n
muv=c(muv,mu) }
muv # 5.00000 8.13784 8.37179 8.39570 8.39821 8.39847
# 8.39850 8.39850 8.39850 8.39850 8.39850
modeval=muv[length(muv)]; modeval # 8.3985
muvec=seq(0,20,0.001); lvec=muvec
for(i in 1:length(muvec)){ muval=muvec[i]
lvec[i]=(-1/(2*sig^2))*sum((yo-muval)^2) +
(n-k)*log(1-pnorm((c-muval)/sig)) }
iopt=(1:length(muvec))[lvec==max(lvec)]; muopt=muvec[iopt]; muopt # 8.399
X11(w=8,h=6); par(mfrow=c(2,1));
plot(muvec,exp(lvec),type="l",lwd=2); abline(v=modeval,lty=2,lwd=2)
plot(muvec,lvec,type="l",lwd=2); abline(v=modeval,lty=2,lwd=2)
174
4.5 Variants of the NR and EM algorithms

The Newton-Raphson and Expectation-Maximisation algorithms can be
modified and combined in various ways to produce a number of useful
variants or ‘hybrids’. For example, the NR algorithm can be used at each
M-Step of the EM algorithm to maximise the Q-function.
If the EM algorithm is applied to find the mode of a parameter vector, say

θ = (θ1 ,θ 2 ) , then the multivariate NR algorithm for doing this may be
problematic and one may consider using the ECM algorithm (where C
stands for Conditional).
The idea is, at each M-Step, to maximise the Q-function with respect to
θ1 , with θ 2 fixed at its current value; and then to maximise the Q-function
with respect to θ 2 , with θ1 fixed at its current value.
If each of these conditional maximisations is achieved via the NR

algorithm, the procedure can be modified to become the ECM1 algorithm.
This involves applying only one step of each NR algorithm (rather than
finding the exact conditional maximum). In many cases the ECM1
algorithm will be more efficient at finding the posterior mode than the
ECM algorithm.
Sometimes, when the simultaneous solution of several equations via the

multivariate NR algorithm is problematic, a more feasible solution is to
apply a suitable CNR algorithm (where again C stands for Conditional).
For example, suppose we wish to solve two equations simultaneously, say:

g1 (a, b) = 0
g 2 ( a, b) = 0 ,
for a and b. Then it may be convenient to define the function
=g (a, b) g1 (a, b) 2 + g 2 (a, b) 2 ,
which clearly has a minimum value of zero at the required solutions for a
and b.
This suggests that we iterate two steps as follows:

Step 1. Minimise g (a, b) with respect to a, with b held fixed.
Step 2. Minimise g (a, b) with respect to b, with a held fixed.
175
The first of these two steps involves solving

∂g (a, b)
= 0,
∂a
∂g (a, b) ∂g (a, b) ∂g (a, b)
= 2 g1 (a, b) 1
where + 2 g 2 ( a, b) 2 .
∂a ∂a ∂a
Assuming the current values of a and b are a j and b j , this can be achieved
via the NR algorithm by setting a0′ = a j and iterating until convergence
as follows (k = 0, 1, 2, ...):
 ∂g (a, b) 
 ∂a = a a=′
k ,b bj 
ak′ += ak′ −  2  ,
 ∂ g ( a, b) 
1
 = a a=′ , b b 
 ∂a
2 k j

and finally setting
a j +1 = a∞′ . (4.5)
The second of the two steps involves solving

∂g (a, b)
= 0,
∂b
∂g (a, b) ∂g (a, b) ∂g (a, b)
= 2 g1 (a, b) 1
where + 2 g 2 ( a, b) 2 .
∂b ∂b ∂b
This can be achieved via the NR algorithm by setting b0′ = b j and iterating
until convergence as follows (k = 0, 1, 2, ...):
 ∂g (a, b) 
 ∂b= a a=
j +1 , b bk 
bk′ += bk −  2 
,
 ∂ g ( a, b) 
1
 = a a= j +1 , b bk 
 ∂ b 2

and finally setting
b j +1 = b∞′ . (4.6)
A variant of the CNR algorithm is the CNR1 algorithm. This involves

performing only one step of each NR algorithm in the CNR algorithm.
In the above example, the CNR1 algorithm implies we set a j +1 = a1′ at

(4.5) and b j +1 = b1′ at (4.6) (rather than a j +1 = a∞′ and b j +1 = b∞′ ).
176
This modification will also result in eventual convergence to the solution

of g1 (a, b) = 0 and g 2 (a, b) = 0 .
One application of the CNR and CNR1 algorithms is to finding the HPDR
for a parameter.
For example, in Exercise 4.4 we considered the model given by

( x | λ ) ~ Poisson(λ )
f (λ ) ∝ 1, λ > 0 ,
with observed data x = 1.
The 80% HPDR for λ was shown to be (a,b), where a and b are the
simultaneous solutions of the two equations:
g1 (a, b) = F (b | x) − F (a | x) − 0.8
g=
2 ( a, b) f (b | x) − f (a | x) .
Applying the CNR or CNR1 algorithm as described above should also

lead to the same interval as obtained earlier via the multivariate NR
algorithm, namely (0.16730, 3.08029).
For further details regarding the EM algorithm, the Newton-Raphson

algorithm, and extensions thereof, see Lachlan and Krishnan (2008).
Exercise 4.7 Application of the EM and ECM algorithms to a

normal mixture model

( yi | R, µ , δ ) ~ ⊥ N ( µ + δ Ri , σ 2 ), i =
1,..., n
( R1 ,..., Rn | µ , δ ) ~ iid Bernoulli (π ), i = 1,..., n
f ( µ , δ ) ∝ 1, µ ∈ℜ, δ > 0 .
This model says that each value yi has a common variance σ 2 and one
of two means, these being: µ if Ri = 0
µ + δ if Ri = 1.
Each of the ‘latent’ indicator variables Ri has known probability π of

being equal to 1, and probability 1 − π of being equal to 0.
177
Note: In more advanced models, the quantity π could be treated as

unknown and assigned a prior distribution, along with the other two
model parameters, µ and δ . The model here provides a ‘stepping
stone’ to understanding and implementing such more complex models.
(a) Consider the situation where n = 100, π = 1/3, µ = 20 , δ = 10 and

σ = 3. Generate a data vector y = ( y1 ,..., yn ) using these specifications
and create a histogram of the simulated values.
(b) Design an EM algorithm for finding the posterior mode of θ = ( µ , δ ) .

Then implement the algorithm so as to find that mode.
(c) Modify the EM algorithm in part (b) so that it is an ECM algorithm.

Then run the ECM algorithm so as to check your answer to part (b).
(d) Create a plot which shows the routes taken by the algorithms in parts
(b) and (c).
(a) Figure 4.7 shows a histogram of the sampled values which clearly
shows the two component normal densities and the mixture density. The
sample mean of the data is 23.16. Also, 29 of the 100 Ri values are equal
to 1, and 71 of them are equal to 0.
Figure 4.7 Histogram of simulated data
178
(b) We will here take the vector R = ( R1 ,..., Rn ) as the latent data. The
conditional posterior of µ and δ given this latent data is
f ( µ , δ | y, R) ∝ f ( µ , δ , y, R)
= f ( µ , δ ) f ( R | µ , δ ) f ( y | R, µ , δ )
n n
 1 2
2 ( i
∝ 1× ∏ π Ri (1 − π )1− Ri × ∏ exp − y − [ µ + Riδ ]) 
=i 1 =i 1  2σ 
 1 n

∑ ( y − [ µ + R δ ])
2
∝ 1×1× exp − 2 .
 2σ
i i
i =1 
So the log-augmented posterior density is

1 n
2 ∑( i
y − [ µ + Riδ ])
2
log f ( µ , δ | y, R ) =−
2σ i =1
=
−
1 n
2 ∑
2σ i =1
(
yi2 − 2 yi [ µ + Riδ ] + [ µ + Riδ ]
2
)
1  n n n
2
− 2 ∑ yi2 − 2∑ yi [ µ + Riδ ] + ∑ [ µ + Riδ ] 
=
2σ  i 1 =i 1
= =i 1 
 n n n

= −c1 c2 − 2 µ ny − 2δ ∑ yi Ri + nµ 2 + 2 µδ ∑ Ri + δ 2 ∑ Ri2  ,
 =i 1 =i 1 =i 1 
where c1 and c2 are positive constants which do not depend on µ or δ
in any way. We see that
 n

log f ( µ , δ | y, R) =−c1 c2 − 2 µ ny − 2δ ∑ yi Ri + nµ 2 + 2 µδ RT + δ 2 RT  ,
 i =1 
n
where RT = ∑ Ri .
i =1
Note: Each Ri equals 0 or 1, and therefore Ri2 = Ri .
So the Q-function is
Q j ( µ , δ ) = ER {log f ( µ , δ | y, R ) | y, µ j , δ j }
 n

= −c1 c2 − 2 µ ny − 2δ ∑ yi eij + nµ 2 + 2 µδ eTj + δ 2 eTj  ,
 i =1 
where: eij = E ( Ri | y, µ j , δ j )
n
= ( RT | y, µ j , δ j )
eTj E= ∑e
i =1
ij
.
179
We now need to obtain formulae for the eij values. Observe that
f ( R | y, µ , δ ) ∝ f ( µ , δ , y, R)
n n
 1 2
2 ( i
∝ 1× ∏ π Ri (1 − π )1− Ri × ∏ exp − y − [ µ + Riδ ])  .
=i 1 =i 1  2σ 
It follows that
( Ri | y, µ , δ ) ~ ⊥ Bernoulli (ei ) , i = 1,...,n,
where
π exp  − 2 ( yi − [ µ + δ ]) 
1 2
ei =  2σ  .
 1 2  1 2
π exp  − 2 ( yi − [ µ + δ ])  + (1 − π ) exp  − 2 ( yi − µ ) 
 2σ   2σ 
Therefore
π exp  − ( y −  µ )
1 2
+ δ j  
 2σ
i j

2
eij = .
π exp  − ( )
2  2
yi −  µ j + δ j   + (1 − π ) exp  − 2 ( yi − µ j ) 
1 1
 2σ   2σ 
2
Thereby the E-Step of the EM algorithm has been defined.
Next, the M-Step requires us to maximise the Q-function. We begin by

writing:
∂Q j ( µ , δ )
=− c1 {0 − 2ny − 0 + 2nµ + 2δ eTj + 0}
∂µ
∂Q j ( µ , δ )  n

=−c1 0 − 0 − 2∑ yi eij + 0 + 2 µ eTj + 2δ eTj  .
∂δ  i =1 
Setting both of these derivatives to zero and solving for µ and δ

simultaneously, we obtain the next two values in the algorithm:
n
1 n
y − ∑ yi eij
n i =1
∑ yi eij
µ j +1 = δ j +1
,= i =1
− µ j +1 .
1 e
1 − eTj Tj
n
The EM algorithm is now completely defined.
180
Starting the algorithm from ( µ0 , δ 0 ) = (10,1), we obtain the sequence

shown in Table 4.4. We see that the algorithm has converged to what we
believe to be the posterior mode, ( µˆ , δˆ ) = (20.08, 9.72).
Running the algorithm from different starting points we obtain the same
final results. Unlike the NR algorithm, we find that the EM algorithm
always converges, regardless of the point from which it is started.
Table 4.4 Results of an EM algorithm
j µj δj
0 10.000 1.000
1 21.169 3.032
2 20.321 7.07
3 19.843 9.139
4 19.926 9.518
5 20.005 9.626
6 20.046 9.674
7 20.066 9.697
8 20.075 9.708
9 20.08 9.713
10 20.082 9.715
11 20.083 9.717
12 20.084 9.717
13 20.084 9.717
14 20.084 9.718
15 20.084 9.718
16 20.084 9.718
17 20.084 9.718
18 20.084 9.718
19 20.084 9.718
20 20.084 9.718
(c) The ECM requires us to once again examine the Q-function,

 n

−c1 c2 − 2 µ ny − 2δ ∑ yi eij + nµ 2 + 2 µδ eTj + δ 2eTj  ,
Q j (µ,δ ) =
 i =1 
but now to maximise this function with respect to µ and δ individually
(rather than simultaneously as for the EM algorithm in (c)).
181
∂Q j ( µ , δ )
Thus, setting =− c1 {0 − 2ny − 0 + 2nµ + 2δ eTj + 0}
∂µ
1
to zero we get µ j +1= y − δ j eTj (after substituting in δ = δ j ).
n
∂Q j ( µ , δ )  n

Then, setting =−c1 0 − 0 − 2∑ yi eij + 0 + 2 µ eTj + 2δ eTj 
∂δ  i =1 
n
∑ye i ij
δ j +1
to zero we get = i =1
− µ j +1 (same equation as in (c)).
eTj
We see that the ECM algorithm here is fairly similar to the EM algorithm.
Starting the algorithm at ( µ0 , δ 0 ) = (10, 1) we obtain the sequence shown

in Table 4.5 (page 184). We see that the ECM algorithm has converged to
the same values as the EM algorithm, but along a slightly different route.
(d) Figure 4.8 (page 185) shows a contour plot of the log-posterior density
log f ( µ , δ | y, R) and the routes of the EM and ECM algorithms in parts
(b) and (c), each from the starting point ( µ0 , δ 0 ) = (10, 1) to the mode,
( µˆ , δˆ ) = (20.08, 9.72). Also shown are two other pairs of routes, one pair
starting from (5, 30), and the other from (35, 20).
Note 1: In this exercise there is little difference between the EM and

ECM algorithms, both as regards complexity and performance. In more
complex models we may expect the EM algorithm to converge faster
but have an M-Step which is more difficult to complete than the set of
separate Conditional Maximisation-Steps (CM-Steps) of the ECM
algorithm.
Note 2: The log-posterior density in Figure 4.8 has a formula which can
be derived as follows. First, the joint posterior of all unknowns in the
model is
f ( µ , δ , R | y ) ∝ f ( µ , δ , y, R)
n n
 1 2
2 ( i
∝ 1× ∏ π Ri (1 − π )1− Ri × ∏ exp − y − [ µ + Riδ ]) 
=i 1 =i 1  2σ 
182
n
 1 2
2 ( i
= ∏π Ri
(1 − π )1− Ri exp − y − [ µ + Riδ ])  .
i =1  2σ 
So the joint posterior density of just µ and δ is

f (µ , δ | y) = ∑ f (µ , δ , R | y)
R
n 1
 1 2
∝ ∏ ∑ π Ri (1 − π )1− Ri exp − 2 ( yi − [ µ + Riδ ]) 
i =1 Ri = 0  2σ 
n
  1   1 2 
∏  π exp − 2σ ( y − [ µ + δ ])  + (1 − π ) exp − 2 ( yi − µ )   .
2
=
 2σ
i
 
2
i =1
So the log-posterior density of µ and δ is

l ( µ , δ ) ≡ log f ( µ , δ | y )
n
  1 2
=c + ∑ log  π exp − 2 ( yi − [ µ + δ ]) 
i =1   2σ 
 1 2 
+(1 − π ) exp − 2 ( yi − µ )   ,
 2σ 
where c is an additive constant and can arbitrarily be set to zero.
Note 3: As an additional exercise (and a check on our calculations

above), we could apply the Newton-Raphson algorithm so as to find the
mode of l ( µ , δ ) . But this would require us to first determine formulae
for the following rather complicated partial derivatives:
∂l ( µ , δ ) ∂l ( µ , δ ) ∂ 2l ( µ , δ ) ∂ 2l ( µ , δ ) ∂ 2l ( µ , δ )
, , , , ,
∂µ ∂δ ∂µ 2 ∂δ 2 ∂δ∂µ
and could prove to be unstable. That is, the algorithm might fail to
converge if started from a point not very near the required solution.
Another option is to apply the CNR algorithm (the conditional Newton-

Raphson algorithm). This would obviate the need for one of the
∂ 2l ( µ , δ )
derivatives above, , and might be more stable, albeit at the cost
∂δ∂µ
of not converging so quickly as the plain NR algorithm.
183
As yet another possibility, we could apply the CNR1 algorithm. This is

the same as the CNR algorithm, except that at each conditional step we
perform just one iteration of the univariate NR algorithm before moving
on to the other of the two conditional steps.
Finally, we could use the R function optim() to maximise l ( µ , δ ) .

Although this function will be formally introduced later, we can report
that it does indeed find the posterior mode, ( µˆ , δˆ ) = (20.08, 9.72). For
details, see the bottom of the R code below.
Table 4.5 Results of an ECM algorithm
j µj δj
0 10.000 1.000
1 22.505 1.696
2 22.566 3.882
3 21.905 6.811
4 21.139 8.729
5 20.611 9.501
6 20.322 9.732
7 20.181 9.774
8 20.118 9.764
9 20.093 9.746
10 20.085 9.732
11 20.083 9.725
12 20.083 9.720
13 20.083 9.719
14 20.084 9.718
15 20.084 9.718
16 20.084 9.718
17 20.084 9.718
18 20.084 9.718
19 20.084 9.718
20 20.084 9.718
184
Figure 4.8 Routes of the EM and ECM algorithms
# (a)
ntrue=100; pitrue=1/3; mutrue=20; deltrue=10; sigtrue=3
set.seed(512); Rvec=rbinom(ntrue,1,pitrue); sum(Rvec) # 29

yvec=rnorm(ntrue,mutrue+deltrue*Rvec,sigtrue)
ybar=mean(yvec); ybar # 23.16
hist(yvec,prob=T,breaks=seq(0,50,0.5),xlim=c(10,40),ylim=c(0,0.2),
xlab="y", main=" ")
185
yv=seq(0,50,0.01); lines(yv,dnorm(yv,mutrue,sigtrue),lty=2,lwd=2)
lines(yv,dnorm(yv,mutrue+deltrue, sigtrue),lty=2,lwd=2)
lines(yv, (1-pitrue)*dnorm(yv,mutrue,sigtrue)+
pitrue*dnorm(yv,mutrue+deltrue,sigtrue), lty=1,lwd=2)
legend(10,0.2,c("Components","Mixture"),lty=c(2,1),lwd=c(2,2))
# (b)
evalsfun= function(y=yvec, pii=pitrue, mu=mutrue,del=deltrue,sig=sigtrue){
# This function outputs (e1,e2,...,en)
term1vals=pii*dnorm(y,mu+del,sig)
term0vals=(1-pii)*dnorm(y,mu,sig)
term1vals/(term1vals+term0vals) }
EMfun=function(J=20, mu=10, del=1, y=yvec, pii=pitrue, sig=sigtrue){

muv=mu; delv=del; ybar=mean(y); n=length(y)
for(j in 1:J){
evals=evalsfun(y=y, pii=pii, mu=mu, del=del, sig=sig)
sumyevals = sum(y*evals); sumevals=sum(evals)
mu=(ybar-sumyevals/n) / (1-sumevals/n)
del=sumyevals/sumevals - mu
muv=c(muv,mu); delv=c(delv,del)
}
list(muv=muv,delv=delv)
}
EMres=EMfun(J=20, mu=10, del=1,y=yvec,pii=pitrue,sig=sigtrue)
outmat = cbind(0:20,EMres$muv, EMres$delv)
print.matrix <- function(m){ write.table(format(m, justify="right"),
row.names=F, col.names=F, quote=F) }
print.matrix(outmat)
# 0.000 10.000 1.000
# 1.000 21.169 3.032
# 2.000 20.321 7.070
# 3.000 19.843 9.139
# 4.000 19.926 9.518
# 5.000 20.005 9.626
# ……………………………….
# 16.000 20.084 9.718
# 17.000 20.084 9.718
# 18.000 20.084 9.718
# 19.000 20.084 9.718
# 20.000 20.084 9.718
muhat=EMres$muv[21]; delhat=EMres$delv[21];
c(muhat,delhat) # 20.084 9.718
186
# (c)
CEMfun=function(J=20, mu=10, del=1, y=yvec, pii=pitrue, sig=sigtrue){
muv=mu; delv=del; ybar=mean(y); n=length(y)
for(j in 1:J){
evals=evalsfun(y=y, pii=pii, mu=mu, del=del, sig=sig)
sumyevals = sum(y*evals); sumevals=sum(evals)
mu=ybar-del*sumevals/n
del=sumyevals/sumevals - mu
muv=c(muv,mu); delv=c(delv,del)
}
list(muv=muv,delv=delv)
}
CEMres=CEMfun(J=20, mu=10, del=1,y=yvec,pii=pitrue,sig=sigtrue)
outmat2 = cbind(0:20, CEMres$muv, CEMres$delv)
print.matrix(outmat2)
# 0.000 10.000 1.000

# 1.000 22.505 1.696
# 2.000 22.566 3.882
# 3.000 21.905 6.811
# 4.000 21.139 8.729
# 5.000 20.611 9.501
# ………………………………
# 16.000 20.084 9.718
# 17.000 20.084 9.718
# 18.000 20.084 9.718
# 19.000 20.084 9.718
# 20.000 20.084 9.718
# (d)
X11(w=8,h=9); par(mfrow=c(1,1))
logpostfun=function(mu=10,del=10,y=yvec,pii=pitrue,sig=sigtrue){
sum(log(pii*dnorm(y,mu+del,sig)+(1-pii)*dnorm(y,mu,sig))) }
mugrid=seq(0,35,0.5); delgrid=seq(0,30,0.5)
logpostmat=as.matrix(mugrid %*% t(delgrid))
dim(logpostmat) # 41 21 OK
for(i in 1:length(mugrid)) for(j in 1:length(delgrid)) logpostmat[i,j] =

logpostfun(mu=mugrid[i],del=delgrid[j],y=yvec,pii=pitrue,sig=sigtrue)
contour(x=mugrid, y=delgrid, z=logpostmat, nlevels=20,

xlab="mu", ylab="delta"); points(muhat,delhat, pch=16,cex=1.2)
points(10,1,pch=16,cex=1.2)
187

CEMres=CEMfun(J=20, mu=10, del=1,y=yvec,pii=pitrue,sig=sigtrue)
lines(EMres$muv, EMres$delv,lty=1,lwd=3)
lines(CEMres$muv, CEMres$delv,lty=2,lwd=3)
CEMres=CEMfun(J=50, mu=5, del=30, y=yvec,pii=pitrue,sig=sigtrue)
CEMres=CEMfun(J=50, mu=35, del=20, y=yvec,pii=pitrue,sig=sigtrue)
legend(21,30,c("EM","ECM"),lty=c(1,2),lwd=c(3,3))
# Note 2. Maximisation of the logposterior density of mu and delta using optim()

logpostfun2=function(theta=c(10,1),y=yvec,pii=pitrue,sig=sigtrue){
-sum(log(pii*dnorm(y,theta[1]+theta[2],sig)+
(1-pii)*dnorm(y,theta[1],sig)))
}
res=optim(par=c(10,1),fn= logpostfun2)$par; res # 20.08 9.72
res=optim(par=res,fn= logpostfun2)$par; res # 20.084 9.718
# Here we fine-tune the answer by starting at the previous solution.
4.6 Integration techniques

Bayesian inference typically involves a great deal of integration (and/or
summation). For example, consider the posterior density
f ( | y )  6 5 , 0    1
(which featured in previous exercise involving the binomial-beta model)
and suppose that we wish to find the posterior mean estimate of λ = θ 2 .
This estimate is
1
ˆ  E ( 2 | y )    2 6 5  d   0.75 .
0
But what if this integral did not have a simple analytical solution?
188
In that case, we could consider a number of other strategies. First, we

might re-express the posterior mean as

ˆ   f ( | y )d ,
where, using the method of transformation,
d 5 1
f ( | y )  f ( | y )  6 1/ 2  1/ 2  3 2 , 0    1 ,
d 2
so that
1
ˆ    3 2  d  = 0.75.
0
If this strategy does not help, we may then consider using a numerical
integration technique.
For example, we could apply the integrate() function in R to get ̂ = 0.75,

as follows:
gfun = function(t){ 6*t^7 } # Define the function to be integrated
integrate(f=gfun,lower=0,upper=1)$value # 0.75
In some cases the function requiring integration is very complicated or

does not have a closed form expression. In that case, direct application of
the integrate() function may not work or be practicable, and then it may
be useful to apply the trapezoidal rule or Simpson’s rule to evaluate the
integral.
When working in R, the following is often a convenient strategy:

(i) evaluate g ( )   2  6 5 at each  on the grid
0, 0.1, 0.2, ..., 0.9, 1 (say); then
(ii) create a spline through these points, using the fit() and predict()
functions; and then
(iii) find the area under this spline using the integrate() function.
Applying this method (see the R code below for details) yields 0.7558 as
an estimate of ̂ . Repeating, but with the evaluations on the grid 0.01,
0.02, ...,1 yields 0.7500. Repeating again, but with evaluations on the grid
0.001, 0.002, ..., 1 yields 0.7500. It appears that a limit has been reached
and that using a finer grid would not result in any improvements to the
results of this numerical procedure.
We may conclude that ˆ  0.7500 (to 4 decimals).
189
gfun = function(t){ 6*t^7 } # Define the function to be integrated

integrate(f=gfun,lower=0,upper=1)$value # 0.75
INTEG <- function(xvec, yvec, a = min(xvec), b = max(xvec)){

# Integrates numerically under a spline through the
# points given by the vectors xvec and yvec, from a to b.
fit <- smooth.spline(xvec, yvec)
spline.f <- function(x){predict(fit, x)$y }
integrate(spline.f, a, b)$value }
gfun=function(t){ 6*t^7 }
tvec <- seq(0,1,0.1); gvec <- gfun(tvec)
INTEG(tvec,gvec,0,1) # 0.755803
Exercise 4.8 Numerical integration
=
Suppose that X ~ N ( µ , σ 2 ) and Y ( X | X > c) where µ = 8, σ = 3
and c = 10. Find EY using numerical techniques and compare your answer
with the exact value,
c−µ    c − µ 
µ +σ φ   1 − Φ   ,
 σ    σ 
which was derived analytically in Exercise 4.6.
The required integral is

∞
EY = ∫ g ( x)dx ,
c
xf ( x) 1  x−µ 
where: g ( x) = , f ( x) = φ ,
P ( X > 0) σ  σ 
c−µ .
P ( X > 0) = 1 − Φ  
 σ 
Applying the integrate() function directly to g ( x) we get EY = 11.7955.
190
Applying the INTEG() function (defined in Section 4.6) with coordinates

given by (10,10.1,10.2,...,30) and ( g (10), g (10.1), g (10.2),..., g (30)) , we
also get EY = 11.7955. The exact value of EY is in fact
c−µ    c − µ 
µ +σ φ   1 − Φ    = 11.7955.
 σ    σ 
Note: If we use the integrate() function with bounds from 10 to 20 rather

than 10 to 30, we get 11.7929, which is slightly in error. Exactly the
same happens with the INTEG() function. Thus, when using either of
these functions, care must be taken to choose a large enough range.
Ideally, we will sketch the integrand function and make sure the range
of integration is sufficiently broad to cover all important regions (where
the integrand is significantly positive). In practice, it is useful to
gradually increase the range of integration until the answer stops
changing. Likewise, it is useful to gradually increase the grid density
chosen for the INTEG() function until the answer stops changing.
# First declare the function INTEG() as defined in the previous exercise
mu=8; sig=3; c = 10; options(digits=6)

PXpos = (1-pnorm((c-mu)/sig))
gfun=function(x){ x * dnorm(x,mu,sig) / PXpos }
integrate(gfun,c,20)$value # 11.7929
integrate(gfun,c,30)$value # 11.7955
xvec <- seq(c,20,0.1); gvec <- gfun(xvec); INTEG(xvec,gvec,c,20) # 11.7929
xvec <- seq(c,30,0.1); gvec <- gfun(xvec); INTEG(xvec,gvec,c,30) # 11.7955
true=mu + sig*dnorm((c-mu)/sig)/(1-pnorm((c-mu)/sig)); true # 11.7955
Exercise 4.9 Double integration
Use the integrate() and INTEG() functions in at least two different ways
so as to calculate the double integral
1  x3 
I = ∫  ∫ t t dt  dx .
t 0 
= 
x 0= 
Illustrate your calculations with suitable graphs of the relevant functions

involved.
191
Using the integrate() function alone (and not the INTEG() function), the
integral can be worked out as follows:
integrate(function(x) {
sapply(x, function(x) {
integrate(function(t) {
sapply(t, function(t) t^t )
}, 0, x^3)$value }) }, 0, 1)
# 0.192723 with absolute error < 7.8e-10
Another approach is as follows. Observe that

1
I= ∫
x =0
g ( x) dx ,
where
x3
g ( x) = ∫ h(t )dt
t =0
and
h(t ) = t t .
We will now use the integrate() function to obtain g ( x) for each value of
x in the grid 0, 0.01, 0.02, ..., 1. We will then apply the INTEG() function
to the resulting coordinates.
Figure 4.9 below displays the two functions h(t ) and g ( x) . The value
g (0.8) = 0.381116 is the area under h(t ) between 0 and 0.8. The total area
under h(t ) (from 0 to 1) is 0.78343.
The total area under g ( x) (from 0 to 1) is estimated as 0.192723. Using

the grid 0, 0.001, 0.002, ..., 1 also leads to 0.192723, whereas using the
grid 0, 0.1, 0.2, ..., 1 leads to 0.193054.
We conclude that the exact value of the required integral I to 4 decimals

is 0.1927, which is in agreement with the first approach above which
doesn’t make use of the INTEG() function.
192
One could also adapt the second approach above so as to calculate the
double integral using the INTEG() function only (without using the
integrate() function directly). This might be useful if the inner integral
x3
g ( x) = ∫ h(t )dt
t =0
where h(t ) = t t
could not be evaluated easily using integrate() directly, for example if

h(t ) were a very complicated function which could not be expressed in
closed form.
Note: The integrate() function is called within the INTEG() function and
so is used at least indirectly in all of the approaches considered here.
Figure 4.9 Two functions
integrate(function(x) {
sapply(x, function(x) {
integrate(function(t) {
sapply(t, function(t) t^t )
}, 0, x^3)$value }) }, 0, 1)
193
# Declare the function INTEG() as defined in the previous exercise

options(digits=6); X11(w=8,h=6); par(mfrow=c(2,1))
hfun= function(t){ t^t }

tvec=seq(0,1,0.01); hvec=hfun(tvec)
plot(tvec,hvec,type="l",xlab="t",ylab="h(t)",lwd=2); abline(v=0.8,lty=2)
integrate(f=hfun,lower=0,upper=0.8^3)$value
# 0.381116 This is g(0.8) = area under h(t) to left of 0.8
integrate(f=hfun,lower=0,upper=1)$value
# 0.78343 This is the total areas under h(t) (from 0 to 1)
xvec = seq(0,1,0.01); gvec = rep(NA,length(xvec))

for(i in 1:length(xvec)){ xval = xvec[i]
gvec[i] = integrate(f=hfun,lower=0,upper=xval^3)$value }
INTEG(xvec,gvec) # 0.192723
plot(xvec,gvec,type="l",xlab="x",ylab="g(x)",lwd=2)
points(0.8, 0.381116 , pch=16, cex=1)
# Apply INTEG() using different grids


4.7 The optim() function

The function optim() in R is a very useful and versatile tool for
maximising or minimising functions, both of one and of several variables.
This R function can also be adapted for solving single or simultaneous

equations and provides an alternative to other techniques such as trial and
error, the Newton-Raphson algorithm and the EM algorithm.
The second of the next two exercises shows how the optim() function can
be used to specify a prior distribution.
194
Exercise 4.10 Simple examples of the optim() function
Use the optim() function to ‘find’ the mode of each of the following:
=
(a) g ( x ) x 2 e −5 x , x > 0 (mode = 2/5)
| x |x e − ( x −1)
2
=
(b) g ( x ) , x ∈ℜ (the mode has no closed form)
1+ | x |
(c) g ( x, y ) y 3e − y{( x −1)

= + ( x −3)2 }
, x ∈ℜ, y > 0
2
(mode = (x, y) = ((1 + 3)/2, 3/2)).

In each of these cases, the optim() function (which minimises a function

by default) may be applied to the negative of the specified function (so as
to maximise that function).
(a) The function correctly returns x = 2 / 5. (NB: The warning message

may be ignored.)
(b) The function returns a value of 1.5047. (We presume that this is
correct; see below for a verification.)
(c) The mode is correctly computed as ( x, y ) = (2,1.5). (Note that this

solution is obvious by analogy with maximum likelihood estimation of
the normal mean and variance.)
Figure 4.10 illustrates these three solutions, with each mode being marked
by a dot and vertical line. Subplot (c) shows several examples of the
function g ( x, y ) in part (c) considered as a function of only x, with each
line defined by a fixed value of y on the grid 0, 0.5, 1, ...,4.5, 5.
195
Figure 4.10 Maximisation of function g in parts (a), (b) and (c)
help(optim); options(digits=5); X11(w=8,h=8); par(mfrow=c(3,1))
# (a)
fun=function(x){ -x^2 * exp(-5*x) }
res0=optim(par=0.5,fn=fun)$par; res0 # 0.4
# Warning message:
# In optim(par = 0.5, fn = fun) :
# one-diml optimization by Nelder-Mead is unreliable:
# use "Brent" or optimize() directly
plot(seq(0,5,0.01), -fun(seq(0,5,0.01)),type="l",lwd=3,xlab="x",ylab="g(x)");
abline(v=res0); points(res0, -fun(res0), pch=16, cex=2); text(4,0.02,"(a)",cex=2)
# (b)
fun=function(x){ -exp(-(x-1)^2) * abs(x)^x/(1+abs(x)) }
res0=optim(par=1,fn=fun)$par; res0 # 1.5047
plot(seq(-2,5,0.01), -fun(seq(-2,5,0.01)),type="l",lwd=3, xlab="x",ylab="g(x)");
abline(v=res0); points(res0, -fun(res0), pch=16, cex=2); text(4,0.45,"(b)",cex=2)
196
# (c)
fun=function(v){ -v[2]^3 * exp( -v[2] * ( (v[1]-1)^2 + (v[1]-3)^2 ) ) }
res0=optim(par=c(2,2),fn=fun, lower = c(-Inf,0), upper = c(Inf,Inf),
method = "L-BFGS-B")$par; res0 # 2.0 1.5
fun2=function(x,y){ y^3 * exp( -y * ( (x-1)^2 + (x-3)^2 ) ) }
plot(c(0.5,3.5),c(0,0.2), type="n",xlab="x",ylab="f(x,y)")
for(y in seq(0,5,0.5))
lines(seq(0,5,0.01), fun2(x=seq(0,5,0.01),y=y), lty=1)
abline(v=res0[1]); points(res0[1],fun2(res0[1],res0[2]), pch=16, cex=2);
lines(seq(0,5,0.01),fun2(x= seq(0,5,0.01), y=res0[2]),lty=1,lwd=3);
text(3,0.17,"(c)",cex=2)
Exercise 4.11 Specification of parameters in a prior

distribution using the optim() function
Consider the normal-gamma model given by:

( y1 ,, yn |  ) ~ iid N (,1/  )
λ ~ G (η ,τ ) .
Use the optim() function in R to find the values of  and  which

correspond to a prior belief that the population standard deviation
σ = 1/ λ lies between 0.5 and 1 with 95% probability, and that σ is
equally likely to be below 0.5 as it is to be above 1.
We wish to find the values of  and  which satisfy the two equations:
P(σ < a) = α /2 and P(σ < b) =1 − α / 2 ,
where a = 0.5, b = 1 and α = 0.05.
These two equations are together equivalent to each of the following five
pairs of equations:
P(σ 2 < a 2 ) =
α /2 and P(σ 2 < b 2 ) =
1−α / 2
P(1/ λ < a 2 ) = α /2 and P(1/ λ < b 2 ) = 1−α / 2
P(1/ a < λ ) =
2
α /2 and P(1/ b < λ ) =−
2
1 α /2
P(λ < 1/ a 2 ) = 1−α / 2 and P(λ < 1/ b 2 ) = α /2
FG (η ,τ ) (1/ a ) − (1 − α / 2) =
2
0 and FG (η ,τ ) (1/ b ) − α / 2 =
2
0.
197
We now focus on the last of these pairs of two equations. Two obvious
ways to solve these equations are by trial and error and via the multivariate
Newton-Raphson algorithm, as illustrated earlier. But the solution can be
obtained more easily by using the optim() function to minimise
2 2
g (η ,τ )  FG (η ,τ ) (1 / a 2 ) − (1 − α / 2)  +  FG (η ,τ ) (1 / b2 ) − (α / 2)  .
=
Note: Clearly, this function has a value of zero at the required values of
 and  .
With the default settings and starting at  = 0.2 and  = 6, optim()

produced some warning messages (which we ignored) and provided the
solution,  = 8.4764 and  = 3.7679.
Now, this solution is not exactly correct, because the probabilities of a

Gamma(8.4764, 3.7679) random variable lying below 1/ b 2 = 1 and
below 1/ a 2 = 4, respectively, are 0.025048 and 0.975104 (i.e. not exactly
0.025 and 0.975 as desired).
However, applying the optim() function again but starting at the previous
solution, namely  = 8.4764 and  = 3.7679, yielded a ‘refined’
solution,  = 8.4748 and  = 3.7654.
This solution may be considered correct, because the probabilities of a

Gamma(8.4748, 3.7654) random variable being less than 1/ b 2 = 1 and
less than 1/ a 2 = 4 , respectively, are exactly 0.025 and 0.975.
Discussion
It is instructive to derive and plot the corresponding density of the

precision parameter λ , and then to do this also for the variance parameter
σ 2 = λ −1 and the standard deviation parameter σ = λ −1/2 , respectively.
The three densities are plotted in Figure 4.11 (in the stated order from top
to bottom). The vertical lines show the 0.025 and 0.975 quantiles of each
distribution. The formulae for the three densities are as follows:
τ η λη −1e−τλ
= f (λ ) f=G (η ,τ ) (λ ) , λ >0
Γ(η )
198
dλ
f (σ
= 2
) f IG (η ,τ ) (σ
= 2
) f (λ ) = ) (λ
fG (η ,τ= σ −2 ) −(σ 2 ) −2 ,
d (σ )
2
where λ = (σ 2 ) −1
τ η (1/ σ 2 )η −1 e −τ (1/σ ) 2 −2 2
2
= (σ ) , σ > 0
Γ(η )
dλ
(σ ) f (λ ) = f G (η ,τ=
f= ) (λ σ −2 ) −2σ −3 where λ = (σ ) −2
dσ
τ η (1/ σ 2 )η −1 e −τ (1/σ ) −3
2
= 2σ , σ > 0 .
Γ(η )
As a check on the last of these three densities, the integrate() function was
used to show that the area under that density is exactly 1, and that the areas
underneath it to the left of 0.5 and to the right of 1 are both exactly 0.025.
Figure 4.11 Three prior densities
199
options(digits=5); a=0.5; b=1; alp=0.05;

fun=function(v,alp=0.05,a=0.5,b=1){
(pgamma(1/a^2,v[1],v[2])-(1-alp/2))^2 +
(pgamma(1/b^2,v[1],v[2])-(alp/2))^2 }
res0=optim(par=c(0.2,6),fn=fun)$par
res0 # 8.4764 3.7679
pgamma(c(1/b^2,1/a^2),res0[1],res0[2]) # 0.025048 0.975104 Close
res=optim(par=res0,fn=fun)$par; res # 8.4748 3.7654

pgamma(c(1/b^2,1/a^2),res[1],res[2]) # 0.025 0.975 Correct
res2=optim(par=c(6,3),fn=fun)$par; res2 # 8.4753 3.7655

pgamma(c(1/b^2,1/a^2),res2[1],res2[2]) # 0.024992 0.974996 Close
res3=optim(par=res2,fn=fun)$par; res3 # 8.4748 3.7654

pgamma(c(1/b^2,1/a^2),res3[1],res3[2]) # 0.025 0.975 Correct
par(mfrow=c(3,1)); tv=seq(0,10,0.01)
plot(tv, dgamma(tv,res[1],res[2]),type="l",lwd=2, xlim=c(0,6),

xlab="lambda",ylab="density"); abline(v=c(1/a^2,1/b^2));
abline(h=0,lty=3)
plot(tv,dgamma(1/tv,res[1],res[2])/tv^2, type="l", lwd=2, xlim=c(0,1.5),

xlab="sigma^2",ylab="density");
abline(v=c(a^2,b^2)); abline(h=0,lty=3)
plot(tv,dgamma(1/tv^2,res[1],res[2])*2/tv^3, type="l", lwd=2,

xlim=c(0.35,1.4), xlab="sigma",ylab="density");
abline(v=c(a,b)); abline(h=0,lty=3)
# Check areas under the last curve

func=function(t){ dgamma(1/t^2,res[1],res[2])*2/t^3 }
integrate(func,lower=0,upper=Inf)$value # 1 Correct
integrate(func,lower=0,upper=0.5)$value # 0.025 Correct
integrate(func,lower=1,upper=Inf)$value # 0.025 Correct
200
CHAPTER 5
Monte Carlo Basics
5.1 Introduction
The term Monte Carlo (MC) methods refers to a broad collection of tools
that are useful for approximating quantities based on artificially generated
random samples. These include the Monte Carlo integration (for
estimating an integral using such a sample), the inversion technique (for
generating the required sample), and Markov chain Monte Carlo methods
(an advanced topic in Chapter 6). In principle, the approximation can be
made as good as required simply by making the Monte Carlo sample size
sufficiently large. As will be seen (further down), Monte Carlo methods
are a very useful tool in Bayesian inference.
To illustrate the basic idea of Monte Carlo methods, consider Buffon’s

needle problem, where a needle of length 10 cm (say) is dropped randomly
onto a floor with parallel lines being distance 10 cm apart. What is p, the
probability of the needle crossing a line? The exact value of p can be
worked out analytically as 2 / π = 0.63662 (this is done in one of the
exercises below). But this takes mathematical effort. If this analytical
solution were not possible (or just too much work), we could instead
estimate p via Monte Carlo. The simplest way to do this would be to toss
the needle onto the floor 1,000 times (randomly and independently). If the
needle crosses a line 641 times (say), then the Monte Carlo estimate of p
is just 641/1,000 = 0.641.
As a variation on this physical experiment (which could be rather

laborious), we could toss the needle 1,000 times virtually, meaning that
we simulate each drop (or rather the parameters of each drop) on a
computer and each time determine whether the virtual needle has crossed
a virtual line.
This method will be faster and more accurate; but it will also require at
least some mathematical work to identify exactly what the parameters of
each drop are and what configuration of those parameters correspond to
the needle crossing a line (again, this is done in one of the exercises
below).
201
In this chapter, we will first discuss Monte Carlo methods and their
usefulness under the assumption that we have available or can generate
the required random samples. As we will see in the exercises and their
solutions, such samples can often be obtained very easily using inbuilt R
functions, e.g. runif() and rnorm().
After this we will describe special methods for generating a random

samples, starting with the simplest, such as the inversion technique and
rejection sampling. We reserve the more complicated techniques which
involve Markov chain theory to the next and later chapters.
Also, as part of the structure of the present chapter, we will first discuss
Monte Carlo methods and random number generation in a fully general
setting. Only after we have finished our treatment of these two topics (to
a certain level at least) will we discuss their application to Bayesian
inference. Hopefully this format will minimise any confusion.
5.2 The method of Monte Carlo integration for

estimating means
One of the most important applications of Monte Carlo methods is the
estimation of means. Suppose we are interested in µ , the mean of some
distribution defined by a density f ( x ) (or by a cumulative distribution
function F ( x)), but we are unable to calculate µ exactly (or easily), for
example by applying the formula
µ Ex
= = ∫ xf ( x )dx
(or µ Ex
= = ∑ xf ( x )
x
µ Ex
or = = ∫ xdF ( x ) ).
Also suppose, however, that we are able to generate (or obtain) a random
sample from the distribution in question. Denote this sample as
x1 ,..., xJ ~ iid f ( x )
(or x1 ,..., xJ ~ iid F ( x ) ).
Then we may use this sample to estimate µ by

1 J
x = ∑ xj .
J j =1
202
Chapter 5: Monte Carlo Basics
Also, a 1 − α confidence interval (CI) for µ given by

CI= ( x ± zα /2 s / J ) ,
where
1 J
=s2 ∑
J − 1 j =1
( x j − x )2
is the sample variance of the random values.
In this context we refer to:
x1 ,..., xJ as the Monte Carlo sample values

or the Monte Carlo sample
x as the Monte Carlo sample mean
or the Monte Carlo estimate
CI as the Monte Carlo 1 − α confidence interval
for µ
J as the Monte Carlo sample size
s2 as the Monte Carlo sample variance
s as the Monte Carlo sample standard deviation
s/ J as the Monte Carlo standard error (SE).
Three important facts here are that:
• x is unbiased for µ (i.e. Ex = µ )

• the CI has coverage approximately 1 − α , by the central limit
theorem
• the width of the CI converges to zero as the MC sample size
J tends to infinity.
Exercise 5.1 Monte Carlo estimation of a known gamma mean
(a) Use the R function rgamma() to generate a random sample of size

J = 100 from the Gamma(3,2) distribution, whose mean is µ = 3/2 = 1.5.
Then use the method of Monte Carlo to produce a point estimate µ and a
95% CI for µ .
(b) Repeat (a) but with MC sample sizes of 1,000 and 10,000, and discuss
the results.
203
Note: In this exercise we are focusing on the integral

∞
 23 x 3−1e −2 x 
= µ ∫= xf ( x )dx ∫ x   dx ,
0  Γ(3) 
showing how it could be estimated via MC if it were not possible to
evaluate analytically. Exactly the same approach could be applied if the
integral were impossible to evaluate.
(a) Applying the above procedure (see the R code below) we estimate µ
by x = 1.5170. The Monte Carlo 95% confidence interval for µ is
CI= ( x ± z0.025 s / J ) = (1.3539, 1.6800).
We note that x is ‘close’ to the true value, µ = 1.5, and the CI contains
that true value.
(b) Repeating (a) with J = 1,000 we obtain the point estimate 1.5199 and
the interval estimate (1.4658, 1.5740).
Repeating (a) with J = 10,000 we obtain the point estimate 1.4942 and the
interval estimate (1.4773, 1.5110).
As in (a) we note in each case that x is ‘close’ to µ , and the CI contains

µ . We also note that as J increases the MC point estimate tends to get
closer to µ , and the 95% CI tends to get narrower. (The widths of the
three CIs are 0.3261, 0.1081 and 0.0337.)
options(digits=4); J = 100; set.seed(221); xv=rgamma(J,3,2)

xbar=mean(xv); s=sd(xv); ci=xbar + c(-1,1)*qnorm(0.975)*s/sqrt(J)
c(xbar,s,s^2,ci,ci[2]-ci[1]) # 1.5170 0.8320 0.6921 1.3539 1.6800 0.3261
J = 1000; set.seed(231); xv=rgamma(J,3,2)

c(xbar,s,s^2,ci,ci[2]-ci[1]) # 1.5199 0.8722 0.7607 1.4658 1.5740 0.1081
c(xbar,s,s^2,ci,ci[2]-ci[1]) # 1.4942 0.8597 0.7391 1.4773 1.5110 0.0337
204
5.3 Other uses of the MC sample

Once a Monte Carlo sample x1 ,..., xJ ~ iid f ( x) has been obtained, it can
be used for much more than just estimating the mean of the distribution,
µ = Ex . For example, suppose we are interested in the (lower) p-quantile
of the distribution, namely
q p = FX−1 ( p ) = {value of x such that F ( x ) = p }.
The MC estimate of q p is simply qˆ p , the empirical p-quantile of x1 ,..., xJ .

For instance, the median q1/2 can be estimated by the middle number
amongst x1 ,..., x J after sorting in increasing order. This assumes that J is
odd. If J is even, we estimate q1/2 by the average of the two middle
numbers. Thus we may write the MC estimate of q1/2 as
 x(( J +1)/2) , J odd

qˆ1/2 =  x( J /2) + x(( J +1)/2)
 , J even,
 2
where x( k ) is the kth smallest value amongst x1 ,..., x J (k = 1,...,J).
Also, we estimate the 1 − α central density region (CDR) for x, namely

( qα /2 , q1−α /2 ) , by ( qˆα /2 , qˆ1−α /2 ) .
Further, suppose we are interested in the expected value of some function

of x, say y = g ( x) . That is, we wish to estimate the quantity/integral
ψ
= Ey
= ∫ yf ( y)dy
= Eg ( x=
) ∫ g ( x) f ( x)dx .
Then we simply calculate y j = g ( x j ) for each j = 1,..., J . The result will
be a random sample y1 ,..., yJ ~ iid f ( y ) to which the method of Monte
Carlo can then be applied in the usual way. Thus, an estimate of ψ is
1 J
y = ∑ y j (the sample mean of the y-values),
J j =1
and a 1 − α CI for ψ is
 sy 
 y ± zα /2 ,
 J
205
1 J
=
where s y2 ∑
J − 1 j =1
( y j − y ) 2 (the sample variance of the y-values).
This idea applies to even very complicated functions y = g ( x) for which

the exact or even approximate value of ψ = Ey would otherwise be very
difficult to obtain, either analytically or numerically using a deterministic
technique such as numerical integration (or quadrature).
Also, the density f ( x ) can be estimated by smoothing a probability

histogram of x1 ,..., x J . Likewise, the density f ( y ) can be estimated by
smoothing a probability histogram of y1 ,..., y J . (This could be extremely
useful if y is a very complicated function of x.)
Note 1: As we will see later, it is often the case that we are able to sample
from a distribution without knowing—or being able to derive—the
exact form of its density function.
Note 2: Smoothing a histogram requires some arbitrary decisions to be

made about the degree of smoothing and other smoothing parameters.
So the MC estimate of a density is not uniquely defined.
Exercise 5.2 Monte Carlo estimation of complicated quantities
Suppose that x ~ G (3, 2) . Use MC methods and a sample of size J = 1,000

to estimate:
µ = Ex , the 80% CDR for x, and f ( x )
x 2e − x
ψ = Ey , the 80% CDR for y, and f ( y ) , where y = .
1+ x +1/ x
Present your results graphically, and wherever possible show the true
values of the quantities being estimated. Then repeat everything but using
a Monte Carlo sample size of J = 10,000.
The required graphs are shown in Figures 5.1 to 5.4. See the R code below
for more details.
206
Figure 5.1 Histogram of x-value (J = 1,000)
Figure 5.2 Histogram of y-value (J = 1,000)
207
Figure 5.3 Histogram of x-value (J = 10,000)
Figure 5.4 Histogram of y-value (J = 10,000)
X11(w=8,h=4.5); par(mfrow=c(1,1)); options(digits=4);

xbar=mean(xv); xci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
xcdr=quantile(xv,c(0.1,0.9)); xden=density(xv)
yv=xv^2 * exp(-xv) / ( 1 + xv + 1/xv )
ybar=mean(yv); yci=ybar + c(-1,1)*qnorm(0.975)*sd(yv)/sqrt(J)
ycdr=quantile(yv,c(0.1,0.9)); yden=density(yv)
208
hist(xv,prob=T,breaks=seq(0,7,0.25),xlim=c(0,7),ylim=c(0,0.6),xlab="x",
main=""); lines(xden,lty=2,lwd=2)
xvec=seq(0,10,0.01); lines(xvec,dgamma(xvec,3,2),lty=1,lwd=2)
abline(v= c(xbar, xci, xcdr), lty=2, lwd=2)
abline(v=c(3/2,qgamma(c(0.1,0.9),3,2)), lty=1,lwd=2)
legend(4,0.6,c("MC estimates","True values"),lty=c(2,1),lwd=c(2,2))
hist(yv,prob=T,breaks=seq(0,0.2,0.005),xlim=c(0,0.2),ylim=c(0,30),xlab="y",
main=""); lines(yden,lty=2,lwd=2)
abline(v= c(ybar, yci, ycdr), lty=2, lwd=2)
# Repeat with J = 10000 ------------------------------

xbar=mean(xv); xci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
xcdr=quantile(xv,c(0.1,0.9)); xden=density(xv)
yv=xv^2 * exp(-xv) / ( 1 + xv + 1/xv )
ybar=mean(yv); yci=ybar + c(-1,1)*qnorm(0.975)*sd(yv)/sqrt(J)
ycdr=quantile(yv,c(0.1,0.9)); yden=density(yv)
hist(xv,prob=T,breaks=seq(0,9,0.25),xlim=c(0,7),ylim=c(0,0.6),xlab="x",
main=""); lines(xden,lty=2,lwd=2)
xvec=seq(0,10,0.01); lines(xvec,dgamma(xvec,3,2),lty=1,lwd=2)
abline(v= c(xbar, xci, xcdr), lty=2, lwd=2)
abline(v=c(3/2,qgamma(c(0.1,0.9),3,2)), lty=1,lwd=2)
hist(yv,prob=T,breaks=seq(0,0.2,0.005),xlim=c(0,0.2),ylim=c(0,30),xlab="y",
main="")
lines(yden,lty=2,lwd=2); abline(v= c(ybar, yci, ycdr), lty=2, lwd=2)
5.4 Importance sampling

When applying the method of MC to estimate an integral of the form
=ψ Eg
= ( x) ∫ g ( x) f ( x)dx ,
suppose it is impossible (or difficult) to sample from f ( x ) , but it is easy
to sample from a distribution/density h( x ) which is ‘similar’ to f ( x ) .
209
Then we may write

 f ( x) 
= ψ ∫=  g ( x)  h( x)dx ∫ w( x)h( x)dx ,
 h( x ) 
where
f ( x)
w( x ) = g ( x ) .
h( x )
This suggests that we sample x1 ,..., xJ ~ iid h( x ) and use MC to estimate

ψ by
1 J
ψ= w=
ˆ ∑ wj ,
J j =1
where
f (x j )
= w j w= (x j ) g(x j ) .
h( x j )
This techniques is called importance sampling, and there are several

issues to consider. As already indicated, the method works best if h( x ) is
chosen to be very similar to f ( x ) .
Another issue is that f ( x ) may be known only up to a multiplicative

constant, i.e. where f ( x ) = k ( x ) / c , where the kernel k ( x ) is known
exactly but it is too difficult or impossible to evaluate the normalising
constant c = ∫ k ( x )dx . In that case, we may write
k ( x) ∫ g ( x)k ( x)dx
=ψ ∫=
g ( x) dx
c ∫ k ( x)dx
 k ( x) 
∫  g ( x) h( x)  h( x)dx ∫ w( x)h( x)dx
= = ,
 k ( x) 
∫   h( x)dx ∫ u ( x ) h ( x ) dx
 h( x ) 
where:
k ( x)
w( x ) = g ( x )
h( x )
k ( x)
u( x ) = .
h( x )
210
This suggests that we sample x1 ,..., xJ ~ iid h( x ) (as before) and apply
MC estimation to the means of w( x ) and u( x ) , respectively (each with
respect to the distribution defined by density h( x ) ) so as to obtain the
estimate
1 J
w J j =1
∑ w j w + ... + w
ψ=ˆ = = 1 J
,
1 J
u1 + ... + u J
∑uj
u
J j =1
where w j = w( x j ) and u j = u( x j ) .
Exercise 5.3 Example of Monte Carlo with importance sampling
We wish to find µ = Ex where x has density
1 −x
f ( x) ∝ e , x>0.
x +1
Use Monte Carlo methods and importance sampling to estimate µ .
1 −x
Here, k ( x ) = h( x ) e − x , x > 0
e , and it is convenient to use =
x +1
(the standard exponential density, or Gamma(1,1) density). Then,
=µ Ex
∞
= ∫ xf ( x)dx
=
∫ xk ( x)dx
0 ∫ k ( x)dx
 k ( x) 
∫  x h( x)  h( x)dx ∫ x x+ 1 h( x)dx
= = .
 k ( x)  1
∫  h( x)  h( x)dx ∫ x + 1 h( x)dx
1 J xj
J
∑ x +1
So a MC estimate of µ is µˆ = jJ=1 j ,
1 1
∑
J j =1 x j + 1
where x1 ,..., x J ~ iid G (1,1) .
211
0.40345
Implementing this with J = 100,000, we get µˆ = = 0.67631.
0.59655
Note 1: For interest we use numerical techniques to get the exact answer,
µ = 0.67687.
Thus the relative error is –0.084%. Figure 5.5 illustrates.
Note 2: The exact value of the normalising constant is

c = ∫ k ( x )dx is 0.596347.
From the above we see that our MC estimate of c is 0.59655 (similar).
Figure 5.5 Illustration of importance sampling
options(digits=10);
kfun=function(x){ exp(-x)/(x+1) }
c=integrate(f=kfun,lower=0,upper=Inf)$value; c # 0.5963473624
ffun=function(x){ (1/ 0.5963473624)*exp(-x)/(x+1) }
integrate(f=ffun,lower=0,upper=Inf)$value; # 0.9999999999
xffun= function(x){ x*(1/0.5963474)*exp(-x)/(x+1) }
mu= integrate(f=xffun,lower=0,upper=Inf)$value; mu # 0.6768749849
212
J=100000; set.seed(413); xv=rgamma(J,1,1)

num=mean(xv/(xv+1)); den=mean(1/(xv+1))
est=num/den; c(num, den, est) # 0.4034510685 0.5965489315 0.6763084254
err=100* (est-mu)/mu; err # -0.08370222467
plot(c(0,3),c(0,2),type="n",xlab="x",ylab="density"); xvec=seq(0,5,0.01);
lines(xvec,dgamma(xvec,1,1),lty=1,lwd=3)
lines(xvec,xvec*dgamma(xvec,1,1),lty=1,lwd=1)
lines(xvec,ffun(xvec),lty=2,lwd=3); lines(xvec,xvec*ffun(xvec),lty=2,lwd=1)
points(c(1,mu,est),c(0,0,0),pch=c(16,4,1),lwd=c(2,2,2),cex=c(1.2,1.2,1.2))
legend(1.7,2,c( "f(x) = (1/c)*exp(-x)/(x+1)", "h(x) = exp(-x)" ),
lty=c(2,1), lwd=c(3,3))
legend(1.7,1.3,c( "x*f(x)", "x*h(x)" ), lty=c(2,1), lwd=c(1,1))
legend(0.5,2,c("E(x) = area under x*f(x)", "E(x) = area under x*h(x)",
"MC estimate of E(x)"), pch=c(4,16,1),pt.lwd=c(2,2,2),pt.cex=c(1.2,1.2,1.2))
5.5 MC estimation involving two or more

random variables
All the examples so far have involved only a single random variable x.
However, the method of Monte Carlo generalises easily to two or more
random variables. In fact, the procedure for MC estimation of the mean of
a function, as described above, is already valid in the case where x is a
vector. We will now focus on the bivariable case, but the same principles
apply when three or more random variables are being considered
simultaneously.
Suppose that we have a random sample from the bivariate distribution of

two random variables x and y, denoted ( x1 , y1 ),...,( xJ , y J ) ~ iid f ( x, y ) ,
and we are interested in some function of x and y, say r = g ( x, y ) . Then
we simply calculate rj = g ( x j , y j ) and perform MC inference on the
resulting sample r1 ,..., rJ ~ iid f ( r ) .
Note 1: This procedure applies whether or not the random variables x

and y are independent. If they are independent then we simply sample
x j ~ f ( x ) and y j ~ f ( y ) .
Note 2: If x and y are dependent, it may not be obvious how to generate

( x j , y j ) ~ f ( x, y ) .
213
Then, one approach is to apply the method of composition, as detailed

below. If that fails, other methods are available, in particular ones which
involve Markov chain theory. Much more will be said on these methods
later in the course.
5.6 The method of composition

Suppose we wish to sample a vector ( x j , y j ) ~ f ( x, y ) . Often this can be
done in two different ways via the method of composition, as follows.
One way is to first sample x j ~ f ( x ) and then sample y j ~ f ( y | x j ) . The

result will be the desired ( x j , y j ) ~ f ( x, y ) . This follows by the identity
(or ‘composition’)
f ( x, y ) = f ( x ) f ( y | x ) .
Note: Having obtained ( x j , y j ) ~ f ( x, y ) in this manner, suppose we

‘discard’ x j . Then this will leave behind a single number, y j ~ f ( y ) .
This could be useful if all we really want is a sample from f ( y ) but
sampling from this distribution/density directly is difficult.
Alternatively, first sample y j ~ f ( y ) and then sample x j ~ f ( x | y j ).

The result will again be ( x j , y j ) ~ f ( x, y ) . This follows by the identity
f ( x, y ) = f ( y ) f ( x | y ) .
Note: Having obtained ( x j , y j ) ~ f ( x, y ) in this second manner,

suppose that we ‘discard’ y j . This will leave behind a single number,
x j ~ f ( x ) . This could be useful if all we really desire is a sample from
f ( x ) but sampling from this distribution/density directly is difficult.
This idea of composition generalises easily to higher dimensions. For

example, one of several different ways to sample a triplet
( x j , y j , z j ) ~ f ( x, y , z )
is first sample y j ~ f ( y ) , then sample x j ~ f ( x | y j ) and finally sample
z j ~ f ( z | x j , y j ) . This works because of the identity
f ( x, y , z ) = f ( y ) f ( x | y ) f ( z | x, y ) .
214
Exercise 5.4
Suppose that we are interested in the distribution of a random variable

=
defined by r y (x + y ) , where x and y have a joint distribution
defined by the pdf f ( x, y ) = f ( x ) f ( y | x ) , and where x ~ G (3, 2) and
( y | x ) ~ N ( x, x ) .
Use the R functions rgamma() and rnorm() to generate a sample of size

J = 1,000 from the joint distribution of x and y. Then use the method of
MC to estimate ψ = Er , and report a 95% CI forψ . Also estimate the
80% CDR for r and f ( r ) . Present your results both graphically and
numerically.
Numerically, we estimate ψ by 0.4256, and our 95% CI for ψ is

(0.4026, 0.4486). We also estimate the 80% CDR for r by (–0.1025,
0.8339). The required graph is shown in Figure 5.6.
Figure 5.6 Histogram of r-values (J = 1,000)
215
X11(w=8,h=4.5); par(mfrow=c(1,1)); options(digits=4);
J = 1000; set.seed(221); xv=rgamma(J,3,2); yv = rnorm(xv,sqrt(xv))

rv = yv/(xv+sqrt(abs(yv)))
rbar=mean(rv); rci=rbar + c(-1,1)*qnorm(0.975)*sd(rv)/sqrt(J)
rcdr=quantile(rv,c(0.1,0.9)); rden=density(rv)
c(rbar,rci,rcdr) # 0.4256 0.4026 0.4486 -0.1025 0.8339
hist(rv,prob=T, breaks=seq(-1,1.8,0.1),xlim=c(-1,1.6),ylim=c(0,1.3),xlab="r",
main=""); lines(rden,lty=1,lwd=2); abline(v= c(rbar, rci, rcdr), lty=2, lwd=2)
5.7 Monte Carlo estimation of a binomial

parameter
Suppose we are interested in a binomial proportion (i.e. probability) p but
have difficulty calculating this quantity exactly. Then we may interpret p
as the mean µ of a Bernoulli distribution and directly apply the method
of Monte Carlo in the usual way. In this special case, there are certain
simplifications which result in slightly different-looking final formulae.
Explicitly, suppose we are able to generate

x1 ,..., xJ ~ iid Bernoulli ( p ) .
Then the MC estimate of p is

1 J
x = ∑ xj (the sample proportion of 1s in the sample),
J j =1
and the MC sample variance is
1  J 2 2
= s2  ∑ x j − Jx 
J − 1  j =1 
=
1
J −1
( Jx − Jx 2 ) since x 2j = x j (because each x j is 0 or 1)
J
= x (1 − x ) .
J −1
s 1 J x (1 − x )
=
So the MC SE is =x (1 − x ) .
J J J −1 J −1
216
It follows that a MC 1 − α CI for p is

 s   x (1 − x ) 
 x ± zα /2 = x ± zα /2
J − 1 
.
 J 
The MC estimate x is often written as p̂ , and J − 1 is often replaced by

J (for simplicity). These changes lead to the standard form of the MC
1 − α confidence interval for p,
 pˆ (1 − pˆ ) 
 pˆ ± zα /2  .
 J 
Note 1: The above theory is really nothing other than the usual classical
theory for estimating a binomial proportion. Thus, there are many other
CIs that could be substituted, (e.g. the Wilson CI whose coverage is
closer to 1 − α , and the Clopper-Pearson CI whose coverage is always
guaranteed to be at least 1 − α but which is typically wider).
Note 2: The above MC inference depends on the x j values only by way

of the sample mean x or, equivalently, by way of the sample total
xT = x1 + ... + xJ = Jx . A consequence of this is that exactly the same
Monte Carlo inference can be performed if we observe only a single
value of the total xT , whose distribution is given by xT ~ Bin( J , p ) .
Note 3: A common application of the theory here is where the binomial

parameter is the probability of some event involving random variables,
for example= p P ( x > 1) and= p P( x < y ) .
For the first example here, we generate x1 ~ f ( x ) , let= r1 I ( x1 > 1) , and

then repeat independently many times so as to generate a random sample
r1 ,..., rJ ~ iid Bern( p ) . That sample can then be used for MC inference
on= p P ( x > 1) .
The procedure for the second example is similar, except that it involves
sampling ( x1 , y1 ) ~ f ( x, y ) and determining=
r1 I ( x1 < y1 ) , etc.
217
Note 4: One use of MC CIs for a binomial proportion is to assess the

coverage of MC CIs.
Often, the true coverage probability of a MC CI is not exactly the

nominal level, say 95%. This may be due to the MC sample size J being
insufficiently large or for some other reason.
If we are concerned about this, we may wish to estimate the true

coverage of the MC CI by repeating the entire MC inference procedure
itself a large number of times, say M. Each time we record an indicator
r for the MC CI containing the quantity of interest.
The result will be a sample r1 ,..., rM ~ iid Bern( p ) , where p is the true
coverage probability, which can then be estimated via MC methods in
the usual way.
Exercise 5.5 Estimating a probability via Monte Carlo
 x 
=
Use MC to estimate p P > 0.3e x  , where x ~ Gamma (3, 2) .
 x +1 
With J = 20,000, we sample x1 ,..., x J ~ iid G (3, 2) and let

 xj 
=rj I  > 0.3e j  .
x
 xj +1 
 
1 J
Thereby we obtain an estimate of p equal to pˆ = ∑ rj = 0.2117
J j =1
 pˆ (1 − pˆ ) 
and a 95% CI for p equal to  pˆ ± 1.96  = (0.2060, 0.2173).
 200000 
x
Note 1: We may also view p as=p P( y > 0.3) , where y = e − x
x +1
(for example). In that case, we sample x1 ,..., x J ~ iid G (3, 2) , calculate
218
rj I ( y j > 0.3) . This leads to exactly

−xj xj
yj = e , and then let=
xj +1
the same results regarding p. As a by-product of this second approach,
we obtain an estimate of the density function of the random variable
x
y = e− x , namely f ( y ) , which would be very difficult to obtain
x +1
analytically. Figure 5.7 illustrates.
Note 2: The density() function in R used to smooth the histogram does

not adequately capture the upper region of the density f ( y ) , nor the
fact that f ( y ) = 0 when y < 0.
Figure 5.7 Histogram of 20,000 values of y
J=20000; set.seed(162); xv=rgamma(J,3,2); ct=0

yv= sqrt(xv)*exp(-xv) / sqrt(xv+1)
for(j in 1:J) if(yv[j] > 0.3) ct=ct+1
phat=ct/J; ci=phat+c(-1,1)*qnorm(0.975)*sqrt(phat*(1-phat)/J)
c(phat,ci) # 0.2117 0.2060 0.2173
hist(yv,prob=T,breaks=seq(0,0.5,0.005),xlim=c(0,0.4),xlab="y",main=" ")
abline(v=0.3,lwd=3); lines(density(yv),lwd=3)
219
Exercise 5.6 Buffon’s needle problem
A needle of length 10 cm is dropped randomly onto a floor with lines on

it that are parallel and 10 cm apart.
(a) Analytically derive p, the probability that the needle crosses a line.
(b) Now forget that you know p. Estimate p using Monte Carlo methods
on a computer and a sample size of 1,000. Also provide a 95% confidence
interval for p. Then repeat with a sample size of 10,000 and discuss.
(a) Let: X = perpendicular distance from centre of needle to nearest line

in units of 5 cm
Y = acute angle between lines and needle in radians
C = ‘The needle crosses a line’.
Then: X ~ U (0,1) with density f ( x ) = 1, 0 < x < 1

 π 2 π
Y ~ U  0,  with density f ( y= ) ,0< y<
 2 π 2
X  Y (i.e. X and Y are independent, so that
2 π
f ( x, y ) = f ( x ) f ( y ) =1 × , 0 < x < 1, 0 < y < )
π 2
C= { X < sin Y } = {( x, y ) : x < sin y} .
Figure 5.8 illustrates this setup.
It follows that
p  P(C )  P( X  sin Y )
sin y 
 /2 /2
  
2  2
   dx dy    sin y dy
f ( x, y )dxdy 

xsin y y 0 x 0 y 0
2   
  cos y 0    cos    ( cos 0)
2 /2
    
 2 
2 2
 0  (1)  = 0.63662.
 
Figure 5.9 illustrates the integration here.
220
Figure 5.8 Illustration of Buffon’s needle problem
Figure 5.9 Illustration of the solution to Buffon’s needle

problem
221
Note 1: Another way to express the above working is to first note that
P (C | y ) ≡ P (C | Y =y ) =P ( X < sin y | y ) =P ( X < sin y ) =sin y ,
since ( X | y ) ~ X ~ U (0,1) with cdf F ( x | y )= F ( x )= x, 0 < x < 1 .
It follows that
π /2
2 2
p ==
P (C ) EP (=
C | Y ) E=
sin Y ∫ (sin y=
0
) dy
π π
,
as before.
Note 2: It can be shown that if the length of the needle is r times the
distance between lines, then the probability that the needle will cross a
line is given by the formula
 2r / π , r ≤1

p= 2 2 −1  1  
1 − π  r − 1 − r + sin  r   , r > 1.
   
(b) For this part, we will make use of the analysis in (a) whereby
= C {( x, y ) : x < sin y} ,
and where:
 π
x ~ U (0,1) , y ~ U  0,  , X  Y .
 2
Note: We suppose that these facts are understood but that the integration
required to then proceed on from these facts to the final answer (as in
(a)) is too difficult.
We now sample x1 ,..., xJ ~ iid U (0,1) and y1 ,..., y J ~ iid U (0, π / 2) (all
independently of one another). Next, we obtain the indicators defined by
 1 if x j < sin y j
rj =I ( x j < sin y j ) =
0 otherwise.
The result is the MC sample r1 ,..., rJ ~ iid Bern( p ) (i.e. a sample of

size J to be used for inference on p). (Equivalently, we may obtain
rT ≡ r1 + ... + rJ ~ Bin( J , p) , which will lead to the same final results.)
222
1 J r
The MC estimate of p is pˆ= r= ∑
J j =1
rj= T ,
J
 pˆ (1 − pˆ ) 
=  pˆ ± zα /2
and a 95% CI for p is CI .
 J 
Carrying out this experiment in R with J = 1,000 we get

p̂ = 0.618 and CI = (0.588, 0.648).
Then repeating, but with J = 10,000 instead, we obtain

p̂ = 0.633 and CI = (0.624, 0.643).
We see that increasing the MC sample size (from 1,000 to 10,000) has
reduced the width of the MC CI from 0.060 to 0.019. Both intervals
contain the true value, namely 2 / π = 0.6366.
# (a)
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(seq(0,pi/2,0.01),sin(seq(0,pi/2,0.01)), type="l",lwd=3,xlab="y", ylab="x")
abline(v=c(0,pi/2),lty=3); abline(h=c(0,1),lty=3)
text(0.2,0.4,"x = sin(y)"); text(1,0.4,"C"); text(0.35,0.8,"Complement of C")
text(1.52,0.06,"pi/2")
# (b)
J=1000; set.seed(213); xv=runif(J,0,1); yv=runif(J,0,pi/2); rv=rep(0,J)
options(digits=4); for(j in 1:J) if(xv[j]<sin(yv[j])) rv[j]=1
phat=mean(rv); z=qnorm(0.975); pci=phat+c(-1,1)*z*sqrt(phat*(1-phat)/J)

c(phat,pci,pci[2]-pci[1]) # 0.61800 0.58789 0.64811 0.06023
J=10000; set.seed(215); xv=runif(J,0,1); yv=runif(J,0,pi/2); rv=rep(0,J)

for(j in 1:J) if(xv[j]<sin(yv[j])) rv[j]=1
phat=mean(rv); z=qnorm(0.975); pci=phat+c(-1,1)*z*sqrt(phat*(1-phat)/J)

c(phat,pci,pci[2]-pci[1]) # 0.63320 0.62375 0.64265 0.01889
223
Exercise 5.7 MC CIs for the coverage probabilities of MC CIs

for a gamma mean
(a) Using the R function rgamma(), generate a random sample of size

J = 100 from the gamma distribution with parameters 3 and 2 and mean
µ = 3/2. Then use the method of Monte Carlo to estimate µ . In your
estimation, include a 95% CI for µ and the width of this CI. Also report
whether the CI contains the true value of µ .
(b) Repeat (a) but with J = 200, 500, 1,000, 10,000 and 100,000,
respectively. Report the widths of the resulting CIs and, for each CI, state
whether it contains µ . Discuss any patterns that you see.
(c) Repeat (a) M = 100 times and report the proportion of the resulting M
95% MC CIs which contain the true value of the mean. (In each case use
J = 100.) Hence calculate a 95% CI for p, the true coverage probability of
the 95% MC CI for µ based on a MC sample of size J = 100 from the
Gamma(3,2) distribution.
(d) Repeat (c), but with M = 200, 500, 1,000 and 10,000, respectively.
Discuss any patterns that you see.
(a) Applying the procedure (see the R code below) we estimate µ by

x = 1.517. The Monte Carlo 95% confidence interval for µ is
CI= ( x ± z0.025 s / J ) = (1.354, 1.680).
We observe that this interval has width 0.326 and contains µ .
(b) Repeating (a) as required, we obtain:

x = 1.471 and CI = (1.348, 1.593) with width 0.245 for J = 200
x = 1.430 and CI = (1.358, 1.502) with width 0.144 for J = 500
x = 1.475 and CI = (1.419, 1.530) with width 0.111 for J = 1,000
x = 1.490 and CI = (1.473, 1.508) with width 0.0344 for J = 10,000
x = 1.502 and CI = (1.497, 1.507) with width 0.0107 for J = 100,000.
We see that x appears to be converging towards µ = 1.5. The width of

the CI appears to be decreasing as J increases. Each of these five CIs
contains µ , just like the CI in (a).
224
(c) Repeating (a) M = 100 times leads to M = 100 MC CIs of which 93

contain µ = 1.5. Thus p̂ = 93%, which as expected is ‘close’ to the 95%
nominal coverage of the CI.
 0.93(1 − 0.93) 
A 95% CI for p is  0.93 ± 1.96  = (0.880 0.980).
 100 
This is consistent with the MC 95% CI for µ having coverage 95%.
(d) Repeating (a) M = 200 times leads to p̂ = 94.5% of the 200 CIs
containing 1.5, with a 95% CI for p,
 0.945(1 − 0.945) 
 0.945 ± 1.96  = (0.913, 0.977).
 200 
Repeating (a) M = 500 times leads to p̂ = 94.2% of the 500 CIs

containing 1.5 with a 95% CI for p,
 0.942(1 − 0.942) 
 0.942 ± 1.96  = (0.922, 0.962).
 500 
Repeating (a) M = 1,000 times leads to p̂ = 93.5% of the 1,000 CIs

 0.935(1 − 0.935) 
 0.935 ± 1.96  = (0.935, 0.963).
 1, 000 
Repeating (a) M = 10,000 times leads to p̂ = 94.4% of the 10,000 CIs

 0.94(1 − 0.94) 
 0.944 ± 1.96  = (0.940, 0.949).
 10, 000 
The widths of all five CIs for p are: 0.100, 0.063, 0.041, 0.027 and 0.009.
We see that the CI for p becomes narrower as M increases. Also, the
proportion of CIs containing 1.5 converges towards 95% as M increases.
The convergence does not seem to be uniform. This is because of Monte
Carlo error. If we repeated the experiment again, we might find a slightly
different pattern.
Each of the CIs for p is consistent with p = 0.95, except the one with
M = 10,000, which is the most reliable. In that case the CI for p is
225
(0.940, 0.949), which is entirely below 0.95. This suggests that the true
coverage probability of the 95% MC CI for µ is slightly less than 95%.
The observed proportions appear to be converging to this limit rather than

to 95% exactly. This is explainable by the fact that the MC sample size
J = 100 is far from infinity. If we repeated (d) with a larger value of J in
each case, say J = 1,000, we would see the proportion of the M CIs
converge towards a limiting value which is even closer to 95%. But then
an even larger value of M would be necessary to establish that there is in
fact any difference between the limiting value and 95%.
# (a)
options(digits=5); J = 100; set.seed(221); xv=rgamma(J,3,2)
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
c(xbar,ci) # 1.5170 1.3539 1.6800
# (b)
Jvec=c(100,200,500,1000,10000,100000); K = length(Jvec)
xbarvec=rep(NA,K); LBvec= rep(NA,K); UBvec= rep(NA,K);
set.seed(221);
for(k in 1:K){ J=Jvec[k]; xv=rgamma(J,3,2); xbar=mean(xv)
ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
xbarvec[k]=xbar; LBvec[k]=ci[1]; UBvec[k]=ci[2]
}
Wvec=UBvec-LBvec
print(rbind(Jvec, xbarvec, LBvec,UBvec, Wvec),digits=4)
# Jvec 100.0000 200.0000 500.0000 1000.000 1.000e+04 1.000e+05

# xbarvec 1.5170 1.4705 1.4299 1.475 1.490e+00 1.502e+00
# LBvec 1.3539 1.3480 1.3577 1.419 1.473e+00 1.497e+00
# UBvec 1.6800 1.5930 1.5020 1.530 1.508e+00 1.507e+00
# Wvec 0.3261 0.2451 0.1443 0.111 3.441e-02 1.073e-02
# (c)
J=100; M=100; ct=0; set.seed(442); for(m in 1:M){
xv=rgamma(J,3,2)
if((ci[1]<=1.5)&&(1.5<=ci[2])) ct = ct + 1 }
p=ct/M; ci=p+c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/J)
c(ct,p,ci) # 93.00000 0.93000 0.87999 0.98001
226
# (d)
J=100; Mvec=c(200,500,1000,10000); set.seed(651)
for(M in Mvec){ ct=0
for(m in 1:M){
xv=rgamma(J,3,2); xbar=mean(xv)
ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
if((ci[1]<=1.5)&&(1.5<=ci[2])) ct = ct + 1
}
p=ct/M; ci=p+c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/M)
print(c(M,p,ci,ci[2]-ci[1]),digits=3) }
# [1] 200.0000 0.9450 0.9134 0.9766 0.0632

# [1] 500.000 0.942 0.922 0.962 0.041
# [1] 1.00e+03 9.49e-01 9.35e-01 9.63e-01 2.73e-02
# [1] 1.00e+04 9.44e-01 9.40e-01 9.49e-01 9.00e-03
5.8 Random number generation

So far we have assumed the availability of the sample required for Monte
Carlo estimation, such as x1 ,..., x J ~ iid f ( x ) . The issue was skipped over
by making use of ready made functions in R such as runif(), rbeta() and
rgamma(). However, many applications involve dealing with complicated
distributions from which sampling is not straightforward.
So we will next discuss some basic techniques that can be used to generate
the required Monte Carlo sample from a given distribution. More
advanced techniques will be treated later. We will first treat the discrete
case, which is the simplest, and then the continuous case. It will be
assumed throughout that we can at least sample easily from the standard
uniform distribution, i.e. that we can readily generate u ~ U (0,1) .
Note: This sampling is easily achieved using the runif() function in R.

Alternatively, it can be done physically by using a hat with 10 cards in
it, where these have the numbers 0,1,2,....,9 written on them. Three cards
(say) are drawn out of the hat, randomly and with replacement. The three
numbers thereby selected are written down in a row, and a decimal point
is placed in front of them. The resulting number (e.g. 0.472, 0.000 or
0.970) is an approximate draw from the standard uniform distribution.
Repeating the entire procedure several times results in a random sample
from that distribution. Increasing ‘three’ above (to ‘five’, say) improves
the approximation (e.g. yielding 0.47207, 0.00029 or 0.97010).
227
5.9 Sampling from an arbitrary discrete

distribution
Suppose we wish to sample a value x ~ f ( x ) where f ( x ) is a discrete
pdf defined over the possible values x = x1 ,..., xK . First define
f k = f ( xk )
and
Fk = f1 + ... + f k (k = 1,…,K),
noting that FK = 1 .
Then sample u ~ U (0,1) , and finally return:

x = x1 if 0 ≤ u ≤ F1
x = x2 if F1 < u ≤ F2
……………………………….....
x = xK if FK −1 < u ≤ FK (=
1) .
One way to implement the above is to set k = 1, to repeatedly increment k

by 1 until Fk −1 < u ≤ Fk , and then, using the final value of k thereby
obtained, to return x = xk .
Note 1: We see that this procedure will work also in the case where K is
infinite. In that case a practical alternative is to redefine K as a value k
for which Fk is very close to 1 (e.g. 0.9999) and then approximate f ( x )
by zero for all x > xK .
Note 2: In R, an alternative to using u ~ U (0,1) is to apply the function

sample() with appropriate specifications of x1 ,..., xK and f1 ,..., f K (as
illustrated in an exercise below).
Exercise 5.8 Example of sampling from a simple discrete

distribution
Show that the above method works when applied to generating a value x
from the Bin(2,1/2) distribution, i.e. that it returns x = 0, 1 and 2 with
probabilities 1/4, 1/2 and 1/4, respectively.
228
In this case, K = 3 and: x1 = 0, F (=

x1 ) P( x ≤ 0) = 0.25
x2 = 1, F (=
x2 ) P( x ≤ 1) = 0.75
x3 = 2, F (=x3 ) P( x ≤ 2) = 1.00.
Let u ~ U(0,1). Then the method returns:

x = x1 = 0 if 0 < u < F ( x1 ) i.e. if 0.00 < u < 0.25
x = x2 = 1 if F ( x1 ) < u < F ( x2 ) i.e. if 0.25 < u < 0.75
x = x3 = 2 if F ( x2 ) < u < F ( x3 ) i.e. if 0.75 < u < 1.00.
Thus, x has: 0.25 – 0.00 = 0.25 probability of being set to 0

0.75 – 0.25 = 0.50 probability of being set to 1
1.00 – 0.75 = 0.25 probability of being set to 2 (all correct).
Exercise 5.9 Sampling from a complicated discrete distribution
Consider the discrete distribution defined by the pdf

x 3e − x
f ( x) ∝ ,x =
1,3,5,...
1+ x
Find the mean of the distribution by performing appropriate summations.

Then generate a random sample from this distribution and use it to
confirm the mean.
x 3e − x
=
Using R we calculate k ( x) = , x 1,3,5,..., 41 (here k stands for
1+ x
kernel), noting that the last two values of k ( x ) are tiny (9.455201e-14 and
1.454999e-14).
We then calculate the sum of the kernel values,

c= k (1) + k (3) + ... + k (41) = 1.051009,
and thereby normalise the kernel to obtain
k ( x)
= f ( x) = , x 1,3,5,..., 41 .
c
229
The pdf may also be written as f ( x) = k ( x) / c , x = x1 ,..., xK , where:

x=
k 2k − 1 ; k = 1,..., K ; K = 21.The exact mean of the distribution is then
evaluated numerically as
K
=µ ∑=
x f (x )
k =1
k k 3.6527 .
Note: Changing 41 to 101 here changes the approximation to 3.6527,

i.e. makes no difference to 4 decimals. This suggests that taking the
upper bound as 41 is good enough.
To sample J = 100,000 values from the distribution we may write

sample(x=xvec,size=J,replace=TRUE,prob=fvec)
where xvec is a vector with values 1,3,…,41 and fvec is a vector with the
values f (1), f (3),..., f (41) (see the R Code below).
Note: We could also change fvec to kvec here, where kvec is a vector
with the values k (1), k (3),..., k (41) ; both possibilities will work since
sample() will automatically normalise the values in its parameter ‘prob’.
The Monte Carlo estimate of µ works out as 3.6494 with 95% CI

(3.6374, 3.6615). We note that this CI contains the true value, 3.6527.
kfun = function(x){ x^3*exp(-x)/(1 + sqrt(x)) }; options(digits=5)

xvec=seq(1,41,2); kvec=kfun(xvec); c =sum(kvec); c # 1.051
fvec=kvec/c; sum(fvec) # 1
print(rbind(xvec,fvec)[,1:9],digits=3)
# xvec 1.000 3.000 5.000 7.0000 9.0000 11.0000 13.00000 1.50e+01 1.70e+01
# fvec 0.175 0.468 0.248 0.0816 0.0214 0.0049 0.00103 2.02e-04 3.78e-05
sum(xvec*kvec)/sum(kvec) # 3.6527
# Check that 41 is large enough:
xvec=seq(1,101,2); kvec=kfun(xvec); sum(xvec*kvec)/sum(kvec)
# 3.6527 (same)
# Sample from the distribution
xvec=seq(1,41,2); kvec=kfun(xvec); J=100000; set.seed(332);
samp = sample(x=xvec,size=J,replace=TRUE,prob=fvec)
est =mean(samp); std=sd(samp); ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(J)
c(est,ci) # 3.6494 3.6374 3.6615
230
5.10 The inversion technique

Suppose we wish to sample x, a value of a continuous random variable X
with cdf FX ( x ) . One way to do this is using the inversion technique,
defined as follows, with the underlying theorem and proof shown below.
First derive the quantile function of X, denoted FX1 ( p ) (0 < p < 1).
(This can be done by setting FX ( x ) to p and solving for x.)
Next, generate a random number u from the standard uniform distribution.

(It will be assumed that this can be done easily, e.g. using runif() in R.)
Then return x  FX1 (u ) as a value sampled from the distribution of X.
Theorem 5.1: Suppose that X is a continuous random variable with cdf

FX ( x ) and quantile function FX1 ( p ) . Let U ~ U(0,1), independently of
X, and define R  FX1 (U ) . Then R has the same distribution as X.
Proof of Theorem 5.1: Observe that U has cdf FU (u )= u , 0 < u < 1 .

This implies that R has cdf
FR ( r )  P ( R  r )  P ( FX ( FX1 (U ))  FX ( r ))  P (U  FX ( r ))  FX ( r ) .
Thus, R has the same cdf as X and therefore the same distribution.
Note: A complication with the inversion technique may arise if there is

difficulty deriving the quantile function FX1 ( p ) . In that case, since the
task is fundamentally to solve FX ( x )  u for x, it may be useful to
employ the Newton-Raphson algorithm to the problem of solving the
equation g ( x ) = 0 , where g ( x )  FX ( x )  u .
Exercise 5.10 Practice at the inversion technique
(a) Using u = 0.371 as a value from the standard uniform distribution,

obtain a value from the standard exponential distribution. Then generate
a large random sample u1 ,..., uJ ~ iid U (0,1) (of size J = 1,000 say) and
use this to create a random sample of the same size from the standard
exponential distribution. Check your results by calculating an estimate of
the mean of that distribution and also a 95% CI for that mean. Compare
your results with the true value of that mean, namely 1.
231
(b) Using u = 0.371 as a value from the standard uniform distribution,

obtain a value from the gamma distribution with mean and variance both
equal to 2. Then generate a large random sample u1 ,..., uK ~ iid U (0,1) (of
size J = 1,000, say) and use this to create a random sample of the same
size from the said gamma distribution. Check your results by calculating
an estimate of the mean of that distribution and also a 95% CI for that
mean. Compare your results with the true value, namely 2.
Solution to Problem 5.10
(a) Let X ~ G (1,1) with density function f ( x ) = e − x , x > 0, and cdf

x
∫e
−t
F ( x )= dt = 1 − e − x , x > 0. The quantile function here is the solution
0
of 1 − e −x
= p, namely F −1 ( p ) =
− log(1 − p ) .
So a value from the standard exponential distribution is easily computed

as x =F −1 (u ) =
− log(1 − 0.371) = 0.463624.
Taking J = 1,000, we now generate u1 ,..., uJ ~ iid U (0,1) in R using the

runif() function, and then calculate x j =
− log(1 − u j ) for each j = 1,…,J.
This results in the required sample x1 ,..., x J ~ iid G (1,1) . Using this
sample, the MC estimate of µ = EX is 0.9967, and a 95% CI for µ is
(0.9322, 1.0613). We see that the CI contains the true value being
estimated (i.e. 1).
(b) Here, X ~ G (2,1) with mean 2/1, variance 2 /12 = 2 , pdf f ( x) = xe − x

and cdf
x x
F ( x) = ∫ te − t dt = t (−e − t )  − ∫ 1(−e − t )dt
x
0
 0  0
− xe − x + 0 +  −e − t  =
x
= − xe − x − e − x + 1 =
1 − ( x + 1)e − x .
 0
We see that the quantile function of X, F −1 ( p ) , does not have a closed

form expression, since it is the root of the function
g ( x) =F ( x) − p =1 − ( x + 1)e − x − p
(i.e. the solution of g ( x) = 0) .
232
However, for any p we can obtain that root using the Newton-Raphson
algorithm by iterating
g(x j )
x j += xj − where g ′( x=
) F ′( x) −= ) xe− x
0 f ( x=
g ′( x j )
1
 1 − ( x j + 1)e − x j − p 
= xj −  .
 x e
−xj

 j 
With p = u = 0.371 and starting arbitrarily at x0 = 1 , we get the sequence:

1.0000, 1.2902, 1.2939, 1.2939, 1.2939, 1.2939, 1.2939…..
So we return 1.2939 as a value from the G(2,1) distribution.
As a check, we use the pgamma() function in R to confirm that

FX (1.2939) = 0.371 as follows:
pgamma(1.2939,2,1) # 0.37101
Taking K = 1,000, we now generate u1 ,..., uK ~ iid U (0,1) in R using the

runif() function, and then for k = 1,…,K we solve
1 − ( xk + 1)e − xk =uk for xk
using the NR algorithm each time. This procedure results in the sample,
x1 ,..., xK ~ iid G (2,1) .
Using this sample, an estimate of µ = EX is 1.9631, and a 95% CI for µ

is (1.8815, 2.0446). We see that the CI contains the true value, 2.
options(digits=5)
# (a)
-log(1-0.371) # 0.463624
J=1000; set.seed(221); uv=runif(J,0,1)
xv=-log(1-uv) # Generate a random sample of size 1000 from the G(1,1) dsn
est=mean(xv); std=sd(xv); ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(J)
c(est,ci) # 0.99673 0.93216 1.06130
233
# (b)
u=0.371; x=1; xv=x; for(j in 1:7) { x=x-(1-(x+1)*exp(-x)-u)/(x*exp(-x)); xv=c(xv,x) }
xv # 1.0000 1.2902 1.2939 1.2939 1.2939 1.2939 1.2939 1.2939
pgamma(x,2,1) # 0.371 Just checking that F(1.293860) = 0.371
pgamma(1.2939,2,1) # 0.37101
K=1000; xvec=rep(NA,K); set.seed(332); for(k in 1:K){

u=runif(1); x=1; for(j in 1:10) x=x-(1-(x+1)*exp(-x)-u)/(x*exp(-x))
xvec[k]=x } # Generate a random sample of size 1000 from the G(2,1) dsn
est=mean(xvec); std=sd(xvec)
ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(K)
c(est,ci) # 1.9631 1.8815 2.0446
5.11 Random number generation via

compositions
Sometimes the most convenient way to sample from a distribution is to
express it as a function (or composition) of two or more random variables
which are easy to sample from. For example, to obtain two independent
values from the standard normal distribution we may use the well-known
Box-Muller algorithm, as follows.
Sample u1 , u2 ~ iid U (0,1) and let:

z1= −2 log u1 cos(2π u2 )
z2= −2 log u1 sin(2π u2 ) .
It can be shown that z1 , z2 ~ iid N (0,1) . If we only need one value from
the standard normal distribution then we may arbitrarily discard z2 and
return only z1 .
Exercise 5.11 Sampling from the double exponential

distribution
Suppose we wish to sample a value x ~ f ( x) , where

=f ( x) (1/ 2)e −| x| , x ∈ℜ .
Describe how to obtain x as a composition of two other values than can

be easily sampled.
234
Let R and Y be independent random variables such that R ~ Bern(0.5)

and Y ~ G (1,1) . Then=
U (2 R − 1)Y has the same distribution as X.
This is because R is equally likely to be 0 as it is to be 1, and so 2 R − 1 is

equally likely to be –1 as it is to be +1. So there is a 50% chance that U
will be exponential ( G (1,1) ) and a 50% chance that U will be negative
exponential. So, obviously U has exactly the same distribution as X. For
a formal proof, see the Note below.
We see that a method for obtaining a value x ~ f ( x) is to independently

sample r ~ Bern(0.5) and y ~ G (1,1) , and then calculate= x (2r − 1) y .
Note: The cdf of=U (2 R − 1)Y is

F=(u ) P (U ≤ u )
= P ((2 R − 1)Y ≤ u )
= EP ((2 R − 1)Y ≤ u | R )
= P ( R = 0) P ((2 R − 1)Y ≤ u | R = 0)
+ P= ( R 1) P ((2 R − 1)Y ≤ u =
| R 1)
1 1
= P ( −Y ≤ u | R = 0) + P ( +Y ≤ u | R = 1)
2 2
1 1
= P (Y ≥ −u ) + P(Y ≤ u )
2 2
 (1 / 2)e −( −u )
+ (1 / 2)(0), u < 0
= −u
(1 / 2)(1) + (1 / 2)(1 − e ), u ≥ 0
 (1/ 2)eu , u < 0
= −u
1 − (1/ 2)e , u ≥ 0.
 (1 / 2)eu , u < 0 
So U has pdf = ′(u ) 
f (u ) F= −u  .
 0 − (1 / 2) e ( − 1), u ≥ 0 
1 −|u|
That is, =
f (u ) e , −∞ < u < ∞ , which is the same the pdf of X.
2
235
Exercise 5.12 Sampling from a triangular distribution
 x, 0 < x < 1 
Suppose we want to sample x ~ f ( x) where f ( x) =  .
 2 − x, 1 < x < 2 
Describe how two random variables can be combined to obtain x.
Sample the two random variables r ~ Bern(0.5) and y ~ Beta (2,1) . Then
calculate x = ry + (1 − r )(2 − y ) . This way, there is a 50% chance that x
will equal y, whose pdf is f ( y= ) 2 y, 0 < y < 1 , and a 50% chance that x
will equal z= 2 − y , whose pdf is f ( z )= 2(2 − z ),1 < z < 2 .
A second solution is as follows. Sample u1 , u2 ~ iid U (0,1) and calculate

x= u1 + u2 . It can easily be shown that a value of x formed in this way has
the triangular pdf in question.
5.12 Rejection sampling

Some distributions are difficult to sample from using any of the already
mentioned methods. For example, when applying the inversion technique,
solving the equation F(x) = u may be problematic even with the aid of the
Newton-Raphson algorithm (e.g. due to instability unless starting at very
close to the solution).
In such cases, one convenient and easy way to obtain a value from the
distribution of interest may be via rejection sampling (also known as the
rejection method or the acceptance-rejection method). This method works
as follows.
Suppose we want to generate a random number from a target distribution

with density f ( x ) . This target distribution may be continuous or discrete.
We must first decide on a suitable envelope distribution with envelope

density h ( x ) . (These are also called the majorising distribution and
majorising density.) Ideally, the chosen density h ( x ) is similar in shape
to f ( x ) and relatively easy to sample from.
236
We next define the following quantities:

 f ( x ) 
c  max  
x  h( x ) 
f ( x)
p( x )  .
ch( x )
The idea here is that f ( x ) lies entirely beneath ch( x ) except that it
touches ch( x ) at maybe only one point. Then p( x ) , which is called the
acceptance probability, appropriately lies between 0 and 1 (inclusive).
Figure 5.10 illustrates this setup. The rejection algorithm is as follows:
1. Sample a proposed value (or candidate) x ~ h( x ) .

f ( x )
2. Calculate the acceptance probability p  p( x )  .
ch( x )
3. Generate a standard uniform value u ~ U (0,1) .
4. Decide whether to accept or reject the candidate, as follows:
If u < p then accept x  , meaning return x  x  and STOP.
If u > p then reject x  , meaning go to Step 1 and REPEAT.
Steps 1 to 4 are repeated as many times as necessary until an acceptance
occurs, resulting in x  x  . The finally accepted value x is an observation
from f ( x) . Repeating the entire procedure above another J − 1 times
independently will result in a random sample of size J from f ( x ) .
Figure 5.10 illustrates, with:
f ( x ) = density of the Beta(4,8) distribution

h( x ) = density of the Beta(2,2) distribution
 f ( x ) 
c  max   = 2.45
x  h( x ) 
x  = 0.4 (example of a candidate)
f ( x ) 2.365
p  p ( x )    0.671 .
ch( x ) 3.524
In this case, if we sample u = 0.419 (for example), then we accept x  and

return x = 0.4. If, however, we sample u = 0.705 (say), then we reject x 
and propose another x  , etc.
237
Figure 5.10 Illustration of the rejection sampling algorithm
Note 1: The rejection sampling algorithm as defined here also works

with f ( x ) and h( x ) in the equations replaced by any kernels of the
target and envelope distributions, respectively.
Note 2: The overall acceptance rate is the unconditional probability of

acceptance and equals the area under f ( x ) divided by the area under
ch( x ) , which is obviously 1/c (= 0.409 in our example).
The wastage may be defined as the overall probability of rejection,

namely 1 − 1 / c , and this is simply the area between f ( x ) and ch( x )
(= 0.591 in our example).
Note 3: If we consider the experiment of proposing values repeatedly

until the next acceptance, then the number of candidates follows a
geometric distribution with parameter 1/c, and so the expected number
of candidates (until acceptance) is 1/(1/c) = c.
Note 4: There are two basic principles which must be considered in

rejection sampling:
(i) The envelope density h ( x ) should be similar to the target density

f ( x ) since this will minimise wastage, i.e. minimise the average
number of proposals per acceptance, c, and hence optimise the computer
time required.
238
(ii) The envelope distribution should be easy to sample from.
Note 5: The idea of rejection sampling can be used to give an intuitively

appealing account of how Bayes’ theorem works. In this regard, see
Smith and Gelfand (1992).
Note 6: How rejection sampling works can most easily be explained by

considering the case where f ( x ) defines a simple discrete distribution.
This is the subject of the next exercise.
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(c(0,1), c(0,6),type="n",xlab="x",ylab="")
xv=seq(0.001,0.999,0.01); hxv=dbeta(xv,2,2); lines(xv,hxv,lty=2,lwd=3)
kfun=function(x){ dbeta(x,4,8) }
# We could specify any positive function here (*)
k0=integrate(f=kfun,lower=0,upper=1)$value
# This calculates the normalising constant
fxv=kfun(xv)/k0; # This ensures f(x) as defined at (*) is a proper density
lines(xv,fxv,lty=1,lwd=3)
c=max(fxv/hxv); c # 2.4472
lines(xv,c*hxv,lty=3,lwd=3)
legend(0,6,c("f(x)","h(x)","c*h(x)"),lty=c(1,2,3),lwd=c(3,3,3))
text(0.07,3,"c = 2.45")
xval=0.4; lines(c(xval,xval),c(0, c*dbeta(xval,2,2)),lty=1,lwd=1)

points(rep(xval,3), c(0,kfun(xval)/k0 ,c*dbeta(xval,2,2)) ,
pch=rep(16,3), cex=rep(1.2,3))
text(0.43,0.05,"P"); text(0.43,2.5,"Q"); text(0.43,3.3,"R");
c(0,kfun(xval)/k0 ,c*dbeta(xval,2,2))
# 0.0000 2.3649 3.5239 2.3649/3.5239 # 0.6711
text(0.6,5.2,"Probability of accepting 0.4 is p(0.4) = f(0.4)/{c*h(0.4)} \n
= {distance P to Q} divided by {distance P to R}\n= 2.365/3.524 = 0.671")
c(0,kfun(xval)/k0 ,c*dbeta(xval,2,2)) # 0.0000 2.3649 3.5239
239
Exercise 5.13 Illustration of rejection sampling
Consider the Bin(2,1/2) distribution with pdf

1/ 4, x = 0, 2 
f ( x) =  ,
1/ 2, x = 1 
and suppose we want to sample from this using rejection method envelope
=g ( x) 1/=3, x 0,1, 2 , i.e. the density of the discrete uniform distribution
over the integers 0, 1 and 2. Show that the rejection sampling algorithm
returns 0, 1 and 2 with the correct probabilities.
 f ( x )  1 / 2 3 f ( x) 1/ 2, x  0, 2
Here: c  max     , p( x)   .
x   g ( x )  1 / 3 2 cg ( x)  1, x  1 
Now, suppose that we propose a very large number of proposed values

from g ( x ) . Then:
• about 1/3 of these will be 0, of which about 1/2 will be accepted
• about 1/3 of these will be 1, of which (fully) all will be accepted
• about 1/3 of these will be 2, of which about 1/2 will be accepted.
We see that about 2/3 of all the proposed values will be accepted, and of
these about 25% will be 0, 50% will be 1, and 25% will be 2. About 1/3
of the candidates will be rejected, about half of these being 0 and half
being 2. The overall acceptance rate is 1/c = 1/(3/2) = 2/3, and the wastage
is 1 − 1 / c = 1/3. On average, c = 1.5 candidates will have to be proposed
until an acceptance. Thus, generation of 1,000 Bin(2,1/2) values (say) will
require about 1,500 candidates.
5.13 Methods based on the rejection algorithm

The rejection method may be used in conjunction with many other
methods. For example, the Box-Muller algorithm (mentioned earlier) is a
basis for the Marsaglia polar method for sampling from a normal
distribution. This method involves generating
u1 , u2 ~ iid U (0,1)
repeatedly until
s ≡ (2u1 − 1) 2 + (2u2 − 1) 2 < 1
and then returning zi = (2ui − 1) −2(log s ) / s , i = 1,2.
240
The result will (eventually) be the required sample

z1 , z2 ~ iid N (0,1) .
This algorithm includes a condition for rejecting the sample values u1 , u2

and involves iterating until these values are accepted (as a pair). The
procedure may be less efficient than the Box-Muller algorithm (which
does not involve rejection sampling and never requires more than two
standard uniform variates) but avoids the computation of sines and cosines.
5.14 Monte Carlo methods in Bayesian

inference
Most of the ideas above in this chapter are directly applicable to Bayesian
inference. Suppose we have derived a posterior distribution or density
f (θ | x ) but it is complicated and difficult to work with directly. Then we
can try to generate a random sample from that posterior with a view to
estimating all the required inferential quantities (e.g. point and interval
estimates) via the method of Monte Carlo.
First, denote the Monte Carlo sample as θ1 ,..., θ J ~ iid f (θ | x) . Then, the
MC estimate of the posterior mean of θ , namely
= θˆ E= (θ | x ) ∫ θ f (θ | x )dθ ,
is
1 J
θ = ∑θ j (the MC sample mean),
J j =1
and a 1 − α CI for θˆ is
 sθ 
 θ ± zα /2 ,
 J
where
1 J
= sθ2 ∑
J − 1 j =1
(θ j − θ ) 2 .
Also, a MC estimate of the 1 − α CPDR for θ is ( qˆα /2 , qˆ1−α /2 ) , where qˆ p

is the empirical p-quantile of θ1 ,..., θ J , and the MC estimate of the
posterior median is q̂1/2 , etc.
241
Further, when the posterior density f (θ | x) does not have a closed form
expression (as is often the case), it can be estimated by smoothing a
probability histogram of θ1 ,..., θ J .
Once an estimate of the posterior density has been obtained, the mode of
that estimate defines the MC estimate of the posterior mode.
Suppose we are interested in some posterior probability

=p P(θ ∈ A | y )
(where A is a subset of the parameter space).
Then, the MC estimate of p is

1 J
= pˆ ∑ I (θ j ∈ A) ,
J j =1
i.e. the proportion of the θ j values which lie in A, and a 1 − α CI for p is
( pˆ ± z α /2 )
pˆ (1 − pˆ ) / J .
Suppose we are interested in a function of the parameter, ψ = g (θ ) . Then

regardless of how complicated g is, we can perform MC inference on ψ
easily. Simply calculate ψ j = g (θ j ) for each j = 1,...,J. This results in a
random sample from the posterior distribution of ψ , namely the values
ψ 1 ,...,ψ J ~ iid f (ψ | x) .
One may then apply any of the ideas above, just as before. For example,
the posterior mean of ψ , namely
= ψˆ E= (ψ | x) ∫ψ f (ψ =
| x)dψ ∫ g (θ ) f (θ | x)dθ ,
can be estimated by its MC estimate,
1 J
ψ = ∑ψ j ,
J j =1
and a 1 − α CI for ψˆ is
 sψ 
ψ ± zα /2 ,
 J
where
1 J
=sψ2 ∑
J − 1 j =1
(ψ j −ψ ) 2 .
242
Exercise 5.14 MC inference under the normal-normal-gamma

model
Recall the Bayesian model: ( y1 ,, yn | ,  ) ~ iid N (,1/  )

f ( µ , λ ) ∝ 1 / λ , µ ∈ ℜ, λ > 0 .
Suppose we observe the data vector y = ( y1 ,..., yn ) = (2.1, 3.2, 5.2, 1.7).
(a) Generate J = 1,000 values from the posterior distribution of µ . Use

this sample to perform MC inference on µ . Illustrate your inferences with
a suitable graph.
(b) Generate J = 1,000 values from the posterior distribution of λ . Use

this sample to perform MC inference on λ . Illustrate your inferences with
a suitable graph.
(c) Use MC methods to estimate the signal to noise ratio (SNR), defined
=
as γ µ= / σ µ λ . Illustrate your inferences with a suitable graph.
(a) Recall that the marginal posterior distribution of µ is given by

 µ−y 
 y  ~ t ( n − 1) .
s/ n 
So we generate w1 ,..., wJ ~ iid t ( n − 1) and then calculate

s
µ j= y + w j , j = 1,.., J .
n
We then use the sample µ1 ,..., µ J ~ iid f ( µ | y ) for MC inference on µ .

Thereby, we estimate µ’s posterior mean µˆ = E ( µ | y ) by µ = 3.077
with (3.001, 3.153) as the 95% MC CI for µ̂ . The MC estimate of µ’s
95% CPDR is (0.685, 5.507).
We now compare the above estimates with the true values:

µˆ = y = 3.050
 s 
95% CPDR for µ =  y ± t0.025 ( n − 1)  = (0.556, 5.544).
 n
243
We observe that the true posterior mean is contained in the 95% MC CI

for that mean. Figure 5.11 provides a comparison of the above Monte
Carlo and ‘exact’ inferences.
Note 1: The formula for the exact posterior density is

dw  µ−y  1
= f ( µ | y ) f=
(w | y) f t ( n −1)   ×
dµ s/ n s/ n
 ( n −1) +1 
− 
 (n − 1) + 1    µ − y  
2  2 
Γ     
 2  1 +  s / n   ×
n
, µ ∈ℜ .
 n −1   n −1  s
Γ  (n − 1)π  
 2   
Note 2: The MC sample µ1 ,..., µ J ~ iid f ( µ | y ) could also be obtained

using the following results:
 n  1  n  1 2 
( | y ) ~ Gamma  , s 
 2   2  
 1
( | y ,  ) ~ N  y ,  .
 n 
Thus, using the method of composition and the identity

f (,  | y )  f ( | y ) f ( | y ,  ) ,
we first sample
 n  1  n  1 2 
1 ,..., J ~ Gamma  , s  ,
 2   2  
and then sample
 1 
 j ~ N  y ,  for each j = 1,..., J .
 n j 
The result of this procedure is

(1 , 1 ),...,(J , J ) ~ iid f (,  | y ) ,
and thereby
µ1 ,..., µ J ~ iid f ( µ | y ) ,
as before, after discarding all of the  j values.
244
Figure 5.11 Monte Carlo inference on the normal mean
(b) One way to obtain a MC sample from the marginal posterior

distribution of  is as indicated in Note 2 of part (a). Alternatively, we
can make use of the result
n n  1 n
( | y ,  ) ~ Gamma  , s2  , where s2   ( yi   ) 2 .
2 2  n i1
So, again by the method of composition, but this time using the identity
f (,  | y )  f ( | y ) f ( | y ,  ) ,
we make use of the sample already generated in (a) and sample
n n 
 j ~ Gamma  , s2 j 
2 2 
for each j = 1,..., J . The result is (1 , 1 ),...,(J , J ) ~ iid f (,  | y ) , and
thereby λ1 ,..., λJ ~ iid f (λ | y ) (after discarding all of the  j values).
Implementing this procedure (i.e. making use of the simulated values in

(a)) we obtain the required sample, λ1 ,..., λJ ~ iid f (λ | y ) , and use it for
MC inference. Thereby we estimate λ’s posterior mean λˆ = E (λ | y ) by
λ = 0.3998 with (0.3804, 0.4192) as the 95% MC CI for λ̂ . The MC
estimate of λ’s 95% CPDR is (0.0347, 1.2828).
We now compare the above estimates with the true values:
245
1
λˆ = = 0.4071
s2
 
95% CPDR =  F −1n −1 n −1 2  (0.025), F −1n −1 n −1 2  (0.975) 
 G , s  
  2 2  
G , s 
 2 2 
= (0.0293, 1.2684).
We see that the true posterior mean is contained in the 95% MC CI for
that mean. Figure 5.12 illustrates these Monte Carlo and ‘exact’ inferences.
Figure 5.12 Monte Carlo inference on the precision parameter
(c) Using the values sampled in (a) and (b), we now calculate  j   j  j
for each j = 1,.., J , and hence obtain a MC sample 1 ,...,  J ~ iid f ( | y ) ,
which can then be used to perform MC inference on γ . (NB: The symbols
‘ γ ’ and ‘  ’ are typographically equivalent.) Implementing this strategy,
we estimate γ ’s posterior mean by 1.800, with (1.745 1.854) as a 95% CI
for that mean, and we estimate γ ’s 95% CPDR as (0.228, 3.543).
Figure 5.13 illustrates these Monte Carlo estimates. Also shown are:
• the exact posterior mean of  , which is γˆ = E (γ | y ) = 1.793
• the exact 95% CPDR for  , which is (0.0733, 3.5952)
• the exact posterior density of 
• the MLE of  , which is γ = y / s = 3.05/1.567 = 1.946.
246
See the Note and R Code below for details of these calculations.
Figure 5.13 Monte Carlo inference on the signal to noise ratio
Note: The conditional posterior distribution of     given λ is

(  | y ,  ) ~ N ((  ) y ,(  ) 21 / ( n )) ~ N ( y  ,1 / n ) .
This follows from the uninformative normal-normal model, i.e. from the
fact that
( | y ,  ) ~ N ( y ,1 / ( n )) .
So the posterior density of  may be obtained numerically according to


f (  | y )  E { f (  | y ,  ) | y}   f (  | y ,  ) f ( | y )d  ,
0
where:
n n2 (  y  )2
f ( | y,  )  f N ( y  ,1/ n )
( )  e , 
2
n1
n1 n1 2
 n  1 2  2 1  s
 
 2 s 
2 2
e
f ( | y )  f ( )  ,0.
 n1 n1 2 
G s   n  1
 
,
 2 2 

 2 
247
Also (as shown in a previous exercise), the posterior mean of  is

exactly
= γˆ E=
( µ λ | y ) E{E ( µ λ | y, λ ) | y}
 n −1 1 
Γ + 
=
y
×  2 2
= 1.793
s  n −1 1/2
 n −1
  Γ 
 2   2 
(after some algebra).
The exact 95% CPDR for  may be obtained by using the optim()
function to minimise
g ( L,U )   F (U | y )  F ( L | y )  0.95   f  (U | y )  f  ( L | y )
2 2
U  
2
  
     f ( | y,  ) f ( | y )d d   0.95

 L  0 

 
 2

   f  (U | y,  ) f ( | y )d    f  ( L | y,  ) f ( | y )d  ,
 0 
0
with the result being (L, U) = (0.0733, 3.5952).
# (a)
y=c(2.1, 3.2, 5.2, 1.7); n=length(y); ybar=mean(y); s=sd(y); s # 1.567
J=1000; set.seed(144); options(digits=4)
wv=rt(J,n-1); muv=ybar+s*wv/sqrt(n)
mubar=mean(muv); muci=mubar + c(-1,1)*qnorm(0.975)*sd(muv)/sqrt(J)
mucpdr=quantile(muv,c(0.025,0.975))
c(mubar,muci,mucpdr) # 3.0770 3.0012 3.1528 0.6848 5.5069
muhat=ybar; mucpdrtrue= ybar+(s/sqrt(n))*qt(c(0.025,0.975),n-1)
c(muhat,mucpdrtrue) # 3.050 0.556 5.544
X11(w=8,h=5); par(mfrow=c(1,1))
hist(muv,prob=T,xlab="mu",xlim=c(-2,7.5), ylim=c(0,0.5),main="",
breaks=seq(-20,20,0.25))
muvec=seq(-20,20,0.01);
postvec=dt( (muvec-ybar)/(s/sqrt(n)) , n-1 ) / (s/sqrt(n))
248
lines(muvec,postvec, lty=1,lwd=3)
lines(density(muv),lty=2,lwd=3)
abline(v=c(mubar,muci,mucpdr),lty=2,lwd=3)
abline(v=c(ybar, mucpdrtrue) , lty=1,lwd=3)
legend(-2,0.5,c("Monte Carlo estimates","Exact posterior estimates"),
lty=c(2,1),lwd=c(3,3),bg="white")
# (b)
lamv=rep(NA,J); set.seed(332)
for(j in 1:J) lamv[j] = rgamma(1,n/2,(n/2)*mean((y-muv[j])^2))
lambar=mean(lamv); lamci=lambar + c(-1,1)*qnorm(0.975)*sd(lamv)/sqrt(J)

lamcpdr=quantile(lamv,c(0.025,0.975))
c(lambar, lamci, lamcpdr) # 0.39980 0.38040 0.41920 0.03465 1.28283
lamhat=1/s^2; lamcpdrtrue= qgamma(c(0.025,0.975),(n-1)/2,((n-1)/2)*s^2)
c(lamhat, lamcpdrtrue) # 0.40706 0.02928 1.26844
hist(lamv,prob=T,xlab="lam",xlim=c(0,2.5), ylim=c(0,2),main="",
breaks=seq(0,3,0.05))
lamvec=seq(0,3,0.01) ; lampostvec= dgamma(lamvec,(n-1)/2,((n-1)/2)*s^2)
lines(lamvec, lampostvec, lty=1,lwd=3)
lines(density(lamv),lty=2,lwd=3)
abline(v=c(lambar, lamci, lamcpdr),lty=2,lwd=3)
abline(v=c(1/s^2, lamcpdrtrue), lty=1,lwd=3)
legend(1.5,2,c("Monte Carlo estimates","Exact posterior estimates"),
# (c)
gamv=muv*sqrt(lamv)
gambar=mean(gamv); gamci=gambar + c(-1,1)*qnorm(0.975)*sd(gamv)/sqrt(J)

gamcpdr=quantile(gamv,c(0.025,0.975))
c(gambar, gamci, gamcpdr) # 1.7997 1.7453 1.8540 0.2284 3.5433
mle=ybar/s; mle # 1.946
gamhat=(ybar/s)*gamma(0.5+(n-1)/2)/(sqrt((n-1)/2)*gamma((n-1)/2))
print(c(ybar,s,gamhat),digits=8) # 3.0500000 1.5673757 1.7928178
intfun=function(lam,gam, ybar=3.05,s=1.5673757,n=4){
dnorm(gam,ybar*sqrt(lam),1/sqrt(n))*dgamma(lam,(n-1)/2,s^2*(n-1)/2) }
249
integrate(function(gam) {
sapply(gam, function(gam) {
integrate(function(lam) {
sapply(lam, function(lam) intfun(lam,gam) )
}, 0, Inf)$value }) }, -Inf, Inf)
# 1 with absolute error < 4.7e-07 OK (Just checking)
sapply(lam, function(lam) gam*intfun(lam,gam) )
}, 0, Inf)$value }) }, -Inf, Inf)
# 1.793 with absolute error < 4.7e-06 OK (Agrees with exact calculation)
gamvec=seq(-5,10,0.01); fgamvec=gamvec
for(i in 1:length(gamvec)){
fgamvec[i]=integrate( f=intfun, lower=0, upper=Inf,
gam=gamvec[i])$value }
plot(gamvec,fgamvec) # OK
L=-0.1; U=4.2 # Testing....

}, 0, Inf)$value }) }, L,U)
# 0.9823 with absolute error < 4.3e-08 OK
integrate( f=intfun, lower=0, upper=Inf, gam=U)$value -

integrate( f=intfun, lower=0, upper=Inf, gam=L)$value # -0.02074 OK
gfun=function(v){ L=v[1]; U=v[2]

( integrate(function(gam) {
}, 0, Inf)$value }) }, L,U)$value - 0.95 )^2 +
( integrate( f=intfun, lower=0, upper=Inf, gam=U)$value -
integrate( f=intfun, lower=0, upper=Inf, gam=L)$value )^2 }
gfun(v=c(-0.1,4.2)) # 0.001473 OK
gfun(v=c(1,3)) # 0.08562 OK
250
res0=optim(par=c(0,4),fn=gfun)$par
res0 # 0.07334 3.59516
res1=optim(par=res0,fn=gfun)$par
res1 # 0.07332 3.59518
res2=optim(par=res1,fn=gfun)$par
res2 # 0.07332 3.59518 OK
L=res2[1]; U=res2[2] # Now check...
}, 0, Inf)$value }) }, L,U)
integrate( f=intfun, lower=0, upper=Inf, gam=L)$value # 0.06598
integrate( f=intfun, lower=0, upper=Inf, gam=U)$value # 0.06598 All OK
hist(gamv,prob=T,xlab="gam",xlim=c(-1,6), ylim=c(0,0.6),main="",
breaks=seq(-2,7,0.1))
lines(density(gamv),lty=2,lwd=3)
abline(v=c(gambar, gamci, gamcpdr),lty=2,lwd=3)
points(mle,0,pch=4,lwd=3,cex=2)
lines(gamvec,fgamvec,lty=1,lwd=3)
abline(v=c(gamhat,L,U),lty=1,lwd=3)
legend(3,0.6,c("Monte Carlo estimates","Exact posterior estimates"),
text(5,0.4,"The cross shows the MLE")
5.15 MC predictive inference via the method

of composition
Suppose that in the context of a Bayesian model defined by f ( y | θ ) and
f (θ ) , we wish to predict a value x whose distribution is specified by
f ( x | y , θ ) . Recall that the posterior predictive density is
f ( x | y ) = ∫ f ( x | y , θ ) f (θ | y )dθ .
If this density is complicated, we may choose to perform MC predictive

inference on x using a sample x1 ,..., x J ~ iid f ( x | y ) . The question then
arises as to how such a sample may be obtained.
251
One answer is to sample from f ( x | y ) directly. But that may be difficult

since f ( x | y ) is complicated. Another answer is to apply the method of
composition through the equation
f ( x, θ | y ) = f ( x | y , θ ) f (θ | y ) .
This means that we should first sample θ ′ ~ f (θ | y ) and then sample

x′ ~ f ( x | y , θ ′) , the result being ( x′, θ ′) ~ f ( x, θ | y ) . If we then discard
θ ′ , the result is the required x′ ~ f ( x | y ) . Implementing this process a
total of J times results in the required sample, x1 ,..., x J ~ iid f ( x | y ) .
Exercise 5.15 Monte Carlo prediction in the binomial-beta

model
The probability of heads coming up on a bent coin follows a standard

uniform distribution a priori. We toss the coin 50 times and get 28 heads.
Estimate using Monte Carlo the probability that heads will come up on at
least six of the next 10 tosses of the same bent coin.
Recall that the binomial-beta model:

( y |  ) ~ Bin( n,  )
 ~ Beta (,  ) ,
for which the posterior distribution is given by
( | y ) ~ Beta (  y,   n  y ) .
Earlier we showed that if the future data x has distribution defined by

( x | y , θ ) ~ Bin( m, θ ) ,
then posterior predictive distribution is given by
 m  B ( y  x  , n  y  m  x   )
f ( x | y )    , x  0,, m .
 x  B ( y  , n  y   )
Rather than trying to sample from this distribution directly, we may do

the following:
Sample   ~ Beta (  y,   n  y )
Sample x′ ~ Bin(m, θ ′) .
Discarding  , we obtain the required sample value, x  ~ f ( x | y ) .

In the situation here:     1 , n = 50, y = 32, m = 10.
252
Implementing the above sampling strategy J = 10,000 times with these

specifications, we obtain a large MC sample, x1 ,..., xJ ~ f ( x | y ) .
It is found that 7,084 of the sample values are at least 6. So we estimate

=p P ( X ≥ 6 | y ) by p̂ = 0.7084. A 95% CI for p is then
( pˆ ± 1.96 pˆ (1 − pˆ ) / J ) = (0.6995, 0.7173).
For interest, we also work out the probability exactly as

p  10x6 f ( x | y ) = 0.7030 (correct to 4 decimals)
and note that this value lies in the 95% CI obtained using MC methods.
options(digits=5)
n=50; y=32; alp=1;bet=1; a=alp+y; b=bet+n-y; m=10; J=10000
set.seed(443); tv=rbeta(J,a,b); xv=rbinom(J,m,tv)
phat=length(xv[xv>=6])/J;
ci=phat+c(-1,1)*qnorm(0.975)*sqrt(phat*(1-phat)/J)
c(phat,ci) # 0.70840 0.69949 0.71731
xvec=0:m; fxgiveny=
choose(m,xvec)*beta(y+xvec+alp,n-y+m-xvec+bet)/beta(y+alp,n-y+bet)
sum(fxgiveny) # 1 Just checking
sum(fxgiveny[xvec>=6]) # 0.70296
5.16 Rao-Blackwell methods for estimation

and prediction
Consider a Bayesian model with two parameters given by a specification
of f ( y | θ ,ψ ) and f (θ ,ψ ) , and suppose that we obtain a sample from
the joint posterior distribution of the two parameters, say
(θ1 ,ψ 1 ),...,(θ J ,ψ J ) ~ iid f (θ ,ψ | y ) .
As we have seen, an unbiased Monte Carlo estimate of θ ’s posterior

mean, θˆ = E (θ | y ) , =
is θ (1 / J ) ∑ Jj =1 θ j , with an associated MC 1 − α
CI for θˆ given by (θ ± z s / J ) , where s is the sample standard
α /2 θ θ
deviation of θ1 ,..., θ J .
253
Now observe that

= θˆ E={E (θ | y,ψ ) | y} ∫ E (θ | y,ψ ) f (ψ | y)dψ .
This implies that another unbiased Monte Carlo estimate of θˆ is
1 J
e = ∑ej ,
J j =1
where
e j = E (θ | y ,ψ j ) ,
and another 1 − α CI for θˆ is
( e ± zα /2 se / J),
where se is the sample standard deviation of e1 ,..., eJ .
If possible, this second method of Monte Carlo inference is preferable to

the first because it typically leads to a shorter CI. We call this second
method Rao-Blackwell (RB) estimation. The first (original) method may
be called direct Monte Carlo estimation or histogram estimation.
The same idea extends to estimation of the entire marginal posterior

density of θ , because this can be written
=f (θ | y ) ∫=
f (θ | y ,ψ ) f (ψ | y )dψ Eψ { f (θ | y ,ψ ) | y} .
Thus, the Rao-Blackwell estimate of f (θ | y ) is

1 J
f (θ | y ) = ∑ f (θ | y,ψ j ) ,
J j =1
as distinct from the ordinary histogram estimate obtained by smoothing a
probability histogram of θ1 ,..., θ J .
The idea further extends to predictive inference, where we are interested

in a future quantity x defined by a specification of f ( x | y, θ ,ψ ) .
The direct MC estimate of the predictive mean, namely

xˆ = E ( x | y ) ,
is
= x (1 / J ) ∑ Jj =1 x j ,
where
x1 ,..., xJ ~ iid f ( x | y )
(e.g. as obtained via the method of composition).
254
A superior estimate is the Rao-Blackwell estimate given by

1 J
E = ∑Ej ,
J j =1
where there is now a choice from the following:
E j = E ( x | y , θ j ,ψ j )
or E j = E ( x | y ,ψ j )
or E j = E ( x | y , θ j ) .
This estimator ( E ) is based on the identities

= xˆ E{E = ( x | y , θ ,ψ ) | y} E=
{E ( x | y ,ψ ) | y} E{E ( x | y , θ ) | y} .
Note: The first of the three choices for E j is typically the easiest to
calculate but also leads to the least improvement over the ordinary
=
‘histogram’ predictor, x (1 / J ) ∑ Jj =1 x j .
Likewise, the Rao-Blackwell estimate of the entire posterior predictive

density f ( x | y ) is
1 J
f ( x | y ) = ∑ f j ( x ) ,
J j =1
where there is a choice from the following:
f j ( x ) = f ( x | y , θ j ,ψ j )
or f j ( x ) = f ( x | y ,ψ j )
or f j ( x ) = f ( x | y , θ j ) .
Exercise 5.16 Practice at Rao-Blackwell estimation in the

normal-normal-gamma model
Recall the Bayesian model:

( y1 ,, yn | ,  ) ~ iid N (,1/  )
f ( µ , λ ) ∝ 1 / λ , µ ∈ ℜ, λ > 0 .
Suppose that we observe the vector y = ( y1 ,..., yn ) = (2.1, 3.2, 5.2, 1.7).
Generate J = 100 values from the joint posterior distribution of µ and λ

and use these values as follows. Calculate the direct Monte Carlo estimate
and the Rao-Blackwell estimate of λ ’s marginal posterior mean.
255
In each case, report the associated 95% CI for that mean. Compare your
results with the true value of that mean. Produce a probability histogram
of the simulated λ -values. Overlay a smooth of this histogram and the
Rao-Blackwell estimate of λ ’s marginal posterior density. Also overlay
the exact density.
Recall from Equation (3.3) in Exercise 3.11 that:

 n  1  n  1 2 
( | y ) ~ Gamma  , s 
 2   2  
 1
( | y ,  ) ~ N  y ,  .
 n 
So we first sample
 n  1  n  1 2 
  ~ Gamma  , s ,
 2   2  
and then we sample
 1 
 ~ N  y , .
 n  
The result is
(,  ) ~ f (,  | y ) .
Repeating many times, we get

(1 , 1 ),...,(J , J ) ~ iid f (,  | y ) .
The histogram estimate of ˆ  E ( | y ) works out as  = 0.4142, with 95%

CI (0.4076, 0.4209).
Next let e j  E ( | y ,  j ) .
Then the Rao-Blackwell estimate of ̂ is e = 0.4073, with associated 95%

CI (0.4047, 0.4100).
It will be observed that this second CI is narrower than the first (having
width 0.0053 compared with 0.0133). It will also be observed that both
CIs contain the true value, ˆ  1 / s 2 = 0.4071.
256
Figure 5.14 shows:
• a probability histogram of 1 ,..., J
• a smooth of that probability histogram
• the true marginal posterior density, namely

f ( | y )  f Gamma ( n1)/2,s2 ( n1)/2 ( )
 
• the Rao-Blackwell estimate of f ( | y ) as given by

f ( | y )  1 1 n
J
 Gamman/2,s2j n/2
f ( ) where s 2
j   ( yi   j ) 2 .
n i1
Note: The Rao-Blackwell estimate here is based on the result

n n 1 n 
( | y ,  ) ~ Gamma  ,   ( yi   ) 2  .
 2 2  n i1 
 
It will be observed that the Rao-Blackwell estimate of λ ’s posterior

density is fairly close. The histogram estimate is much less accurate and
incorrectly suggests that  has some probability of being negative.
Figure 5.14 Illustration of Rao-Blackwell estimation
257
options(digits=4)
# (a)
y=c(2.1, 3.2, 5.2, 1.7); n=length(y); ybar=mean(y); s=sd(y); s2=s^2
J=100; set.seed(254); lamv=rgamma(J,(n-1)/2,s2*(n-1)/2);
muv=rnorm(J,ybar,1/sqrt(n*lamv)); est0=1/s^2
est1=mean(lamv); std1=sd(lamv); ci1=est1 + c(-1,1)*qnorm(0.975)*std1/sqrt(J)
ev=rep(NA,J); for(j in 1:J){ muval=muv[j]; ev[j]=1/mean((y-muval)^2) }
est2=mean(ev); std2=sd(ev); ci2=est2 + c(-1,1)*qnorm(0.975)*std2/sqrt(J)
rbind( c(est0,NA,NA,NA), c(est1,ci1,ci1[2]-ci1[1]), c(est2,ci2,ci2[2]-ci2[1]) )
# [1,] 0.4071 NA NA NA
# [2,] 0.4396 0.3767 0.5026 0.12589
# [3,] 0.4150 0.3892 0.4408 0.05166
# (b)
X11(w=8,h=5); par(mfrow=c(1,1))
hist(lamv,xlab="lambda",ylab="density",prob=T,xlim=c(0,2.5),
ylim=c(0,2.5),main="",breaks=seq(0,4,0.05))
lines(density(lamv),lty=1,lwd=3)
lamvec=seq(0,3,0.01); RBvec=lamvec; smu2v=1/ev
for(k in 1:length(lamvec)){ lamval=lamvec[k]
RBvec[k]=mean(dgamma(lamval,n/2,(n/2)*smu2v)) }
lines(lamvec,RBvec,lty=1,lwd=1)
lines(seq(0,3,0.005),dgamma(seq(0,3,0.005),(n-1)/2,s2*(n-1)/2), lty=3,lwd=3)
legend(1.2,2,c("Histogram estimate of posterior","Rao-Blackwell estimate",
"True marginal posterior"), lty=c(1,1,3),lwd=c(3,1,3))
5.17 MC estimation of posterior predictive

p-values
Recall the theory of posterior predictive p-values whereby, in the context
of a Bayesian model specified by f ( y | θ ) and f (θ ) , we test H 0 versus
H1 by choosing a suitable test statistic T ( y , θ ) .
The posterior predictive p-value is then

= p P (T ( x, θ ) ≥ T ( y , θ ) | y )
(or something similar, e.g. with ≥ replaced by ≤ ), calculated under the
implicit assumption that H 0 is true.
258
If the calculation of p is problematic, a suitable Monte Carlo strategy is as

follows:
1. Generate a random sample from the posterior,

θ1 ,..., θ J ~ iid f (θ | y ) .
2. Generate x j ~ ⊥ f ( y | θ j ) , j = 1,…,J
(so that x1 ,..., xJ ~ iid f ( x | y )).
3. For each j = 1,…,J calculate T j = T ( x j , θ j ) and=

I j I (T j ≥ T ) ,
where T = T ( y , θ ) .
1 J
4. Estimate p by pˆ = ∑ I j with associated 1 − α CI
J j =1
 pˆ (1 − pˆ ) 
 pˆ ± zα /2 .
 J 
Exercise 5.17 Testing for independence in a sequence of

Bernoulli trials
A bent coin has some chance of coming up heads whenever it is tossed.

Our uncertainty about that chance may be represented by the standard
uniform distribution.
The bent coin is tossed 10 times. Heads come up on the first seven tosses
and tails come up on the last three tosses.
Using Bayesian methods, test that the 10 tosses were independent.
The observed number of runs (of heads or tails in a row) is 2, which seems
rather small.
Let yi be the indicator for heads on the ith toss, (i = 1,…,n) (n = 10), and
let θ be the unknown probability of heads coming up on any single toss.
Also let xi be the indicator for heads coming up on the ith of the next n
tosses of the same coin, tossed independently each time.
259
Further, let y = ( y1 ,..., yn ) and x = ( x1 ,..., xn ) , and choose the test statistic
as
T ( y,θ ) = R( y ) ,
defined as the number of runs in the vector y.
Then an appropriate posterior predictive p-value is

= p P( R( x ) ≤ R( y ) | y ) ,
where y = (1,1,1,1,1,1,1,0,0,0) and R ( y ) = 2.
Under the Bayesian model:

( y1 ,..., yn | θ ) ~ iid Bern(θ )
θ ~ U (0,1) ,
the posterior is given by
(θ | y ) ~ Beta ( yT + 1, n − yT + 1) ,
where yT = y1 + ... + yn = 7.
With J = 10,000, we now generate

θ1 ,..., θ J ~ iid Beta (8, 4) .
After that, we do the following for each j = 1,..., J :
1. Sample x1j ,..., xnj ~ iid Bern(θ j ) and form the vector
x j = ( x1j ,..., xnj ) .
2. Calculate R j = R( x j ) (i.e. calculate the number of runs in

( x1j ,..., xnj ) ).
3. Obtain=
I j I ( R j ≤ R ) , where R = R ( y ) = 2.
Thereby we estimate p by
1 J
pˆ = ∑ I j = 0.0995,
J j =1
with 95% CI
 pˆ (1 − pˆ ) 
 pˆ ± 1.96  = (0.0936, 0.1054).
 J 
260
So the posterior predictive p-value is about 10 percent, which may be

considered as statistically non-significant. That is, there is insufficient
evidence (at the 5% level of significance, say) to conclude that the 10
tosses of the coin were somehow dependent.
Note 1: Using a suitable formula from runs theory, the exact value of p
could be obtained as
1
=p ∫ P( R( x) ≤ 2 | θ ) f
0
Beta (8,4) (θ )dθ
 n 
1
= ∫0  x∑
T =0
P( R( x) ≤ 2 | θ , xT ) f ( xT | θ )  f Beta (8,4) (θ )dθ ,

where:
• P ( R ( x ) ≤ 2 | θ ) is the exact probability that 2 or fewer runs will

result on 10 Bernoulli trials if each has probability of success θ
• P( R ( x) ≤ 2 | θ , xT ) is the probability of 2 or fewer runs will result

when xT 1s and n − xT 0s are placed in a row
n
f ( xT | θ )   θ xT (1 − θ ) n − xT is the binomial density with
•=
 xT 
parameters n and θ , evaluated at xT .
Note 2: It is of interest to recalculate p using data which seems even

more ‘extreme’, for example,
y = (1,1,1,1,1, 1,1,1,1,1, 1,1,1,1,0, 0,0,0,0,0) .
For this data, R ( y ) = 2 again but with n = 20 and y = 14. In this case,
(θ | y ) ~ Beta ( yT + 1, n − yT + 1) ~ Beta (15,7) ,
and we obtain the estimate p̂ = 0.0088 with 95% CI (0.0070 0.0106).
Thus there is, as was to be expected, much stronger statistical evidence

to reject the null hypothesis of independence.
261
R=function(v){m=length(v); sum(abs(v[-1]-v[-m]))+1}
# Calculates the runs in vector v
R(c(1,1,1,0,1)) # 3 testing …
R(c(1,1)) # 1
R(c(1,0,1,0,1)) # 5
R(c(0,0,1,1,1)) # 2
R(c(1,0,0,1,1,0,0,1,1,1,1,0)) # 6 …. all OK
n=10; J=10000; Iv=rep(0,J); set.seed(214); tv=rbeta(J,8,4)

for(j in 1:J){ xj=rbinom(n,1,tv[j]); if(R(xj)<=2) Iv[j]=1 }
p=mean(Iv); ci=p+c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/J)
c(p,ci) # 0.09950 0.09363 0.10537
n=20; J=10000; Iv=rep(0,J); set.seed(214); tv=rbeta(J,15,7)

for(j in 1:J){ xj=rbinom(n,1,tv[j]); if(R(xj)<=2) Iv[j]=1 }
p=mean(Iv); ci=p+c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/J)
c(p,ci) # 0.008800 0.006969 0.010631
262
CHAPTER 6
MCMC Methods Part 1
6.1 Introduction
Monte Carlo methods were introduced in the last chapter. These included
basic techniques for generating a random sample and methods for using
such a sample to estimate quantities such as difficult integrals. This
chapter will focus on advanced techniques for generating a random
sample, in particular the class of techniques known as Markov chain
Monte Carlo (MCMC) methods. Applying an MCMC method involves
designing a suitable Markov chain, generating a large sample from that
chain for a burn-in period until stochastic convergence, and making
appropriate use of the values following that burn-in period.
Like other iterative techniques such as the Newton-Raphson and

Expectation-Maximisation algorithms, MCMC methods require an
arbitrary starting point (or vector) and then involve iterating repeatedly
until convergence. But MCMC methods are distinguished from these
other methods by the fact that the update at each iteration is not
deterministic but stochastic, with the probability distributions involved
dependent on results from the previous iteration.
Typically, MCMC methods are used to sample from multivariate

probability distributions rather than univariate ones. This is because a
univariate distribution can usually be sampled from using simpler
methods. Nevertheless, we will begin our discussion of MCMC methods
with a description of the Metropolis algorithm for sampling from
univariate distributions, because that algorithm constitutes a basic
building block for the more advanced methods.
6.2 The Metropolis algorithm

Suppose that we wish to sample from a univariate distribution with pdf
f ( x ) for which rejection sampling and the other techniques described
previously are problematic (say). Then another way to proceed is via the
Metropolis algorithm. This is an example of Markov chain Monte Carlo
(MCMC) methods. The Metropolis algorithm may be described as
follows.
263
As with the Newton-Raphson algorithm, we begin by specifying an initial

value of x, call it x0 . We then also need to specify a suitable driver
distribution which is easy to sample from, defined by a pdf,
g (t | x) .
For now, we will assume the driver to be symmetric, in the sense that
g (t | x)  g ( x | t ) ,
or more precisely,
g ( t  a |   b)  g ( t  b |   a ) ∀ a, b ∈ ℜ .
Note: The driver distribution may also be non-symmetric, but this case
will be discussed later.
We then do the following iteratively for each j = 1,2,3,...,K (where K is

‘large’):
(a) Generate a candidate value of x by sampling x j ~ g (t | x j1 ) . We

call x j the proposed value and g (t | x j1 ) the proposal density.
f ( x j )
(b) Calculate the acceptance probability as p  .
f ( x j1 )
Note: If p > 1 then we take p = 1. Also, if x j is outside the range of

possible values for the random variable x, then f ( x j ) = 0 and so p = 0.
(c) Accept the proposed value x j with probability p.

To determine if x j is accepted, generate u ~ U (0,1)
(independently). If u < p then accept x j , and otherwise reject x j .
(d) If x j has been accepted then let x j  x j , and otherwise let

x j  x j1 (i.e. repeat the last value x j1 in the case of a rejection).
This procedure results in the realisation of a Markov chain,

x0 , x1 , x2 ,..., xK . The last value of this chain, xK , may be taken as an
264
Chapter 6: MCMC Methods Part 1
observation from f ( x ) , at least approximately. The approximation will

be extremely good if K is sufficiently large.
If we want a random sample of size J from f ( x ) , then the whole

procedure can be repeated another J − 1 times, each time using either the
same starting value x0 or a different one.
If K is sufficiently large, stochastic convergence will be achieved within

K iterations, regardless of the point(s) from which the algorithm is started.
Relabelling the last value, xK , in the jth chain as x j ( j = 1,..., J ) leads
the required sample, namely x1 ,..., xJ ~ iid f ( x) .
Generating a chain of length K a large number times J may be considered

wasteful of computer resources. So typically only one long chain is
generated, of length K= B + J , where B is sufficiently large for
stochastic convergence to be achieved from the single starting value, x0 ,
and J is again the required sample size. Discarding the results of the first
B iterations (called the burn-in, including also x0 ) and relabelling the last
J values of the chain appropriately, the result will be the sample
x1 ,..., xJ ~ f ( x) .
A problem with this second method of generating the sample values is that
they will be autocorrelated to some extent i.e. not a truly random (iid)
sample from the distribution f ( x ) . We will later discuss this issue and
how to deal with the problems that may arise from it. For the moment, we
stress that x1 ,..., x J will be approximately a random sample from f ( x ) .
Moreover, if J is sufficiently large, then these values will be effectively
independent. This means that a probability histogram of these values will
in fact converge to f ( x ) as J tends to infinity.
Exercise 6.1 A simple application of the Metropolis algorithm
Illustrate the Metropolis algorithm by generating a sample of size 400

from the distribution defined by the density
f ( x )  6 x 5 , 0 < x < 1.
Note: This is just the Beta(6,1) density and could be sampled from easily
in many other ways.
265
Let us specify the driver distribution as the uniform distribution from

x  c to x  c , where c is a tuning parameter whose value is to be
determined (as discussed further below). Thus the driver density is
1
g (t | x )  , x  c  t  x  c ,
2c
or equivalently
1
g (t | x )  I (| t  x |  c ) .
2c
Note: This driver is symmetric, since

g ( t  a | x  b)  g ( t  b | x  a ) ∀ a, b ∈ ℜ .
The jth iteration of the algorithm involves first sampling a candidate value
(or proposed value) from the driver distribution centred at the last value,
namely
x j ~ U ( x j1  c, x j1  c ) ,
and then accepting this candidate value with probability
6 x j5  x j 
5
f ( x  x j )
p    , (6.1)
f ( x  x j1 ) 6 x 5j1  x j1 
where p is taken to be:
0 in the case where x j  0 or x j  1
1 in the case where x j1  x j  1 .
Note: The cancellation of 6s in (6.1) illustrates an attractive feature of

the Metropolis algorithm generally: only the kernel of the sampling
density is needed. Here, the kernel of the sampling density f ( x )  6 x 5
is k ( x ) = x 5 . This fact can be very useful in more complicated situations
where only the kernel of the sampling density is known.
Starting from x0 = 0.1 and with c = 0.15 (arbitrarily), we obtain a Markov

chain of length K = 500, with values as illustrated in Figure 6.1.
Some of the values of this chain are as follows:
266
x0 ,..., x10 ,..........,

x301 ,..., x310 ,..........,
x491 ,..., x500 =
0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1861, 0.2650, 0.2650,

0.4065, 0.4388, 0.4388, ……….,
0.9261, 0.9987, 0.9987, 0.9987, 0.9987, 0.9725, 0.8889, 0.8889,

0.9672, 0.9315, ..........,
0.8058, 0.6811, 0.6073, 0.4587, 0.4353, 0.3462, 0.3462, 0.4177,

0.4177, 0.4656.
Note: There were four rejections until the first acceptance, at iteration
5, where x5 = x5′ = 0.1861, as underlined above.
Figure 6.2 shows a probability histogram of the last J = 400 values,

together with the exact density of x. It would appear that stochastic
equilibrium has been achieved by about iteration 50. So we may, very
conservatively, discard the first B = 100 iterations as the burn-in.
The acceptance rate (AR) for this Markov chain is found to be 64%,
meaning that 320 of the 500 candidate values x j were accepted and 36%
(or 180) were rejected.
Figure 6.1 Trace of sample values with tuning constant c = 0.15
267
Figure 6.2 Probability histogram with tuning constant c = 0.15
Changing the tuning parameter
What happens if we make the tuning parameter c = 0.15 larger? Figures

6.3 and 6.4 are a repeat of Figures 6.1 and 6.2, respectively, but using
simulated values from a run of the Metropolis algorithm with c = 0.65.
In this case the acceptance rate is only 20.8% and the histogram is a poorer
estimate of the true density (to which it would however converge as
J  ) . We say that the algorithm is now displaying poor mixing
compared to results in the first run of 500 where c = 0.15.
What happens if we make c = 0.15 smaller? Figures 6.5 and 6.6 are a
repeat of Figures 6.1 and 6.2, respectively, but using simulated values
from a run of the Metropolis algorithm with c = 0.05.
In this case the acceptance rate is higher at 83%, there is greater

autocorrelation, and the histogram is again a poorer estimate of the true
density (to which it would however still converge as J  ) . We again
say that the algorithm is mixing poorly.
It is important to stress that even if the algorithm is mixing poorly

(whether this be due to the tuning constant being too large or too small),
it will eventually (with a sufficiently large value of J) yield a sample that
is useful for inference to the desired degree of precision. Tweaking the
tuning constant is merely a device for optimising computational efficiency.
268
269
MET <- function(K,x,c){

# This function performs the Metropolis algorithm for a simple model.
# Inputs: K = total number of iterations
# x = initial value of x
# c = tuning parameter.
# Outputs: $vec = vector of (K+1) values of x
# $ar = acceptance rate
270
vec <- x; ct <- 0

for(j in 1:K){
prop <- runif(1,x-c,x+c)
p <- 0
if((prop>0) && (prop<1)) p <- (prop/x)^5
u <- runif(1)
if(u < p){
x <- prop
ct <- ct + 1
}
vec <- c(vec,x)
}
ar <- ct/K
list(vec=vec,ar=ar)
}
K <- 500; X11(w=8,h=4.5); par(mfrow=c(1,1))
set.seed(316); res <- MET(K=K,x=0.1,c=0.15)

plot(0:K,res$vec,type="l",xlab="iteration",ylab="x", main="")
hist(res$vec[-(1:101)],prob=T,xlim=c(0.4,1),ylim=c(0,6),
xlab="x",ylab="density",main="")
lines(seq(0.4,1,0.01),6*seq(0.4,1,0.01)^5); res$ar # 0.64
print(res$vec[1+c(0,1:10,301:310,491:500)], digits=4)
# [1] 0.1000 0.1000 0.1000 0.1000 0.1000 0.1861 0.2650 0.2650 0.4065 0.4388
# [11] 0.4388 0.9261 0.9987 0.9987 0.9987 0.9987 0.9725 0.8889 0.8889 0.9672
# [21] 0.9315 0.8058 0.6811 0.6073 0.4587 0.4353 0.3462 0.3462 0.4177 0.4177
# [31] 0.4656

plot(0:K,res$vec,type="l",xlab="iteration",ylab="x", main=" ")
hist(res$vec[-(1:101)],prob=T,xlim=c(0.4,1),ylim=c(0,6),xlab="x",
ylab="density", main=" ")

plot(0:K,res$vec,type="l",xlab="iteration",ylab="x",main=" ")
hist(res$vec[-(1:101)],prob=T,xlim=c(0.4,1),ylim=c(0,6),xlab="x",
ylab="density", main=" ")
271
Exercise 6.2 Sampling from a normal distribution via the

Metropolis algorithm
Use the Metropolis algorithm and a uniform driver to sample 10,000

values from the standard normal distribution.
Check your result by comparing the sample mean and sample standard
deviation of your sample to the true theoretical values, 0 and 1.
Calculate a Monte Carlo 95% confidence interval for the normal mean, 0.
1
− x2
Since f ( x ) ∝ e 2
, the acceptance probability at iteration j is given by
1
1  2 x j 
2
e  x 2  x  2 
f ( x  x j ) 2  j1   j  
p   exp  .
f ( x  x j1 ) 1
1  2 x j1 
2
 2 
e  
2
Using the same uniform driver as in Exercise 6.1, x0 = 5 and c = 2.5

(where this tuning constant was chosen after some experimentation), we
obtain a Markov chain of length K = 10,500, as shown in Figure 6.7.
Figure 6.8 shows a histogram of the last J = 10,000 values, together with
the standard normal density overlaid.
We have very conservatively discarded the first B = 500 iterations as the

burn-in. The acceptance rate for this Markov chain is 56.1%.
The average of the J sampled values is 0.0355 (close to 0) and their sample
standard deviation is 1.0047 (close to 1). These values lead to a 95% CI
for the normal mean equal to (0.0158, 0.0552). We note that this CI does
not contain the true value, 0, as one might expect. The underlying issue
behind this fact will be discussed generally in the next section.
272
Figure 6.7 Trace of sample values
Figure 6.8 Probability histogram

# This function performs the Metropolis algorithm to sample from the
# standard normal dsn.
# x = initial value of x
# c = tuning parameter.
# Outputs: $vec = vector of (K+1) values of x
# $ar = acceptance rate.
273
vec = x; ct = 0
for(j in 1:K){ prop = runif(1,x-c,x+c)
p = exp(-0.5*(prop^2-x^2)); u = runif(1)
if(u <= p){ x = prop; ct = ct + 1 }
vec <- c(vec,x) }
ar = ct/K; list(vec=vec,ar=ar) }
B=500; J = 10000; K = B + J
set.seed(117); res <- MET(K=K,x=5,c=2.5); res$ar # 0.548381
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(0:K,res$vec,type="l",xlab="iteration",ylab="x",main=" ")
hist(res$vec[-(1:(B+1))],prob=T,xlim=c(-4,4),ylim=c(0,0.5),xlab="x",
ylab="density",nclass=50, main=" ")
lines(seq(-4,4,0.01),dnorm(seq(-4,4,0.01)),lwd=2)
est=mean(res$vec[-(1:(B+1))]); std=sd(res$vec[-(1:(B+1))])
ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(10000)
c(est,std,ci) # 0.03550254 1.00470749 0.01581064 0.05519445
6.3 The batch means method

As stated earlier, the output from the Metropolis algorithm leads to a
sample, x1 ,..., x J , from the target density, f ( x ) , which exhibits some
degree of positive autocorrelation.
This does not present a major problem when one is interested in

calculating only point estimates. For example, if we wish to estimate the
distribution mean EX = ∫ xf ( x )dx , each sample value x j has expected
value EX, and this is true regardless of how severely the simulated values
are correlated (assuming that all the simulated values are collected after
stochastic convergence). Therefore, the expected value of the Monte
Carlo mean is also exactly EX (or very nearly so).
However, when one uses a severely and positively autocorrelated Monte

Carlo sample to calculate the standard 1 − α confidence interval for a
quantity such as EX, the true coverage probability of that interval may be
far less than the intended nominal value of 1 − α .
One way of dealing with this problem is to generate J independent chains

and take the last value in each chain. Note that this was our original
formulation of the Metropolis algorithm (i.e. for sampling a single value).
274
Another option is to generate a single long chain, of length K = B + 10J

(say) and thin it out by recording only every 10th value in the chain after
burn-in. Even so, there will still be some autocorrelation remaining in the
J resulting values. The autocorrelation could be reduced further by
changing 10 to 100, say; but this would be at the cost of a 10-fold increase
in computer time needed.
A more efficient solution to the autocorrelation problem is the batch

means method. We will now describe how this works for when we wish
to construct a 1 − α CI for EX = ∫ xf ( x )dx based on an autocorrelated
sample x1 ,..., xJ ~ iid f ( x ) .
The batch means CI will be different from the ordinary CI, namely
( x  1.96 sx / J ) , where x and sx are the sample mean and sample
standard deviation of x1 ,..., x J . The batch means CI is obtained as follows.
First, break up the J sample values into m batches of size n each, so that:
Batch 1 contains values 1,…,n (the first n values)
Batch 2 contains values n + 1,…,2n (the next n values)
………………………………………………………….....
Batch m contains values ( m − 1)n + 1,..., J (the last n values).
Next: Let yk be the mean of the n x j -values in the kth batch (k = 1,...,m).
2
Let s y be the sample variance of y1 ,..., ym .
1 m 1 m
Note: Thus s y2   k
m 1 k 1
( y  y ) 2
, where y   yk  x is the
m k 1
mean of the batch means and identical to the mean of all J x j -values.
Finally, compute the 1 − α batch means CI for EX as ( x  1.96s y / m ) .
275
Discussion
The rationale for the batch means method is as follows. If the batch size n
is sufficiently large then, by the central limit theorem,
y1 ,..., ym ~ iid N (,  2 / n ) ,
where   E ( x j ) and  2  Var ( x j ) .
Consequently,
  2 / n   2 
y ~ N ,  ~ N ,  ,
 m   J 
since J = mn .
Therefore a 1 − α CI for  is
( y  z /2 r / J ) ,
where r is an estimate of  .
Now, an unbiased estimator of  2 / n is s y .

2
So an unbiased estimator of  2 is ns y .
2
It follows that a 1 − α CI for  is

( x  z /2 ns 2y / J )  ( x  z /2 s y / m ) .
Exercise 6.3 Testing the batch means method
We wish to perform Monte Carlo estimation of the expected value of X

whose pdf is given by f ( x)  x 2 , 0  x  2 .
Note: Here, X ~ 2Beta(3,1) and so EX = 2 × 3 / (3 + 1) =

1.5 .
(a) Use the Metropolis algorithm to generate a sample of size J = 1,000

from X’s distribution after a burn-in of 100.
Then use this sample to estimate EX, together with a 95% confidence
interval for EX. For this CI use the formula ( x  1.96s / J ) , where s 2
is the sample variance of the J sampled X-values. Also draw a histogram
of the J X-values overlaid with the exact pdf of X.
276
(b) Use the output from the Metropolis algorithm in (a) to construct
another 95% CI for EX, one using the batch means method, as follows:
Divide the J = 1,000 iterations into m = 20 consecutive batches,

each having n = 50 values of X.
Let yk be the average of the n X-values in the kth batch

(k = 1,...,m).
2
Let s y be the sample variance of the m batch means
y1 ,..., ym .
Let the confidence interval for EX be ( x  1.96s y / m ) .
(c) Conduct a Monte Carlo experiment to assess the quality of the two CIs
for EX in (a) and (b).
Do this by implementing the following three-step procedure a total of

R = 100 times:
(i) Run the Metropolis algorithm in (a) so as to generate

J = 1,000 observations from f ( x) .
(ii) Calculate the CI in (a) and count 1 if 1.5 is in it.
(iii) Calculate the CI in (b) and count 1 if 1.5 is in it.
Now divide the total count from (ii) by R to get an unbiased point estimate
of the probability that the ordinary CI for EX in (a) contains EX.
Similarly, divide the two total count from (iii) by R to get an unbiased
point estimate of the probability that the batch means CI for EX in (b)
contains EX.
Also produce 95% CIs for the two probabilities just mentioned.
(d) Repeat the experiment in (c) but with the following in place of (i):
Generate J = 1,000 observations from X’s distribution using the

rbeta() function.
277
(a) Let us specify a uniform driver centred at the last value and with half-
width h. We now iterate as follows after choosing a suitable starting value
of x:
Sample x  ~ U ( x  h, x  h) .
If x  is outside the interval (0,2) then automatically reject x  .
Otherwise accept x  with probability min(1, p), where

p  x2 / x 2 .
Starting from x = 1 with h = 0.7, we get an acceptance rate of 55% and

simulated values as depicted in Figures 6.9 and 6.10.
Taking the last 1,000 values of x as a random sample from f ( x) we

estimate EX as 1.539, with ordinary 95% CI (1.467, 1.611). We note that
this CI does not contain the true value, 1.5.
Figure 6.9 Trace of sample values
278
Figure 6.10 Histogram of sample values
(b) Applying the batch means method with m = 20 and n = 50, we estimate
EX as 1.539 again, but with 95% CI (1.467, 1.611). Note that this CI is
wider than the CI in (a) and does contain the true value, 1.5.
(c) After conducting the experiment we estimate p1 , the true probability

content of the ordinary 95% CI in (a), as 52.0%, with 95% CI 42.2% to
61.8%.
We also estimate p2 , the true probability content of the batch means 95%
CI in (b) (with m = 20 and n = 50), as 90.0%, with 95% CI 84.1% to
95.9%.
We see that in this example the batch means method has performed far
better than the ordinary method for constructing 95% CIs for EX from the
output of a Metropolis algorithm.
(d) Generating each value of X as twice a random number from the

Beta(3,1) distribution, we estimate p1 by 92.0%, with 95% CI 86.7% to
97.3%. We also estimate p2 by 90.0%, with 95% CI 84.1% to 95.9%.
We see that the two CIs have performed about equally well when
calculated using a truly random sample from X’s distribution. In such
situations, the batch means CI is in fact slightly inferior and the ordinary
CI should be used.
279
# (a)
MET <- function(Jp,x,h){
# This function implements a simple Metropolis algorithm.
# Inputs: Jp = total number of iterations
# x = starting value of x
# h = halfwidth of uniform driver.
# Outputs: $xv = vector of x-values of length (Jp + 1)
# $ar = acceptance rate.
xv <- x; ct <- 0
for(j in 1:Jp){ xprop <- runif(1,x-h,x+h)
if( (xprop>0) && (xprop<2) ){
p <- xprop^2 / x^2; u <- runif(1)
if(u < p){ x <- xprop; ct <- ct + 1 } }
xv <- c(xv,x) }
list(xv=xv,ar=ct/Jp) }
Jp <- 1100; set.seed(151); res <- MET(Jp=Jp,x=1,h=0.7); res$ar # 0.5454545
X11(w=8, h=4.5); par(mfrow=c(1,1));
plot(0:Jp,res$x,type="l",xlab="j",ylab="x_j")
xv <- res$xv[-c(1:101)]; J= length(xv)
hist(xv,xlab="x",prob=T,ylim=c(0,2),nclass=20,ylab="density", main="")
xvec <- seq(0,2,0.1); fvec <- (3/8)*xvec^2; lines(xvec,fvec)
EXhat <- mean(xv); sdhat <- sqrt(var(xv)); sdhat # 0.3755086

EXci <- EXhat + c(-1,1)*qnorm(0.975)*sdhat/sqrt(J)
c(EXhat,EXci) # 1.538984 1.515710 1.562258
# (b)
m <- 20; n <- 50; yv <- rep(NA,m)
for(k in 1:m){ xvsub <- xv[ ((k-1)*n+1):(k*n) ]
yv[k] <- mean(xvsub) }
sdhat2 <- sqrt(n*var(yv)); sdhat2 # 1.15783
EXci <- EXhat + c(-1,1)*qnorm(0.975)*sdhat2/sqrt(J)
c(EXhat,EXci) # 1.538984 1.467222 1.610746
# (c)
R<- 100; m <- 20; n <- 50; J <- 1000; burn <- 100; EX <- 1.5; ct1 <- 0; ct2 <- 0;
yv <- rep(NA,m); set.seed(214)
280
for(r in 1:R){
xv <- MET(Jp=burn+J,x=1,h=0.7)$xv[-c(1:101)]
# xv <- rbeta(J,3,1)*2 # for use in (d) (see below)
for(k in 1:m){ xvsub <- xv[ ((k-1)*n+1):(k*n) ]
yv[k] <- mean(xvsub) }
EXhat <- mean(xv); sdhat1 <- sqrt(var(xv)); sdhat2 <- sqrt(n*var(yv))
ci1 <- EXhat + c(-1,1)*qnorm(0.975)*sdhat1/sqrt(J)
ci2 <- EXhat + c(-1,1)*qnorm(0.975)*sdhat2/sqrt(J)
if( (EX >= ci1[1]) && (EX <= ci1[2])) ct1 <- ct1 + 1
if( (EX >= ci2[1]) && (EX <= ci2[2])) ct2 <- ct2 + 1 }
date() # took 2 secs
p1 <- ct1/R; p2 <- ct2/R

p1ci <- p1 + c(-1,1)*qnorm(0.975)*sqrt(p1*(1-p1)/R)
p2ci <- p2 + c(-1,1)*qnorm(0.975)*sqrt(p2*(1-p2)/R)
c(p1,p1ci) # 0.5200000 0.4220802 0.6179198
c(p2,p2ci) # 0.9000000 0.8412011 0.9587989
# (d)
# Repeat code in (c) but with the line
# "xv <- MET(Jp=burn+J,x=1,h=0.7)$xv[-c(1:101)]"
# replaced by the line "xv <- rbeta(J,3,1)*2".
# The results should be:

# c(p1,p1ci) # 0.9200000 0.8668275 0.9731725
# c(p2,p2ci) # 0.9000000 0.8412011 0.9587989
Exercise 6.4 Bayesian inference via the Metropolis algorithm
The prior on a normal mean µ is uniform from zero to infinity. Values

are sampled repeatedly from the N ( µ ,1) distribution until n = 4 positive
values have been observed, resulting in the data: 0.1, 0.2, 1.9, 0.8.
Find the posterior mean of µ in the following ways:
(a) exactly, using numerical integration in R
(b) approximately, using a Monte Carlo method that does not involve
Markov chains
(c) approximately, using the Metropolis algorithm with a normal driver.
281
(a) The posterior density of µ is

1
− ( yi − µ )2
n
e 2
f (µ | y) ∝ f (µ ) f ( y | µ ) ∝ 1× ∏ ,
i =1 1 − Φ ( − µ )
 0−µ 
since P ( y > 0 | µ ) = 1 − P  z <  = 1 − Φ(− µ ) .
 1 
 1 n 2
( )  − 2 ∑ ( yi − µ ) 
−n
Thus f ( µ | y ) ∝ 1 − Φ ( − µ ) exp
 i =1 
 1 
= (1 − Φ (− µ ) ) exp  − (n − 1) s 2 + n( y − µ ) 2  
−n
 2 
 1 
∝ (1 − Φ ( − µ ) ) exp  − n( µ − y ) 2 
−n
 2 
≡ k ( µ ) , µ > 0 (this is the kernel of the posterior density).
∫ µ k ( µ )d µ
1
I
=
Thus µˆ E=
(µ | y) 0
∞
= 1,
I0
∫ µ k ( µ )d µ
0
0
∞
where I q = ∫ µ q k ( µ )d µ , q = 0,1.
0
Using integrate() in R we obtain I 0 = 4.328041, I1 = 2.328058 and hence

µ̂ = 0.5379.
∫ µ (1 − Φ(− µ ) )
−n
1
h ( µ )d µ
(b) Observe that µˆ = 0
∞
,
∫ µ (1 − Φ(− µ ) )
−n
0
h ( µ )d µ
0
n  n 
exp  − ( µ − y ) 2 
where h( µ ) = 2π  2 .
 0− y 
1− Φ  
1/ n 
282
Note: h( µ ) is the density of the N ( y ,1 / n ) distribution restricted to

the positive real line.
Thus µˆ =
E1
E0
=
, where: { }
Eq E µ q (1 − Φ ( − µ ) ) , q = 0,1
−n
µ ~ h( µ ) ~ N ( y ,1 / n ) I ( µ > 0) .
Note: At this point we ‘forget’ about the posterior distribution of µ .
We see that a non-Markov chain Monte Carlo estimate of µ̂ is

E
µ =  1 ,
E 0
µ j (1 − Φ ( − µ j ) )
1 J q −n
=
where: E q ∑
J j =1
µ1 ,..., µ J ~ iid h( µ ) .
Note: To obtain the required sample here, we repeatedly sample

µ ~ N ( y ,1 / n ) until J positive values have been achieved.
Implementing this strategy in R using the rnorm() function with a Monte

Carlo sample size of J = 100,000, we obtain E 0 = 3.7059926,
E = 1.9900593 and hence µ = 0.5370.
1
(c) Using the Metropolis algorithm and a normal driver distribution with
standard deviation 0.5, we obtain a Markov chain of size 10,000 following
a burn-in of size 100. The acceptance rate is found to be 59%.
Then taking every 10th value results in a very nearly uncorrelated sample
of size 1,000 from the posterior distribution of µ . Using these 1,000
values, leads to the estimate µ̂ by 0.5297, with associated 95% CI equal
to (0.5047, 0.5547).
We note that the true exact value calculated in (a), 0.5379, is contained in
this CI.
283

# (a)
y=c(0.1, 0.2, 1.9, 0.8); n = length(y); ybar=mean(y); c(n,ybar) # 4.00 0.75
kfun=function(mu){ exp(-0.5*n*(mu-ybar)^2) / (1-pnorm(-mu))^n }
topfun=function(mu){ mu * kfun(mu) }
par(mfrow=c(2,1)); muvec=seq(0,5,0.1)
plot(muvec,kfun(muvec),type="l"); abline(h=0,lty=3) # OK
plot(muvec,topfun(muvec),type="l"); abline(h=0,lty=3) # OK
top=integrate(f=topfun,lower=0,upper=5)$value
bot=integrate(f=kfun,lower=0,upper=5)$value
c(bot,top,top/bot) # 4.328041 2.328058 0.537901
# (b)
J=110000; set.seed(551); samp=rnorm(J,ybar,1/sqrt(n))
samppos=samp[samp>0]; length(samppos) # 102763
samppos=samppos[1:100000]
numer=mean(samppos*(1-pnorm(-samppos))^(-n) )
denom=mean( (1-pnorm(-samppos))^(-n) )
c(numer,denom,numer/denom) # 1.9900593 3.7059926 0.5369842
# (c)
MET <- function(K,mu,del,y){
# This function implements a simple Metropolis algorithm.
# mu = starting value of mu
# del = standard deviation of normal driver
# y = data vector
# Outputs: $muv = vector of mu-values of length (K + 1)
# $ar = acceptance rate
muv = mu; ct = 0; n=length(y); ybar=mean(y)
kfun=function(mu,ybar,n){ exp(-0.5*n*(mu-ybar)^2) / (1-pnorm(-mu))^n }
for(j in 1:K){ muprop = rnorm(1,mu,del)
if( muprop>0 ){
p=kfun(mu=muprop,ybar=ybar,n=n)/kfun(mu=mu,ybar=ybar,n=n)
u=runif(1); if(u < p){ mu = muprop; ct = ct + 1 } }
muv = c(muv,mu) }
list(muv=muv,ar=ct/K) }
K=10100; set.seed(352); res= MET(K=K,mu=1,del=0.5,y=y)

res$ar # 0.590297
mean(res$muv) # 0.5303868 = preliminary estimate
plot(0:K,res$muv,type="l")
284
vec1=res$muv[-(1:101)]
print(acf(vec1)$acf[1:10],digits=2) # Evidence of strong autocorrelation
# 1.00 0.78 0.61 0.48 0.39 0.30 0.24 0.19 0.14 0.11
v=vec1[seq(10,10000,10)] # Take every 10th value only

print(acf(v)$acf[1:10],digits=2) # No apparent residual autocorrelation
# 1.0000 0.0534 0.0014 0.0331 -0.0089 -0.0041 0.0034 0.0087 0.0102 0.0133
J=length(v); J # 1000
est=mean(v); std=sd(v); ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(J)
c(est,std,ci) # 0.5296887 0.4039238 0.5046537 0.5547237
6.4 Computational issues

Numerical issues may arise when attempting to calculate the acceptance
probability
p  f ( x j ) / f ( x j1 )
due to f ( x j ) or f ( x j1 ) being too large or too small for R to handle.
One relevant fact here is that in R on most computers (at present), 5e-324
(meaning 5 × 10−324 ) is the smallest representable non-zero number. This
problem can often be resolved by calculating p as
p = exp( q)
after first computing
q  log f ( x j )  log f ( x j1 ) ,
but even this formulation may not be sufficient in every situation.
It may sometimes also be necessary to replace the calculation of a function,
say h ( r ) , by
h(max( r,5e − 324))
if that function requires a non-zero argument r which is likely to be
reported by R as 0 (because the exact value of r is likely to be between 0
and 5e − 324 ).
Further, and by the same token, if
0 < h(max( r,5e − 324)) < 5e − 324
then R will report a value of 0. In that case, if a non-zero value of
h is absolutely required (for some subsequent calculation) then the
code for h ( r ) should be replaced by code which returns
max( h(max( r,5e − 324)),5e − 324) .
285
6.5 Non-symmetric drivers and the general

Metropolis algorithm
In some cases, applying the Metropolis algorithm as described above may
lead to poor mixing, even after experimentation to decide on the most
suitable value of the tuning constant.
For example, if the random variable of interest is strictly positive with a

pdf f ( x ) which is positively skewed and highly concentrated just above
0 (for example, if f ( x ) → ∞ as x ↓ 0), proposing a value symmetrically
distributed around the last value may lead to many candidate values which
are negative and therefore automatically rejected.
In such cases, the support of X may not be properly represented, and it

may be preferable to choose a different type of driver distribution, one
which adapts ‘cleverly’ to the current state of the Markov chain.
This can be achieved using the general Metropolis algorithm which

allows for non-symmetric driver distributions. As before, let g (t | x )
denote a driver density, where t denotes the proposed value and x is the
last value in the chain. Then at iteration j, after generating a proposed
value from the driver distribution,
x j ~ g (t | x  x j1 ) ,
the acceptance probability is
f ( x j ) g ( x j1 | x j )
p  .
f ( x j1 ) g ( x j | x j1 )
Note 1: Previously, when g (t | x ) was assumed to be symmetric,

g ( x j1 | x j )
= 1.
g ( x j | x j1 )
Note 2: To calculate p, the best strategy is to let

p = exp( q)
after first computing
q  log f ( x j )  log f ( x j1 )
 log g ( x j1 | x j )  log g ( x j | x j1 ).
286
Exercise 6.5 A Metropolis algorithm with a non-symmetric

driver
Generate a random sample of size 10,000 from the distribution defined by

the pdf
 1 −1/2
 4 x , 0 < x < 1
f ( x) = 
 1 e − ( x −1) , x > 1
 2
using the Metropolis algorithm and a non-symmetric driver with density
of the form
 x t x1e x
g (t | x )  f G ( x , ) (t )  , t  0,
( x )
or equivalently, a driver defined by
(t | x ) ~ G ( x ,  ) .
Check your results by plotting a probability histogram of the sample

values and overlaying the target density, f ( x ) . Also discuss why this
driver is suitable in this situation.
At each iteration j the proposed value is generated by sampling

x j ~ G ( x j1 ,  ) .
The rationale for this choice of driver is that the proposed value is
certainly positive, it has:
mean x j1 /   x j1
variance x j1 /   x j1 /  .
2
Thus the candidate x j is guaranteed to be in the appropriate range ( ℜ+ ),

and it is centred at the last value ( x j1 ).
Also, its variance around that last value is proportional to it (by a factor
of 1 /  ). This ensures that values near zero are appropriately ‘explored’
by the Markov chain.
287
With this driver, the acceptance probability at iteration j is

p = exp( q) ,
where:
q  log f ( x j )  log f ( x j1 )  log g ( x j1 | x j )  log g ( x j | x j1 ) .
log f ( x )  I (0  x  1){0.5log x  log 4}  I ( x  1){1  x  log 2}
log g (t | x )  x log   ( x  1) log t  x  log ( x ) .
Even with this use of the logarithmic function, computational issues arose
in R on account of limitations with the functions rgamma() and lgamma().
These limitations are acknowledged in the help files for these functions
in R.
To give an example:
set.seed(321)
v = rgamma(10000,0.001,0.001)
# Large sample from the G(0.001,0.001) distribution.
mean(v) # 0.5827886
# This is clearly wrong since the mean is 0.001/0.001 = 1.
length(v[v==0]) # 4777
# Almost HALF of the values are EXACTLY zero.
The R code was appropriately modified so that whenever very small but
non-zero values were reported as zero by R (and problems ensued or
potentially ensued because of this) those values were changed in the code
to 5e-324 (the smallest representable non-zero number in R).
With the above specification and fixes, the Metropolis algorithm was run
for 10,000 iterations following a burn-in of size 100 and starting at 1. The
value of δ used was 1.3 and this resulted in an acceptance rate of 53% as
well as good mixing. Figure 6.11 shows the resulting trace of all 10,101
values of x, and Figure 6.12 shows the required probability histogram of
the last 10,000 values, together with the exact density f ( x ) overlaid.
Note: Applying a gamma driver here (in an attempt to improve the

‘vanilla’ version of the Metropolis algorithm) created problems, due to
numerical issues in R associated with the gamma distribution. With
some modifications, we were in the end able to make things work.
Another choice of nonsymmetric driver distribution is the lognormal,
and we leave it as an additional exercise to examine this option in detail.
288
Figure 6.11 Trace of simulated values
Figure 6.12 Histogram and true density
set.seed(321); v = rgamma(10000,0.001,0.001)
# Large sample from the G(0.001,0.001) distribution.
mean(v) # 0.5827886 This is clearly wrong since the mean is 0.001/0.001 = 1.
length(v[v==0]) # 4777 Almost HALF of the values are EXACTLY zero.
logffun=function(x){ res=-0.5*log(x)-log(4); if(x>1) res=1-x-log(2); res }
loggfun=function(t,x,del){
x*del*log(del)+(x*del-1)*log(t)-t*del-lgamma(max( x*del, 5e-324 )) }
289
MET <- function(K,x,del){ # This function implements a simple Metropolis alg.

# Inputs: K = total number of iterations, x = starting value of x,
# del = tuning constant in driver
# Outputs: $xv = vector of x-values of length (K + 1), $ar = acceptance rate
xv = x; ct = 0
for(j in 1:K){ xp = max( rgamma(1,x*del,del), 5e-324 )
logp = logffun(x=xp) - logffun(x=x) +
loggfun(t=x,x=xp,del=del) - loggfun(t=xp,x=x,del=del)
p = exp(logp); u = runif(1); if(u < p){ x = xp; ct = ct + 1 }
xv = c(xv,x) }
list(xv=xv,ar=ct/K) }
X11(w=8,h=4.5); par(mfrow=c(1,1)); K = 10100;

set.seed(319); res = MET(K=K,x=1,del=1.3); res$ar # 0.5324752
plot(0:K,res$xv,type="l",xlab="j",ylab="x_j")
xv <- res$xv[-c(1:101)]
hist(xv,xlab="x",prob=T,ylim=c(0,2.5),xlim=c(0,5), ylab="density", main="",

breaks=seq(0,20,0.05) )
xvec=seq(0,10,0.001); fvec=xvec;
for(i in 1:length(xvec)) fvec[i]=exp(logffun(xvec[i]))
lines(xvec,fvec,lwd=2)
summary(res$xv)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.004243 0.309400 1.034000 1.218000 1.738000 9.356000 (OK, as Min > 0)
6.6 The Metropolis-Hastings algorithm

We have introduced Markov chain Monte Carlo methods with a detailed
discussion of the Metropolis algorithm. As already noted, this algorithm
is limited and rarely used on its own because it can only be used to sample
from univariate distributions. Typically, other methods will be better
suited to the task of sampling from a univariate distribution.
We now turn to the Metropolis-Hastings (MH) algorithm, a generalisation

of the Metropolis algorithm that can be used to sample from a very wide
range of multivariate distributions. This algorithm is very useful and has
been applied in many difficult statistical modelling settings.
290
First let us again review the Metropolis algorithm for sampling from a
univariate density, f ( x) . This involves choosing an arbitrary starting
value of x, a suitable driver density g (t | x) and then repeatedly proposing
a value x′ ~ g (t | x) , each time accepting this value with probability
f ( x′) g ( x | x′)
= p ×
f ( x ) g ( x′ | x )
f ( x′)
(or p = in the case of a symmetric driver).
f ( x)
Each proposal and then either acceptance or rejection constitutes one

iteration of the algorithm and may be referred to as a Metropolis step.
Performing K iterations, each consisting of a single Metropolis step,

results in a Markov chain of values which may be denoted x (0) , x (1) ,..., x ( K ) .
Assuming that stochastic equilibrium has been attained within B iterations

(B standing for burn-in) the last J= K − B values may be renumbered so
as to yield the required sample, x (1) ,..., x ( J ) ~ iid f ( x) .
The Metropolis-Hastings (MH) algorithm is a generalisation of this

procedure to the case where x is a vector of length M (say) .
The bivariate MH algorithm
For simplicity we will first focus on the bivariate case (M = 2). Thus,
suppose we wish to generate a random sample from the distribution of a
random vector X  ( X 1 , X 2 ) with pdf f ( x) , where x  ( x1 , x2 ) denotes
a value of X.
First, choose an initial value of x  ( x1 , x2 ) .
Then choose two suitable driver distributions or densities:

g1 (t | x1 , x2 )
g 2 (t | x1 , x2 ) .
Next perform the following two Metropolis steps:
291
1. Propose a candidate value of x1 by sampling

x1 ~ g1 (t | x1 , x2 ) ,
and accept this value with probability
f ( x1 | x2 ) g1 ( x1 | x1, x2 )
p1   .
f ( x1 | x2 ) g1 ( x1 | x1 , x2 )
(In the case of an acceptance, let x1 = x1 ,
and otherwise leave x1 unchanged.)

x2 ~ g 2 (t | x1 , x2 ) ,
f ( x2 | x1 ) g 2 ( x2 | x1 , x2 )
p2   .
f ( x2 | x1 ) g 2 ( x2 | x1 , x2 )
(In the case of an acceptance, let x2 = x2 ,
and otherwise leave x2 unchanged.)
This completes the first iteration of the MH algorithm.
The initial value of x  ( x1 , x2 ) may be denoted

x (0)  ( x1(0) , x2(0) ) ,
and the current value of the Markov chain may be denoted
x (1)  ( x1(1) , x2(1) ) .
Performing another iteration of the MH algorithm as above (starting from

x  x (1) ) leads to the next value,
x (2)  ( x1(2) , x2(2) ) ,
and so on.
Continuing in this fashion results in a Markov chain of vectors,

x (0) , x (1) ,..., x ( K ) .
Assuming that stochastic equilibrium has been attained within B iterations,

the last J= K − B vectors may be renumbered consecutively to yield the
required sample,
x (1) ,..., x ( J ) ~ iid f ( x) ,
where x ( j )  ( x1( j ) , x2( j ) ) .
292
Note 1: This multivariate sample can then be used to perform marginal

inferences. For example, by discarding all the x2( j ) values, we obtain a
sample from the marginal posterior distribution of x1 , namely
x1(1) ..., x1( J ) ~ iid f ( x1 ) .
This technique would be useful if obtaining a sample from f ( x1 )

directly were for any reason problematic. For example, the marginal
density
f ( x1 ) = ∫ f ( x1 , x2 )dx2
might be difficult to derive explicitly or sample from.
Note 2: Observe that

f ( x1 | x2 ) f ( x1, x2 ) / f ( x2 )
 , etc.
f ( x1 | x2 ) f ( x1 , x2 ) / f ( x2 )
Thus the two acceptance probabilities could also be written as:

f ( x1, x2 ) g1 ( x1 | x1, x2 )
p1 
f ( x1 , x2 ) g1 ( x1 | x1 , x2 )
f ( x2 , x1 ) g 2 ( x2 | x1 , x2 )
p2   .
f ( x2 , x1 ) g 2 ( x2 | x1 , x2 )
The trivariate MH algorithm
The Metropolis-Hastings algorithm for sampling from the trivariate

distribution (M = 3) of a vector random variable X  ( X 1 , X 2 , X 3 )
involves choosing an initial value of the vector
x  ( x1 , x2 , x3 ) ,
specifying three driver densities:
g1 (t | x1 , x2 , x3 )
g 2 (t | x1 , x2 , x3 )
g3 (t | x1 , x2 , x3 ) ,
and repeatedly iterating three Metropolis steps as follows:
293

x1 ~ g1 (t | x1 , x2 , x3 ) ,
f ( x1 | x2 , x3 ) g1 ( x1 | x1, x2 , x3 )
p1  
f ( x1 | x2 , x3 ) g1 ( x1 | x1 , x2 , x3 )

x2 ~ g 2 (t | x1 , x2 , x3 ) ,
f ( x2 | x1 , x3 ) g 2 ( x2 | x1 , x2 , x3 )
p2   .
f ( x2 | x1 , x3 ) g 2 ( x2 | x1 , x2 , x3 )

x3 ~ g3 (t | x1 , x2 , x3 ) ,
f ( x3 | x1 , x2 ) g3 ( x3 | x1 , x2 , x3 )
p3   .
f ( x3 | x1 , x2 ) g3 ( x3 | x1 , x2 , x3 )
As before, continuing in this fashion until stochastic equilibrium has been

achieved, and then for another J iterations, leads to the random sample
x (1) ,..., x ( J ) ~ iid f ( x) , where now x ( j )  ( x1( j ) , x2( j ) , x3( j ) ) .
Note: As before, the x1( j ) values on their own then constitute a sample
from the marginal distribution of x1 , whose density is now
f ( x1 ) = ∫∫ f ( x1 , x2 , x3 )dx2 dx3 ,
and the three acceptance probabilities can also be expressed as
f ( x1, x2 , x3 ) g1 ( x1 | x1, x2 , x3 )
p1  , etc.
f ( x1 , x2 , x3 ) g1 ( x1 | x1 , x2 , x3 )
The general MH algorithm
These ideas extend naturally and in an obvious fashion to higher values

of M. Thus, for sampling from an M-variate distribution with density
f ( x1 ,..., xM ) , the MH algorithm involves choosing a starting value
x  ( x1 ,..., xM ) ,
294
specifying M drivers,
g m (t | x1 ,..., xM ) ( m = 1,..., M ),
and repeatedly iterating M steps as follows:

x1 ~ g1 (t | x1 ,..., xM ) ,
f ( x1 | x2 ,..., xM ) g1 ( x1 | x1,..., xM )
p1  
f ( x1 | x2 ,..., xM ) g1 ( x1 | x1 ,..., xM )

x2 ~ g 2 (t | x1 ,..., xM ) ,
f ( x2 | x1 , x3 ,..., xM ) g 2 ( x2 | x1 , x2 , x3 ,..., xM )
p2  
f ( x2 | x1 , x3 ,..., xM ) g 2 ( x2 | x1 , x2 , x3 ,..., xM )
……..…………………………………………………………….
M. Propose a candidate value of xM by sampling

xM ~ g M (t | x1 ,..., xM ) ,
f ( xM | x1 ,..., xM 1 ) g M ( xM | x1 ,..., xM 1 , xM )
pM   .
f ( xM | x1 ,..., xM 1 ) g M ( xM | x1 ,..., xM 1 , xM )
As before, continuing in this fashion until stochastic equilibrium and then

for J more iterations leads to the sample x (1) ,..., x ( J ) ~ iid f ( x) , where
now x ( j )  ( x1( j ) ,..., xM( j ) ) .
Note: Again, the x1( j ) values on their own then constitute a sample from
the marginal distribution of x1 , whose density is now
f ( x1 ) = ∫  ∫ f ( x1 ,..., xM )dx2 ...dxM ),
and the M acceptance probabilities can also be expressed as
f ( x1,..., xM ) g1 ( x1 | x1, x2 ,..., xM )
p1  , etc.
f ( x1 ,..., xM ) g1 ( x1 | x1 , x2 ,..., xM )
295
Exercise 6.6 MH algorithm applied to a bent coin which is

tossed an unknown number of times
Suppose that five heads have come up on an unknown number of tosses

of a bent coin.
Before the experiment, we believed the coin was going to be tossed a
number of times equal to 1, 2, 3, ..., or 9, with all possibilities equally
likely. As regards the probability of heads coming up on a single toss, we
deemed no value more or less likely than any other value. We also
considered the probability of heads as unrelated to the number of tosses.
Find the marginal posterior distribution and mean of the number of tosses
and of the probability of heads, respectively. Also find the number of
heads we can expect to come up if the coin is tossed again the same
number of times.
Do all this via Monte Carlo by designing and implementing a suitable MH
algorithm.
Note: This problem was solved analytically in Exercise 3.10.
As in Exercise 3.10, the relevant Bayesian model is:

( y | , n ) ~ Binomial ( n,  )
( | n) ~ U (0,1)
f (n)  1/ k , n = 1,...,k, k = 9,
and the joint posterior density of the two parameters n and θ is
f (n,  | y )  f (n) f ( | n) f ( y | n,  )
n ! y (1  ) n y

(n  y )!
 h(n,  ) , 0    1, n  y, y  1,..., k .
Let us now specify the driver for n as discrete uniform over the integers
from n  r to n  r , where r is a tuning parameter.
Also let the driver for  be uniform from   c to   c , where c is

another tuning parameter.
296
Note: These drivers may also be expressed by writing the distributions

explicitly as:
n  ~ DU (n  r , n  r  1,..., n  r )
  ~ U (  c,   c) ,
or by writing the driver densities explicitly as:
1
g1 (t | n,  )  , t  n  r , n  r  1,..., n  r
2r  1
1
g 2 (t | n,  )  ,   c  t    c .
2c
Noting that both drivers are symmetric, a suitable MH algorithm may be

defined by the following two steps at each iteration:
1. Propose a value
n  ~ DU (n  r ,..., n  r ) ,
f (n ,  | y ) h(n ,  ) n  ! y (1  ) n y / (n   y )!
p1   
f ( n,  | y ) h ( n,  ) n ! y (1  ) n y / (n  y )!
n  !(1  ) n / (n   y )!
 .
n !(1  ) n / (n  y )!
2. Propose a value
  ~ U (  c,   c) ,
f (n,   | y ) h(n,  ) n !  y (1  ) n y / (n  y )!
p2   
f (n,  | y ) h(n,  ) n ! y (1  ) n y / (n  y )!
  y (1  ) n y
 y .
 (1  ) n y
Note: The proposed value n  should automatically be rejected if it is

outside the set {5,...,9} (because then f (n ,  | y ) = 0), and otherwise
automatically accepted if p1 > 1. If n  = n then p1 = 1, again leading to
automatic acceptance.
Likewise, the proposed value   should be automatically rejected if it

is outside the interval (0,1), and otherwise automatically accepted if
p2  1 .
297
Setting c = 0.3 and r = 1 (after some experimentation) and starting from

n = 7 and  = 0.5, the MH algorithm converged very quickly, with
acceptance rates of 73% for n and 58% for  over a total of 10,100
iterations.
The first 100 iterations were thrown away as the burn-in, and then every
20th value (only) was recorded so as to thereby yield an approximately
random sample of size J = 500 from the joint posterior distribution of n
and  , namely ( n1 , θ1 ),...,( nJ , θ J ) ~ iid f ( n, θ | y ) .
Figures 6.13 and 6.14 (pages 299 and 300) show the traces for all 10,101
values of n and  , respectively, and Figures 6.15 and 6.16 (pages 300
and 301) show the traces for the final 500 values of n and  , respectively.
Figure 6.17 (page 301) shows the corresponding sample ACFs

(autocorrelation functions), labelled nv0 and thv0 for the last 10,000
values of n and  , respectively, and labelled nv and thv for the final 500
values of n and  . The thinning process has dramatically reduced the high
serial correlation.
The final bivariate sample of size J = 500 was used for Monte Carlo
inference in the usual way, with the following results.
The MC estimate of nˆ  E ( n | y ) (= 6.744) was n = 6.708, with 95% CI

(6.587, 6.829).
The Monte Carlo estimate of ˆ  E ( | y ) (= 0.7040) was  = 0.7097,

with 95% CI (0.6943, 0.7252). Also, the 95% CPDR estimate for  was
(0.3547, 0.9886).
Figure 6.18 (page 302) is a probability histogram of the almost random

sample n1 ,..., nJ ~ iid f ( n | y ) , and Figure 6.19 (page 302) is a probability
histogram of the almost random sample θ1 ,..., θ J ~ iid f (θ | y ) .
Each histogram is overlaid with a nonparametric density estimate based

on the histogram, as well as with the true marginal posterior density.
Each histogram also includes vertical lines showing the true distribution
mean, the MC estimate of that mean, and the 95% CI for that mean.
Figure 6.19 also displays the 95% CPDR estimate for θ .
298
Note 1: The histogram of n-values in Figure 6.18 (page 302) is itself an

estimate of f ( n | y ) . The short vertical lines in the histogram indicate
the MC 95% CIs for f (n | y ) .
For example, the height of the bar above 6 is the proportion of sample
values n1 ,..., nJ equal to 6, which is 117/500 = 0.234, and the short
vertical bar above 6 is the MC 95% CI for P (n = 6 | y ) , which is
(0.234 ± 1.96 0.234(1 − 0.234) / 500) = (0.1969, 0.2711).
Note 2: The histogram of θ -values in Figure 6.19 (page 302) in fact

shows two posterior density estimates. The first and simplest estimate
tapers towards zero as θ approaches 1. The second estimate was
obtained using a special mathematical device that was applied so as to
‘force’ the density estimate to be relatively high near 1. For values of θ
less than about 0.8, the two density estimates are virtually identical.
Details of said mathematical device can be found in the R code below.
Figure 6.13 Trace of 10,101 n-values
299
Figure 6.14 Trace of 10,101 θ -values
Figure 6.15 Trace of 500 n-values
300
Figure 6.16 Trace of 500 θ -values
Figure 6.17 Sample autocorrelation functions
301
Figure 6.18 Probability histogram of 500 n-values
Figure 6.19 Probability histogram of 500 θ -values
302
# NB: Some of this R Code was copied from a previous exercise
y <- 5; k <- 9; options(digits=4)

nvec <- y:k; avec <- 1/(nvec+1); sumavec <- sum(avec); sumavec # 0.6456
fny <- avec/sumavec; rbind(nvec,avec,fny)
# nvec 5.0000 6.0000 7.0000 8.0000 9.0000
# avec 0.1667 0.1429 0.1250 0.1111 0.1000
# fny 0.2581 0.2213 0.1936 0.1721 0.1549
nhat <- sum(nvec*fny); nhat # 6.744

thhat <- sum( fny * (y+1)/(nvec+2) ); thhat # 0.704
xhat <- sum( fny * nvec * (y+1)/(nvec+2) ); xhat # 4.592
thvec <- seq(0,0.99,0.01); fthyvec <- thvec
for(i in 1:length(thvec)) fthyvec[i] <- sum( fny * dbeta(thvec[i],y+1,nvec-y+1) )
X11(w=8,h=6); par(mfrow=c(2,1))
plot(nvec,fny,type="n",xlab="n",ylab="f(n|y)",ylim=c(0,0.4))
points(nvec,fny,pch=16,cex=1); abline(v=nhat)
plot(thvec,fthyvec,type="n",xlab="theta",ylab="f(theta|y) ",ylim=c(0,2.5))
# Code for Metropolis-Hastings algorithm --------------------------------------------

MH = function(Jp,n,th,c,r,y,k){
# This function performs the Metropolis-Hastings algorithm for a simple model.
# n, th = intial values of n and theta
# r, c = tuning parameters for n and theta
# y, k = number of successes, maximum value of n
# Outputs: $nvec = vector of (Jp+1) values of n
# $thvec = vector of (Jp+1) values of theta
# $nar, $thar = acceptance rates for n and theta.
nvec = n; thvec = th; nct = 0; thct = 0
logfun = function(n,th,y){ # Calculates the log of the joint posterior kernel
lgamma(n+1) + y*log(th) + (n-y)*log(1-th) - lgamma(n-y+1) }
for(j in 1:Jp){
nprop = sample((n-r):(n+r),1)
if(nprop >= y) if(nprop <= k){
if(nprop == n) nct = nct + 1
if(nprop != n){
logp1 = logfun(n=nprop,th=th,y=y) - logfun(n=n,th=th,y=y)
p1 = exp(logp1); u <- runif(1)
if(u < p1){ n = nprop; nct = nct + 1}
303
}
}
thprop = runif(1,th-c,th+c)
if(thprop > 0) if(thprop < 1){
logp2 = logfun(n=n,th=thprop,y=y) - logfun(n=n,th=th,y=y)
p2 = exp(logp2); u = runif(1)
if(u < p2){ th = thprop; thct = thct + 1}
}
nvec = c(nvec,n); thvec = c(thvec,th)
}
nar = nct/Jp; thar = thct/Jp; list(nvec=nvec,thvec=thvec,nar=nar,thar=thar) }
# END
X11(w=8,h=5); par(mfrow=c(1,1))
Jp = 10100; set.seed(135); res = MH(Jp=Jp,n=7,th=0.5,c=0.3,r=1,y=5,k=9)
c(res$nar,res$thar) # 0.7344 0.5847
plot(0:Jp,res$nvec,type="l", xlab="j",ylab="n_j")
plot(0:Jp,res$thvec,type="l", xlab="j",ylab="theta_j")
burn = 100; nv0 = res$nvec[-(1:(burn+1))]; thv0 = res$thvec[-(1:(burn+1))]

nv=nv0[seq(20,10000,20)]; thv=thv0[seq(20,10000,20)]; J=500
plot(1:J,nv,type="l", xlab="j",ylab="n_j")
plot(1:J,thv,type="l", xlab="j",ylab="theta_j")
par(mfrow=c(2,2));acf(nv0); acf(thv0); acf(nv); acf(thv)
nbar = mean(nv); nci = nbar + c(-1,1)*qnorm(0.975)*sd(nv)/sqrt(J)

c(nbar,nci) # 6.708 6.587 6.829
thbar = mean(thv); thci = thbar + c(-1,1)*qnorm(0.975)*sd(thv)/sqrt(J)
thcpdr = quantile(thv,c(0.025,0.975))
c(thbar,thci,thcpdr) # 0.7097 0.6943 0.7252 0.3547 0.9886
nvals=5:9; fvals=summary(as.factor(nv)); pvals=fvals/J

Lvals=pvals-qnorm(0.975)*sqrt(pvals*(1-pvals)/J)
Uvals=pvals+qnorm(0.975)*sqrt(pvals*(1-pvals)/J)
rbind(nvals,fvals,pvals,Lvals,Uvals)
# nvals 5.0000 6.0000 7.0000 8.0000 9.0000
# fvals 128.0000 117.0000 98.0000 87.0000 70.0000
# pvals 0.2560 0.2340 0.1960 0.1740 0.1400
# Lvals 0.2177 0.1969 0.1612 0.1408 0.1096
# Uvals 0.2943 0.2711 0.2308 0.2072 0.1704
304
par(mfrow=c(1,1))
hist(nv,prob=T,xlim=c(4,10),ylim=c(0,0.5),xlab="n",breaks=seq(4.5,9.5,1),
main="", ylab="density")
points(nvec,fny,pch=16); abline(v=nhat)
for(i in 1:length(nvals)) lines(rep(nvals[i],2),c(Lvals[i],Uvals[i]),lwd=2)
abline(v=nbar,lty=4); abline(v=nci,lty=2)
legend(8,0.5,c("True mean","Estimate of mean","95% CI for mean"),lty=c(1,4,2))
legend(4,0.5,c("True posterior"),pch=16,cex=1)
legend(4,0.4,c("95% CI for f(n|y)"),lty=1,lwd=2)
hist(thv,prob=T,xlim=c(0,1),ylim=c(0,3.2),xlab="theta",
main="", ylab="density")
thdensity <- density( c(thv,1+abs(1-thv)), from=0, to=1,width=0.2)
lines(density(thv,from=0,to=1,width=0.2),lty=2,lwd=2)
# Note: This is the simplest way to estimate the density
lines(thdensity$x,thdensity$y*2,lty=3,lwd=2)
# Note: This density estimate is forced to be higher at theta=1
abline(v=thbar,lty=4); abline(v=thci,lty=2); abline(v=thcpdr,lty=3)
legend(0,3.2,c("True mean","Estimate of mean","95% CI for mean",
"95% CPDR estimate"),lty=c(1,4,2,3))
legend(0,1.6,c("True posterior","Estimate 1","Estimate 2"),lty=c(1,2,3),lwd=2)
6.7 Independence drivers and block sampling

The Metropolis-Hastings algorithm is very flexible and allows for a lot of
choice in the way it is designed. In any particular application, many
different MH algorithms will work, but some may perform better than
others, meaning they will result in better mixing and faster convergence
towards stochastic equilibrium. This will have a lot to do with how the
random variables involved are set up and parameterised, what driver
distributions are specified, and which tuning parameters are then chosen
for completely defining those driver distributions.
For example, the driver distribution for a component xm of the vector

x  ( x1 ,..., xM ) may be chosen so that it depends only on the last value of
itself. In that case, g m (t | x1 , x2 ,..., xM ) can also be written g m (t | xm ) .
In fact, this is the norm in practice, and it was the case for both drivers in
the last exercise.
305
It is also permissible to choose the mth driver so that it doesn’t depend on

any of the current values of the Markov chain, including itself. In that case,
the driver g m (t | x1 , x2 ,..., xM ) may be written g m (t ) and be referred to as
an independence driver.
Also, one may ‘bundle’ any of the M random variables into blocks and
thereby reduce the number of actual Metropolis steps per iteration. For
example, instead of doing a Metropolis step for each of x3 and x4 at each
iteration, one may do a single Metropolis step as follows:
Create a candidate value of ( x3 , x4 ) by sampling

( x3 , x4 ) ~ g34 (t , u | x3 , x4 ) (say),
and then accept this candidate ( x3 , x4 ) with probability
f ( x1 , x2 , x3 , x4 , x5 ,..., xM )
p34 
f ( x1 , x2 , x3 , x4 , x5 ,..., xM )
g ( x , x | x , x , x  , x  , x ,..., xM )
 34 3 4 1 2 3 4 5 .
g34 ( x3 , x4 | x1 , x2 , x3 , x4 , x5 ,..., xM )
This idea can be used to improve mixing and speed up the rate of
convergence but may require more work sampling from the bivariate
driver and determining the optimal tuning constant. Note that to sample
( x3 , x4 ) , it may be possible to do this in two steps via the method of
composition according to
g34 (t , u | x3 , x4 )  g3 (t | x3 , x4 ) g 4|3 (u | x3 , x4 , t ) .
6.8 Gibbs steps and the Gibbs sampler

One important possibility is to give the driver for xm exactly the same
distribution as the conditional distribution of xm given all the other values.
In that case, the proposal density is

g m (t | x1 ,..., xM )  f ( xm  t | x1 ,..., xm1 , xm1 ,..., xM ) .
With this choice, the acceptance probability equals

f ( xm | x1 ,..., xm1 , xm1 ,..., xM ) f ( xm | x1 ,..., xm1 , xm1 ,..., xM )
pm  
f ( xm | x1 ,..., xm1 , xm1 ,..., xM ) f ( xm | x1 ,..., xm1 , xm1 ,..., xM )
= 1 (that is, 100%).
306
This means that the candidate value xm is definitely accepted at every
iteration. In that case we call the mth step of the Metropolis-Hastings
algorithm a Gibbs step.
If all the Metropolis steps are Gibbs steps then the algorithm may also be
called a Gibbs sampler.
Note: In the case M = 1, the Gibbs sampler equates to sampling directly

from the distribution of interest, with no stochastic dependence between
values of the resulting chain.
Thus a Gibbs sampler for sampling from a multivariate distribution

f ( x)  f ( x1 ,..., xM )
may be defined as iteratively sampling from the full conditional densities:
f ( x1 | x2 , x3 ,..., xM )
f ( x2 | x1 , x3 ,..., xM )
..............................
f ( xM | x1 , x2 ,..., xM −1 ) ,
where each of these is proportional to f ( x1 ,..., xM ) , for example, where
f ( x1 , x2 , x3 ,..., xM )
f ( x1 | x2 , x3 ,..., xM ) =
f ( x2 , x3 ,..., xM )
x1
∝ f ( x1 , x2 , x3 ,..., xM ) .
Note: We could also write the mth conditional density as

f ( xm | xm ) ,
where
xm  ( x1 ,..., xm1 , xm1 ,..., xM )
denotes the vector x with the mth component removed.
In any case, the mth distribution can be obtained by examining the joint
density of all the variables seeing that joint density as a density function
of only xm .
An advantage of the Gibbs sampler is that it produces ‘good mixing’, on

account of no ‘wastage’ due to rejections. A disadvantage is that sampling
from all the required exact conditional distributions may not be easy or
even possible.
307
The Metropolis-Hastings algorithm is a very versatile tool that will work

in almost every situation with the least amount of mathematical effort.
The Gibbs sampler performs better but is practically feasible only in some
special cases.
A general recommendation in any given situation is to begin by specifying

a ‘pure’ Metropolis-Hastings algorithm, and then to examine each of its
M Metropolis steps with a view to converting it into a Gibbs step if that is
not too much effort. If the resulting Metropolis-Hastings algorithm
consists of at least one Gibbs step and at least one Metropolis step, it may
also be referred to as a Metropolis-Hastings within Gibbs sampler.
Example
As an example of converting a Metropolis step into a Gibbs step, recall

the joint posterior density in Exercise 6.5:
n ! y (1   ) n y
f ( n,  | y )  , 0    1, n  y, y  1,..., k .
( n  y )!
This density was used as a basis for the following Metropolis step for 
at each iteration:
2. Propose a value   ~ U (  c,   c) , and accept this value with

  y (1   ) n y
probability p2  y .
 (1   ) n y
Instead of this Metropolis step at each iteration, it would be better and also
easier to apply a Gibbs step which involves sampling the next value of 
directly from the Beta ( y  1, n  y  1) distribution.
Equivalently, one could write that Gibbs step as:
2. Draw  ~ Beta ( y  1, n  y  1) .
Now consider the Metropolis step for n in Exercise 6.5:
1. Propose a value n  ~ DU (n  r ,..., b  r ) , and accept this value

f (n ,  | y ) n  !(1  ) n / (n   y )!
with probability p1   .
f ( n,  | y ) n !(1  ) n / (n  y )!
308
Unfortunately, the kernel of f (n,  | y ) when seen as a function of n alone

(i.e. n !(1  ) n / (n  y )! ) does not suggest a well-known distribution.
However, with a little effort, it is still possible to convert the Metropolis
step for n into a Gibbs step, as follows:
1. Calculate q (n)  n !(1  ) n / (n  y )! for each n = 5,...,9.

Calculate qT  q (5)  ...  q (9) .
Hence obtain f (n | y,  )  q (n) / qT .
Draw n ~ f (n | y,  ) (now easy).
Exercise 6.7 Sampling from a normal-normal-gamma model via

MCMC
Consider the general normal-normal-gamma model given by:

( y1 ,..., yn | µ , λ ) ~ iid N ( µ , λ )
( µ | λ ) ~ N ( µ0 , σ 02 )
λ ~ G (α , β ) .
Suppose that µ0 = 10, σ 0 = 2, α = 3, β = 6 and n = 40.
(a) Generate y = ( y1 ,..., yn ) from the model using these constants.
(b) Design a suitable Metropolis-Hastings algorithm in this setting. Then

apply it y in (a) so as to generate a random sample of size J = 5,000 from
the bivariate posterior distribution of µ and λ . Illustrate the sample with
appropriate trace plots and probability histograms.
(c) Repeat (b) but with a Gibbs sampler in place of the MH algorithm.
(a) Using the specified values, we generated the parameters

λ = 0.1292 and µ = 11.95
from their independent prior distributions.
We then generated n = 40 values from the N ( µ , σ 2 ) distribution with

σ = 1/ λ = 2.782. The sample mean and standard deviation of these
values were 12.28 and 2.592. A histogram of the sample values is shown
in Figure 6.20. Overlaid is the N ( µ , σ 2 ) density.
309
Figure 6.20 Probability histogram of 40 y-values
(b) The joint posterior density of  and  is

f (,  | y )  f ( ) f ( |  ) f ( y | , )
1 
 ( 0 )2 n 1
 ( yi  )2
 1 
e e 2 02
 2 e 2
i 1
n   n 
exp   2 (  0 ) 2   ( yi   ) 2 
 1 1
 2
 20 2 i1 
 k ( ,  ) .
A suitable MH algorithm is then defined by the following two steps:
1. Draw a value   ~ U (  c,   c)
k (  ,  )
and accept it with probability p1  .
k ( ,  )
2. Draw a value   ~ U (  r ,   r )
k ( ,   )
and accept it with probability p2  .
k ( ,  )
Note: The best way to calculate the acceptance probabilities is as:

p1 = exp(q1 ) and p2 = exp(q2 ) ,
=
after first deriving q1 l ( µ ′, λ ) − l ( µ , λ ) and
= q2 l ( µ , λ ′) − l ( µ , λ ) ,
310
where l (,  )  log k (,  )

 n  1  n
    1 log     2 (  0 ) 2   ( yi   ) 2 .
 2  2 0 2 i1
The MH algorithm was started at µ = 0 and  = 1 with tuning constants

c = 0.1 and r = 0.01, and run for a total of 6,000 iterations. The resulting
traces are shown in Figures 6.21 and 6.22.
The acceptance rates for µ and  were 92% and 92%. These rates were
judged to be unduly high because they led to very strong serial correlation
in the simulated values (i.e. poor mixing).
So the algorithm was run again from the same starting values but with
c = 0.9 and r = 0.08 (both larger). This resulted in Figures 6.23 and 6.24
(pages 312 and 313), with much better mixing, faster convergence, and
the better acceptance rates of 59% and 58%.
The last 5,000 pairs of values from this second run of the algorithm were
then collected and used to produce the two histograms in Figures 6.25 and
6.26 (pages 313 and 314). Each histogram is overlaid by a density estimate
of the corresponding posterior and shows a dot indicating the true value
of the parameter (which was initially sampled from its prior).
Figure 6.21 Trace for µ
311
Figure 6.22 Trace for 
Figure 6.23 Improved trace for µ
312
Figure 6.24 Improved trace for 
Figure 6.25 Histogram for µ
313
Figure 6.26 Histogram for 
(c) Examining the kernel of the joint posterior in (b) and studying previous
exercises (involving the normal-normal model and the normal-gamma
model) we easily identify the two conditional distributions which define
the Gibbs sampler. These are defined as follows:
1. Sample  ~ f ( | y,  ) ~ N (* , *2 ) , where: *  (1 k )0  ky ,
2 k n n
*2  k  , k  , σ 2 ≡ 1/ λ .
n n n   / 0
2 2
n  (1/ (02 ))
 n 1 
2. Sample  ~ f ( | y, ) ~ G   ,   (n 1) s 2  n(  y ) 2  .
 2 2 
This Gibbs sampler was started at µ = 0 and  = 1 and run for a total of
6,000 iterations. The resulting traces are shown in Figures 6.27 and 6.28.
The last 5,000 pairs of values were then collected and used to produce the
histograms in Figures 6.29 and 6.30 (page 316). Each histogram is
overlaid by a density estimate of the corresponding posterior and shows a
dot indicating the true value of the parameter.
We see that the Gibbs sampler has produced very similar output to that in
(b) as obtained using the Metropolis-Hastings algorithm, but with less
effort (e.g. no need to worry about tuning constants) and with arguably
better results.
314
By this we mean that the output from the Gibbs sampler exhibits far less
serial correlation. This is evidenced clearly in Figure 6.31 (page 317),
which shows the sample autocorrelation functions of the simulated values
of µ and  in (b) (top two subplots) and in (c) (bottom two subplots).
Figure 6.27 Trace for µ from Gibbs sampler
Figure 6.28 Trace for  from Gibbs sampler
315
Figure 6.29 Histogram for µ from Gibbs sampler
Figure 6.30 Histogram for  from Gibbs sampler
316
Figure 6.31 Sample autocorrelations
# (a)
mu0=10; sig0=2; alp=3; bet=6; n=40; options(digits=4)
set.seed(226); lam=rgamma(1,alp,bet); mu=rnorm(1,mu0,sig0);
sig=1/sqrt(lam); y=rnorm(n,mu,sig)
c(lam, sig, sig^2, mu, mean(y), sd(y))
# 0.1292 2.7822 7.7405 11.9511 12.2768 2.5919
X11(w=8,h=5); par(mfrow=c(1,1))
hist(y,prob=T,xlim=c(5,20),ylim=c(0,0.25),breaks=seq(7,17,0.5), main=" ")

yv=seq(0,20,0.01); lines(yv, dnorm(yv,mu,sig),lwd=3)
317
# (b)
MH <- function(Jp, mu, lam, y, c, r, alp=0, bet=0, mu0=0, sig0=10000 ){
# This function implements a Metropolis-Hastings algorithm for the general
# normal-normal-gamma model.
# mu, lam = starting values of mu and lambda
# y = vector of n observations
# c, r = tuning parameters for mu and lambda
# alp, bet = parameters of lambda’s gamma prior (mean = alp/bet)
# mu0, sig0 = mean and standard deviation of mu's normal prior
# Outputs: $muv, $lamv = (Jp+1)-vectors of values of mu and lambda
# $muar, $lamar = acceptance rates for mu and lambda.
muv <- mu; lamv <- lam; ybar <- mean(y); n <- length(y); muct <- 0; lamct <- 0
logpost <- function(n,y,mu,lam,alp,bet,mu0,sig0){
(alp + n/2-1)*log(lam) - bet*lam -
0.5*lam*sum((y-mu)^2) -0.5*(mu-mu0)^2/sig0^2 }
for(j in 1:Jp){
mup <- runif(1,mu-c,mu+c) # propose a value of mu
q1 <-
logpost(n=n,y=y,mu=mup, lam=lam,alp=alp,bet=bet,mu0=mu0,sig0=sig0)-
logpost(n=n,y=y,mu=mu ,lam=lam,alp=alp,bet=bet, mu0=mu0,sig0=sig0)
p1 <- exp(q1) # acceptance probability
u <- runif(1); if(u < p1){ mu <- mup; muct <- muct + 1 }
lamp <- runif(1,lam-r,lam+r) # propose a value of lambda
if(lamp > 0){ # automatically reject if lamp < 0
q2 <-
logpost(n=n,y=y,mu=mu,lam=lamp,alp=alp,bet=bet, mu0=mu0,sig0=sig0)-
logpost(n=n,y=y,mu=mu,lam=lam ,alp=alp,bet=bet, mu0=mu0,sig0=sig0)
p2 <- exp(q2) # acceptance probability
u <- runif(1); if(u < p2){ lam <- lamp; lamct <- lamct + 1 }
}
muv <- c(muv,mu); lamv <- c(lamv,lam)
}
list(muv=muv,lamv=lamv,muar=muct/Jp,lamar=lamct/Jp)
}
Jp <- 6000; set.seed(331)

res <- MH(Jp=Jp, mu=0,lam=1, y=y, c=0.1,r=0.01, alp=3,bet=6,
mu0=10,sig0=2)
c(res$muar,res$lamar) # 0.9193 0.9165
plot(0:Jp,res$muv,type="l",xlab="j",ylab="mu_j"); text(3000,6,"c=0.1, r=0.01")

plot(0:Jp,res$lamv,type="l",xlab="j",ylab="lambda_j");
text(3000, 0.6,"c=0.1, r=0.01")
318
res <- MH(Jp=Jp, mu=0,lam=1, y=y, c=0.9,r=0.08, alp=3,bet=6,

mu0=10,sig0=2)
c(res$muar,res$lamar) # 0.5890 0.5757
plot(0:Jp,res$muv,type="l",xlab="j",ylab="mu_j"); text(3000,6,"c=0.9, r=0.08")
text(3000,0.6,"c=0.9, r=0.08")
burn <- 1000; muv <- res$muv[-(1:(burn+1))]; lamv <- res$lamv[-(1:(burn+1))]

hist(muv,prob=T,xlab="mu",nclass=20,main="",
ylab="density/relative frequency"); lines(density(muv),lwd=2);
points(mu,0,pch=16,cex=1.5)
hist(lamv,prob=T,xlab="lambda",nclass=20,main="",
ylab="density/relative frequency"); lines(density(lamv),lwd=2)
points(lam,0,pch=16,cex=1.5)
# acf(muv)$acf[1:5] # 1.0000 0.6452 0.4175 0.2744 0.1770

# acf(lamv)$acf[1:5] # 1.0000 0.6641 0.4535 0.3300 0.2419
muvb= muv; lamvb=lamv # For use later
# (c)
GS = function(Jp, mu, lam, y, alp=0, bet=0, mu0=0, sig0=10000 ){
# This function implements a Gibbs Sampler for the general normal-normal-
gamma model.
# mu, lam = starting values of mu and lambda
# y = vector of n observations
# alp, bet = parameters of lambda’s gamma prior (mean = alp/bet)
# mu0, sig0 = mean and standard deviation of mu's normal prior
# Outputs: $muv, $lamv = (Jp+1)-vectors of values of mu and lambda
muv = mu; lamv = lam; n = length(y); ybar = mean(y); s2 = var(y); sig02 = sig0^2
for(j in 1:Jp){
sig2=1/lam; k=n/(n+sig2/sig02); sig2star=k*sig2/n;
mustar=(1-k)*mu0+k*ybar
mu = rnorm(1,mustar,sqrt(sig2star))
lam=rgamma( 1, alp+0.5*n, bet+0.5*((n-1)*s2+n*(mu-ybar)^2) )
muv = c(muv,mu); lamv = c(lamv,lam) }
list(muv=muv,lamv=lamv)
}
319
Jp = 6000; set.seed(331)
res = GS(Jp=Jp, mu=0,lam=1, y=y, alp=3,bet=6, mu0=10,sig0=2)
plot(0:Jp,res$muv,type="l",xlab="j",ylab="mu_j");
burn <- 1000; muv <- res$muv[-(1:(burn+1))]; lamv <- res$lamv[-(1:(burn+1))]
hist(muv,prob=T,xlab="mu",nclass=20,main="",ylim=c(0,1.1),
ylab="density/relative frequency"); lines(density(muv),lwd=2);
points(mu,0,pch=16,cex=1.5)
hist(lamv,prob=T,xlab="lambda",nclass=20,main="",
ylab="density/relative frequency"); lines(density(lamv),lwd=2)
points(lam,0,pch=16,cex=1.5)
muvc=muv; lamvc=lamv
X11(w=8,h=7); par(mfrow=c(2,2))
acf(muvb)$acf[1:5] # 1.0000 0.6452 0.4175 0.2744 0.1770

acf(lamvb)$acf[1:5] # 1.0000 0.6641 0.4535 0.3300 0.2419
acf(muvc)$acf[1:5] # 1.0000000 -0.0004031 0.0079520 -0.0073517 0.0135979

acf(lamvc)$acf[1:5] # 1.000000 0.002873 -0.011504 -0.006671 -0.001769
320
CHAPTER 7
MCMC Methods Part 2
7.1 Introduction
In the last chapter we introduced a set of very powerful tools for
generating samples required for Bayesian Monte Carlo inference, namely
Markov chain Monte Carlo (MCMC) methods. The topics we covered
included the Metropolis algorithm, the Metropolis Hastings algorithm and
the Gibbs sampler.
We now present one more topic, stochastic data augmentation, and
provide some further exercises in MCMC. These exercises will illustrate
how many statistical problem can be cast in the Bayesian framework and
how easily inference can then proceed relative to the classical framework.
The examples below include simple linear regression, logistic regression
(an example of generalised linear modelling and survival analysis),
autocorrelated Bernoulli data, and inference on the unknown bounds of a
uniform distribution.
7.2 Data augmentation

Data augmentation (DA) is a method for using unobserved data or latent
variables so as to simplify and facilitate an iterative optimisation or
sampling algorithm. There are two basic types of DA: deterministic DA
and stochastic DA. An example of the former is the EM algorithm as
described earlier. Stochastic DA is illustrated in the following example.
Example of stochastic data augmentation
Suppose we wish to sample from a univariate distribution defined by a

density f ( x ) but that this is difficult to do directly. But then, also suppose
that we can factor this density as
f ( x ) ∝ g ( x )h ( x ) ,
where:
g ( x ) = ∫ q(u | x )du
321
q( u | x ) is the kernel of conditional density for a latent

random variable u given x which is easy to sample
from
q( u | x )h ( x ) defines the kernel of a conditional density for x
given u which is easy to sample from; call this
kernel k ( x | u ) .
In such a situation we may define the joint distribution of u and x by the

density
f (u, x ) ∝ q(u | x )h ( x ) .
Then, since both of the conditional distributions (of u given x, and of x

given u) are easy to sample from, we may define a suitable Gibbs sampler
by the following two steps:
(i) Sample u′ ~ q(u | x )
(ii) Sample x′ ~ k ( x | u′) .
Running this Gibbs sampler will eventually result in a random sample

(u1 , x1 ),...,(uJ , xJ ) ~ iid f (u, x ) .
Discarding the simulated latent variables u1 ,..., uJ then yields the desired
sample,
x1 ,..., xJ ~ iid f ( x ) .
This idea can be extended in a straightforward fashion to sampling from

a multivariate distribution, i.e. where x is a vector. In such cases, it may
be necessary to define several latent variables in the fashion described
above.
Exercise 7.1 Sampling with the aid of stochastic data

augmentation
We wish to find the mean of a random variable with density

e− x
f ( x) ∝ , x > 0.
x +1
(a) Calculate the exact value of EX using numerical integration techniques.
(b) Estimate EX using a Monte Carlo sample obtained via rejection

sampling.
322
(c) Estimate EX using a Monte Carlo sample obtained via the Metropolis
algorithm.
(d) Estimate EX using a Monte Carlo sample obtained via a Gibbs sampler
designed using the principles of data augmentation.
Note 1: We have already seen the above density f ( x ) in the context

of a previous exercise.
Note 2: The intent of this exercise is threefold:
(i) to illustrate stochastic data augmentation

(ii) to provide additional practice at several Monte Carlo techniques
(iii) to introduce an idea that will be useful later when attempting finite
population inference under biased sampling without
replacement.
e− x
(a) Let the kernel be k ( x ) = .
x +1
Then, using the integrate() function in R, we obtain

∞ ∞
∫ k ( x )dx = 0.59635 and

0
∫ xk ( x)dx =
0
0.40365.
So EX = 0.40365/0.59635 = 0.6769.
(b) A suitable envelope is the standard exponential density

=h( x ) e − x , x > 0 ,
for which the acceptance probability is
k ( x)
p( x ) = ,
ch( x )
where
k ( x) e − x / ( x + 1)
= c max = max = 1.
h( x ) e− x
323
1
Thus p( x ) = .
x +1
Applying this algorithm we obtained a random sample of size J = 1,000

using a total of 1,651 draws from the envelope. (Thus the acceptance rate
was 1,000/1651 = 61%.) Using this Monte Carlo sample, we estimated
EX as 0.6875 with 95% CI (0.6402, 0.7349).
Figure 7.1 shows a trace plot of the simulated values and (just for interest)
the associated sample ACF of these values (showing the complete absence
of autocorrelation), respectively.
Figure 7.1 Trace plot and sample ACF
324
(c) Using a normal driver distribution centred at the last value and with
standard deviation 0.6 we ran a Metropolis algorithm for 40,500 iterations,
starting at x = 1. We kept every 40th sampled value after first discarding
the first 500 iterations as the burn-in. Using the resulting Monte Carlo
sample of size 1,000, we estimated EX as 0.7049 with 95% CI (0.6561,
0.7537). The overall acceptance rate of the algorithm was 58%. Figure 7.2
shows a trace plot of all 40,500 simulated values, the sample ACF of those
values (showing a very strong autocorrelation), a trace plot of the 1,000
values used for inference, and the sample ACF of those values (showing
very little autocorrelation).
Figure 7.2 Trace plots and sample ACFs
325
∞
1
(d) Observe that = ∫ e − ( x +1) wdw .
x +1 0
∞
1 −x
=
Therefore f ( x) e ∝ ∫ e − ( x +1) we − x dw .
x +1 0
Hence we may define an artificial latent variable w such that the joint
density of w and x is
f ( w, x ) ∝ e − ( x +1) we − x , w > 0, x > 0 .
We see that:
w
f ( w | x) ∝ f ( w, x) ∝ e − ( x +1) w , w > 0
x
f ( x | w) ∝ f ( w, x) ∝ e − ( w+1) x , x > 0 .
So, a Gibbs sampler for sampling from f ( w, x ) is defined by the two

densities:
( x + 1)e − ( x +1) w , w > 0
f (w | x) =
( w + 1)e − ( w+1) x , x > 0 ,
f ( x | w) =
or equivalently by the two steps:
Sample w ~ Gamma (1, x + 1)
Sample x ~ Gamma (1, w + 1) .
Starting at x = 1, we ran this Gibbs sampler for 5,100 iterations. We then

kept every 5th sampled value after first discarding the first 100 iterations
as the burn-in. Using the resulting Monte Carlo sample of size 1,000 we
estimated EX as 0.7172 with 95% CI (0.6671, 0.7673).
Figure 7.3 shows a trace plot of all 5,100 simulated values, their sample
ACF (showing a slight autocorrelation), a trace plot of the 1,000 values
used for inference, and the sample ACF of these 1,000 values (showing
very little autocorrelation).
Note that similar plots could also be produced for the simulated latent
variable, w. Also note how data augmentation and a Gibbs sampler have
resulted in a usable Monte Carlo sample more easily and effectively than
the Metropolis algorithm.
326
# (a)
options(digits=5); kfun=function(x){ exp(-x)/(x+1) }
c=integrate(f=kfun,lower=0,upper=Inf)$value; c # 0.59635
xkfun =function(x){ x*exp(-x)/(x+1) }
top=integrate(f=xkfun,lower=0,upper=Inf)$value; top # 0.40365
EX=top/c; EX # 0.67688
# (b)
J=1000; xv=rep(NA,J); ct=0; set.seed(331)
for(j in 1:J){ acc=F; while(acc==F){ ct=ct+1
x=rgamma(1,1,1); p=1/(x+1); u=runif(1); if(u<p){ acc=T; xv[j]=x } } }
c(ct,xbar,ci) # 1651.00000 0.68754 0.64016 0.73492
par(mfrow=c(2,1)); plot(1:J,xv,type="l")
acf(xv)$acf[1:5] # 1.0000000 -0.0205516 -0.0100987 -0.0040018 0.0732520
327
# (c)
# This function applies the Metropolis algorithm to sampling from
# f(x)~exp(-x)/(x+1),x>0.
# x = intial value of x, c = standard deviation of normal driver
# Ouputs: $xv = vector of (K+1) values of x, $ar = acceptance rate
xv = x; ct = 0
for(j in 1:K){
xp = rnorm(1,x,c)
if(xp>0) {
q = (-xp-log(xp+1)) - (-x-log(x+1)); p = exp(q); u = runif(1)
if(u < p){ x = xp; ct = ct + 1 }
}
xv <- c(xv,x) }
ar = ct/K; list(xv=xv,ar=ar) }
K=40500; set.seed(298); res <- MET(K=K,x=1,c=0.6); res$ar # 0.53896

par(mfrow=c(2,2)); plot(0:K, res$xv,type="l")
acf(res$xv)$acf[1:5] # 1.00000 0.91458 0.83710 0.76808 0.70716
xv=res$xv[-(1:501)][seq(40,40000,40)]; plot(1:J,xv,type="l")
acf(xv)$acf[1:5] # 1.0000000 0.0727149 -0.0088327 0.0265807 0.0592275
c(xbar,ci) # 0.70491 0.65614 0.75368
# (d)
GIBBS <- function(K,x){
# This generates a sample using the Gibbs sampler and data augmentation.
# Inputs: K = total number of iterations, x = initial value of x
# Ouputs: $xv = vector of (K+1) values of x, $wv = vector of (K+1) values of w
xv = x; wv=NA; for(j in 1:K){
w=rgamma(1,1,x+1); x=rgamma(1,1,w+1); xv=c(xv,x); wv=c(wv,w) }
list(xv=xv,wv=wv) }
K=5100; set.seed(319); res <- GIBBS(K=K,x=1)

par(mfrow=c(2,2)); plot(0:K, res$xv,type="l")
acf(res$xv)$acf[1:5] # 1.0000000 0.0692628 0.0407747 0.0053119 -0.0133717
xv=res$xv[-(1:101)][seq(5,5000,5)]; plot(1:J,xv,type="l")
acf(xv)$acf[1:5]
# 1.0000e+00 -2.4435e-02 4.5681e-02 -3.1778e-02 2.7116e-05
c(xbar,ci) # 0.71720 0.66711 0.76729
328
Exercise 7.2 Comparison of classical and Bayesian simple linear

regression (and practice at various statistical techniques)
Consider the following simple linear regression model:

Yi ~ N (i , 2 ) , i = 1,...,n,
where
i  a  bxi
(linear predictor for a value with covariate xi ).
(a) Generate a data vector y  ( y1 ,..., yn ) from the model, using:

n = 10, a = 5, b = 2,   2 ,
and with covariates
xi = i
for all i = 1,...,n.
(b) Conduct a classical analysis of the data in (a). Report the MLEs and
95% CIs for a and b. Also create a single graph which shows:
• the data values
• the true regression line E (Y | x)  a  bx
ˆ
• the fitted regression line Eˆ (Y | x)  aˆ  bx
• two lines showing the 95% CI for the regression line
• two lines showing the 95% prediction interval at each value of x.
(c) Perform a Bayesian analogue of the inference in (b) using the

Metropolis-Hastings algorithm and a Monte Carlo sample of size
J = 2,000.
Use a suitable joint uninformative and improper prior for the three
parameters in the model.
(d) Create a single graph showing all the information in the two graphs in
(b) and (c).
Note: The Bayesian analysis in (c) could also be performed via the
Gibbs sampler.
329
(a) The simulated data are shown in Table 7.1. Note that xi  i .
Table 7.1 Simulated data
i 1 2 3 4 5
yi 5.879 8.54 14.12 13.14 15.26
i 6 7 8 9 10
yi 20.43 19.92 18.47 21.63 24.11
 ( x  x )( y  y )
i i
(b) The MLE of b is bˆ  i 1
n
= 1.836,
 (x  x )
i 1
i
2
ˆ = 6.051.
and the MLE of a is then â  y  bx
An unbiased estimate of  2 ( 1/   4) is
1 n
s2  
n  2 i1
ˆ }) 2 = 3.816.
( yi {aˆ  bxi
1 1
 
1 2
Let: X   
  
 
1 n
m m12 
M   11   ( X X )1 .
m21 m22 
A 95% CI for a is then

 
aˆ  t0.025 (8) s m11 = (1.340, 2.332),
and a 95% CI for b is

 
bˆ  t0.025 (8) s m22 = (2.973, 9.128).
330
Also, a 95% CI for E (Y | x)  a  bx is

  

ˆ  ˆ )  t (8) s 1 x  M 1 ,
 ( a bx 0.025
 x
 
and a 95% prediction interval for a new observation Y with covariate x is
  
(aˆ  bxˆ )  t (8) s 1  1 x  M 1 .
 0.025  x

The required graph is shown in Figure 7.4.
Figure 7.4 Classical inference
(c) A suitable Bayesian model is given by:

(Yi | a, b,  ) ~ N (a  bxi ,1/  ) , i = 1,...,n
f ( a, b,  )  1 / , a, b  ,   0 (where λ = 1/ σ 2 ).
Let us now solve this Bayesian model so as to estimate the posterior means
and 95% CPDRs for a and b. The joint posterior density of the three model
parameters is
1 n   
f (a, b,  | y )    exp  ( yi  i ) 2 
 i1  2 
(where i  a  bxi as already defined).
331
Hence the joint log-posterior density (up to an additive constant) is

 n   n
l (a, b,  )   1 log    ( yi  i ) 2 .

 2  2 i1
Applying the MH algorithm for 2,500 iterations, we obtain traces for the
three parameters as shown in Figure 7.5. The horizontal lines show the
true values of the three parameters. The fourth subplot (bottom right) is a
histogram of the last 2,000 values of b simulated.
Figure 7.5 Results of a MH algorithm
Using output from the last 2,000 iterations only, we estimate the posterior
mean and 95% CPDR for a (= 5) as 6.3445 and (3.578, 8.808), and the
same for b (= 2) are about 1.7881 and (1.392, 2.234).
Figure 7.6 shows the Bayesian analogue of Figure 7.5 in part (b).
332
Figure 7.6 Bayesian inference
(d) The required graph is shown in Figure 7.7.
Figure 7.7 Comparison of inferences
333
# (a) **************************************************
options(digits=4)
n <- 10; a <- 5; b <- 2; lam <- 0.25; sig <- 1/sqrt(lam); c(sig,sig^2) # 2 4
xdat <- 1:n; set.seed(123); ydat <- rnorm(n,a+b*xdat,sig)
rbind(xdat,ydat)
# xdat 1.000 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
# ydat 5.879 8.54 14.12 13.14 15.26 20.43 19.92 18.47 21.63 24.11
# (b) **********************************************************
fit <- lm(ydat ~ xdat); summary(fit)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.051 1.335 4.53 0.0019 **
# xdat 1.836 0.215 8.54 2.7e-05 ***
ahat <- coef(fit)[[1]]; bhat <- coef(fit)[[2]]

sse <- sum((ydat-(ahat+bhat*xdat))^2)
sig2hat <- sse/(n-2); lamhat <- 1/sig2hat
c(sse,sig2hat,lamhat) # 30.532 3.816 0.262
df <- length(ydat)-length(fit$coef)
aCI <- ahat + c(-1,1)*qt(0.975,df)*sqrt(sig2hat*summary(fit)$cov.unscaled[1,1])
aCI # 2.973 9.128
bCI <- bhat + c(-1,1)*qt(0.975,df)*sqrt(sig2hat*summary(fit)$cov.unscaled[2,2])
bCI # 1.340 2.332
xxv <- seq(0,n,0.1); nn <- length(xxv)

Xmat <- cbind(1,xxv)
muhat <- Xmat %*% fit$coef
muhatvar <- sig2hat * diag(Xmat %*% summary(fit)$cov.unscaled %*% t(Xmat))
df <- length(ydat)-length(fit$coef)
muhatlb <- muhat - qt(0.975,df) * sqrt(muhatvar)
muhatub <- muhat + qt(0.975,df) * sqrt(muhatvar)
predlb <- muhat - qt(0.975,df) * sqrt(sig2hat+muhatvar)

predub <- muhat + qt(0.975,df) * sqrt(sig2hat+muhatvar)
X11(w=8,h=5); par(mfrow=c(1,1)) # Figure

plot(xdat, ydat, pch=16, xlim=c(0,11),ylim=c(0,35),xlab="x",ylab="y" )
abline(c(a,b),lwd=2);
lines(c(0,n),c(fit$coef[1], fit$coef[1]+ fit$coef[2]*n),lty=4, lwd=2)
lines(xxv,muhatlb,lty=3,lwd=2)
334
lines(xxv,muhatub,lty=3,lwd=2)
lines(xxv, predlb,lty=2,lwd=2)
lines(xxv, predub,lty=2,lwd=2)
legend(6,12,c("True mean of Y given x","Least squares fit","95% CI for mean",
"95% prediction interval"),lty=c(1,4,3,2),lwd=rep(2,4))
# (c) **********************************************************
MH.SLR <- function(Jp, x, y, a, b, lam, asd, bsd, lamsd){
# This function implements a Metropolis Hastings algorithm for a
# simple linear regression model with uninformative priors.
# x = vector of covariates
# y = vector of observations
# a,b,lam = starting values of a,b,lambda
# asd,bsd,lamsd = st. dev.s of drivers for a,b,lambda.
# Outputs: $av,$bv,$lamv = (Jp+1)-vectors of values of a,b,lambda
# $aar,$bar,$lamar = acceptance rates for a,b,lambda.
av <- a; bv <- b; lamv <- lam; ybar <- mean(y); n <- length(y)
act <- 0; bct <- 0; lamct <- 0
logpost <- function(n, x, y, a, b, lam){ # logposterior
(n/2 - 1) * log(lam) - 0.5 * lam * sum((y - a - b * x)^2) }
for(j in 1:Jp) {
ap <- rnorm(1, a, asd) # propose a value of a
k <- logpost(n=n, x=x, y=y, a=ap, b=b, lam=lam) -
logpost(n=n, x=x, y=y, a=a, b=b, lam=lam)
p <- exp(k) # acceptance probability
u <- runif(1); if(u < p) { a <- ap; act <- act + 1 }
bp <- rnorm(1, b, bsd) # propose a value of b
k <- logpost(n=n, x=x, y=y, a=a, b=bp, lam=lam) -
u <- runif(1); if(u < p) { b <- bp; bct <- bct + 1 }
lamp <- rnorm(1, lam, lamsd) # propose a value of lambda
if(lamp > 0) { # automatically reject if lamp < 0
k <- logpost(n=n, x=x, y=y, a=a, b=b, lam=lamp) -
u <- runif(1); if(u < p) { lam <- lamp; lamct <- lamct + 1 }
}
av <- c(av, a); bv <- c(bv, b); lamv <- c(lamv, lam)
}
list(av = av, bv = bv, lamv = lamv, aar = act/Jp, bar = bct/Jp, lamar = lamct/Jp)
}
335
Jp <- 2500; set.seed(441)

mh <- MH.SLR(Jp=Jp, x=xdat,y=ydat, a=0,b=0,lam=1,
asd=1.2,bsd=0.2,lamsd=0.2)
c(mh$aar,mh$bar,mh$lamar) # 0.5228 0.5008 0.5132

plot(0:Jp,mh$av,xlab="j",ylab="a_j",type="l"); abline(h=a)
plot(0:Jp,mh$bv,xlab="j",ylab="b_j", type="l"); abline(h=b)
plot(0:Jp,mh$lamv,xlab="j",ylab="lambda_j", type="l"); abline(h=lam)
hist(mh$bv[-(1:501)],main="",xlab="b")
burn <- 500; J <- Jp - burn; J # 2000

av <- mh$av[-c(1:(burn+1))]; abar <-mean(av)
bv <- mh$bv[-c(1:(burn+1))]; bbar <- mean(bv)
lamv <- mh$lamv[-c(1:(burn+1))]; lambar <- mean(lamv)
sig2bar <- mean(1/lamv)

c(abar,bbar,lambar,sig2bar) # 6.3445 1.7881 0.2758 4.7505
quantile(av,c(0.025,0.975)) # 3.578 8.808

quantile(bv,c(0.025,0.975)) # 1.392 2.234
cpdrLBs <- xxv; cpdrUBs <- xxv; predLBs <- xxv; predUBs <- xxv; set.seed(171)
for(i in 1:nn){
mus <- av + bv*xxv[i]
cpdrLBs[i] <- quantile(mus,0.025)
cpdrUBs[i] <- quantile(mus,0.975)
sim <- rnorm(J,mus,1/sqrt(lamv))
predLBs[i] <- quantile(sim,0.025)
predUBs[i] <- quantile(sim,0.975)
}

plot(xdat,ydat,pch=16,xlim=c(0,11),ylim=c(0,35),xlab="x",ylab="y" )
abline(c(a,b),lwd=2); lines(c(0,n),c(abar, abar + bbar *n),lty=4, lwd=2);
lines(xxv,cpdrLBs,lty=3,lwd=2)
lines(xxv,cpdrUBs,lty=3, lwd=2)
lines(xxv,predLBs,lty=2, lwd=2)
lines(xxv,predUBs,lty=2, lwd=2)
legend(6,12,c("True mean of Y given x","Posterior mean of mean",
"95% CPDR for mean","95% prediction interval"),lty=c(1,4,3,2),lwd=rep(2,4))
336
# (d) **********************************************************

plot(xdat,ydat,pch=16, xlim=c(0,11),ylim=c(0,35),xlab="x",ylab="y" )
abline(c(a,b),lwd=2) # True regression line
# Classical lines
lines(c(0,n),c(fit$coef[1], fit$coef[1]+ fit$coef[2]*n),lty=2, lwd=2)
lines(xxv,muhatlb,lty=2, lwd=2); lines(xxv,muhatub,lty=2, lwd=2)
lines(xxv, predlb,lty=2, lwd=2); lines(xxv, predub,lty=2, lwd=2)
# Bayesian lines
lines(c(0,n),c(abar,abar+n*bbar),lty=4, lwd=1)
lines(xxv,cpdrLBs,lty=4, lwd=1); lines(xxv,cpdrUBs,lty=4, lwd=1)
lines(xxv,predLBs,lty=4, lwd=1); lines(xxv,predUBs,lty=4, lwd=1)
legend(6,12,c("True mean of Y given x",

"Classical inference","Bayesian inference"),lty=c(1,2,4), lwd=c(2,2,1))
Exercise 7.3 Comparison of classical and Bayesian logistic

regression (an example of GLMs) (and practice at various
statistical techniques)
Table 7.2 shows data on the number of rats who died in each of n = 10
experiments within one month of being administered a particular dose of
radiation. For example in Experiment 3, a total of 40 rats were exposed to
radiation for 3.6 hours, and 23 of them died within one month. Thus an
estimate of the probability of a rat dying within one month if it is exposed
to 3.6 hours of radiation is 23/40 = 57.5%.
Table 7.2 Rat mortality data
i ni xi yi yi / ni  pˆ i
1 10 0.1 1 1/10 = 0.1
2 30 1.4 0 0/30 = 0
3 40 3.6 23 23/40 = 0.575
4 20 3.8 12 12/20 = 0.6
5 15 5.2 8 8/15 = 0.5333
6 46 6.1 32 32/46 = 0.696
7 12 8.7 10 10/12 = 0.833
8 37 9.1 35 35/37 = 0.946
9 23 9.1 19 19/23 = 0.826
10 8 13.6 8 8/8 = 1
337
Consider the following logistic regression model for these data:

Yi ~ Bin(ni , pi ) , i = 1,...,n,
where:
1
pi  (probability of a ‘success’ for experiment i)
1  exp(zi )
zi  a  bxi (linear predictor).
(a) Find the ML estimates of a and b using the glm() function in R. For
each parameter also calculate a suitable 95% CI.
(b) Find the ML estimates and associated 95% CIs in R using your own
code for the Newton-Raphson algorithm and without using the glm()
function.
(c) Find the ML estimates using a modification of the Newton-Raphson
algorithm which does not require the inversion of matrices.
(d) Suppose that a and b are assigned independent flat priors over the
whole real line. Thus consider the Bayesian model:
(Yi | a, b) ~ Bin(ni , pi ) , i = 1,...,n
1
pi  (probability of death for experiment i)
1  exp(zi )
zi  a  bxi (linear predictor)
f ( a, b)  1 , a, b   .
Use the MH algorithm to get a sample of J = 10,000 observations from

f (a, b | y ) , where y  ( y1 ,..., yn ) .
Hence estimate the posterior means of a and b, together with 95% MC CIs
for these estimates, and also estimate the 95% CPDRs.
Show graphs of the traces and histograms. Overlay the MC estimates and
MLEs over the traces, together with 95% CPDRs and CIs, respectively.
Also, overlay kernel density estimates over the histograms.
(e) Use the sample in (d) to estimate p(x), the probability of a rat dying if
it is exposed to x hours of radiation, for each x = 0,1,2,...,15.
Graph these results with a line in a figure which also shows the 10 pˆ i
values.
338
Also include:
• the MC 95% CI for each estimate of p(x) (i.e. for each E{p(x) | y})
• the MC 95% CPDR for each p(x)
• the MLE of each p(x) using standard GLM procedures,
together with associated large-sample 95% CIs.
(f) Suppose that 20 more rats are about to be exposed to exactly five hours
of radiation. Use the sample in (d) to estimate how many of these 20 rats
will die, together with a 95% CI for your estimate. Also construct an
approximate 95% prediction region for the number of rats that will die
and report the estimated actual probability content of this region.
(g) Use the sample in (d) to estimate LD50, the lethal dose of radiation at
which 50% of rats die, together with a 95% CPDR. Also compute an
estimate and 95% CI for LD50 using standard GLM techniques.
(h) Consider the Bayesian model and data in (d). Modify the model
suitably so as to constrain the probability of death at a dose of zero to be
exactly zero. Estimate the parameters in the new model and draw a graph
similar to the one in (e) which shows the posterior probability of death for
each dose x from zero to 15, together with the associated 95% CPDRs.
(a) Using the glm() function in R, we find that the MLE and 95% CI for
a are –2.156 and (–2.9998, –1.3113). Also, the MLE and 95% CI for b are
0.5028 and (0.3456, 0.6601).
(b) Since the priors on a and b are flat, finding the maximum likelihood
estimate of (a,b) is the same as finding the posterior mode of (a,b). Now,
the posterior density of a and b is
n
f (a, b | y )   piyi (1 pi ) ni  yi .
i 1
n
l (a, b)  log f (a, b | y )   qi ,
i 1
where qi  yi log pi  (ni  yi ) log(1 pi )

 yi zi  ni log(1  exp( zi )) (after some algebra).
339
dqi dq
Let: d1i   yi  ni pi , d 2i  i  ( yi  ni pi ) xi
da db
2
d q d 2 qi
d11i  2i  ni pi (1 pi ) , d12i   ni pi (1 pi ) xi
da dadb
d 2 qi n n
d 22i  2  ni pi (1 pi ) xi ,
2
d1   d1i , d 2   d 2i
db i 1 i 1
n n n
d11   d11i , d 22   d 22i , d12   d12i
i 1 i 1 i 1
a d  d d12 
v    , D  D(v)   1  , M  M (v)   11 .
b d 2  d12 d 22 
Then the NR algorithm is defined by

vt  vt1  M ( vt1 )1 D ( vt1 ) , t = 1,2,3,....
Starting from the origin, the iterates of a and b are as shown in Table 7.3.
Table 7.3 Results of a Newton-Raphson algorithm
t 0 1 2 3 4 5
at 0 –1.474 –2.013 –2.148 –2.156 –2.156
bt 0 0.3369 0.4670 0.5008 0.5028 0.5028
Thus the MLEs of a and b are â = –2.156 and b̂ = 0.5028. This agrees
perfectly with the results in (a).

A 95% CI for a is aˆ  t0.025 (8) sa  and a 95% CI for b is bˆ  t0.025 (8) sb , 
where:
t0.025 (8) = 2.306
sa2 is the top left element of V
sb2 is the bottom right element of V
V  ( X WX )1 (a 2 by 2 matrix)
1 x1 
 
1 x2  i
  1 , X    , W  diag ( w ,..., w ) , w 
    1 n i
V (i ) g (i ) 2
 
1 xn 
340
1
i  ni , i  pˆ i  (MLE of the probability at x  xi )
1  exp(zî )
ˆ
zî  aˆ  bxi(MLE of linear predictor at x  xi )
  
V ()  (1 ) , g ()  log   (logit link function)
1  
1
g  ( )  .
 (1 ) 2
2
We find that wi  ni pˆ i (1 pˆ i ) . Numerically, we find that 95% CIs for a

and b are (–3.000, –1.311 ) and (0.3456, 0.6601), respectively. These
results agree with those in (a).
(c) At each iteration t = 1,2,3,4,..., we:
1. Fix b and perform a NR step towards maximising wrt a:

at 1  at  d1 (at ) / d11 (at )
2. Fix a and perform a NR step towards maximising wrt b:

bt 1  bt  d 2 (bt ) / d 22 (bt ) .
Starting from the origin (a, b) = (0,0) we obtain the results in Table 7.4.
Table 7.4 Results of a search algorithm
t 0 1 2 3 4
at 0 0.4564 –0.45034 –0.06132 –0.7294
bt 0 0.1401 0.09223 0.20571 0.1690
t 20 21 99 100
at –1.8585 –1.8619 –2.1555 –2.1555
bt 0.4424 0.4532 0.5028 0.5028
We see that this modified and simpler algorithm converges more slowly
than plain NR. Also, it is less stable, as it fails to converge if started from
(a, b) = (0.3, 0.3), unlike plain NR. Both algorithms fail to converge if
started from (0.5, 0.5). (See the R code below for details.)
341
(d) We apply the Metropolis Hastings algorithm with a burn-in of 500 and
starting from the origin to get a sample of size of J = 10,000 from
f (a, b | y ) . The acceptance rates were 37% for a and 55% for b. The
Markov chain was not thinned for subsequent inference, meaning that the
CIs obtained below are perhaps narrower than they should be.
The MC estimate of E(a | y) is –2.207 (similar to the MLE, –2.156), with

95% CI (–2.214, –2.199) and 95% CPDR (–2.963, –1.521).
The MC estimate of E(b | y) is 0.5145 (similar to the MLE, 0.5028), with

95% CI (0.5132, 0.5158) and 95% CPDR (0.3895, 0.6605).
Traces and histograms of the sampled values of a and b are shown in

Figure 7.8.
Figure 7.8 Results of MH algorithm
342
(e) The required results are shown in Figure 7.9.
Note: Figure 7.9 shows that the probability of a rat dying when given no
radiation is about 10%. We should interpret this result and the graph
near x = 0 with caution. Ideally, we would conduct another experiment
with only small values of x and a second logistic regression, perhaps
using the log of x as the explanatory variable. On the other hand, maybe
the 10% figure is reasonable because rats could die within one month
for reasons other than radiation. Alternatively, we could modify our
model so as to force p(0) = 0 (see (h) below).
Figure 7.9 Mortality rate estimates
(f) Let d be the number of rats which will die if exposed to radiation for
five hours. Then
(d | y, a, b) ~ Bin(20, p(a,b)),
where
p(a,b) = 1/(1 + exp(−a − 5b)).
We can now apply the method of composition whereby

f ( d , a, b | y )  f ( d | y , a, b) f ( a, b | y ) .
Thus for each sampled (a,b) we calculate p(a,b) and sample from the
binomial distribution of d above. The frequencies of the resulting 10,000
values of d are shown in Table 7.5
343
Table 7.5 Simulated frequencies of rats dying
d 3 4 5 6 7 8
frequency 1 3 20 75 217 472
d 9 10 11 12 13 14
frequency 845 1188 1562 1733* 1546 1123
d 15 16 17 18 19
frequency 709 332 131 37 6
Using the 10,000 values of d, our estimate of d is 11.81 (the average of

the 10,000 values), with (11.76, 11.85) as the 95% MC CI for d’s posterior
mean. We feel about 95.1% confident that the number of rats which die
will be between 8 and 16, inclusive (since 95.1% of the simulated d values
are in this range). Also, it is most likely that 12 of the 20 rats will die,
because the MC estimate of Mode(d | y) is 12 (since d = 12 above has the
highest frequency, namely 1,733, as marked by an asterisk).
(g) First observe that the LD50 is the value of x such that p ( x)  0.5 .
Solving 1/(1  exp(a  bx)) = 0.5, we get x = LD50  a / b .
Using the sample of 10,000 in part (f), we estimate the posterior mean of
LD50 is 4.279, with 95% MC CI (4.273, 4.286). The MC 95% CPDR for
LD50 is (3.584, 4.916). Thus we can be 95% confident that the dose
required to kill half of a large number of rats is between 3.6 and 4.9.
Using standard GLM procedures and the delta method we estimate LD50
as 4.287 (the MLE) with 95% CI (3.532, 5.042). Thus we can be 95%
confident that the dose required to kill half of a large number of rats is
between 3.5 and 5.0. We see that Bayesian and classical methods have
resulted in inferences which are very similar.
(h) An alternative to the logistic model in (d), one with zero probability
of death at zero dosage of radiation, is as follows:
(Yi | a, b) ~ Bin(ni , pi ) , i = 1,...,n
pi  1 exp(zi ) , zi  axi  bxi2
f ( a, b)  1 , a, b  0 .
344
Running a suitable modification of the MH algorithm in (d), we estimate

a and b as 0.11 and 0.017, with respective 95% CPDRs (0.04, 0.20) and
(0.004, 0.030). The required graph is shown in Figure 7.10.
Figure 7.10 Modified mortality rate estimates
# (a) ********************************************************
nvec <- c(10,30,40,20,15,46,12,37,23,8)
xvec <- c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6)
yvec <- c(1,0,23,12,8,32,10,35,19,8)
pvec <- yvec/nvec
options(digits=4)
cbind(xvec,nvec,yvec,pvec)
# xvec nvec yvec pvec
# [1,] 0.1 10 1 0.1000
# [2,] 1.4 30 0 0.0000
# [3,] 3.6 40 23 0.5750
# [4,] 3.8 20 12 0.6000
# [5,] 5.2 15 8 0.5333
# [6,] 6.1 46 32 0.6957
# [7,] 8.7 12 10 0.8333
# [8,] 9.1 37 35 0.9459
# [9,] 9.1 23 19 0.8261
# [10,] 13.6 8 8 1.0000
345
fit <- glm(pvec~xvec,family=binomial(link=logit),weights=nvec)

fit$coef # -2.1555 0.5028
summary(fit)$cov.unscaled
# (Intercept) xvec
# (Intercept) 0.13404 -0.022442
# xvec -0.02244 0.004651
alpse <- sqrt(summary(fit)$cov.unscaled[1,1])

fitalpci <- fit$coef[1] + c(-1,1)*qt(0.975,8)*alpse
c(alpse,fitalpci) # 0.3661 -2.9998 -1.3113
betse <- sqrt(summary(fit)$cov.unscaled[2,2])

fitbetci <- fit$coef[2] + c(-1,1)* qt(0.975,8)*betse
c(betse,fitbetci) # 0.0682 0.3456 0.6601
# (b) *****************************************************
NR.LOGISTIC <- function(m,alp,bet,xv,nv,yv){

# Performs logistic regression via the Newton-Raphson algorithm.
# Inputs: m = number of iterations
# alp, bet = starting values of alpha and beta
# xv, nv, yv = vectors of covariates, sample sizes and
# numbers of successes, respectively.
# Outputs: $alpv = vector of (m+1) alpha values
# $betv = vector of (m+1) beta values
alpv <- alp; betv <- bet; ve <- c(alp,bet)

for(t in 1:m){
pv <- 1/(1+exp(-alp-bet*xv))
d1 <- sum(yv - nv*pv); d2 <- sum((yv - nv*pv)*xv)
d11 <- -sum(nv*pv*(1-pv)); d12 <- -sum(nv*pv*(1-pv)*xv)
d22 <- -sum(nv*pv*(1-pv)*xv^2)
D <- c(d1,d2)
M <- matrix(c(d11,d12,d12,d22),nrow=2)
ve <- ve - solve(M) %*% D
alp <- ve[1]; bet <- ve[2]
alpv <- c(alpv,alp); betv <- c(betv,bet)
}
list(alpv=alpv,betv=betv)
}
346
options(digits=4)
nrres <- NR.LOGISTIC(m=20,alp=0,bet=0,xv=xvec,nv=nvec,yv=yvec)
nrres
# $alpv: [1] 0.000 -1.474 -2.013 -2.148 -2.156 -2.156 ....
# $betv: [1] 0.0000 0.3369 0.4670 0.5008 0.5028 0.5028 ....
NR.LOGISTIC(m=20,alp=0.3,bet=0.3,xv=xvec,nv=nvec,yv=yvec)
# $alpv: [1] 0.000 -1.474 -2.013 -2.148 -2.156 -2.156 ....
# $betv: [1] 0.0000 0.3369 0.4670 0.5008 0.5028 0.5028 ....
NR.LOGISTIC(m=20,alp=0.5,bet=0.5,xv=xvec,nv=nvec,yv=yvec)
# Error in solve.default(M) :
# system is computationally singular: reciprocal condition
# number = 9.01649e-18
alpmle <- nrres$alp[21]; betmle <- nrres$bet[21]

X <- cbind(1,xvec)
zmle <- alpmle + betmle*xvec # linear predictor
pmle <- 1/(1 + exp(-zmle))
wtvec <- nvec*pmle*(1-pmle)
W <- diag(wtvec)
varmat <- solve(t(X) %*% W %*% X)
varmat
# 0.13404 -0.022442
# -0.02244 0.004651
qt(0.975,8) # 2.306
alpmle + c(-1,1)*qt(0.975,8)*sqrt(varmat[1,1]) # -3.000 -1.311
betmle + c(-1,1)*qt(0.975,8)*sqrt(varmat[2,2]) # 0.3456 0.6601
# (c) ****************************************************
NRMOD.LOGISTIC <- function(m,alp,bet,xv,nv,yv){

# Performs logistic regression via a modification of the Newton-Raphson
# algorithm.
# Inputs: m = number of iterations
# alp, bet = starting values of alpha and beta
# xv, nv, yv = vectors of covariates, sample sizes and
# numbers of successes, respectively.
# Outputs: $alpv = vector of (m+1) alpha values
# $betv = vector of (m+1) beta values
alpv <- alp; betv <- bet; ve <- c(alp,bet)
347
for(t in 1:m){
pv <- 1/(1+exp(-alp-bet*xv))
d1 <- sum(yv - nv*pv)
d2 <- sum((yv - nv*pv)*xv)
d11 <- -sum(nv*pv*(1-pv))
d22 <- -sum(nv*pv*(1-pv)*xv^2)
alp <- alp - d1/d11
bet <- bet - d2/d22
alpv <- c(alpv,alp); betv <- c(betv,bet)
}
list(alpv=alpv,betv=betv)
}
resnr <- NRMOD.LOGISTIC(m=100,alp=0,bet=0,xv=xvec,nv=nvec,yv=yvec)

inc=c(1,2,3,4,5,21,22,100,101); rbind(inc-1,resnr$alpv[inc], resnr$betv[inc])
# [1,] 0 1.0000 2.00000 3.00000 4.0000 20.0000 21.0000 99.0000 100.0000
# [2,] 0 0.4564 -0.45034 -0.06132 -0.7294 -1.8585 -1.8619 -2.1555 -2.1555
# [3,] 0 0.1401 0.09223 0.20571 0.1690 0.4424 0.4532 0.5028 0.5028
resnr <- NRMOD.LOGISTIC(m=100,alp=0.3,bet=0.3,xv=xvec,nv=nvec,yv=yvec)

rbind(inc-1,resnr$alpv[inc], resnr$betv[inc])
# [1,] 0.0 1.00000 2.0000 3.00 4.000e+00 20 21 99 100
# [2,] 0.3 -1.72625 2.1776 -31.10 4.023e+15 NaN NaN NaN NaN
# [3,] 0.3 -0.01407 0.6942 -21.36 2.861e+18 NaN NaN NaN NaN
resnr <- NRMOD.LOGISTIC(m=100,alp=0.5,bet=0.5,xv=xvec,nv=nvec,yv=yvec)

rbind(inc-1,resnr$alpv[inc], resnr$betv[inc])
# [1,] 0.0 1.000 2.0 3 4 20 21 99 100
# [2,] 0.5 -4.532 828.1 -Inf NaN NaN NaN NaN NaN
# [3,] 0.5 -1.090 3101.9 -Inf NaN NaN NaN NaN NaN
# (d) ****************************************************
xvdata <- c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6)

yvdata <- c(1,0,23,12,8,32,10,35,19,8)
nvdata <- c(10,30,40,20,15,46,12,37,23,8)
pvdata <- yvdata/nvdata
MHLR <- function(burn,J,a0,b0,xv,yv,nv,sa,sb){

# Performs the Metropolis-Hastings algorithm for a logistic regression model.
# Inputs: burn = number of iterations for burn-in
# J = required number of Monte Carlo simulations
# a0 = starting value of alpha
# b0 = starting value of beta
348
# xv = vector of xi values (length n)

# yv = vector of yi observations
# nv = vector of ni values
# sa, sb = standard deviations of the two normal driver fns.
# Outputs: $av = vector of (burn+J+1) values of alpha (incl. starting value)
# $bv = vector of (burn+J+1) values of beta (incl. starting value)
# $ara = acceptance rate for alpha (over last J iterations)
# $arb = acceptance rate for beta.
logfun <- function(a,b,xv,yv,nv){

phatv <- 1/(1+exp(-a-b*xv))
sum( yv*log(phatv) + (nv-yv)*log(1-phatv) )
}
n <- length(yv); a <- a0; b <- b0

its <- burn + J # total number of iterations
av <- c(a, rep(NA,its));
bv <- c(b, rep(NA,its)) # vectors of simulated a & b values
arav <- c(NA, rep(0,its)); arbv <- c(NA, rep(0,its))
# acceptance rate vectors for a and b
for(j in 1:its){
a2 <- rnorm(1,a,sa)
logpr <- logfun(a=a2,b=b,xv=xv,yv=yv,nv=nv)-
logfun(a=a,b=b, xv=xv,yv=yv,nv=nv)
pr <- exp(logpr); u <- runif(1)
if(u<pr){ a <- a2; arav[j+1] <- 1 }
b2 <- rnorm(1,b,sb)
logpr <- logfun(a=a,b=b2, xv=xv,yv=yv,nv=nv)-
if(u<pr){ b <- b2; arbv[j+1] <- 1 }
av[j+1] <- a; bv[j+1] <- b

}
ara <- sum(arav[(burn+2):(its+1)])/J

arb <- sum(arbv[(burn+2):(its+1)])/J # acceptance rates for a & b
list(av=av,bv=bv,ara=ara,arb=arb)
}
349
burn <- 500; K <- 10000; its <- burn + K; set.seed(221); date() #
res <- MHLR(burn=burn,J=K,a0=0,b0=0,xv=xvdata,
yv=yvdata,nv=nvdata,sa=0.5,sb=0.05); date() # 10000 Took 1 second
c(res$ara,res$arb) # 0.3650 0.5544
par(mfrow=c(2,1)); plot(res$av,type="l"); plot(res$bv,type="l") # OK
options(digits=4); J = K; thin=1
# thin=1 means no thinning (for experimentation)
av <- res$av[-(1:(burn+1))][seq(thin,K,thin)]; length(av) # 10000
acf(av)$acf[1:5] # 1.0000 0.9283 0.8756 0.8324 0.7945
# (very high autocorrelation)
ahat <- mean(av); aci <- ahat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(av)/J)
acpdr <- quantile(av,c(0.025,0.975))
c(ahat,aci,acpdr) # -2.207 -2.214 -2.199 -2.963 -1.521
bv <- res$bv[-(1:(burn+1))][seq(thin,K,thin)]; length(bv) # 10000

acf(bv)$acf[1:5] # 1.0000 0.9363 0.8892 0.8481 0.8109
bhat <- mean(bv); bci <- bhat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(bv)/J)
bcpdr <- quantile(bv,c(0.025,0.975))
c(bhat,bci,bcpdr) # 0.5145 0.5132 0.5158 0.3895 0.6605
dena <- density(av); denb <- density(bv)

fit <- glm(pvdata~xvdata,family=binomial(link=logit),weights=nvdata)
fit$coef # -2.1555 0.5028
ase <- sqrt(summary(fit)$cov.unscaled[1,1])

fitaci <- fit$coef[1] + c(-1,1)*qt(0.975,8)*ase
c(ase,fitaci) # 0.3661 -2.9998 -1.3113
bse <- sqrt(summary(fit)$cov.unscaled[2,2])

fitbci <- fit$coef[2] + c(-1,1)* qt(0.975,8)*bse
c(bse,fitbci) # 0.0682 0.3456 0.6601
X11(w=8,h=8); par(mfrow=c(2,2))
plot(0:its,res$av,type="l",xlab="j",ylab="a_j")
abline(h=c(ahat,aci,acpdr))
abline(h=c(fit$coef[1],fitaci),lty=4)
legend(400,0,c("MC est, 95% CI & CPDR",
"MLE & classical 95% CI"),lty=c(1,4))
plot(0:its,res$bv,type="l", xlab="j",ylab="b_j")
abline(h=c(bhat,bci,bcpdr))
abline(h=c(fit$coef[2],fitbci),lty=4)
legend(400,0.2,c("MC est, 95% CI & CPDR",
"MLE & classical 95% CI"),lty=c(1,4))
350
hist(av,prob=T, xlim=c(-4,0),ylim=c(0,1.5),nclass=20,xlab="a")
lines(dena$x,dena$y,lwd=2)
hist(bv,prob=T, xlim=c(0.2,0.8),ylim=c(0,7),nclass=20,xlab="b")
lines(denb$x,denb$y,lwd=2)
# (e) ***************************************************
xxv <- seq(0,15,1); len <- length(xxv)

ppv <- xxv; ppci1 <- xxv; ppci2 <- xxv; ppcpdr1 <- xxv; ppcpdr2 <- xxv
for(i in 1:len){
xx <- xxv[i]
ppsim <- 1/(1+exp(-av-bv*xx))
pp <- mean(ppsim)
ppci <- pp + c(-1,1)*qnorm(0.975)*sqrt(var(ppsim)/J)
ppcpdr <- quantile(ppsim,c(0.025,0.975))
ppv[i] <- pp # MC estimate of E(p|xx) and so indirectly of p at x=xx
ppci1[i] <- ppci[1]; ppci2[i] <- ppci[2]
ppcpdr1[i] <- ppcpdr[1]; ppcpdr2[i] <- ppcpdr[2]
}
Xmat <- cbind(1,xxv)

etahat <- Xmat %*% fit$coef # NB: fit was created in (a)
pihat <- 1/(1+exp(-etahat))
etahatvar<- diag ( Xmat %*% summary(fit)$cov.unscaled %*% t(Xmat) )
df <- length(yvdata)-length(fit$coef) # 10-2=8
etahatub <- etahat + qt(0.975,df) * sqrt(etahatvar)
etahatlb <- etahat - qt(0.975,df) * sqrt(etahatvar)
pihatub <- 1/(1+exp(-etahatub))
pihatlb <- 1/(1+exp(-etahatlb))
X11(w=8,h=5); par(mfrow=c(1,1))
plot(c(0,15),c(0,1),type="n",xlab="x",ylab="probability p(x)")
points(xvdata,pvdata,pch=16); lines(xxv,ppv)
lines(xxv,ppci1,lwd=2); lines(xxv,ppci2,lwd=2)
lines(xxv,ppcpdr1,lty=2,lwd=2); lines(xxv,ppcpdr2,lty=2,lwd=2)
points(xxv,pihat); lines(xxv,pihatlb,lty=4); lines(xxv,pihatub,lty=4)
legend(8,0.65, c("MC est & 95% CI","95% CPDR","Classical GLM 95% CI"),
lty=c(1,2,4))
legend(8,0.35,c("Sample proportions","Standard GLM estimates"),pch=c(16,1))
# pphatv <- 1/(1+exp(-ahat-bhat*xxv))
# lines(xxv,pphatv,lty=3) # This alternative estimate is practically
# indistinguishable from ppv and so is not plotted
351
# (f) *****************************************************
p5v <- 1/(1+exp(-av-bv*5)); set.seed(331); dv <- rbinom(J,20,p5v)

hist(dv,prob=T,breaks=seq(-0.5,20.5,1))
summary(as.factor(dv))
#3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
#1 3 20 75 217 472 845 1188 1562 1733 1546 1123 709 332 131 37 6
dhat <- mean(dv); dci <- dhat + c(-1,1)*qnorm(0.975)*sqrt(var(dv)/J)

dcpdr <- quantile(dv,c(0.025,0.975))
c(dhat,dci,dcpdr) # 11.81 11.76 11.85 7.00 16.00
dv2 <- dv[dv>=7]; dv3 <- dv2[dv2<=16]; length(dv3)/J # 0.9727

dv2 <- dv[dv>=8]; dv3 <- dv2[dv2<=16]; length(dv3)/J # 0.951 OK (>= 95%)
dv2 <- dv[dv>=7]; dv3 <- dv2[dv2<=15]; length(dv3)/J # 0.9395 (too small)
dhat2 <- mean(p5v) # alternative method

qbinom(c(0.025,0.975),20,dhat2) # 7 16
# (g) ****************************************************
Lv <- -av/bv; Lhat <- mean(Lv); Lci <- Lhat + c(-1,1)*qnorm(0.975)*sqrt(var(Lv)/J)

Lcpdr <- quantile(Lv,c(0.025,0.975))
c(Lhat,Lci,Lcpdr) # 4.279 4.273 4.286 3.584 4.916
cf <- coef(fit); Lmle <- -cf[1]/cf[2]; deriv <- c( -1/cf[2] , cf[1]/cf[2]^2 )
Lvar <- t(deriv) %*% summary(fit)$cov.unscaled %*% deriv
Lci2 <- Lmle + c(-1,1)*qt(0.975,8) * sqrt(Lvar)
c(Lmle,Lci2) # 4.287 3.532 5.042
# (h) ****************************************************
xvdata <- c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6)

yvdata <- c(1,0,23,12,8,32,10,35,19,8)
nvdata <- c(10,30,40,20,15,46,12,37,23,8)
pvdata <- yvdata/nvdata
352
MHLRZC <- function(burn,J,a0,b0,xv,yv,nv,sa,sb){
# Performs the Metropolis-Hastings algorithm for a logistic regression model

# modified to have a zero constraint.
# Inputs: burn = number of iterations for burn-in

# J = required number of Monte Carlo simulations
# a0 = starting value of alpha
# b0 = starting value of beta
# xv = vector of xi values (length n)
# yv = vector of yi observations
# nv = vector of ni values
# Outputs: $av = vector of (burn+J+1) values of alpha (incl. starting value)

# $bv = vector of (burn+J+1) values of beta (incl. starting value)
# $ara = acceptance rate for alpha (over last J iterations)
# $arb = acceptance rate for beta.
logfun <- function(a,b,xv,yv,nv){

phatv <- 1 - exp( -a*xv - b*xv^2 ) # The main change is here
sum( yv*log(phatv) + (nv-yv)*log(1-phatv) ) }
n <- length(yv); a <- a0; b <- b0
its <- burn + J # total number of iterations
av <- c(a, rep(NA,its)); bv <- c(b, rep(NA,its)) # vectors of simulated a & b
values
arav <- c(NA, rep(0,its)); arbv <- c(NA, rep(0,its))
# acceptance rate vectors for a and b
for(j in 1:its){
a2 <- rnorm(1,a,sa)
if(a2 > 0){
logpr <- logfun(a=a2,b=b,xv=xv,yv=yv,nv=nv)-
if(u<pr){ a <- a2; arav[j+1] <- 1 }
}
b2 <- rnorm(1,b,sb)
if(b2 > 0){
logpr <- logfun(a=a,b=b2, xv=xv,yv=yv,nv=nv)-
if(u<pr){ b <- b2; arbv[j+1] <- 1 }
}
av[j+1] <- a; bv[j+1] <- b }
353
ara <- sum(arav[(burn+2):(its+1)])/J

arb <- sum(arbv[(burn+2):(its+1)])/J # acceptance rates for a & b
list(av=av,bv=bv,ara=ara,arb=arb)
}
burn <- 500; J <- 10000; its <- burn + J; set.seed(111)
res <- MHLRZC(burn=burn,J=J,a0=0.1,b0=0.01,
xv=xvdata,yv=yvdata,nv=nvdata,sa=0.03,sb=0.005)
c(res$ara,res$arb) # 0.5686 0.5637 OK
par(mfrow=c(2,1)); plot(res$av,type="l"); plot(res$bv,type="l") # OK
options(digits=4)
av <- res$av[-(1:(burn+1))]; ahat <- mean(av)
aci <- ahat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(av)/J)
acpdr <- quantile(av,c(0.025,0.975))
c(ahat,aci,acpdr) # 0.10921 0.10842 0.11000 0.03622 0.19256
bv <- res$bv[-(1:(burn+1))]; bhat <- mean(bv)

bci <- bhat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(bv)/J)
c(bhat,bci,bcpdr) # 0.016683 0.016552 0.016814 0.003641 0.029898
xxv <- seq(0,15,1); len <- length(xxv)

ppv <- xxv; ppci1 <- xxv; ppci2 <- xxv; ppcpdr1 <- xxv; ppcpdr2 <- xxv
for(i in 1:len){
xx <- xxv[i]
ppsim <- 1-exp(-av*xx-bv*xx^2)
pp <- mean(ppsim)
ppci <- pp + c(-1,1)*qnorm(0.975)*sqrt(var(ppsim)/J)
ppcpdr <- quantile(ppsim,c(0.025,0.975))
ppv[i] <- pp # MC estimate of E(p|xx) and so indirectly of p at x=xx
ppci1[i] <- ppci[1]; ppci2[i] <- ppci[2]
ppcpdr1[i] <- ppcpdr[1]; ppcpdr2[i] <- ppcpdr[2]
}
X11(w=8,h=5); par(mfrow=c(1,1))
plot(c(0,15),c(0,1),type="n",xlab="x",ylab="probability p(x)")
points(xvdata,pvdata,pch=16); lines(xxv,ppv)
lines(xxv,ppci1,lwd=2); lines(xxv,ppci2,lwd=2)
lines(xxv,ppcpdr1,lty=2,lwd=2); lines(xxv,ppcpdr2,lty=2,lwd=2)
legend(8,0.6, c("MC est & 95% CI","95% CPDR"),lty=c(1,2))
354
Exercise 7.4 Autocorrelated Bernoulli data (and practice at

various statistical techniques)
Consider the following Bayesian model for a sequence of identically

distributed but possibly dependent and serially autocorrelated Bernoulli
random variables yi :
( yi | a, b, yi1 , yi2 , yi3 ,...) ~ Bernoulli ( pi ) , i  0, 1, 2,...
1
pi 
1  exp{( a  byi1 )}
f (a, b)  1, a, b   .
Suppose that the data is y  ( y1 ,..., yn ) = (1,1,1,1,1, 1,1,0,0,0).
Use the Metropolis-Hastings algorithm to generate a random sample of

J = 10,000 values from the joint posterior distribution of a and b. Use this
sample to estimate the posterior means and 95% CPDRs for a and b. Also
estimate P (b < 0 | y ) .
The first thing we need to do is work out the probability that Y1  1

conditional on a and b but not conditional on y0 (since y0 is not known).
With an implicit conditioning on a and b, observe by the law of total
probability that
P(Y1  1)  P(Y0  0) P(Y1  1| Y0  0)  P(Y0  1) P(Y1  1| Y0  1)
 {1 P(Y1  1)}P(Y1  1| Y0  0)  P(Y1  1) P(Y1  1| Y0  1) .
Solving for P (Y1  1) , we get

1  exp(a  b)
q1  P(Y1  1| a, b)  .
2  exp(a  b)  exp(a)
1
Hence, with pi  P(Yi  1 | a, b, yi1 ) 
1  exp(a  byi1 )
(as already defined), the joint posterior pdf of a and b is
f ( a , b | y )  f ( a , b) f ( y | a , b)
a ,b n n
 1 f ( y1 | a, b) f ( yi | a, b, yi1 )  q1 1 (1 q1 ) 1  pi i (1 pi ) i .
y 1 y y 1 y
i 2 i2
355
So the log of the posterior density is given by

l (a, b)  log f (a, b | y )
 c  y1 log q1  (1 y1 ) log(1 q1 )
n
  yi log pi  (1 yi ) log(1 pi ) .
i2
Using normal drivers for both a and b, we implement a Metropolis-

Hastings algorithm and thereby, following a burn-in of size B = 1,000,
obtain an approximately random MC sample of size J = 10,000, which we
will denote by
(a j , b j ) ~ iid f (a, b | y ) , j  1,..., J .
From this MC sample we estimate a by –2.337 with 95% CPDR

(–6.3980, 0.8313), and b by 5.411 with 95% CPDR (0.9098, 11.8691).
We also estimate P (b < 0 | y ) by 0.081.
The traces of a and b over all 11,000 iterations, and histograms of the last
10,000 values of a and b, respectively, are shown in Figure 7.11, together
with posterior density estimates.
Note: In an earlier exercise we considered a posterior predictive p-

value for the null hypothesis that the sequence in the present exercise
consists of values that are iid.
That p-value was estimated as 0.0995 with 95% CI (0.0936, 0.1054).

The estimate 0.081 of P (b < 0 | y ) in the present exercise may be
interpreted in a similar way to the p-value 0.0995.
In this case the appropriate p-value is one-sided.
If we wish to do a two-sided test, in the present context, b = 0 versus

b ≠ 0 , then the p-value may be calculated as twice the minimum of
P (b < 0 | y ) and P (b > 0 | y ) .
Clearly, if the posterior distribution of b is well above or well below

zero, then the resulting two-sided p-value will appropriately be very
close to zero.
356
Figure 7.11 Traces and histograms for a and b
yv <- c(1,1,1,1,1, 1,1,0,0,0); n <- length(yv); ybar <- mean(yv); ydot <- sum(yv)
MHBD <- function(K,a,b,yv,sa,sb){

# Performs a Metropolis-Hastings algorithm for a binary dependence model.
# a,b = starting values of a and b
# yv = vector of 0-or-1 values (y1,...,yn)
# Outputs: $av = vector of (K+1) values of a (incl. starting value)
# $bv = vector of (K+1) values of b (incl. starting value)
# $ara, $arb = acceptance rates for a and b.
n <- length(yv); av <- a; bv <- b; cta <- 0; ctb <- 0
logfun <- function(a,b,yv,n){
p1 = (1 + exp(a+b)) / (2 + exp(a+b) + exp(-a)) # p1
p2ton <- 1/(1 + exp(-a-b*yv[-n])) # p2,...,pn
pv <- c(p1,p2ton) # p1,...,pn
sum( yv*log(pv) + (1-yv)*log(1-pv) ) }
357
for(j in 1:K){
a2 <- rnorm(1,a,sa) # proposed value of a
logpr <- logfun(a=a2,b=b,yv=yv,n=n)-logfun(a=a,b=b,yv=yv,n=n)
if(u<pr){ a <- a2; cta <- cta + 1 }
if(sb > 0){
b2 <- rnorm(1,b,sb) # proposed value of b
logpr <- logfun(a=a,b=b2,yv=yv,n=n)-logfun(a=a,b=b,yv=yv,n=n)
if(u<pr){ b <- b2; ctb <- ctb + 1 }
}
av <- c(av,a); bv <- c(bv,b)
}
list(av=av,bv=bv,ara=cta/K,arb=ctb/K)
}
options(digits=4); set.seed(143); date() #

res <- MHBD(K=11000,a=0,b=0,yv=yv,sa=1.5,sb=2.2); date() # Took 2 secs
c(res$ara,res$arb) # 0.5575 0.5753 (acceptance rates for a and b) OK
X11(w=8,h=6); par(mfrow=c(2,1)); plot(res$av); plot(res$bv) # OK
av <- res$av[1002:11001]; bv <- res$bv[1002:11001]; J=1000
abar <- mean(av); bbar <- mean(bv);

acpdr <- quantile(av,c(0.025,0.975));
rbind(c(abar,acpdr),c(bbar,bcpdr))
# [1,] -2.337 -6.3980 0.8313
# [2,] 5.411 0.9098 11.8691
pr <- length(bv[bv<0])/J; pr # 0.081
X11(w=8,h=6); par(mfrow=c(2,2));
plot(av,type="l",xlab="j",ylab="a_j",cex=1.2)
plot(bv,type="l",xlab="j",ylab="b_j",cex=1.2)
hist(av,prob=T,xlab="a",ylab="relative frequency",cex=1.2);
abline(v=c(abar,acpdr), lty=1,lwd=3); lines(density(av),lwd=2)
hist(bv,prob=T,xlab="b",ylab="relative frequency",cex=1.2);
abline(v=c(bbar,bcpdr), lty=1,lwd=3); lines(density(bv),lwd=2)
358
Exercise 7.5 Inference on the bounds of a uniform distribution

( y1 ,..., yn | a, b) ~ iid U (a, b)
(a | b) ~ U (0, b)
b ~ U (0,1).
Generate a random sample of size n = 20 from the model with a = 0.6 and
b = 0.8. Then apply MCMC methods to generate a random sample from
the joint posterior of a and b. Then use this sample to perform Monte Carlo
inference on m= E ( yi | a, b=) ( a + b) / 2 .
Rounding to four decimals, the generated sample values are as shown in

Table 7.6.
Table 7.6 Sample values
i 1 2 3 4 5
yi 0.7846 0.7572 0.6381 0.7626 0.6105
i 6 7 8 9 10
yi 0.6990 0.7728 0.7113 0.7314 0.7435
i 11 12 13 14 15
yi 0.6324 0.7072 0.7493 0.7979 0.6182
i 16 17 18 19 20
yi 0.7652 0.7883 0.7194 0.6211 0.6054
Note: The range of this data is from 0.6054 to 0.7979. This tells us
immediately that 0 ≤ a ≤ 0.6054 and 0.7979 ≤ b ≤ 1 .
359
Now, the joint posterior density of a and b is

f (a, b | y )  f (a, b, y )  f (b) f (a | b) f ( y | a, b)
I (0  b  1) I (0  a  b) n I (a  yi  b)
  
1 b i 1 ba
1
 , 0  a  b  1, a  min yi  max yi  b .
b(b  a) n
So the two conditional posterior distributions are defined by:

1
f ( a | y , b)  , 0  a  min( yi )
(b  a ) n
1
f (b | y, a )  , max(yi )  b  1 .
b(b  a ) n
Neither of these conditionals defines a well-known distribution. So we

will apply a ‘pure’ Metropolis-Hastings algorithm (rather than a Gibbs
sampler).
With a′ and b′ denoting the proposed values of a and b, the acceptance

probabilities at the two steps are:
f (a  | y, b) 1/ (b  a ) 2  b  a 
n
pa     
f (a | y, b) 1/ (b  a ) 2  b  a  
f (b  | y, a ) 1/ (b (b   a ) 2 ) b  b  a 
n
pb      .
f (b | y, a ) 1/ (b(b  a ) 2 ) b   b   a 
The following drivers were chosen:

a′ ~ N ( a, r 2 )
b′ ~ N (b, t 2 ) .
Starting at a = 0.1 and b = 0.9, and using the tuning constants r = 0.008
and t = 0.01, the algorithm was run for 2,500 iterations. The resulting trace
plots are shown in Figure 7.12.
We see that stochastic convergence was achieved within 500 iterations.

The acceptance rates over the last 2,000 iterations were 62% and 58% for
a and b, respectively.
360
Figure 7.12 Traces for a and b
The algorithm was then run for a further 50,000 iterations, starting at the
last values in the previous run (a = 0.5979 and b = 0.8123). The acceptance
rates were now 61% and 54%, and this second run took 14 seconds of
computer time.
Then every 50th value was recorded so as to yield a final random sample
of size J = 1,000 from the joint posterior distribution of a and b, i.e.
(a1 , b1 ),..., (aJ , bJ ) ~ iid f (a, b | y ) .
As a check, the sample ACF of each sample of size 1,000 was calculated.
Figure 7.13 shows the ACF estimates for a and b, and these provide no
evidence for residual autocorrelation in either series.
361
Figure 7.13 Sample ACFs for a and b
A random sample from the posterior distribution of the mean

m= E ( yi | a, b= ) ( a + b) / 2
was then formed by calculating
m= j (a j + b j ) / 2 .
We thereby obtained the random sample

m1 ,..., mJ ~ iid f (m | y ) .
This Monte Carlo sample was used to estimate mˆ = E (m | y ) by 0.7013,

with 95% CI (0.7008, 0.7019). The estimated 95% CPDR for m was
(0.6837, 0.7173).
Figure 7.14 is a histogram of the 1,000 values of m, overlaid by a density

estimate of f (m | y ) , with the vertical lines showing the point and interval
estimates reported above.
362
Figure 7.14 Inference on m = (a + b)/2
options(digits=4)
MH = function(B,J=1000,y,a,b,r,t){
# This function performs a Metropolis-Hastings algorithm for a model involving
3 uniforms.
# Inputs: B = burn-in length
# J = desired Monte Carlo size
# y = (y1,...,yn) = data (yi ~ iid U(a,b))
# a = starting value of a (a ~ U(0,b))
# b = starting value of b (b ~ U(0,1))
# r,t = tuning constants for a & b, respectively
# Outputs: $av = (1+B+J) vector of a-values
# $bv = (1+b+J) vector of b-values
# $ar = acceptance rate for a (over last J iterations)
# $br = acceptance rate for b (over last J iterations)
av = a; bv = b; an=0; bn=0; miny=min(y); maxy=max(y); n=length(y);
for(j in 1:(B+J)){
ap = rnorm(1,a,r)
if((0<ap)&&(ap<miny)){
p = ((b-a)/(b-ap))^n; u = runif(1)
if(u<p){ a=ap; if(j>B) an=an+1 } }
bp = rnorm(1,b,t)
if((maxy<bp)&&(bp<1)){
q = (b/bp)*((b-a)/(bp-a))^n; v = runif(1)
363
if(v<q){ b=bp; if(j>B) bn=bn+1 } }

av=c(av,a); bv=c(bv,b)
}
ar = an/J; br=bn/J; list(av=av,bv=bv,ar=ar,br=br) }
set.seed(337); ydata = runif(20,0.6,0.8); round(ydata,4)

# [1] 0.7846 0.7572 0.6381 0.7626 0.6105 0.6990 0.7728 0.7113 0.7314
# [10] 0.7435 0.6324 0.7072 0.7493 0.7979 0.6182 0.7652 0.7883 0.7194
# [19] 0.6211 0.6054
summary(ydata)
# 0.605 0.637 0.725 0.711 0.763 0.798
B = 500; J = 2000; set.seed(232)

mh = MH(B=B,J=J,y=ydata, a=0.1,b=0.9,r=0.008,t=0.01)
c(mh$ar,mh$br) # 0.616 0.576
X11(w=8,h=7); par(mfrow=c(2,1))
plot(0:(B+J),mh$av,type="l",main="",xlab="j",ylab="aj")
abline(v=B,lty=3)
plot(0:(B+J),mh$bv,type="l", main="",xlab="j",ylab="bj")
abline(v=B,lty=3)
alast= mh$av[length(mh$av)]; blast= mh$bv[length(mh$bv)]
c(alast,blast) # 0.5979 0.8123
B=0; J = 50000; set.seed(230); date()

mh = MH(B=B,J=J,y=ydata, a=alast,b=blast,r=0.008,t=0.01)
date() # Takes about 14 seconds
c(mh$ar,mh$br) # 0.6141 0.5434
av=mh$av[-1][seq(05,50000,50)]; J = length(av); J # 1000
bv=mh$bv[-1][seq(50,50000,50)];
acf(av)$acf[1:5] # 1 0.04828 0.01193 -0.02745 0.03983 OK
acf(bv)$acf[1:5] # 1 0.038617 0.007026 0.030259 0.011678 OK
mv=0.5*(av+bv)
# acf(mv)$acf[1:5] # 1 -0.001121 -0.020770 0.001872 -0.008731 OK
X11(w=8,h=5); par(mfrow=c(1,1))
hist(mv,prob=T,xlab="m",main="",
xlim=c(0.65,0.75), ylim=c(0,80))
lines(density(mv),lwd=2)
est=mean(mv); ci=est+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J)
cpdr=quantile(mv,c(0.025,0.975))
print(c(est,ci,cpdr),digits=4) # 0.7013 0.7008 0.7019 0.6837 0.7173
abline(v= c(est,ci,cpdr),lwd=2)
364
CHAPTER 8
Inference via WinBUGS
8.1 Introduction to BUGS
We have illustrated the usefulness of MCMC methods by applying them
to a variety of statistical contexts. In each case, specialised R code was
used to implement the chosen method. Writing such code is typically time
consuming and requires a great deal of attention to details such as
choosing suitable tuning constants in the Metropolis-Hastings algorithm.
A software package which can greatly assist with the application of

MCMC methods is WinBUGS. This stands for:
Bayesian Inference Using Gibbs Sampling for Microsoft Windows.
The BUGS Project was started in 1989 by a team of statisticians in the

UK (at the Medical Research Council Biostatistics Unit, Cambridge, and
Imperial College School of Medicine, London) and developed until the
latest version WinBUGS 1.4.3 was released in 2007.
WinBUGS 1.4.3 is a stable version of BUGS which is suitable for routine

use, even today.
Since 2007, development of BUGS has focused on OpenBUGS, an open

source version of the package. In what follows we will only refer to
WinBUGS 1.4.3. This is freely available from the official website:
http://www.mrc-bsu.cam.ac.uk/software/bugs/
Figure 8.1 shows this website (as it appeared on 18 February 2015).
Figure 8.2 shows the Wikipedia article on WinBUGS (on the same day):
http://en.wikipedia.org/wiki/WinBUGS
The preferred reference for citing WinBUGS in scientific papers is:
Lunn, D.J., Thomas, A., Best, N., and Spiegelhalter, D. (2000).
WinBUGS – A Bayesian modelling framework: Concepts,
structure, and extensibility. Statistics and Computing, 10:
325–337.
365
Figure 8.1 Official website for WinBUGS
Figure 8.2 Wikipedia article on WinBUGS
366
Chapter 8: Inference via WinBUGS
8.2 A first tutorial in BUGS

y1 ,..., yn | µ ,τ ~ iid Normal ( µ , σ 2 ) ( τ = 1/ σ 2 )
µ | τ ~ Normal ( µ0 , σ 02 )
τ ~ Gamma (α , β ) ( Eτ = α / β )
where µ0 = 0, σ 0 = 10,000 and α = β = 0.001.
2
Suppose the data is y = ( y1 ,..., yn ) = (2.4, 1.2, 5.3, 1.1, 3.9, 2.0), and we
wish to find the posterior mean and 95% posterior interval for each of µ
and γ = µ τ (the signal to noise ratio).
To perform this in WinBUGS 1.4.3, open a new window (select ‘File’ and
then ‘New’ in the BUGS toolbar), and type the following BUGS code:
model
{
for(i in 1:n){
y[i] ~ dnorm(mu, tau)
}
mu ~ dnorm(0,0.0001)
tau ~ dgamma(0.001, 0.001)
gam <- mu*sqrt(tau)
}
list( n=6, y=c(2.4,1.2,5.3,1.1,3.9,2.0) )
list(tau=1)
Alternatively, copy this text from a Word document into a Notepad file,
and then copy the text from the Notepad file into the WinBUGS window.
Note: Do not copy text from Word to WinBUGS directly or you may
get an error message.
The WinBUGS window should then look as depicted in Figure 8.3.
367
Figure 8.3 WinBUGS window with code
Next, select ‘Model’ (in the WinBUGS toolbar) and then ‘Specification’.
Then highlight the word ‘model’ (in the BUGS code above) and click on
‘check model’ in the ‘Specification Tool’.
Then highlight the first word ‘list’, click on ‘load data’ and click on
‘compile’.
Then highlight the second word ‘list’, click on ‘load inits’ and click on
‘gen inits’.
Next, select ‘Inference’ and then ‘Samples’. Then, in the ‘Sample Monitor
Tool’ which appears, type ‘mu’ in the ‘node’ box, click ‘set’, type ‘gam’
in the ‘node’ box and click ‘set’ again.
Then click ‘Model’ and ‘Update’.
In the ‘Update Tool’ which appears, change ‘1000’ to ‘1500’ and click
‘update’. This will implement 1,500 iterations of an MCMC algorithm.
Next type ‘*’ (an asterisk) in the ‘node’ box, change ‘1’ to ‘501’ in the
‘beg’ box (meaning beginning) and click ‘stats’ (statistics).
This should produce something similar to what is shown in Figure 8.4 and
Table 8.1.
368
Figure 8.4 Tools and node statistics in WinBUGS
Table 8.1 Node statistics in WinBUGS (as in Figure 8.4)
node mean sd MC error 2.5% median 97.5% start sample
gam 1.538 0.6389 0.02113 0.3775 1.521 2.908 501 1000
mu 2.636 0.8181 0.02587 0.9428 2.645 4.313 501 1000
From these results, we see that the posterior mean and 95% posterior
interval for µ are about 2.64 and (0.94, 4.31), and the same quantities
for γ are about 1.54 and (0.38, 2.91).
To obtain more precise inference we could repeat the above procedure

with a larger Monte Carlo sample size (e.g. 10,000 rather than 1,000).
369
Note: If σ 0 = ∞ and α= β= 0 , the posterior mean and 95% CPDR for

µ are exactly
y = 2.65
(i.e. the sample mean) and
( y ± t0.025 ( n − 1) / n ) = (0.92, 4.38)
(where s is the sample standard deviation).
The posterior mean and CPDR for γ do not have such simple formulae.
To see line plots of the simulated values, click on ‘history’ (in the ‘Sample
Monitor Tool’), and to view smoothed histograms of them, click ‘density’.
Figure 8.5 illustrates.
Figure 8.5 Line plots and smoothed histograms in WinBUGS
370
To transfer the simulated values from WinBUGS into R (for further

analysis) click on ‘coda’. Two boxes will appear, one called ‘CODA index’
with the following:
gam 1 1000
mu 1001 2000
The other box, called ‘CODA for chain 1’, should have two columns and
2,000 rows and look as follows:
501 1.298
502 1.307
503 1.478
.......................
1498 0.8303
1499 1.993
1500 2.326
501 1.812
502 1.999
503 2.8
......................
1498 1.628
1499 2.161
1500 2.748
Next, copy the contents of ‘CODA for chain 1’ into a Notepad file called
‘out.txt’ (say). Save that file somewhere, e.g. onto the desktop.
371
Then begin a session in R and proceed as follows:
out <- read.table(file=file.choose()) # Navigate to and choose ‘out.txt’
dim(out) # 2000 2
gamv <- out[1:1000,2]; muv <- out[1001:2000,2]
par(mfrow=c(2,1)); hist(muv, breaks=20); hist(gamv, breaks=20)
This should result in the graphs shown in Figure 8.6.
Figure 8.6 Histograms in R using output from WinBUGS
372
One can then use the MCMC output in many other ways, e.g. to simulate
from a posterior predictive distribution via the method of composition.
As an alternative, it is possible to run WinBUGS directly from R after

installing the appropriate packages. (This will be done in a future exercise).
But this method is really only for production runs and is not recommended
during the experimentation stage of an analysis.
For more information on BUGS, click on ‘Help’ and ‘User manual’ in the
toolbar. Also see ‘Examples Vol I’ and ‘Examples Vol II’ for several
dozen worked examples in BUGS. The examples are very user-friendly.
They contain data, code and everything one needs to reproduce the results
shown. Figure 8.7 shows various excerpts from these files.
Figure 8.7 Exerpts from the WinBUGS 1.4.3 User Manual

(several pages)
373
374
375
376
Note: The last graphic shown is called a Doodle. WinBUGS has a

facility whereby the user can create such a diagram and have the code
generated automatically.
377
378
Predictions:
Trace plots and density estimates:
(End of Figure 8.7)
379
Exercise 8.1 Simple linear regression via WinBUGS
Use WinBUGS to perform a simple linear regression on the data in Table

8.2 (which is the same as Table 7.1 in Exercise 7.2).
Table 8.2 Regression data
xi ( i ) 1 2 3 4 5
yi 5.879 8.54 14.12 13.14 15.26
i 6 7 8 9 10
yi 20.43 19.92 18.47 21.63 24.11
Using the following WinBUGS code, we obtain the results in Table 8.3:
model{
for(i in 1:n){
mu[i] <- a + b*x[i]
y[i] ~ dnorm(mu[i],lam)
}
a ~ dnorm(0.0,0.001)
b ~ dnorm(0.0,0.001)
lam ~ dgamma(0.001,0.001)
}
# data
list(n = 10, x = c(1,2,3,4,5,6,7,8,9,10), y=c(5.879,8.54,14.12,
13.14,15.26,20.43,19.92,18.47,21.63,24.11))
# inits
list(a=0,b=0,lam=1)
Table 8.3 Results of regression performed using WinBUGS

a 6.039 1.532 0.01646 2.955 6.051 9.107 1001 10000
b 1.836 0.247 0.00266 1.342 1.834 2.334 1001 10000
lam 0.2625 0.1313 0.001602 0.07259 0.2404 0.5788 1001 10000
380
Using the results in Table 8.3, we estimate a by 6.039 with 95% CPDR
(2.955, 9.107), and we estimate b by 1.836 with 95% CPDR (1.342, 2.334).
It may be noted that these results are very similar to those obtained via
classical techniques in an earlier exercise: 6.051 and (2.973, 9.128) for a,
and 1.836 and (1.340, 2.332) for b.
Figure 8.8 shows trace plots and density estimates produced as part of the
WinBUGS output.
Figure 8.8 Graphical output from WinBUGS regression
381
Exercise 8.2 Logistic regression via WinBUGS
Consider the data in Table 8.4, which is the same as in Table 7.2 of
Exercise 7.3 (where, for example, in Experiment 3 a total of 40 rats were
exposed to radiation for 3.6 hours, and 23 of them died within one month).
Table 8.4 Rat mortality data
i ni xi yi yi / ni  pˆ i
1 10 0.1 1 1/10 = 0.1
2 30 1.4 0 0/30 = 0
3 40 3.6 23 23/40 = 0.575
4 20 3.8 12 12/20 = 0.6
5 15 5.2 8 8/15 = 0.5333
6 46 6.1 32 32/46 = 0.696
7 12 8.7 10 10/12 = 0.833
8 37 9.1 35 35/37 = 0.946
9 23 9.1 19 19/23 = 0.826
10 8 13.6 8 8/8 = 1
Use WinBUGS to estimate the parameters in the following logistic

regression model for these data:
Yi ~ Bin(ni , pi ) , i = 1,...,n,
where:
1
pi  (probability of a ‘success’ for experiment i)
1  exp(zi )
zi  a  bxi (linear predictor).
In your results, also include inference on LD50, the dose at which 50% of
rats will die (= −a/b), and on d, defined as the number of rats that will die
out of 20 that are exposed to five hours of radiation.
382
Applying the following WinBUGS code, we obtain the results in Table

8.5:
model
{
for(i in 1:N){
z[i] <- a + b*x[i]
logit(p[i])<- z[i]
y[i] ~ dbin(p[i],n[i])
}
a ~ dnorm(0.0,0.001)
b ~ dnorm(0.0,0.001)
logit(p5) <- a+5*b
d ~ dbin(p5,20)
LD50 <- -a/b
}
# data
list(N=10,n=c(10,30,40,20,15,46,12,37,23,8),
x=c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6),
y=c(1,0,23,12,8,32,10,35,19,8))
# inits
list(a=0,b=0)
Table 8.5 Results of logistic regression performed using

WinBUGS
nodemean sd MC error 2.5% median 97.5% start sample
LD50 4.273 0.3373 0.00464 3.587 4.285 4.899 1001 10000

a -2.177 0.3726 0.01041 -2.922 -2.168 -1.478 1001 10000
b 0.5082 0.06962 0.001964 0.3794 0.5059 0.6501 1001 10000
d 11.79 2.344 0.02447 7.0 12.0 16.0 1001 10000
p5 0.5895 0.03946 3.174E4 0.5125 0.5896 0.6664 1001 10000
383
Thus, we estimate a by –2.177 with 95% CPDR (–2.922, –1.478), etc.
These results are very similar to those obtained via classical techniques in
Exercise 7.3, namely –2.156 and (–3.000, –1.311) for a, etc.
Figure 8.9 shows some traces and density estimates produced as part of
the WinBUGS output. Here, ‘p5’ represents the probability of a rat dying
within one month if exposed to five hours of radiation. We chose to
monitor this node so as to estimate its posterior density
Figure 8.9 Graphical output from WinBUGS logistic regression
384
Exercise 8.3 Inference on a uniform distribution via WinBUGS

( y1 ,..., yn | a, b) ~ iid U (a, b)
(a | b) ~ U (0, b)
b ~ U (0,1).
Suppose that n = 20 data values from this model with a = 0.6 and
b = 0.8 are as shown in Table 8.6 (which is the same as Table 7.6 in
Exercise 7.5).
Table 8.6 Sample values from a uniform distribution
i 1 2 3 4 5
yi 0.7846 0.7572 0.6381 0.7626 0.6105
i 6 7 8 9 10
yi 0.6990 0.7728 0.7113 0.7314 0.7435
i 11 12 13 14 15
yi 0.6324 0.7072 0.7493 0.7979 0.6182
i 16 17 18 19 20
yi 0.7652 0.7883 0.7194 0.6211 0.6054
Use WinBUGS to generate a random sample from the joint posterior

distribution of the parameters a and b. Then use this sample to estimate
the mean of the uniform distribution, namely
m= E ( yi | a, b=
) ( a + b) / 2 .
385
Applying the following WinBUGS code we obtain the results in Table 8.7:
model
{
for(i in 1:n){ y[i] ~ dunif(a,b) }
b ~ dunif(0,1)
a ~ dunif(0,b)
m <- (a+b)/2
}
list( n=20, y=c( 0.7846, 0.7572, 0.6381, 0.7626, 0.6105,

0.6990, 0.7728, 0.7113, 0.7314, 0.7435,
0.6324, 0.7072, 0.7493, 0.7979, 0.6182,
0.7652, 0.7883, 0.7194, 0.6211, 0.6054) )
list(a=0.1, b=0.9)
Table 8.7 Results of WinBUGS analysis for a uniform

distribution

a 0.594 0.01184 1.996E-4 0.5623 0.5977 0.6051 1001 10000
b 0.8091 0.01187 2.004E-4 0.7982 0.8054 0.841 1001 10000
m 0.7016 0.008201 1.388E-4 0.6844 0.7015 0.7187 1001 10000
Using the results in Table 8.7, we estimate m by 0.7016, with 95% CI

(0.7013, 0.7019) for m’s posterior mean.
We also estimate the 95% CPDR for m as (0.6844, 0.7187).
386
Note 1: The CI here was obtained in R using the following code:
0.7016 +c(-1,1)*qnorm(0.975)*0.0001388
Another CI is (0.7014, 0.7018), obtained using the code:
0.7016 +c(-1,1)*qnorm(0.975)*0.008201/sqrt(10000)
But this second CI is ‘inferior’ to (0.7013, 0.7019) because it ignores

the autocorrelation in the simulated values. The fact that the second CI
is shorter corresponds to the fact that its true coverage probability is less
than the nominal and desired 95%.
Note 2: These inferences (above Note 1) are similar to those obtained in

the solution to Exercise 7.5 using custom-written R code: 0.7013 with
95% CI (0.7008, 0.7019) and 95% CPDR estimate (0.6837, 0.7173).
Note 3: The CI in Note 2 is wider than the CI (0.7013, 0.7019) because

it is based on a smaller Monte Carlo sample size (of 1,000 rather than
10,000). If we use only iterations 1,001 to 2,000 from the WinBUGS
output, we get
m 0.7016 0.008287 3.573E-4 0.6833 0.7016 0.7194 1001 1000
in place of the corresponding row of Table 8.7. Then, the 95% CI for
m’s posterior mean becomes (0.7009, 0.7023), obtained via
0.7016 +c(-1,1)*qnorm(0.975)*0.0003573
This CI has a width of 0.0014, which is greater than 0.0006, the width
of (0.7013, 0.7019), and closer to 0.0011, the width of the CI in Note 2.
Figure 8.10 shows some traces and density estimates produced as part of
the WinBUGS output.
387
Figure 8.10 Graphs from WinBUGS analysis for a uniform

distribution
388
8.3 Tutorial on calling BUGS in R
The following is a short tutorial on how WinBUGS can be called within

an R session. Some of the details may need to be changed depending on
the configuration of files and directories in the computer being used.
First, assume that R (v3.01) is installed in C:/R-3.0.1
Also assume that WinBUGS (v4.1.3) is installed in C:/WinBUGS14
Open R and type
install.packages("R2WinBUGS")
Note: You must have a connection to the internet for this to work. This
command is required only once for each installed version of R.
Next, select a CRAN mirror when prompted. ‘Melbourne’ should work.

You should then see something like the following:
package ‘coda’ successfully unpacked and MD5 sums checked
package ‘R2WinBUGS’ successfully unpacked and MD5 sums checked, etc.
Then type
library("R2WinBUGS")
Note: This loads the necessary functions and must be done at the
beginning of each R session in which WinBUGS is to be called.
You should now see something like:
Loading required package: coda
Loading required package: lattice
Loading required package: boot, etc.
389
Next, create a file called C:/R-3.0.1/BugsCode1.txt

which contains the following code for a simple Bayesian model:
model
for(i in 1:n){ y[i] ~ dnorm(mu, tau) }
mu ~ dnorm(0,0.0001)
tau ~ dgamma(0.001, 0.001)
gam <- mu*sqrt(tau)
Next create a working directory, say C:/R-3.0.1/BugsOut/

and proceed in R as follows:
y <- c(2.4,1.2,5.3,1.1,3.9,2.0)
n <- length(y)
data <- list("n","y")
inits <- function(){ list(mu=0, tau=1.0) }
parameters <- c("mu", "gam")
sim <- bugs(data, inits, parameters,
model.file= "C:/R-3.0.1/BugsCode1.txt",
n.chains = 1, n.iter = 1500, n.burnin=500, DIC = FALSE,
bugs.directory = "C:/WinBUGS14/",
working.directory = "C:/R-3.0.1/BugsOut/")
This sets things up, starts WinBUGS, runs the BUGS code, closes
WinBUGS, and creates a number of files in the working directory, similar
to the ones shown in Figure 8.11.
390
Figure 8.11 Files created by running WinBUGS in R
These files contain information which can then be accessed within R, for
example as follows:
print(sim,digits=4)
# Inference for Bugs model at "C:/R-3.0.1/BugsCode1.txt", fit using WinBUGS,
# 1 chains, each with 1500 iterations (first 500 discarded)
# n.sims = 1000 iterations saved
# mean sd 2.5% 25% 50% 75% 97.5%
# mu 2.6358 0.8185 0.9424 2.1760 2.645 3.1175 4.2984
# gam 1.5380 0.6392 0.3774 1.0935 1.521 1.9360 2.9061
par(mfrow=c(2,1))
hist(sim$sims.list$mu, breaks=20)
hist(sim$sims.list$gam, breaks=20)
After typing these commands, you should see two histograms similar to
the ones shown in Figure 8.12. For more information on the bugs()
function, simply type
help(bugs)
391
Figure 8.12 Histograms obtained in R after calling WinBUGS
Note: If your WinBUGS code has an error, the procedure will crash,
with little to tell you what went wrong. In that case, first iron out any
‘bugs’ directly in WinBUGS, and only then run your WinBUGS code
in R, as above.
392
Exercise 8.4 ARIMA modeling and forecasting with WinBUGS

in R
Consider the well-known Total International Airline Passengers (TIAP)

time series, as shown in Table 8.8. This series describes quarterly totals
of international passengers for the period January 1949 to December 1960.
(Here, Qtr1 refers to the period January–March, etc.)
Table 8.8 The TIAP time series
Year Qtr1 Qtr2 Qtr3 Qtr4

1949 362 385 432 341
1950 382 409 498 387
1951 473 513 582 474
1952 544 582 681 557
1953 628 707 773 592
1954 627 725 854 661
1955 742 854 1023 789
1956 878 1005 1173 883
1957 972 1125 1336 988
1958 1020 1146 1400 1006
1959 1108 1288 1570 1174
1960 1227 1468 1736 1283
Using classical methods, fit a suitable ARIMA model to this time series.
Then forecast the time series forward for one up to twelve quarters.
Then repeat your analysis and forecasts using WinBUGS called from R.
Also create a single graph which compares both sets of forecasts.
393
Figure 8.13 shows plots of the original times series xt , its logarithm
(showing stabilised variability), the difference of the logarithm (showing
a removal of the trend), and yt , the fourth seasonal difference of the first
difference of the logarithm (showing that seasonality has been removed).
The last two (bottom) plots are the sample ACF and sample PACF for yt .
Figure 8.13 Plots for the TIAP time series
394
The last two plots in Figure 8.13 suggest SAR(1) or SMA(1) processes.
Both fits pass standard diagnostic checks, the second being marginally
better. Figure 8.14 shows some diagnostic plots for the SMA(1) fit (see
the R Code below for further details).
Figure 8.14 Diagnostics for the SMA(1) fit to the TIAP

time series
395
The chosen SMA(1) model for the TIAP time series xt may be expressed
by writing
yt = ∇4∇ log xt ,
where
y=
t wt + Θ1wt − 4 , wt ~ iid N (0, σ 2 ) .
The parameter estimates for this model are:
Θ̂1 = –0.4927 (SE = 0.1201)
σ̂ 2 = 0.0013.
Figure 8.15 shows the time series xt plus predictions 12 quarters ahead
based on the above fitted model. The dashed lines show the 95%
prediction interval at each of the 12 future times points. (See the R code
below for details regarding all calculations.)
396
Figure 8.15 Classical forecasts of the TIAP time series
We now fit the same model to the time series but using MCMC via
WinBUGS called from R. Some graphical output from the WinBUGS run
is shown in Figure 8.16. (See the code below for details.)
Figure 8.17 shows the Bayesian analogue of the classical forecasts

displayed in Figure 8.15.
To compare the classical and Bayesian analyses, we combine the two sets
of forecasts into a single plot, as shown in Figure 8.18 (page 399). Figure
8.19 (page 399) is a detail in Figure 8.18.
397
Figure 8.16 Output from an analysis of the TIAP series using

WinBUGS
Figure 8.17 Bayesian forecasts of the TIAP time series
398
Figure 8.18 Comparison of forecasts for the TIAP time series
3000
Classical
Bayesian
2000
xt
1000
0
0 10 20 30 40 50 60
Figure 8.19 Detail in Figure 8.18
399
We see from Figures 8.18 and 8.19 that the two approaches to inference
have yielded very similar results, at least as regards prediction.
The Bayesian approach has produced 95% prediction intervals which are
slightly wider than those obtained via the classical approach.
It may be argued that such wider intervals are more appropriate, since the
classical approach makes forecasts without taking into account any
uncertainty in the parameter estimates.
By contrast, the Bayesian approach to forecasting does take into account

that uncertainty.
To conclude, we report that the fitted model for the TIAP time series xt
is given by
yt = ∇4∇ log xt ,
with
y= ˆ wˆ , wˆ ~ iid N (0, σˆ 2 ) ,
wˆ t + Θ
t 1 t −4 t
where, via classical analysis:
Θ̂1 = –0.4927 (SE = 0.1201)
σ̂ 2 = 0.0013,
and where, via Bayesian analysis:
Θ̂1 = –0.4661 (SE = 0.1266)
σ̂ 2 = 0.0015.
400
R and WinBUGS Code for Exercise 8.4
# Classical analysis in R
# ==========================================================
x <-
c(362, 385, 432, 341, 382, 409, 498, 387, 473, 513, 582, 474,
544, 582, 681, 557, 628, 707, 773, 592, 627, 725, 854, 661, 742,
854, 1023, 789, 878, 1005, 1173, 883, 972, 1125, 1336, 988, 1020,
1146, 1400, 1006, 1108, 1288, 1570, 1174, 1227, 1468, 1736, 1283 )
n <- length(x); n # 48
X11(w=8,h=9); par(mfrow=c(3,2))
plot(x,type="l"); abline(v=seq(0,48,4),h=seq(0,2000,100), lty=3)
plot(log(x),type="l"); abline(v=seq(0,48,4), lty=3)
plot(diff(log(x)),type="l"); abline(v=seq(0,48,4), lty=3)
plot(diff(diff(log(x),lag=4)),type="l"); abline(v=seq(0,48,4), lty=3)
y <- diff(diff(log(x),lag=4))
acf(y, lag=24)
pacf(y,lag=24)
fit1 <- arima( log(x),order=c(0,1,0), seasonal=list(order=c(1,1,0), period=4) )
tsdiag(fit1); fit1
# sar1
# -0.4990
# s.e. 0.1417
# sigma^2 estimated as 0.001310: log lik. = 81.12, aic = -158.24
fit2 <- arima( log(x),order=c(0,1,0), seasonal=list(order=c(0,1,1), period=4) )

tsdiag(fit2); fit2
# sma1
# -0.4927
# s.e. 0.1201
# sigma^2 estimated as 0.001306: log lik. = 81.2, aic = -158.4
# There’s not much to distinguish the two fits.

# The second one is marginally better.
# Let’s now display the diagnostics for that fit (again).
401
fit <- fit2; tsdiag(fit)
# We see that the residuals from the fit are well-behaved,

# and their sample ACF is consistent with that of white noise.
# Let’s also look at some other diagnostics. These turn out to be OK too.
X11(w=8,h=5); par(mfrow=c(2,2))
acf(fit$resid, lag=24)
pacf(fit$resid, lag=24)
qqnorm(fit$resid)
hist(fit$resid, nclass=12)
# Check whether to include a mean term

mean(y) # 0.0008141388
fit3 <- arima( y, order=c(0,0,0), seasonal=list(order=c(0,0,1), period=4),
include.mean=T ); fit3
# sma1 intercept
# -0.4937 -0.0003 <--------- not significant
# s.e. 0.1204 0.0031
# So there’s no need for an intercept term in the model.
# Let’s now make some predictions.

logxpredict <- predict(fit, n.ahead=12)
xF <- exp(logxpredict$pred)
xL <- exp(logxpredict$pred - qnorm(0.975)* logxpredict$se)
xU <- exp(logxpredict$pred + qnorm(0.975)* logxpredict$se)
cbind(xF, xL, xU)

# xF xL xU
# 49 1365.822 1272.412 1466.090
# 50 1602.240 1449.497 1771.079
# 51 1916.210 1694.939 2166.367
# 52 1418.253 1230.895 1634.130
# 53 1509.806 1264.357 1802.904
# 54 1771.148 1439.872 2178.641
# 55 2118.215 1677.977 2673.956
# 56 1567.764 1213.320 2025.751
# 57 1668.969 1244.652 2237.940
# 58 1957.861 1412.873 2713.066
# 59 2341.516 1640.034 3343.038
# 60 1733.037 1180.875 2543.381
402
X11(h=5); par(mfrow=c(1,1)); plot(c(0,60),c(0,3800), type="n")

lines(x, lwd=2); points(x, lwd=2);
points((n+1):(n+12), xF, pch=16, cex=1.5);
lines(n:(n+12), c(x[n],xF), lty=1,lwd=2)
# points((n+1):(n+12), xL, pch=16);
lines((n+1):(n+12), xL, lty=2, lwd=2)
# points((n+1):(n+12), xU, pch=16);
lines((n+1):(n+12), xU, lty=2, lwd=2)
abline(v=seq(0,100,4),h=seq(0,4000,100), lty=3) # OK....
# Bayesian reanalysis in R and WinBUGS

# ==========================================================
# Assume that R (v3.0.1) is installed in C:/R-3.0.1

# and WinBUGS (v4.1.3) is installed in C:/WinBUGS14
install.packages("R2WinBUGS") # Not necessary if done previously
library("R2WinBUGS") # Necessary every time R is started
# Make the following directory exists: C:/R-3.0.1/BugsOut/

# Create a file called C:/R-3.0.1/BugsCode2.txt with the following:
# ----------------------------------------------------------------------
model
{
for(t in 1:n) { z[t] <- log(x[t]) }
for(t in 1:5){ y[t] <- 0; w[t] ~ dnorm(0,tau) }
for(t in 6:n){ y[t] <- z[t] - z[t-1] - z[t-4] + z[t-5] }
for(t in 6:N){ # N=n+12=60
m[t] <- Phi1*w[t-4]
y[t] ~ dnorm(m[t],tau)
w[t] <- y[t] - m[t]
}
tau ~ dgamma(0.001,0.001)
Phi1dum ~ dbeta(1,1); Phi1 <- 2*Phi1dum-1
for(k in 1:12) {
z[n+k] <- z[n+k-1] + z[n+k-4] - z[n+k-5] + y[n+k]
x[n+k] <- exp(z[n+k])
}
sig2 <- 1/tau
}
# ----------------------------------------------------------------------
403
# NB: We can’t specify Phi1 ~ dunif(-1,1). This causes an error.

# Update in March 2014: Phi1 ~ dunif(-1,1) works in WinBUGS 1.4.3.
x <- c(362, 385, 432, 341, 382, 409, 498, 387, 473, 513, 582, 474,
544, 582, 681, 557, 628, 707, 773, 592, 627, 725, 854, 661, 742,
854, 1023, 789, 878, 1005, 1173, 883, 972, 1125, 1336, 988, 1020,
1146, 1400, 1006, 1108, 1288, 1570, 1174, 1227, 1468, 1736, 1283,
NA,NA,NA,NA, NA,NA,NA,NA, NA,NA,NA,NA)
n <- 48; N <- 60; data <- list("n","N","x")

inits <- function(){ list(tau=1, Phi1dum=0.5) }
parameters <- c("sig2", "Phi1", "x")
sim <- bugs(data, inits, parameters, n.thin=1,

model.file= "C:/R-3.0.1/BugsCode2.txt",
n.chains = 1, n.iter = 6000, n.burnin=1000, DIC = FALSE,
bugs.directory = "C:/WinBUGS14/",
working.directory = "C:/R-3.0.1/BugsOut/")
# This starts WinBUGS, runs the BUGS code for 6000 iterations, closes
# WinBUGS, and creates a number of files in the working directory. These
# files contain information which can also be accessed within R, as follows.
print(sim,digits=4)
# Inference for Bugs model at "C:/R-3.0.1/BugsCode2.txt", fit using WinBUGS,

# 1 chains, each with 6000 iterations (first 1000 discarded)
# n.sims = 5000 iterations saved
# mean sd 2.5% 25% 50% 75% 97.5%

# sig2 0.0015 0.0003 0.0009 0.0012 0.0014 0.0016 0.0022
# Phi1 -0.4661 0.1266 -0.6910 -0.5548 -0.4740 -0.3865 -0.1944
# x[49] 1367.1820 52.6189 1265.0000 1332.0000 1365.0000 1402.0000
# 1472.0000
# x[50] 1605.9746 86.2790 1443.0000 1547.0000 1603.0000 1662.0000
# 1781.0000
# x[51] 1918.2346 124.7788 1681.9750 1835.0000 1914.0000 2000.0000
# 2172.0250
# x[52] 1422.9222 107.4501 1220.9750 1350.0000 1420.0000 1491.0000
# 1641.0000
# x[53] 1517.8472 146.0119 1247.9750 1418.7499 1514.0000 1610.0000
# 1822.0000
# x[54] 1783.4306 201.9834 1415.0000 1645.0000 1777.0000 1908.2500
# 2217.0000
404
# x[55] 2133.7016 273.1291 1646.9750 1946.7500 2119.0000 2306.0000

# 2724.0000
# x[56] 1584.1955 223.5842 1187.9750 1431.0000 1576.0000 1720.2499
# 2066.0000
# x[57] 1693.4548 276.4929 1211.9750 1499.7499 1674.0000 1857.0000
# 2309.0750
# x[58] 1992.9153 364.3849 1370.9750 1742.7499 1968.0000 2204.0000
# 2837.0999
# x[59] 2388.4000 476.7169 1589.8999 2058.7500 2345.0000 2668.0000
# 3453.0250
# x[60] 1775.0647 381.9082 1137.0000 1511.0000 1735.0000 1992.0000
# 2628.1249
help(bugs) # To get info on how to do the following...
Phi1v <- sim$sims.list$Phi1; sig2v <- sim$sims.list$sig2

xm <- sim$sims.list$x
par(mfrow=c(2,2))
hist(Phi1v, breaks=20); hist(sig2v, breaks=20)
hist(xm[,1], breaks=20); hist(xm[,2], breaks=20)
# Let’s now make the forecasts of the series using the BUGS output.
xF2 <- xF; xL2 <- xL; xU2 <- xU; for(t in 1:12){
xF2[t] <- mean(xm[,t])
xL2[t] <- quantile(xm[,t], 0.025)
xU2[t] <- quantile(xm[,t], 0.975) } # Calc. estimates
par(mfrow=c(1,1)); plot(c(0,60),c(0,3800), type="n")

lines(x, lwd=2); points(x, lwd=2)
points((n+1):(n+12), xF2, pch=16, cex=1.5);
lines(n:(n+12), c(x[n],xF2), lty=1,lwd=2)
lines((n+1):(n+12), xL2, lty=2, lwd=2)
lines((n+1):(n+12), xU2, lty=2, lwd=2)
abline(v=seq(0,100,4),h=seq(0,4000,100), lty=3) # OK.....
405
# Next we graph both sets of forecasts together in a single plot,

# and then produce a close-up in that single plot, as follows:
X11(h=5); par(mfrow=c(1,1));
plot(c(0,60),c(0,3800), type="n", xlab="t", ylab="xt")

points((n+1):(n+12), xF, pch=16, cex=1.5, col="red");
lines(n:(n+12), c(x[n],xF), lty=1,lwd=2, col="red")
lines((n+1):(n+12), xL, lty=1, lwd=2, col="red")
lines((n+1):(n+12), xU, lty=1, lwd=2, col="red")
abline(v=seq(0,100,4),h=seq(0,4000,100), lty=3)
points((n+1):(n+12), xF2, pch=16, cex=1.5, col="blue" );
lines(n:(n+12), c(x[n],xF2), lty=2,lwd=2, col="blue ")
lines((n+1):(n+12), xL2, lty=2, lwd=2, col="blue ")
lines((n+1):(n+12), xU2, lty=2, lwd=2, col="blue ")
legend(0,3000,c("Classical","Bayesian"), lty=c(1,2),
lwd=c(2,2), col=c("red", "blue"), bg="white" )
par(mfrow=c(1,1))
plot(c(40,60),c(1000,3500), type="n", xlab="t", ylab="xt")
points((n+1):(n+12), xF, pch=16, cex=1.5, col="red");
lines(n:(n+12), c(x[n],xF), lty=1,lwd=2, col="red")
lines((n+1):(n+12), xL, lty=1, lwd=2, col="red")
lines((n+1):(n+12), xU, lty=1, lwd=2, col="red")
abline(v=seq(0,100,4),h=seq(0,4000,100), lty=3)
points((n+1):(n+12), xF2, pch=16, cex=1.5, col="blue" );
lines(n:(n+12), c(x[n],xF2), lty=2,lwd=2, col="blue ")
lines((n+1):(n+12), xL2, lty=2, lwd=2, col="blue ")
lines((n+1):(n+12), xU2, lty=2, lwd=2, col="blue ")
legend(40,3000,c("Classical","Bayesian"), lty=c(1,2),
lwd=c(2,2), col=c("red", "blue"), bg="white" )
406
CHAPTER 9
Bayesian Finite Population Theory
9.1 Introduction
In this chapter we will focus on the topic of Bayesian methods for finite
population inference in the sample survey context. We have previously
touched on this topic when considering posterior predictive inference of
‘future’ values in the context of the normal-normal-gamma model. The
topic will now be treated more generally and systematically.
There are many and various ways in which Bayesian finite population
inference can be categorised, for example:
• situations with and without prior information being available

• sampling with and without replacement
• Monte Carlo based methods versus deterministic (or ‘exact’) methods
• situations with and without auxiliary information being available
• scenarios where a superpopulation variance is known and where it is
unknown
• sampling with equal probabilities versus unequal probabilities
• sampling mechanisms that are ignorable versus nonignorable
(i.e. biased)
• cases where the order of sampling is known versus where that order
is unknown
• cases with full response versus where some sampled units fail to
respond.
Each of these categories can in turn be broken down further. For example,
Monte Carlo based techniques may or may not require Markov chain
Monte Carlo methods for generating the sample required for inference.
We see there is potentially a vast subject ground to cover.
We will begin with a description of some basic general concepts, notation

and terminology in relation to finite population modelling in the Bayesian
framework, with a focus on simple random sampling without replacement
(SRSWOR). We then illustrate these ideas by way of a series of exercises
which also feature some other concepts such as simple random sampling
407
with replacement (SRSWR), nonignorable sampling schemes, and

covariate data. Some of these ideas will be taken up again in later chapters.
We defer discussion of Bayesian finite population models involving

normal (i.e. Gaussian) data to the next chapter (Chapter 10), where such
models are the focus and treated in detail. In Chapter 11 we will discuss
data transformations, inference on non-standard quantities of interest, and
frequentist properties of Bayesian estimators in a finite population
context, including the notions of model bias and design bias. Chapter 12
will focus on the issues of biased sampling and nonignorable nonresponse.
The exposition in Chapters 9 to 12 is largely theoretical but does include

mention of several real world applications, including on-site sampling of
recreation parks, oil discovery, and correcting for self-selection bias in
volunteer surveys. Further discussion of the role that Bayesian methods
and prior information play in survey sampling and finite population
inference can be found in Rao (2011). This paper also lists many other
papers and books on this and related topics, for example Ericson (1969)
and Särndal, Swensson and Wretman (1992).
9.2 Finite population notation and terminology

Consider a finite population of N units labelled i = 1,..., N , and let yi be
the value of the ith unit for some observable variable of interest.
Define y = ( y1 ,..., yN ) as the population vector.
Suppose that n units are selected from the finite population without
replacement.
We refer to n as the sample size and to m= N − n as the nonsample size.
Let s = ( s1 ,..., sn ) be the vector of the ordered labels of the sampled units.
Also let r = ( r1 ,..., rm ) be the vector of the ordered labels of the

nonsampled units, i.e. those remaining.
Define ys = ( ys1 ,..., ysn ) to be the sample vector, and likewise define
yr = ( yr1 ,..., yrm ) to be the nonsample vector.
408
Note 1: With the above definitions, it is always true that

s1 < ... < sn
and
r1 < ... < rm ,
irrespective of the order in which the population units may actually be
sampled. Also,
{s1 ,..., sn , r1 ,..., rm } = {1,..., N } .
Note 2: For mathematical convenience, the population, sample and

nonsample vectors may later sometimes be defined as the column
vectors
 y1 
 
= y (= y1 ,..., y N )′    , ys = ( ys1 ,..., ysn )′ and yr = ( yr1 ,..., yrm )′ ,
y 
 N
respectively.
Also, the population vector may sometimes be written using upper case
letters, as Y = (Y1 ,..., YN ) or Y = (Y1 ,..., YN )′ . For the remainder of this
chapter, these alternative notations will not be used.
Example: Suppose that we select n = 3 units from a finite population of

size N = 7 and obtain units 4, 5 and 2 (in that order, or any other order).
Then the nonsample size is m= N − n = 4 and:

y = ( y1 ,..., y7 )
=s (=
s1 , s2 , s3 ) (2, 4,5) , ys = ( y2 , y4 , y5 )
=r (=
r1 , r2 , r3 , r4 ) (1,3, 6, 7) , yr = ( y1 , y3 , y6 , y7 ) .
9.3 Bayesian finite population models

Consider a finite population vector y which may be thought of as having
been generated from some probability distribution which depends on a
parameter θ (possibly a vector).
Also suppose that a sample of size n is drawn from the finite population
without replacement according to some probability distribution for s.
409
This scenario may be expressed in terms of a Bayesian finite population

model with the following form:
f ( s | y,θ ) (the probability of obtaining sample s for given
values of y and θ )
f (y |θ ) (the model density of the finite population vector
given θ )
f (θ ) (the prior density of the parameter).
Suppose that we have data of the form D = ( s, ys ) and are interested in a

quantity Q = g ( y, θ ) , for some function g. Then the task is to determine
the distribution of Q given D.
This distribution will be based on the joint distribution of the two

unobserved quantities θ and yr , given the two observed quantities,
namely:
s (which tells us which units are sampled); and
ys (the vector of the values of the sampled units).
Thus, inference on the quantity of interest Q = g ( y,θ ) is based on the

density f (Q | D ) , which in turn is based on the density
f (θ , yr | s, ys ) ∝ f (θ , ys , yr , s )
= f (θ ) f ( ys , yr | θ ) f ( s | ys , yr , θ ) . (9.1)
Note 1: The values of s and r here are fixed at their observed values
defined by the data. Thus, given D = ( s, ys ) , we may always express
Q = g ( y,θ ) as h(( ys , yr ),θ ) for some function h (which will in many
cases be the same function as g), and there should be no ambiguity in
the meaning of quantities such as f ( ys , yr | θ ) in (9.1).
Note 2: We have specified the sampling mechanism in terms of the

quantity s which tells us which units are sampled but not the order in
which they are sampled. In some cases it may be appropriate to replace
f ( s | y,θ ) in the model by f ( L | y,θ ) , where
L = ( L1 ,..., Ln )
is the vector of the labels of the selected units in the order that they are
sampled. L provides more information than s, which is a function of L.
410
Note 3: We have assumed that sampling is without replacement. If

sampling is with replacement, it may be appropriate to replace f ( s | y,θ )
or f ( L | y, θ ) in the model by f ( I | y, θ ) , where
I = ( I1 ,..., I N ) ,
and where I i is the number of times that population unit i is sampled.
In this case it may be necessary to modify the notation to account for the
number of distinct units sampled, previously the fixed constant n, due to
the possibility of multiple selections under sampling with replacement.
Example 1: Suppose that we sample units 4, 5 and 2, in that order. Then

=L (=L1 , L2 , L3 ) (4,5, 2) and s = (2,4,5). Note that s is a function of L.
Example 2: Suppose we sample units 3, 5 and 3, in that order. Then

= L (=L1 , L2 , L3 ) (3,5,3) and I = (0, 0, 2, 0,1) . In this case, we write
= =
s ( s1 ,...,s d) (=
s1 , s2 ) (3,5) as the ordered vector of distinct labels for
the units sampled. Here, d is the number of distinct units sampled (a
random variable with realised value 2), in contrast to n, the total number
of selections (a fixed constant equal to 3). Note that d is a function of s,
which is a function of I, which in turn is a function of L.
9.4 Two types of sampling mechanism

There are basically two types of sampling mechanism in the context of the
above model, data and quantity of interest. These two types correspond to
two distinct cases, as follows:
(i) where f (Q | D ) remains exactly the same if the sampling density

f ( s | ys , yr , θ ) is omitted from the calculation at equation (9.1);
in this case we say that the sampling mechanism is ignorable (or
unbiased)
(ii) where f (Q | D ) changes in some way if the sampling density

f ( s | ys , yr , θ ) is omitted from the calculation at equation (9.1);
we then say the sampling mechanism is nonignorable (or biased).
411
Perhaps the simplest example of an ignorable sampling mechanism is

simple random sampling without replacement (SRSWOR), for which
−1
N
f ( s | y,θ ) =   , s ∈ S ( s ) ,
n
where
= S ( s ) {(1,..., n), (1, 2,..., n − 1, n + 1),...., ( N − n + 1,..., N )} .
is the sample space for s (the set of all possible combinations of n integers
taken from N).
In this case, f ( s | y,θ ) does not depend on y or θ at all and so may also
be written simply as f ( s ) . This then guarantees that
−1
N
, θ ) f=
f ( s | y s , yr= ( s)  
n
at the single observed value of s, whatever that value may be.
Therefore, the joint density of the two unknowns is

f (θ , yr | s, ys ) ∝ f (θ , ys , yr , s )
= f (θ ) f ( ys , yr | θ ) f ( s | ys , yr , θ )
∝ f (θ ) f ( ys , yr | θ ) × 1 ,
which is the same as (9.1) but with f ( s | y s , yr , θ ) omitted.
This result tells us that f (Q | D ) will be the same when the sampling
mechanism density f ( s | y s , yr , θ ) is ‘ignored’ in the model, so to speak.
9.5 Two types of inference

There are basically two types of inference in the context of the above
model, data and quantity of interest:
(a) where Q does not depend on y, in which case inference is on

Q = g (θ ) (a function of only θ ) and may be called analytic
inference or infinite population inference or superpopulation
inference
(b) where Q does not depend on θ , in which case inference is on

Q = g ( y ) (a function of only y) and may be called descriptive
inference or finite population inference or predictive inference.
412
9.6 Analytic inference

In the case of analytic inference, this is based solely on the posterior
density of the model parameter θ , namely
f (θ | D) = f (θ | s, ys )
∝ f (θ , s, ys )
= ∫ f (θ , s, yr , ys )dyr
= f (θ ) ∫ f ( ys , yr | θ ) f ( s | y s , yr , θ )dyr .
Now suppose further that the sampling mechanism is ignorable. In that

case,
f (θ | D ) ∝ f (θ ) ∫ f ( ys , yr | θ ) × 1dyr
since f ( s | ys , yr , θ ) may be ignored
= f (θ ) f ( ys | θ ) ∫ f ( yr | ys , θ )dyr
since f ( ys , yr | θ ) = f ( ys | θ ) f ( yr | θ , ys )
= f (θ ) f ( ys | θ )
since ∫ f (y r | ys , θ )dyr = 1 for all θ .
Thus the posterior density of θ is obtained in exactly the same way as in

previous chapters.
Note: As stressed earlier, it is to be understood that s in f ( ys | θ ) here

is fixed at its observed value. With this understanding, we will
sometimes abbreviate f (θ | D) = f (θ | s, ys ) as f (θ | ys ) .
Example: If s = (2,4,5) , then ys means specifically ( y2 , y4 , y5 ) . Thus,

in this context, ys does not refer to the vector ( ys1 , ys2 , ys3 ) with the
subscripts s1 , s2 , s3 as random variables.
413
9.7 Descriptive inference

In the case of descriptive inference, this is based solely on the predictive
density of the nonsample vector yr , namely
f ( yr | D ) = f ( yr | s , y s ) ∝ f ( s , y s , yr )
= ∫ f (θ , s, yr , ys )dθ = ∫ f (θ ) f ( ys , yr | θ ) f ( s | ys , yr , θ )dθ .
Now suppose further that the sampling mechanism is ignorable. In that

special case,
f ( yr | D ) ∝ ∫ f (θ ) f ( ys , yr | θ ) × 1dθ
since f ( s | y s , yr , θ ) may be ignored
= ∫ f ( yr | ys , θ ) f (θ ) f ( ys | θ )dθ
∝ ∫ f ( yr | ys , θ ) f (θ | ys )dθ
since f (θ | y s ) ∝ f (θ ) f ( ys | θ ) .
So the predictive density of yr is obtained in exactly the same way as in

previous chapters.
Note: As before, it is to be understood that s and r in f ( yr | ys , θ ) and

f (θ | ys ) are fixed at their observed values. With this understanding,
we will sometimes write
f ( yr | D ) = f ( yr | s , y s ) as f ( yr | y s ) .
More generally, we will sometimes write

f (θ , yr | D) = f (θ , yr | s, ys ) as f (θ , yr | ys ) ,
and
f (Q | D) = f (Q | s, ys ) as f (Q | ys ) .
Example: If s = (2,4,5) and N = 7 then ys means ( y2 , y4 , y5 ) and yr

means ( y1 , y3 , y6 , y7 ) .
414
Exercise 9.1 A Bernoulli finite population model with ignorable

sampling
A finite population of size N = 4 consists of values that are independently

and identically distributed (iid) Bernoulli with parameter θ , where θ is a
priori equally likely to be 1/4 or 1/2 (with no other possibilities).
We sample n = 2 units from the finite population according to SRSWOR.
Units 2 and 4 are sampled, and both have the value 1.
(a) Find the posterior distribution of θ .
(b) Find the predictive distribution of the finite population total, namely
yT = y1 + ... + yN .
(a) The Bayesian model here may be written:

−1 −1
N  4 1
y,θ )  =
f (s | =  =  ,
n  2 6
s = (1, 2), (1,3), (1, 4), (2,3), (2, 4), (3, 4)
N
(y |θ )
f= ∏θ
i =1
yi
(1 − θ )1− yi
(the model density of the finite population values)

f (θ ) 1/=
= 2, θ 1/ 4,1/ 2 (the prior density of the parameter).
The observed sample data is

= D (= s, ys ) (( s1 , s2 ),( ys1 ,=
ys2 )) ((2, 4),( y2=
, y4 )) ((2, 4),(1,1)) ,
and the nonsample vector=
is yr ( y=
r1 , y r2 ) ( y1 , y3 ) ∈ {0,1}2 .
The sampling mechanism is ignorable, and so

f (θ | D) ∝ f (θ ) f ( ys | θ ) ∝ 1× ∏ θ y (1 − θ )1− y
i i
i∈s
=θ 2
since n = 2 and yi = 1 for all i ∈ s
= =
(1/ 4) 2 1/16, θ 1/ 4
=
=(1/ 2) 4=
2
/16, θ 1/ 2.
415
 1/ 5, θ = 1/ 4
It follows that f (θ | D) = 
4 / 5, θ = 1/ 2.
(1 − θ ) 2 , yr =
(0, 0)

(1 − θ )θ , yr =(0,1)
(b) Next, observe that f ( yr | D, θ ) = 
θ (1 − θ ), yr =(1, 0)

 θ , yr = (1,1).
2
This implies that

f ( yr | D) = ∑ f ( yr | D, θ ) f (θ | D)
θ
  1 2 1  2 2 4 25
 1 −  + 1 − = ,=yr (0, 0)
  4  5  4  5 80
 1  1 1  2  2 4 19
1 −  + 1 −  = ,=yr (0,1)
 4  4 5  4  4 5 80
=
 1 1 − 1  1 + 2 1 − 2  =
4 19
, y= (1, 0)
  
 4  4  5 4  4  5 80 r
 2 2
  1  1  2  4 17
   +   = , yr = (1,1).
  4  5  4  5 80
The nonsample total is yrT= y1 + y3 , with three possible possible values:

0+0=0
0+1=1+0=1
1 + 1 = 2.
 25 / 80, yrT = 0

Therefore=
f ( yrT | D) =38 / 80, yrT 1
17 / 80, y = 2.
 rT
The finite population total is y= T ysT + yrT , where ysT =y2 + y4 =+

1 1
= 2 is the sample total. It follows that the required predictive density of
the finite population total is
 25 / 80, yT = 2 + 0 = 2

f ( yT | D) =  38 / 80, yT = 2 + 1 = 3
17 / 80, y = 2 + 2 = 4.
 T
416
Exercise 9.2 A Bernoulli finite population model with

nonignorable sampling
A finite population of size N = 4 consists of values that are conditionally

iid Bernoulli with parameter θ , where θ is a priori equally likely to be
1/4 or 1/2 (with no other possibilities).
We sample n = 2 units from the finite population without replacement in

such a way that the probability of selecting a sample is proportional to
the sum of the values in that sample.
Units 2 and 4 are sampled, and both have the value 1.
(b) Find the predictive distribution of the finite population total, namely
yT = y1 + ... + yN
(c) Find the conditional posterior distribution of θ given the nonsample

vector, and then employ this distribution to check your answer to (a) using
results in (b).
(d) Find the following probabilities of selection into the sample:

(i) P(i ∈ s | y, θ ) (ii) P (i ∈ s | y )
(iii) P (i ∈ s | θ ) (iv) P (i ∈ s ) .
(a) The Bayesian model here may be written:

f ( s | y , θ ) ∝ ysT , s =
(1, 2),(1,3),(1, 4),(2,3),(2, 4),(3, 4)
N
( y |θ )
f= ∏θ
i =1
yi
(1 − θ )1− yi
(the model density of the finite population values)

f (θ ) 1/=
= 2, θ 1/ 4,1/ 2 (the prior density of the parameter).
The observed sample data is

= D (= s, ys ) (( s1 , s2 ), ( ys1 ,=
ys2 )) ((2, 4), ( y2=
, y4 )) ((2, 4), (1,1)),
and the nonsample vector is
= yr ( y=r1 , y r2 ) ( y1 , y3 ) ∈ {0,1}2 .
417
In this case the sampling mechanism is nonignorable and the first thing
we should do is determine the exact form of the sampling density of
s = ( s1 , s2 ) . Now,
, θ ) cy
f ( s | y= = sT c( y s1 + y s2 )
for some constant c such that
1 = ∑ f ( s | y,θ )
s
= c {( y1 + y2 ) + ( y1 + y3 ) + ( y1 + y4 ) + ( y2 + y3 ) + ( y2 + y4 ) + ( y3 + y4 )}
= c {3( y1 + y2 + y3 + y4=
)} 3cyT .
We see that c = 1 / (3 yT ) , and so

ys1 + ys2
y,θ )
f ( s |= = , s (=
s1 , s2 ) (1, 2),(1,3),(1, 4),(2,3),(2, 4),(3, 4) .
3 yT
Note 1: This formula shows explicitly how the sampling mechanism

depends on the values in the finite population vector y. It also shows
that, conditional on y, the sampling mechanism does not depend on the
superpopulation parameter θ .
Note 2: This formula is only true when the finite population total yT is
positive, i.e. when at least one of y1 ,..., y N is nonzero. In the case where
all population values are zero, we have that y sT = ys1 + ys2 = 0 for all
possible samples s = ( s1 , s2 ) , and consequently f ( s | y, θ ) ∝ 0 , which
must be understood to mean that that no sample actually gets drawn. The
fact that a sample has been observed implies f ( s | y, θ ) > 0 for at least
one value of s, which implies that at least one population value is
positive, which in turn implies that yT > 0. This would be true even if
all the sample values were zero; but as it happens, at least one of them
is positive (in fact both are), which in itself implies that yT > 0 .
We may now work out the joint density of all quantities in the model:
f (θ , ys , yr , s ) = f (θ ) f ( ys | θ ) f ( yr | θ ) f ( s | ys , yr , θ )
1     ys + ys2
=×  ∏ θ yi (1 − θ )1− yi  ×  ∏ θ yi (1 − θ )1− yi  × 1
2  i∈s   i∈r  3 yT
418
1 1+1
= × θ 2 × θ y1 + y3 (1 − θ ) 2− y1 − y3 ×
2 3( y1 + 1 + y3 + 1)
θ 2+ y + y (1 − θ )2− y − y
1 3 1 3
∝ .
2 + y1 + y3
So the posterior density of θ is

f (θ | D) = f (θ | s, ys )
∝ f (θ , s, ys )
= ∑ f (θ , s, ys , yr )
yr
y1 + y3
 θ 
1 1
1
∝ θ (1 − θ ) ∑ ∑ 
2 2

= y3 0  1 − θ 
y1 0= 2 + y1 + y3

 θ 
0+ 0
1  θ 
0 +1
1
∝ θ (1 − θ ) 
2 2
 + 
 1 − θ  2 + 0 + 0  1 − θ  2 + 0 + 1

1 
1+ 0 1+1
 θ  1  θ  
+  +  
 1−θ  2 +1+ 0  1−θ  2 + 1 + 1

 1  θ  1  θ  1  θ  2 1 
=θ 2 (1 − θ ) 2  +   +  +  
 2  1 − θ  3  1 − θ  3  1 − θ  4 
=
1
12
{6θ 2 (1 − θ )2 + 8θ 3 (1 − θ ) + 3θ 4 }
 1   1 2  1 2 1  1 1 
3 4
1
 6    1 −  + 8    1 −  + 3    , θ =
12   4   4   4   4   4   4
=
 1  2  2 2  2 2 
2 2 3 4
2
12 6  4   1 − 4  + 8  4   1 − 4  + 3  4   , θ =
             4
 1 1
 12(256) [ 6(9) + 8(3) + 3(1)] , θ = 4
=
 1 [ 6(16) + 8(16) + 3(16)] , θ = 2
12(256) 4
 1
 6(9) + 8(3) + 3(1)= 81, θ= 4
∝
6(16) + 8(16) + 3(16) 2
= 272, θ= .
 4
419
Now 81 + 272 = 353, and so

= =
 81/ 353 0.22946, θ 1/ 4
f (θ | D) = 
= =
272 / 353 0.77054, θ 1/ 2.
(b) The predictive density of the nonsample vector

= yr (= yr1 , yr2 ) ( y1 , y3 )
is
= f ( yr | D ) f ( yr | s , y s ) ∝ f ( yr , s , y s )
= ∑ f (θ , s, ys , yr )
θ
1
∝ ∑
2 + y1 + y3 θ =1/4,2/4
θ 2+ y1 + y3 (1 − θ )2− y1 − y3
1 
 1  1 3  1  1 3  2  1 3  2  1 3 
2+ y + y 2− y − y 2+ y + y 2− y − y

=    1 −  +    1 −  
2 + y1 + y3 
 4   4 4  4 

16 + 32− y1 − y3
1
(2 + y1 + y3 )256
{ 32− y1 − y3 + 16} ∝
2 + y1 + y3
16 + 32−0−0 25 150
 2 + 0 + 0= 2= 12 , ( y1 , y=
3) (0,0)
 2 −0 −1
 16 + 3 19 76
 2 + 0 + 1= 3= 12 , ( y1 , y=
3) (0,1)
∝ 2 −1−0
 16 + 3 = 19 =
76
, ( y1 , y=
3) (1,0)
 2 +1+ 0 3 12
 2 −1−1
 16 + 3 = 17 =
51
, ( y1 , y=
3) (1,1)
 2 + 1 + 1 4 12
75, ( y1 , y3 ) = (0, 0)
 38, ( y , y ) = (0,1)

∝ 1 3
 38, ( y1 , y3 ) = (1, 0)
 24, ( y1 , y3 ) = (1,1).
Now, 150 + 76 + 76 + 51 = 353, and so

= =
150 / 353 0.42493, yr (0, 0)
=  =
 76 / 353 0.21530, yr (0,1)
f ( yr | D ) = 
= =
 76 / 353 0.21530, yr (1, 0)
=  51/ 353 0.14448,
= yr (1,1).
420
So the predictive density of the nonsample total,

yrT = yr1 + yr2 = y1 + y3 ,
= =
150 / 353 0.42493, yrT 0

is f=
( yrT | D) = =
152 / 353 0.43059, yrT 1
=  51/ 353 0.14448,
=
 yrT 2.
So the predictive density of the finite population total,

yT = ysT + yrT = (1 + 1) + yrT ,
= =
150 / 353 0.42493, yT 2

is =
f ( yT | D) = =
152 / 353 0.43059, yT 3
=  =
 51/ 353 0.14448, yT 4.
(c) The conditional posterior density of θ given yr is

f (θ | ys , yr , s ) ∝ f (θ , ys , yr , s )
θ
∝ θ 2+ y1 + y3 (1 − θ ) 2− y1 − y3 .
We now need to consider all the possible values of yr , one by one.
For yr = (0,0) :
 1  2  1  2 9 1
   1 − =  θ
,=
 4   4  256 4
f (θ | y s , yr , s ) ∝ θ 2+ 0+ 0
(1 − θ ) 2 −0−0
=
 2 2
 2   2  16 2
 4   1 − 4=
 θ
,=
 256 4
 9 / 25, θ = 1/ 4
⇒ f (θ | ys , yr , s ) = 
16 / 25, θ = 1/ 2.
For yr = (0,1) :
 1 3  1 1 3 1
   1 − =  θ
,=
 4   4  256 4
f (θ | ys , yr , s ) ∝ θ 2 + 0 +1
(1 − θ ) 2 −0 −1
=
 3 1
 2   2  16 2
 4   1 − 4=
 θ
,=
 256 4
 3 /19, θ = 1/ 4
⇒ f (θ | ys , yr , s ) = 
16 /19, θ = 1/ 2.
421
For yr = (1,0) :
 1 3  1 1 3 1
   1 − =  θ
,=
 4   4  256 4
f (θ | ys , yr , s ) ∝ θ 2 +1+ 0
(1 − θ ) 2 −1−0
=
 3 1
 2   2  16 2
 4   1 − 4=
 θ
,=
 256 4
 3 /19, θ = 1/ 4
⇒ f (θ | ys , yr , s ) = 
16 /19, θ = 1/ 2.
For yr = (1,1) :
 1  4 1 1
=
  = , θ
 4  256 4
f (θ | y s , yr , s ) ∝ θ 2 +1+1
(1 − θ ) 2 −1−1
=
 4
 2  16 2
=  
 4  = , θ
256 4
 1/17, θ = 1/ 4
⇒ f (θ | ys , yr , s ) = 
16 /17, θ = 1/ 2.
Now,
f (θ | ys , s )
= ∑=
f (θ , y | y , s ) ∑ f (θ | y , y , s ) f ( y
yr
r s
yr
s r r | ys , s ) .
So, using results in (b), we have that:
 1 
f (θ 1 / 4=
= | ys , s) ∑
=
yr
f θ
 4
ys , yr , s  f ( yr | ys , s )

9 150 3 76 3 76 1 51
= × + × + × + × = 0.22946
25 353 19 353 19 353 17 353
 1 
f (θ 1 / 2=
= | ys , s) ∑
=
yr
f θ
 2
ys , yr , s  f ( yr | ys , s )

16 150 16 76 16 76 16 51
= × + × + × + × = 0.77054.
25 353 19 353 19 353 17 353
These results are all in agreement with those obtained in (a) using a
different approach.
422
(d) (i) The probability of selecting unit i into the sample given y and θ is
the same for all i , in particular i = 1, and so may be written
1
, θ ) ∑ f ( s | y=
P(1 ∈ s | y= ,θ ) {( y1 + y2 ) + ( y1 + y3 ) + ( y1 + y4 )}
s:1∈s 3 yT
y + 2 y1 1 2 y1
= T = + ,
3 yT 3 3 yT
assuming that yT > 0 ; otherwise, P(1 ∈ s | y , θ ) = 0.
 1 2 yi
 + , yT > 0
Thus, for each i = 1,…,4 we have that P(i ∈ s | y, θ ) =
 3 3 yT
 0, y = 0.
 T
As a check, we may ask whether the sum of these inclusion probabilities

equals n = 2.
The answer is yes, assuming that y is such that yT > 0 ; in that case,
N 4
 1 2 yi  4 2( y1 + y2 + y3 + y4 )
∑ P ( i ∈
=i 1 =i 1  3
s | y , θ ) ∑
=  +  =+ = 2= n .
3 yT  3 3 yT
(ii) Since P(i ∈ s | y , θ ) does not depend on θ , we also have

 1 2 yi
 + , yT > 0
P(i ∈ s | y ) =  3 3 yT
 0, y = 0.
 T
(iii) The probability of selecting unit i into the sample given θ is the same
for all i, in particular i = 1, and so may be written
P(i ∈ s | θ ) =P(1 ∈ s=
|θ ) ∑ P(1∈ s | θ , y) f ( y | θ )
y
 1 2 y1  4 yi
= (0,0,0,0) | θ ) +
0 × P( y = ∑  +
y: yT >0  3
 ∏ θ (1 − θ )
3 yT  i =
1
1− yi
 1 2 y1  yT  0.34180, θ = 1/ 4
=∑  +  θ (1 − θ )
4 − yT
=
y: yT >0  3 3 yT  0.46875, θ = 1/ 2.
These numbers were obtained by writing and implementing a suitable

function in R (see the R code below).
423
(iv) The unconditional probability that any particular population unit i will
be selected into the sample is
P (i ∈=
s) ∑
θ
P (i ∈ s | θ ) f (θ )
1 1
= 0.34180 × + 0.46875 × = 0.40527.
2 2
To check this result, we note that the sum of inclusion probabilities

should in this case be identical to the expected sample size.
4
The first of these quantities is ∑ P(i ∈ s) =4 × 0.40527 = 1.6211.
i =1
The second of these quantities can be obtained by first noting that

4 = (3 / 4) 4 81/ = 256, θ 3 / 4
P( yT =0 | θ ) =(1 − θ ) =
= 4
(2 / 4) 16 =/ 256, θ 2 / 4.
This implies that
81 1 16 1 97
P ( yT = 0) = ∑ P ( yT = 0 | θ ) f (θ ) = × + × =
θ 256 2 256 2 512
= 0.18945.
The sample vector has size 2 if yT > 0 , and size 0 if yT = 0 . So its

expected size is 0 × 0.18945 + 2 × (1 − 0.18945) = 1.6211, which is the
4
same as ∑ P(i ∈ s) above.
i =1
# (a) & (b)
options(digits=5)
kern=function(th,yr){ th^(2+sum(yr))*(1-th)^(2-sum(yr))/(2+sum(yr)) }
kernth0.25 = kern(th=0.25,yr=c(0,0))+ kern(th=0.25,yr=c(0,1))+

kern(th=0.25,yr=c(1,0))+ kern(th=0.25,yr=c(1,1))
kernth0.5 = kern(th=0.5,yr=c(0,0))+ kern(th=0.5,yr=c(0,1))+
kern(th=0.5,yr=c(1,0))+ kern(th=0.5,yr=c(1,1))
424
postth=c(kernth0.25, kernth0.5)/( kernth0.25 + kernth0.5)

postth # 0.22946 0.77054
kernyr00 = kern(th=0.25,yr=c(0,0))+ kern(th=0.5,yr=c(0,0))

postyr =c(kernyr00,kernyr01,kernyr10,kernyr11)/
(kernyr00+kernyr01+kernyr10+kernyr11)
postyr # 0.42493 0.21530 0.21530 0.14448
# (c)
sum(c(9/25,3/19,3/19,1/17)*postyr) # 0.22946 Correct

sum(c(16/25,16/19,16/19,16/17)*postyr) # 0.77054 Correct
# (d)
probfun=function(y,th){ yT=sum(y); res=0

if(yT>0) res = ((1/3) + (2/3)*y[1]/yT) * th^yT * (1-th)^(4-yT)
res }
mat1=matrix(c(0,0,0, 0,0,1, 0,1,0, 1,0,0, 0,1,1, 1,0,1, 1,1,0, 1,1,1),

byrow=T, nrow=8,ncol=3)
mat2=rbind(mat1,mat1); ymat=cbind(c(rep(0,8),rep(1,8)), mat2)
ymat
# [1,] 0 0 0 0
# [2,] 0 0 0 1
# ...............................
# [15,] 1 1 1 0
# [16,] 1 1 1 1
prob0.25=0; for(i in 1:16) prob0.25 = prob0.25 + probfun(y=ymat[i,],th=0.25)

prob0.5=0; for(i in 1:16) prob0.5 = prob0.5 + probfun(y=ymat[i,],th=0.5)
c(prob0.25,prob0.5) # 0.34180 0.46875

(prob0.25+prob0.5)/2 # 0.40527
4*(prob0.25+prob0.5)/2 # 1.6211
c(97/512, 2*(1-97/512) ) # 0.18945 1.62109
425
Exercise 9.3 A finite population Bayesian model with SRSWOR
We sample n = 2 units from a finite population of N = 4 via SRSWOR.
If θ = 0 then the finite population vector y is equally likely to be each of

the following:
(0,0,0,0), (0,0,0,1), (0,0,1,1), (0,1,1,1).
If θ = 1 then the finite population vector y is equally likely to be each of

the following:
(1,1,1,1), (1,1,1,0), (1,1,0,0), (1,0,0,0).
A priori, the parameter θ is equally likely to be 0 or 1 (e.g. according to

the toss of a coin).
Suppose we sample units 2 and 3, with values 1 and 1, respectively.
(b) Find the predictive distribution of the finite population mean, namely
y = ( y1 + ... + y N ) / N .
The easiest way to do this exercise is to first identify eight equally likely
possibilities to start with. These possibilities are:
1. θ = 0, y = (0,0,0,0) with y =0
2. θ = 0, y = (0,0,0,1) with y = 1/4
3. θ = 0, y = (0,0,1,1) with y = 1/2
 4. θ = 0, y = (0,1,1,1) with y = 3/4
 5. θ = 1, y = (1,1,1,1) with y =1
 6. θ = 1, y = (1,1,1,0) with y = 3/4
7. θ = 1, y = (1,1,0,0) with y = 1/2
8. θ = 1, y = (1,0,0,0) with y = 1/4.
=
After observing ys (= y2 , y3 ) (1,1), there are only three possibilities
remaining (4, 5 and 6 in the list, each highlighted by an arrow).
426
(a) Two possibilities out of the 3 correspond to θ = 1 (namely 5 and 6)

1 / 3, θ = 0 
and one to θ = 0 (namely 4); consequently, f (θ | D ) =   , or
2 / 3, θ = 1
equivalently, (θ | D ) ~ Bern(2 / 3) .
(b) Two possibilities out of the 3 correspond to y = 3 / 4 (namely 4 and

 2 / 3, y = 3 / 4 
6) and one to y = 1 (namely 5); therefore f ( y | D ) =  .
 1 / 3, y = 1 
Alternative solution
The above results can also be obtained by working through in the style of
the solutions to previous exercises, as follows. Before the data is observed,
the Bayesian model may be written:
−1 −1
N  4 1
y,θ )  =
f (s | =  =  ,
n  2 6
s = (1, 2), (1,3), (1, 4), (2,3), (2, 4), (3, 4)
1
f ( y | θ ) = , y (θ , θ , θ , θ ), (θ , θ , θ ,1 − θ ),
=
4
(θ ,θ ,1 − θ ,1 − θ ), (θ ,1 − θ ,1 − θ ,1 − θ )
f (θ ) 1=
= / 2, θ 0,1 (the prior density of the parameter).
The observed data= is D (= s, ys ) ((2,3),(1,1)) . At this particular value of

the data:
1
,θ ) =
f ( s | y= , s (2,3) (the value of s actually observed)
6
1
f ( y | θ ) = , y = (0,1,1,1) and θ = 0 ,
4
y ∈ {(1,1,1,1), (1,1,1, 0)} and θ = 1 (where we need
only consider values of y consistent with the data)
= f (θ ) 1= / 2, θ 0,1 (since both values of θ are still possible,
i.e. consistent with the observed data).
With the quantities s = (2,3)= , y s (=y2 , y3 ) (1,1) and yr = ( y1 , y4 ) all

fixed at these values, the joint density of all quantities in the model may
be written
427
= f (θ , s, y ) f=
(θ , s, ys , yr ) f (θ ) f ( ys , yr | θ ) f ( s | ys , yr , θ )
I (θ ∈ {0,1}) I ( y =(0,1,1,1), θ =0) + I ( y ∈ {(1,1,1,1),(1,1,1,0)}, θ =1) 1
= × ×
2 4 6
θ , yr
∝ I ( yr =(0,1), θ =0) + I ( yr ∈ {(1,1),(1,0)}, θ =
1) .
(a) It follows that
∑ f (θ , s, y )
f (θ | D ) ∝ f (θ , s, ys ) =
yr
 ∑= = 1,
I ( yr (0,1)) = θ 0

∝
∑ I ( yr ∈ {(1,1), (1, 0)}) =
2, θ =
1.

1 / 3, θ = 0 
After normalising, we see that f (θ | D ) =  .
2 / 3, θ = 1
(b) Also,
∑ f (θ , s, ys , yr )
f ( y r | D ) ∝ f ( y r , s, y s ) =
θ
 1
 ∑ [ I ( yr =(0,1), θ = 1)] =
0) + I ( yr ∈ {(1,1),(1,0)}, θ = 1, yr = (0,1)
 θ =0
 1
∝  ∑ [ I ( yr =(0,1), θ =0) + I ( yr ∈ {(1,1),(1,0)}, θ =1)] =1, yr =(1,1)
 θ = 0
 1
 ∑ [ I ( yr =(0,1), θ = 1) ] =
0) + I ( yr ∈ {(1,1),(1,0)}, θ = 1, yr = (1,0)
 θ =0
which implies that f=

( yr | D ) 1=
/ 3, yr (0,1),(1,1),(1,0) .
Consequently, f=
( y | D ) 1=
/ 3, y (0,1,1,1),(1,1,1,1),(1,1,1,0) .
Now, the values of y listed here as possible given the observed data have
means 3/4, 1 and 3/4, respectively.
It follows that the predictive density of the population mean is

 2 / 3, y = 3 / 4 
f ( y | D) =   (as was obtained previously).
 1 / 3, y = 1 
428
Exercise 9.4 Length-biased with-replacement sampling from a

Poisson finite population
A finite population of size 9 consists of values that are conditionally iid

Poisson with a mean whose prior distribution is gamma with both
parameters zero (considered uninformative).
We sample 3 times from the finite population according to a with-

replacement sampling scheme, where on each draw the probability of
selecting a unit is proportional to its value.
Unit 2 is selected once and its value is 1.

Unit 4 is selected twice and its value is 3.
Find the posterior distribution of the Poisson mean and the predictive
distribution of the nonsample total.
Also find these distributions under the (false) assumption that the
sampling is SRSWR.
Then create two plots which suitably compare the four distributions
indicated above.
Note: The concepts here involve a biased sampling mechanism and are
relevant to on-site sampling, where for example we wish to estimate the
total number of times that visitors (or potential visitors) to a recreational
park actually visit there in some specified time period.
If we go to the site at random times to survey visitors, we are more likely

to interview people who come very often relative to those who come
only rarely. This means that we may end up over-estimating the
popularity of the park—unless we make a suitable correction
(downwards) to account for the (upwardly) biased sampling mechanism.
If a potential visitor to the site doesn’t come at all, then there is zero
chance of sampling them.
If we wish to consider only the population of persons who actually visit

the site in a given period (i.e. to exclude the potential visitors who do
not visit), we may need to consider a truncated model involving the
Poisson random variable conditional on it being non-zero. For further
details and a discussion of the modelling issues here, see Shaw (1988).
429
Generally, we are considering a sample of size n obtained with

replacement from a finite population with values, y1 ,..., y N which are
conditionally Poisson with some mean λ , where the prior distribution on
λ is gamma with parameters η and τ (and mean η / τ ).
Let I i be the number of times population unit i is sampled and define

N
I = ( I1 ,..., I N ) . Then=
let d ∑ I (I
i =1
i > 0) be the distinct sample size (the
number of distinct population units sampled), and let m= N − d be the

nonsample size (the number of units not sampled).
In this scenario, we define the sample vector as y s = ( y s1 ,..., y sd ) , where

s = ( s1 ,..., sd ) is the vector of the labels of the d distinct units that are
sampled, and we define the nonsample vector as yr = ( yr ,..., yr ) , where 1 m
r = ( r1 ,..., rm ) is the vector of the labels of the m units that are not sampled.
Note: Here, s is a function of I, and so the data in this situation could

also be written as D = ( I , ys ) .
Since we are interested in the nonsample values only by way of their total
yrT , a suitable Bayesian finite population model in this context is:
Ii
n ! N  yi 
f ( I | y, λ ) = N ∏  ,
∏i =1 I i ! i =1  yT 
I ∈ {(a1 ,..., aN ) : ai ∈ {0,1,..., n}∀i, a1 + ... + aN =n}
 e − λ λ yi  e − mλ ( mλ ) yrT
f=( ys , yrT | λ )  ∏ ×
 i∈s yi !  yrT !
λ ~ G (η ,τ ) .
In our specific situation, N = 9, n = 3, and the data is

= D (= I , ys ) ((0,1, 0, 2, 0, 0, 0, 0, 0), (1,3)) ,
meaning that unit 2 is selected once and its values is 1, and unit 4 is
selected twice and its value is 3. Thus d = 2 and m = 7. Also, η = 0 and
τ =0.
430
On the basis of these specifications, we wish to make inferences about λ

and the nonsample total,
yrT = y1 + y3 + y5 + y6 + y7 + y8 + y9 .
Note: The probability of sampling unit 2 once and unit 4 twice (as is
assumed to have occurred) equals
1 2 Ii
y2 y4 y4 y4 y2 y4 y4 y4 y2 3!  y2   y4  3! 9
 yi 
+ + =    =
yT yT yT yT yT yT yT yT yT 1!2!  yT   yT 
∏  
∏ i =1 I i ! i =1  yT 
9
and so is consistent with
Ii
n! N
 y 
f ( I | y,θ ) = N ∏  i 
∏i =1 I i i =1  yT 
as specified in the general model.
For this exercise we will first derive the predictive distribution of yrT and
then use this to obtain the posterior distribution of λ only afterwards. The
predictive density of yrT is
f ( yrT | D ) ∝ ∫ f ( yrT , ys , I , λ )d λ
= ∫ f (λ ) f ( ys | λ ) f ( yrT | λ ) f ( I | ys , yrT , λ )d λ
∞
  e − mλ ( mλ ) yrT 1
∝ ∫ λ η −1e −τλ ×  ∏ e − λ λ yi  × × n dλ
0  i∈s  y rT ! yT
Ii n
 1 
N
 1 
(note that ∏   =   )
i =1  yT   yT 
∞
m yrT 1
n ∫
= λ η + ysT + yrT −1e − λ (τ +d +m )d λ
yrT ! yT 0
m yrT 1 Γ(η + ysT + yrT )
= × ×1 .
yrT ! yT (τ + d + m)η + ysT + yrT
n
431
Thus
k ( yrT )
=
f ( yrT | D ) = , yrT 0,1, 2,... ,
c
where
 N −d  Γ(η + ysT + yrT )
yrT
k ( yrT ) =  
 N +τ  yrT !( ysT + yrT ) n
and
∞
c= ∑ k( y
yrT =0
rT ).
Note: Here, d + m = N, and so

yrT
(τ + d + m )η + ysT + yrT = ( N + τ )η + ysT + yrT ∝ ( N + τ ) yrT .
We may approximate f ( yrT | D ) by calculating k ( yrT ) only for

yrT = 0,1, 2,..., M for some large integer M (in practice we used 100) for
which k ( yrT ) is sufficiently close to zero.
Using the predictive density of yrT , we can now obtain the posterior
density of λ as
∞
f (λ | D ) = ∑ f (λ | D, y
yrT = 0
rT ) f ( yrT | D ) ,
where
(λ | D, yrT ) ~ G (η + ysT + yrT ,τ + N ) .
Note: This result is obvious but can also be obtained as follows:
f (λ | D, yrT ) = f (λ | s, ys , yrT ) ∝ f (λ , s, ys , yrT ) ∝ f (λ ) f ( ys , yrT | λ )

 e − λ λ yi  e − mλ ( mλ ) yrT
∝ λ η −1e −τλ  ∏ ×
 i∈s yi !  yrT !
∝ λ η −1e −τλ × e − d λ λ ysT × e − mλ λ yrT
= λ η + ysT + yrT −1e − λ (τ + N ) (since d + m = N).
432
We see that f (λ | D ) is an infinite mixture of gamma densities where the

weight assigned to each one is the corresponding (marginal) predictive
density of yrT .
Note: An alternative way to derive f (λ | D ) is using the equation
f (λ | D ) ∝ ∑ f ( yrT , ys , I , λ ) .
yrT
The case of SRSWR
In the case of SRSWR, the sampling density

Ii
n! N
 y 
f ( I | y, λ ) = N ∏  i 
∏i =1 I i i =1  yT 
changes to
I
n! N  1  i n!
= f ( I | y, λ ) =
∏i 1 =
N ∏  
I i i =1  N  N ∏iN 1 I i
n
,
=
which we note does not depend on λ or yrT and so can be ‘ignored’.
The result is then almost the same as before, the only difference being that
the term
∏ iN=1 yTIi = yTn =( y sT + yrT ) n
in
 N − d  Γ(η + ysT + yrT )
yrT
k ( yrT ) =  
 N + τ  yrT !( ysT + yrT )
n
is replaced by 1.
Thus under SRSWR we find that

K ( yrT )
= f ( yrT | D ) = , yrT 0,1, 2,... ,
C
where
 N − d  Γ(η + ysT + yrT )
yrT
K ( yrT ) =  
 N +τ  yrT ! × 1
and
∞
C= ∑ K( y
yrT =0
rT ).
433
As regards the posterior distribution of λ under SRSWR, this need no

longer be expressed as an infinite mixture of gamma distributions but
simply as
(λ | D) ~ G (η + ysT ,τ + d ) .
Figure 9.1 shows the posterior density f (λ | D ) under the length-biased

and SRSWR assumptions, respectively.
We see that the inference under the assumption of length-bias is the lower
of the two. This is because it appropriately corrects for large finite
population values being more likely to be selected. If we ‘ignore’ the fact
that large values are more likely to be selected. then we will erroneously
over-estimate the superpopulation mean, λ .
Figure 9.2 shows the predictive density f ( yrT | D ) , again under the two
assumptions.
As in Figure 9.1, we see that ignoring the length-biased sampling

mechanism tends to bias the inference upwards.
As a check on our calculations, which omitted all terms corresponding to

values of yrT greater than M = 100 (see above), we calculate the
predictive mean of yrT under the SRSWR assumption using the formula
1 M
∑ yrT K ( yrT )
E ( yrT | D ) ≈
C rT =0
and obtain the value of 14.
This may be compared with the theoretical value, which is exactly

= {E ( yrT | D, λ ) | D} E ( mλ | D )
E ( yrT | D ) E=
η + ysT 0 + (3 + 1)
=m× = 7× =
14 .
τ +d 0+2
434
Figure 9.1 Posterior densities of the Poisson mean
Figure 9.2 Predictive densities of the nonsample total
435
N=9; n = 3; ys=c(1,3); ysT = sum(ys); d = 2; m = 7; eta=0; tau=0; yrTv=0:100

kv = ((N-d)/(tau+N))^yrTv *gamma(eta+ysT+yrTv)/
( factorial(yrTv) * (ysT+yrTv)^n )
c = sum(kv); fv = kv/c
plot(yrTv,fv,pch=16, xlab="nonsample total",

ylab="predictive density",xlim=c(0,60), main=" ")
kvigno = ((N-d)/(tau+N))^yrTv *gamma(eta+ysT+yrTv)/( gamma(yrTv+1) * 1)
cigno = sum(kvigno); fvigno = kvigno/cigno
points(yrTv,fvigno,pch=1)
legend(20,0.1,c("Length-bias assumed (Inference is correct)",
"SRSWR assumed (Inference is too high)"),pch=c(16,1))
c(sum(yrTv*fv), sum(yrTv*fvigno) ) # 5.6302 14.0000
m*(eta+ysT)/(tau+d) # 14
lamv=seq(0,10,0.01); lamfv=lamv
for(i in 1:length(lamv)) lamfv[i]=sum(fv*dgamma(lamv[i],eta+ysT+yrTv,tau+N))
plot(lamv,lamfv,type="l", lty=1, lwd=3,
xlab="lambda",ylab="posterior density", main=" ")
lamfvigno=lamv
for(i in 1:length(lamv))
lamfvigno[i]=sum(fvigno*dgamma(lamv[i],eta+ysT+yrTv,tau+N))
# lines(lamv,lamfvigno,lty=2,lwd=1) # Can do as a check on calculations
lines(lamv,dgamma(lamv,eta+ysT,tau+d),lty=2,lwd=3)
legend(4,0.5,c("Length-bias assumed (Inference is correct)",
"SRSWR assumed (Inference is too high)"),lty=c(1,2),lwd=c(3,3))
436
Exercise 9.5 An exponential finite population model with a

biased Poisson sampling scheme
A sample is drawn from a finite population of size N = 7 in such a way

that unit i has probability of inclusion π i , independently of all the other
units.
The values in the finite population are independent and identically

distributed exponentials with mean µ = 1 / λ , where the prior distribution
for λ is given by
f (λ ) ∝ 1 / λ , λ > 0 .
Units 3 and 5 are selected, and their values are 1.6 and 0.4, respectively.
Find and sketch the posterior density of the superpopulation mean µ and
the predictive density of the finite population mean y under each of the
following specifications:
(a) All the π i values are equal to 0.3 (i = 1,...,N).
(b) All the π i values are equal to 0.3 except that:

π 5 = 0.3 if y5 < 1
π 5 = 0.9 if y5 > 1
(thus unit 5 is 3 times as likely to be sampled if its value exceeds 1).
(c) All the π i values are equal to 0.3 except that:

π 4 = 0.3 if y4 < 1
π 4 = 0.9 if y4 > 1
(thus unit 4 is 3 times as likely to be sampled if its value exceeds 1).
Note: Here, the sample size n is not fixed and is a random variable.
437
(a) The relevant Bayesian model is:

N N N
| y, λ )
f ( I=
=i 1 =i 1
∏ i | y, λ )
f ( I= ∏ π iI (1 − π i )1− I , n = ∑ I i
i i
i =1
N
=
i =1
− λ yi
f ( y | λ) i ∏ λe , y >0∀i
f (λ ) ∝ 1 / λ , λ > 0 .
Here,
π=
1 ...= π N = 0.3,
and the data is
= D (= I , ys ) ((0,0,1,0,1,0,0),(1.6,0.4)) ,
with n = 2 (the achieved sample size).
The sampling mechanism is ignorable and so

1
f (λ | D ) ∝ f (λ ) f ( ys | λ ) ∝ ∏ λ e − λ yi =
λ n −1e − λ ysT
λ i∈s
⇒ (λ | D ) ~ G (n, ysT )
⇒ ( µ | D ) ~ IG ( n, ysT ) .
Next,
( yrT | λ ) ~ G (m, λ ) ,
where m = N − n = 7 − 2 = 5 .
It follows that
f ( yrT | D ) = ∫ f ( yrT | D, λ ) f (λ | D )d λ
∞
∝ ∫ λ m yrTm −1e − λ yrT × λ n −1e − λ ysT d λ
0
∞
= yrTm −1 ∫ λ n + m −1e − λ ( yrT + ysT ) d λ
0
m −1
y Γ( n + m )
= rT
( yrT + ysT ) n + m
yrTm −1
∝ , yrT > 0 .
( yrT + ysT ) N
438
Hence
N − n −1
 n 
 y − ys 
f ( y | D) ∝ 
N  n
N
, y > ys
y N
(using the fact that y= rT Ny − nys ).
(b) In this case, inferences will be exactly the same as in (a). This is
because, even though the sampling mechanism is potentially nonignorable
due to f ( I | y , λ ) depending on a population value y5 , that value happens
to be known (since unit 5 is in the sample, i.e. 5 ∈ s ).
To clarify, we write
 0.3, y5 < 1
π5 =
π 5 ( y5 ) =
 =0.3 + 0.6 I ( y5 > 1) .
0.9, y5 > 1
Then, noting that I 5 = 1 and y5 = 0.4, we have that

f ( I 5 | y , λ ) =π 5I5 (1 − π 5 )1− I5 =π 5 = 0.3 + 0.6 I ( y5 > 1) = 0.3.
Thus
N
| y, λ )
f ( I= ∏π
i =1
i
Ii
(1 − π i )1− Ii
doesn’t depend on λ or yr and is completely known.
Therefore,
∫ f (λ , I , ys , yr )dyr
f (λ | D ) ∝ f (λ , I , y s ) =
= ∫ f (λ ) f ( ys , yr | λ ) f ( I | y s , yr , λ )dyr
∝ f (λ ) f ( ys | λ ) ∫ f ( yr | λ )dyr ∝ f (λ ) f ( ys | λ ) × 1 as before in (a).
(c) In this case, the sampling mechanism is nonignorable and inferences

will be different to those in (a), because f ( I | y , λ ) depends on a
439
population value y4 which is unknown (since unit 4 is not in the sample,

i.e. 4 ∈ r ); that is, f ( I | y, λ ) is unknown. To clarify, we write
 0.3, y4 < 1
π4 =
π 4 ( y4 ) =
 =0.3 + 0.6 I ( y4 > 1) .
0.9, y4 > 1
Then, noting that I 4 = 0 and y4 is unknown, we have that

f ( I 4 | y, λ ) = 1 − π 4 =−
π 4I 4 (1 − π 4 )1− I 4 = 0.7 0.6 I ( y4 > 1)
(a function of y4 ).
N
So f ( I | y , λ ) = ∏ f ( I i | y , λ ) is unknown.
i =1
With this in mind, we now write
∫ f (λ , I , ys , yr )dyr
f (λ | D ) ∝ f (λ , I , y s ) =
= ∫ f (λ ) f ( ys , yr | λ ) f ( I | y s , yr , λ )dyr
∝ f (λ ) f ( ys | λ )W ,
where
= (λ )
W W= ∫ f (y r | λ ) f ( I 4 | y4 )dyr
 ∞ 
 
=  ∏ ∫ f ( yi | λ )dyi  ∫ f ( y4 | λ ) f ( I 4 | y4 )dy4
 ii∈≠r4 0 
 
 ∞
 
=  ∏ 1 ∫ λ e − λ y4 [0.7 − 0.6 I ( y4 > 1)]dy4
 ii∈≠4r  0
 
since f ( yi= | λ ) λ e − λ yi ∀i
∞ ∞
= 0.7 ∫ λ e − λ y4
dy4 − 0.6∫ λ e − λ y4 dy4
0 1
−λ
= 0.7 × 1 − 0.6e .
Thus
f (λ | D ) ∝ λ n −1e − λ ysT (7 − 6e=
−λ
) 7λ n −1e − λ ysT − 6λ n −1e − λ ( ysT +1) .
440
Thus
 7  y n λ n −1e − λ ysT  6  ( ysT + 1) n λ n −1e − λ ( ysT +1)  
= f (λ | D ) c  n  sT  − n   ,
 sT
y  Γ ( n )  ( y sT + 1)  Γ ( n ) 
where
 7 6 
= 1 ∫ f (λ=| D )d λ c  n (1) − n ( )
1
 y sT ( y sT + 1) 
−1
 7 6 
⇒ c=  n − n  .
 ysT ( ysT + 1) 
Note 1: The posterior f (λ | D ) is a weighted average of two gamma

densities where one of the weights is negative.
Note 2: The posterior density of µ = 1/ λ is given by

f ( µ |= (λ 1/ µ | D) / µ 2 .
D) f=
We now turn our attention to the predictive distribution of the nonsample

total. Observe that
f ( yr | D ) = ∫ f ( yr | D, λ ) f (λ | D )d λ ,
where
 
f ( yr | D, λ ) ∝ f ( yr , ys , I , λ ) ∝  7 − 6 I ( y4 > 1)  λ e − λ y4  × ∏ λ e − λ yi .
   i∈r
i ≠4
This suggests that we decompose the nonsample total according to

yrT= y0 + y4
(where y0 is the total of all values in yr except for y4 ) and think about
how we can use the following facts:
( y0 ⊥ y 4 | D , λ ) ( y4 is independent of all other nonsample
units, given D and λ )
( y0 | D, λ ) ~ G (m − 1, λ ) (a simple distribution)
f ( y4 | D, λ ) ∝ [7 − 6 I ( y4 > 1)]λ e − λ y4 , y4 > 0
(a complicated distribution).
441
One strategy is to use these facts to obtain the cdf F ( y4 | D, λ ), hence

F ( yrT | D, λ ), hence f ( yrT | D, λ ) = F ′( yrT | D, λ ), hence f ( y | D, λ ),
and hence ultimately the required f ( y | D ) = ∫ f ( y | D, λ ) f (λ | D )d λ .
 y4 − λ t
 7 ∫ λ e dt , 0 < y4 < 1
 0
First, F ( y4 | D, λ ) ∝  y y4
7 λ e − λt dt − 6 λ e − λt dt ,
4
 ∫ ∫1 y4 > 1
 0
 7(1 − e − λ y4 ), 0 < y4 < 1
= − λ y4 − λ1 − λ y4
7(1 − e ) − 6(e − e ), y4 > 1.
k (7 − 7e − λ y4 ), 0 < y4 < 1
Thus F ( y4 | D , λ ) =  − λ y4 −λ
 k (7 − e − 6e ), y4 > 1,
=
where (λ ) 1 / (7 − 6e − λ ) , since 1 ==
k k= k (7 − 6e − λ ) .
F ( y4 ∞ | D, λ ) =
Check: Since ( y4 | D, λ ) is continuous we would expect that

F ( y=
4 1+ | D, λ ) − F ( y=
4 1− | D, λ=
) 0.
The left hand side here is

k (7 − e − λ 1 − 6e − λ ) − k (7 − 7e − λ 1 ) = k (7 − 7e − λ ) − k (7 − 7e − λ ) = 0
(which is correct).
Next, writing a ≡ yrT for notational convenience, we have that

F ( a | D, λ )= P( y0 + y4 ≤ a | D, λ )
= E{P( y0 + y4 ≤ a | D, λ , y0 ) | D, λ}
a
= ∫ P( y
0
4 ≤ a − y0 | D, λ , y0 ) f ( y0 | D, λ )dy0
(a convolution)
a
= ∫ F( y =
0
4 a − y0 | D, λ ) f G ( m −1,λ ) ( y0 )dy0
∫ k 7 − 7e
− λ ( a − y0 )
=  f G ( m −1,λ ) ( y0 )dy0 , 0 < a < 1 .
0
442
For the case a > 1 we find that

a −1
F ( a | D, λ ) = ∫ k 7 − e − λ ( a − y0 ) − 6e − λ  f G ( m −1,λ ) ( y0 )dy0
0
a
+ ∫ k 7 − 7e − λ ( a − y0 )  f G ( m −1,λ ) ( y0 )dy0 .
a −1
Note: If a > 1 and 0 ≤ y0 ≤ a − 1 then 1 ≤ a − y0 ≤ a .

If a > 1 and a − 1 ≤ y0 ≤ a then 0 ≤ a − y0 ≤ 1 .
Check: Since ( a | D, λ ) is continuous we would expect that

( a 1+ | D, λ ) − F=
F= λ) 0 .
( a 1− | D, =
The LHS here is

1−1
 ∫ k 7 − e
− λ (1− y0 )
− 6e − λ  f G ( m −1,λ ) ( y0 )dy 0
0
1 

∫1−1 
− λ (1− y0 )
+ k  7 − 7 e 
 G ( m −1,λ ) 0 0 
f ( y ) dy


1
− ∫ k 7 − 7e − λ (1− y0 )  f G ( m −1,λ ) ( y0 )dy0 = 0 (which is correct).
0
We now consider Leibniz’s rule for differentiating an integral:

d b( y ) b( y ) ∂
∫
dy a ( y )
f ( x, y )dx = ∫
a ( y ) ∂y
f ( x, y )dx
+b '( y ) f (b( y ), y ) − a '( y ) f (a ( y ), y )

(where the symbols here are not directly related to those in this exercise).
Applying this rule for the case 0 < a < 1, we obtain

dF ( a | D, λ )
a
f ( a | D, λ ) =
da ∫
=
0
k 0 − 7e − λ ( a − y0 ) ( −λ )  f G ( m −1,λ ) ( y0 )dy0
da
+ k 7 − 7e − λ ( a −a )  f G ( m −1,λ ) ( a ) (this is zero)
da
d0
− k 7 − 7e − λ ( a −0)  f G ( m −1,λ ) (0) (this is zero)
da 
443
a
 λ m −1 y0m −2 e − λ y0  λ m −1 a m −2
= 7 k λ e − λ a ∫ e λ y0 
 Γ( m − 1)   dy0 = 7k λ e − λ a
( m − 2)! ∫0 y0 dy0
0  
λ m −1
 λ m a m −1e − λ a 
= 7k λ e − λ a a m −1 = 7k   = 7kf G ( m ,λ ) ( a ) .
( m − 1)!  ( m − 1)! 
Likewise, for the case a > 1, we obtain

a −1
dF ( a | D, λ )
∫ k 0 − e
− λ ( a − y0 )
f ( a | D, λ ) = = ( −λ ) − 0 f G ( m −1,λ ) ( y0 )dy0
da 0
d ( a − 1)
+ k 7 − e − λ ( a −( a −1)) − 6e − λ  f G ( m −1,λ ) ( a − 1)
da
d0
− k 7 − e − λ ( a −0) − 6e − λ  f G ( m −1,λ ) (0) (this is zero)
da 
a
+ ∫ k 0 − 7e − λ ( a − y0 ) ( −λ )  f G ( m −1,λ ) ( y0 )dy0
a −1
da
+ k 7 − 7e − λ ( a −a )  f G ( m −1,λ ) ( a ) (this is zero)
da
d ( a − 1)
− k 7 − 7e − λ ( a −( a −1))  f G ( m −1,λ ) ( a − 1)
da
a −1
 λ m −1 y0m −2 e − λ y0 
= k λ e−λa ∫ e λ y0 −λ −λ
  dy0 + k 7 − e − 6e  f G ( m −1,λ ) ( a − 1)
0  Γ ( m − 1) 
a
 λ m −1 y0m −2 e − λ y0 
∫
−λa λ y0
+7 k λ e e 
−λ
 dy0 −7k 1 − e  f G ( m −1,λ ) ( a − 1)
a −1  Γ( m − 1) 
λ m −1
k λ e−λa ( a − 1) m −1 + 7k (1 − e − λ ) f G ( m −1,λ ) ( a − 1)
( m − 1)!
λ m −1
+7 k λ e − λ a  a m −1 − ( a − 1) m −1  − 7k (1 − e − λ ) f G ( m −1,λ ) ( a − 1)
( m − 1)!
 λ m ( a − 1) m −1 e − λ ( a −1)   λ m a m −1e − λ a 
= ke − λ   + 7 k  ( m 1)! 
 ( m − 1)!   − 
 λ ( a − 1) m −1 e − λ ( a −1) 
m
−7ke − λ  
 ( m − 1)! 
= k {7 f G ( m ,λ ) ( a ) − 6e − λ f G ( m ,λ ) ( a − 1)} .
444
In summary so far,
7 fG ( m ,λ ) (a ), 0 < a <1
f (a | D, λ )= k ×  −λ
 7 fG ( m ,λ ) (a ) − 6e fG ( m,λ ) (a − 1), a > 1.
Check: Here,
∫ f (a | D, λ )da =
{
k × 7 FG ( m,λ ) (1) + 7 1 − FG ( m,λ ) (1)  − 6e − λ 1 − FG ( m,λ ) (1 − 1) 
  }
× {7 − 6e − λ [1 − 0]} = 1
1
= −λ
7 − 6e
(which is correct).
Next, using the relationship=y ( nys + yrT ) / N , we obtain:
f ( y | D , λ ) = f1 ( y , λ )
≡ Nk (λ )7 fG ( m ,λ ) ( Ny − nys )
ny ny + 1
for s < y < s
N N
f ( y | D, λ ) = f 2 ( y , λ )
≡ Nk (λ ) {7 fG ( m ,λ ) ( Ny − nys ) − 6e − λ fG ( m ,λ ) ( Ny − nys − 1)}
nys + 1
for y > ,
N
where:
nys
= 0.2857
N
ny s + 1
= 0.4286
N
1
k (λ ) = (as before).
7 − 6e − λ
445
Thus we finally obtain the required posterior predictive density:

∞
nys ny + 1
) g1 ( y ) ≡ ∫ f1 ( y , λ ) f (λ | D )d λ ,
f ( y | D= <y< s
0
N N
∞
nys + 1
g 2 ( y ) ≡ ∫ f 2 ( y , λ ) f (λ | D )d λ ,
f ( y | D) = y> ,
0
N
−1
 7 6 
where f (λ = | D)  n − n
 ysT ( ysT + 1) 
 7 6 
×  n fG ( n , ysT ) (λ ) − f + ( λ ) 
( ysT + 1) n
G ( n , y 1)
 ysT 
sT
(as obtained earlier).
Figure 9.3 shows the two densities f ( µ | D ) and f ( y | D ) under each of

the scenarios in (a) and (c).
We see that inferences under the length-biased sampling scheme in (c) are
lower than those under SRSWR in (a). This is because, generally
speaking, length bias makes larger units more likely to be selected, and
not adjusting for that bias naturally leads to inferences that are too high.
These patterns are consistent with the following point estimates as

obtained numerically (see the R code below for details of the calculation):
E ( µ | D ) = 1.38 in (c) < E ( µ | D ) = 2.00 in (a)
E ( y | D ) = 1.19 in (c) < E ( y | D ) = 1.71 in (a).
Note 1: In (a),
( µ | D ) ~ IG ( n, ysT ) ,
and therefore
E ( µ | D=) y sT / ( n − 1) = 2 / (2 − 1)
= 2 (exactly).
Note 2: The posterior predictive mean of y in (c) was obtained

numerically as follows:
nys +1
N ∞
=yˆ E=
( y | D) ∫
nys
yg1 ( y )dy + ∫
nys +1
yg 2 ( y )dy
N N
= 0.01140 + 1.17546 = 1.1869.
446
Figure 9.3 Posterior and predictive densities
# (a)
X11(w=8,h=4); par(mfrow=c(1,1))
N=7; ys=c(1.6,0.4); ysT=sum(ys); ysbar=mean(ys); n=length(ys); m=N-n
c(ysT,ysbar,n,m) # 2 1 2 5
fmufun=function(mu,n,ysT) dgamma(1/mu,n,ysT)/mu^2
integrate(fmufun,0, Inf,n=n,ysT=ysT)$value # 1 check
muv=seq(0.0001,20.0001,0.005); fmuv= fmufun(muv,n=n,ysT=ysT)
plot(muv,fmuv,type="l",xlim=c(0,20)) # check
integrate(function(mu,n,ysT) mu*fmufun(mu,n,ysT),
0,Inf,n=n,ysT=ysT)$value # 2 check (posterior mean of mu)
447
kybarfun=function(ybar,n,N,ysbar) (ybar-(n/N)*ysbar)^(N-n-1) / ybar^N

const = integrate(kybarfun, (n/N)*ysbar , Inf,n=n,N=N,ysbar=ysbar)$value
const # 0.4083333
ybarv=seq( (n/N)*ysbar, (n/N)*ysbar+30, 0.005)
fybarv= kybarfun(ybarv,n=n,N=N,ysbar=ysbar)/const
plot(ybarv,fybarv, type="l",xlim=c(0,20)) # check
(1/const)*integrate(function(ybar,n,N,ysbar) ybar*kybarfun(ybar,n,N,ysbar),
(n/N)*ysbar,Inf,n=n,N=N,ysbar=ysbar)$value
# 1.714286 (predictive mean of ybar)
# (c)
c = 1 / ( 7/ysT^n - 6/(ysT+1)^n ); c # 0.9230769
flamfunc=function(lam,n,ysT,c) c*
( (7/ysT^n)*dgamma(lam,n,ysT) - (6/(ysT+1)^n)*dgamma(lam,n,ysT+1) )
integrate(flamfunc,0,Inf,n=n,ysT=ysT,c=c)$value # 1 check
lamv=seq(0,20,0.01)
plot(lamv,flamfunc(lamv,n=n,ysT=ysT,c=c),type="l") # OK
fmufunc=function(mu,n,ysT,c) c*(1/mu^2)*
( (7/ysT^n)*dgamma(1/mu,n,ysT) - (6/(ysT+1)^n)*dgamma(1/mu,n,ysT+1) )
integrate(fmufunc,0,Inf,n=n,ysT=ysT,c=c)$value # 1 check
integrate(function(mu,n,ysT,c) mu*fmufunc(mu,n,ysT,c),
0,Inf,n=n,ysT=ysT,c=c)$value # 1.384615 (posterior mean of mu)
fmuvc=fmufunc(mu=muv,n=n,ysT=ysT,c); plot(muv,fmuvc) # OK
ybarmin=ysT/N; ybarmin # 0.2857143 Minimum possible value of ybar

ybarcut=(ysT+1)/N; ybarcut # 0.4285714 Cut-point for ybar
f1fun=function(ybar,lam,n,N,m,ysT) (N / (7-6*exp(-lam))) *
7*dgamma(N*ybar-ysT,m,lam)
f2fun=function(ybar,lam,n,N,m,ysT) (N / (7-6*exp(-lam))) *
(7*dgamma(N*ybar-ysT,m,lam)-6*exp(-lam)*dgamma(N*ybar-ysT-1,m,lam) )
# Check for particular values of lambda

lam=0.764 # (example in the range ybarmin to ybarcut)
p1 = integrate(f1fun, ybarmin,ybarcut, lam=lam,n=n,N=N,m=m,ysT=ysT)$value
p2 = integrate(f2fun, ybarcut, Inf, lam=lam,n=n,N=N,m=m,ysT=ysT)$value
c(p1,p2,p1+p2) # 0.001921853 0.998078147 1.000000000 OK
lam=3.214 # (example in the range ybarcut to infinity)

p1 = integrate(f1fun, ybarmin,ybarcut, lam=lam,n=n,N=N,m=m,ysT=ysT)$value
p2 = integrate(f2fun, ybarcut, Inf, lam=lam,n=n,N=N,m=m,ysT=ysT)$value
c(p1,p2,p1+p2) # 0.2298026 0.7701974 1.0000000 OK
448
g1fun=function(ybar,n,N,m,ysT,c)
integrate(function(lam,ybar,n,N,m,ysT,c)
f1fun(ybar,lam,n,N,m,ysT)*flamfunc(lam,n,ysT,c),
0,Inf, ybar=ybar, n=n,N=N,m=m,ysT=ysT,c=c)$value
g2fun=function(ybar,n,N,m,ysT,c)
integrate(function(lam,ybar,n,N,m,ysT,c)
f2fun(ybar,lam,n,N,m,ysT)*flamfunc(lam,n,ysT,c),
0,Inf, ybar=ybar, n=n,N=N,m=m,ysT=ysT,c=c)$value
# Check:
g1fun(ybar=0.4,n,N,m,ysT,c) # 0.4119163 OK
g2fun(ybar=0.6,n,N,m,ysT,c) # 1.274185 OK
ybarv1=seq(ybarmin,ybarcut,length.out=400); fybarv1=ybarv1
for(j in 1:length(ybarv1)) fybarv1[j] =
g1fun(ybar=ybarv1[j],n=n,N=N,m=m,ysT=ysT,c=c)
ybarv2=c( seq(ybarcut,1,length.out=200), seq(1,2,length.out=200),

seq(2,3,length.out=200), seq(3,5,length.out=200),
seq(5,10,length.out=200), seq(10,50,length.out=200) ,
seq(50,1000,length.out=200), seq(1000,10000,length.out=200) )
fybarv2=ybarv2
for(j in 1:length(ybarv2)) fybarv2[j] =
g2fun(ybar=ybarv2[j],n=n,N=N,m=m,ysT=ysT,c=c)
plot(c(0,5),c(0,1.5),type="n")
lines(ybarv1, fybarv1,lty=1,lwd=2)
lines(ybarv2, fybarv2,lty=1,lwd=2) # OK
# Check
# Integrates numerically under a spline through the points given by
# the vectors xvec and yvec, from a to b.
fit <- smooth.spline(xvec, yvec); spline.f <- function(x){predict(fit, x)$y }
INTEG(seq(0,1,0.01),seq(0,1,0.01)^2,0,1) # 0.3333333 check
prob1=INTEG(ybarv1,fybarv1,ybarmin,ybarcut)
prob2=INTEG(ybarv2,fybarv2,ybarcut,10000)
c(prob1,prob2,prob1+prob2) # 0.02880659 0.97119399 1.00000058 OK
INTEG(c(ybarv1,ybarv2),c(fybarv1,fybarv2),ybarmin,10000) # 1.000004 OK
449
X11(w=8,h=6); par(mfrow=c(2,1))
plot(ybarv1, ybarv1* fybarv1, xlim=c(0,1)) # OK
plot(ybarv2, ybarv2* fybarv2, xlim=c(0,20)) # OK
term1 = INTEG(ybarv1, ybarv1*fybarv1,ybarmin,ybarcut)

term2 = INTEG(ybarv2, ybarv2*fybarv2,ybarcut,10000)
ybarhatc = term1 + term2; c(term1, term2, ybarhatc)
# 0.01139601 1.17546200 1.18685801 (predictive mean of ybar)
X11(w=8,h=8); par(mfrow=c(1,1)) # Produce final plots

plot(c(0,5),c(0,1.3),type="n",xlab="mu & ybar",
ylab="posterior & predictive density")
lines(muv,fmuv,lty=4,lwd=3,col="green") # mu under SRS
lines(muv,fmuvc,lty=2,lwd=3, col="red") # mu under length-biased sampling
lines(ybarv,fybarv, lty=3,lwd=3, col="blue") # ybar under SRS
lines(ybarv1, fybarv1,lty=1,lwd=3); lines(ybarv2, fybarv2,lty=1,lwd=3)
abline(v=(n/N)*ysbar,lty=3); (n/N)*ysbar # 0.2857143
legend(2,1.3,c("f(mu|D) under SRSWR in (a)",
"f(mu|D) under length-biased sampling in (c)",
"f(ybar|D) under SRSWR in (a)",
"f(ybar|D) under length-biased sampling in (c)"),
lty=c(4,2,3,1),lwd=rep(3,4),col=c("green","red","blue","black"))
text(3.5,0.75,"The dotted vertical line shows the minimum possible")
text(3.5,0.68," value of ybar which is (n*ysbar+0)/N = 0.286")
Exercise 9.6 A Gibbs sampler for solving a length-biased with-

replacement model
Consider the Bayesian model in part (c) of Exercise 9.5, namely:

N N N
| y, λ )
f ( I=
=i 1 =i 1
∏ i | y, λ )
f ( I= ∏ π iI (1 − π i )1− I ,
i i
n = ∑ Ii
i =1
N
f ( y | λ ) = ∏ λe
− λ yi
, f (λ ) ∝ 1 / λ , λ > 0 ,
i =1
where: N = 7, π i = 0.3 ∀i =1,2,3,5,6,..., N

π 4 = 0.3 if y4 < 1 and π 4 = 0.9 if y4 > 1
=D (= I , ys ) ((0,0,1,0,1,0,0),(1.6,0.4)) , n = 2, m= N − n = 5.
Design and implement a suitable Gibbs sampler so as to obtain a random

sample from the joint distribution of µ = 1 / λ and y . Illustrate your
results with suitable plots and estimates.
450
Motivated by and using results in the previous exercise (Exercise 9.5),

define =
y0 yrT − y4 and then note that at the observed value of the data,
the Bayesian model implies that:
f ( I | ys , y0 , y4 , λ ) ∝ 7 − 6 I ( y4 > 1)
f ( y4 | ys , y0 , λ ) ~ G (1, λ )
f ( y0 | ys , λ ) ~ G (m − 1, λ )
n
f ( ys | λ ) ∝ ∏ λ e − λ yi
i =1
f (λ ) ∝ 1 / λ , λ > 0 .
So
1  n 
f ( I , ys , y0 , y4 , λ ) ∝ ( )
×  ∏ λ e − λ yi  × λ m −1 y0m − 2 e − λ y0
λ  i =1 
×λ e − λ y4 × [ 7 − 6 I ( y4 > 1) ] .
We see that a suitable Gibbs sampler is defined by the following three

conditionals:
1. f (λ | I , y s , y0 , y4 ) ∝ λ −1+ n + m −1+1eλ ( ysT + y0 + y4 ) =

λ N −1e − λ yT
⇒ (λ | I , ys , y0 , y4 ) ~ G ( N , yT ) ~ G ( N , ysT + y0 + y4 )
2. f ( y0 | I , ys , λ , y4 ) ∝ y0m − 2 e − λ y0
⇒ ( y0 | I , ys , λ , y4 ) ~ G ( m − 1, λ )
3. f ( y4 | I , y s , λ , y0 ) ∝ [ 7 − 6 I ( y4 > 1)] λ e − λ y4 , y4 > 0 .
The first of these three conditionals are straightforward and easy to sample
from. The third conditional can be sampled from via the inversion
technique as follows.
First, for notational convenience, write the relevant random variable as x

with density
f ( x ) ∝ [ 7 − 6 I ( x > 1) ] λ e − λ x , x > 0 .
451
Then the cdf of x is

 x − λt 
 7 ∫ λ e dt , 0 < x <1 
 0 
F ( x) = r  x x 
 7 λ e − λt dt − 6 λ e − λt dt , x > 1 
 ∫ ∫1 
 0 
for some constant r
 7(1 − e − λ x ), 0 < x <1 

= r −λ x −λ1 −λ x 
 7(1 − e ) − 6(e − e ), x > 1 
 7 − 7e − λ x , 0 < x <1 
= r −λ x −λ ,
 7 − e − 6e , x > 1 
which equals 1 = 7 − 0 − 6e − λ in the limit as x → ∞ ; so

= r 1 / (7 − 6e − λ ) .
7 − 7e − λ
Now observe that F ( x= 1)= .
7 − 6e − λ
This is a constant in the formula for the quantile function of X, obtained

as follows.
7 − 7e − λ
First, if p < −λ
p r (7 − 7e − xλ )
then we solve =
7 − 6e
1  p
and thereby obtain x = − log  1 −  .
λ  7r 
7 − 7e − λ
Secondly, if p > then we solve p =r (7 − e − λ x − 6e − λ )
7 − 6e − λ
1  p
and thereby obtain x = − log  7 − 6e − λ −  .
λ  r
In summary, the quantile function of x is given by

  −λ  7 − 7e − λ 
− log  1 − ( 7 − 6e )  , p <
1 p
 x = λ 7 − 6e −λ 
Q( p) =   7 
 . (9.2)
7 − 7e − λ 
x =
 λ
−λ
(
− log 7 − 6e − p ( 7 − 6e ) , p >
1 −λ
)
7 − 6e − λ 
452
So a procedure for sampling from the third conditional in the Gibbs

sampler, namely
f ( y4 | I , y s , λ , y0 ) ∝ [ 7 − 6 I ( y4 > 1)] λ e − λ y4 ,
is to draw u ~ U (0,1) and then return y4 = Q (u ) as per equation (9.2).
Implementing the above Gibbs sampler for 20,000 iterations following a

burn-in of 1,000 and then thinning out by a factor of 10 we obtained a
random sample of size J = 2,000 from the joint posterior/predictive
distribution of f (λ , y0 , y4 | I , y s ) .
Figure 9.4 displays trace plots for the three unknowns, λ , y0 , y4 , sample
ACFs for these over the last 20,000 iterations, and the three sample ACFs
again over the final samples of size J. Figure 9.5 is a histogram of the J
simulated values of µ = 1/ λ and Figure 9.6 is a histogram of the J
simulated values of y = ( ysT + y0 + y4 ) / N . In each histogram are shown
a density estimate as well as three vertical lines for the Monte Carlo point
estimate and 95% CI for the mean.
The posterior density of µ , i.e. f ( µ | D), was estimated via Rao-

Blackwell as
1 J
fˆ ( µ | D) = ∑ f IG ( N , Ny ( j ) ) ( µ ) ,
J j =1
where
y ( j ) = ( y sT + y0( j ) + y4( j ) ) / N ,
using the fact that
( µ | I , ys , y0 , y4 ) ~ IG ( N , ysT + y0 + y4 ) ~ IG ( N , yT ) ~ IG ( N , Ny ) .
The posterior mean of µ , i.e. E ( µ | D), was also estimated via Rao-
Blackwell as
1 J ysT + y0( j ) + y4( j ) 1 J Ny ( j )
=
=
µˆ = ∑
J j 1= N −1
∑
J j 1 N −1
= 1.41,
using the fact that

E ( µ | I , ys , y0 , y4 ) = ( ysT + y0 + y4 ) / ( N − 1) ,
with 95% CI for the posterior mean equal to
 J
 ( j)

2 
1 1 Ny
 µˆ ± 1.96
 ∑ − µˆ   = (1.34, 1.47).

J J − 1 j =1  N − 1  
453
Note: This is consistent with the exact value, namely E ( µ | D ) = 1.38,

as obtained in Exercise 9.5.
The predictive density of y was estimated by smoothing a probability

histogram of the simulated values y ( j ) , and the predictive mean of y , i.e.
E ( y | D) , was estimated by
1 J
yˆ = ∑ y ( j ) = 1.21,
J j =1
with 95% CI
 
1 1 J
∑ ( )
2
 ˆ
y ± 1.96 y ( j ) − yˆ  = (1.15, 1.26).
 J J − 1 j =1 
Note 1: This is consistent with the exact value, E ( y | D) = 1.19, as

obtained in Exercise 9.5.
Note 2: We may be able to improve on the above ‘histogram’ estimation

of E ( y | D) using Rao-Blackwell methods. For example, observe that
1 m −1
E ( y | D, λ , =  y sT + y4 +
λ 
y4 ) .
N
So we define
1 m −1
e=j E ( y | D, λ j , y4( j=
)
)  ysT + y4 +
( j)
.
N λ j 
The associated Rao-Blackwell estimate of E ( y | D) is

1 J
e= ∑ e j = 1.21,
J j =1
with 95% CI
 1 1 J 
 e ± 1.96 ∑ j
J J − 1 j =1
( e − e ) 2
 = (1.16, 1.26).
 
Note 3: In this case, applying Rao-Blackwell methods has only slightly

narrowed the CI for E ( y | D) .
454
455
Figure 9.5 Inference on the superpopulation mean
Figure 9.6 Inference on the finite population mean
456
Qfun = function(p=0.5,lam=1){
c1 = (7-7*exp(-lam))/(7-6*exp(-lam))
if(p <= c1) c2 = 1- (p/7) * (7-6*exp(-lam))
if(p > c1) c2 = 7 - 6*exp(-lam) - p*(7-6*exp(-lam))
-(1/lam)*log(c2) }
# Check:
pvec=seq(0,1,0.001); Qvec=pvec
for(i in 1:length(pvec)) Qvec[i] = Qfun(p=pvec[i],lam=1.3)
plot(pvec,Qvec); plot(Qvec,pvec) # OK
GS = function(J=1000,N=7,n=2,m=5, ysT=2, lam=1,y0=1,y4=1){

lamv=lam; y0v=y0; y4v=y4; for(j in 1:J){
lam=rgamma(1,N,ysT+y0+y4)
y0=rgamma(1,m-1,lam)
u=runif(1); y4=Qfun(p=u,lam=lam)
lamv=c(lamv,lam); y0v=c(y0v,y0); y4v=c(y4v,y4) }
list(lamv=lamv, y0v=y0v, y4v=y4v) }
X11(w=8,h=9); par(mfrow=c(3,3)); set.seed(321); date()

res= GS(J=21000,N=7,n=2,m=5, ysT=2, lam=1,y0=1,y4=1); date() # took 3 secs
plot(res$lamv,type="l"); plot(res$y0v,type="l"); plot(res$y4v,type="l") # OK
lamv=res$lamv[-(1:1001)]; y0v=res$y0v[-(1:1001)]; y4v=res$y4v[-(1:1001)];

acf(lamv); acf(y0v); acf(y4v) # high serial correlation, so need to thin out
inc= seq(10,20000,10); lamvec=lamv[inc]; y0vec=y0v[inc]; y4vec=y4v[inc];
acf(lamvec); acf(y0vec); acf(y4vec) # OK
J = length(lamvec); J # 2000
N=7;n=2;m=5; ysT=2; muvec=1/lamvec; ybarvec=(1/N)*(ysT+y0vec+y4vec)

ybarhat=mean(ybarvec);
ybarci=ybarhat+c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J)
c(ybarhat, ybarci, ybarci[2]-ybarci[1]) # 1.204519 1.151619 1.257419 0.105800
evec=(1/N)*( ysT+ y4vec + (m-1)/lamvec )

ebar=mean(evec); eci= ebar+c(-1,1)*qnorm(0.975)*sd(evec)/sqrt(J)
c(ebar,eci,eci[2]-eci[1]) # 1.2091236 1.1581903 1.2600569 0.1018666
muhat=(N/(N-1))*ybarhat
muci=muhat + c(-1,1)*qnorm(0.975)*sd( (N/(N-1))*ybarvec ) / sqrt(J)
c(muhat, muci) # 1.405272 1.343556 1.466989
457
mugrid=seq(0.001,10.001,0.01)
fmuhat=mugrid; for(i in 1:length(mugrid))
fmuhat[i] = mean( dgamma(1/mugrid[i], N, N*ybarvec )/mugrid[i]^2 )
X11(w=8,h=5)
hist(muvec,prob=T, xlim=c(0,5),ylim=c(0,1),breaks=seq(0,80,0.1),
xlab="mu", main="")
lines(mugrid,fmuhat,lwd=2); abline(v= c(muhat, muci), lwd=2)
hist(ybarvec,prob=T, xlim=c(0,5),ylim=c(0,1.2),breaks=seq(0,80,0.1),
xlab="ybar", main=" ")
lines(density(ybarvec),lwd=2); abline(v= c(ybarhat, ybarci), lwd=2)
Exercise 9.7 Gibbs sampler for a length-biased without-

replacement sampling model
Earlier we defined L = ( L1 ,..., Ln ) as the vector of the labels of the selected

units in the order in which they are sampled.
Now consider the following Bayesian finite population model:
n y Li
f ( L | y, λ ) = ∏ , L = ( L1 ,..., Ln ) ∈ {( a1 ,..., an ) :
yT − ∑ j =1 y L j
i −1
i =1
ai ∈ {1,..., N } ∀ i ∈ {1,..., n} & ai ≠ a j ∀ i, j ∈ {1,..., n}}
N
f ( y | λ)
= ∏ λe λ
i =1
− yi
, yi > 0 ∀ i
f (λ ) ∝ 1 / λ , λ > 0 .
Design and implement a suitable Gibbs sampler so as to obtain a random

sample from the joint distribution of µ = 1 / λ and y in the case where
N = 7, n =3, m= N − n = 4
and when the observed data is
= D (= L, ys ) ((4,3,6),(1.6,0.4,0.7)) .
Illustrate your results with suitable plots and estimates.
458
The sampling mechanism here is defined by the model density of L, which

may also be written as
y L1 y L2 y L3 y Ln
f ( L1 ,..., Ln | y , λ ) = × × × ... ×
yT yT − y L1 yT − y L1 − y L2 yT − y L1 − y L2 − ... − y Ln −1
=for L (1,..., n ),(1,3,2,..., n ),...,( N , N − 1,..., N − n + 1) .
This pdf implies that units are selected from the finite population, one by
one and without replacement, in such a way that the probability of
selecting a unit on any given draw is its value divided by the sum of the
values of all units which have not yet been sampled at that point in time.
We call this procedure length-biased sampling without replacement.
Note: This is an example of a sampling mechanism that is nonignorable

but known. If f ( L | y , λ ) depended on λ , or on some other unknown
quantity, then we would say that the sampling mechanism is
nonignorable and unknown.
In the present case it is convenient to relabel the population units—after

sampling—in such a way that L = (1,2,..., n ) and so also s = (1,..., n ) and
=r (n + 1,..., N ) . Assuming that this is done, we may write the density of
the sampling mechanism in various other and simpler ways, for example:
y y2 y3 yn
f ( L | y, λ ) = 1 × × × ... ×
yT yT − y1 yT − y1 − y2 yT − y1 − y2 − ... − yn −1
y1 y2 y3 yn
= × × × ... ×
y1 + ... + y N y2 + ... + y N y3 + ... + y N yn + ... + y N
n
yi
=∏
n
y
= ∏ Ni , etc.
i =1 yi + ... + y N i =1 ∑ y
j =i j
Note: We have not previously relabelled population units in this manner

because doing so would have provided only marginal notational
convenience and may have obscured the nature of the sampling
mechanisms we were trying to illustrate. In the next chapter, we will
again make use of a convenient relabelling scheme similar to the one
applied here.
With the above relabelling in place, and noting that

459
( yrT | λ ) ~ G ( m, λ ) ,
the joint posterior density of λ and yrT (given the data, D = ( L, ys ) ) may
now be written as
f (λ , yrT | D ) ∝ f (λ , yrT , ys , L) =
f (λ ) f ( ys | λ ) f ( yrT | λ ) f ( L | ys , yrT )
1  n 
×  ∏ λ e − λ yi  × ( λ m yrTm −1e − λ yrT ) × ∏
n
1
∝ .
=λ  i 1=  i 1 yi + ... + yn + y rT
This joint density suggests a Metropolis-Hastings algorithm with a Gibbs

step defined by the conditional posterior distribution
(λ | D, yrT ) ~ G ( N , ysT + yrT )
and a Metropolis step defined by a rather complicated conditional
predictive density defined by
n
1
f ( yrT | D, λ ) ∝ yrTm −1e − λ yrT ∏ .
i =1 ( yi + ... + yn ) + y rT
At this point it is useful to recall a data augmentation technique based on

the identity
∞
1 = ∫ xe − xwdw ,
0
or equivalently
∞
1
= ∫ e − xwdw ,
x 0
which can be applied here so as to yield the identity
n n ∞
1
=i 1 =
∏ yi + ... + yn + yrT
= ∏ ∫ e − ( yi +...+ yn + yrT ) wi dwi .
i 1 0
This suggests that we introduce an artificial or latent random variable

w = ( w1 ,..., wn ) into our model which is defined in such a way that the
joint posterior density of λ , yrT and w is given by
1  n 
×  ∏ λ e − λ yi  × ( λ m yrTm −1e − λ yrT ) × ∏ e − ( yi +...+ yn + yrT ) wi .
n
f (λ , yrT , w | D ) ∝
=λ  i 1=  i 1
Note: If we integrate this joint density with respect to w then we recover

f (λ , yrT | D ) as above.
460
The above expression for f (λ , yrT , w | D ) now suggests a ‘pure’ Gibbs

sampler defined by the following n + 2 conditional distributions:
(λ | D, yrT , w) ~ G ( N , ysT + yrT )

( yrT | D, λ , w) ~ G (m, λ + wT ) where wT = w1 + ... + wn
( wi | D, λ , yrT ) ~ ⊥ G (1, yi + ... + yn + yrT ) , i = 1,..., n .
This Gibbs sampler can be used to generate a random sample

(λ j , yrT
( j)
, w( j ) ) ~ iid f (λ , yrT , w | D ) , j = 1,…,J,
where
w( j ) = ( w1( j ) ,..., wn( j ) ) .
This sample can then be used for Monte Carlo inference on the quantities
of interest, namely µ = 1/ λ and=y ( ysT + yrT ) / N .
Applying the above Gibbs sampler (with a suitable burn-in and thinning)
we obtained a random sample of size J = 2,000 from the joint posterior
distribution of λ , yrT and w = ( w1 ,..., wn ) .
The posterior density of µ was estimated via Rao-Blackwell as

1 J
fˆ ( µ | D) = ∑ f IG ( N , Ny ( j ) ) ( µ ) ,
J j =1
where
y= ( j)
( y sT + yrT( j)
)/N ,
using the fact that
( µ | I , ys , yrT , w) ~ IG ( N , ysT + yrT ) ~ IG ( N , yT ) ~ IG ( N , Ny ) .
The posterior mean of µ was also estimated via Rao-Blackwell as

1 J ysT + yrT( j ) 1 J Ny ( j )
=
=
µˆ = ∑ N −1 J ∑
J j 1= j 1 N −1
= 0.619,
using the fact that

E ( µ | I , ys , yrT , w) =
( ysT + yrT ) / ( N − 1) ,
with 95% CI
 J
 ( j)

2 
1 1 Ny
 µˆ ± 1.96
 ∑ − µˆ   = (0.614, 0.624).

J J − 1 j =1  N − 1  
461
The predictive density of yrT was likewise estimated via Rao-Blackwell

as
1 J
fˆ ( yrT | D) = ∑ fG ( m ,λ + w( j ) ) ( yrT ) ,
J j =1 j T
where
wT( j )= w1( j ) + ... + wn( j ) .
The predictive mean of yrT was also estimated via Rao-Blackwell as

1 J m
yˆ rT = ∑
J j =1 λ j + wT( j )
= 1.013,
using the fact that

λ , w) m / (λ + wT ) ,
E ( yrT | I , ys , =
with 95% CI
 J  
2 
 yˆ ± 1.96 1 1 m
− yˆ rT   = (0.993, 1.033).
 rT ∑
J J − 1 j =1  λ j + wT( j )  
  
These Rao-Blackwell estimates for yrT were then transformed into

estimates for y via the equation
= y ( ysT + yrT ) / N .
In this way, we estimated y ’s posterior mean by 0.530, with 95% CI

(0.614, 0.624).
Figure 9.7 shows trace plots for λ , yrT and w1 , sample ACFs for these
quantities over the last 10,000 iterations, and these three sample ACFs
again but calculated using only the final smaller samples of size J = 2,000.
Figures 9.8 and 9.9 (page 464) show two histograms, of the J simulated
values of µ = 1/ λ , and of the J simulated values of=
y ( ysT + yrT ) / N .
In each histogram are shown a density estimate and three vertical lines
representing the Monte Carlo point estimate and 95% CI for the posterior
mean.
Note 1: The type of sampling mechanism which features in this exercise

has applications in the analysis of oil discovery data. For further details,
see West (1996).
462
Note 2: In this chapter, we have presented several examples of how

Bayesian methods can be used to perform inference on an exponential
finite population under biased sampling. For another such example, see
Puza and O’Neill (2005).
Figure 9.7 Trace plots and sample ACFs for samples obtained
via MCMC
463
Figure 9.8 Inference on the superpopulation mean via MCMC
Figure 9.9 Inference on the finite population mean via MCMC
464
GS = function(J=1000,N=7,n=3,m=4, ys=c(1.6,0.4,0.7),
lam=1,yrT=1,w=rep(1,3)){
ysT=sum(ys); lamv=lam; yrTv=yrT; wmat=w; for(j in 1:J){
lam=rgamma(1,N,ysT+yrT);
yrT=rgamma(1,m,lam+sum(w))
for(i in 1:n) w[i] = rgamma(1,1,sum(ys[i:n]))
lamv=c(lamv,lam); yrTv=c(yrTv,yrT); wmat=rbind(wmat,w)
}
list(lamv=lamv, yrTv=yrTv, wmat=wmat)
}
set.seed(321); date()
res=GS(J=11000,N=7,n=3,m=4, ys=c(1.6,0.4,0.7), lam=1,yrT=1,w=rep(1,3))
date() # took 4 secs
X11(w=8,h=9); par(mfrow=c(3,3));
plot(res$lamv,type="l"); plot(res$yrTv,type="l"); plot(res$wmat[,1],type="l")
lamv=res$lamv[-(1:1001)]; yrTv=res$yrTv[-(1:1001)];
wmat=res$wmat[-(1:1001),]
acf(lamv); acf(yrTv); acf(wmat[,1]) #
inc= seq(5,10000,5); lamvec=lamv[inc]; yrTvec=yrTv[inc]; wmatrix=wmat[inc,];
acf(lamvec); acf(yrTvec); acf(wmatrix[,1]) # OK
J = length(lamvec); J # 2000
N=7;n=3;m=4; ys=c(1.6,0.4,0.7); ysT=sum(ys);

muvec=1/lamvec; ybarvec=(1/N)*(ysT+yrTvec)
wTvec=apply(wmatrix,1,sum)
yrThat=mean(m/(lamvec+wTvec))
yrTci=yrThat+c(-1,1)*qnorm(0.975)*sd(m/(lamvec+wTvec))/sqrt(J)
c(yrThat,yrTci) # 1.0131279 0.9930648 1.0331911
ybarhat=(1/N)*(ysT+yrThat)
ybarci=(1/N)*(ysT+yrTci)
c(ybarhat,ybarci) # 0.5304468 0.5275807 0.5333130
muhat=(N/(N-1))*ybarhat
muci=muhat + c(-1,1)*qnorm(0.975)*sd( (N/(N-1))*ybarvec ) / sqrt(J)
c(muhat, muci) # 0.6188547 0.6136692 0.6240401
465
mugrid=seq(0.001,10.001,0.01)
fmuhat=mugrid; for(i in 1:length(mugrid))
fmuhat[i] = mean( dgamma(1/mugrid[i], N, N*ybarvec )/mugrid[i]^2 )
ybargrid=seq(0,10,0.01)
fybarhat= ybargrid; for(i in 1:length(ybargrid))
fybarhat[i] = mean( dgamma(N*ybargrid[i]-ysT, m, lamvec+wTvec )*N )
X11(w=8,h=5); par(mfrow=c(1,1))
hist(muvec,prob=T, xlim=c(0,3),ylim=c(0,2.5),breaks=seq(0,80,0.1),
xlab="mu", main="")
lines(mugrid,fmuhat,lwd=2); abline(v= c(muhat, muci), lwd=2)
hist(ybarvec,prob=T, xlim=c(0.3,1.2),ylim=c(0,7),breaks=seq(0,80,0.025),
xlab="ybar", main="")
lines(ybargrid, fybarhat,lwd=2); abline(v= c(ybarhat, ybarci), lwd=2)
466
CHAPTER 10
Normal Finite Population Models
10.1 The basic normal-normal finite
population model
Consider a finite population of N values y1 ,..., y N from the normal

distribution with unknown mean µ and known variance σ 2 . Assume
we have prior information about µ which may be expressed in terms of
a normal distribution with mean µ0 and variance σ 02 .
Suppose that we are interested in the finite population mean, namely

y = ( y1 + ... + y N ) / N , and wish to perform inference on y based on the
observed values in a sample of size n taken from this finite population
via simple random sampling without replacement (SRSWOR).
For convenience, we will in what follows label (or rather relabel) the
n sample units as 1,..., n and the m= N − n nonsample units as
n + 1,..., N . This convention simplifies notation and allows us to write
the finite population vector, originally defined by y = ( y1 ,,..., y N ) , as
=y ((=
y1 ,..., yn ), ( yn +1 ,..., yN )) ( ys , yr ) .
Example: Suppose that we sample units 2, 3 and 5 from a finite

population of size 7. Then we change the labels of units 2, 3 and 5 to 1,
2 and 3, respectively, and we change the labels of units 1, 4, 6 and 7 to
4, 5, 6 and 7, respectively.
Thereby, instead of writing ys = ( y2 , y3 , y5 ) and yr = ( y1 , y4 , y6 , y7 ) ,

we may write ys = ( y1 , y2 , y3 ) and yr = ( y4 , y5 , y6 , y7 ) , respectively.
We will also implicitly condition on s = ( s1 ,..., sn ) at its fixed value and

suppress s from much of the notation. Thus we will sometimes write
f ( y | s, ys ) as f ( y | ys ) , with an understanding that s refers to the
particular units which were actually sampled.
467
Our inferential problem may be thought of as prediction of yr given the

data, y s (and s), since= y ( ysT + myr ) / N . Considering the various
distributions that are involved, a suitable Bayesian model is:
( yr | y s , µ ) ~ N ( µ , σ 2 / m )
(the model distribution of the nonsample mean)
( y1 ,..., yn | µ ) ~ N ( µ , σ 2 )
(the model distribution of the sample values)
µ ~ N ( µ0 , σ 0 ) 2
(the prior distribution).
This model will be called the basic normal-normal finite population

model. By results for the normal-normal model reported earlier, we see
that the posterior distribution of the superpopulation mean is given by
( µ | ys ) ~ N ( µ* , σ *2 ) ,
where: µ* =−
(1 k ) µ0 + kys (the posterior mean as a credibility estimate)
σ2 n
σ *2 = k (the posterior variance), k =
n n + σ 2 / σ 02
(the credibility factor and weight given to the MLE, y s ).
It will be recalled that in this context the predictive density of the

nonsample mean is
f ( yr | y s ) = ∫ f ( yr | y s , µ ) f ( µ | y s ) d µ .
But this is the integral of the exponent of a quadratic equation in µ and

yr , and so equals the exponent of a quadratic equation in yr . It follows
that
( yr | y s ) ~ N ( a , b 2 ) ,
=
where: a E= , µ ) | ys } E=
( yr | ys ) E{E ( yr | ys = {µ | ys } µ*
= ( yr | ys ) V {E ( yr | ys , µ ) | ys } + E{V ( yr | ys , µ ) | ys }
b 2 V=
σ 2  σ2
= V {µ | y s } + E  ys = σ 2
* + .
 m  m
It follows that ( y | ys ) ~ N (c, d 2 ) , where:
468
 ny + myr  ny + mE ( yr | ys )
c = E ( y | ys ) = E  s ys  = s
 N  N
ny + ma ny + mµ*
= s = s
N N
 ny + myr  m
2
d 2 = V ( y | ys ) = V  s y s  =   V ( yr | y s )
 N  N
m2 2 m2  2 σ 2 
== b σ* +
N 2  m 
.
N2
Then, the 1 − α central predictive density region (CPDR) for y is given

by (c ± zα /2 d ) .
Summary: For the basic normal-normal finite population model:

 σ2 
( yr | y s , µ ) ~ N  µ , 
 N −n
( y1 ,..., yn | µ ) ~ iid N ( µ , σ 2 ) , µ ~ N ( µ0 , σ 02 ) ,
the posterior distribution of the superpopulation mean µ is given by
( µ | ys ) ~ N ( µ* , σ *2 ) ,
σ2 n
where: µ* =−
(1 k ) µ0 + kys , σ *2 = k , k= .
n n + σ 2 / σ 02
The predictive distribution of the nonsample mean yr is given by

( yr | y s ) ~ N ( a , b 2 ) ,
σ2
where: a = µ* , b= σ * + , m= N − n .
2 2
m
The 1 − α CPDR for yr is (a ± zα /2b) .
The predictive distribution of the finite population mean y is given by

( y | y s ) ~ N (c, d 2 ) ,
nys + mµ* m2  2 σ 2 
where: c = = , d2 σ +
2  *  (with µ* and σ *2 as above).
N N  m
The 1 − α CPDR for y is (c ± zα /2 d ) .
469
Exercise 10.1 Practice with the basic normal-normal finite

population model

( yr | y s , µ ) ~ N ( µ , σ 2 / m )
( y1 ,..., yn | µ ) ~ iid N ( µ , σ 2 )
µ ~ N ( µ0 , σ 02 ) .
(a) Express the predictive mean of the finite population mean y as a

credibility estimate with a suitable credibility factor. Then also express
the predictive variance and distribution in terms of that credibility factor.
Use your results to answer parts (b) through (e) below.
(b) What is the predictive distribution in the case of very weak prior
information?
(c) What is the predictive distribution in the case of very strong prior
information?
(d) What is the predictive distribution in the case of a very large sample
size?
(e) What is the predictive distribution in the case of a census?
(f) Suppose we believe with a priori probability 95% that µ lies

between 7.0 and 13.0. We sample the values 5.7, 9.6 and 8.3 from a
finite population of seven units. Find the predictive mean and 95%
highest predictive density region for the average of all seven values in
the finite population if the superpopulation standard deviation is 2.0.
Create a graph showing:

(i) the likelihood function for the superpopulation mean
(ii) the prior density of the superpopulation mean
(iii) the posterior density of the superpopulation mean
(iv) the prior density of the nonsample mean
(v) the predictive density of the nonsample mean
(vi) the prior density of the finite population mean
(vii) the predictive density of the finite population mean.
In your graph indicate the predictive mean and 95% highest predictive
density region for the average of all seven values in the finite population.
470
(a) It is easy to show that the predictive mean of y , namely

nys + mµ* nys + ( N − n )[(1 − k ) µ0 + kys ]
= c E= ( y | ys ) = ,
N N
may also be written as the credibility estimate
(1 q ) µ0 + qys ,
c =−
where
n + ( N − n )k
q=
N
is the credibility factor, meaning the weight assigned to ys (the direct
data estimate of y ), and where 1 − q is the weight assigned to µ0 (the
prior estimate of y ).
It can then also be shown that the predictive variance of y , namely

m2  2 σ 2 
=d 2 V (= σ* +  ,
N 2 
y | ys )
m
may be expressed as
( N − n)2  σ 2 σ2  σ2  n
 k + =  q 1 −  .
 n N −n n  N
2
N
Thus we may also write the predictive distribution of the finite

population mean as
 σ2  n 
( y | y s ) ~ N  (1 − q) µ0 + qy s , q 1 −   ,
 n  N 
where:
n + ( N − n )k
q=
N
n
k= .
n + σ 2 / σ 02
Note: If the original credibility factor k equals 1 then the second

credibility factor q also equals 1. This then implies that we estimate y
by
c =(1 − 1) µ0 + 1 ys =ys .
471
This makes sense because if the sample data values are given ‘full
credibility’ then their straight average should intuitively be used to
estimate the finite population mean.
On the other hand, if k = 0 then q = n/N (the sampling fraction). This

then implies that we estimate y by
c =−(1 n / N ) µ0 + (n / N ) ys =
(( N − n) µ0 + nys ) / N .
This also makes sense because if the sample data are given ‘zero
credibility’ then each of the N − n nonsampled values should
intuitively be estimated by the prior mean of the superpopulation mean
µ.
(b) In the case of very weak prior information we have (in the limit) that
σ 0 = ∞ , hence k = 1, and hence q = 1. Consequently
 σ2  n   σ2  n 
( y | y s ) ~ N  (1 − 1) µ0 + 1 y s ,1  1 −   ~ N  y s , 1 −   .
 n  N   n  N 
This implies a posterior mean and 1 − α CPDR for y of y s and

 σ n 
 ys ± zα /2 1−  .
 n N
Note: This is the same inference one would make via classical
techniques after substituting the sample standard deviation
1 n
=s ∑
n − 1 i =1
( yi − ys ) 2
for σ , assuming that n is ‘large’.
(c) In the case of very strong prior information we have (in the limit)
that σ 0 = 0 , hence k = 0, and hence q = n/N. Consequently,
 n n n σ2  n 
( y | ys ) ~ N   1 −  µ 0 + ys , 1 −  
 N  N N n  N 
 ( N − n) µ0 + nys σ 2  n 
~ N , 1 −   .
 N N  N 
472
(d) In the case of a very large sample size we have (in the limit) that
n = ∞ , hence k = 1, and hence q = 1 . Consequently (just as in (b) for
the case of very weak prior information),
 σ2  n 
( y | ys ) ~ N  (1 − 1) µ0 + 1 ys ,1 1 −  
 n  N 
 σ2  n 
~ N  ys , 1 −   .
 n  N 
(e) In the case of a census we have n = N, hence

N + ( N − N )k
= q = 1,
N
and therefore
 σ 2  N 
( y | ys ) ~ N  (1 − 1) µ0 + 1 ys ,1 1 −  
 N  N 
~ N ( ys , 0 ) ,
meaning that y = ys with posterior probability 1 (obviously).
Note: Some of the equations developed previously implicitly assume

that n < N.
(f) The given specifications imply that:

n = 3, N = 7, m= N − n = 4, σ = 2
1
ys = (5.7+9.6+8.3) = 7.8667
3
3
µ0 = 10, σ 0 = = 1.53064
1.96
µ ~ N ( µ0 , σ 02 )
n
k= = 0.63731
n + σ 2 / σ 02
σ2
µ* =−
(1 k ) µ0 + kys = 8.6404, σ * = k = 0.9218141
n
( µ | ys ) ~ N ( µ* , σ *2 )
473
a = µ* = 8.6404,=b σ *2 + σ 2 / m = 1.3601
( yr | y s ) ~ N ( a , b 2 )
n + ( N − n )k
q= = 0.79275
N
ny + mµ*
c =s =− (1 q ) µ0 + qys = 8.3088
N
m2 2
d= b = 0.77717
N2
( y | y s ) ~ N (c, d 2 ) .
So the predictive mean of y , the average of all 7 values in the finite

population, is c = 8.3088, and the 95% highest predictive density region
for that average is (c ± 1.96d ) = (6.7856, 9.8320). Figure 10.1 shows:
(i) the likelihood function for the superpopulation mean, L( µ ) , equal to

the posterior density of µ under a flat prior; thus L( µ ) = f N ( y ,σ 2 / n ) ( µ )
s
(ii) the prior density of the superpopulation mean,

f ( µ ) = f N ( µ ,σ 2 ) ( µ )
0 0
(iii) the posterior density of the superpopulation mean,

f ( µ | ys ) = f N ( µ ,σ 2 ) ( µ )
* *
(iv) the prior density of the nonsample mean,

f ( yr ) = f N ( µ ,σ 2 +σ 2 / m ) ( yr )
0 0
(v) the predictive density of the nonsample mean,

f ( yr | y s ) = f N ( a , b 2 ) ( yr )
(vi) the prior density of the finite population mean,

f ( y ) = f N ( µ ,σ 2 +σ 2 / N ) ( y )
0 0
(vii) the predictive density of the finite population mean,

f ( y | ys ) = f N ( c , d 2 ) ( y ) .
474
Figure 10.1 Various densities in Exercise 10.1
In Figure 10.1, we may observe how the prior densities of µ , yr and y

are all centred around the prior mean µ0 = 10 . The line for µ is most
highly concentrated about 10 because it represents the prior density of
the mean of a hypothetically infinite number of population values. The
line for yr is the least focused about 10 because it represents the prior
density of the mean of only 4 such values (compared with the line for y
which is the prior pdf for the mean of 7 such values).
Each of the posterior/predictive densities for µ , yr and y is located

somewhere between the corresponding prior density and the likelihood
function. The posterior/predictive densities for µ and yr are centred at
the same values, namely the posterior mean, µ* = 8.6404, whereas the
predictive density for y is centred closer to the likelihood mode,
ys = 7.8667. This is because the second credibility factor is larger than
the first (q = 0.79275 > k = 0.63731).
475
ys=c(5.7,9.6,8.3); ysbar=mean(ys); ysbar # 7.866667

sig=2; n=3; N=7; m=N-n; mu0=10; sig0=3/qnorm(0.975);
k=n/(n+sig^2/sig0^2); q=(n+m*k)/N
c(m,mu0,sig0,k,q) # 4.0000000 10.0000000 1.5306404 0.6373060 0.7927463
mustar=(1-k)*mu0+k*ysbar; sigstar2=k*sig^2/n
c(mustar,sqrt(sigstar2)) # 8.6404139 0.9218141
a=mustar; b2=sigstar2+sig^2/m; c=(n*ysbar+m*a)/N; d2=(m/N)^2*b2
c(a,sqrt(b2),c,sqrt(d2)) # 8.6404139 1.3600519 8.3088080 0.7771725
HPDR=c+c(-1,1)*qnorm(0.975)*sqrt(d2); HPDR # 6.785578 9.832038
X11(w=8,h=7); par(mfrow=c(1,1))
plot(c(4,15),c(0,0.6),type="n",xlab="mu, yrbar, ybar",
ylab="density, likelihood", main="")
v=seq(0,20,0.01)
lines(v,dnorm(v,ysbar,sig/sqrt(n)),lty=1,lwd=3,col="black")
# likelihood function (i)
lines(v,dnorm(v,mu0,sig0),lty=2, lwd=2,col="red") # prior (ii)

lines(v,dnorm(v,mustar,sqrt(sigstar2)),lty=2,lwd=3, col="red") # posterior (iii)
lines(v,dnorm(v,mu0,sig0^2+sig^2/m),lty=3,lwd=2, col="blue")
# prior pdf of yrbar (iv)
lines(v,dnorm(v,a,sqrt(b2)),lty=3,lwd=3, col="blue")
# predictive pdf of yrbar (v)
lines(v,dnorm(v,mu0,sig0^2+sig^2/N),lty=4,lwd=2, col="green")
# prior pdf of ybar (vi)
lines(v,dnorm(v,c,sqrt(d2)),lty=4,lwd=3, col="green")
# predictive pdf of ybar (vii)
abline(v=c(c,HPDR),lty=1,lwd=1)
legend(3.8,0.6,c("(i) Likelihood","(ii) Prior","(iii) Posterior"),
lty=c(1,2,2), lwd=c(3,2,3), col=c("black","red","red"))
legend(10,0.6,c("(iv) Prior pdf of yrbar","(v) Predictive pdf for yrbar",
"(vi) Prior pdf of ybar","(vii) Predictive pdf for ybar"),
lty=c(3,3,4,4), lwd=c(2,3,2,3), col=c("blue","blue","green","green"))
text(12.5,0.38, "The thin vertical lines show the predictive")
text(12.5,0.345,"mean and 95% HPDR bounds for ybar")
476
10.2 The general normal-normal finite

population model
The basic normal-normal finite population model examined in the
previous section assumes that:
• all N values in the finite population are conditionally normal and iid
• we are interested only in the nonsample mean yr and functions of
yr (such as the finite population mean y ).
We will now examine a generalisation of this basic model which allows

for:
• non-independence of values
• covariate information
• inference on the entire nonsample vector and linear combinations
thereof.
We will continue to assume that the values in the population are all
(conditionally) normally distributed, and that the (conditional) variance
of each value in the finite population is known. We will now also
assume that all the covariance terms between these values are known.
(These assumptions will be relaxed at a later stage.)
First, define the (finite) population vector in column form as

  y1  
  
   
   y1 
 y s    yn    
= y =   =    .
 yr    yn +1    y 
    N 
 
 y 
 N 
Next, suppose that auxiliary information is available in the form of an N
by p matrix
 x1′   x11  x1 p 
   
= X =   ( X 1 ,...,=
X p)     ,
 x′  x 
 N  N 1  x Np 
where
477
 xi1 
 
xi =   
x 
 ip 
is the covariate vector for the ith population unit ( i = 1,..., N ) and
 x1 j 
 
Xj =  
x 
 Nj 
is the population vector for the jth explanatory variable ( j = 1,..., p ) .
Also suppose that the finite population vector y has a known variance-
covariance structure in the form of an N by N positive definite matrix
 σ 11  σ 1N 
Σ =     ,
σ 
 N 1  σ NN 
=
where: σ ij C= ( yi , y j ) σ ji
σ=
ii Vyi ≡ σ i2 ,
with the covariance and variance operations here (C and V) implicitly
conditional on all model parameters.
In the above context, the Bayesian model we will focus on is:

( y | β ) ~ N N ( X β , Σ)
β ~ N p (δ , Ω) .
This model will be called the general normal-normal finite population

model. Here,
 β1 
 
β =  
β 
 p
is the vector of regression coefficients, whose prior distribution is
multivariate normal with (specified) mean
 δ1 
 
δ =  
δ 
 p
478
and (specified) variance-covariance matrix

 ω11  ω1 p 
 
Ω =     ,
ω 
 p1  ω pp 
=
where: ωij C=
( β i , β j ) ω ji
ωii V β i ≡ ωi2 ,
=
with the covariance and variance operations here (C and V) implicitly
unconditional, thereby reflecting prior belief regarding the βi values.
We will assume interest lies generally in the nonsample vector yr and

functions of that vector, and specifically in the finite population mean y
(a simple function of yr and of the known quantities y s , n and N). Thus
the regression coefficient vector β will be treated as a nuisance
parameter and inference will be based on the predictive distribution of
yr given ys .
Note: The basic normal finite population model as considered

previously is a special case of the just-defined general normal finite
population model with:
p = 1, = β (= β1 ) µ , = δ (δ= 1) µ0 , =Ω (ω= 11 ) σ 02
1
 
X= 1=
N (1,...,1)=′    (a column vector of N ones)
1
 
σ 2
0  0
 
 0 σ2 0
=Σ σ=2
IN
    
 
 0 0 0 σ2
(where I N is the N by N identity matrix).
Thus, the previous normal finite population model could also be

written as:
( y | µ ) ~ N N ( µ1N , σ 2 I N )
µ ~ N1 ( µ0 , σ 02 ) .
479
10.3 Derivation of the predictive distribution

of the nonsample vector
Observe that the unconditional (or prior) distribution of the entire finite
population vector y is given by the density
f ( y ) = ∫ f ( y , β )d β = ∫ f ( y | β ) f ( β )d β .
Now, the integrand of this multiple integral is a quadratic in the yi and

β j values. This implies that the value of the integral has the form of a
quadratic in the yi values alone. This then implies that the prior (or
unconditional) distribution of y is also multivariate normal. It then
remains to find the mean and covariance vector of that prior distribution,
as follows:
= Ey EE= ( y | β ) E=(X β ) Xδ
Vy = EV ( y | β ) + VE ( y | β ) = EΣ + V ( X β ) = Σ + X ΩX ′ .
Thus, y ~ N N ( X δ , Σ + X ΩX ′) .
This result may also be written as

 ys    X sδ   Σ ss + X s ΩX s′ Σ sr + X s ΩX r′  
 y  ~ N N   X δ  ,  Σ + X ΩX ′ Σ + X ΩX ′   ,
 r   r   rs r s rr r r 
where we partition X and Σ according to

X  Σ Σ sr 
X =  s  and Σ = ss .
 Xr   Σ rs Σ rr 
 X 1′ 
Thus, X s =    is a submatrix consisting of the first n rows of X, etc.
 X′ 
 n
It follows by standard multivariate normal theory (see below) that

( yr | ys ) ~ N m ( E* ,V* ) ,
where:
E=* X rδ + (Σ rs + X r ΩX s′ )(Σ ss + X s ΩX s′ ) −1 ( ys − X sδ ) (10.1)
( rr + X r ΩX r′ ) − (Σ rs + X r ΩX s′ )(Σ ss + X s ΩX s′ ) −1 (Σ sr + X s ΩX r′ ) .
V* =Σ
(10.2)
480
Note: We have here used the following result (e.g. see equation
(81.2.11) in Rao, 1973):
 X1    µ1   Σ11 Σ12  
 X  ~ N n1 + n2   µ  ,  Σ 
 2   2   21 Σ 22  
⇒ ( X 2 | X 1 ) ~ N n2 ( µ2 + Σ 21Σ11
−1
( X 1 − µ1 ), Σ 22 − Σ 21Σ11
−1
Σ12 ) .
10.4 Alternative formulae for the predictive

distribution of the nonsample vector
Another way to obtain the distribution of ( yr | y s ) (already derived
above) is as follows. First, the posterior density of β is
f ( β | ys ) ∝ f ( β ) f ( ys | β )
 1   1 
∝ exp  − ( β − δ )′Ω −1 ( β − δ )  exp  − ( ys − X s β )′Σ −ss1 ( ys − X s β ) 
 2   2 
 1 
= exp  − Q1  ,
 2 
where
Q1 = ( β − δ )′Ω −1 ( β − δ ) + ( ys − X s β )′Σ −ss1 ( ys − X s β ) .
We see that f ( β | y s ) is proportional to the exponent of a quadratic

form in β . This implies that
( β | y ) ~ N ( βˆ , D )
s p
for some βˆ and D to be determined.
Now observe that

 1 
f ( β | ys ) ∝ exp  − Q2  ,
 2 
where
( β − βˆ )′D −1 ( β − βˆ )
Q2 =
= β ′D −1β − β ′D −1βˆ − βˆ ′D −1β + constant (10.3)
(where the constant does not depend on β ).
481
But Q1 =β ′Ω −1β − β ′Ω −1δ − δ ′ Ω −1β − ys′Σ −ss1 X s β

− β ′ X s′Σ −ss1 ys + β ′ X s′Σ −ss1 X s β + constant
= β ′(Ω −1 + X s′Σ −ss1 X s ) β − β ′(Ω −1δ + X s′Σ −ss1 ys ) − (δ ′Ω −1 + ys′Σ −ss1 X s ) β
+ constant. (10.4)
Equating (10.3) and (10.4) we see that:

D −1 =Ω −1 + X s′Σ −ss1 X s
D −1βˆ =Ω −1δ + X ′Σ −1 y . s ss s
It follows that:
D =(Ω −1 + X s′Σ −ss1 X s ) −1
βˆ = D (Ω −1δ + X ′Σ −1 y ) . s ss s
We can now use the result

( β | ys ) ~ N p ( βˆ , D )
to find the predictive mean and variance of yr .
First, observe that

( y | β ) ~ N N ( X β , Σ)
may also be written
  ys     X s β   Σ ss Σ sr  
   β  ~ N N   ,Σ Σ  ,
 r  
y X β
  r   rs rr  
which implies that

( yr | ys , β ) ~ N m ( X r β + Σ rs Σ −ss1 ( ys − X s β ), Σ rr − Σ rs Σ −ss1Σ sr ) .
It follows that:
E ( yr | ys ) = E{E ( yr | ys , β ) | ys }
= E{ X r β + Σ rs Σ −ss1 ( ys − X s β ) | ys }
= X βˆ + Σ Σ −1 ( y − X βˆ )
r rs ss s s (10.5)
V ( yr | ys ) E{V ( yr | ys , β ) | ys } + V {E ( yr | ys , β ) | ys }
=
= E{Σ rr − Σ rs Σ −ss1Σ sr | ys } + V { X r βˆ + Σ rs Σ −ss1 ( ys − X s βˆ ) | ys }
= Σ − Σ Σ −1Σ + V {( X − Σ Σ −1 X ) βˆ | y }
rr rs ss sr r rs ss s s
= Σ rr − Σ rs Σ Σ sr + ( X r − Σ rs Σ X s ) D ( X r − Σ rs Σ −ss1 X s )′ .
−1
ss
−1
ss (10.6)
482
Note: The expression for E* at (10.1) must be the same as that for
E ( yr | ys ) at (10.5), and likewise the expression for V* at (10.2) must
be the same as that for V ( yr | ys ) at (10.6). This equivalence can also
be shown with some algebra by making use of the formula
(Σ ss + X s ΩX s′ ) −1 =Σ −ss1{I s − X s (Ω −1 + X s Σ −ss1 X s ) −1 X s′Σ −ss1} ,
which in turn follows from the general matrix identity
( A − UW −1V ) −1 =+ A−1 A−1U (W − VA−1U ) −1VA−1 .
Here, I s is the n by n identity matrix and could also be written I n .
10.5 Prediction of the finite population mean

and other linear combinations
We may now write down a general expression for the predictive
distribution of the finite population mean. That mean may be expressed
as the linear combination
y sT + myr 1
= y = ( y sT + 1′r yr ) .
N N
Note: Here, 1′r denotes the row vector with m= N − n ones. This
vector could also be written 1′m or 1′N −n or (1,...,1) .
Therefore the predictive distribution of y given y s is normal with mean

y + 1′ E 1′ V 1
e* = sT r * and variance v* = r *2 r .
N N
So the 1 − α CPDR for y is (e* ± zα / 2 v* ) .
More generally, the predictive distribution of the linear combination

ψ = c0 + (c1 y1 + ... + cn yn ) + (cn +1 yn +1 + ... + cN yN )
cr′V*cr
is normal with mean e# =c0 + cs′ ys + cr′ E* and variance v# = ,
N2
where cs = (c1 ,..., cn )′ and cr = ( yn +1 ,..., cN )′ .
So the 1 − α CPDR for ψ is (e# ± zα / 2 v# ) .
483
Summary: For the general normal-normal finite population model:

( y | β ) ~ N N ( X β , Σ)
β ~ N p (δ , Ω) ,
the posterior distribution of the regression vector β is given by
( β | y ) ~ N ( βˆ , D ) ,
s p
where: βˆ = D (Ω δ + X s′Σ −ss1 ys ) ,

−1
D =(Ω −1 + X s′Σ −ss1 X s ) −1 .
The predictive distribution of the nonsample vector yr is given by

( yr | ys ) ~ N m ( E* ,V* ) ( m= N − n ),
where: E=
* X rδ + (Σ rs + X r ΩX s′ )(Σ ss + X s ΩX s′ ) −1 ( ys − X sδ )
= X βˆ + Σ Σ −1 ( y − X βˆ )
r rs ss s s
( rr + X r ΩX r′ ) − (Σ rs + X r ΩX s′ )(Σ ss + X s ΩX s′ ) −1 (Σ sr + X s ΩX r′ )
V* =Σ
= Σ rr − Σ rs Σ −ss1Σ sr + ( X r − Σ rs Σ −ss1 X s ) D ( X r − Σ rs Σ −ss1 X s )′ .
The predictive distribution of the finite population mean y is given by

y + 1′ E 1′ V 1
( y | y s ) ~ N ( e* , v* ) , where e* = sT r * and v* = r *2 r ,
N N
with 1 − α CPDR for y given by (e* ± zα / 2 v* ) .
The predictive distribution of any linear combination of the form

ψ =c0 + cs′ ys + cr′ yr is given by
(ψ | ys ) ~ N ( e# , v# ) ,
cr′V*cr
where e# =c0 + cs′ ys + cr′ E* and , v# =
N2
with 1 − α CPDR for ψ given by (e# ± zα / 2 v# ) .
10.6 Special cases including ratio estimation

In the context of the above general normal-normal finite population
model, suppose that p = 1 (i.e. there is a single covariate) and the
population values are conditionally independent, the ith one having
mean xi β and variance xi2γ σ 2 , where γ ∈ ℜ and σ 2 > 0 are known.
484
Also, suppose that the prior distribution on the single regression

coefficient β is normal with mean δ and variance ω 2 . Then:
( y | β ) ~ N N ( X β , Σ)
β ~ N p (δ , Ω) ,
 x1   x12γ 0  
0
x   
0 x22γ 0
 , Σ =  σ 2 , Ω =ω 2 .
2 
where: p = 1, X =
       

   
 xN   0 0  x N2γ 
The model may also be written in non-matrix form as:

( yi | β ) ~ ⊥ N ( xi′β , xi2γ σ 2 ), i =
1,..., N
β ~ N (δ , ω 2 ) .
Under this model it can be shown that the predictive distribution of the
finite population mean is given by
( y | ys ) ~ N ( A, B 2 ) ,
where:
n  n   δσ 2 + ω 2 ∑in=1 yi xi1−2γ 
A= y s +  1 −  xr  2 
N  N   σ + ω 2 ∑in=1 xi2−2γ 
σ2  N
m 2ω 2 xr2 
=B2 2  ∑
N i= n +1
xi
2γ
+ 2 − 2γ 
σ + ω ∑i =1 xi 
2 2 n
1 N
xr = ∑ xi (average of the covariate values in the nonsample).
m i= n +1
Now suppose it is believed that the variances of the population values

are exactly proportional to the covariate values, i.e. V ( yi | β ) = xiσ 2 .
Then γ = 1 / 2 , and we find that:

n  n   ω 2 y + δσ 2 / n 
=
A y s +  1 −  xr  2 s 
N  N   ω xs + σ 2 / n 
σ2  n  n  n ω 2 xr 
=
B2 xr  1 −   +  1 −  2 
n  N N  N  ω xs + σ / n 
2
1 n
xs = ∑ xi (the average of the covariate values in the sample).
n i =1
485
If there is a priori ignorance regarding β we may further set ω = ∞, and

in that case:
n  n  ys n  n x 
A= ys + 1 −  xr = ys  +  1 −  r 
N  N  xs  N  N  xs 
 nxs + ( N − n) xr  ys
= y= s   x
 Nxs  xs
σ2  n  n  n  xr  σ 2  n  xr
B=
2
xr  1 −   +  1 −  =  1 −  x
n  N N  N  xs  n  N  xs
N
1
x=
N
∑x
i =1
i (average of covariate values in the finite population).
As regards this last special case, we see that the predictive mean A is
identical to the common design-based ratio estimator.
Also, the predictive variance B 2 , although not identical to any design-

based formula, is the same as a model-based analogue (e.g. see Brewer,
1963, and Royall, 1970). The formula for B 2 suggests a purposive
sampling scheme whereby units with the largest covariate values should
be selected.
Note 1: If units with relatively large y-values are selected, then xs will
x
likely be larger than xr , so that then r will likely be small, and
xs
σ2  n  xr
thereby=
B2 V ( y=
| ys )  1 −  x will also likely be small.
n  N  xs
Note 2: The same formulae as derived in the last special case will also
apply approximately when the sample size n is very large. This makes
sense because the effect of a very large sample size is the same as that
of a very diffuse prior. Note that in the case of a census, n = N and we
find that the above formulae correctly yield A = y s and B 2 = 0 .
In a way similar to the above, it is possible to obtain analogues of other

common design-based and model-based results, such as regression and
stratified estimators, together with their associated variances (see
Ericson, 1969 and 1988).
486
Exercise 10.2 Derivation of the Bayesian ratio estimator

( yi | β ) ~ ⊥ N ( xi β , xi σ 2 ), i =
1,..., N
f ( β ) ∝ 1, β ∈ℜ .
Derive the predictive distribution of the finite population mean given

data of the form D = ( s, y s ) .
The Bayesian model is:

( y | β ) ~ N N ( X β , Σ)
β ~ N p (δ , Ω) ,
where:
 x1 
 
p = 1, δ = 0 , Ω = ∞ , X= x=   
x 
 N
 x1 
2 .
= Σ σ= 2
diag ( x1 ,..., x N ) σ   
 x N 

Note: Here, σ −2Σ is a matrix with zeros everywhere except for

x1 ,..., xN along the main diagonal.
Using general results derived previously we first have that

( β | y s ) ~ N ( βˆ , D ) ,
where:
D =(Ω −1 + X s′Σ −ss1 X s ) −1
−1
  x1−1   x1  
 −1 1    
= ∞ + ( x1  xn ) 2      
 σ 
  xn   xn  
−1  
−1
 n  σ2
∑ i i i 
= σ= −1 2
x x x
 i =1  xsT
487
βˆ = D(Ω −1δ + X s′Σ −ss1 ys )

  x1−1   y1  
 1    
= D 0 + ( x1  xn ) 2      
 σ


 xn   yn  
−1
σ2 1 n ysT ys
2 ∑ i i
= x x −=
1
yi = .
xsT σ i =1 xsT xs
Next,
( yr | ys ) ~ N m ( E* ,V* ) ,
where:
m= N − n
 xn +1   xn +1 
ˆ β )    βˆ =
E* X r β + Σ rs Σ ss ( ys − X s=
= −1 ˆ +0    ys
 
x   x  xs
 N   N 
V* = Σ rr − Σ rs Σ −ss1Σ sr + ( X r − Σ rs Σ −ss1 X s ) D ( X r − Σ rs Σ −ss1 X s )′
 xn +1    xn +1   2
    σ
= σ    − 2
0 +    1r − 0 x (( x n +1  xN ) − 0 )
   x   sT
 x N   N  
 xn +1   xn +1 xn +1  xn +1 xN  
    
   .
1
= σ 2   + x   
 x N  sT  
  x N xn +1  x N xn +1  
Thus finally we have that

( y | y s ) ~ N ( e* , v* ) ,
where:
  xn +1  
ysT + 1′r E* 1   ys 
e* = =  ysT + (1  1)    
N N  x  xs 
  N  
1 ysT  ysT  xsT + xrT  ysT xT ys
=  ysT + xrT =   =  = x
N xsT  N  xsT  xsT N xs
488
1′r V*1r 1
v* = 2
= 2 σ 2 (1  1)
N N
 xn +1   xn +1 xn +1  xn +1 xN   1
  1    
×   + x        
 xN  sT  xN xn +1  xN xn +1   1

 1
1 2 1  N N
  
=
N2
σ (
 n +1
x  x N ) + ∑n+1 xi xn+1 
xsT  i =
∑n+1 xi xN     
 i=   1
 
1 2 1  N N

= 2
σ  ( x n +1 + ... + x N ) +  x n +1 ∑ i x + ... + x N ∑ xi  
N  xsT  i=
n +1 i=n +1 
1 2 1 2  x  x + xrT 
= 2
σ  xrT + xrT  = rT2 σ 2  sT 
N  xsT  N  xsT 
1 ( N − n ) xr xsT + xrT σ 2  n  xr
= σ2 × × × = 1 −  x .
N nxs N n  N  xs
Exercise 10.3 Practice with the general normal-normal finite

population model
Consider a superpopulation model in which all values are independent

and normally distributed with mean µ , and where each value yi has a
variance which is either:
σ 02 if the corresponding covariate value xi is 0, or
σ 12 if xi = 1 (the only other possibility).
Suppose that σ 02 , σ 12 and all N covariate values xi are given. Also

suppose there is a priori ignorance regarding µ .
Find a simple expression for the predictive distribution of the finite

population mean y . Then calculate the predictive mean and 95%
predictive interval for y if:
σ 0 = 0.08, σ 1 = 1.2, ys = (2.1, 4.9, 2.3, 2.0,0.2)′
x = (0,1, 0, 0,1, 1,1,1, 0, 0, 1,1,1,1, 0, 0,1)′ .
489
Note: We have here defined a type of stratification; the finite

population is assumed to consist of two strata with different variances
but the same underlying mean in both strata.
Let n0 denote the number of covariate values xi in the sample (of size n)
which are 0, and let n1 be the number which are 1. Likewise, let m0
denote the number of covariate values xi in the nonsample (of size
m= N − n ) which are 0, and let m1 be the number which are 1.
(Thus, n1 = ∑in=1 xi , n0= n − n1 , m1 = ∑iN= n +1 xi and m0= m − m1 .)
Then, without loss of generality, re-order the finite population values in

such a way that xs = (0,..., 0,1,...,1)′ and xr = (0,..., 0,1,...,1)′ .
(Thus, in each of the sample and nonsample vectors, place the values
with covariate 0 first, and place the values with covariate 1 last.)
With this setup, the Bayesian model is:

( y | β ) ~ N N ( X β , Σ)
β ~ N p (δ , Ω) ,
where:
p = 1, β ≡ µ , δ = 0 , Ω = ∞
X = 1N (since the covariates do not affect the means)
Σ =diag (σ 021′n , σ 121′n , σ 021′m , σ 121′m )
0 1 0 1
(a matrix with zeros everywhere except for

σ 02 ,..., σ 02 , σ 12 ,..., σ 12 , σ 02 ,..., σ 02 , σ 12 ,..., σ 12
along the main diagonal).
Then
( β | ys ) ~ N p ( βˆ , D ) ,
where:
490
D =(Ω −1 + X s′Σ −ss1 X s ) −1 =( ∞ −1 + 1′s Σ −ss11s ) −1

−1
  σ 0−2  
   
     1 
  σ 0−2    
= (1  1)     
  σ 1−2   1 
     
   
  σ 1−2  
−1
  1 
 −2 −2   
=  (σ 0  σ 0 σ 1  σ 1 )    
−2 −2
  1 
  
1
=
n0σ 0 + n1σ 1−2
−2
βˆ = D(Ω −1δ + X s′Σ −ss1 ys ) = D(∞ −1 0 + 1′s Σ −ss1 ys )

  σ 0−2  
   
     y 
  σ 0−2   1  
= D (1  1)     
  σ 1−2   y 
     n 
   
  σ 1−2  
= D (σ 0 ys 0T + σ 1 ys1T ) .
−2 −2
Note: Here,
n0
y s 0 T = ∑ yi
i =1
denotes the total of the sample values with covariate xi = 0 , and

n
y s1T = ∑
=i n0 +1
yi
denotes the total of the sample values with covariate xi = 1 .
491
Next, ( yr | ys ) ~ N m ( E* ,V* ) , where m= N − n and:

=
E * X r βˆ + Σ rs Σ −ss1 ( y s − X s β=
ˆ ) 1 βˆ +=
r 0 1r βˆ
V* = Σ rr − Σ rs Σ −ss1Σ sr + ( X r − Σ rs Σ −ss1 X s ) D ( X r − Σ rs Σ −ss1 X s )′
 σ 02 
 
  
 σ 02 
=   − 0 + (1r − 0) D (1r − 0)′
 σ 2
1 
  
 
 σ 12 
 σ 02 
 
    1  1
 σ02
  
=   − D   .
 σ12
  1  1
   

 
 σ 12 
Thus ( y | y s ) ~ N ( e* , v* ) , where:
ysT + 1′r E* 1
e* =
N
=+
N
{ 1
ysT (1  1)1r βˆ =+
N
ysT mβˆ } { }
1′r V*1r 1
=v* = 2 (1  1)
N N2
 σ 02  
  
 

 1  1  1
σ0
2     
×   − D        
 σ1
2
  1  1   1
     

  
 σ 12  
 1
 σ 0  σ 0 σ1  σ1 − D ( m  m )   
( )
1 2 2 2 2
N2   
 1
 
1
= 2
( m0σ 02 + m1σ 12 − Dm 2 ) .
N
492
In summary, we have that ( y | y s ) ~ N ( e* , v* ) , where:

y + mβˆ
e* = sT , m= N= − n , βˆ D (σ 0−2 ys 0T + σ 1−2 ys1T )
N
1 m0σ 02 + m1σ 12 − m 2 D
D= , v = .
n0σ 0−2 + n1σ 1−2
*
N2
Numerically, we are given:

σ 0 = 0.08, σ 1 = 1.2, ys = (2.1, 4.9, 2.3, 2.0,0.2)′ (thus n = 5)
x = (0,1, 0, 0,1, 1,1,1, 0, 0, 1,1,1,1, 0, 0,1)′
(thus m = 12 and N = n + m = 17)
xs = (0,1,0,0,1)′ , xr = (1,1,1,0,0, 1,1,1,1,0, 0,1)′ .
We now re-order the sample and nonsample values appropriately and so

redefine:
ys = (2.1, 2.0, 2.3, 4.9,0.2)′
xs = (0,0,0,1,1)′
xr = (0,0,0,0,1, 1,1,1,1,1, 1,1)′ .
Note: We have merely swapped units 2 and 4 in both ys and xs ,

respectively, so that all units with covariate 0 appear first and all units
with covariate 1 appear last. We have also written the nonsample
covariate vector xr with all four zero values listed at the beginning.
We see that:
n0 = 3, n1 = 2, m0 = 4, m1 = 8
ys 0T = 2.1 + 2.0 + 2.3 = 6.4, ys1T = 4.9 + 0.1 = 5.1 ,
y=
sT 6.4 + 5.1 = 11.5
ys 0 = 6.4 /3 = 2.1333, ys1 = 5.1/2 = 2.55,
ys = 11.5/5 = 2.3.
Thereby we obtain ( y | ys ) ~ N ( e* , v* ) , where:

1 1
=D = = 0.0021270
n0σ 0 + n1σ 1
−2 −2
3 / 0.08 + 2 / 1.22
2
493
βˆ= D (σ 0−2 ys 0T + σ 1−2 ys1T )= 0.0021270(6.4 / 0.082 + 5.1 / 1.22 )

= 2.1345
ysT + mβ 11.5 + 12 × 2.1345
ˆ
=e* = = 2.1832
N 17
m0σ 02 + m1σ 12 − m 2 D 4 × 0.082 + 8 × 1.22 − 122 × 0.0021250

=v* =
N2 172
= 0.038890.
Thus the predictive mean of the finite population mean y is β̂ = 2.13,

and the 95% predictive interval for y is (e* ± 1.96 v* ) = (1.80, 2.57).
options(digits=4)
sig0=0.08; sig1=1.2; ys = c(2.1,2.0,2.3,4.9,0.2); n=length(ys)
xs=c(0,0,0,1,1); xr = c(0,0,0,0,1, 1,1,1,1,1, 1,1); m=length(xr); N = n+m
n1=sum(xs); n0=n-n1; m1=sum(xr); m0=m-m1
c(n,n0,n1, m,m0,m1, N) # 5 3 2 12 4 8 17
ysT=sum(ys); ys1T=sum(ys*xs); ys0T=ysT-ys1T
ysbar=ysT/n; ys1bar=ys1T/n1; ys0bar=ys0T/n0
c(ys0T,ys1T,ysT, ys0bar,ys1bar,ysbar)
# 6.400 5.100 11.500 2.133 2.550 2.300
D = 1/( n0/ sig0^2 + n1/ sig1^2 ); betahat = D*(ys0T/ sig0^2 + ys1T/ sig1^2 )
estar=(1/N)*( ysT+m*betahat );
vstar=(1/N^2)*(m0* sig0^2+m1* sig1^2-D*m^2)
c(D,betahat,estar,vstar) # 0.002127 2.134564 2.183222 0.038890
hpdr=estar+c(-1,1)*qnorm(0.975)*sqrt(vstar); c(hpdr) # 1.797 2.570
10.7 The normal-normal-gamma finite

population model
For the models so far considered in this chapter, the superpopulation
variance σ 2 parameter or variance-covariance matrix parameter Σ has
been assumed to be known.
If this parameter were unknown, as might typically be the case in

practice, then an estimate could be computed from the data via some
494
method (which need not necessarily be Bayesian) and substituted into

the equations derived.
This strategy, which may be considered an example of empirical Bayes

techniques, may sometimes work well, especially if based on a
sufficiently large sample size.
For example, recall that in the case of no covariates, with the

superpopulation variance σ 2 known, the 1 − α CPDR for y is
 σ n 
 ys ± zα /2 1−  .
 n N
Now suppose that n is large and we estimate σ 2 by the sample variance,

1 n
= s 2
∑
n − 1 i =1
( yi − ys ) 2 .
Then the result is the same as the classical design-based CI one would
use in the same situation of a large sample size.
However, this strategy will not work well generally. For example, if n is
small then it will lead to an interval which has a frequentist coverage
well below the intended level of 1 − α . In such cases, the problem could
be addressed to some extent by applying an adjustment which reflects
uncertainty regarding the unknown variance parameter. However, the
nature of this type of adjustment would be ad hoc and lead to possibly
other problems with the inference.
Perhaps the best way to deal with uncertainty regarding the variance
parameter is to incorporate it into the finite population model as yet
another random variable with its own prior distribution, i.e. to add
another level to the hierarchical structure of that model. This is the
approach we will now take. Note that parts of the exposition below will
be a review of material already covered in previous chapters.
With the above in mind, and with quantities as defined previously, we

define the normal-normal-gamma finite population model as follows:
( y | β ,λ) ~ NN ( X β ,Σ / λ)
( β | λ ) ~ N p (δ , Ω)
λ ~ G (η ,τ ) .
495
A problem with this model is that is involves an additional nuisance

parameter to deal with relative to the normal-normal finite population
model, namely λ . This means that the predictive pdf of the nonsample
vector cannot be obtained so easily.
That density is now
=f ( yr | ys ) ∫ ∫ f ( y , β , λ | y )d β d λ ∝ ∫ ∫ f ( y , β , λ )d β d λ ,
r s
(10.7)
where f ( y , β , λ ) = f (λ ) f ( β | λ ) f ( y | β , λ )
 1 
∝ λ η −1e −τλ × exp  − ( β − δ )′Ω −1 ( β − δ ) 
 2 
 1 
×λ N /2 exp  − λ ( y − X β )′Σ −1 ( y − X β ) 
 2 
is the joint density of all random variables involved in the model,
namely the N finite population values, y1 ,..., y N , and the p + 1 model
parameters, namely λ , β1 ,..., β p .
In an attempt to perform the second double integral at (10.7) (which is

actually a ( p + 1) -fold integral), we may first integrate with respect to λ
and obtain
∞
exp{−(1 / 2)( β − δ )′Ω −1 ( β − δ )}
f ( yr | ys ) ∝ ∫ dβ
−∞
[τ + (1 / 2)( y − X β )′Σ −1
( y − X β )]η + N /2
(after recognising a gamma density in λ ), or first integrate with respect

to β and obtain
∞
λ η + N /2−1   1 
f ( yr | ys ) ∝ ∫ × exp  −λ τ + y ′Σ −1 y 
0
det(Ω + λ X ′Σ X )
−1 −1
  2 

+ (Ω −1δ + λ X ′Σ −1 y )′(Ω −1 + λ X ′Σ −1 X ) −1 (Ω −1δ + λ X ′Σ −1 y )  d λ

(after recognising a multivariate normal density in β ).
Either way, the remaining integral is in general impossible to perform

analytically, and the posterior predictive distributions of the nonsample
vector and linear combinations of that vector (such as the finite
population mean and total) are not normally distributed. However, there
is an important special case which simplifies matters considerably.
496
10.8 Special cases of the normal-normal-

gamma finite population model
Theorem 10.1: Suppose there is priori ignorance regarding β and it is
appropriate to set δ = 0 and Ω = ∞ , so that
f ( β | λ ) ∝ f ( β ) ∝ 1, β ∈ℜ .
Then the predictive distribution of the finite population mean is given by

 y −a 
 b ys  ~ t (2η + n − p ) ,
 
ysT + 1′r [ X r βˆ + Σ rs Σ −ss1 ( ys − X s βˆ )]
where: a =
N
1′ [Σ − Σ rs Σ ss Σ sr + ADA′]1r [2τ + ( ys − X s βˆ )′Σ −ss1 ( ys − X s βˆ )]
−1
b 2 = r rr
(2η + n − p ) N 2
=βˆ DX s′Σ −ss1 ys , = D ( X s′Σ −ss1 X s ) −1 , = A X r − Σ rs Σ −ss1 X s .
Note: Here, β̂ is the MLE of β , and also the posterior mean of β

under the simpler normal-normal finite population model with
improper prior f ( β ) ∝ 1, β ∈ ℜ (and σ 2 known).
Theorem 10.1 can be proved by first noting that:
(a) (λ | ys ) is gamma (with parameters that can be obtained by

integrating f ( β , λ | ys ) with respect to β ), and
(b) ( y | ys , λ ) is normal (with parameters that can be obtained by

examining the normal-normal finite population model above).
Using these two distributions, one can solve for the predictive density of
the finite population mean via the identity
=
f ( y | ys ) ∫=
f ( y , λ | y )d λ ∫ f ( y | y , λ ) f (λ | y )d λ .
s s s
A special case of Theorem 10.1 which assumes a priori ignorance of λ

by way of setting η= τ= 0 can be found in Royall and Pfeffermann
(1982).
497
If we further assume conditional independence (which may expressed by

writing Σ =I N ) and no auxiliary information ( p = 1 and X = 1N ), the
result in Theorem 10.1 reduces to
 y − ys 
 ys  ~ t (n − 1) ,
 ( s / n) 1 − n / N 
 s 
n
1
=
where ss2 ∑
n − 1 i =1
( yi − ys ) 2 (the sample variance) .
This result was already proved in a previous chapter without the

involvement of vectors and matrices. Again note that the result leads to
point estimates and interval estimates which are identical to those which
one might construct using a design-based approach (see Cochran, 1977,
Section 2.8).
Exercise 10.4 Proof of Theorem 1
Prove Theorem 10.1 above.
Using the procedure outlined above, we first derive the unconditional

pdf of λ as follows:
 λ 
n
= f (λ | y s ) ∫ f ( β , λ | y s )d β ∝ ∫ λ η −1e −τλ × 1 × λ 2 exp  − Q1  d β ,
 2 
where
Q1 =( ys − X s β )′Σ −ss1 ( ys − X s β )
=ys′Σ −ss1 ys − ys′Σ −ss1 X s β − β ′ X s′Σ −ss1 ys + β ′ X s′Σ −ss1 X s β .
Now equate Q1 with

Q2 = ( β − T )′M ( β − T )′ + R (where R stands for ‘remainder’)
= β ′M β − β ′MT − T ′M β + T ′MT + R .
We see that
M= X s′Σ −ss1 X s and MT= X s′Σ −ss1 ys ,
so that
T= M −1 ( MT ) = ( X s′Σ −ss1 X s ) −1 X s′Σ −ss1 ys .
498
Note: Here, T is the same as β̂ in Theorem 10.1.
Also, R =ys′Σ −ss1 ys − T ′MT =( ys − X sT )′Σ −ss1 ( ys − X sT ) .
Note: This is easily proved by noting that the RHS here is

ys′Σ −ss1 ys − ys′Σ −ss1 X sT − T ′X s′Σ −ss1 ys + T ′X s′Σ −ss1 X sT
where
ys′Σ −ss1 X sT = ( ys′Σ −ss1 X sT )′ =T ′X s′Σ −ss1 ys
(since ys′Σ −ss1 X sT is a scalar quantity), and where
T ′X s′Σ −ss1 X sT =T ′MT =T ′ X s′Σ −ss1 X s ( X s′Σ −ss1 X s ) −1 X s′Σ −ss1 y = T ′X s′Σ −ss1 y ,
so that the RHS equals
ys′Σ −ss1 ys − T ′MT − T ′MT + T ′MT =ys′Σ −ss1 ys − T ′MT .
Thus
 λ 
n
f (λ | y s ) ∝ ∫ λ η −1e −τλ × 1 × λ 2 exp  − [ ( β − T )′M ( β − T )′ + R ]  d β
 2 
  R 
n
η + −1
= λ 2
exp  −λ τ +   × I ,
  2 
where
 1  M −1 
−1

=I ∫ exp  − ( β − T )′   ( β − T )′ dβ
 2  λ  
 
1
−
  M −1   2
p
= (2π ) det 
2

  λ 
(using standard multivariate normal theory)
p
∝λ2 (since M= X s′Σ −ss1 X s is a p by p matrix).
It follows that
  R   B 
n p A
η + + −1 −1
f (λ | y s ) ∝ λ 2 2
exp  −λ τ + =   λ exp  − λ  ,
2
  2   2 
where:
= 2τ + R ,
A = 2η + n − p , B R =( ys − X sT )′Σ −ss1 ( ys − X sT ) .
499
We thereby arrive at the required distribution,

(λ | ys ) ~ G ( A / 2, B / 2) ,
which may also be expressed by writing
( Bλ | ys ) ~ G ( A / 2,1 / 2) = χ 2 ( A) .
Having derived the posterior dsn of λ , we now observe that

( y | ys , λ ) ~ N (e0 , v0 ) ,
where:
1
= e0 ( ysT + 1′r E0 ) , E0 X rT + Σ rs Σ −ss1 ( ys − X sT )
=
N
1′r V01r w0 1′ V 1
= v0 ≡ , w0 = r 02 r ,
N λ2
λ N
V0= G + AM A′ −1
G = Σ rr − Σ rs Σ −ss1Σ sr , A X r − Σ rs Σ −ss1 X s .
=
Note: We have here simply applied the theory of the normal-normal

finite population model with Ω = ∞ and with quantities such as Σ sr
and Σ ss replaced by Σ sr / λ and Σ ss / λ , etc.
Therefore
f ( y | y s ) = ∫ f ( y | y s , λ ) f (λ | y s )d λ
1
 λ  A
−1  B 
∝ ∫ λ 2 exp  − ( y − e0 ) 2  × λ 2 exp  − λ  d λ
 2 w0   2 
A+1
−1   B ( y − e0 ) 2  
= ∫λ exp  −λ  +  dλ
2 w0  
2
  2
 A+1   A+1 
−  − 
 B ( y − e0 ) 2   2   ( y − e0 ) 2   2 
∝ +  ∝ 1 + 
2 2 w0   Bw0 
 A+1 
 A+1  − 
  ( y − e0 ) 2  
−
 2 
   y − e 2   2 
     0
 
  Bw0 / A     Bw0 / A  
∝ 1+ ∝ 1 +  .
 A   A 
   
 
 
500
 y − e0  Bw0
It follows that 
 h ys  ~ t ( A) , where h02 = .
 0  A
Here: A = 2η + n − p (which is the same as the degrees of freedom in

the t distribution in Theorem 10.1)
1 ysT + 1′r [ X rT + Σ rs Σ −ss1 ( ys − X sT )]

e0 = ( ysT + 1′r E0 ) =
N N
(which is the same as a in Theorem 10.1).
B 2τ + R 1′ V 1
=
h02 = w0 × r 02 r
A 2η + n − p N
[2τ + ( ys − X sT )′Σ −ss1 ( ys − X sT )]

1′r (G + AM −1 A′)1r
(2η + n − p ) N 2
[2τ + ( ys − X sT )′Σ −ss1 ( ys − X sT )]

=
(2η + n − p) N 2
 
×1′r Σ rr − Σ rs Σ −ss1Σ sr + ( X r − Σ rs Σ −ss1 X s )( X s′Σ −ss1 X s ) −1 ( X r − Σ rs Σ −ss1 X s )′1r
 
2
(which is the same as b in Theorem 10.1).
That completes the proof of Theorem 10.1.
10.9 The case of an informative prior on the

regression parameter
If there is some prior information available regarding the regression
parameter β then Ω < ∞ and Theorem 10.1 above cannot be applied.
So the problem of inference on the finite population mean y becomes
much more difficult.
However, that difficulty can be easily ‘sidestepped’ via Monte Carlo

methods based on a random sample from the predictive distribution of
y , namely
y (1) ,..., y ( J ) ~ iid f ( y | ys ) .
501
With such a sample we can, for example, estimate y’s predictive mean,
namely yˆ = E ( y | ys ) , by the average of y (1) ,..., y ( J ) , and estimate y’s
(1) (J )
95% CPDR by the empirical 0.025 and 0.975 quantiles of y ,..., y .
This then raises the question of how the Monte Carlo sample can be
obtained. In this context, we may employ the method of composition via
the equation
f ( y , β , λ | ys ) = f ( y | ys , β , λ ) f ( β , λ | ys ) .
Thus, we first generate a sample from the joint posterior distribution the
two parameters,
( β (1) , λ (1) ),...,( β ( J ) , λ ( J ) ) ~ iid f ( β , λ | ys ) .
and then for each j = 1,..., J we sample

 y + 1′r X r β ( j )1r 1′r Σ rr 1r 
y ( j ) ~ f ( y | y s , β ( j ) , λ ( j ) ) ~ N  sT , 2 ( j)  .
 N N λ 
This in turn raises the question of how to obtain the sample from
f ( β , λ | ys ) . In this case an ideal solution is to apply a Gibbs sampler
defined by the following conditional distributions:
1. ( β | y s , λ ) ~ N p ( β , D ) ,
where: β = D (Ω −1δ + λ X s′Σ −ss1 ys )
D =(Ω −1 + λ X s′Σ −ss1 X s ) −1
 n 1 
2. (λ | ys , β ) ~ G η + ,τ + ( y s − X s β )′Σ −ss1 ( ys − X s β )  .
 2 2 
Note: The first of these distributions derives directly from the normal-
normal finite population model with Σ sr and Σ ss replaced by Σ sr / λ
and Σ ss / λ , etc.
The second conditional is obtained by noting that

f (λ | y s , β ) ∝ f (λ , β | y s )
∝ f (λ , β , y s )
502
= f (λ ) f ( β | λ ) f ( y s | λ , β )
 1 
∝ λη −1e −τλ × exp  − ( β − δ )′Ω −1 ( β − δ ) 
 2 
 λ 
n
× λ exp  − ( ys − X s β )′Σ −ss1 ( ys − X s β ) 

2
 2 
λ η + −1   
n
1
∝ λ 2 exp  −λ τ + ( ys − X s β )′Σ −ss1 ( ys − X s β )   .
  2 
Exercise 10.5 Practice with the normal-normal-gamma finite

population model
In the context of the normal-normal-gamma finite population model,

suppose we obtain a sample of size n = 5, with values given by
ys = ( y1 ,..., yn )′ = (5.6, 2.3, 8.4, 5.1, 4.3)'
via SRSWOR from a finite population of size N = 15.
Find the predictive mean and 95% central predictive density region for
the finite population mean y in each of the following scenarios.
(a) There are no covariates, the population values are conditionally iid
and there is no prior information available regarding the model
parameters.
(b) The population values are conditionally independent, the ith

population value has mean xi β and variance xi / λ (i = 1,...,N), the
population covariate vector is
x = ( x1 ,..., xN )′ = (9.3, 4.6, 15.0, 11.2, 7.8, 2.4, 6.6, 3.0, 2.1, 7.3,
5.5, 8.0, 2.4, 4.2, 5.5)',
and there is no prior information regarding the model parameters.
(c) There are no covariates, the population values are conditionally iid,
the prior on the normal mean is normal with mean 10 and variance 2.25,
and (independently) the prior on the normal precision parameter (inverse
of the normal variance) is gamma with mean 2 and variance 1/2 (or
equivalently, gamma with parameters 8 and 4).
503
(a) In this case, Theorem 10.1 reduces to

 y − ys 
 ys  ~ t (n − 1) ,
 ( s / n) 1 − n / N 
 s 
1
where: y= s ( y1 + ... + yn ) = 5.140
n
1 n
= ss2 ∑
n − 1 i =1
( yi − y ) 2 = 4.9030.
So the required predictive mean and 95% predictive interval of y are

 ss n 
ys = 5.140 and  ys ± tα /2 (n − 1) 1 −  = (2.8951, 7.3849).
 n N

(b) In this case (a variation of Bayesian ratio estimation as discussed

earlier) we apply Theorem 10.1 with:
 x1 
 
p = 1, η= τ= 0 , X == x , Σ diag
= ( x)   .
 xN 

Instead of deriving a ‘simple’ general algebraic expression for the

predictive distribution of the finite population mean in this case, we can
obtain the specific required result more quickly by directly applying the
formulae in Theorem 10.1 using R. An advantage of this approach is
that it leads us to write a general algorithm in R which can be
straightaway used in other situations requiring Theorem 10.1. Also, the
algorithm can be used to check our answer to part (a).
Thereby we obtain the result that

 y −a 
 b y s  ~ t (c ) ,
 
where a = 3.3945, b = 0.1159 and c = 2η + n − p = 4.
So the required predictive mean and 95% predictive interval of y are

ys = 3.3945 and ( a ± tα /2 (c)b ) = (3.0725, 3.7164).
504
Note: This inference is lower than that in (a) because the mean of the
covariate values in the nonsample is 4.7, which is much lower than
their mean in the sample, 9.58. The regression coefficient β in our
model is estimated as 0.5365, reflecting the positive linear relationship
between the x and y values in the sample.
(c) In this case, a good option is to first employ the Gibbs sampler to
generate a random sample from the joint posterior distribution of β and
λ , with:
p = 1, δ = 10, Ω =9 , η = 8, τ = 4 , X = 1N , Σ =diag (1N ) .
The two conditional distributions are:
1. ( β | y s , λ ) ~ N p ( β , D ) ,
where:
β = D (Ω −1δ + λ X s′Σ −ss1 ys )
D =(Ω −1 + λ X s′Σ −ss1 X s ) −1
 n 1 
2. (λ | ys , β ) ~ G η + ,τ + ( y s − X s β )′Σ −ss1 ( ys − X s β )  .
 2 2 
But, by analogy with the simpler normal-normal model and normal-

gamma model, these conditionals must be equivalent to:
1. ( β | ys , λ ) ~ N ( β λ , σ λ2 ) ,
where:
βλ =
(1 − kλ ) β 0 + kλ ys
kλ n
σ λ2 = , kλ =
nλ n + 1 / (λσ 02 )
β 0 = 10, σ 0 = 3
 n n 
2. (λ | ys , β ) ~ G η + ,τ + sβ2  ,
 2 2 
where
1 n
s2   ( yi   ) 2 .
n i1
505
Either way, implementing this Gibbs sampler for 10,100 iterations with a
burn-in of 100 we obtain the trace plots and histograms for β and λ in
Figure 10.2. (The two subplots on the left are for β , and the two on the
right are for λ . The histograms do not include the first 100 iterations.)
Thinning the last 10,000 values of each parameter by a factor of 10 we

obtain an approximately random sample of size J = 1,000 from the joint
posterior distribution of the two parameters, namely
( β j , λ j ) ~ iid f ( β , λ | ys ) , j = 1,…,J.
The sample ACFs over the entire sample of 10,000 and over the thinned
sample of 1,000 are shown for each of β and λ in Figure 10.3. (E.g. the
top-left subplot is for β over the entire sample of 10,000.) The thinning
process has virtually eliminated all signs of autocorrelation.
Figure 10.2 Trace plots and histograms
506
Figure 10.3 Sample ACFs

(Top two: J = 10,000; Bottom two: J = 1,000)
Using our sample from the joint posterior of the two parameters we now
generate a sample from the predictive distribution of the nonsample
mean by drawing
 1 
yr( j ) ~ f ( yr | ys , β j , λ j ) ~ N  β j ,  for each j = 1,…,J.
 ( N − n )λ j 

Note: The result is

yr(1) ,..., yr( J ) ~ iid f ( yr | ys ) ,
by virtue of the method of composition and the equation
f ( yr , β , λ | ys ) = f ( yr | ys , β , λ ) f ( β , λ | ys ) .
We next form a random sample from the predictive distribution of the

finite population mean by calculating
y=
( j)
N
1
( ny s + ( N − n ) yr( j ) ) for each j = 1,…,J.
Note: The result is y (1) ,..., y ( J ) ~ iid f ( y | y s ) .
507
We now estimate y (and y’s predictive mean, yˆ = E ( y | ys )) by

1 J ( j)
y= ∑ y = 5.555,
J j =1
with 95% CI for ŷ equal to
 1 J 
 y ± 1.96 ∑
J ( J − 1) j =1
( y − y ( j ) ) 2  = (5.526, 5.584).

 
We also estimate the 95% CPDR for y by (4.685, 6.633), where the
bounds of this interval are the empirical 0.025 and 0.975 quantiles of
y (1) ,..., y ( J ) .
Another approach to performing Monte Carlo inference on y is via

Rao-Blackwell methods. This approach does not require the sample
yr(1) ,..., yr( J ) and should provide more accurate Monte Carlo estimates.
The idea is based on the identities:
f ( y | ys ) = ∫ f ( y , β , λ | ys ) d β d λ
= ∫ f ( y | ys , β , λ ) f ( β , λ | ys ) d β d λ
= ( y | y s ) Eβ ,λ {E ( y | y s , β , λ ) y s }
yˆ E=
f ( y | y s ) = E β ,λ { f ( y | y s , β , λ ) y s } .
Now note once again that:

=y
N
1
( nys + ( N − n) yr )
 1 
( yr | ys , β , λ ) ~ N  β , .
 ( N − n )λ 
So we now define:
e( β , λ ) = E ( y | y s , β , λ )
1
= ( nys + ( N − n) E ( yr | ys , β , λ ) )
N
1
= ( nys + ( N − n) β )
N
508
v ( β , λ ) = V ( y | ys , β , λ )
( N − n) 2
= V ( yr | y s , β , λ )
N2
( N − n) 2 1 N −n
= × = 2
N 2
( N − n )λ N λ
e j e( β j , λ=
= j)
1
N
( nys + ( N − n)β j )
N −n
= (β j , λ j )
v j v= .
N 2λ j
Note: Since e( β , λ ) does not depend on λ , we may also write e( β , λ )

as e( β ) . Likewise, since v ( β , λ ) does not depend on β , we may also
write v ( β , λ ) as v (λ ) .
Then the Rao-Blackwell estimate of y (and yˆ = E ( y | ys )) is

1 J
e= ∑ e j = 5.557,
J j =1
with 95% CI for ŷ working out as
 1 J 
 e ± 1.96 ∑
J ( J − 1) j =1
( e − e j ) 2  = (5.534, 5.581).

 
Note: The width of this Rao-Blackwell CI is 5.581 – 5.534 = 0.046,

which (as could be expected) is less than that of the earlier CI, namely
5.584 – 5.526 = 0.058.
We can now also obtain the Rao-Blackwell estimate of the CPDR for y .
First, the Rao-Blackwell estimate of the predictive density of y (that is,

of f ( y | ys )) is
1 J
f ( y | ys ) = ∑ f ( y | ys , µ j , θ j )
J j =1
1 J 1  1 
= ∑
J j =1 v j 2π
exp − ( y − e j )2  .
 2v j 
509
Note: The simplest and most ‘basic’ estimate of f ( y | ys ) is the

‘histogram’ estimate, fˆ ( y | y ) , obtained by smoothing a histogram of
s
(1) (J )
the sampled values y ,..., y ~ iid f ( y | ys ) .
The Rao-Blackwell estimate of the 95% CPDR of y is (L,U), where L

and U satisfy:
 1 
L
1 J 1
∫−∞ J ∑
j =1 v j 2π
exp  −
 2v j
( y − e j ) 2  dy =

0.025
U
1 J 1  1 2

∑
∫ J j=1 v j 2π  2v j
exp  − ( y − e j )  dy =
0.975 .
−∞  
To obtain L we rewrite the first of these two equations as

1 J
∑ P ( X j < L) =
J j =1
0.025 ,
where X j ~ N ( e j , v j ) , or equivalently as
1 J  L − ej 
∑Φ  =
J j =1  v j 
0.025 (where Φ is the standard normal cdf).
 
We can now solve this equation in a number of ways, for example by

minimising the function
2

 1 J  L − ej  

g ( L) = ∑  Φ  

− 0.025 
 J j =1  v j 
 

(whose minimum is 0 at the required L),
e.g. using the optim() function in R.
Likewise we can obtain U by using optim() to minimise

2

 1 J  L − ej  

h(U ) = ∑ Φ   − 0.975
 J v j  
 j =1  
(whose minimum is 0 at the required U).
Note: We could also obtain L and U using trial and error or the
Newton-Raphson algorithm.
510
Implementing the above procedure we arrive at the required Rao-

Blackwell estimate of the central predictive region for the finite
population mean: (L,U) = (4.707, 6.542).
Note: This is similar to the previous ‘histogram’ estimate of the

CPDR, (4.685, 6.633).
Figure 10.4 shows a histogram of the J = 1,000 simulated values

y (1) ,..., y ( J ) ~ iid f ( y | ys ) , together with the histogram estimate y and
the Rao-Blackwell estimate e of yˆ = E ( y | y s ) . Also shown are the two
corresponding 95% CIs for ŷ . The histogram is overlaid with the
histogram estimate fˆ ( y | ys ) and the Rao-Blackwell estimate f ( y | ys )
of f ( y | ys ) . It will be observed that the Rao-Blackwell estimate
provides the smoother result.
Figure 10.4 Inferences on the finite population mean
511
# (a)
options(digits=4); N = 15; ys = c(5.6,2.3,8.4,5.1,4.3); n = length(ys)

est=mean(ys); ss2=var(ys); varybar=(ss2/n)*(1-n/N); tval= qt(0.975,n-1)
cpdr=est+c(-1,1)*tval*sqrt(varybar)
c(est,ss2,sqrt(ss2), varybar, sqrt(varybar), tval, cpdr)

# 5.1400 4.9030 2.2143 0.6537 0.8085 2.7764 2.8951 7.3849
# (b)
NNGFPM= function(eta=0, tau=0, alp=0.05,

ys= c(5.6,2.3,8.4,5.1,4.3), X=rep(1,15) , N=15, sigma=diag(rep(1,N)) )
{
# This function performs inference under the normal-normal-gamma

# finite population model.
# Inputs: eta, tau, alp, ys, X, N, sigma
# Outputs: A list with $a, $b and $c indicating (ybar-a)/b given ys ~ t(c)
p=ncol(cbind(NA,X))-1; n = length(ys); c=2*eta+n-p
ysT=sum(ys); Xs=cbind(NA,X)[1:n,][,-1]; Xr=cbind(NA,X)[(n+1):N,][,-1]

sigmass=sigma[1:n,1:n]; sigmarr=sigma[(n+1):N,(n+1):N]
sigmasr=sigma[1:n,(n+1):N]; sigmars=t(sigmasr)
D=solve(t(Xs)%*%solve(sigmass)%*%Xs)
beta=D%*%t(Xs)%*%solve(sigmass)%*%ys
A=Xr-sigmars%*%solve(sigmass)%*%Xs; oner=rep(1,N-n)
a=(1/N)*( ysT + t(oner)%*%

( Xr%*%beta + sigmars%*%solve(sigmass)%*%(ys-Xs%*%beta) ) )
b2=(1/(c*N^2)) * ( 2*tau + t(ys-Xs%*%beta)%*%solve(sigmass)%*%

(ys-Xs%*%beta) ) * t(oner)%*%
( sigmarr-sigmars%*%solve(sigmass)%*%sigmasr +
A%*%D%*%t(A)) %*% oner
b=sqrt(b2); cpdr=a+c(-1,1)*qt(1-alp/2,c)*b
list(a=a,b=b,c=c,beta=beta, cpdr=cpdr)
}
512
# Test function by using it to check (a):

res= NNGFPM(); c(res$a,res$b,res$c,res$beta, res$cpdr)
# 5.1400 0.8085 4.0000 5.1400 2.8951 7.3849 Same as in (a) OK
# Apply function with covariate info:

xvec=c(9.3, 4.6, 15.0,11.2, 7.8, 2.4, 6.6, 3.0, 2.1, 7.3, 5.5, 8.0, 2.4, 4.2,
5.5)
res= NNGFPM(X=xvec, sigma=diag(xvec))
c(res$a,res$b,res$c,res$beta,res$cpdr)
# 3.3945 0.1159 4.0000 0.5365 3.0725 3.7164
c(mean(xvec), mean(xvec[1:5]), mean(xvec[6:15]) ) # 6.327 9.580 4.700
# (c)
ys= c(5.6,2.3,8.4,5.1,4.3); ysbar=mean(ys); n = 5; N = 15; options(digits=4)
GIBBS = function(J=1000,ys= c(5.6,2.3,8.4,5.1,4.3),

bet=1, lam=1, bet0=10, sig0=1.5, eta=8, tau=4)
{
betv=bet; lamv=lam; sig02=sig0^2; n=length(ys); ysbar=mean(ys);
for(j in 1:J){
klam=n/(n+1/(lam*sig02)); sig2lam=klam/(n*lam)
betlam=(1-klam)*bet0+klam*ysbar;
bet=rnorm(1,betlam,sqrt(sig2lam))
s2bet=mean((ys-bet)^2); lam=rgamma(1,eta+n/2,tau+n*s2bet/2)
betv=c(betv,bet); lamv=c(lamv,lam) }
list(betv=betv,lamv=lamv)
}
set.seed(641); res=GIBBS(J=10100); X11(w=8,h=5.5); par(mfrow=c(2,2))

plot(res$betv,type="l"); plot(res$lamv,type="l")
hist(res$betv[-c(1:101)],prob=T,nclass=30);
hist(res$lamv[-c(1:101)],prob=T,nclass=30) # Fig. 10.2
betvec=res$betv[-c(1:101)][seq(10,10000,10)]; J = length(betvec); J # 1000

lamvec=res$lamv[-c(1:101)][seq(10,10000,10)]
acf(res$betv); acf(res$lamv); acf(betvec); acf(lamvec) # Fig. 10.3
betbar=mean(betvec); betci=betbar+c(-1,1)*qnorm(0.975)*sd(betvec)/sqrt(J)
c(betbar,betci) # 5.766 5.731 5.801
513
set.seed(121); yrbarvec=rnorm(J, betvec, 1/sqrt((N-n)*(lamvec)) )

yrbarbar=mean(yrbarvec);
yrbarci= yrbarbar+c(-1,1)*qnorm(0.975)*sd(yrbarvec)/sqrt(J)
yrbarcpdr=quantile(yrbarvec, c(0.025,0.975))
c(yrbarbar,yrbarci,yrbarcpdr) # 5.762 5.718 5.806 4.458 7.380
ybarvec=(1/N)*( n*ysbar + (N-n)*yrbarvec )

ybarbar=mean(ybarvec);
ybarci= ybarbar+c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J)
ybarcpdr=quantile(ybarvec, c(0.025,0.975))
c(ybarbar,ybarci,ybarcpdr) # 5.555 5.526 5.584 4.685 6.633
ybarci[2]-ybarci[1] # 0.05849
evec=(1/N)*(n*ysbar + (N-n)*betvec ); vvec=(N-n)/(N^2*lamvec)

ebar=mean(evec); eci=ebar+c(-1,1)*qnorm(0.975)*sd(evec)/sqrt(J)
Lfun=function(L){ ( 0.025-mean(pnorm( (L-evec)/sqrt(vvec) ) ) )^2 }

L = optim(par=3,fn=Lfun)$par; L # 4.707 (ignore warning message)
mean( pnorm( (L-evec)/sqrt(vvec) )) # 0.025 OK
Ufun=function(U){ ( 0.975-mean(pnorm( (U-evec)/sqrt(vvec) ) ) )^2 }

U = optim(par=7,fn=Ufun)$par; U # 6.542 (ignore warning message)
mean( pnorm( (U-evec)/sqrt(vvec) )) # 0.975 OK
ecpdr=c(L,U); c(ebar,eci,ecpdr) # 5.557 5.534 5.581 4.707 6.542

eci[2]-eci[1] # 0.04642
X11(w=8,h=7); par(mfrow=c(1,1))
hist(ybarvec,prob=T,nclass=20,xlim=c(3.5,8),
xlab="ybar",ylab="density/relative frequency",main="")
lines(density(ybarvec),lty=2,lwd=3,col="blue")
abline(v=c(ybarbar,ybarci,ybarcpdr),lty=2,lwd=3,col="blue")
ybarv=seq(3,8,0.01); fv=rep(NA,length(ybarv))
for(i in 1:length(ybarv)) fv[i] = mean(dnorm(ybarv[i], evec, sqrt(vvec)))
lines(ybarv,fv,lty=1,lwd=2,col="red")
abline(v=c(ebar,eci,ecpdr),lty=1,lwd=2,col="red")
legend(3.4,0.9,c("Histogram","Rao-Blackwell"),
lty=c(2,1), lwd=c(3,2),col=c("blue","red"), bg="white")
514
CHAPTER 11
Transformations and Other Topics
11.1 Inference on complicated quantities
So far, in the context of Bayesian finite population models specified by:

f (ξ | y , θ ) where ξ is s or I or L (as discussed earlier)
f ( y |θ ) where
= y (= =
ys , yr ) (( y1 ,..., yn ),( yn +1 ,..., y N )) ( y1 ,..., y N )
f (θ ) where θ = (θ1 ,..., θ q ) ,
we have been focusing primarily on two finite population quantities, the
finite population total yT = y1 + ... + y N and the finite population mean
y = ( y1 + ... + y N ) / N = yT / N .
These are special cases of the class of linear combinations of the N

population values
y = c0 + cy1 + ... + cN y N ,
for which inference is often straightforward, such as in the context of the
general normal-normal-gamma finite population model.
We will now consider other inferential targets.
Generally, suppose we are interested in the quantity ψ = g (θ , y ) , where

g is a potentially very complicated function of all q model parameters
and all N finite population values. In such cases, we may adopt the
following four-step strategy.
Step 1. Obtain a sample from the posterior distribution of θ = (θ1 ,..., θ q ),

that is θ (1) ,..., θ ( J ) ~ iid f (θ | D ) , where θ ( j ) = (θ1( j ) ,..., θ q( j ) ) and where D
is the data, typically defined as ( s, ys ) or ( I , s ) or ( L, y s ) as discussed
previously, and whichever the case may be.
Make use of special techniques if suitable, e.g. the method of

composition and MCMC methods like the Gibbs sampler.
515
Step 2. Use the sample in Step 1 to generate a random sample from the
predictive distribution of the nonsample vector yr = ( yn +1 ,..., y N ) , that is
yr(1) ,..., yr( J ) ~ iid f ( yr | D ) , where yr( j ) = ( yn( +j )1 ,..., y N( j ) ) .
Make use of special techniques if required.
Often, the sample can be obtained easily via the method of composition
and the identity
f ( yr , θ | D ) = f ( yr | D, θ ) f (θ | D ) ,
namely by sampling
yr( j ) ~ f ( yr | D, θ ( j ) )
for each j = 1,..., J .
In many cases, each sampled nonsample vector yr( j ) here can obtained
by sampling
yi( j ) ~ ⊥ f ( yi | D, θ ( j ) ) , i= n + 1,..., N ,
and then forming the vector according to
yr( j ) = ( yn( +j 1) ,..., y N( j ) ) .
Step 3. Form the completed population vector

= y ( j ) (= y s , yr( j ) ) ( y1 ,..., yn , yn( +j 1) ,..., y N( j ) )
and then calculate
ψ ( j ) = g ( y ( j ) ,θ ( j ) )
for each j = 1,..., J .
The result will be a sample from the posterior/predictive distribution of

ψ , namely
ψ (1) ,...,ψ ( J ) ~ iid f (ψ | D ) .
Step 4. Use the sample obtained in Step 3 to perform Monte Carlo

inference on ψ in the usual way. Thus, estimate the posterior/predictive
mean of ψ , namely
=ψˆ E=
(ψ | D ) ∫ψ f (ψ | D)dψ
(which may be impossible to obtain analytically), by the Monte Carlo
1 J
sample mean ψ = ∑ψ ( j ) (which is unbiased, in that E (ψ | D ) = ψˆ ).
J j =1
516
Also calculate the 1 − α CI for ψˆ given by

 sψ  1 J
ψ ± zα /2 =
 , where sψ
2
−
∑ (ψ ( j ) − ψ ) 2 .
 J  J 1 j =1
Also, estimate the 1 − α central posterior/predictive density region

(CPDR generally) for ψ by (Qα /2 , Q1−α /2 ) , where Q p is the empirical p-
quantile of the sample ψ (1) ,...,ψ ( J ) .
Also, estimate the entire posterior/predictive density of ψ , namely

f (ψ | D ) , by fˆ (ψ | D ) , a smooth of a histogram of ψ (1) ,...,ψ ( J )
(obtained by adjusting the smooth parameters).
Use Rao-Blackwell methods to improve precision, if possible and

practicable. For example, suppose that q = 2, θ = (θ1 , θ 2 ) , ψ = g ( y , θ 2 ) ,
and f (ψ | D, θ1 ) has a simple form. Then, instead of using a ‘histogram
estimate’ fˆ (ψ | D ) to estimate f (ψ | D ) , use the Rao-Blackwell estimate
1 J
f (ψ | D ) = ∑ f (ψ | D,θ ( j ) ) .
J j =1
Exercise 11.1 Estimation of nonstandard target quantities
(a) Suppose that 2.1, 5.2, 3.0, 7.7 and 9.3 constitute a random sample
from a normal finite population of size 20 whose mean and variance are
unknown. We are interested in the finite population median. Estimate
this quantity using a suitable Bayesian model.
(b) Repeat (a) but for the quantity:

average percentage increase between subsequent ordered
population values greater than 4.
(c) Repeat (a) but for the quantity:

sum of finite population values in the upper quartile of the
normal superpopulation.
517
The Bayesian model here is:

−1
N
f ( s | y, µ , λ ) =   ,
n
= s (1,..., n), (1,..., n − 1, n + 1),..., ( N − n + 1,..., N ) (SRSWOR)
( y1 ,..., y N | µ , λ ) ~ iid N ( µ ,1 / λ )
f ( µ , λ ) ∝ 1/ λ , µ ∈ℜ, λ > 0 ,
where N = 20, n = 5, and where the data is
= D (= s, ys ) ((1,..., n ),(2.1, 5.2, 3.0, 7.7, 9.3)) .
Note 1: This data is presented according to a convenient reordering of

population labels, after sampling, so that the sampled values are listed
at the beginning of the finite population vector (as discussed earlier).
Note 2: The superpopulation parameter in the model may be thought

of as the vector
= θ (θ= 1,θ2 ) (µ, λ ) ,
in which case the model could also be written:
( s | y , θ ) ~ SRSWOR( N , n )
( y | θ ) ~ N N (θ11N , I N / θ 2 )
f (θ ) ∝ 1/ θ 2 , θ1 ∈ℜ, θ 2 > 0 .
For the purposes of this exercise, let y( i ) denote the ith finite population
order statistic, meaning the ith value amongst y1 ,..., y N after these are
ordered from smallest to largest. We are interested in three finite
population quantities, as follows:
y( N /2) + y( N /2)+1
=
(a) ψ 1 g1 (=y , θ ) g=
1( y)
2
N  y 
( i ) − y( i −1)
∑ 
y(i −1) 
 I ( y(i −1) > 4)
100  N
i =2
(b)=ψ 2 g 2 (=y, θ ) g=2 ( y)
∑ I ( y(i −1) > 4)

i =2
N
 1 
ψ 3 g3 ( =
(c)= y,θ ) ∑ y I  y
i =1
i i >µ+
λ
Φ −1 (0.75)  .

518
Note 1: The median ψ 1 is the average of the middle two values, since
N = 20 is even.
Note 2: In general, ψ 2 is defined only if at least two of the finite

population values are greater than 4. For our data, there is no problem
with the definition because the observed sample already contains three
such values. If there were a problem, then ψ 2 = g 2 ( y ) could be
defined as zero (say) in the case where the number of population
values is only 0 or 1, i.e. if ∑iN=1 I ( yi > 4) < 2 .
Note 3: As regards ψ 3 , if c is the upper quartile of the normal

superpopulation then
 y −µ c−µ 
0.75 = P( yi < c | θ ) = P  i < θ
 σ σ 
c−µ
⇒ = Φ −1 (0.75)
σ
1
⇒ c = µ + σΦ −1 (0.75) = µ + Φ −1 (0.75) .
λ
In each case, the inferential target has a posterior/predictive distribution

which cannot be obtained analytically. One way to proceed is as follows:
 n −1 n −1 2 
Step 1. Generate λ1 ,..., λJ ~ iid f (λ | D ) ~ G  , ss  ,
 2 2 
n
1
=
where ss2 ∑ ( yi − y ) 2 .
n i =1
(This step derives from results for the normal-normal-gamma model.)
 1 
Step 2. Generate µ j ~ f ( µ | D, λ j ) ~ N  ys , for each j = 1,..., J .
 nλ 
 j 
(This step derives from results for the normal-normal model).
519
Step 3. For each j = 1,..., J :

 1
• Generate yn( +j 1) ,..., yn( +j 1) ~ iid f ( yi | D, µ j , λ j ) ~ N  µ j , 
 λ j 

• Form yr( j ) = ( yn( +j 1) ,..., y N( j ) ) and
=y ( j ) (=
y s , yr( j ) ) ( y1 ,..., yn , yn( +j 1) ,..., y N( j ) )
• Calculate ψ ( j ) = g ( y ( j ) , θ ( j ) ) , where θ ( j ) = ( µ j , λ j ) .
Step 4. Use the values ψ (1) ,...,ψ ( J ) ~ iid f (ψ | D ) for Monte Carlo
inference on ψ in the usual way.
Note 1: Steps 1 and 2 result in the sample

( µ1 , λ1 ),...,( µ J , λJ ) ~ iid f ( µ , λ | D ) .
Note 2: In the above, Steps 1 and 2 could be replaced as follows:
Step 1’. Generate µ1 ,..., µ J ~ f ( µ | D ) for each j = 1,..., J . Do this by

first sampling w1 ,..., wJ ~ iid t ( n − 1) and then forming
µ=j y s + w j ss / n for each j = 1,..., J
(using results from the normal-normal-gamma model).
n n 
Step 2’. Generate λ j ~ ⊥ f (λ | D, µ j ) ~ G  , sµ2 j  , where
2 2 
n
1
= sµ2 j ∑
n i =1
( yi − µ j ) 2
(using results from the normal-gamma model).
These modified steps will also result in the sample

( µ1 , λ1 ),...,( µ J , λJ ) ~ iid f ( µ , λ | D ) .
Applying the above four-step procedure (using the original Steps 1 and
2) with Monte Carlo sample size J = 1,000, we obtain Table 11.1 which
shows numerical estimates for the three quantities of interest:
ψ = ψ 1 , ψ 2 and ψ 3 , respectively. Figure 11.1 shows histograms which
illustrate these inferences.
520
Table 11.1 and Figure 11.1 also contain analogous results for a fourth
quantity of interest which may be defined as
= ψ 4 g 4= ( y , θ ) (ψ 3 |ψ 3 ≠ 0)
  N  1 −1   N  1 −1  
=   ∑ yi I  yi > µ + Φ (0.75)    ∑ I  yi > µ + Φ (0.75)   > 0  .
  i 1 =  λ   i 1  λ  
The relevant posterior/predictive density may also be written

= f (ψ 4 | D ) f (ψ 3 | D,ψ 3 ≠ 0) .
Inferences on ψ 4 were obtained using the 960 values of ψ 3 which were

non-zero. It was meaningful to perform this additional inference because
there were 40 simulations amongst the 1,000 for which the upper
quartile of the normal distribution lay above the largest finite population
value, resulting in the sum ψ 3 being equal to 0 exactly.
Note 1: From the above, we see that ψ 3 is neither a discrete nor a

continuous random variable but one with a mixed distribution.
The discrete part of this mixed distribution is the probability that

ψ 3 = 0 exactly, and this we estimated via MC as 40/1,000 = 0.04.
Note 2: We also see that neither ψ 3 nor ψ 4 is necessarily positive.
This is because it might be the case that the upper quartile of the
normal distribution is negative and many of the finite population
values happen (by a very small chance) to lie between that negative
quartile and zero.
521
Table 11.1 Point and interval estimates for four quantities
Quantity of interest:
ψ1 ψ2 ψ3 ψ 4 (ψ 3 |ψ 3 ≠ 0)
=
Posterior mean estimate:

5.842 9.975 58.31 60.74
95% CI for posterior mean:

(5.790, 5.893) (9.775, 10.175) (56.48 60.15) (58.99, 62.49)
Posterior mode estimate:

5.528 8.150 62.29 62.45
Posterior median estimate:

5.769 9.377 59.48 60.59
95% CPDR estimate:

(4.308, 7.528) (5.522, 17.770) (0.00 114.87) (11.72, 114.96)
Figure 11.1 Four histograms and sets of inferences
522
options(digits=4)
# Define 3 psi functions -----------------

PSI1FUN = function(y){ quantile(y,0.5) }
PSI2FUN = function(y){ ynew=sort(y[y>4]); nnew=length(ynew);
if(nnew<2) res=NA
if(nnew>=2) res = 100*mean( (ynew[-1]-ynew[-nnew]) / ynew[-nnew] )
res }
PSI3FUN = function(y,mu,lam){ q = qnorm(0.75); sum(y[y>(mu+q/sqrt(lam))])
}
# Test 3 psi functions -------------------------

PSI1FUN(y=c(1,2,7)) # 2 OK
PSI1FUN(y=c(1,2,7,8)) # 4.5 OK
PSI2FUN(y=c(5,12,6)) # 60 Correct: 100* (1/2) * ( (6-5)/5 + (12-6)/6 ) = 60
PSI2FUN(y=c(5,3,6)) # 20 Correct: 100* (6-5)/5 = 20
PSI2FUN(y=c(5,2,3)) # NA Correct
PSI2FUN(y=c(4,4,-3)) # NA Correct
set.seed(311); PSI3FUN(y=rnorm(100,10,1),mu=10,lam=1) # 267 ~ 25*10, OK
523
# Perform inference on 3 psi functions ----------------------------------------

ys= c(2.1, 5.2, 3.0, 7.7, 9.3); ysbar=mean(ys); n=length(ys); ss2=var(ys); N = 20
options(digits=4); J=1000; set.seed(232)
lamvec=rgamma( J, (n-1)/2, ((n-1)/2) *ss2 )
muvec = rnorm(J,ysbar,1/sqrt(n*lamvec))
yrmat=matrix(NA, nrow=J, ncol=N-n)
for(j in 1:J) yrmat[j,] = rnorm(N-n,muvec,1/sqrt(lamvec))
psi1vec=rep(NA,J); psi2vec=rep(NA,J); psi3vec=rep(NA,J)
for(j in 1:J){ yrj = yrmat[j,]
psi1vec[j] = PSI1FUN(y=c(ys, yrj))
psi2vec[j] = PSI2FUN(y= c(ys, yrj))
psi3vec[j] = PSI3FUN(y= c(ys, yrj), mu=muvec[j], lam=lamvec[j]) }
cbind( summary(psi1vec), summary(psi2vec),

summary(psi3vec), summary(psi3vec[psi3vec!=0]) )
# Min. 3.14 4.44 0.0 9.3
# 1st Qu. 5.28 7.65 37.9 40.3
# Median 5.77 9.38 59.5 60.6
# Mean 5.84 9.97 58.3 60.7
# 3rd Qu. 6.41 11.50 79.6 80.7
# Max. 9.09 28.10 156.0 156.0
X11(w=9,h=6.5); par(mfrow=c(2,1))
psivec=psi1vec; J = length(psivec)
psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J)
fpsi=density(psivec); psimode=fpsi$x[fpsi$y==max(fpsi$y)]
psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975))
c(psibar,psici,psimode,psimedian,psicpdr)
# 5.842 5.790 5.893 5.528 5.769 4.308 7.528
hist(psivec, prob=T, xlab="psi1",xlim=c(0,10),ylim=c(0,0.6),
breaks=seq(0,10,0.25), main="Monte Carlo inference on psi1")
lines(fpsi,lwd=3)
abline(v= c(psibar, psici, psicpdr, psimedian, psimode) ,
lty=c(1,1,1,1,1,2,2), lwd=rep(2,7))
legend(0,0.6,
c("Posterior mean, 95% CI \n & 95% CPDR","Posterior mode & median"),
lty=c(1,2), lwd=c(2,2), bg="white")
psivec=psi2vec; J = length(psivec)
# 9.975 9.775 10.175 8.150 9.377 5.522 17.770
524

breaks=seq(0,30,0.5),main="Monte Carlo inference on psi2")
lines(fpsi,lwd=3)
lty=c(1,1,1,1,1,2,2), lwd=rep(2,7))
legend(15,0.15,
c("Posterior mean, 95% CI & 95% CPDR","Posterior mode & median"),
lty=c(1,2), lwd=c(2,2), bg="white") # End of first 2 graphs
psivec=psi3vec # Start of next 2 graphs

# 58.31 56.48 60.15 62.29 59.48 0.00 114.87

breaks=seq(0,200,5), main="Monte Carlo inference on psi3")
lines(fpsi,lwd=3)
lty=c(1,1,1,1,1,2,2), lwd=rep(2,7))
legend(100,0.022,
c("Posterior mean, 95% CI \n& 95% CPDR"),lty=1,lwd=2,bg="white")
legend(-5,0.022,c("Posterior mode \n& median"), lty=2, lwd=2, bg="white")
length(psi3vec[psi3vec!=0]) # 960
length(psi3vec[psi3vec==0]) # 40 40/1000 = 4%
psivec=psi3vec[psi3vec!=0]; J=length(psivec); J # 960 Condition on psi > 0
# 60.74 58.99 62.49 62.45 60.59 11.72 114.96
hist(psivec, prob=T, xlab="psi3, psi4",xlim=c(0,160),ylim=c(0,0.022),

breaks=seq(0,200,5),
main="Monte Carlo inference on psi4 = (psi3 given psi3 != 0)")
lines(fpsi,lwd=3)
abline(v= c(psibar, psici, psicpdr, psimedian, psimode),
lty=c(1,1,1,1,1,2,2), lwd=rep(2,7))
legend(100,0.022,
c("Posterior mean, 95% CI \n& 95% CPDR"),lty=1,lwd=2,bg="white")
legend(-5,0.022,c("Posterior mode \n& median"), lty=2, lwd=2, bg="white")
525
11.2 Data transformations

In statistical analysis, a common practice is to first transform the data
before applying a model. For example, if the data values are strictly
positive and highly right skewed, it may be worthwhile taking natural
logarithms before applying a normal model.
In the classical setting, e.g. in the design-based survey sampling, this

idea may work well for purposes of analytical inference (i.e. estimation
of model parameters) but can be problematic for prediction. This is
because the quantity requiring prediction (e.g. the nonsample total) does
not typically have a simple distribution on the untransformed scale.
Although prediction can be performed easily on the transformed scale
there is no way to translate results back onto the original scale. By
contrast, this issue does not create any special problems within the
Bayesian framework.
Suppose that we are interested in some finite population quantity which

is denoted ψ = g ( y ) , e.g. y = 1′N y / N .
Also suppose that there is no convenient superpopulation model for the

finite population values yi , i = 1,..., N , but there does exist such a
model for some function of those values, say zi = h( yi ) for a function h.
In that case we may consider a Bayesian model specified in terms of:

f (ξ | z, θ ) where ξ is s or I or L (as discussed earlier)
=
f ( z | θ ) where z (=zs , zr ) (( z1 ,..., zn ),( zn +1 ,...,=
z N )) ( z1 ,..., z N )
f (θ ) where θ = (θ1 ,..., θ q ) .
We now use Monte Carlo methods (perhaps MCMC methods if needed)

to generate a random sample from the predictive distribution of the
nonsample vector for the z variable (i.e. zr ), given the data D (for
example ( s, ys ) , ( I , s ) or ( L, ys )) . Let us call this sample
zr(1) ,..., zr( J ) ~ iid f ( zr | D ) .
We next calculatate yi( j ) = h −1 ( zi( j ) ) for each i= n + 1,..., N and each

j = 1,..., J . Thus, we untransform the simulated individual data values
back to the original scale.
526
Next, we form the vectors

yr( j ) = ( yn( +j 1) ,..., y N( j ) )
and
y ( j ) = ( y s , yr( j ) )
for each j = 1,..., J .
This results in the samples

yr(1) ,..., yr( J ) ~ iid f ( yr | D )
and
y (1) ,..., y ( J ) ~ iid f ( y | D ) .
Finally, we calculate
ψ ( j) = g( y( j) )
for each j = 1,..., J .
This results in
ψ (1) ,...,ψ ( J ) ~ iid f (ψ | D ) ,
namely a sample from the predictive distribution of the finite population
quantity of interest, on the original scale required for that quantity. This
sample can then be used for Monte Carlo inference on ψ in the usual
way.
Note: We may think of this topic as an example and special application

of the last topic, that is, Bayesian inference on complicated functions
of the finite population vector.
Exercise 11.2 Finite population inference using data

transformation
Consider the following random sample of size 50 from a finite

population of size 200:
28.374, 69.857, 22.721, 57.593, 126.965, 17.816, 16.078, 0.803, 3.164, 3.544,
2.123, 2.353, 184.539, 59.856, 63.701, 585.684, 29.094, 79.245, 18.105, 1.623,
5.513, 1.629, 63.654, 22.060, 187.463, 5.051, 34.299, 27.475, 0.746, 34.016,
8.547, 1.081, 3.151, 55.569, 2.593, 522.377, 1.660, 130.435, 1.246, 169.462,
3.444, 6.376, 18.735, 51.312, 33.920, 350.346, 475.795, 4.972, 24.451, 86.987.
Use Bayesian methods with a suitable transformation to estimate the

finite population mean.
527
We create a histogram of the sample values and see that the underlying
distribution is highly right skewed. However, a histogram of the natural
logarithm of the sample values is consistent with a normal
superpopulation model. The histograms are shown in Figure 11.2.
Therefore we posit the following Bayesian model involving an

uninformative prior and the logarithms of the finite population values,
=zi h=( yi ) log yi , i = 1,..., N (N = 200):
( s | z, µ , λ ) ~ SRSWOR
( z1 ,..., z N | µ , λ ) ~ iid N ( µ ,1 / λ )
f ( µ , λ ) ∝ 1/ λ , µ ∈ℜ, λ > 0 .
Figure 11.2 Histograms of the sample data
The data =
is D (= s, zs ) ((1,...,50),(28.374, 69.857,...,86.987)) (after a
convenient ordering), and the quantity of interest is
1 N 1 N −1 1 N
=
= y ∑= yi g=
N i1 =
( z) ∑
N i 1=
h =( zi ) ∑ exp( zi ) .
N i1
So we generate
( µ1 , λ1 ),...,( µ J , λJ ) ~ iid f ( µ , λ | D )
(using methods detailed previously).
528
Then for each j = 1,..., J we sample

zn( +j )1 ,..., z N( j ) ~ iid N ( µ j ,1/ λ j )
and calculate
y= ( j)
(
1
N
{ y1 + ... + yn } + {exp( zn( +j )1 ) + ... + exp( z N( j ) )} )
1 N

=
N  y sT + ∑
i= n +1
exp( zi( j ) )  .

The result is
y (1) ,..., y ( J ) ~ iid f ( y | D ) ,
which can then be used for Monte Carlo inference.
Applying the above procedure with a Monte Carlo sample size of

J = 1,000 we estimate y ’s posterior mean, yˆ = E ( y | D ) , and so also y
itself, by
1 J
y = ∑ y ( j ) = 110.83,
J j =1
with 95% CI for ŷ
 1 1 J 
 y ± 1.96
J
∑
J − 1 j =1
( y ( j ) − y ) 2  = (104.64, 117.02).

 
We also estimate the bounds of the 95% CPDR for y by 49.26 and
302.05, where these are the empirical 0.025 and 0.975 quantiles of
y (1) ,..., y ( J ) .
Figure 11.3 shows a histogram of the simulated values of y , together

with the above five numbers, as well as a ‘histogram estimate’ of the
predictive density f ( y | D ) . In this histogram the dot shows the true
value of the finite population mean, y = 114.2, which was known prior
to the generation of the sample data.
529
Figure 11.3 Inference on the finite population mean
Discussion
Figure 11.4 shows histograms of the values z1 ,..., z N which were in fact
drawn from the normal distribution with mean 3 and standard deviation
2 (left plot), and the = =
values of y1 exp( z1 ),..., y N exp( z N ) (right plot),
together with the true underlying superpopulation densities of the
variables zi and yi .
Figure 11.4 Histogram of all N values of z and y
530
For comparison we repeat the above inference on the original scale of

the data and ‘exactly’ (since there is then no need for Monte Carlo
methods).
In that case—where we replace z by y in the Bayesian model—we find

that the predictive mean of=
y is yˆ E=( y | D ) y s = 74.15 (the average
of the raw data values), and the 95% CPDR for y is exactly (41.36,
106.94). We see that this inference does much worse at estimating y ,
whose true value is 114.2.
Note: This second set of inference is the same as design-based

inference since it is based on the result
 
 
 y − ys D  ~ t ( n − 1) ,=
where ss2
1 n
∑ ( yi − y s ) 2 .
 ss n  n − 1 i =1
 1− 
 n N 
Figure 11.5 shows the original data values (untransformed) and both sets
of inferences above. It highlights the value of performing an appropriate
prior transformation for purposes of estimating the finite population
mean.
Figure 11.5 Comparison of two sets of inference
531
For interest, we repeat the above simulations and comparison with a

N(2,1) model for the zi s (rather than a N(3,4) model). Figure 11.6 shows
the analogue of the last figure above.
We see, of course, that the benefits of applying the log transformation to

the data diminishes as the skewness of the sample data decreases.
Figure 11.6 Comparison of two sets of inference with less

skewed data
Note 1: Using the formula for sample skewness given by

(1 / n ) ∑in=1 ( yi − ys )3
g= ,
{(1 / n ) ∑in=1 ( yi − ys ) 2 }3/2
we obtained a value of g = 2.662 for the first set of data and a value of
g = 1.549 for the second set of data.
Note 2: For another example of finite population inference via

Bayesian and MCMC methods which involves the logarithmic
transformation, see Puza (2002). This other example also features the
use of covariate information.
Note 3: It can be shown (mathematically) that yˆ = E ( y | D) = ∞

(exactly). This seems somewhat counterintuitive in light of the fact
that our Monte Carlo estimate y = 110.83 is very close to the actual
finite population mean, y = 114.2.
532
# Data generation used to set up exercise --------------------------------------------

N=200; n=50; set.seed(432); Z=rnorm(N,3,2); S=sample(1:N,n)
ZS=Z[S]; Y=exp(Z); YS=exp(ZS); YBAR=mean(Y); YBAR # 114.2
hist(Z,prob=T); hist(Y,prob=T); hist(ZS,prob=T); hist(YS,prob=T)
# preliminary plots
X11(w=8,h=4); par(mfrow=c(1,2))
hist(Z,prob=T,xlim=c(-4,10), ylim=c(0,0.25),breaks=seq(-3,12,0.5))
lines(seq(-5,12,0.01),dnorm(seq(-5,12,0.01),3,2),lwd=3)
hist(Y,prob=T,xlim=c(0,600),ylim=c(0,0.08), breaks=seq(0,5000,10));
yg=seq(0.1,700,0.5); lines(yg ,dnorm( log(yg),3,2)/yg, lwd=3)
format(list(YS=YS),digits=3) # "28.374, 69.857, 22.721, …, 24.451, 86.987"
# Look at given data and the log of that data (load data etc.) ------------------
N = 200; n = 50; m = N-n; options(digits=4)
ys = c( 28.374, 69.857, 22.721, 57.593, 126.965,
17.816, 16.078, 0.803, 3.164, 3.544,
2.123, 2.353, 184.539, 59.856, 63.701,
585.684, 29.094, 79.245, 18.105, 1.623,
5.513, 1.629, 63.654, 22.060, 187.463,
5.051, 34.299, 27.475, 0.746, 34.016,
8.547, 1.081, 3.151, 55.569, 2.593,
522.377, 1.660, 130.435, 1.246, 169.462,
3.444, 6.376, 18.735, 51.312, 33.920,
350.346, 475.795, 4.972, 24.451, 86.987)
summary(ys)
# 0.7 3.5 23.6 74.2 63.7 586.0
skewness=mean( (ys-mean(ys))^3 ) / ( mean((ys-mean(ys))^2) )^(3/2)

skewness # 2.662
zs=log(ys); par(mfrow=c(1,2))
hist(ys,prob=T); hist(zs,prob=T) # preliminary plots
hist(ys,prob=T,xlim=c(0,600),ylim=c(0,0.045),
breaks=seq(0,700,10), main="Sample values");
hist(zs,prob=T,xlim=c(-2,8), ylim=c(0,0.35),
breaks=seq(-3,10,0.5), main="Log of sample values");
533
# Finite population inference using original scale and design-based approach

# (same as the 'exact' Bayesian approach without Monte Carlo) -----------------
ysbar=mean(ys); sy=sd(ys); ybarhat=ysbar
ybarci=ybarhat+c(-1,1)*qt(0.975,n-1)* (sy/sqrt(n)) * sqrt(1-n/N)
inf.original=c(ybarhat,ybarci);
c(inf.original, YBAR) # 74.15 41.36 106.94 114.24
# Finite population inference via Bayesian approach using log transformation

# (and a 'crude' approach which makes no use of Rao-Blackwell ideas etc.) ----
zsbar=mean(zs); sz=sd(zs); J=1000; set.seed(142);

lamvec=rgamma(J,(n-1)/2,(sz^2)*(n-1)/2)
muvec=rnorm(J,zsbar,1/sqrt(n*lamvec)); yrbarvec=rep(NA,J)
for(j in 1:J){ zr=rnorm(m, muvec[j], 1/sqrt(lamvec[j]) )

yr=exp(zr); yrbarvec[j] = mean(yr) }
ybarvec=(1/N)*(n*ysbar+m*yrbarvec); ybarhat=mean(ybarvec)
ybarcpdr=quantile(ybarvec,c(0.025,0.975))
inf.transform = c(ybarhat,ybarci,ybarcpdr)
c(inf.transform,YBAR) # 110.83 104.64 117.02 49.26 302.05 114.24
summary(ybarvec)
# 37.0 70.6 89.4 111.0 122.0 2080.0
par(mfrow=c(1,1)); hist(ybarvec,prob=T) # preliminary plot

hist(ybarvec,prob=T,xlim=c(0,600),ylim=c(0,0.015),
breaks=seq(0,3000,10), main=" ");
abline(v=inf.transform,lty=1,lwd=2); points(YBAR,0,pch=16)
legend(310,0.015,c("Inference using log transformation"),lty=c(1),lwd=c(2))
text(450,0.01,
"The dot shows 114.2, the true value \nof the finite population mean")
lines(density(ybarvec),lwd=2)
par(mfrow=c(1,1)); hist(ys,prob=T) # preliminary plot

hist(ys,prob=T,xlim=c(0,600),ylim=c(0,0.045), breaks=seq(0,700,10), main=" ");
abline(v=inf.original,lty=2,lwd=2); abline(v=inf.transform,lty=1,lwd=2)
points(YBAR,0,pch=16)
legend(310,0.04,c("Inference using original scale",
"Inference using log transformation"), lty=c(2,1),lwd=c(2,2))
text(450,0.02,
534
# Repeat with 'less extreme' lognormal data -----------------------------------------
N=200; n=50; set.seed(432); Z=rnorm(N,2,1); S=sample(1:N,n) # <- difference

ZS=Z[S]; Y=exp(Z); YS=exp(ZS); YBAR=mean(Y)
X11(w=8,h=6); par(mfrow=c(2,2));
hist(Z,prob=T); hist(Y,prob=T); hist(ZS,prob=T); hist(YS,prob=T)
# preliminary plots
ys = YS; zs=log(ys);
skewness=mean( (ys-mean(ys))^3 ) / ( mean((ys-mean(ys))^2) )^(3/2)
skewness # 1.549
ysbar=mean(ys); sy=sd(ys); ybarhat=ysbar
ybarci=ybarhat+c(-1,1)*qt(0.975,n-1)* (sy/sqrt(n)) * sqrt(1-n/N)
inf.original =c(ybarhat,ybarci);
c(inf.original, YBAR) # 10.541 8.177 12.906 11.698
zsbar=mean(zs); sz=sd(zs); J=1000; set.seed(142);

lamvec=rgamma(J,(n-1)/2,(sz^2)*(n-1)/2)
muvec=rnorm(J,zsbar,1/sqrt(n*lamvec)); yrbarvec=rep(NA,J)
for(j in 1:J){ zr=rnorm(m, muvec[j], 1/sqrt(lamvec[j]) )

yr=exp(zr); yrbarvec[j] = mean(yr) }
ybarvec=(1/N)*(n*ysbar+m*yrbarvec); ybarhat=mean(ybarvec)
inf.transform = c(ybarhat,ybarci,ybarcpdr)
c(inf.transform,YBAR) # 11.006 10.904 11.108 8.478 15.016 11.698
X11(w=8,h=4); par(mfrow=c(1,1))
hist(ys,prob=T) # preliminary plot
hist(ys,prob=T,xlim=c(0,40),ylim=c(0,0.2), breaks=seq(0,40,1), main=" ");
abline(v=inf.original,lty=2,lwd=2); abline(v=inf.transform,lty=1,lwd=2)
points(YBAR,0,pch=16)
legend(20,0.2,
c("Inference using original scale", "Inference using log transformation"),
lty=c(2,1),lwd=c(2,2))
text(30,0.1,
535
11.3 Frequentist properties of Bayesian finite

population estimators
We have previously studied the frequentist characteristics of Bayesian
estimators. That was in the context of analytic inference (i.e. inference
on model parameters) and based on a random sample from a
hypothetically infinite population (e.g. a normal distribution). We will
now generalise those ideas in the broader framework of a Bayesian finite
population model.
As before, we are primarily interested in the frequentist characteristics of

Bayesian estimators which are based on uninformative priors and used
as proxies for classical or design based estimators. Nevertheless we will
consider both types of prior (informative and uninformative).
Consider a Bayesian finite population model specified in terms of:

f (ξ | y , θ ) where ξ is s or I or L (as discussed earlier)
=
f ( y | θ ) where y (= =
ys , yr ) (( y1 ,..., yn ),( yn +1 ,..., y N )) ( y1 ,..., y N )
f (θ ) where θ = (θ1 ,..., θ q ) .
Also suppose that the data is

D = ( s, y s ) or ( I , s ) or ( L, ys )
(as the case may be), and the quantity of interest is
ψ = g (θ , y )
(generally) or ψ = g (θ ) (as considered previously for ‘pure’ analytic
inference) or ψ = g ( y ) (the case of ‘pure’ finite population inference).
Now suppose that in the context of this general model, data and quantity
of interest, we derive a point estimate for ψ (such as the posterior mean,
mode or median) of the form
ψˆ = ψˆ ( D )
and a 1 − α interval estimate for ψ (such as the CPDR or HPDR) of the
form
=I ( L=
,U ) I=
( D ) ( L( D ),U ( D )) .
Note: If the sampling mechanism is defined in terms of I = ( I1 ,..., I N ) ,

the vector of inclusion counters, there is a conflict of notation and one
of these quantities needs a different symbol.
536
In the above context, there may be interest in the frequentist bias of ψˆ

and the frequentist coverage probabilities of the interval I, especially if
these estimators are intended as proxies for classical ones.
However, because there is now an extra level in the Bayesian model

hierarchy relative to previously, in the form of the density defining the
sampling mechanism, namely
f (ξ | y , θ ) ,
there are two ways (at least) of defining the required frequentist
characteristics:
• model-based, meaning conditional on θ and ξ
• design-based, meaning conditional on θ and y.
For definiteness, suppose that the data is D = ( s, y s ) . Then we define:
• the model bias of ψˆ as

= Bθ ,s E y {(ψˆ ( s, y s ) − ψ ( y , θ ) | θ , s}
• the relative model bias of ψˆ as

ψˆ ( s, ys ) − ψ ( y , θ ) 
Rθ ,s = E y  θ , s
 ψ ( y,θ ) 
• the model coverage probability of I as

= Cθ ,s Py {ψ ( y , θ ) ∈ I ( s, y s ) | θ , s} .
Also, we define:
• the design bias of ψˆ as

= Bθ , y Es {(ψˆ ( s, y s ) − ψ ( y , θ ) | θ , y}
• the relative design bias of ψˆ as

ψˆ ( s, ys ) − ψ ( y , θ ) 
Rθ , y = Es  θ, y
 ψ ( y,θ ) 
• the design coverage probability of I as

= Cθ , y Ps {ψ ( y , θ ) ∈ I ( s, y s ) | θ , y} .
537
Note 1: Each of the three model-based characteristics is an expectation

with respect to the distribution of y given θ and s. Each of the three
design-based characteristics is an expectation with respect to the
distribution of s given θ and y.
Note 2: Analogous definitions apply if D = ( I , y s ) or D = ( L, ys ) , etc.,

noting that s is a function of I and L, there is a one-to-one
correspondence between I and s under sampling without replacement,
etc. For instance, if D = ( I , y s ) , we define the model bias of ψˆ as
= Bθ , I E y {(ψˆ ( I , y s ) − ψ ( y , θ ) | θ , I } ,
and when D = ( L, ys ) , we define the model bias of ψˆ as
= Bθ ,L E y {(ψˆ ( L, y s ) − ψ ( y , θ ) | θ , L} , etc.
Note 3: If a model-based characteristic such as the model bias Bθ ,s is

be the same for all possible samples s, then s may be dropped from the
subscript; e.g. we may instead write Bθ . Likewise, if a design-based
characteristic such as the design bias Bθ , y is the same for all possible
values of the model parameter θ , then θ may be dropped; e.g. we
may write B y .
Note 4: If a model-based or design based characteristic cannot be

evaluated analytically then it may be possible to estimate via a Monte
Carlo simulation. This idea features in the next exercise below.
Note 5: The model bias of ψˆ above is a generalisation of the

frequentist bias of an estimator as defined earlier and based on a
random sample from an infinite population (e.g. a normal distribution).
The following argument illustrates. Suppose that ψ = θ , ψˆ = ys (the
sample mean) and the sampling mechanism is SRSWOR. Then, by the
above definitions, the model bias of ψˆ is
= Bθ ,s E y {(ψˆ ( s, y s ) − ψ ( y , θ ) | θ , s} (generally)
− θ | θ , s) E y ( ys | θ , s) − θ ,
= E y ( y s=
where
538
E y ( ys | θ , s ) = ∫ ys f ( y | θ , s )dy .
Now, in this case,
−1
N
,θ )
f ( s | y= ( s )  =
f= for all s (1,..., n ),...,( N − n + 1,..., N ) ,
n
so that
y y
( y | θ , s ) ∝ f ( y , θ , s ) f ( s | y , θ ) f ( y | θ ) f (θ ) ∝1 × f ( y | θ ) × 1 ,
f=
and therefore
y | θ , s ) f=
f (= ( y | θ ) f ( y r=
, ys | θ ) f ( yr | θ , ys ) f ( ys | θ ) ,
with s fixed at its observed value.
From these observations we see that

E y ( ys | θ , s ) = ∫ ∫ ys f ( yr | θ , ys ) f ( ys | θ )dyr dys
= ∫ f (y r | θ , ys )dyr × ∫ ys f ( ys | θ )dys
= 1 × E ( ys | θ ) .
=
Therefore Bθ ,s E ( ys | θ ) =
−θ E ( ys − θ | θ ) .
We have shown that the model bias here is the same as the bias of y s
in the earlier non-finite population context (where s did not feature in
the notation).
This is an example of where s could be dropped from the subscript in

Bθ ,s , i.e. where this could also be written Bθ .
If the sampling mechanism in this illustration were nonignorable, with

f ( s | y , θ ) depending on y in some way, then the simplifications above
might not be possible and the bias might need to be evaluated, with
more difficulty, according to the formula
f ( y,θ , s)
Bθ ,s =−θ + ∫ ys f ( y | θ , s )dy =−θ + ∫ ys dy
f (θ , s )
where: f (θ , s ) = ∫ f ( y , θ , s )dy
f ( y , θ , s ) = f ( s | y , θ ) f ( y | θ ) f (θ ) , etc.
539
Note 6: The design bias of ψˆ above is a generalisation of the bias of

an estimator in the classical survey sampling context where a sample is
drawn from a finite population of values which are thought of as
constants. The following argument illustrates. Suppose that ψ = y (the
finite population mean), ψˆ = ys (the sample mean) and the sampling
mechanism is SRSWOR. Then, by the above definitions, the design
bias of ψˆ is
= Bθ , y Es {(ψˆ ( s, y s ) − ψ ( y , θ ) | θ , y} (generally)
− y | θ , y} Es ( ys | θ , y ) − y .
= Es { ys =
Now, Es ( y s | θ , y ) = ∑ y s f ( s =
1
s
| y,θ ) ∑ ( ys1 + ... + ysn )
kn s
1 N
where f ( s | y , θ ) = and k =  
k n
1
= {( y1 + ... + yn ) + ...( y N −n +1 + ... + y N )} .
kn
Here, expression {} contains a total of kn terms, with each of

y1 ,..., y N is represented equally often and therefore kn/N times.
kn
We see that {=
} ( y1 + ... + y=
N) kny .
N
1
Thus Es ( y s | θ , y ) = kn y = y ,
kn
and so Bθ , y = Es ( y s | θ , y ) − y = y − y = 0 .
We have here simply followed through with our general definitions

and notation to show that under SRSWOR the sample mean is
unbiased for the population mean.
If the sampling mechanism were nonignorable, with f ( s | y , θ )

depending on y in some way, then the bias of the sample mean might
need to be evaluated, with more difficulty, according to the formula
f ( s, y , θ )
Bθ , y =− y + ∑ ys f ( s | y , θ ) =− y + ∑ ys ,
s s f ( y,θ )
where f ( y , θ ) = ∑ f ( s, y , θ ) , f ( s, y , θ ) = f ( s | y , θ ) f ( y | θ ) f (θ ) , etc.
s
540
Exercise 11.3 Frequentist properties of Bayesian estimators in

a normal finite population model
Consider a sample of size n = 20 taken from a finite population of size

N = 100 according to SRSWOR, where the population values are normal
with mean µ = 10 and variance σ 2 = 1 / λ = 4 , with prior given by
f ( µ , λ ) ∝ 1/ λ , µ ∈ℜ, λ > 0 (uninformative).
(a) Using these specifications, generate a finite population vector

y = ( y1 ,..., y N ) , take the sample vector as ys = ( y1 ,..., yn ) , and then use
Monte Carlo (MC) methods with a sample size of J = 1,000 to estimate
the superpopulation signal to noise ratio defined by γ = µ / σ .
Report a point estimate of γ in the form of a MC estimate of the

posterior mean γˆ = E (γ | D ) where D = ( s, y s ) is the data, and an
interval estimate in the form of a MC estimate of the 95% CPDR for γ .
(Do not bother to calculate a 95% CI for γˆ .)
What is the difference between your point estimate and γ ? Does γ lie
inside the interval? Calculate γ , the MLE of γ and report the difference
between γ and γ .
Illustrate your inferences by drawing a suitable histogram of the

simulated values of γ , marked over with the various estimates.
(b) Perform the procedure in (a) K = 100 times independently, with K

different finite populations but the sample always consisting of the first
n values in that finite population.
Based on your results, estimate the model bias and relative model bias of
your point estimator, and the model coverage of your interval estimator.
Also estimate the model bias and relative model bias of the MLE γ .
Illustrate your results by drawing a suitable histogram of the K simulated

MC estimates, marked over with the various relevant quantities.
(c) Repeat (b) but with K = 5,000 and discuss.
541
(d) Generate a finite population, vector y = ( y1 ,..., y N ) , and then take a

sample from the finite population via SRSWOR. Then use MC methods
with sample size J = 1,000 to estimate the finite population ratio of
largest value to median, which is given by the formula
y(100)
ψ= ,
( y(50) + y(51) ) / 2
where y( i ) is the ith order statistic for the N population values y1 ,..., y N .
Report a point estimate of ψ in the form of a MC estimate of the

posterior mean ψˆ = E (ψ | D ) and an interval estimate in the form of a
MC estimate of the 95% CPDR for ψ . (Do not bother to calculate a
95% CI for ψˆ .)
What is the difference between your point estimate and ψ ? Does ψ lie
inside the interval?
Illustrate your inferences by drawing a suitable histogram of the

simulated values of ψ , marked over with the various estimates.
(e) Perform the procedure in (d) K = 100 times independently, with K

different samples taken from the same finite population.
Based on your results, estimate the design bias and relative design bias
of your point estimator, and the design coverage of your interval
estimator.
Illustrate your results by drawing a suitable histogram of the K simulated

MC estimates, marked over with the various relevant quantities.
(f) Repeat (e) using two other point estimators, respectively.
(a) A finite population of size N = 100 from the N( µ = 10, σ 2 = 4)

distribution was generated. The sample mean and standard deviation of
the 100 finite population values were y = 9.932 and s y = 1.907. Figure
11.7 shows a histogram of these values.
542
Figure 11.7 Histogram of N = 100 finite population values
Then the first n = 20 values were taken as a sample from the finite
population. Figure 11.8 shows a histogram of these sample values. The
sample mean and standard deviation of the sample values were
ys = 10.516 and ss = 1.749 . So the MLE of γ = µ / σ was calculated
=
as γ µ=/ σ ys / ss = 6.011.
Figure 11.8 Histogram of n = 20 sample values
Then a Monte Carlo sample of size J = 1,000 was taken from the joint
posterior distribution of µ and λ = 1 / σ 2 , i.e. from f ( µ , λ | D ) where
D = ( s, ys ). Hence a MC sample of size J was obtained from the
posterior distribution of γ , namely γ 1 ,..., γ J ~ iid f (γ | D ) .
543
Note: As explained in previous exercises, this was done by:

 n −1 n −1 2 
• first sampling λ1 ,..., λJ ~ iid G  , ss 
 2 2 
• then sampling w1 ,..., wJ ~ iid t ( n − 1)
• next forming µ=
j ys + w j s / n
• finally calculating γ j = µ j λ j .
The MC sample from γ ’s posterior was used to calculate the point

estimate
1 J
γ = ∑ γ j = 5.925
J j =1
(the MC estimate of γ ’s posterior mean) and the interval estimate
I = (4.115, 7.963)
(formed by the empirical 0.025 and 0.975 quantiles of γ 1 ,..., γ J ).
Figure 11.9 shows a histogram of the simulated values γ 1 ,..., γ J overlaid

by an estimate of γ ’s posterior density f (γ | D ) . Also shown in the
figure are the Bayesian estimates (3 vertical lines), the MLE γˆ = 6.011,
and the true value of γ , namely γ = µ / σ = 10/2 = 5. We see that the
true value of γ lies in the Bayesian interval estimate, and the difference
between the Bayesian estimate and the true value is 5.925 − 5 = 0.925.
Likewise, the MLE is in ‘error’ by 6.011 − 5 = 1.011.
Figure 11.9 Inference on γ based on a MC sample (J = 1,000)
544
(b) The procedure in (a) was repeated so as to yield a total of K = 100

Bayesian estimates γ 1 ,..., γ K , as well as K Bayesian interval estimates
I1 ,..., I K and K MLEs γ1 ,..., γK .
From these results we estimated the model mean of the Bayesian

estimate γ by
1 K
γ = ∑ γ k = 5.2226,
K k =1
with 95% CI (for that mean) of
 1 K 
 γ ± 1.96
K ( K − 1)
∑ (γ k − γ ) 2
 = (4.9986, 5.4466).
 k =1 
Hence we estimated the model bias of γ by γ − γ = 0.2226 with 95%

CI (−0.0014, 0.4466).
Likewise, we estimated the model mean of the MLE γ by

1 K
γ = ∑ γk = 5.298,
K k =1
with 95% CI (for that mean) of
 1 K 
 γ ± 1.96
 ∑
K ( K − 1) k =1
(γk − γ ) 2  = (5.070, 5.526).

 
Hence we estimate the model bias of γ by γ − γ = 0.298 with 95% CI

(0.0705, 0.5255).
Thus we also estimate the relative model biases of γ and γ by

(γ − γ ) / γ = 0.0445 with 95% CI (–0.0003, 0.0893)
(γ − γ ) / γ = 0.0596 with 95% CI (0.0141, 0.1051).
Note: These could also be reported as the percentages (%):

(γ − γ ) / γ = 4.5 with 95% CI (−0.03, 8.9)
(γ − γ ) / γ = 6.0 with 95% CI (1.4, 10.5).
Also, exactly 91 of the 100 Bayesian interval estimates I1 ,..., I K actually

contained the true value γ = 5.
545
So we estimate the model coverage of the 95% CPDR estimate of γ

(based on a MC sampled size of specifically J = 1,000) as 0.91, with
95% CI (for that coverage)
(0.91 ± 1.96 0.91(1 − 0.91) / 100) = (0.854, 0.966).
Figure 11.10 shows a histogram of the K simulated values of γ 1 ,..., γ K

and related quantities.
We see that the Bayesian inference appears to have slightly

outperformed the MLE as regards model bias.
Note that this applies in a very particular situation, namely one with
N = 100, n = 20, µ = 10, σ = 2, and a MC estimation scheme as
described above with specifically J = 1,000.
Note: If we were to use a different common sample from each finite

population (e.g. ys = ( y2 , y14 , y15 ,..., y87 )), or a different sample each
time, the results would be the same, subject to Monte Carlo variation.
This might not be the case in a situation where the sampling
mechanism is nonignorable or where there are covariate values. But as
a matter of form when calculating model-based properties, we must
condition on the sample being taken, i.e. on s.
Figure 11.10 Distribution of K = 100 estimates
546
(c) Repeating (a) and (b) with K = 5,000, we obtained the following
results:
Estimate of model bias of γ is 0.1616 with 95% CI (0.1359, 0.1872)
Estimate of model bias of γ is 0.2301 with 95% CI (0.2041, 0.2561)
Estimate of relative model bias of γ is 3.2 with 95% CI (2.7, 3.7) (%)
Estimate of relative model bias of γ is 4.6 with 95% CI (4.1, 5.1) (%).
Exactly 4,755 of the 5,000 Bayesian interval estimates I1 ,..., I K actually

contained the true value γ = 5.
So we estimate the model coverage of the 95% CPDR estimate of γ

(based on a MC sample of size J = 1,000) as 4,755/5,000 = 0.951, with
95% CI (for that coverage),
(0.951 ± 1.96 0.951(1 − 0.951) / 5, 000) = (0.945, 0.957).
From these results it appears that both the Bayesian and ML estimators
are indeed positively biased by several percent, with the Bayesian
estimator slightly outperforming the MLE.
It also appears that the model coverage of the Bayesian interval estimate
is very close to the nominal 95%.
Figure 11.11 shows a histogram of the 5,000 simulated Bayesian

estimates and related information. A detail in this figure is shown as
Figure 11.12.
Figure 11.11 Distribution of K = 5,000 estimates
547
Figure 11.12 Detail in Figure 11.11
(d) A finite population of size N = 100 from the N( µ = 10, σ 2 = 4)

distribution was generated. The sample mean and standard deviation of
the 100 finite population values were y = 9.675 and s y = 2.159.
A histogram of the values is shown in Figure 11.13. The true value of

the ratio requiring inference was in this case calculated as
y(100) 15.622
= ψ = = 1.536.
( y(50) + y(51) ) / 2 10.171
Figure 11.13 Histogram of N = 100 finite population values
548
Then a sample of size n = 20 values was taken from the finite

population. The sample mean and standard deviation of the sampled
values were y s = 9.438 and ss = 2.448. A histogram of the sample
values is shown in Figure 11.14.
Figure 11.14 Histogram of n = 20 sample values
Then a MC sample of size J = 1,000 was taken from the joint posterior
distribution of µ and λ = 1 / σ 2 , i.e. from f ( µ , λ | D ) with D = ( s, y s ) .
Hence a MC sample of size J was obtained from the predictive
distribution of ψ , namely ψ 1 ,...,ψ J ~ iid f (ψ | D ) .
Note: As explained in previous exercises, this was done by doing the

following for each j = 1,..., J :
(
• first sample yi( j ) ~ iid N µ j ,1/ λ j , i ∈ r )
• then form yr( j ) = ( yr(1 j ) ,..., yr(Nj−)n )
• finally calculate ψ j from ( ys , yr( j ) ) .
The MC sample from ψ ’s predictive distribution was used to calculate

the point estimate
1 J
ψ = ∑ψ j = 1.715
J j =1
(the MC estimate of ψ ’s predictive mean) and the interval I = (1.456,
2.078) formed by the empirical 0.025 and 0.975 quantiles of ψ 1 ,...,ψ J .
549
Figure 11.15 shows a probability histogram of the simulated values

ψ 1 ,...,ψ J overlaid by an estimate of ψ ’s predictive density f (ψ | D ) .
Also shown are the Bayesian estimates (represented by three vertical
lines), and the true value of ψ , which is 1.536 (represented by the dot).
We note that the true value of ψ lies in the Bayesian interval estimate,
and the difference between the Bayesian estimate and the true value is
1.715 − 1.536 = 0.179.
Figure 11.15 Inference on ψ based on a MC sample (J = 1,000)
(e) The procedure in (d) was repeated so as to yield a total of K = 100

Bayesian estimates ψ 1 ,...,ψ K and K corresponding Bayesian interval
estimates I1 ,..., I K . From these results we estimate the design mean of
the Bayesian predictive mean estimate ψ by
1 K
ψ = ∑ψ k = 1.6168,
K k =1
with 95% CI (for that mean)
 1 K 
ψ ± 1.96 ∑
K ( K − 1) k =1
(ψ k − ψ ) 2  = (1.5962, 1.6374).

 
Hence we estimate the design bias of ψ by ψ − ψ = 0.0808, with 95%

CI (0.0602, 0.1014). Thus we also estimate the relative design bias of ψ
by (ψ − ψ ) / ψ = 5.3, with 95% CI (3.9, 6.6) (%).
550
Also, 91 of the 100 Bayesian interval estimates I1 ,..., I K contained the

true value, ψ = 1.536. So we estimate the design coverage of the 95%
CPDR estimate of ψ (based on a MC sample with size J = 1,000) as
0.91, with 95% CI (0.91 ± 1.96 0.91(1 − 0.91) / 100) = (0.8539, 0.9661).
Figure 11.16 shows a probability histogram of the K simulated values

ψ 1 ,...,ψ K and related quantities.
Figure 11.16 Distribution of K = 100 estimates of ψ
(f) Figure 11.17 is an analogue of Figure 11.16 but obtained by replacing

the Monte Carlo sample mean estimate ψ = (ψ 1 + ... +ψ J ) / J by the
empirical median of ψ 1 ,...,ψ J .
Likewise, Figure 11.18 is an analogue of Figure 11.16 but obtained by

replacing the posterior mean estimate by the empirical mode of
ψ 1 ,...,ψ J .
Note: The empirical mode was obtained using the R function density().
We see that the design bias of the empirical mode appears to be smaller
than that of the empirical median, which in turn is smaller than that of
the posterior mean. The biases of the Monte Carlo predictive mean,
median and mode estimates (based on a Monte Carlo sample size of
J = 1,000) are estimated as +5.3%, +3.8% and +1.4%.
551
Note: From Figure 11.15 in (d) we may have already guessed that the
posterior mode is better than the posterior mean as an estimate of ψ
(whose true value is 1.536, as shown by the dot in Figures 11.15–18).
552
# (a)
X11(w=8,h=4); par(mfrow=c(1,1)); options(digits=4)
N=100; n=20; mu=10; sig=2; lam=1/sig^2; gam=mu/sig

set.seed(332); y=rnorm(N,mu,sig); # hist(y,prob=T)
hist(y,prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5),
main=" ")
lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),lwd=3)
ys=y[1:n]
hist(ys,prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5),
main=" ")
ysbar=mean(ys); sys=sd(ys); gammle=ysbar/sys

ybar=mean(y); sy=sd(y); ygam=ybar/sy; c(ybar,sy,ygam) # 9.932 1.907 5.207
c(lam,ysbar,sys, gam, gammle) # 0.250 10.516 1.749 5.000 6.011
J=1000; set.seed(171);
lamv=rgamma(J,(n-1)/2,sys^2*(n-1)/2); muv=rnorm(J,ysbar,1/sqrt((n*lamv)))
gamv=muv*sqrt(lamv)
gambar=mean(gamv); gamint=quantile(gamv,c(0.025,0.975))
c(gambar, gamint) # 5.925 4.115 7.963
hist(gamv,prob=T,xlab="gamma", xlim=c(2,10), ylim=c(0,0.5),

breaks=seq(0,12,0.25), main=" ")
abline(v=c(gambar, gamint),lwd=3); lines(density(gamv),lwd=3)
points(c(gam,gammle),c(0,0),pch=c(16,1))
legend(7,0.5,c("True value of gamma","MLE of gamma"),
pch=c(16,1),bg="white")
# (b) Follows on from (a)
K = 100; J=1000; gambarvec=rep(NA,K); gammlevec=rep(NA,K);

gamlie=rep(0,K);
553
set.seed(143); for(k in 1:K){

y=rnorm(N,mu,sig); s=1:n; ys=y[s]; ysbar=mean(ys); sys=sd(ys)
lamv=rgamma(J,(n-1)/2,sys^2*(n-1)/2);
muv=rnorm(J,ysbar,1/sqrt((n*lamv)))
gamv=muv*sqrt(lamv); gambar=mean(gamv);
gammlevec[k]=ysbar/sys
gamint=quantile(gamv,c(0.025,0.975)); gambarvec[k]=gambar
if((gamint[1]<=gam)&&(gam<=gamint[2])) gamlie[k]=1 }
Eest=mean(gambarvec);
Eci=Eest+c(-1,1)*qnorm(0.975)*sd(gambarvec)/sqrt(K)
Cest=mean(gamlie); Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K)
c(Eest,Eci,Cest,Cci) # 5.2226 4.9986 5.4466 0.9100 0.8539 0.9661
Emleest=mean(gammlevec)
Emleci=Emleest+c(-1,1)*qnorm(0.975)*sd(gammlevec)/sqrt(K)
c(Emleest,Emleci) # 5.298 5.070 5.526
Biasest=Eest-gam; Biasci=Eci-gam
Biasmleest=Emleest-gam; Biasmleci=Emleci-gam
c(Biasest,Biasci, Biasmleest,Biasmleci)
# 0.222583 -0.001418 0.446583 0.298019 0.070493 0.525544
c(Biasest,Biasci, Biasmleest,Biasmleci)/gam
# 0.0445165 -0.0002836 0.0893166 0.0596037 0.0140986 0.1051088
# hist(gambarvec,prob=T)
hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(2,12),
ylim=c(0,0.6), breaks=seq(0,12,0.5), main= "")
abline(v=c(Eest,Eci), lty=1, lwd=3); abline(v=c(Emleest,Emleci), lty=2, lwd=3)
lines(density(gambarvec),lty=1,lwd=3); lines(density(gammlevec),lty=2,lwd=3)
points(gam,0,pch=16)
legend(6.5,0.6,c("Bayesian estimates \n(MC with J=1000)", "ML estimates"),
lty=c(1,2), lwd=c(3,3))
# (c)
K = 5000; J=1000; gambarvec=rep(NA,K);

gammlevec=rep(NA,K); gamlie=rep(0,K);
554
set.seed(213); for(k in 1:K){ # Takes a few seconds

y=rnorm(N,mu,sig); s=1:n; ys=y[s]; ysbar=mean(ys); sys=sd(ys)
lamv=rgamma(J,(n-1)/2,sys^2*(n-1)/2);
gamv=muv*sqrt(lamv);
gambar=mean(gamv); gammlevec[k]=ysbar/sys
gamint=quantile(gamv,c(0.025,0.975)); gambarvec[k]=gambar
if((gamint[1]<=gam)&&(gam<=gamint[2])) gamlie[k]=1 }
Eest=mean(gambarvec);
Eci=Eest+c(-1,1)*qnorm(0.975)*sd(gambarvec)/sqrt(K)
Cest=mean(gamlie); Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K)
Emleest=mean(gammlevec)
Emleci=Emleest+c(-1,1)*qnorm(0.975)*sd(gammlevec)/sqrt(K)
c(Emleest,Emleci) # 5.230 5.204 5.256
Biasest=Eest-gam; Biasci=Eci-gam
Biasmleest=Emleest-gam; Biasmleci=Emleci-gam
c(Biasest,Biasci, Biasmleest,Biasmleci)
# 0.1616 0.1359 0.1872 0.2301 0.2041 0.2561
c(Biasest,Biasci, Biasmleest,Biasmleci)/gam
# 0.03231 0.02718 0.03745 0.04602 0.04081 0.05122
# hist(gambarvec,prob=T)
hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(2,12),
legend(6,0.6,c("Bayesian estimates \n(MC with J=1000)", "ML estimates"),
lty=c(1,2), lwd=c(3,3))
hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(4.5,6),

555
# (d)
psifun=function(y){ max(y)/median(y) } # Function for the quantity of interest

N=100; n=20; mu=10; sig=2; set.seed(119); y=rnorm(N,mu,sig)
ybar=mean(y); sy=sd(y); psi=psifun(y=y)
c(ybar,sy,min(y),max(y), median(y), psi)
# 9.675 2.159 3.678 15.622 10.171 1.536
hist(y,prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5),

main="")
set.seed(421); ys=sample(y,n)
ys=y[s]; ysbar=mean(ys); sy=sd(ys); sy2=var(ys)
c(ysbar,sy, sy2) # 9.438 2.448 5.994
hist(ys,prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5),

main="")
set.seed(323); J=1000;
lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2); muv=rnorm(J,ysbar,1/sqrt((n*lamv)))
psiv=rep(NA,J);
for(j in 1:J){ yrsim=rnorm(N-n,muv,1/sqrt(lamv)); ysim=c(ys,yrsim);
psiv[j]=psifun(y=ysim) }
psibar=mean(psiv); psiint=quantile(psiv,c(0.025,0.975))
c(psibar,psiint) # 1.715 1.456 2.078
summary(psiv)
# 1.37 1.60 1.69 1.72 1.81 2.34
# hist(psiv,prob=T)
hist(psiv,prob=T,xlab="psi", xlim=c(1.3,2.4), ylim=c(0,4),breaks=seq(1,2.5,0.05),
main="")
abline(v=c(psibar,psiint),lwd=3); den=density(psiv)
lines(den,lwd=3); points(psi,0,pch=16)
psimedian=median(psiv)
psimode=den$x[den$y==max(den$y)]
c(psibar,psimedian,psimode) # 1.715 1.688 1.659
556
# (e) Follows on from (d)
K = 100; J=1000; psibarvec=rep(NA,K); LBvec= psibarvec; UBvec=LBvec;

alp=0.05
set.seed(411);
date() #
for(k in 1:K){
ys=sample(y,n); ysbar=mean(ys); sy2=var(ys)
lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2);
psiv=rep(NA,J); for(j in 1:J){
yrsim=rnorm(N-n,muv,1/sqrt(lamv))
ysim=c(ys,yrsim)
psiv[j]=psifun(y=ysim)
}
psibarvec[k] = mean(psiv);
LBvec[k]=quantile(psiv,alp/2); UBvec[k]=quantile(psiv,1-alp/2)
};
date() # Simulation with K=100 & J=1000 takes 12 seconds
ct=0; for(k in 1:K) if((LBvec[k]<=psi)&&(psi<=UBvec[k])) ct=ct+1
# hist(psibarvec,prob=T)
hist(psibarvec,prob=T,xlab="psibar", xlim=c(1.2,2), ylim=c(0,6.5),
breaks=seq(1.2,2,0.025), main= "")
points(psi,0,pch=16)
# Characteristics of posterior mean estimate --------------

Eest=mean(psibarvec); Eci=Eest+c(-1,1)*qnorm(0.975)*sd(psibarvec)/sqrt(K)
Cest=ct/K; Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K)
Biasest=Eest-psi; Biasci=Eci-psi; c(Biasest,Biasci) # 0.08084 0.06024 0.10144
c(Biasest,Biasci)/psi # 0.05263 0.03922 0.06604
abline(v=c(Eest,Eci), lty=1, lwd=3); lines(density(psibarvec),lty=1,lwd=3)
# (f) Follows on from (e)
K = 100; J=1000; LBvec= rep(NA,K); UBvec=LBvec; alp=0.05

psimodevec= LBvec; psimedianvec= LBvec; set.seed(411);
date() #
557
for(k in 1:K){
ys=sample(y,n); ysbar=mean(ys); sy2=var(ys)
lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2);
psiv=rep(NA,J); for(j in 1:J){
yrsim=rnorm(N-n,muv,1/sqrt(lamv))
ysim=c(ys,yrsim)
psiv[j]=psifun(y=ysim)
}
psimedianvec[k] = median(psiv)
den=density(psiv); psimodevec[k]=den$x[den$y==max(den$y)]
LBvec[k]=quantile(psiv,alp/2); UBvec[k]=quantile(psiv,1-alp/2)
}
date() # Simulation with K=100 & J=1000 takes 12 seconds
ct=0; for(k in 1:K) if((LBvec[k]<=psi)&&(psi<=UBvec[k])) ct=ct+1
# hist(psimedianvec,prob=T)
hist(psimedianvec,prob=T,xlab="psimedian", xlim=c(1.2,2),
ylim=c(0,6),breaks=seq(1.2,2,0.025), main= "")
# Characteristics of posterior median estimate -----------------

Eest=mean(psimedianvec);
Eci=Eest+c(-1,1)*qnorm(0.975)*sd(psibarvec)/sqrt(K)
Biasest=Eest-psi; Biasci=Eci-psi; c(Biasest,Biasci) # 0.05873 0.03813 0.07934
abline(v=c(Eest,Eci), lty=1, lwd=3); lines(density(psimedianvec),lty=1,lwd=3)
# hist(psimodevec,prob=T)
hist(psimodevec,prob=T,xlab="psimode", xlim=c(1.2,2),
ylim=c(0,6),breaks=seq(1.2,2,0.025), main= "")
# Characteristics of posterior mode estimate --------------------

Eest=mean(psimodevec); Eci=Eest+c(-1,1)*qnorm(0.975)*sd(psibarvec)/sqrt(K)
Biasest=Eest-psi; Biasci=Eci-psi; c(Biasest,Biasci)
# 0.021933 0.001332 0.042534
abline(v=c(Eest,Eci), lty=1, lwd=3); lines(density(psimodevec),lty=1,lwd=3)
558
CHAPTER 12
Biased Sampling and Nonresponse
12.1 Review of sampling mechanisms
We have already discussed the topic of ignorable and nonignorable
sampling in the context of Bayesian finite population models. To be
definite, let us now focus on the model defined by:
f ( s | y,θ ) (the probability of obtaining sample s for given
values of y and θ )
f (y |θ ) (the model density of the finite population vector)
f (θ ) (the prior density of the parameter),
where the data is D = ( s, ys ) and the quantity of interest is some
functional ψ = g (θ , y ) , e.g. a function of two components of θ or a
function of y only, etc.
We say that the sampling mechanism is ignorable if

f (ψ | s, ys ) = f (ψ | ys )
for all values of ψ , where s is fixed at its observed value, or
equivalently, if the posterior distribution of ψ is exactly the same when
it is calculated solely on the basis of the ‘reduced model’ as given by:
f (y |θ ) (same as before)
f (θ ) (same as before),
that is, with f ( s | y, θ ) effectively being ‘ignored’. Otherwise, we say
that the sampling mechanism is nonignorable (or biased).
Equivalently, the sampling mechanism is ignorable if

f (ψ | s, ys ) = f (ψ | ys )
for all ψ , and the sampling mechanism is nonignorable if
f (ψ | s, ys ) ≠ f (ψ | ys )
for at least one value of ψ .
Recall that in some situations, whether the sampling mechanism is

ignorable may depend on which particular units happen to be sampled.
559
For example if f ( s | y, θ ) is a function of only N, n and y3 (say), then

(typically) the sampling mechanism is ignorable if and only if unit 3 is
sampled (and thereby observed).
Also, recall that analogous definitions apply if the sampling mechanism

is alternatively specified in terms of
f ( I | y,θ )
or in terms of
f ( L | y,θ ) ,
rather than in terms of
f ( s | y,θ ) .
Here, as previously, I = ( I1 ,..., I N ) denotes the vector of inclusion

counters, i.e. the numbers of times units 1,…,N are sampled (possibly
more than once in the case of sampling with replacement), and
L = ( L1 ,..., Ln ) is the vector of the labels of the units sampled in the
temporal order in which they are sampled.
12.2 Nonresponse mechanisms

An issue related to nonignorable sampling is nonignorable nonresponse.
Once a sample has been taken, some of the units may then fail to
respond. This may be for whatever reason, but the underlying issue is
that the values of the nonresponding units will then be unobserved, with
possibly serious consequences to the resulting inference.
This issue can be addressed by introducing another variable and level

into the modelling equation. Let Ri denote the ith response indicator,
meaning the indicator variable for the ith population unit responding.
Thus Ri = 1 if unit i responds, and Ri = 0 otherwise ( i = 1,..., N ).
Now let R = ( R1 ,..., RN ) (or the transpose of this) be called the

population response vector, and likewise, define:
Rs = ( Rs1 ,..., Rsn ) as the sample response vector
Rr = ( Rr1 ,..., RrN −n ) as the nonsample response vector.
560
With these definitions we may now augment our ‘base model’ above
with a new level in the hierarchy, typically in between y and s, as
follows:
f ( s | R, y,θ ) (the probability of obtaining sample s for given

values of R, y and θ )
f ( R | y,θ ) (the probability of units responding as indicated
by R, given y and θ )
f (y |θ ) (same as before)
f (θ ) (same as before). (12.1)
Note 1: This general formulation, with f ( s | R, y, θ ) a function of R,

means that which units are sampled could potentially depend on which
units would respond if sampled. However, typically it will be
reasonable to assume that the sampling and response mechanisms are
independent in the model, meaning that f ( s | R, y , θ ) = f ( s | y , θ ) .
Note 2: The statistical literature contains many different and

sometimes inconsistent treatments of nonignorable nonresponse. For a
review of the term ‘missing at random’, which relates to but does not
feature in the exposition here, see Seaman et al. (2013).
In the context of this model, let

no = Rs1 + ... + Rsn = 1′n Rs
be the number of values in the sample that respond (have a value that is
observed), and let
nu= n − no
be the number of units in the sample that do not respond (have a value
that is unobserved).
Then define
o = (o1 ,..., ono )
as the observed vector (the vector of the labels of the units sampled and
observed), and define
u = (u1 ,..., unu )
as the unobserved vector (the vector of the labels of the units sampled
and unobserved).
561
Note: In each of these vectors, the values (labels) are assumed to be in

increasing order.
Then define the observed sample vector as

yo = ( yo1 ,..., yon )
o
and the unobserved sample vector as

yu = ( yu ,..., yu ) .
1 nu
With these definitions, the data has the general form

D = ( s, Rs , yo )
and also the quantity of interest has the general form
ψ = g (θ , y, R) .
Note 1: The function g defining ψ takes into account the possibility

there may be interest in whether some of the nonsampled units would
have responded had they been sampled.
Note 2: As mentioned previously, it is often convenient to re-label the

N finite population values in such a way that
= y (= y s , y r ) ( yo , y u , y r )
= (( y1 ,..., yno ),( yno +1 ,..., yn ),( yn +1 ,..., y N ))
= ( y1 ,..., y N ) .
In the context of the general four-level Bayesian finite population model

given by (12.1) above (which involves s, R, y and θ ), we may make the
following definitions:
• The sampling mechanism is ignorable if

f (ψ | s, Rs , yo ) = f (ψ | Rs , yo ) ∀ ψ
with s fixed at its observed value (note that o is a function of s
and Rs ); otherwise the sampling mechanism is nonignorable.
• The response mechanism is ignorable if

f (ψ | s, Rs , yo ) = f (ψ | s, yo ) ∀ ψ
with o fixed at its observed value; otherwise the response
mechanism is nonignorable.
562
These two basic definitions then lead to four general cases, defined as
follows:
• The sampling mechanism and response mechanism are both

ignorable if
f (ψ | s, Rs , yo ) = f (ψ | yo ) ∀ ψ
with o fixed at its observed values.
• The sampling mechanism is ignorable and the response

mechanism is nonignorable if
f (ψ | s, Rs , yo ) = f (ψ | Rs , yo ) ∀ ψ
with s fixed at its observed value, and
f (ψ | Rs , yo ) ≠ f (ψ | yo )
• The response mechanism is ignorable and the sampling

mechanism is nonignorable if
f (ψ | s, Rs , yo ) = f (ψ | s, yo ) ∀ ψ
with o fixed at its observed value and
f (ψ | s, yo ) ≠ f (ψ | yo )
• The sampling mechanism and response mechanism are both

nonignorable if
f (ψ | s, Rs , yo ) ≠ f (ψ | Rs , yo )
for at least one value of ψ and
f (ψ | s, Rs , yo ) ≠ f (ψ | s, yo )
563
Exercise 12.1 A model with sampling and response

mechanisms that are both ignorable
Consider a Bayesian finite population model defined by:
f ( s | R, y , θ )
f ( R | y,θ )
f (y |θ )
f (θ ) ,
where the data is
D = ( s, Rs , yo )
and the quantity of interest is
N
ψ g (θ , y ,=
= ′N y
R ) 1= ∑=
y
i =1
i yT (finite population total).
Suppose that in this context:
• the sample of n units is taken from the N in the population via

SRSWOR
• each unit in the population has the same probability of response,

π
• the population values in the model are iid, each with a

distribution which depends only on a single parameter µ
• the model parameter vector is
θ = (µ , π )
with µ ⊥ π (thus the two model parameters are independent, a

priori).
Show that the sampling mechanism and response mechanism are both
ignorable, and that this is true for all possible values of the data.
564
Observe that for all s, R, y and θ :

−1
N
, θ ) f=
f ( s | R, y= ( s)  
n
N
y , θ ) f=
f (R | = ( R) ∏π
i =1
Ri
(1 − π )1− Ri
yT = yoT + yuT + yrT ,

where:
= ′o yo
yoT 1= ∑y
i∈o
i is the total of the observed sample values
= ′u yu
yuT 1= ∑y
i∈u
i is the total of the unobserved sample values
= ′r yr
yrT 1= ∑y
i∈r
i is the total of the nonsample values.
Note: Here, 1′o denotes a column vector of no ones, etc.
Consequently, the relevant predictive density of the quantity of interest,

namely
f (ψ | D ) = f ( yT | s, Rs , yo ) ,
is derived from the joint predictive density of all unobserved and
nonsampled values, namely
f ( yu , yr | s, Rs , yo ) .
We will now proceed to show that

f ( yu , yr | s, Rs , yo ) = f ( yu , yr | yo )
with o fixed at its observed value, and that this is true for all possible
values of yu , yr , s, Rs and yo .
If this can be shown then also

f ( yT | s, Rs , yo ) = f ( yT | yo ) ,
for all possible values of yT , s, Rs and yo .
It will thereby be established that the sampling mechanism and response

mechanism are both ignorable, and that this is true for all possible values
of the data D = ( s, Rs , yo ) .
565
Observe that for any yu , yr , s, Rs and yo , it is true that
f ( yu , yr | s, Rs , yo ) ∝ f ( yu , yr , s, Rs , yo )
= ∑ ∫ ∫ f ( yu , yr , s, Rs , yo , Rr , µ , π )d µ dπ
Rr
= ∑ ∫ ∫ f ( µ ) f (π ) f ( yo | µ ) f ( yu , yr | µ , yo )
Rr
× f ( Rs | π ) f ( Rr | π ) f ( s )d µ d π
{
= f ( s ) × ∫ f ( yu , yr | µ , yo ) f ( µ ) f ( yo | µ ) d µ }
 
×  ∫ f (π ) f ( Rs | π )∑ f ( Rr | π ) d π 
 Rr 
where = [  ] ∫ f (π , Rs ) ×1=d π f ( Rs )
yu , y r  f ( µ ) f ( yo | µ ) 
∝ 1 × ∫ f ( y u , y r | µ , yo )   d µ ×1
 f ( yo ) 
= ∫ f ( y u , y r | µ , yo ) f ( µ | yo ) d µ
= ∫ f ( y u , y r , µ | yo ) d µ
= f ( y u , y r | yo ) .
That is,
f ( yu , yr | s, Rs , yo ) = f ( yu , yr | yo ) ,
as required.
566
Exercise 12.2 An ignorable sampling mechanism with a

nonignorable response mechanism
A finite population consists of N = 500 values that are modelled as

normally distributed with unknown mean µ and unknown variance
σ 2 = 1 / λ . A sample of size n = 100 is taken from this population via
SRSWOR. We find that only no = 34 values are observed, with values:
12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28,
9.7, 12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75,
10.34, 14.37, 12.13, 8.56, 11.91, 11.79, 11.45, 14.98, 10.57, 12.28,
9.91, 10.94, 13.28, 11.43.
(a) Assuming that the response mechanism is ignorable, estimate the

finite population mean.
(b) A follow-up sample of size n f = 15 is taken from the nu = 66 non-

responding units via SRSWOR, and these n f units are observed (by
‘force’), yielding the values:
5.4, 9.41, 7.03, 8.88, 11.47, 7, 9.44, 8.58, 9.27, 8.18,
8.62, 8.73, 7.33, 9.81, 9.88.
Thus there remain n − no − n f = nu − n f = 51 nonresponding sample units

with unknown values.
Assuming that the response mechanism is ignorable, use all of the

available data to re-estimate the finite population mean.
(c) Repeat (b) but using a suitable Bayesian model which takes into
account the response mechanism and appropriately incorporates it into
the inferential procedure.
(a) We estimate y by the average of the no = 34 observed values, which

is yo = 11.94. The sample standard deviation of these no values is equal
to so = 1.552. So a 95% CPDR for y is
 s n 
 yo ± t0.025 ( no − 1) o 1 − o  = (11.42, 12.46).
 no N
567
(b) We estimate y by the average of all nof= no + n f = 34 + 15 = 49

observed values, which is equal to yof = 10.92. The sample standard
deviation of these nof values is sof = 2.168. So a 95% CPDR for y is
 s nof 
 yof ± t0.025 ( nof − 1) of 1−  = (10.33, 11.51).
 nof N 
 
(c) Figure 12.1 is a histograms of the no = 34 initially observed values

and the n f = 15 follow-up values, respectively. We see that the ‘forced’
follow-up values which initially failed to respond seem to be smaller on
average than the values of the units which responded. This suggests a
biased or nonignorable nonresponse mechanism whereby units with
large values are more likely to respond than units with small values.
Figure 12.1 Initially observed and and follow-up sample values
One way (amongst several) to model such a response mechanism is via

the formulation
( Ri | y , µ , λ ) ~ ⊥ Bernoulli ( pi ) , i = 1,..., N ,
where
 p 
log  i = a + byi
 1 − pi 
is the logit of the probability of unit i responding.
568
Noting that the sampling mechanism is ignorable, and that the response
mechanism would be ignorable if all n sample values were known, we
posit a suitable Bayesian model as follows:
1
( y | yrT , Rs , ys , µ , λ ) ~ ( y | y=
rT , y s ) ( ysT + yrT )
N
 N −n
( yrT | Rs , ys , µ , λ ) ~ ( yrT | µ , λ ) ~ N  ( N − n ) µ ,
 λ 
s , µ, λ )
f ( Rs | y= f (=
Rs | y s ) ∏p
i∈s
i
Ri
(1 − pi )1− Ri
1
where pi = − ( a + byi )
1+ e
λ
λ − 2 ( y −µ ) 2
f ( ys | µ , λ ) = ∏
i
e
i∈s 2π
f ( µ , λ ) ∝ 1/ λ , µ ∈ℜ, λ > 0 .
Note: There is no need to include the nonsample response vector Rr in

the model.
Let m = s − o − f = u − f be the vector of labels for the units which are

sampled but still ‘missing’ after the follow-up sample has been
observed.
Then the joint posterior/predictive density of all the relevant unknowns

in the model may be written
f ( yrT , µ , λ , ym | Rs , yo , y f ) ∝ f ( yrT , µ , λ , ym , Rs , yo , y f )
 
= f ( µ , λ ) ×  f ( yo | µ , λ ) f ( y f | µ , λ ) f ( y m | µ , λ 
 
 
×  f ( Ro | y s ) f ( R f | y s ) f ( Rm | y s )  × f ( yrT | µ , λ )
 
λ λ
1 λ − 2 ( yi − µ )2 λ − 2 ( yi − µ )2 λ − λ2 ( yi − µ )2
∝ ×∏ e ∏ e ∏ e
λ i∈o 2π i∈ f 2π i∈m 2π
×∏ pi1 (1 − pi )1−1 ∏ pi0 (1 − pi )1−0 ∏ pi0 (1 − pi )1−0

i∈o i∈ f i∈m
λ
λ −
2( N − n )
( yi −( N − n ) µ )2
× e .
N − n 2π
569
This joint density defines a suitable Metropolis-Hastings algorithm with

Gibbs steps that could be run to obtain a Monte Carlo sample from the
predictive distribution of the finite population mean y .
One way to proceed is to implement this algorithm using WinBUGS and

the code shown below (underneath the R Code below). Some of the
results are as shown in Table 12.1. These inferences are based on
J = 10, 000 iterations of a WinBUGS run, following an initial burn-in of
size 1,000.
Table 12.1 Results of WinBUGS analysis
node mean sd MC error 2.5% median 97.5%

a -17.86 4.582 0.4184 -26.96 -17.79 -10.31
b 1.676 0.4535 0.04136 0.9301 1.672 2.586
lam 0.1921 0.04236 0.001112 0.118 0.189 0.2828
mu 9.688 0.3508 0.01358 8.976 9.693 10.35
ps[1] 0.9348 0.07378 0.006256 0.7572 0.959 0.997
ps[2] 0.9721 0.0535 0.004619 0.8664 0.9886 0.9996
………………………………………………………………………………….
ps[99] 0.1417 0.2097 0.003545 1.224E-5 0.04017 0.7787
ps[100] 0.1423 0.2101 0.003883 1.12E-5 0.03954 0.7731
ybar 9.687 0.3353 0.01329 9.013 9.696 10.32
yrT 3874.0 147.9 5.408 3573.0 3878.0 4156.0
From Table 12.1, we estimate the posterior mean of y by 9.69 and we

estimate the 95% CPDR for y as (9.01, 10.32). It will be noted that this
inference is significantly lower than the inferences in (a) and (b) where
the response mechanism was taken as ignorable. Some of the graphical
output from the WinBUGS run are shown in Figure 12.2
Figure 12.2 Graphical output from WinBUGS
570
Discussion
It is instructive to now reveal that the data values in this exercise were in
fact generated as follows.
First, a finite population of size N = 500 was generated from the normal
distribution with mean µ = 10 and standard deviation σ = 2. The mean
of the finite population values was calculated as y = 10.10.
Note: We see that the CPDR in (c), (9.013, 10.32), contains this true
value of y , whereas the CPDRs in (a) and (b), (11.42, 12.46) and
(10.33, 11.51), do not. This suggests the analysis in (c) was on the
right track.
Then a random sample of size n = 100 was taken from the finite
population according to SRSWOR. The sample mean was calculated as
ys = 9.91.
Note: Thus, if there had been no nonresponse then the finite population
mean (with true value 10.10) would have been estimated by 9.91.
Figure 12.3 shows histograms of the population and sample values, each
overlaid by the superpopulation density. The dots in the two subplots
show y = 10.10 and ys = 9.91, respectively.
571
Figure 12.3 Histograms of the population and sample values
Then the probabilities of response were calculated as

1
pi = − ( a + byi )
1+ e
with a = −15 and b = 1.4 (set in advance).
Using these probabilities, it was next determined which units would

respond, by sampling
Ri ~ Bernoulli ( pi )
for each i = 1,…,N.
Thereby it was established which sample units would respond and which
would not. Figure 12.4 shows histograms of these two groups (of size
no = 34 and nu = 66), each overlaid by the superpopulation density. The
dots in the left and right subplots show yo and yu , respectively, and
each histogram is overlaid by the superpopulation density.
We see how the respondent values are systematically larger than the
nonrespondent values. This reflects the fact that units with larger values
were more likely to respond.
572
Figure 12.4 Observed and unobserved (non-responding)

sample values
Figure 12.5 shows all N probabilities of response p1 ,..., pN plotted

against the population values y1 ,..., y N . The crosses indicate population
units which would not respond if sampled, and these naturally tend to be
the units with the smallest values.
Figure 12.5 Probabilities of response in the population
Likewise, Figure 12.6 shows the n probabilities of response in the

sample plotted against the sample values. The crosses indicate sample
units which did not respond in actuality, and these tend to be the units
with the smallest values. The solid dots indicate the 15 units which were
selected for ‘forced’ follow-up according to SRSWOR (from the 66 non-
responding sample units). Without these 15 ‘representative’ follow-up
573
values it would have been impossible to appropriately address the

nonignorable nonresponse problem and correct the biased inference in
(a) and (b) downward.
Figure 12.6 Probabilities of response in the sample
# Preliminary: Data generation and description ===========
X11(w=8,h=4); par(mfrow=c(1,1)); options(digits=4);

N=500; n=100; mu=10; sig=2; a=-15; b=1.4;
set.seed(421); y=rnorm(N,mu,sig) # N finite population values
p=1/(1+exp(-(a+b*y))) # N probabilities of response (logistic)
plot(y,p) # OK
set.seed(123); R=rbinom(N,1,p) # N response indicators

set.seed(421); s=sort(sample(1:N,n)) # n sample labels
r = (1:N)[-s] # N-n nonsample labels

ys=y[s] # n sample values
yr=y[r] # N-n nonsample values
Rs = R[s] # n sample response indicators
Rr = R[r] # N-n nonsample response indicators
no = sum(Rs); nu = n-no; c(no,nu)

# 34 66 numbers of observed and unobserved units
o = s[Rs==1] # labels of observed sample values
u = s[Rs==0] # labels of unobserved sample values
574
rbind(s[1:10],Rs[1:10])
# [1,] 6 7 14 17 22 37 39 48 66 69
# [2,] 0 0 1 0 1 0 0 0 1 1
o[1:5] # 14 22 66 69 78 Correct
u[1:5] # 6 7 17 37 39 Correct
yo = y[o]; yu = y[u]
ybar=mean(y); ysbar=mean(ys); yrbar=mean(yr);
yobar=mean(yo); yubar=mean(yu)
c(ybar,ysbar,yrbar,yobar,yubar) # 10.095 9.907 10.143 11.938 8.860
# Plot population and sample values -------------------------------

par(mfrow=c(1,2))
hist(y,prob=T,xlab="value", main="Population",
xlim=c(3,17),ylim=c(0,0.25), breaks=seq(0,20,1))
points(ybar,0,pch=16)
hist(ys,prob=T,xlab="value", main="Sample",
points(ysbar,0,pch=16)
# Plot observed and unobserved sample values -------------------------------

par(mfrow=c(1,2))
hist(yo,prob=T,xlab="value", main="Observed",
points(yobar,0,pch=16)
hist(yu,prob=T,xlab="value", main="Unobserved",
points(yubar,0,pch=16)
# Plot probabilities of response in population --------------

par(mfrow=c(1,1))
plot(y,p,xlab="y",ylab="p",main="")
points(y[R==0], p[R==0],pch=4,cex=1.5)
text(8,0.8,"The crosses represent nonrespondents")
# Plot probabilities of response in sample and follow-up subsample --------------

par(mfrow=c(1,1)); plot(ys,p[s],xlab="y",ylab="p",main="")
points(ys[Rs==0], p[s][Rs==0],pch=4,cex=1.5)
575
nf=15; set.seed(112); followup = sort(sample(1:nu, nf)) # Follow up sample

f=u[followup] # pop. labels of follow-up units
yf=y[f] # The follow-up sample vector
yfbar=mean(yf); yfbar # 8.601 mean of follow-up values
points(yf, p[f], pch=16) # OK
text(8,0.8,"The crosses represent nonrespondents")
text(8,0.7,"The dots represent follow-up units")
# Print data --------------------------------------------------

s # [1] 6 7 14 17 22 37 39 48 66 69 73 77 78 103 105 106 117………
o # [1] 14 22 66 69 78 141 152 156 172 228 230 232 ……
f # [1] 17 73 77 128 145 163 187 196 253 271 318 357 436 438 481
paste(as.character(round(yo,2)), collapse=", ")

# 12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32,13.2,11.28,9.7,12.18,
# 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75, 10.34, 14.37, 12.13, 8.56,
# 11.91, 11.79, 11.45, 14.98, 10.57, 12.28, 9.91, 10.94, 13.28, 11.43
paste(as.character(round(yf,2)), collapse=", ")
# 5.4, 9.41, 7.03, 8.88, 11.47, 7, 9.44, 8.58,9.27,8.18,8.62,8.73,7.33,9.81, 9.88
# (a) ===================================
yo = c(12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28, 9.7,
12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75, 10.34, 14.37, 12.13,
8.56, 11.91, 11.79, 11.45, 14.98, 10.57, 12.28, 9.91, 10.94, 13.28, 11.43)
no=length(yo); N=500; ybarhata = mean(yo); so=sd(yo)
ybarcpdra=ybarhata+c(-1,1)*qt(0.975,no-1)*(so/sqrt(no))*sqrt(1-no/N)
c(no,so,ybarhata, ybarcpdra) # 34.000 1.552 11.939 11.416 12.461
# (b) ===================================
yf = c(5.4,9.41,7.03,8.88,11.47,7,9.44,8.58,9.27,8.18,8.62,8.73,7.33, 9.81,9.88)
yof=c(yo,yf); nof=no+nf; ybarhatb = mean(yof);sof=sd(yof)
ybarcpdrb=ybarhatb+c(-1,1)*qt(0.975,nof-1)*(sof/sqrt(nof))*sqrt(1-nof/N)
c(nof,sof,ybarhatb, ybarcpdrb) # 49.000 2.168 10.917 10.326 11.509
# (c) ============================================
# Plot observed and follow-up sample values separately
par(mfrow=c(1,2))
hist(yo,prob=T,xlab="value", main="Initially observed",
xlim=c(3,17),ylim=c(0,0.35), breaks=seq(0,20,1));
points(mean(yo),0,pch=16);
hist(yf,prob=T,xlab="value", main="Follow-up",
xlim=c(3,17),ylim=c(0,0.35), breaks=seq(0,20,1));
points(mean(yf),0,pch=16)
576
WinBUGS code for Exercise 12.2
model
{
for(i in 1:n){
zs[i] <- a + b*ys[i]
logit(ps[i])<- zs[i]
rs[i] ~ dbern(ps[i])
ys[i] ~ dnorm(mu,lam)
}
a ~ dnorm(0.0,0.001)
b ~ dnorm(0.0,0.001)
mu ~ dnorm(0.0,0.001)
lam ~ dgamma(0.001,0.001)
ysT <- sum(ys[])
meanyrT <- nr*mu
precyrT <- lam/nr
yrT ~ dnorm(meanyrT,precyrT)
ybar <- (ysT+yrT)/(n+nr)
}
# data
list( n=100, nr=400,
rs=c( 1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1, 1,1,1,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0),
ys=c(
12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28,
9.7, 12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75,
10.34, 14.37, 12.13, 8.56, 11.91, 11.79, 11.45, 14.98, 10.57, 12.28,
9.91, 10.94, 13.28, 11.43, 5.4, 9.41, 7.03, 8.88, 11.47, 7,
9.44, 8.58, 9.27, 8.18, 8.62, 8.73, 7.33, 9.81, 9.88, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA) )
# inits
list(a=0,b=0,mu=0,lam=1)
577
12.3 Selection bias in volunteer surveys

Volunteer surveys are common nowadays, with the main mediums being
the telephone and Internet. However, they can be misleading on account
of selection bias, and this has been known for a long time. For example,
in 1983 a major television network in the US conducted a phone-in (or
dial-in) poll. Viewers were invited to phone the network and answer the
following question:
Should the United Nations continue to be based in the United States?
Of the 185,000 phones calls subsequently registered, 33% were from

persons answering yes, and 67% from persons answering no. The
question then arose as to how reliable these figures are when applied to
the American population as a whole. Many factors could affect said
reliability, for example whether some people phoned in more than once.
A key concern is that maybe yes-respondents were more, or less, likely

to phone in than no-respondents. For example, if yes-respondents were
less likely to phone in, then the sample almost certainly contained an
unrepresentatively low proportion of yes-responses. Consequently, the
figure 33% is biased and too low when taken as an estimate of the
percentage of all Americans in favour of the UN being based in the US.
Concerned about the accuracy of its phone-in polls generally, the TV

network conducted an independent survey of the entire American
population using proper probability sampling techniques. A SRSWOR
of 1,000 persons yielded 72% yes-responses to the same question and
28% no-responses.
From these results, we may suspect that yes-respondents were indeed

less likely to phone in than no-respondents. This prompts us to now
study the issue in more depth, starting with the following model. This
model and parts of the subsequent exposition can also be found in Puza
and O’Neill (2006).
12.4 A classical model for self-selection bias

Suppose that there are a large number N units in the population (e.g.
persons in the US) and each unit has the same probability p of having a
particular characteristic in question (e.g. being in favour of the UN being
based in the US).
578
Then define:
yi as the indicator for pop. unit i having the characteristic (0 or 1)
π i as the probability that unit i will be sampled (e.g. phone in to
answer the question)
I i as the indicator that population unit i is sampled.
In this context the data is D = (n, ysT ) , where:

n = I1 + ... + I N is the observed sample size
ysT = ys1 + ... + ysn is the number of yes-respondents in the sample.
Now, a ‘naïve’ or ‘base’ model here is

ysT ~ Bin(n, p) ,
and this leads to the straight sample proportion
ys = ysT / n
as an estimate of p.
We now wish to generalise this model to account for the possibility that
ys may be biased. To this end, suppose each π i can be one of two
values:
φ1 if that unit has the characteristic in question, i.e. if yi = 1
φ0 if that unit does not have the characteristic, i.e. if yi = 0.
Note: We may then write π i = φ yi .
Next, suppose that a unit with the characteristic in question is λ times as

likely to respond as a unit without the characteristic. Thus
φ1 = λφ0 .
Also, write φ0 simply as φ . Then,

 φ if yi = 0 
=π i =  φλ i .
y
 λφ if y i = 1
579
With the above definitions, we now consider the probability of a

respondent having the characteristic (as distinct from the probability of a
nonrespondent having the characteristic):
P=( yi 1) P=
( I i 1|=yi 1)
P( y= 1| I= 1)=
P( I i = 1)
i i
(note that we are applying Bayes’ rule here)
P=( yi 1) P=
( I i 1|= yi 1)
=
P( yi = 0) P( I i =1| yi = 0) + P ( yi =1) P ( I i =1| yi =1)
pφ1 pφλ pλ
= = = .
(1 − p )φ0 + pφ1 (1 − p )φ + pφλ 1 − p + pλ
Note: Observe how one of the parameters, namely φ , cancels out here.
pλ
We may now write ysT ~ Bin(n, ω ) , where ω = .
1 − p + pλ
Next, the MLE and method of moments estimator of ω is ys = ysT / n .
pλ ω
Also, solving ω = for p yields p = .
1 − p + pλ λ − λω + ω
ys
It follows that the MLE and MOME of p is pˆ = .
λ − λ ys + ys
 y (1 − ys ) 
=
Also, ( L, U )  ys ± zα /2 s  is a 1 − α CI for ω .
 n 
 L U 
Therefore, a 1 − α CI for p is  , .
 λ − λ L + L λ − λU + U 
It is of interest to now discuss the biases of the two estimators mentioned

above. First, the bias of ys is
 λ  (1 − p )(1 − λ )
B ( ys ) = ω − p = p  1 − = p .
 1 − p + pλ  1 − p (1 − λ )
580
This is not zero but reduces to zero when λ = 1 , i.e. when π 1 = π 0 .
 ys 
=
Also, the bias of p̂ is B( pˆ ) E  − p.
 λ − λ ys + ys 
Just like B( ys ) , B( pˆ ) is nonzero but reduces to zero when λ = 1 . But

unlike B( ys ) , B( pˆ ) converges to zero as the sample size n tends to
infinity, this being true for all λ .
ys
That is, pˆ = is asymptotically unbiased for p as n → ∞ .
λ − λ ys + ys
Note: This is obvious by construction. But just to check, we note that

pλ
Eys = and Vy s < ∞ . Therefore
1 − p + pλ
  pλ  
 1 − p + pλ  
B( pˆ ) →    − p =
0 as n → ∞ .
  pλ   pλ 
 λ − λ 1 − +
p + pλ  1 − p + pλ  
 
Example 12.1 Application to the US TV network scenario

(a classical analysis)
pλ ω (1 − p)
Observe that ω = implies λ = .
1 − p + pλ p(1 − ω )
Then recall that the phone-in poll conducted by the TV network yielded
an estimate of 0.33, and that the parallel scientifically designed (and
‘proper’) survey yielded an estimate of 0.72.
Thus we may estimate λ = π 1 / π 0 by

ωˆ (1 − pˆ ) 0.33(1 − 0.72)
= λˆ = = 0.19.
pˆ (1 − ωˆ ) 0.72(1 − 0.33)
This estimate being less than unity is consistent with our earlier intuition
that the phone-in poll estimate might be too low due to yes-respondents
being less likely to phone in than no-respondents.
581
Example 12.2 Inference on p in a flag poll (a classical analysis)
On 28 January 2000 an Internet poll was conducted by the Nine TV

Network in Australia with the question:
Should the Australian flag be replaced by a new one?
To this poll there were 4,941 yes-responses and 4,512 no-responses, thus
a proportion of
4,941/(4,941 + 4,512) = 4,941/9,453 = 0.523 yeses.
A similar question was asked in the Australian Constitutional

Referendum Study, 1999 (Gow et al., 2000), and this proper survey
yielded 829 yes-responses and 1,394 no-responses, thus a proportion of
829/(829 + 1,394) = 829/2,223 = 0.373 yeses.
Hence, for the 28 January Internet poll we may estimate λ = π 1 / π 0 by

ωˆ (1 − pˆ ) 0.523(1 − 0.373)
= λˆ = = 1.84.
pˆ (1 − ωˆ ) 0.373(1 − 0.523)
This suggests that persons who wanted the flag replaced were almost
twice as likely to register their opinion via the Internet poll as persons
who were happy with the old flag.
Example 12.3 Inference on p in a currency poll

(a classical analysis)
On 4 June 2000 an Internet poll was conducted by the Nine TV Network

with the question:
Should the Queen’s image be removed from our currency?
To this there were 2,544 yes-responses and 1,755 no-responses, thus a

proportion of
2,544/(2,544 + 1,755) = 2,544/4,299 = 0.592 yeses.
Now recall Example 12.2. Clearly there is some similarity between the
two polls. Both were conducted on the Internet by the same organisation
within the same half-year, and the two questions asked both relate to
changing something about Australia’s heritage. This similarity suggests
that 1.84 may be a plausible value of λ = π 1 / π 0 to be used in the 4 June
poll here.
582
If so, we may estimate the true proportion of Australians in favour of

removing the Queen’s image from our currency as
ys 0.592
= pˆ = = 0.441.
λ − λ ys + ys 1.84 − 1.84 × 0.592 + 0.592
pλ
Then, a 95% CI for ω = (the probability of a yes-response for
1 − p + pλ
a respondent) is
 y (1 − ys )   0.592(1 − 0.592) 
=
( L, U )  ys ± zα /2 s  =  0.592 ± 1.96 
 n   4, 299 
= (0.577, 0.607).
Therefore, a 1 − α CI for p is
 L U 
 , 
 λ − λ L + L λ − λU + U 
 0.577 0.607 
= , 
 1.84 − 1.84 × 0.577 + 0.577 1.84 − 1.84 × 0.607 + 0.607 
= (0.426, 0.456).
12.5 Uncertainty regarding the sampling

mechanism
In Example 12.3 above, the value of λ was taken to be exactly 1.84.
However, there is in fact uncertainty about λ which ought to be taken
into account and perhaps lead to a wider CI for p than the one reported.
With this in mind we now postulate the following Bayesian model:
pλ
( ysT | p, λ ) ~ Bin(n, ω ) where ω= (as before)
1 − p + pλ
( p | λ ) ~ Beta (α , β )
λ ~ Gamma (η ,τ ) . (12.2)
Note: This model implicitly conditions on the sample size n.
583
Example 12.4 Bayesian re-analysis of poll data Example 12.2
Recall the 28 January 2000 Internet poll yielding 4,941 yeses out of
9,453 responses and the related properly conducted probability survey
yielding 829 yeses and 1,394 nos.
This suggests we apply the Bayesian model (12.2) in WinBUGS to

estimate λ , with:
η = τ = 0.000001 (implying an uninformative prior on λ )
α = 829 + 1 = 830, β = 1,394 + 1 = 1,395

(the posterior of p implied by the proper survey in a
binomial-beta model and then fed here as the prior for p)
n = 9,453, y sT = 4,941
(the observed data in the self-selected sample).
Using suitable WinBUGS code (see below) and a sample size of 10,000
after a burn-in of 1,000, we obtained results shown in Table 12.2. Figure
12.7 shows some of the graphical output from WinBUGS.

lam 1.843 0.08879 0.00271 1.677 1.841 2.026
p 0.373 0.01022 3.15E-4 0.3529 0.373 0.393
We see that λ has been estimated as 1.84 again, but now with some
measure of uncertainty: the 95% posterior interval estimate for λ is
(1.68, 2.03).
584
Equating the sample mean and sample variance of the 10,000 simulated
values with the theoretical mean and variance of the Gamma (η ,τ ) ,
namely η / τ and η / τ 2 , respectively, we may approximate the posterior
distribution of λ as Gamma (η ,τ ) with η = 431 and τ = 234.
Figure 12.8 shows a histogram of the simulated values overlaid by the

gamma density defined by these parameters. We see that the gamma
posterior approximation fits quite well.
Figure 12.8 Histogram of simulated values and fitted gamma

density
585
WinBUGS Code for Example 12.4
model;
{
ysT~ dbin(omega,n)
omega <- (p*lam)/(1-p+lam*p)
lam ~ dgamma(eta,tau)
p ~ dbeta(alpha,beta)
}
# data
list(ysT=4941,n=9453,eta=0.000001,
tau=0.000001,alpha=830,beta=1395)
# inits
list(p=0.5,lam=1)
R Code for Example 12.4
# Need to run BUGS code above first, using coda to create output in data.txt
options(digits=3); 0.33*0.28/(0.72*0.67) # 0.192

0.523*(1-0.373)/(0.373*(1-0.523)) # 1.84
0.592/(1.84-1.84*0.592+0.592) # 0.441
CIomega = 0.592+c(-1,1)*1.96*sqrt(0.592*(1-0.592)/4299)
CIp = (CIomega/(1.84-1.84*CIomega+CIomega))
c(CIomega, CIp) # 0.577 0.607 0.426 0.456
out=read.table(file=file.choose()) # choose data.txt from BUGS run

lamvec = out[1:10000,2]; options(digits=5)
lambar=mean(lamvec); lamvar=var(lamvec)
taufit=lambar/lamvar; etafit=lambar*taufit
c(lambar, lamvar, etafit, taufit)
# 1.8432e+00 7.8849e-03 4.3087e+02 2.3376e+02
summary(lamvec)
# 1.55 1.78 1.84 1.84 1.90 2.20
X11(w=8,h=4); par(mfrow=c(1,1))
lamv <- seq(1.4,2.4,0.001)
fv <- dgamma(lamv,431,234)
hist(lamvec,prob=T,xlim=c(1.4,2.4),ylim=c(0,5),xlab="lambda",cex=1.5,
breaks=seq(1,3,0.025), main="")
lines(lamv,fv,lwd=3)
586
Example 12.5 Bayesian re-analysis of poll data in Example 12.3

using results in Example 12.4
Recall the 4 June 2000 poll yielding 2,544 yeses out of 4,299 responses,
leading to 0.441 as an estimate of p, with 95% CI (0.426, 0.456), based
on λ being exactly equal to 1.84. This suggests we apply our Bayesian
model in WinBUGS to estimate p with:
η = 431, τ = 234
(using the posterior for λ in Example 4 as the prior)
α = β =1 (implying an uninformative prior for p)
n = 4,299, y sT = 2,544
(the observed data in the self-selected sample).
Using suitable WinBUGS code (see below), we obtained the results

shown in Table 12.3. Some of the graphical output is shown in Figure
12.9.

lam 1.841 0.08801 0.001991 1.67 1.84 2.014
p 0.4409 0.01408 3.18E-4 0.414 0.4406 0.4698
We see that p has been estimated as 0.441 again, with 95% interval
estimate (0.414, 0.470). It will be noted that this interval is wider than
the one in Example 12.3; this may be attributed to the fact that in
Example 12.3 uncertainty regarding λ was not properly taken into
account. For more information on the topic in this section, see Puza and
O’Neill (2006).
Note: The posterior for λ is virtually the same as the prior for λ . This
was to be expected, since—unlike in Example 12.4—the data here
does not contain any structure which could tell us anything about the
relationship between the sampling propensities π 0 and π 1 .
587
WinBUGS Code for Example 12.5
model;
{
ysT~ dbin(omega,n)
omega <- (p*lam)/(1-p+lam*p)
lam ~ dgamma(eta,tau)
p ~ dbeta(alpha,beta)
}
# data
list(ysT=2544,n=4299,eta=431,
tau=234,alpha=1,beta=1)
# inits
list(p=0.5,lam=1)
12.6 Finite population inference under

selection bias in volunteer surveys
In the last section on selection bias in volunteer surveys, the finite

population size N was introduced at the beginning, but then seemed to
disappear from the notation. The Bayesian model subsequently
developed did not feature N at all.
This is a clue to the fact that the Bayesian model in that section is only
useful for infinite population inference, in particular on the
superpopulation parameter p, and cannot be used for inference on finite
population quantities, in particular the finite population mean
y = ( y1 + ... + y N ) / N .
588
This is not an issue when N is very large (as it was assumed there), since
in that case inference on y is, by the law of large numbers, virtually
identical to inference on the superpopulation mean p.
The following exercise develops a ‘true’ Bayesian finite population

model in the same setting, one which could be useful in scenarios where
N is not so large as to be effectively infinity.
Exercise 12.3 A Bayesian finite population self-selection model
Consider a finite population of N units, where each unit has common

probability p of having some characteristic, independently of all the
other units, and where our prior beliefs regarding p can be represented
by way of a beta distribution with parameters α and β .
A sample is selected from the finite population in such a way that every
unit without the characteristic has probability φ of being sampled, and
every unit with the characteristic has probability λφ of being selected.
Every unit that is sampled has its value fully observed.
The prior on φ is beta with parameters δ and γ but evenly spread over
the interval (0, c), where c < 1 is a specified constant representing an
absolute upper bound for what the value of φ could possibly be.
(Examples of a potentially suitable values of c are 0.1, 0.2 and 0.5.)
Also, the prior on λ is beta with parameters η and τ but evenly spread
over the interval (0, 1/c), so as to permit a suitably wide range of
possible values for the ratio of sampling propensities π 1 = λφ to π 0 = φ .
(For example, if c = 0.2 then that ratio could be anything from 0 to 5.)
(a) Write down a Bayesian model which comprehensively represents the

above situation. Assume that all of the model parameters are
independent a priori. Clearly identify the data.
(b) Suppose we are interested in both the superpopulation mean (i.e. the
common probability of a unit having the characteristic, p) and the finite
population mean (i.e. proportion of the N finite population units which
have the characteristic, y ). Write down a formula for the joint posterior
(and predictive) density of all quantities which are relevant to and could
used be as a basis for the desired inference.
589
(c) Use the density in (a) to construct a suitable Metropolis-Hastings

algorithm. Then run the algorithm in R so as to redo the analyses in
Examples 12.4 and 12.5. Perform each new analysis thrice, assuming the
finite population size N is 200,000, 400,000 and 40,000, respectively.
(d) Modify the MH algorithm in (c) so that its output features only the
three model parameters and none of the nonsample values. (NB: The
idea here is to design a superior MH algorithm, one with better ‘mixing’
than the one in (c).)
(e) Describe a procedure whereby the output from the algorithm in (d)
could be used to obtain a sample from the predictive distribution of the
nonsample mean. Then run that algorithm and implement the procedure
so as to produce results intended to be equivalent to those in the
reanalysis of Example 5 in (c) with N = 200,000.
(a) With y = ( y1 ,..., y N ) and I = ( I1 ,..., I N ) , the Bayesian model may be

written as follows:
( I i | y , p, λ , φ ) ~ ⊥ Bernoulli (φλ yi ) , i = 1,..., N
( y1 ,..., y N | p, λ , φ ) ~ iid Bernoulli ( p )
( p | λ , φ ) ~ Beta (α , β ) , (λ | φ ) ~ (1 / c ) × Beta (η ,τ )
φ ~ c × Beta (δ , γ ) ( 0 < c < 1 ).
Note: The sampling mechanism here is nonignorable and unknown,

since f ( I | y , p, λ , φ ) depends on the unknown quantities φ and λ . If
λ were equal to 1 then the sampling mechanism would again be
unknown but in that case ignorable, since φ ⊥ p a priori.
The data here may be written as D = ( n, ysT ) , where:

N
n = ∑ Ii is the sample size
i =1
y sT = ∑ yi is the number of sampled units with the characteristic.

i∈s
590
Since the data is a function of ( I , y s ) , the relevant joint posterior/

predictive density is
f (φ , λ , p, yr | I , y s ) ∝ f (φ , λ , p, λ , yr , I , y s )
= f (φ , λ , p, yr , I s , I r , y s )
= f (φ ) f (λ ) f ( p ) × f ( ys | p ) f ( yr | p )
× f ( I s | y s , φ , λ ) f ( I r | yr , φ , λ ) (12.3)
(φ / c )δ −1 (1 − φ / c )γ −1 c( cλ )η −1 (1 − cλ )τ −1 pα −1 (1 − p ) β −1
= × ×
cB(δ , γ ) B(η ,τ ) B(α , β )
    
×  ∏ p yi (1 − p )1− yi   ∏ p yi (1 − p )1− yi  ×  ∏ φλ yi ( ) (1 − φλ ) yi 1− I i
Ii

 i∈s   i∈r   i∈s 
 1− I i 
(
×  ∏ φλ yi ) ( )
Ii
1 − φλ yi  (12.4)
 i∈r 
∝ φ δ −1 (1 − φ / c )γ −1 × λ η −1 (1 − cλ )τ −1 × pα −1 (1 − p ) β −1
× p ysT (1 − p ) n − ysT p yrT (1 − p ) N − n − yrT
  
(
×  ∏ φλ yi ) (1 − φλ ) yi 1−1
 ∏ φλ ( ) (1 − φλ ) yi 1− 0
1 yi 0
 (12.5)
 i∈s  i∈r 
= φ δ −1 (1 − φ / c )γ −1 × λ η −1 (1 − cλ )τ −1 × pα −1 (1 − p )τ β −1
× p ysT + yrT (1 − p ) N − ysT − yrT × φ n λ ysT (1 − φλ ) yrT (1 − φ ) N −n − yrT . (12.6)
Note 1: In all of the above e.g. (12.3), s and r are fixed at their
observed values.
Note 2: In the step from (12.4) to (12.5), be aware that I i = 1 ∀ i ∈ s

and I i = 0 ∀ i ∈ r .
591
Note 3: In the step form (12.3) to (12.4), f (φ ) is derived as follows.
φ wδ −1 (1 − w)γ −1
If w ≡ ~ Beta (δ , γ ) then f ( w) = .
c B(δ , γ )
Therefore
dw (φ / c )δ −1 (1 − φ / c )γ −1
f (φ ) f=
= ( w) .
dφ cB(δ , γ )
A similar logic can be used to derive the density

c( cλ )η −1 (1 − cλ )τ −1
f (λ ) = .
B (η ,τ )
(b) Examining the density in (a), in particular (12.6), we see that:

f ( yrT | D, φ , λ , p ) ∝ [ p (1 − φλ ) ] [ (1 − p )(1 − φ )]
y rT N −n − y rT
⇒ ( yrT | D, φ , λ , p ) ~ Bin( N − n, q) ,
p(1 − φλ )
where q = (12.7)
p(1 − φλ ) + (1 − p )(1 − φ )
=
f (p |D , φ , λ , yrT ) pα + ysT + yrT −1 (1 − p )τ β + N −n − ysT − yrT −1
⇒ ( p | D, φ , λ , yrT ) ~ Beta (α + y sT + yrT , β + N − y sT − yrT ) .

(12.8)
Also:
f (φ | D, yrT , λ , p ) ∝ φ δ + n −1 (1 − φ / c )γ −1 (1 − φλ ) yrT (1 − φ ) N −n − yrT
(12.9)
f (λ | D, yrT , φ , p ) ∝ λ η + ysT −1 (1 − cλ )τ −1 (1 − φλ ) yrT . (12.10)
The above implies a suitable MH algorithm with two Gibbs steps as

defined at (12.7) and (12.8) and two Metropolis steps as defined by
(12.9) and (12.10).
592
(c) The MH algorithm in (b) was applied with the following

specifications so as to redo the analysis in Example 12.4:
N = 200,000, n = 9453, ysT = 4941, c = 0.2

α = 830, β = 1395, η = τ = 1, δ = γ = 1.
A run with burn-in 2,000 followed by J = 10,000 iterations for inference

was performed. Numerical results from this run are shown in Table 12.4.
Table 12.4 Monte Carlo inferences using N = 200,000
phi, φ lam, λ p ybar, y

0.03597 1.84686 0.37259 0.37259 mean of simulated values
0.08789 0.08789 0.01017 0.01022 sample standard deviation
0.03449 1.68272 0.35266 0.35250 LB of 95% CPDR estimate
0.03749 2.02311 0.39190 0.39202 UB of 95% CPDR estimate
Our point and interval estimates for λ are 1.85 and (1.68, 2.02), which
are very similar to 1.84 and (1.68, 2.03) in Example 12.4.
Note: The primary object here is estimation of λ , not of p or y . But it

will be noted that the estimates of these other two quantities (p or y )
are very alike, which is as one might expect.
Repeating the above but with finite population sizes 400,000 and 40,000,
respectively, we obtain the corresponding results shown in Tables 12.5.
Table 12.5 Inferences using different N (same details as in

Table 12.4)
N = 400,000 N = 40,000
phi, φ lam, λ p ybar, y phi, φ lam, λ p ybar, y
0.01803 1.83548 0.37394 0.373948 0.18123 1.81588 0.375693 0.375834
0.08546 0.08546 0.00981 0.009832 0.07579 0.07579 0.009203 0.009399
0.01731 1.68407 0.35413 0.354113 0.17492 1.66922 0.357356 0.357050
0.01878 2.00923 0.39122 0.391193 0.18813 1.97208 0.393969 0.394500
Note: The three sets of inferences in Tables 12.4 and 12.5 have yielded
different estimates of φ but very similar results for the other three
quantities, in particular the object of this study, λ .
593
Figure 12.10 shows graphical output from the first of the three
Metropolis-Hastings algorithms (i.e. the one with N = 200,000).
Figure 12.10 Graphical output from run with N = 200,000
594
Next, a beta distribution was fitted to the 10,000 simulated values of λ

above (taken from the run with N = 200, 000) so as to define the
approximate posterior given by
(λ | D ) ~ (1 / c ) × Beta (η ′,τ ′) ,
where η ′ = 278.1 and τ ′ = 474.8 (with c = 0.2 as before).
This posterior for λ was then fed in as the prior for λ so as to redo the
analysis in Example 12.5.
Accordingly, the MH algorithm in (b) was next applied once again but
with the following specifications:
N = 200,000, n = 4299, ysT = 2544, c = 0.2

α = 1, β = 1, η = 278.1, τ = 474.8, δ = γ = 1.
The relevant numerical estimates are as shown in Table 12.6.
595
Table 12.6: Inferences using N = 200,000 and a fitted beta prior

Thus point and interval estimates for p are 0.440 and (0.413, 0.466),
which we note are similar to 0.441 and (0.414, 0.470) in Example 12.5.
Also point and 95% interval estimates for y are 0.452 and (0.426,
0.478).
Note 1: The inference on y here was not possible using the theory in
the section just above the present exercise, i.e. using the infinite
population models developed in that section.
Note 2: The posterior for λ is very similar to its prior, which is as one
might expect, since the data now has no structure which could tell us
anything further about that parameter.
Repeating the above but with finite population sizes 400,000 and 40,000,
respectively, we obtain the corresponding results shown in Tables 12.7.
Table 12.7 Inferences using different N (same details as in

Table 12.6)
N = 400,000 N = 40,000
phi, φ lam, λ p ybar, y phi, φ lam, λ p ybar, y
0.007863 1.83516 0.44193 0.44792 0.07888 1.82895 0.44228 0.50220
0.087755 0.08776 0.01375 0.01372 0.08162 0.08162 0.01359 0.01337
0.007482 1.66809 0.41563 0.42160 0.07538 1.66402 0.41490 0.47517
0.008299 2.00048 0.46819 0.47409 0.08278 1.99275 0.47007 0.52985
596
Discussion
Something to be noted above is that estimation of y appears to increase

slightly as N decreases, whereas estimation of p remains about the same.
Estimation of φ also increases as N decreases. This could present a

‘problem’ if N is ‘too small’. Figures 12.11, 12.12 and 12.13 (pages 598
and 599) show histograms of the simulated values when N = 200,000,
20,000 and 15,000, respectively.
We see no problem in the first two of these three cases. But for
N = 15,000, the estimation of φ appears to be artificially restricted by
our arbitrary choice of c as 0.2. (Observe that the simulated values are
strongly ‘bunched up’ at just below 0.2.)
Repeating the MCMC run with N = 15, 000 but with c also changed to
0.5 appears to solve this problem. Results are shown in Figure 12.14
(page 599). We note that estimation of λ has changed from about 2 to
less than 1. This suggests that we might get very similar results with c
even larger, e.g. c = 1.
But when we do this, we get very different results (not shown). Why?
Because when we changed c from 0.2 to 0.5, we forgot to reconfigure

the prior for λ , which also involves c.
Note: The prior for φ also involves c but does not need reconfiguring
(because that prior is uniform for all values of c, since δ= γ= 1 ).
Thus, Figure 12.14 (the case of N = 15,000 and c = 0.5) in fact illustrates
output which is ‘flawed’ (in this sense) and so should be disregarded.
Although these technical issues could satisfactorily be resolved with

some effort, we will leave that task as an avenue of investigation for
further research and move on to answering part (d).
597
Figure 12.11 Histograms using N = 200,000 and c = 0.2
598
599
(d) Recall the joint density (12.6). This density may also be written as:
f (φ , λ , p, yr | I , ys ) ∝ f (φ , λ , p ) p ysT + yrT (1 − p ) N − ysT − yrT
× φ n λ ysT (1 − φλ ) yrT (1 − φ ) N − n − yrT ,
where f (φ , λ , p ) ∝ φ δ −1 (1 − φ / c )γ −1 × λ η −1 (1 − cλ )τ −1 × pα −1 (1 − p ) β −1 .
Now observe that

 
f (φ , λ , p, yr | I , ys ) ∝ f (φ , λ , p ) ×  p ysT (1 − p ) n − ysT φ n λ ysT  × ξ ,
 
where: ξ = [ p(1 − φλ )] [ (1 − p )(1 − φ )]
y rT N −n − y rT
= [ p(1 − φλ ) + (1 − p )(1 − φ )]N −n × ∏ z y (1 − z )1− y i i
i∈r
p(1 − φλ )
z= .
p(1 − φλ ) + (1 − p )(1 − φ )
Further observe that

1
∑∏ z
yr i∈r
yi
)1− yi
(1 − z= ∏∑z
i∈r yi =
0
yi
)1− yi 1
(1 − z=
(since the first product is the joint pdf of N − n iid Bernoulli(z)

variables).
It follows that
f (φ , λ , p | I , y s ) = ∑ f (φ , λ , p, yr | I , y s )
yr
 
∝ f (φ , λ , p ) ×  p ysT (1 − p ) n − ysT φ n λ ysT 
 
× [ p (1 − φλ ) + (1 − p )(1 − φ ) ] .
N −n
The above defines a MH algorithm with three steps based on the

following conditionals:
f (φ | D, λ , p ) ∝ φ δ + n −1 (1 − φ / c )γ −1 [ p (1 − φλ ) + (1 − p )(1 − φ )]
N −n
f (λ | D, φ , p ) ∝ λ η + ysT −1 (1 − cλ )τ −1 [ p (1 − φλ ) + (1 − p )(1 − φ )]
N −n
f ( p | D, φ , λ ) ∝ pα + ysT −1 (1 − p ) β + n − ysT −1 [ p (1 − φλ ) + (1 − p )(1 − φ )]

N −n
.
600
(e) From the working in (d) we see that

( yrT | I , ys , φ , λ , p ) ~ Bin( N − n, z ) ,
where
p(1 − φλ )
z= . (12.11)
p(1 − φλ ) + (1 − p )(1 − φ )
So, to get a sample from the predictive distribution of y we do as

follows:
1. Obtain (φ j , λ j , p j ) ~ iid f (φ , λ , p | I , ys ) , j = 1,…,J

using the MH algorithm in (d)
( j)
2. Sample yrT ~ Bin( N − n, z j ) , where
p j (1 − φ j λ j )
zj = , j = 1,…,J (from (12.11))
p j (1 − φ j λ j ) + (1 − p j )(1 − φ j )
1
3. Calculate=
y ( j) ( ysT + yrT( j ) ) , j = 1,…,J .
N
We now perform the MH algorithm in (d) and the above procedure with:
N = 200,000, n = 4299, ysT = 2544, c = 0.2
α = 1, β = 1, η = 278.1, τ = 474.8, δ = γ = 1.
We thereby obtain the inferences shown in Table 12.8.
Table 12.8 Results obtained in part (e)

We see that inferences are very similar to those in the reanalysis of

Example 12.5 in (c) with N = 200,000 (where y was estimated as
0.45248). But the results here should in fact be considered more accurate
because they are based on a MH algorithm with fewer components.
601
Note 1: The inferences on y could be further improved via Rao-

Blackwell arguments which obviate the need to sample values of yrT
at all. In particular, the Rao-Blackwell estimate of the predictive mean
of the finite population mean, yˆ = E ( y | D ) , is
1 J
z = ∑ z j = 0.4364,
J j =1
with 95% CI for ŷ
 1 J 
 z ± 1.96 ∑
J ( J − 1) j =1
( z j − z ) 2  = (0.4361, 0.4367).

 
Actually, this is not quite right, since z is the Rao-Blackwell estimate

of yˆ r = E ( yr | D) , and the 95% CI is for yˆ r . To see this, refer to
(12.3).
Thus, since
1
=
y ( ysT + ( N − n) yr ) ,
N
the RB estimate of ŷ is actually
1
( ysT + ( N − n) z ) = 0.440,
N
with a 95% confidence interval for ŷ equal to
1 1 
 ( ysT + ( N − n)0.4361, ( ysT + ( N − n)0.4367  = (0.439, 0.440).
N N 
Note 2: The Monte Carlo 95% confidence intervals reported here are
unduly narrow (i.e. will have less than 95% actual coverage). This is
because we did not address the problem of the very strong serial
correlation amongst the values outputted from the Metropolis-Hastings
algorithm, for example by way of thinning or the batch means method.
But this remark only applies to confidence intervals for mean estimates
and not to posterior or predictive interval estimates, such as (0.413,
0.467) for y in Table 12.8.
602
MH = function(J=100, n=9453, ysT=4941, alp=830, bet=1395,

p=0.5, phi0=0.1, lam0=1, phisd=0.1, lamsd=0.1,
eta=1, tau=1, del=1, gam=1, c=0.2, N=200000 ){
phi=phi0; lam=lam0; phiv=phi; lamv=lam; phict=0; lamct=0; pv=NA; yrTv=NA
for(j in 1:J){
q=p*(1-phi*lam)/( p*(1-phi*lam) + (1-p)*(1-phi) )
yrT=rbinom(1,N-n,q); yT=ysT+yrT; p=rbeta(1,alp+yT,bet+N-yT)
phinew=rnorm(1,phi,phisd)
if((phinew>0)&&(phinew<c)){
logprobnum=(del-1)*log(phinew)+(gam-1)*log(1- phinew/c)+
n*log(phinew) +yrT*log(1- phinew*lam)+(N-n-yrT)*log(1-phinew)
logprobden=(del-1)*log(phi)+(gam-1)*log(1-phi/c)+
n*log(phi) +yrT*log(1-phi*lam)+(N-n-yrT)*log(1-phi)
logprob= logprobnum- logprobden; prob=exp(logprob)
u=runif(1); if(u<=prob){ phict=phict+1; phi=phinew } }
lamnew=rnorm(1,lam,lamsd)
if((lamnew>0)&&(lamnew<(1/c))){
logprobnum= (eta-1)*log(lamnew)+(tau-1)*log(1- lamnew*c)+
ysT*log(lamnew)+yrT*log(1-phi*lamnew)
logprobden= (eta-1)*log(lam)+(tau-1)*log(1-lam*c)+
ysT*log(lam)+yrT*log(1-phi*lam)
u=runif(1); if(u<=prob){ lamct=lamct+1; lam=lamnew } }
phiv=c(phiv,phi); lamv=c(lamv,lam); pv=c(pv,p); yrTv=c(yrTv,yrT) }
phiar=phict/J; lamar=lamct/J
list(pv=pv, yrTv=yrTv, phiv=phiv, lamv=lamv, phiar=phiar, lamar=lamar) }
# end fn
X11(w=8,h=6); par(mfrow=c(2,2)); options(digits=5); N=200000
# A ----------------------------------
set.seed(531); res=MH(J=2000, n=9453, ysT=4941, alp=830, bet=1395,

eta=1, tau=1, del=1, gam=1, c=0.2, N=N )
c(res$phiar,res$lamar) # 0.513 0.536 OK
plot(res$pv); plot(res$yrTv); plot(res$phiv); plot(res$lamv) # Has burnt in OK
p0=res$pv[2001]; lam0=res$lamv[2001]; phi0=res$phiv[2001]
# record last values
603
set.seed(131); K=10000; date() #

res=MH(J=K, n=9453, ysT=4941, alp=830, bet=1395,
p=p0, phi0=phi0, lam0=lam0, phisd=0.0006, lamsd=0.04,
eta=1, tau=1, del=1, gam=1, c=0.2, N=N ); date() #
plot(res$pv); plot(res$yrTv); plot(res$phiv); plot(res$lamv) # OK
# Example of optional thinning to reduce serial correlation:

# acf(res$pv[-1]); acf (res$yrTv[-1]); acf (res$phiv[-1]); acf (res$lamv[-1])
# skip=10; inc=1+seq(skip,K,skip); J=length(inc); J # 1000
# pv= res$pv[inc]; yrTv= res$yrTv[inc]; phiv=res$phiv[inc]; lamv=res$lamv[inc]
# acf(pv); acf(yrTv); acf(phiv); acf(lamv) # better
skip=1; inc=1+seq(skip,K,skip); J=length(inc); J # 10000 (Just use whole

sample)
pv= res$pv[inc]; yrTv= res$yrTv[inc]; phiv=res$phiv[inc]; lamv=res$lamv[inc]
hist(pv,prob=T); hist(yrTv,prob=T); hist(phiv,prob=T); hist(lamv,prob=T); # OK
# Calculate estimates (Note we could improve these via Rao-Blackwell):

phat=mean(pv); pcpdr=quantile(pv,c(0.025,0.975)); pse=sd(pv)
lamhat=mean(lamv); lamcpdr=quantile(lamv,c(0.025,0.975)); lamse=sd(lamv)
phihat=mean(phiv); phicpdr=quantile(phiv,c(0.025,0.975)); phise=sd(lamv)
n= 9453; ysT=4941; ybarv=(1/N)*(ysT+yrTv);
ybarhat=mean(ybarv); ybarcpdr=quantile(ybarv,c(0.025,0.975));
ybarse=sd(ybarv)
print(cbind(c(phihat, phise ,phicpdr), c(lamhat, lamse ,lamcpdr),
c(phat, pse,pcpdr), c(ybarhat,ybarse, ybarcpdr)), digits=4)
# B ----------------------------------
# phi lam p ybar

# 0.03597 1.84686 0.37259 0.37259 mean
# 0.08789 0.08789 0.01017 0.01022 se
# 2.5% 0.03449 1.68272 0.35266 0.35250 LB
# 97.5% 0.03749 2.02311 0.39190 0.39202 UB
# Repeat above exactly from A to B but after setting N=400000. Results:

# 0.01803 1.83548 0.37394 0.373948
# 0.08546 0.08546 0.00981 0.009832
# 2.5% 0.01731 1.68407 0.35413 0.354113
# 97.5% 0.01878 2.00923 0.39122 0.391193
604
# Repeat above exactly from A to B but after setting N=40000. Results:

# 0.18123 1.81588 0.375693 0.375834
# 0.07579 0.07579 0.009203 0.009399
# 2.5% 0.17492 1.66922 0.357356 0.357050
# 97.5% 0.18813 1.97208 0.393969 0.394500
# Now calculate new prior from posterior of lambda (based on 1st run above):
c(lamhat,lamse) # 1.846864 0.087889
fun=function(etatau, c=0.2, est=lamhat, se=lamse){
(est-(1/c)*etatau[1]/sum(etatau))^2+
( se^2 - (1/c^2)*prod(etatau)/( sum(etatau)^2*(1 + sum(etatau)) ) )^2 }
etataunew0 = optim(par=c(2,5), fn=fun)$par
etataunew = optim(par= etataunew0, fn=fun)$par
etanew=etataunew[1]; taunew=etataunew[2]
c(etanew, taunew) # 278.10 474.79
(1/0.2)*etanew/(etanew+taunew) # 1.8469
sqrt((1/0.2^2)*etanew*taunew/((etanew+taunew)^2*(etanew+taunew+1)))
# 0.087889 OK
# Now run MCMC with new prior and data: ------------------------------

par(mfrow=c(2,2)); N=200000
# C -----------------------------------------------------------
set.seed(531); res=MH(J=2000, n=4299, ysT=2544, alp=1, bet=1,

eta=etanew, tau=taunew, del=1, gam=1, c=0.2, N=N )
plot(res$pv); plot(res$yrTv); plot(res$phiv); plot(res$lamv) # Has burnt in OK
set.seed(131); K=10000; date() #

res=MH(J=K, n=4299, ysT=2544, alp=1, bet=1,
p=p0, phi0=phi0, lam0=lam0, phisd=0.0004, lamsd=0.05,
eta= etanew, tau= taunew, del=1, gam=1, c=0.2, N=N ); date() #
plot(res$pv); plot(res$yrTv); plot(res$phiv); plot(res$lamv) # OK
skip=1; inc=1+seq(skip,K,skip); J=length(inc); J # 10000 (Just use whole

sample)
pv= res$pv[inc]; yrTv= res$yrTv[inc]; phiv=res$phiv[inc]; lamv=res$lamv[inc]
hist(pv,prob=T); hist(yrTv,prob=T); hist(phiv,prob=T); hist(lamv,prob=T); # OK
605
# Calculate estimates (Note we could improve these via Rao-Blackwell):

n= 9453; ysT=4941; ybarv=(1/N)*(ysT+yrTv);
ybarse=sd(ybarv)
# D -------------------------------------------------
# phi lam p ybar

# 0.01570 1.84272 0.44049 0.45248 mean
# 0.08792 0.08792 0.01408 0.01403 se
# 2.5% 0.01495 1.67553 0.41344 0.42555 LB
# 97.5% 0.01656 2.01139 0.46602 0.47799 UB
# Repeat above exactly from C to D but with N=400000. Results:

# 0.007863 1.83516 0.44193 0.44792
# 0.087755 0.08776 0.01375 0.01372
# 2.5% 0.007482 1.66809 0.41563 0.42160
# 97.5% 0.008299 2.00048 0.46819 0.47409
# Repeat above exactly from C to D but with N=40000. Results:

# 0.07888 1.82895 0.44228 0.50220
# 0.08162 0.08162 0.01359 0.01337
# 2.5% 0.07538 1.66402 0.41490 0.47517
# 97.5% 0.08278 1.99275 0.47007 0.52985
# Repeat above exactly from C to D but with N=20000 and 15000 to produce
# extra graphs. We omit the code for the case N = 15000, c=0.5 and the case
# N = 15000, c = 1
# (e)
MH2 = function(J=100, n=9453, ysT=4941, alp=830, bet=1395,
p0=0.5, phi0=0.1, lam0=1, psd=0.1, phisd=0.1, lamsd=0.1,
eta=1, tau=1, del=1, gam=1, c=0.2, N=200000 ){
p=p0; phi=phi0; lam=lam0; pv=p; phiv=phi; lamv=lam; pct=0; phict=0;
lamct=0;
606
for(j in 1:J){
pnew=rnorm(1,p,psd)
if((pnew >0)&&(pnew <1)){
logprobnum=(alp-1+ysT)*log(pnew)+(bet-1+n-ysT)*log(1-pnew) +
(N-n)*log((1-pnew)*(1-phi)+pnew*(1-phi*lam))
logprobden=(alp-1+ysT)*log(p)+(bet-1+n-ysT)*log(1-p) +
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lam))
u=runif(1); if(u<=prob){ pct=pct+1; p=pnew } }
phinew=rnorm(1,phi,phisd)
if((phinew>0)&&(phinew<c)){
logprobnum=(del-1+n)*log(phinew)+(gam-1)*log(1- phinew/c)+
(N-n)*log((1-p)*(1-phinew)+p*(1-phinew*lam))
logprobden=(del-1+n)*log(phi)+(gam-1)*log(1-phi/c)+
u=runif(1); if(u<=prob){ phict=phict+1; phi=phinew } }
lamnew=rnorm(1,lam,lamsd)
if((lamnew>0)&&(lamnew<(1/c))){
logprobnum= (eta-1+ysT)*log(lamnew)+(tau-1)*log(1- lamnew*c)+
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lamnew))
logprobden= (eta-1+ysT)*log(lam)+(tau-1)*log(1- lam*c)+
u=runif(1); if(u<=prob){ lamct=lamct+1; lam=lamnew } }
pv=c(pv,p); phiv=c(phiv,phi); lamv=c(lamv,lam) }
par=pct/J; phiar=phict/J; lamar=lamct/J
list(pv=pv, phiv=phiv, lamv=lamv, par=par, phiar=phiar, lamar=lamar) }
# end fn
X11(w=8,h=6); par(mfrow=c(2,2))
N=200000; n = 4299; ysT=2544; K=2000
set.seed(531); res=MH2(J=K, n=4299, ysT=2544, alp=1, bet=1,
p0=0.5, phi0=0.1, lam0=1, psd=0.008, phisd=0.0007, lamsd=0.04,
eta= etanew, tau= taunew, del=1, gam=1, c=0.2, N=N )
c(res$par, res$phiar,res$lamar) # 0.6580 0.4135 0.6045 OK
plot(res$pv); plot(res$phiv); plot(res$lamv) # Has burnt in OK
607
set.seed(131); K=10000; par(mfrow=c(2,2)); date() #

res=MH2(J=K, n=4299, ysT=2544, alp=1, bet=1,
p0=p0, phi0=phi0, lam0=lam0, psd=0.008, phisd=0.0006,
lamsd=0.04,
eta= etanew, tau= taunew, del=1, gam=1, c=0.2, N=N ); date() #
c(res$par, res$phiar,res$lamar) # 0.6825 0.4315 0.6643 OK
plot(res$pv); plot(res$phiv); plot(res$lamv) # OK
skip=1; inc=1+seq(skip,K,skip); J=length(inc); J

# 10000 (Just use whole sample)
pv= res$pv[inc]; phiv=res$phiv[inc]; lamv=res$lamv[inc]
par(mfrow=c(2,2)); hist(pv,prob=T); hist(phiv,prob=T); hist(lamv,prob=T);
# OK
# Calculate estimates
# Generate sample from predictive dsn of finite population mean

zv=pv*(1-phiv*lamv)/( pv*(1-phiv*lamv) + (1-pv)*(1-phiv) )
set.seed(331); yrTv = rbinom(J, N-n, zv); ybarv=(1/N)*(ysT+yrTv)
ybarse=sd(ybarv)
# Print out inferences

# 0.01567 1.8491 0.43973 0.43973
# 0.08660 0.0866 0.01387 0.01382
# 2.5% 0.01491 1.6844 0.41331 0.41346
# 97.5% 0.01650 2.0278 0.46689 0.46673
RBest=mean(zv); RBci=RBest+c(-1,1)*qnorm(0.975)*sd(zv)/sqrt(J)
c(RBest,RBci) # 0.43639 0.43612 0.43667
(1/N)*(ysT+(N-n)*RBest) # 0.43973
(1/N)*(ysT+(N-n)*RBci) # 0.43946 0.44000
608
APPENDIX A
Additional Exercises
Exercise A.1 Practice with the Metropolis algorithm
(a) Sample a value m from the standard exponential distribution. Then

randomly sample n = 100 values from the normal distribution with mean
m and variance v = m 2 .
Then design and implement a Metropolis algorithm so as to obtain a

random sample of size J = 1,000 from the posterior of m.
Use this sample to perform Monte Carlo inference on m. Be sure to

provide a 95% CI for the posterior mean of m, an estimate of the 95%
central posterior density region for m, and an estimate of the entire
marginal posterior density of m.
Then predict c, the average of a future independent sample of size k = 10

from the normal distribution with the same mean m and variance v.
Be sure to provide a 95% CI for the predictive mean of c, an estimate of

the 95% central predictive density region for c, and an estimate of the
entire posterior predictive density of c.
Illustrate your results with suitable figures (for example, trace plots and
histograms).
(b) Consider the following values in a sample obtained via SRSWOR

from a finite population of size N = 50:
3.4, 6.3, 1.0, 2.9, 1.8, 2.0, 0.5, 7.9, 4.8, 6.5.
Suppose we model the finite population values as normal with (unknown)

mean m and variance v = m 2 , with a standard exponential prior on m.
Using MCMC methods, estimate the finite population mean and provide
a suitable 95% interval estimate.
609
Solution to Exercise A.1
(a) The sampled value of m was 0.7071. A histogram of the 100 sampled
normal values is shown in Figure A.1(a) (page 612). This histogram is
overlaid by the (known) normal distribution with mean m and variance
v = m 2 = 0.5.
The posterior density of m is
f ( m | y ) ∝ f ( m) f ( y | m)
n
1  1 
∝ e− m ∏ exp − 2 ( yi − m) 2 
i =1 m  2m 
 1 n 
= e − m m − n exp  − 2 ∑ ( yi − m ) 2  .
 2m i =1 
n
1
l (m) = log f (m | y ) =−m − n log m −
2m 2
∑ ( y − m)
i =1
i
2
.
A suitable Metropolis algorithm is one which at each iteration proposes a

value
m′ ~ U ( m − δ , m + δ ) ,
where δ is a tuning constant, and accepts this value with probability
p = eq ,
where
= q l ( m′) − l ( m ) .
Implementing this algorithm we obtained the 10,100 values of m, whose

trace is shown in Figure A.1(b) (page 612). Stochastic convergence
appears to have been attained immediately, and so the burn-in was
conservatively taken to be 100.
The last 10,000 of these 10,100 values are highly autocorrelated, as

evidenced by the sample ACF in Figure A.1(c) (page 612). However,
thinning out by a factor of 10 removes almost all of the autocorrelation,
as seen in the sample ACF in Figure A.1(d) (page 612), and yields the
required random sample
m1 ,..., mJ ~ iid f ( m | y ) ,
where J = 1,000.
610
Appendix A: Additional Exercises
A histogram of these 1,000 values of m is shown in Figure A.1(e).
The dashed line in this subplot is a histogram estimate of f ( m | y ) , and

the solid line is the true posterior density. The vertical lines show the
posterior mean estimate, m = 0.7377, the 95% CI for the posterior mean,
(0.7350, 0.7404), and the 95% CPDR estimate for m, (0.6620, 0.8298).
The dots show the true posterior mean, mˆ = E ( m | y ) = 0.7393, and the
true 95% CPDR for m. The cross shows the true value of m, 0.7071.
The Monte Carlo sample was used to generate a random sample from the
predictive distribution of
=
c ( yn +1 + ... + yn +10 ) / 10
by sampling
c j ~ N ( m j , m 2j / 10) , j = 1,…,J.
A histogram of these c-values is shown in Figure A.1(f).
The dashed line in this subplot is a histogram estimate of f ( c | y ) , and

the solid line is the Rao-Blackwell estimate
1 J 1  1 
= f (c | y ) ∑
J j =1 m j 2π
exp  − 2 ( c − m j ) 2  .
 2m j 
The vertical lines show the predictive mean estimate, c = 0.741, the 95%
CI for the predictive mean, (0.7270, 0.7549), and the 95% CPDR estimate
for c, (0.3063, 1.1893).
The dot shows the Rao-Blackwell estimate of cˆ = E ( c | y ) , which is the

same as m = 0.7377.
The Rao-Blackwell 95% CI for ĉ is the same as the 95% CI (0.7350,

0.7404) reported earlier.
611
Figure A.1 Graphical results for part (a)
612
(b) Here we repeat the procedure in part (a), but:
• with n = 10 (rather than 100)
• using the 10 given sample values, whose mean is 3.71

(instead of the 100 generated values, as previously)
1 1
• with=c =
( y11 + ... + y50 ) (instead of c ( y101 + ... + y110 )).
40 10
Figure A.2 is an analogue of Figure A.1, except that subplot (a) does not
have a normal density overlaid, and there is an extra subplot (g) that shows
inference on the finite population mean, which may be denoted here by
1
a= (10 × 3.71 + 40c ) .
50
Figure A.2 Graphical results for part (b)
613
Some of the estimates and quantities shown in the last subplot (g) are as
follows. The histogram estimate of a’s predictive mean is a = 3.061 with
95% CI (3.028, 3.094). The Rao-Blackwell estimate of a’s predictive
mean is (10 × 3.71 + 40m ) / 50 = 3.055, with 95% CI (3.031, 3.078). The
exact predictive mean of a is the same as the posterior mean of m and
equal to 3.068. The 95% CPDR estimate for a is 2.190 4.256.
614
R Code for Exercise A.1
# (a)
options(digits=4)
# Integrates numerically under a spline through the points given by
# the vectors xvec and yvec, from a to b.
fit <- smooth.spline(xvec, yvec)
spline.f <- function(x){predict(fit, x)$y }
INTEG(seq(0,1,0.01), seq(0,1,0.01)^2, 0,1) # 0.3333 correct
X11(w=8,h=6); par(mfrow=c(2,2));
set.seed(221); m=rgamma(1,1,1); v=m^2; n=100; y=rnorm(n,m,m); c(m,v)
# 0.7071 0.5000
hist(y,prob=T,xlim=c(-2,4),ylim=c(0,0.8), breaks=seq(-2,4,0.25),
main="(a) Histogram of 100 y-values")
yvec=seq(-2,4,0.01); lines(yvec,dnorm(yvec,m,m),lwd=3)
abline(v=c(m,m+c(-1,1)*qnorm(0.975)*m), lwd=3)
LOGPOST=function(m=2,n=10,y=c(2,1)){
-m-n*log(m)-(1/(2*m^2))*sum((y-m)^2) }
LOGPOST() # -9.056 OK
METALG = function(J=1000,y,m0=1,mdel=0.4){
m=m0; mv=m; mct=0; n=length(y); for(j in 1:J){
mcand=runif(1,m-mdel,m+mdel)
if(mcand>0){ logprob=LOGPOST(m= mcand,n=n,y=y)-
LOGPOST(m=m,n=n,y=y)
prob=exp(logprob)
u=runif(1); if(u<=prob){ mct=mct+1; m= mcand }
}
mv=c(mv,m)
}
list(mv=mv,mar=mct/J) }
set.seed(312); res=METALG(J=10100,y=y,m0=1,mdel=0.11); res$mar # 0.5528

plot(res$mv,type="l",main="(b) Trace of 10100 m-values");
acf(res$mv, main="(c) Sample ACF of 10000 m-values")

acf(res$mv, plot=F)[1:5] # 0.628 0.404 0.259 0.157 0.100
mv=res$mv[-(1:101)][seq(10,10000,10)];
acf(mv, main="(d) Sample ACF of 1000 m-values")
acf(mv,plot=F)[1:5] # -0.014 -0.001 0.006 0.018 0.014
615
J=length(mv); J # 1000
mbar=mean(mv); mci=mbar+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J)
mcpdr=quantile(mv,c(0.025,0.975));
mvec=seq(0.5,1,0.01); kvec=mvec;
for(i in 1:length(mvec)) kvec[i] = exp(LOGPOST(m=mvec[i],n=n,y=y))
k0=INTEG(mvec,kvec); postvec=kvec/k0; k0 # 6.269e-11
mhat=INTEG(mvec,mvec*postvec);
c(mbar,sd(mv),mhat,mci,mcpdr)
# 0.73769 0.04305 0.73935 0.73502 0.74036 0.66197 0.82984
fun=function(q,p=0.025){ (INTEG(mvec,postvec,0,q)-p)^2 }
LB0 = optim(par=0.5,fn=fun)$par; LB = optim(par= LB0,fn=fun)$par
fun=function(q,p=0.975){ (INTEG(mvec,postvec,0,q)-p)^2 }
UB0 = optim(par=0.8,fn=fun)$par; UB = optim(par= UB0,fn=fun)$par
c(LB,UB) # 0.6609 0.8305
INTEG(mvec,postvec,0,LB) # 0.025
INTEG(mvec,postvec,UB,1) # 0.025 OK (Ignore all the warnings)
par(mfrow=c(2,1))
hist(mv,prob=T,xlim=c(0.6,0.9),ylim=c(0,10), breaks=seq(0.5,1,0.01),
xlab="x",main="(e) Histogram of 1000 m-values")
lines(mvec,postvec,lty=1,lwd=3)
lines(density(mv),lty=2,lwd=3)
abline(v=c(mbar,mci,mcpdr),lwd=2)
points(c(mhat,LB,UB),c(0,0,0),pch=16)
points(m,0,pch=4,lwd=3)
# Prediction of c -----------------------
set.seed(332); cv=rnorm(J,mv,mv/sqrt(10))
cbar=mean(cv); cci=cbar+c(-1,1)*qnorm(0.975)*sd(cv)/sqrt(J)
ccpdr=quantile(cv,c(0.025,0.975))
c(cbar,sd(cv),cci,ccpdr) # 0.7410 0.2253 0.7270 0.7549 0.3063 1.1893
hist(cv,prob=T,xlim=c(0,1.6),ylim=c(0,2.5), breaks=seq(0,1.6,0.05),
xlab="c",main="(f) Histogram of 1000 c-values")
cvec=seq(0,1.5,0.01); fcvec=seq(0,1.5,0.01); for(i in 1:length(cvec))
fcvec[i]=mean(dnorm(cvec[i],mv,mv/sqrt(10)))
lines(cvec,fcvec,lty=1,lwd=3)
lines(density(cv),lty=2,lwd=3)
abline(v=c(cbar,cci,ccpdr),lwd=2)
points(mhat,0,pch=16)
616
# (b)
X11(w=8,h=6); par(mfrow=c(2,2));
y = c(3.4, 6.3, 1.0, 2.9, 1.8, 2.0, 0.5, 7.9, 4.8, 6.5); n = 10; ybar=mean(y);
ybar # 3.71
hist(y,prob=T,xlim=c(0,10),ylim=c(0,0.6), breaks=seq(0,10,0.5),
set.seed(312); res=METALG(J=10100,y=y,m0=1,mdel=1); res$mar # 0.5954

plot(res$mv,type="l",main="(b) Trace of 10100 m-values");
acf(res$mv, main="(c) Sample ACF of 10000 m-values")
acf(res$mv,plot=F)[1:5] # 0.710 0.513 0.374 0.270 0.195
acf(mv, main="(d) Sample ACF of 1000 m-values")
mv=res$mv[-(1:101)][seq(10,10000,10)];
acf(mv,plot=F)[1:5] # 0.056 0.001 -0.006 -0.027 0.035
J=length(mv); J # 1000
mvec=seq(1.8,5,0.01); kvec=mvec;
for(i in 1:length(mvec)) kvec[i] = exp(LOGPOST(m=mvec[i],n=n,y=y))
k0=INTEG(mvec,kvec); postvec=kvec/k0; k0 # 3.317e-08
mhat=INTEG(mvec,mvec*postvec);
c(mbar,sd(mv),mhat,mci,mcpdr)
# 2.8907 0.4823 2.9071 2.8608 2.9206 2.1456 3.9827
fun=function(q,p=0.025){ (INTEG(mvec,postvec,1.8,q)-p)^2 }
LB0 = optim(par=2.1,fn=fun)$par; LB = optim(par= LB0,fn=fun)$par
fun=function(q,p=0.975){ (INTEG(mvec,postvec,1.8,q)-p)^2 }
UB0 = optim(par=4.1,fn=fun)$par; UB = optim(par= UB0,fn=fun)$par
c(LB,UB) # 2.143 4.033
INTEG(mvec,postvec,1.8,LB) # 0.025
INTEG(mvec,postvec,UB,5) # 0.025 OK (Ignore all the warnings)
par(mfrow=c(2,1))
hist(mv,prob=T,xlim=c(1,5),ylim=c(0,1), breaks=seq(1,5,0.2),
xlab="x",main="(e) Histogram of 1000 m-values")
lines(mvec,postvec,lty=1,lwd=3)
points(c(mhat,LB,UB),c(0,0,0),pch=16)
617
# Prediction of c = (1/40)(y11+...+y50) (new definition) -----------------------

set.seed(332); cv=rnorm(J,mv,mv/sqrt(40))
hist(cv,prob=T,xlim=c(1,6), ylim=c(0,0.7), breaks=seq(1,6,0.25),

xlab="c",main="(f) Histogram of 1000 c-values")
cvec=seq(1,6,0.01); fcvec=seq(1,6,0.01); for(i in 1:length(cvec))
fcvec[i]=mean(dnorm(cvec[i],mv,mv/sqrt(40)))
lines(cvec,fcvec,lty=1,lwd=3)
lines(density(cv),lty=2,lwd=3)
abline(v=c(cbar,cci,ccpdr),lwd=2)
points(mhat,0,pch=16)
# Now perform inference on the finite population mean,

# a=(1/50)*(10*ybar +40*c)
av=(1/50)*(10*ybar+40*cv)
abar=mean(av); aci=abar+c(-1,1)*qnorm(0.975)*sd(av)/sqrt(J)
acpdr=quantile(av,c(0.025,0.975))
c(abar,sd(av),aci,acpdr) # 3.0608 0.5276 3.0281 3.0935 2.1904 4.2560
(1/50)*(10*ybar+40*mbar) # 3.055 RB estimate of predictive mean of a
(1/50)*(10*ybar+40*mci) # 3.031 3.078 RB CI for predictive mean of a
(1/50)*(10*ybar+40*mhat) # 3.068 Exact predictive mean of a
X11(w=8,h=4); par(mfrow=c(1,1))
hist(av,prob=T,xlim=c(1.5,5.5), ylim=c(0,1), breaks=seq(1,6,0.2), xlab="c",
main="(g) Histogram of 1000 a-values (finite population mean)")
avec=seq(1,6,0.01); favec=seq(1,6,0.01); for(i in 1:length(avec))
favec[i]=
mean( dnorm( avec[i], (1/50)*( 10*ybar+40*mv), mv*sqrt(40)/50 ) )
lines(avec,favec,lty=1,lwd=3); lines(density(av),lty=2,lwd=3)
abline(v=c(abar,aci,acpdr),lwd=2)
points( (1/50)*(10*ybar+40*mbar) ,0.1,pch=1,cex=1, lwd=2)
points( (1/50)*(10*ybar+40*mci) ,c(0.06,0.14), pch=1,cex=1, lwd=2)
points( (1/50)*(10*ybar+40*mhat) ,0,pch=4,lwd=2,cex=2)
points(ybar,0,cex=1,lwd=2,pch=16)
legend(3.9,1, c("Histogram density estimate","Rao-Blackwell estimate"),
lty=c(2,1), lwd=c(3,3), bg="white")
legend(3.83,0.67,c("Sample mean","Rao-Blackwell estimate & 95% CI",
"Exact predictive mean"),
pch=c(16,1,4), pt.cex=c(1,1,2), pt.lwd= c(2,2,2), bg="white")
618
Exercise A.2 Practice with the MH algorithm
(a) Sample a value a from the standard exponential distribution and a

value b from the uniform distribution between 0 and 10 (independently).
Then randomly sample n = 100 values from the gamma distribution with
mean m = a / b and variance v = a / b2 .
Then design and implement a Metropolis-Hastings algorithm so as

to generate a random sample of size J = 1, 000 from the joint posterior
distribution of a and b.
Use this sample to perform Monte Carlo inference on m.
Be sure to provide a 95% CI for the posterior mean of m, an estimate of

the 95% central posterior density region for m, and an estimate of the
entire marginal posterior density of m.
Then predict c, the average of a future independent sample of size k = 10

from the gamma distribution with the same mean m and variance v.
Be sure to provide a 95% CI for the predictive mean of c, an estimate of

the 95% central predictive density region for c, and an estimate of the
entire posterior predictive density of c.
Illustrate your results with suitable figures (e.g. trace plots and
histograms).
(b) Consider the following values in a sample obtained via SRSWOR

from a finite population of size N = 30:
0.4, 3.3, 1.0, 2.9, 1.8, 4.1.
Suppose we model the finite population values as gamma with mean

m = a / b and variance v = a / b 2 , with a standard exponential prior on m
and a uniform prior on b between 0 and 10.
Using MCMC methods, estimate/predict the finite population mean

absolute deviation about the superpopulation mean, equivalently referred
to as the MAD for short, and defined by
1 N
= ψ ∑ yi − m .
N i =1
619
The sampled values of a and b were 1.463 and 5.528. So the value of m
was a/b = 0.2647. The 100 sampled gamma values are shown in Figure
A.3(a) (page 621).
Next, the posterior density of the two parameters a and b is
f ( a, b | y ) ∝ f ( a, b) f ( y | a, b)
n
b a yia −1e − byi e − a b na (∏in=1 yi ) a −1 e − byT
∝ e− a ∏ = .
i =1 Γ(a ) Γ(a ) n
l (a, b) = log f (a, b | y )

n
=−a + na log b + (a − 1)∑ log yi − byT − n log Γ(a ) .
i =1
A suitable Metropolis algorithm is one which at each iteration:
1. Proposes a value
a′ ~ U (a − δ a , a + δ a ) ,
where δ a is a tuning constant, and accepts this value with
probability p = e q , where
= q l ( a′, b) − l ( a, b)
2. Proposes a value
b′ ~ U (b − δ b , b + δ b ) ,
where δ b is a tuning constant, and accepts this value with
probability p = e q , where
= q l ( a, b′) − l ( a, b) .
Implementing this algorithm we obtained the required J = 1,000 values

( a1 , b1 ),...,( a J , bJ ) ~ iid f ( a, b | y )
and hence
m1 ,..., mJ ~ iid f ( m | y )
by calculating m j = a j / b j for each j = 1,…,J.
A histogram of these simulated m-values is shown in Figure A.3(b) (page

622).
620
The dashed line is a histogram estimate of f ( m | y ) . The vertical lines

show the posterior mean estimate, m = 0.3017, the 95% CI for the
posterior mean, (0.3001, 0.3033), and the 95% CPDR estimate for m,
(0.2566, 0.3570). The cross shows the true value of m, 0.7071.
The Monte Carlo sample was then used to generate a random sample from
the predictive distribution of
=
c ( yn +1 + ... + yn +10 ) / 10 .
This was done by sampling

yn( +j 1) ,..., yn( +j 10
)
~ iid G ( a j , b j )
and forming
c j = ( yn( +j 1) ,..., yn( +j 10
)
) / 10 , j = 1,…,J.
A histogram of the c-values is shown in Figure A.3(c). The dashed line is

a histogram estimate of f ( c | y ) . The vertical lines are the predictive
mean estimate, c = 0.2981, the 95% CI for the predictive mean, (0.2929,
0.3033), and the 95% CPDR estimate for c, (0.1584, 0.4878).
621
(b) Here we repeat the procedure in (a) but using n = 6 (rather than 100),
and the 6 given sample values whose mean is 2.25 (instead of the 100
generated values as before), so as to generate a Monte Carlo sample of
size J = 1,000 from the posterior distribution of a and b.
We then use each pair of values, a j and b j , to generate 24 values which

are iid from the gamma distribution with parameters a j and b j .
Then for each j we calculate the associated value of the MAD, namely
1 N a
= ψj ∑
N i =1
yi − j .
bj
We then use the resulting J values of the MAD, i.e. ψ 1 ,...,ψ J , for Monte
Carlo inference in the usual way.
622
Figure A.4 shows a histogram of these J values and related information.
Numerically, we estimate ψ ’s posterior/predictive mean by 1.307 with

95% CI (1.27, 1.34), and we estimate ψ ’s CPDR by (0.75, 2.73).
Figure A.4 Histogram of 1,000 MAD values
# (a)
options(digits=4); n = 100; X11(w=8,h=4); par(mfrow=c(1,1));
set.seed(192); a=rgamma(1,1,1); b=runif(1,0,10); y=rgamma(n,a,b);
m=a/b; v=a/b^2; c(a,b,m,v) # 1.46321 5.52763 0.26471 0.04789
hist(y,prob=T,xlim=c(0,1.5),ylim=c(0,3), breaks=seq(0,1.5,0.05),
yvec=seq(0,1.5,0.01); lines(yvec,dgamma(yvec,a,b),lwd=3)
abline(v=m,lwd=3)
sumlogy=sum(log(y)); sumy=sum(y) # sufficient statistics

LOGPOST=function(a=1,b=1,n=3,sumlogy=2,sumy=2){
-a+n*a*log(b)+(a-1)*sumlogy-b*sumy-n*lgamma(a) }
LOGPOST() # -3 OK
623
MHALG = function(J=1000,y,a0=1,b0=1,adel=1,bdel=1){
a=a0; b=b0; av=a; bv=b; act=0; bct=0; n=length(y);
sumlogy=sum(log(y)); sumy=sum(y) # sufficient statistics
for(j in 1:J){
acand=runif(1,a-adel,a+adel)
if(acand>0){
logprob=
LOGPOST (a=acand,b=b,n=n,sumlogy=sumlogy,sumy=sumy)-
LOGPOST (a=a,b=b,n=n,sumlogy=sumlogy,sumy=sumy)
prob=exp(logprob)
u=runif(1); if(u<=prob){ act=act+1; a= acand } }
bcand=runif(1,b-bdel,b+bdel)
if((bcand>0)&&(bcand<10)){
logprob=
LOGPOST (a=a,b=bcand,n=n,sumlogy=sumlogy,sumy=sumy)-
LOGPOST (a=a,b=b,n=n,sumlogy=sumlogy,sumy=sumy)
prob=exp(logprob)
u=runif(1); if(u<=prob){ bct=bct+1; b= bcand }
}
av=c(av,a); bv=c(bv,b)
}
list(av=av,bv=bv,aar=act/J,bar=bct/J)
}
set.seed(312); res=MHALG(J=10100,y=y,a0=1,b0=1,adel=0.3,bdel=1)
X11(w=8,h=6); par(mfrow=c(2,1));
plot(res$av); plot(res$bv); c(res$aar,res$bar) # 0.5055 0.5611
av=res$av[-(1:101)][seq(10,10000,10)]; J=length(av); J # 1000

bv=res$bv[-(1:101)][seq(10,10000,10)]; mv=av/bv
c(mbar,mci,mcpdr) # 0.3017 0.3001 0.3033 0.2566 0.3570
X11(w=8,h=4); par(mfrow=c(1,1));
hist(mv,prob=T,xlim=c(0.2,0.4),ylim=c(0,20), breaks=seq(0.2,0.4,0.005),
xlab="m",main="(b) Histogram of 1000 m-values")
624
# Prediction of c -----------------------
set.seed(332); cv=rep(NA,J); for(j in 1:J) cv[j]=mean(rgamma(10,av[j],bv[j]))
hist(cv,prob=T,xlim=c(0.05,0.7),ylim=c(0,7), breaks=seq(0,1.6,0.02),
xlab="c",main="(c) Histogram of 1000 c-values")
lines(density(cv),lty=1,lwd=3); abline(v=c(cbar,cci,ccpdr),lwd=2)
# (b)
y=c( 0.4, 3.3, 1.0, 2.9, 1.8, 4.1); X11(w=8,h=6); par(mfrow=c(2,1));
n=length(y); sumlogy=sum(log(y)); sumy=sum(y) # sufficient statistics
set.seed(312); res=MHALG(J=10100,y=y,a0=1,b0=1,adel=1.3,bdel=0.7)
plot(res$av); plot(res$bv); c(res$aar,res$bar) # 0.5129 0.5094
av=res$av[-(1:101)][seq(10,10000,10)]; J=length(av); J # 1000

bv=res$bv[-(1:101)][seq(10,10000,10)]; mv=av/bv
c(mbar,mci,mcpdr) # 2.256 2.208 2.305 1.148 4.188
X11(w=8,h=4); par(mfrow=c(1,1));
hist(mv,prob=T,xlim=c(0,7),ylim=c(0,0.8), breaks=seq(0,10,0.5),
xlab="x",main="Histogram of 1000 simulated m-values")
lines(density(mv),lty=2,lwd=3); abline(v=c(mbar,mci,mcpdr),lwd=2)
# Prediction of psi -----------------------

set.seed(332); psiv=rep(NA,J);
for(j in 1:J){ yrem=rgamma(24,av[j],bv[j])
yall = c(y,yrem); psiv[j]=mean((abs(yall-mv[j]) )) }
psibar=mean(psiv); psici =psibar+c(-1,1)*qnorm(0.975)*sd(psiv)/sqrt(J)
psicpdr=quantile(psiv,c(0.025,0.975))
c(psibar,sd(psiv),psici,psicpdr) # 1.3068 0.5411 1.2732 1.3403 0.7497 2.7349
hist(psiv,prob=T,xlim=c(0,4),ylim=c(0,1.5), breaks=seq(0,7,0.1),
xlab="psi",main="")
lines(density(psiv),lty=1,lwd=3); abline(v=c(psibar,psici,psicpdr),lwd=2)
625
Exercise A.3 Practice with a Bayesian finite population

regression model
(a) Generate a population of covariates

x1 ,..., xN ~ iid U (10, 20) ,
where N = 100.
Then generate a population of values

yi ~ N ( a + bxi , σ 2 ) , i = 1,..., N ,
where a = 3, b = 0.5, σ = 2 .
Then select a random sample of size n = 20 from the N units in the finite
population, without replacement.
Plot the y values against the x values, over the population and over the
sample, respectively. Draw the true regression line y= a + bx and the
two least squares regression lines estimated using the population data and
sample data, respectively.
(b) Consider the following Bayesian model:

( yi | a, b, λ ) ~ ⊥ N ( a + bxi ,1 / λ ) , i = 1,..., N
f ( a, b, λ ) ∝ 1 / λ ; a, b ∈ℜ; λ > 0 .
Generate a random sample of size J = 1,000 from the joint posterior

distribution of a, b and λ , given the sample data generated in (a).
Then use this sample and R to estimate each of the following quantities:
m= a + 16b (average of a hypothetically infinite number of
values with covariate 16)
y1 + ... + yN
y= (the finite population mean)
N
2 y(100)
ψ= (ratio of maximum to median of the 100 finite
y(50) + y(51)
population values).
Assume that all N covariate values in the population are known.
(c) Repeat the inferences in (b) but using WinBUGS and a sample size of
J = 10,000.
626
(a) The required plot and regression lines are shown in the Figure A.5.
(b) Denote the sample values by s1 ,..., sn ∈ {1,..., N } , where s1 < ... < sn ,
and define s = ( s1 ,..., sn ) .
Then define the population vector as y = ( y1 ,..., y N )′ and the sample

vector as ys = ( ys1 ,..., ysn )′ .
=
Also define =
r ( r1 ,..., rN −n ) {1,..., N } − s in such a way that r1 < ... < rN −n ,
and define the nonsample vector as yr = ( yr1 ,..., yrN − n )′ .
Likewise, define the population covariate vector as x = ( x1 ,..., x N )′ , the

sample covariate vector as xs = ( xs1 ,..., xsn )′ , and the nonsample covariate
vector as xr = ( xr1 ,..., xrN −n )′ .
627
Also consider all of x1 ,..., x N as known constants, and define D = ( s, y s )

as the data. Also let:
a
β =   , X s = (1n , xs ) , X r = (1N −n , xr ) , Σ ss =I n , Σ rr =I N −n .
 
b
Then, from the theory of the normal-normal-gamma finite population

model, we have that:
( yr | D, β , λ ) ~ N N −n ( X r β , Σ rr / λ )
( β | D, λ ) ~ N 2 (T , D / λ ) ,
D ( X s′Σ −ss1 X s ) −1 and T =
where = ( X s′Σ −ss1 X s ) −1 X s′Σ −ss1 ys
(λ | D ) ~ G ( A / 2, B / 2) ,
where A= n − 2 and B =( y s − X sT )′Σ −ss1 ( y s − X sT ) .
Thus, to do the required inference, first carry out the following steps:
1. Relabel the population units so that y s = ( y1 ,..., yn )′ ,
xs = ( x1 ,..., xn )′ , yr = ( yn +1 ,..., y N )′ , xr = ( xn +1 ,..., xN )′ , etc.,
so that y = ( y s′ , yr′ )′ , etc.
2. Calculate A, B, D and T as per the above
3. Generate λ1 ,..., λJ ~ iid G ( A / 2, B / 2) (easy)
4. Generate β ~ ⊥ N 2 (T , D / λ j ) , for j = 1,…,J
( j)
(easy)
5. Generate y ,..., y (1)
r
(J )
r ~ N N −n ( X r β ( j)
, Σ rr / λ j ) , for j = 1,…,J
(e.g. for each j, generate y ( j)
i ~ ⊥ N ( a j + b j xi ,1 / λ j ) ,
i= n + 1,..., N , and form y ( j)
r = ( yn( +j 1) ,..., y N( j ) )′
6. Form y ( j ) = ( y s′ , yr( j )′ )′ for each j = 1,…,J.
Now calculate
m=j a j + 16b j
and perform Monte Carlo inference on m, using the fact that
m1 ,..., mJ ~ iid f ( m | D ) .
m J −1 ∑ Jj =1 m j .)
(For example, estimate m by =
Likewise, calculate y ( j ) = 1′N y ( j ) / N and perform Monte Carlo inference

on y in the usual way, using the fact that y (1) ,..., y ( J ) ~ iid f ( y | D ) .
628
Finally, calculate
( j)
2 y(100)
ψ j = ( j)
y(50) + y(51)
( j)
and perform Monte Carlo inference on ψ , using the fact that

ψ 1 ,...,ψ J ~ iid f (ψ | D ) .
Optionally, we may improve on some of the above ‘basic’ inferences by

considering Rao-Blackwell techniques, e.g. estimate m by its exact
=
posterior mean, mˆ E=( m | D ) (1,16)T .
Figure A.6 shows histograms of the simulated values of m (subplot (a)),

y (subplot (b)) and ψ (subplot (c)), with each subplot overlaid by
various points, interval and density estimates.
Subplot (d) (page 631) illustrates ‘exact’ inference on y based on the

theory of the normal-normal-gamma finite population model, and subplot
(e) (page 631) is a detail in subplot (d). Each plot features a cross showing
the true value of the quantity being estimated.
Figure A.6 Graphical results for part (b)
629
630
Table A.1 shows some of the true values and corresponding numerical
estimates featuring in Figure A.6.
(c) Using the WinBUGS code below we obtained results as shown in

Figure A.7. It will be noted that these are consistent with those in Table
A.1
631
Table A.1 Numerical results for part (b)
Quantity True Posterior MC 95% CI for MC estimate

value mean estimate post. mean of 95% CPDR
m 11.000 10.895 10.906 (10.875, 10.937) (9.893, 11.863)
y 10.473 10.174 10.185 (10.158, 10.211) (9.353, 11.049)
ψ 1.435 NA 1.659 (1.650, 1.668) (1.444, 2.014)
Figure A.7 Output from WinBUGS run
632
# (a)
N=100; n=20; a=3; b=0.5; sig=2; set.seed(312); x=runif(N,10,20);
y=rnorm(N,a+b*x,sig); s=sort(sample(1:N,n)); xs=x[s]; ys=y[s];
r=(1:N)[-s]; xr=x[r]; yr=y[r]; yT=sum(y); ysT=sum(ys); yrT=sum(yr)
ybar=mean(y); ysbar=mean(ys); yrbar=mean(yr);
xT=sum(x); xsT=sum(xs); xrT=sum(xr)
xbar=mean(x); xsbar=mean(xs); xrbar=mean(xr);
m=a+16*b; psi=max(y)/median(y)
c(m, ybar,max(y),median(y),psi) # 11.000 10.473 15.234 10.616 1.435
plot(x,y,xlim=c(0,20),ylim=c(0,17));
points(xs,ys,pch=16); abline(v=0,lty=3); abline(h=0,lty=3); abline(v=16,lty=3);
abline(h=a+16*b,lty=3);
abline(a,b,lwd=3);
abline(lm(y~x),lty=2,lwd=3); abline(lm(ys~xs),lty=3,lwd=3);
abline(lm(yr~xr),lty=4,lwd=3)
legend(0,17,bg="white", c("True regression line","Estimate from population",
"Estimate from sample","Estimate from nonsample"),
lty=1:4,lwd=rep(3,4) )
text(16,2,"The solid dots show the sample values")
# (b) Follows on from (a)….

# Packages, Load package, MASS (for use further down)
eta=0; tau=0; sigma=diag(rep(1,N)); sigmass=diag(rep(1,n));

sigmarr=diag(rep(1,N-n));
p=2; c=2*eta+n-p; Xs=cbind(1,xs); Xr=cbind(1,xr); X=rbind(Xs,Xr)
T=D%*%t(Xs)%*%solve(sigmass)%*%ys; t(T) # -0.6637 0.7224
A=2*eta+n-p; B= 2*tau + t(ys-Xs%*%T) %*% solve(sigmass) %*% (ys-Xs%*%T)
J=1000; set.seed(5); lamvec=rgamma(J,A/2,B/2);

betamat=matrix(NA,nrow=2,ncol=J)
for(j in 1:J) betamat[,j] = mvrnorm( n=1, mu=T, Sigma=D/lamvec[j] )
avec=betamat[1,]; bvec=betamat[2,]
ahat=mean(avec); bhat=mean(bvec); c(ahat,bhat) # -0.5742 0.7175
yrmat=matrix(NA,nrow=N-n,ncol=J)
set.seed(334); for(j in 1:J)
yrmat[,j]= rnorm(N-n,avec[j]+bvec[j]*xr,1/sqrt(lamvec[j]))
633
# Use simulated values of beta and yr to do inference

mvec=avec+16*bvec; ybarvec=rep(NA,J); psivec=rep(NA,J)
for(j in 1:J){ ysim = c(ys, yrmat[,j])
ybarvec[j]=mean(ysim)
psivec[j] = max(ysim)/median(ysim) }
mhat=mean(mvec); mci= mhat +c(-1,1)*qnorm(0.975)*sd(mvec)/sqrt(J)
mcpdr=quantile(mvec,c(0.025,0.975))
ybarhat=mean(ybarvec);
ybarci = ybarhat +c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J)
psihat=mean(psivec); psici = psihat +c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J)
psicpdr=quantile(psivec,c(0.025,0.975))
hist(mvec,prob=T,xlim=c(8,14),ylim=c(0,1), breaks=seq(7,14,0.25),
xlab="m",main="(a) Histogram of 1000 m-values") # Ignore warnings
lines(density(mvec),lty=2,lwd=3) # Histogram estimate
abline(v=c(mhat,mci,mcpdr),lty=2,lwd=3) # Histogram estimates
mhat2=c(1,16)%*%T; points(mhat2,0, pch=16,cex=1.5) # Exact posterior mean
mvarterm2=c(1,16)%*%D%*%c(1,16); msdterm2=sqrt(mvarterm2)
mv=seq(6,16,0.05); fmv2=mv
for(k in 1:length(mv))
fmv2[k]=mean(dnorm(mv[k],mhat2,msdterm2/sqrt(lamvec)))
lines(mv,fmv2,lwd=3); # Exact posterior density of m
points(median(y),0, pch=4,cex=2,lwd=3 ) # True value of m
legend(8,1,c("Histogram estimate","Exact density"), lty=c(2,1),lwd=c(3,3),
bg="white")
legend(8,0.6,c("Rao-Blackwell","True"),pch=c(16,4),
pt.cex=c(1.5,2), pt.lwd=c(1,3), bg="white")
hist(ybarvec,prob=T,xlim=c(8,12),ylim=c(0,1), breaks=seq(3,18,0.25),
xlab="ybar",main="(b) Histogram of 1000 ybar-values")
lines(density(ybarvec),lty=2,lwd=3) # Histogram estimate
abline(v=c(ybarhat, ybarci, ybarcpdr),lty=2,lwd=3) # Histogram estimates
ybarv=seq(8,13,0.02); fybarhatv=ybarv;
meanvalvec = (1/N)*( ysT+(N-n)*(avec+bvec*xrbar) )
varvalvec = (N-n)/(lamvec*N^2)
for(k in 1:length(ybarv)){
fybarhatv[k]= mean( dnorm(ybarv[k], meanvalvec, sqrt(varvalvec) ) ) }
lines(ybarv, fybarhatv,lty=1,lwd=3) # Rao-Blackwell
points(mean(meanvalvec),0,pch=16,cex=1.5) # Rao-Blackwell
points(ybar, 0, pch=4,cex=2,lwd=3 ) # True value of ybar
legend(8,1,c("Histogram estimate","Rao-Blackwell"),
lty=c(2,1),lwd=c(3,3), bg="white")
634
legend(8,0.6,c("Rao-Blackwell","True value"),pch=c(16,4),
pt.cex=c(1.5,2), pt.lwd=c(1,3), bg="white")
hist(psivec,prob=T,xlim=c(1.25,2.5),ylim=c(0, 3), breaks=seq(0,10,0.05),

xlab="psi",main="(c) Histogram of 1000 psi-values")
den=density(psivec); lines(den, lty=2,lwd=3)
abline(v=c(psihat, psici, psicpdr),lty=2,lwd=3)
# psimode=den$x[(1:length(den$x))[den$y==max(den$y)]] # optional extras....

# psimedian=median(psivec); abline(v=c(psimode,psimedian),lty=1,lwd=3)
points(psi, 0, pch=4,cex=2,lwd=3 ) # True value of psi

legend(2.05,3,c("Histogram estimate"), lty=c(2),lwd=c(3), bg="white")
legend(2.05,2,c("True value"),pch=c(4), pt.cex=c(2), pt.lwd=c(3), bg="white")
# Perform exact inference on ybar using a function from a previous exercise:

NNGFPM= function(eta=0, tau=0, alp=0.05,
ys= c(5.6,2.3,8.4,5.1,4.3), X=rep(1,15) , N=15, sigma=diag(rep(1,N)) ) {
# This function performs inference under the normal-normal-gamma
# finite population model.
# Inputs: eta, tau, alp, ys, X, N, sigma
# Outputs: A list with $a, $b and $c indicating (ybar-a)/b given ys ~ t(c)
p=ncol(cbind(NA,X))-1; n = length(ys); c=2*eta+n-p
ysT=sum(ys); Xs=cbind(NA,X)[1:n,][,-1]; Xr=cbind(NA,X)[(n+1):N,][,-1]
sigmass=sigma[1:n,1:n]; sigmarr=sigma[(n+1):N,(n+1):N]
sigmasr=sigma[1:n,(n+1):N]; sigmars=t(sigmasr)
beta=D%*%t(Xs)%*%solve(sigmass)%*%ys
A=Xr-sigmars%*%solve(sigmass)%*%Xs; oner=rep(1,N-n)
a=(1/N)*( ysT + t(oner)%*%
( Xr%*%beta + sigmars%*%solve(sigmass)%*%(ys-Xs%*%beta) ) )
b2=(1/(c*N^2)) * ( 2*tau + t(ys-Xs%*%beta)%*%solve(sigmass)%*%
(ys-Xs%*%beta) ) * t(oner)%*%
(sigmarr-sigmars%*%solve(sigmass)%*%sigmasr +A%*%D%*%t(A)) %*%
oner
b=sqrt(b2); cpdr=a+c(-1,1)*qt(1-alp/2,c)*b
list(a=a,b=b,c=c,beta=beta, cpdr=cpdr) }
res= NNGFPM( eta=0, tau=0, alp=0.05, ys=ys,X=X,N=N, sigma=sigma )

c(res$a,res$b,res$c, res$cpdr) # 10.1744 0.4035 18.0000 9.3267 11.0221
# Plot for inference on ybar again

hist(ybarvec,prob=T,xlim=c(8,12),ylim=c(0,1), breaks=seq(3,18,0.2),
xlab="ybar",main="(d) Histogram of 1000 ybar-values")
635
abline(v=c(ybarhat, ybarci, ybarcpdr),lty=2,lwd=3) # Histogram point estimates

points(mean(meanvalvec),0,pch=16,cex=1.5)
# Rao-Blackwell estimate of predictive mean
abline(v=c(res$a,res$cpdr), lty=1, lwd=3) # # Exact point estimates
lines(density(ybarvec),lty=2,lwd=3) # Histogram estimate of predictive pdf
lines(ybarv, fybarhatv,lty=3,lwd=3) # Rao-Blackwell estimate of pdf
lines(ybarv, dt((ybarv-res$a)/res$b,c)/res$b,lty=1,lwd=3) # Exact predictive pdf
legend(8,1,c("Histogram","Rao-Blackwell","Exact pdf"),
lty=c(2,3,1),lwd=c(3,3,3))
legend(8,0.5,c("Rao-Blackwell","True value"),
pch=c(16,4),pt.cex=c(1.5,2), pt.lwd=c(1,3))
text(11.65,0.8,
"The solid vertical lines\nshow the exact \npredictive mean\nand 95% CPDR")
# Detail in last figure

hist(ybarvec,prob=T,xlim=c(10,11.5),ylim=c(0,1), breaks=seq(3,18,0.2),
xlab="ybar",main="(e) Detail in subplot (d)")
abline(v=c(ybarhat, ybarci, ybarcpdr),lty=2,lwd=3) # Histogram point estimates
points(mean(meanvalvec),0,pch=16,cex=1.5)
# Rao-Blackwell estimate of predictive mean
abline(v=c(res$a,res$cpdr), lty=1, lwd=3) # # Exact point estimates
lines(density(ybarvec),lty=2,lwd=3) # Histogram estimate of predictive pdf
lines(ybarv, fybarhatv,lty=3,lwd=3) # Rao-Blackwell estimate of pdf
lines(ybarv, dt((ybarv-res$a)/res$b,c)/res$b,lty=1,lwd=3) # Exact predictive pdf
legend(11.1,1,c("Histogram","Rao-Blackwell",
"Exact pdf"),lty=c(2,3,1),lwd=c(3,3,3))
legend(11.1,0.6,c("Rao-Blackwell","True value"),
pch=c(16,4),pt.cex=c(1.5,2), pt.lwd=c(1,3))
# Exact values of the quantities of interest and summary estimates ------------

c(m,mhat2,mhat,mci,mcpdr)
# 11.000 10.895 10.906 10.875 10.937 9.893 11.863
c(ybar,res$a,ybarhat,ybarci,ybarcpdr)
# 10.473 10.174 10.185 10.158 10.211 9.353 11.049
c(psi,psihat,psici,psicpdr) # 1.435 1.659 1.650 1.668 1.444 2.014
# Preparation of data for input to WinBUGS ----------------------------------------

paste(as.character(round(ys,2)), collapse=",")
# 14.98,10.99,9.58,6.56,13.83,……, 10.66,10.41"
paste(as.character(round(c(xs,xr),2)), collapse=",")
# 19.34,18.2,14.27,10.91,13.45,…..,12.57,10.36,19.49
636
WinBUGS Code for Exercise A.3
model
{
for(i in 1:100){
mu[i] <- a + b*x[i]
y[i] ~ dnorm(mu[i],lam)
}
a ~ dnorm(0.0,0.0001)
b ~ dnorm(0.0,0.0001)
lam ~ dgamma(0.0001,0.0001)
m <- a+16*b
ybar <- mean(y[])
max <- ranked(y[],100)
medL <- ranked(y[],50)
medU <- ranked(y[],51)
med <- (medL + medU)/2
psi <- max/med
}
# data
list(y=c(
14.98,10.99,9.58,6.56,13.83, 11.38,9.13,13.25,7.03,11.14,
2.74,11.97,12.15,9.39,11.71, 10.25,7.98,8.54,10.66,10.41,
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA),
x=c(19.34,18.2,14.27,10.91,13.45,13.3,11.31,16.62,13.07,17.45,10.55,
17.66,17.34,17.46,16.14,17.19,10.96,14.19,16.08,14.83,17.92,16.61,
14.52,16.7,12.28,14.61,14.51,11.5,15.17,16.72,11.27,15.21,16.34,
10.36,12.62,19.27,19.7,12.26,10.07,18.74,11.86,12.35,16.79,13.18,
14.05,17.52,18.17,18.7,18.1,10.17,10.26,12.95,12.64,12.35,18.39,
12.08,17.48,13.47,14.47,16.76,17.64,14.32,19.07,17.29,15.87,14.2,
18.49,14.69,13.57,14.74,12.41,19.99,18.39,16.43,15.6,15.74,18.33,
16.98,16.72,19.3,13.92,11.4,11.55,13.83,12.36,13.3,15.3,19.26,18.15,
17.75,10.72,13.78,13.2,14.98,13.53,10.19,16.46,12.57,10.36,19.49))
# inits
list(a=0,b=0,lam=1)
637
Exercise A.4 Case study in Bayesian finite population models

with biased sampling
A finite population of size N = 4 consists of values y1 ,..., y4 that are iid

Bernoulli with parameter θ .
A priori, θ is equally likely to be 1/4 or 3/4 (with no other possibilities).
We are interested in two quantities:
the superpopulation mean θ = E ( yi | θ )
y1 + ... + yN
the finite population mean y= .
N
We sample n = 2 units from the finite population without replacement in

such a way that
every sample is equally likely to be selected, apart from one exception, as

follows:
if the value of unit 1 is 1 then each sample with unit 1 is twice as likely to
be selected as each sample without unit 1.
We observe the values of the two sampled units (each being 0 or 1) as well
as the labels identifying them (each being 1, 2, 3 or 4).
(a) Write down a suitable Bayesian model for the above scenario in terms
of the densities of the parameter θ , the finite population vector,
y = ( y1 ,..., y N ) , and the sample, s = ( s1 ,..., sn ) .
Your formulae may involve only these variables, as well as n, N, and the
vector of inclusion counters, I = ( I1 ,..., I N ) , where I i = 1 if the ith unit is
in the sample, and I i = 0 otherwise. (Note that there is a one-to-one
correspondence between s and I in this exercise.)
(b) Identify a condition which determines whether the sampling

mechanism is ignorable or nonignorable. Then write down an expression
for the density of s in each of these two cases.
638
(c) Derive the posterior density and mean of θ generally.
(d) Find the model bias of the posterior mean of θ if:
(i) θ = 1/4 and s = (1,3)
(ii) θ = 1/4 and s = (2,3) .
(e) Find the design bias of the posterior mean of θ if:
(i) θ = 1/4 and y = (0,0,1,1)
(ii) θ = 1/4 and y = (1,0,1,1) .
(f) Derive the predictive mean of y generally.
(g) Find the model bias of the predictive mean of y if:
(i) θ = 1/4 and s = (1,3)
(ii) θ = 1/4 and s = (2,3) .
(h) Find the design bias of the predictive mean of y if:
(i) θ = 1/4 and y = (0,0,1,1)
(ii) θ = 1/4 and y = (1,0,1,1) .
(i) Design and run a Gibbs sampler to check the posterior mean of θ
in (c) and the predictive mean of y in (f).
(j) Use Monte Carlo methods to check the two design biases in (h).
(k) Find the mean of the predictive mean of the finite population mean.
Then apply Monte Carlo methods to check your answer.
639
(a) Part of the Bayesian model is:

N
( y |θ )
f= ∏θ
i =1
yi
(1 − θ )1− yi
f (θ ) 1=
= / 2, θ 1 / 4,3 / 4 .
As regards the sampling mechanism, if y1 = 0 then

−1
N
,θ )
f ( s | y= f=
( s)   =
, s (1, 2),(1,3),(1, 4),(2,3),(2, 4),(3, 4) .
n
Also, if y1 = 1 then
 c, i∉s 
s | y, θ ) f=
f (= ( s | y1 )  
 2c , i ∈ s
 c, s = (1, 2), (1,3), (1, 4) 
= .
2c, s = (2,3), (2, 4), (3, 4) 
To find the value of c, we may equate

1 = ∑ f ( s | y1 ) = c × 3 + (2c ) × 3 = 9c .
s
We thereby obtain c = 1/9.
Note 1: Alternatively, we may observe that

f ( s | y1 ) = c (1 + I i ) ,
where
=I i I (i ∈ s ) .
Hence
 
s
=
1 ∑ f (s | y =) c∑ (1 + I =)
s
c ∑1 + ∑ I (i ∈ s ) 
1
 s s 
i
   N   N − 1   4   3  
= c ∑1 +=
∑ 1 c   +  =   c   +   
 s s :i ∈ s   n   n − 1    2   1  
= c(6 + 3) = 9c
⇒c=
1/ 9 .
640
 N − 1
Note 2: There are a total of   samples s which contain any given
 n −1 
particular unit i. So if y1 = 1 then
 1/ 9, s = (1, 2), (1,3), (1, 4) 
s | y, θ ) f=
f (= ( s | y1 )  .
2 / 9, s = (2,3), (2, 4), (3, 4) 
Putting together the two cases above ( y1 = 0 and 1), we see that the
sampling mechanism is given generally by
f ( s | y, θ ) = f ( s | y1 )
1 + I1 y1
=
 N   N − 1
 +  y1
 n   n −1 
1 + I1 y1
= , s = (1, 2),(1,3),(1, 4),(2,3),(2, 4),(3, 4) ,
6 + 3 y1
where of course =
I1 I ( s ∈ {(1, 2), (1,3), (1, 4)}) .
As a check, it is useful to list all of the values produced by this formula.

These values are as shown in Table A.2. Observe that the sum of f ( s | y1 )
over all values of s is equal to 1, both when y1 = 0 and when y1 = 1 .
From Table A.2 we may also confirm that, as specified in the problem:
every sample is equally likely to be selected, apart from one exception, as

follows: if the value of unit 1 is 1 then each sample with unit 1 is twice as
likely to be selected as each sample without unit 1.
Table A.2 All possible values of s and their probabilities
Sample, s: (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)

=
I1 I (1 ∈ s ) : 1 1 1 0 0 0
f ( s | y1 = 0) : 1/6 1/6 1/6 1/6 1/6 1/6
f ( s | y1 = 1) : 2/9 2/9 2/9 1/9 1/9 1/9
641
(b) If unit 1 is selected ( 1 ∈ s , I i = 1 ) then y1 = 0 or 1 is known and so

the sampling mechanism is ignorable. In that case,
1+ = y1 1 / 6, y1 0=  3 / 18, y1 0  3 + y1
| y,θ ) =
f ( s=  =   =  ,
6 + 3=y1 2 / 9, y1 1=   4 / 18, y1 1 18
s = (1, 2),(1,3),(1, 4) .
Conversely, if unit 1 is not selected ( 1 ∉ s , I i = 0 ) then y1 = 0 or 1 is

unknown and so the sampling mechanism is nonignorable.
In that case:
1= 1 / 6, y1 0=
 3 / 18, y1 0  3 − y1
| y,θ )
f ( s= =  =   =  ,
6 + 3=
y1 1 / 9, y1 1=
  2 / 18, y1 1 18
s = (2,3),(2, 4),(3, 4) .
(c) The posterior distribution of θ given data D = ( s, y s ) can now be

derived by considering the two cases in the note above.
First, if unit 1 happens to be sampled then the value of the sampling

density f ( s | y , θ ) is known, and so the sampling mechanism is ignorable.
Explicitly, we find in that case,
f (θ | D ) = f (θ | s, y s ) ∝ f (θ , s, ys ) = ∑ f (θ , s, y , y )
yr
s r
= ∑ f (θ ) f ( y s , yr | θ ) f ( s | y s , yr , θ )
yr
= ∑ f (θ ) f ( y s | θ ) f ( yr | θ ) f ( s | y1 )
yr
= f (θ ) f ( y s | θ ) f ( s | y1 )∑ f ( yr | θ )
yr
since f ( s | y , θ ) = f ( s | y1 ) , where s is fixed at its

observed value, s = ( s1 , s2 ) = (1,2), (1,3) or (1,4)
θ
∝ f (θ ) f ( ys | θ ) × 1 × 1
since f ( s | y1 ) does not depend on θ .
Note: This is the point at which f ( s | y , θ ) can be ‘ignored’.
642
Thus we have that

f (θ | D) ∝1 × ∏ θ y (1 − θ )1− y
i i
i∈s
= θ ysT
(1 − θ )2− ysT
  1  ysT  3  ysT 2− ysT 
    , θ = 1/ 4
 4   4  
= ysT ysT 2 − ysT 
 3   1  
 4   4  , θ = 3 / 4

3 2 − ysT
, θ = 1 / 4
∝ y 
 3 , θ = 3/ 4 
sT
 32 , θ = 1 / 4 
∝  y +y 
3
sT sT
, θ = 3 / 4
 9, θ = 1 / 4 
=  ysT .
9 , θ = 3 / 4 
That is (if 1 ∈ s ),
 9 /10, θ = 1/ 4 
  , ysT = 0
 9   1/10, θ = 3 / 4 
 9 + 9 ysT , θ = 1/ 4   1/ 2, θ = 1/ 4 
=f (θ | D) =    =  , ysT 1
 1/ 2, θ = 3 / 4 
ysT
 9 , θ = 3 / 4
 9 + 9 ysT   1/10, θ = 1/ 4 
  , ysT = 2.
9 /10, θ = 3 / 4 
So then also (if 1 ∈ s ) the posterior mean of θ is

1 9  3 1  3
 4  10  + 4  10 = 10 , ysT = 0
    
 11 31 1
θˆ= E (θ | D)=    +  = , ysT = 1
 42 42 2
1  1  3  9  7
   +  = , ysT = 2.
 4  10  4  10  10
3 + 2 ysT
Note: This could also be written as θˆ = (if 1 ∈ s ).
10
643
Next, suppose that unit 1 is not sampled. Then the value of unit 1 is
unknown and so the sampling mechanism is nonignorable.
In that case, we see from (b) that

3 − y1
θ ) f (s | =
f ( s | y ,= y1 ) ∝ 3 − y1 ,=s (2,3),(2, 4),(3, 4) ,
18
where y1 is an unknown value in the nonsample vector yr = ( y1 , yk )
where k = 2, 3 or 4.
Working through as before,

f (θ | D) = f (θ | s, ys )
∝ f (θ , s, ys )
= ∑ f (θ , s, ys , yr )
yr
= ∑ f (θ ) f ( y s , yr | θ ) f ( s | y s , yr , θ )
yr
= ∑ f (θ ) f ( y s | θ ) f ( yr | θ ) f ( s | y1 )
yr
= f (θ ) f ( y s | θ )∑ f ( s | y1 ) f ( yr | θ )
yr
= f (θ ) f ( ys | θ ) q(θ ) ,
where
q(θ ) ∝ ∑ (3 − θ ) f ( yr | θ )
yr
= E yr (3 − θ | θ )
= 3−θ .
Note: We could also have written

1 1
q(θ ) ∝ ∑ ∑ (3 − y1 ) θ y1 (1 − θ )1− y1  θ yk (1 − θ )1− yk 
=
y1 0=
yk 0
 1  1
=  ∑ θ yk (1 − θ )1− yk  ∑ (3 − y1 )θ y1 (1 − θ )1− y1
=  yk 0 =  y1 0
=1 × {(3 − 0)θ 0 (1 − θ )1−0 + (3 − 1)θ 1 (1 − θ )1−1}
= 3(1 − θ ) + 2θ
= 3−θ .
644
Having shown (in the case 1 ∉ s ) that

f (θ | D ) ∝ f (θ ) f ( ys | θ )(3 − θ ) ,
it now follows that
  1  ysT  3  ysT 2− ysT  1 
    3 − , θ =
1/ 4
 4   4   4 
f (θ | D ) ∝  ysT ysT 2 − ysT 
 3   1   3 
 4   4  3 − , θ =
3 / 4
 4 
3 2 − ysT
× 11, θ = 1 / 4
∝ y 
 3 × 9, θ = 3/ 4 
sT
 32 × 11, θ = 1/ 4 
∝  y +y 
3
sT sT
× 9, θ = 3 / 4
 11, θ = 1 / 4 
=  ysT .
9 , θ = 3 / 4 
Thus (if 1 ∉ s ), we have that

 11/12, θ = 1/ 4 
  , ysT = 0
 11    1/12, θ = 3 / 4 
 11 + 9 ysT , θ = 1/ 4   11/ 20, θ = 1/ 4 
=f (θ | D) =   =  , ysT 1
  9 / 20, θ = 3 / 4 
ysT
 9 , θ = 3 / 4 
11 + 9 ysT  11/ 92, θ = 1/ 4 
  , ysT = 2.
81/ 92, θ = 3 / 4 
So then also (if 1 ∉ s ) the posterior mean of θ is

θˆ = E (θ | D)
 1  11  3  1  14 7 805
 4  12  + 4  12  === =
0.2917, ysT =0
     48 24 2760
 1  11  3  9  38 19 1311
=    +   === = 0.4750, ysT =1
 4  20  4  20  80 40 2760
 1  11  3  81  254 127 1905
  +   = = = = 0.6902, ysT =2.
 4  92  4  92  368 184 2760
645
Note: This mean may also be written as

805 + 462 ysT + 44 ysT2
θˆ = (if 1 ∉ s ).
2760
This alternative formula was obtained by solving the equation

 805, x = 0 
 
a + bx =
+ cx 2
=
 1311, x 1
1905, x = 2 
 
for a, b and c.
Putting the two cases together we find that the posterior mean of θ is
given generally by:
=θˆ E (θ =
| D) θˆ=
( D) θˆ( s, ys )
 3 /10 = 0.3000 if 1 ∈ s and ysT =

0
 if 1 ∈ s and ysT =
 1/ 2 = 0.5000 1
 7 /10 = 0.7000 if 1 ∈ s and ysT =
2
=
 7 / 24 = 0.2917 if 1∉ s and ysT =
0
 19 / 40 = 0.4750 if 1∉ s and ysT =
1

 127 /184 = 0.6902 if 1∉ s and ysT =
2,
or equivalently, by
 3+ 2y   805 + 462 ysT + 44 ysT2 

θˆ =

sT
 I1 +   (1 − I1 ) .
 10   2760 
Note: Here:
1∈ s ⇔ I1 = 1 ⇔ s = (1, 2),(1,3) or (1, 4)
1∉ s ⇔ I1 = 0 ⇔ s = (2,3),(2, 4) or (3, 4) .
Also:
ysT = 0 iff both sampled values are 0
ysT = 1 iff one sampled value is 0 and the other is 1
ysT = 2 iff both sampled values are 1.
646
(d)(i) If θ = 1/4 and s = (1,3) then 1 ∈ s and I1 = 1 , and so

 3 + 2 ysT   805 + 462 ysT + 44 ysT  3 + 2 y sT
2
=θˆ   
1 + = (1 − 1) .
 10   2760  10
So the model mean of θˆ is

1
E (θˆ | θ=
, s) {3 + 2 E ( ysT | θ , s)} .
10
Now,
f ( y, s | θ )
f ( y | θ , s) = ,
f (s | θ )
where:
3 + y1 4 yi
f ( y, s | θ ) f ( s =
= | y,θ ) f ( y | θ ) ∏
18 i =1
θ (1 − θ )1− yi
3 + y1
(using the result in (b) that f ( s | y , θ ) = if 1 ∈ s )
18
f (s | θ )
= ∑
=
y
f ( y , s | θ ) ∑ f ( s | y=
y
,θ ) f ( y | θ ) E y { f ( s | y,θ ) | θ }
 3 + y1  3 + θ
= Ey  θ = .
 18  18
Therefore
 3 + y1 4 yi 1− yi 
 18 ∏ θ (1 − θ ) 
f ( y | θ , s) =  i =1 
 3 +θ 
 
 18 
 3 + y1 y1 1− y1 
4
θ θ  ∏ θ (1 − θ ) .
1− yi
=  (1 − ) yi
 3 + θ  i =2
We see that
( yi | θ , s ) ~ ⊥ Bernoulli (π i ) , i = 1, 2,3, 4 ,
where:
π=2 π= 3 π=4 θ
3+1 1 4θ
π1
= θ (1 − θ =
)1−1 .
3+θ 3+θ
647
3+ 0 0 3(1 − θ ) 4θ
Check: θ (1 − θ )1−0 = =
1− 1 − π1 .
=
3+θ 3+θ 3+θ
It follows that
ysT | θ , s ) E ( y1 | θ , s ) + E ( y3 | θ , s )
E (=
1 1
 7+ 
4θ θ (7 + θ ) 4  4  29 / 16 29
=π 1 + π 3 =θ + = = = = .
3+θ 3+θ  1 13 / 4 52
3+ 
 4
Hence
1   29   107
E (θˆ | θ , s ) = 3 + 2    = = 0.4115.
10   52   260
So, if θ = 1/4 and s = (1,3) , then the model bias of θˆ is

107 107 1 21
E (θˆ − θ | θ , s ) = − θ= − = = 0.1615.
260 260 4 130
Note: We can also report the relative model bias of θˆ as

 θˆ − θ  21 / 130 42
E θ,s = = = +64.6%.
 θ  1/ 4 65
 
(d)(ii) If θ = 1/4 and s = (2,3) then 1 ∈ r and I1 = 0 , and so

 3+ 2y   805 + 462 ysT + 44 ysT2 
θˆ =+

sT
0   (1 − 0)
 10   2760 
805 + 462 ysT + 44 ysT2
= .
2760
So the model mean of θˆ is

805 + 462 E ( ysT | θ , s ) + 44 E ( ysT2 | θ , s )
E (θˆ | θ , s ) = .
2760
648
In this case,
f ( y, s | θ )
f ( y | θ , s) = ,
f (s | θ )
as before, but with
3 − y1 4 yi
f ( y , s | θ ) f ( s |=
= y,θ ) f ( y | θ ) ∏
18 i =1
θ (1 − θ )1− yi
3 − y1
(using the result in (b) that f ( s | y , θ ) = if 1 ∉ s ).
18
Thus,
= f (s | θ ) ∑
=
y
f ( y, s | θ ) ∑ f ( s | y, θ ) f ( y | θ )
y
 3 − y1 
= E y { f ( s | y,θ ) | θ } = E y  θ
 18 
3−θ
= .
18
So
 3 − y1 4 yi 1− yi 
 18 ∏ θ (1 − θ ) 
f ( y | θ , s) =  i =1 
 3 −θ 
 
 18 
 3 − y1 y1 1− y1 
4
θ θ  ∏ θ (1 − θ ) .
1− yi
=  (1 − ) yi
 3 − θ  i =2
We see that
where:
π=2 π= 3 π=4 θ
3 −1 1 2θ
π1
= θ (1 − θ =
)1−1 .
3−θ 3−θ
3− 0 0 3(1 − θ ) 2θ
Check: θ (1 − θ )1−0 = =
1− 1 − π1 .
=
3−θ 3−θ 3−θ
649
It follows that
( ysT | θ , s ) E ( y2 | θ , s ) + E ( y3 | θ , s )
E=
1 1 1
= π 2′ + π 3′ = θ + θ = + = .
4 4 2
Equivalently,
( ysT | θ , s ) ~ Bin(2, θ ) ,
and so
E ( ysT | θ , s ) = 2θ .
By the same token,

E= 2
( y sT | θ , s ) V ( y sT | s, θ ) + {E ( ysT | s, θ )}2
1 1 5
= 2θ (1 − θ ) + (2θ ) 2= 2θ (1 + θ )= 2 × 1 + = .
4 4 8
Hence
1 5
805 + 462   + 44  
=E (θˆ | θ , s ) = 2  8  2127 = 0.3853.
2760 5520
So, if θ = 1/4 and s = (2,3) , then the model bias of θˆ is

2127 1 747
E (θˆ − θ | θ , s ) = E (θˆ | θ , s ) −=
θ −= = 0.1353.
5520 4 5520
Note: As regards the model bias of θˆ , there are a total of 4 cases,

corresponding to whether 1 ∈ s or 1 ∉ s , and to whether θ = 1 / 4 or
θ = 3 / 4 . We have covered two of these four cases.
(e)(i) If θ = 1/4 and y = (0,0,1,1) then y1 = 0 . So in that particular case

the sampling mechanism is definitely SRSWOR and ignorable. Without
further thought, the posterior density of θ can be obtained as follows:
= f (θ | D) f=
(θ | s, ys ) f (θ | ys )
∝ f (θ ) f ( ys | θ )
∝ 1× ∏ θ yi (1 − θ )1− yi .
i∈s
650
Recalling (c), note that

 9 /10, θ = 1/ 4 
  , ysT = 0
 1/10, θ = 3 / 4 
 1/ 2, θ = 1/ 4 
= f (θ | D) =   , ysT 1
 1/ 2, θ = 3 / 4 
 1/10, θ = 1/ 4 
  , ysT = 2,
9 /10, θ = 3 / 4 
and
 3 / 10, y sT = 0 
  3 + 2 y sT
θ E (θ | D
=ˆ = )  1 / 2, y= sT 1=  .
7 / 10, y = 2  10
 sT 
The design mean of θˆ is therefore

3 + 2 E ( ysT | θ , y )
E (θˆ | θ , y ) = ,
10
where
E ( ysT | θ , y ) = nE ( ys | θ , y ) = n ∑ ys f ( s | θ , y ) = ny ,
s
since (making use of basic results in the classical theory)
−1
N 0 + 0 +1+1
f ( s | θ=
, y) ( s)   = 2 ×
f= = 1.
n 4
Therefore the design mean of θˆ is

+ 2 ×1
ˆ | θ , y ) 3= 1
E (θ= .
10 2
So the design bias of θˆ is

1 1 1
E (θˆ − θ | θ , y ) = − θ = − = 0.25.
2 2 4
Note: In the above, E (θˆ | θ , y ) does not depend on θ . So, for the case
1 3
θ = 3/4 and y = (0,0,1,1) , the design bias of θˆ is − = −0.25.
2 4
651
(e)(ii) If θ = 1/ 4 and y = (1,0,1,1) then y1 = 1, and so the sampling

mechanism is potentially nonignorable (depending on which sample s
happens to be drawn).
Recall from (c) that the posterior mean of θ is a function of the data given
generally by
 3 /10 = 0.3000 if 1 ∈ s and ysT =
0
 1/ 2 = 0.5000 1
 7 /10 = 0.7000 if 1 ∈ s and ysT =
2
= θˆ θˆ=
( s, ys ) 
 7 / 24 = 0.2917 if 1 ∉ s and ysT =
0
 19 / 40 = 0.4750 if 1 ∉ s and ysT =
1

127 /184 = 0.6902 if 1 ∉ s and ysT =
2.
Also recall from (b) that

 3 + y1 
, s = (1, 2),(1,3),(1, 4) 
 18 
f ( s | y,θ ) =  .
 3 − y1
, s = (2,3),(2, 4),(3, 4) 
 18 
The design bias of θˆ can now be worked out according to

E (θˆ | θ , y ) = ∑θˆ( s, y s ) f ( s | θ , y ) .
s
Now, suppose that we draw the sample s = (1, 2) .
Then y s = ( y1 , y2 ) = (1,0).
Thus 1 ∈ s and y sT = 1 , and so by the above,

1 3+1 1
θˆ( s, ys ) f ( s | θ , y ) =× =.
2 18 9
Likewise:
If s = (1,3) then ys = ( y1 , y3 ) = (1,1) and so
7 3+1 7
θˆ( s, ys ) f ( s | θ , y ) = × = .
10 18 45
652
If s = (1, 4) then y s = ( y1 , y4 ) = (1,1) and so

7 3+1 7
θˆ( s, ys ) f ( s | θ , y ) = × = .
10 18 45
If s = (2,3) then y s = ( y2 , y3 ) = (0,1) and so

19 3 − 1 19
θˆ( s, ys ) f ( s | θ , y ) = × = .
40 18 360

19 3 − 1 19
θˆ( s, ys ) f ( s | θ , y ) = × = .
40 18 360

127 3 − 1 127
θˆ( s, ys ) f ( s | θ , y ) = × = .
184 18 1656
It follows that
E (θˆ | θ , y ) = ∑θˆ( s, y s ) f ( s | θ , y )
s
= (1/9) + (7/45) + (7/45) + (19/360) + (19/360) + (127/1656)
= 0.6045.
Thus, if θ = 1/4 and y = (1,0,1,1) , then the design bias of θˆ is

1
E (θˆ − θ | θ , y ) =0.6045 − = 0.3545.
4
Note 1: Also, if θ = 3/4 and y = (1,0,1,1) , then the design bias of θˆ is

3
0.6045 − = −0.1455.
4
Note 2: As regards the design bias of θˆ, there are a total of

2×4×2 = 16 cases to be considered, corresponding to:
y1 being either 0 or 1 (2 possibilities)
yT − y1 being 0 or 1 or 2 or 3 (4 possibilities)
θ being either 1/4 or 3/4 (2 possibilities).
We have covered four of these 16 cases.
653
(f) Recall from (c) that

where:
π=2 π= 3 π=4 θ
 4θ 
 3 + θ , 1 ∈ s 
π1 =  .
 2θ
, 1∉ s 
 3 − θ 
Therefore
θ + θ , 1 ∈ s 
rT | θ , s, y s )
E ( y= ( yrT | θ , s ) 
E= ,
θ + φ , 1 ∉ s 
where
2θ
φ= .
3−θ
So
E ( yrT | s, ys ) = E{E ( yrT | θ , s, ys ) | s, ys }
 E (2θ | D), 1 ∈ s 
= 
 E (θ | D) + E (φ | D), 1 ∉ s 
 2θˆ, 1 ∈ s 
= ,
θ + φ , 1 ∉ s 
ˆ ˆ
where
 2θ 
=φˆ E=
(φ | D) Eθ  D
 3 −θ 
 2θ 
= ∑   f (θ | D) .
θ =1/4,3/4  3 − θ 
The finite population mean is

1
= y ( ysT + yrT ) ,
4
and so the predictive mean of y may be expressed as
1
= yˆ E ( y | =
s, y s ) ( ysT + E ( yrT | s, ys )) .
4
654
Using suitable R functions, we find that φˆ and ŷ are as follows:
If 1 ∈ s and y sT = 0 then φˆ = 0.2303030 and ŷ = 0.1500000

If 1 ∈ s and y sT = 1 then φˆ = 0.4242424 and ŷ = 0.5000000
If 1 ∈ s and y = 2 then φˆ = 0.6181818 and ŷ = 0.8500000
sT
If 1 ∉ s and y sT = 0 then φˆ = 0.2222222 and ŷ = 0.1284722

If 1 ∉ s and y sT = 1 then φˆ = 0.4000000 and ŷ = 0.4687500
If 1 ∉ s and y = 2 then φˆ = 0.6086957 and ŷ = 0.8247283.
sT
Note: Working through the above equation using exact fractions, it can
be shown that
 3 / 20, 1 ∈ s, ysT =0
 1/ 2, 1 ∈ s, ysT =
1

 17 / 20, 1 ∈ s, ysT =2
= yˆ yˆ=( s, ys ) 
 37/288, 1 ∉ s, ysT =0
 15 / 32, 1 ∉ s, ysT =
1

607 / 736, 1 ∉ s, ysT =2.
The following are details of the working for 37/288, 15/32 and 607/736.
Observe that
2θ θ (5 − θ )
E ( yrT | θ , s, ys ) =
θ+ = .
3−θ 3−θ
Therefore
 θ (5 − θ ) 
= {E ( yrT | s, ys , θ ) | s, ys } E 
yˆ rT E= s, y s  .
 3−θ 
So, if ysT = 0 then

1 1 3 3
 5−   5− 
 θ (5 − θ )  4  4  11 4  4 1
=yˆ rT E  = D +
 3−θ  3−
1 12
3−
3 12
4 4
655
 19  1 3  17 
  11   1  17 
=   +  =
4 4 4 4 1
19 + 
11 12 9 12 48  3
4 4
1  57 + 17  74 37
=  =  = .
48  3  48 × 3 72
Also, if y sT = 1 then
1 1 3 3
5−  5 −  9
 θ (5 − θ )  4  4
+ 
11 4 4
=yˆ rT E  = D
 3−θ  3−
1 20 3−
3 20
4 4
1  19  3  17 
  11  
=   +  =
4 4 9 1 7
4 4
9
{19 + 51} = .
11 20 20 80 8
4 4
And if y sT = 2 then
1 1 3 3
 5−   5− 
 θ (5 − θ )  4  4  11 4  4  81
=yˆ rT E  = D +
 3−θ  3−
1 92
3−
3 92
4 4
1  19  3  17 
  11  
=   +  
4 4 4 4 81
11 92 9 92
4 4
1 478 239
= {19 + 27 × 17
= } = .
368 368 184
Thus (for 1 ∉ s ) we have that

 37 / 72, ysT = 0

= yˆ rT =7 / 8, ysT 1
239 /184, y = 2.
 sT
656
Hence
=0 + 37/72 37/72, = ysT 0

yˆT = ysT + yˆ rT = yˆ rT =  1 + 7 / 8 = 15 / 8, ysT = 1
2 + 239 /184
= 607 /184, =
 ysT 2.
Thus, finally (for 1 ∉ s ), we obtain

 37 / 288, ysT = 0
yˆT 
=y =  15 / 32,
ˆ y=
sT 1
4 
607 / 736, ysT = 2.
A similar logic can be used to obtain the fractions 3/20, 1/2 and 17/20.
(g)(i) Suppose that θ = 1/4 and s = (1,3) . Then 1 ∈ s and so

( y1 ,..., y4 | θ , s ) ~ ⊥ Bernoulli (π i ) ,
where:
1
4 
4θ =4 4
= π1 =
3+θ  1  13
3+  
4
1
π i= θ= , i > 1 .
4
In this case,
ysT= y1 + y3 ,
and so:
9 3 27
P( ysT = 0 | θ , s ) = × =
13 4 52
4 1 4
P( ysT = 2 | θ , s ) = × =
13 4 52
27 4 21
P( ysT = 1 | θ , s ) =−1 − = .
52 52 52
657
So the model mean of ŷ is

E ( yˆ | θ , s ) = E{E ( yˆ | θ , s, ysT ) | θ , s}
= E{ yˆ ( s, ysT ) | θ , s}
2
= ∑ yˆ (s, y
ysT = 0
sT ) f ( ysT | θ , s )
= 0.15(27/52) + 0.5(21/52) + 0.85(4/52)

= 0.3451923.
Also, the model mean of y is

1 1 4 1 1 1
E ( y | θ , s=
) (π 1 + ... + π 4=
)  + + + 
4 4  13 4 4 4 
= 55/208 = 0.2644231.
So the model bias of ŷ is

E ( yˆ − y | θ , s ) = 0.3451923 − 0.2644231 = 0.08077.
(g)(ii) Suppose that θ = 1/4 and s = (2,3) . Then 1 ∉ s and so

( y1 ,..., y4 | θ , s ) ~ ⊥ Bernoulli (π i ) ,
where:
1
2 
2θ =4 2
= π1 =
3−θ  1  11
3−  
4
1
π i= θ= , i > 1 .
4
In this case,
ysT= y2 + y3 ,
and so:
3 3 9
P( ysT = 0 | θ , s ) = × =
4 4 16
1 1 1
P( ysT = 2 | θ , s ) = × =
4 4 16
9 1 6
P( ysT = 1| θ , s ) =−1 − = .
16 16 16
658
So (using results in (g)(i)) the model mean of ŷ is

0.1284722(9/16) + 0.46875(6/16) + 0.8247283(1/16) = 0.2995924.
Also, the model mean of y is

1 1 2 1 1 1
E ( y | θ , s=
) (π 1 + ... + π 4=
)  + + + 
4 4  11 4 4 4 
= 41/176 = 0.2329545.
So the model bias of ŷ is

E ( yˆ − y | θ , s ) = 0.2995924 - 0.2329545 = 0.06664.
(h)(i) Suppose that θ = 1/4 and y = (0,0,1,1) . Then y1 = 0 and so the

sampling mechansim is definitely SRSWOR and ignorable.
Explicitly, we have that

f ( s | θ=
, y ) f=
( s) 1 / 6 .
So the design mean of ŷ is

E ( yˆ | θ , y ) = E{E ( yˆ | θ , y, s ) | θ , y}
1
= ∑
=
s
yˆ ( s, y ) f ( s | θ , y )
s ∑ yˆ (s, ys )
6 s
=
1
6
{ yˆ ((1, 2),(0,0)) + yˆ ((1,3),(0,1)) + yˆ ((1, 4),(0,1))
+ yˆ ((2,3),(0,1)) + yˆ ((2, 4),(0,1)) + yˆ ((3, 4),(1,1))}
= (1/6)(0.15 + 0.5 + 0.5 + 0.46875+ 0.46875 + 0.8247283)
= 0.4853714.
Also, the design mean of y is

E ( y | θ , y ) = (0 + 0 + 1 + 1)/4 = 0.5.
So the design bias of ŷ is

E ( yˆ − y | θ , y ) = 0.4853714 − 0.5 = −0.01463.
Note: The derivation of this result did not involve θ . So for the case
θ = 3/4 and y = (0,0,1,1) , the design bias of ŷ is also −0.01463.
659
(h)(ii) Suppose that θ = 1/4 and y = (1,0,1,1) . Then y1 = 1 and so the

sampling mechansim is possibly nonignorable, with
 3 + y1 3 + 1 2 
= = = , s (1, 2),(1,3),(1, 4) 
 18 18 9 
f ( s | y,θ ) =  .
3 − y
 = = = 3 − 1 1
1
, s (2,3),(2, 4),(3, 4) 
 18 18 9 
So the design mean of ŷ is
E ( yˆ | θ , y ) E=
= {E ( yˆ | θ , y , s ) | θ , y} ∑ yˆ ( s, y ) f ( s | θ , y )
s
s
2 2 2
= yˆ ((1, 2),(1,0)) + yˆ ((1,3),(1,1)) + yˆ ((1, 4),(1,1))
9 9 9
1 1 1
+ yˆ ((2,3),(0,1)) + yˆ ((2, 4),(0,1)) + yˆ ((3, 4),(1,1))
9 9 9
= (2/9)(0.5 + 0.85 + 0.85) + (1/9)(0.46875+ 0.46875 + 0.8247283)
= 0.684692.
Also, the design mean of y is E ( y | θ , y ) = (1 + 0 + 1 + 1)/4 = 0.75.
So the design bias of ŷ is E ( yˆ − y | θ , y ) = 0.684692 − 0.75 = −0.06531.
Note: The derivation of this result did not involve θ . So for the case
θ = 3/4 and y = (1,0,1,1) , the design bias of ŷ is also −0.06531.
(i) A suitable Gibbs sampler is based on the joint density

4
1 + I1 y1
=f ( s, y , θ ) f (θ ) f ( y | θ ) f ( s | y , θ ) ∝ 1 × ∏ θ yi (1 − θ )1− yi × .
i =1 6 + 3 y1
We can identify three conditional distributions here. First observe that

4
f (θ | s, y ) ∝ ∏ θ yi (1 − θ )1− yi = θ yT (1 − θ )1− yT , θ = 1 / 4,3 / 4
i =1
 (3 / 4) yT (1 − 1/ 4)1− yT , θ =
1/ 4
= (A.1)
(1/ 4) (1 − 3 / 4) , θ =
yT 1− yT
3 / 4.
Next, recall from (d)(ii) that

660
where: π=
2 π=
3 π=
4 θ
3 −1 1 2θ
=π1 θ (1 − θ =
)1−1 .
3−θ 3−θ
Now, the second component of r = ( r1 , r2 ) must be 2, 3 or 4.
Therefore
( yr2 | θ , s, y s , yr1 ) ~ Bernoulli (θ ) . (A.2)
However, there are two possibilities for yr1 . If the data is such that s1 = 1
then
( yr1 | θ , s, y s , yr2 ) ~ Bernoulli (θ ) . (A.3)
On the other hand, if the data is such that s1 > 1 then r1 = 1 , and this
implies that
 2θ 
( yr1 | θ , s, y s , yr2 ) ~ Bernoulli  . (A.4)
 3−θ 
Equations (A.1), (A.2), (A.3) and (A.4) imply three conditional

distributions which define a suitable Gibbs sampler (for θ , yr1 and yr2 ).
Note: At (15.4), the ratio of probabilities of yr1 = 0 to yr1 = 1 is

 2θ 
1 − 
 3 − θ = 3(1 − θ )= 3 ×  1 − θ  ,
 
 2θ  2θ 2  θ 
 
 3−θ 
which is exactly 3/2 times the ratio of the probabilities of yr1 = 0 to
yr1 = 1 at (A.3). (This observation provided some assistance when
formulating the required R code, as detailed below.)
Implementing the above Gibbs sampler, we obtained a random sample

(θ1 , y (1) ),...,(θ10000 , y (10000) ) ~ iid f (θ , y | D )
for each of the six possible data configurations in (c) and (f).
The respective sample means for θ were:

0.3007, 0.4924, 0.6997, 0.2952, 0.4764, 0.6925.
661
It will be observed that these numbers are very close to the corresponding
values obtained in (c), namely
 3 /10 = 0.3000 if 1 ∈ s and ysT = 0
 1/ 2 = 0.5000 1
 7 /10 = 0.7000 if 1 ∈ s and ysT = 2
θˆ = 
 7 / 24 = 0.2917 if 1 ∉ s and ysT = 0
 19 / 40 = 0.4750 if 1 ∉ s and ysT =1

127 /184 = 0.6902 if 1 ∉ s and ysT =2.
The respective sample means for y were:

0.1518, 0.4929, 0.8485, 0.1308, 0.4719, 0.8269.
It will be noted that these are very close to the corresponding values
obtained in (f), namely:
0.15, 0.5, 0.85, 0.1284722, 0.4687500, 0.8247283.
(j) To check the design bias in (h)(i) we note that for y = (0,0,1,1) the
sampling mechanism is ignorable.
So proceed as follows. Simply select one of the 6 possible samples

randomly. Then calculate the corresponding value of ŷ . Repeat another
J − 1 times, independently. Then take the mean of the simulated ŷ
values and subtract y = 2/4.
Implementing this procedure with J = 10,000 yielded a point estimate of

−0.01562 with 95% CI (−0.01945, −0.01179). This is consistent with the
result −0.01463 in (h)(i).
To check the design bias in (h)(ii) we note that for y = (1,0,1,1) the
sampling mechanism is nonignorable with each sample containing unit 1
twice as likely as each unit not containing unit 1.
So, select a sample s from (1,2), (1,3), (1,4), (2,3), (2,4), (3,4), in such a
way that each of the first three of these has probability 2/9 and each of the
last three has probability 1/9. Then calculate the corresponding value of
ŷ . Repeat another J − 1 times, independently. Then take the mean of the
simulated ŷ values and subtract y = 3/4.
662

−0.06592 with 95% CI (−0.06944, −0.06239). This is consistent with the
result −0.06531 in (h)(ii).
(k) The mean of the predictive mean of the finite population mean is the
same as the unconditional mean of the finite population mean, which is
the same as the prior mean of the superpopulation mean, which in our case
equals 1/2. Mathematically,
Eyˆ = EE ( y | s, y s ) by the definition of ŷ
= Ey by the law of conditional expectation
= EE ( y | θ ) by the law of conditional expectation
1 4 1 4
= Eθ y |θ )
since E (=
=
∑
4 i 1=
yi | θ ) =
E( = ∑θ θ
4i1
1 1 3 1 1
= ∑θ f (θ ) = × + × = .
θ 4 2 4 2 2
To verify this obvious result via Monte Carlo is a good final check on
previous calculations.
To this end, simulate θ , then simulate y, then simulate s, hence obtain the
data ( s, y s ) , then calculate the associated ŷ . Then repeat all of the above
independently another J − 1 times.

0.4992 with 95% CI (0.4938, 0.5047). This is consistent with the answer
of 1/2 above.
# (g)
postfun = function(s=c(1,2), ys=c(0,1)){ ysT=sum(ys)
if(any(s==1)==T){ if(ysT==0) probs=c(0.9,0.1)
if(ysT==1) probs=c(0.5,0.5)
if(ysT==2) probs=c(0.1,0.9) }
if(any(s==1)==F){ if(ysT==0) probs=c(11/12,1/12)
if(ysT==1) probs=c(11/20,9/20)
if(ysT==2) probs=c(11/92,81/92) }
probs }
postfun() # 0.5 0.5 Just testing
663
postfun(s=c(2,4),ys=c(1,1)) # 0.1195652 0.8804348
thetahatfun=function(s=c(1,2), ys=c(0,1)){ probs= postfun(s=s,ys=ys);

thetavals=c(1,3)/4; sum( thetavals * probs ) }
thetahatfun() # 0.5 Just testing
thetahatfun(s=c(2,4),ys=c(1,1)) # 0.6902174
phihatfun=function(s=c(1,2), ys=c(0,1)){ probs=postfun(s=s,ys=ys);
thetavals=c(1,3)/4; phivals=2*thetavals/(3-thetavals)
sum( phivals * probs ) }
phihatfun() # 0.4242424 Just testing

phihatfun(s=c(2,4),ys=c(1,1)) # 0.6086957
yrThatfun=function(s=c(1,2), ys=c(0,1)){ thetahat=thetahatfun(s=s,ys=ys)

if(any(s==1)==T){ res=2*thetahat }
if(any(s==1)==F){
phihat=phihatfun(s=s,ys=ys); res = thetahat + phihat }
res }
yrThatfun() # 1 Just testing

yrThatfun (s=c(2,4),ys=c(1,1)) # 1.298913
ybarhatfun=function(s=c(1,2), ys=c(0,1)){ EyrT= yrThatfun (s=s,ys=ys)

(sum(ys)+EyrT)/4 }
ybarhatfun() # 0.5 Just testing

ybarhatfun(s=c(2,4),ys=c(1,1)) # 0.8247283
smat=matrix(c(1,2, 1,2, 1,2, 1,2, 2,3, 2,3, 2,3, 2,3), byrow=T,nrow=8, ncol=2)
ysmat= matrix(c(0,0, 0,1, 1,0, 1,1, 0,0, 0,1, 1,0, 1,1),
byrow=T,nrow=8, ncol=2)
thetahatvec=rep(NA,8); phihatvec=rep(NA,8); ybarhatvec=rep(NA,8);
for(k in 1:8){ thetahatvec[k]= thetahatfun(s=smat[k,],ys=ysmat[k,])

phihatvec[k]= phihatfun(s=smat[k,],ys=ysmat[k,])
ybarhatvec[k]= ybarhatfun(s=smat[k,],ys=ysmat[k,]) }
cbind(smat,NA,ysmat,NA,thetahatvec, NA, phihatvec, NA, ybarhatvec)

# thetahatvec phihatvec ybarhatvec
# [1,] 1 2 NA 0 0 NA 0.3000000 NA 0.2303030 NA 0.1500000
# [2,] 1 2 NA 0 1 NA 0.5000000 NA 0.4242424 NA 0.5000000
# [3,] 1 2 NA 1 0 NA 0.5000000 NA 0.4242424 NA 0.5000000 repeat OK
# [4,] 1 2 NA 1 1 NA 0.7000000 NA 0.6181818 NA 0.8500000
# [5,] 2 3 NA 0 0 NA 0.2916667 NA 0.2222222 NA 0.1284722
664
# [6,] 2 3 NA 0 1 NA 0.4750000 NA 0.4000000 NA 0.4687500

# [7,] 2 3 NA 1 0 NA 0.4750000 NA 0.4000000 NA 0.4687500 repeat OK
# [8,] 2 3 NA 1 1 NA 0.6902174 NA 0.6086957 NA 0.8247283
0.15*(27/52) + 0.5*(21/52) + 0.85*(4/52) # 0.3451923

0.1284722*(9/16) + 0.46875*(6/16) + 0.8247283*(1/16) # 0.2995924
# (h)
(1/6)*(0.15 + 0.5 + 0.5 + 0.46875+ 0.46875 + 0.8247283) # 0.4853714
(2/9)*(0.5 + 0.85 + 0.85) + (1/9)*(0.46875+ 0.46875 + 0.8247283) # 0.684692
# (i) Check posterior means and predcitive means via Gibbs sampler
options(digits=4)
GS=function(J=1000, s=c(1,2),ys=c(1,0), theta=1/4 ){
thetav=rep(NA,J); yrTv=rep(NA,J); yTv=rep(NA,J)
yrmat=matrix(NA,nrow=J,ncol=2); ysT=sum(ys)
for(j in 1:J){
probsyi = c(1-theta, theta)
yr2=sample(x=c(0,1),size=1,prob=probsyi)
if(s[1]==1) yr1=sample(x=c(0,1),size=1,prob=probsyi) else
yr1=sample(x=c(0,1),size=1,prob=c(3,2)*probsyi)
yr=c(yr1,yr2); yrT=sum(yr); yT=ysT+yrT
probstheta=c( (1/4)^yT *(3/4)^(4-yT), (3/4)^yT *(1/4)^(4-yT) )
theta = sample( x=c(1/4,3/4), size=1, prob= probstheta)
thetav[j]=theta; yrTv[j]=yrT; yTv[j]=yT; yrmat[j,]=yr
}
list(thetav=thetav, yrTv=yrTv, yTv=yTv, ybarv=yTv/4, yrmat=yrmat) }
set.seed(111); J = 10000; thetahatvec=rep(NA,6); ybarhatvec=rep(NA,6)

res=GS(J=J,s=c(1,2),ys=c(0,0))
thetahatvec[1] = mean(res$thetav); ybarhatvec[1] = mean(res$ybarv);
res= GS(J=J,s=c(1,2),ys=c(0,1))
res= GS(J=J,s=c(1,2),ys=c(1,1))
res=GS(J=J,s=c(2,3),ys=c(0,0))
res= GS(J=J,s=c(2,3),ys=c(0,1))
res= GS(J=J,s=c(2,3),ys=c(1,1))
thetahatvec # 0.3007 0.4924 0.6997 0.2952 0.4764 0.6925
# All very close to results in (c)
665
ybarhatvec # 0.1518 0.4929 0.8485 0.1308 0.4719 0.8269

# All very close to results in (f)
# (j) Check design bias of predictive mean of ybar if theta=1/4 and y=(0,0,1,1)
smatrix=matrix(c(1,2, 1,3, 1,4, 2,3, 2,4, 3,4), byrow=T,nrow=6, ncol=2)
y=c(0,0,1,1); J = 10000; ybarhatsimv=rep(NA,J); set.seed(413)
for(j in 1:J){ indexsim=sample(1:6,1,prob=c(1,1,1,1,1,1))

ssim=smatrix[indexsim,]; yssim= y[ssim]
ybarhatsimv[j] = ybarhatfun(s=ssim,ys=yssim) }
est=mean(ybarhatsimv)-0.5;
ci=est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv-0.5)/sqrt(J)
c(est,ci) # -0.01562 -0.01945 -0.01179 Consistent with -0.01463 in (h)(i)
# Check design bias of predictive mean of ybar if theta=1/4 and y=(1,0,1,1)

y=c(1,0,1,1); J = 10000; ybarhatsimv=rep(NA,J); set.seed(442)
for(j in 1:J){ indexsim=sample(1:6,1,prob=c(2,2,2,1,1,1))

ssim=smatrix[indexsim,]; yssim= y[ssim]
ybarhatsimv[j] = ybarhatfun(s=ssim,ys=yssim) }
est=mean(ybarhatsimv)-0.75;
ci=est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv-0.5)/sqrt(J)
c(est,ci) # -0.06592 -0.06944 -0.06239 Consistent with -0.06531 in (h)(ii)
# (k) Check mean of predictive mean of finite population mean

smatrix=matrix(c(1,2, 1,3, 1,4, 2,3, 2,4, 3,4), byrow=T,nrow=6, ncol=2)
J = 10000; ybarhatsimv=rep(NA,J); set.seed(102);
for(j in 1:J){
thetasim=sample(c(1/4,3/4),1); ysim=rbinom(4,1,thetasim)
if(ysim[1]==0) indexsim = sample(1:6,1,prob=c(1,1,1,1,1,1))
if(ysim[1]==1) indexsim = sample(1:6,1,prob=c(2,2,2,1,1,1))
ssim=smatrix[indexsim,]; yssim= ysim[ssim];
ybarhatsimv[j]= ybarhatfun(s=ssim,ys=yssim) }
est = mean(ybarhatsimv);
ci = est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv)/sqrt(J)
c(est,ci) # 0.4992 0.4938 0.5047 Consistent with 0.5
666
APPENDIX B
Distributions and Notation
Below are several probability distributions which feature in this book. The
purpose of this appendix is to provide a brief guide to the style of notation
and terminology used throughout. It is not intended to be a comprehensive
listing. Some of the notation introduced here is repeated in Appendix C.
B.1 The normal distribution
A random variable (rv) X has the normal distribution with parameters µ

and σ 2 if its probability density function (pdf), or density, has the form
1  1 
f ( x)  exp  2 ( x  ) 2  ,   .
 2  2 
We then write X ~ N ( µ , σ 2 ) . To be more explicit, we will sometimes

write f ( x) as f X ( x) or f N (  , 2 ) ( x) . To avoid subscripting notation and so
aid legibility, f N (  , 2 ) ( x) may sometimes be written as f ( x, N (,  2 )) .
Likewise for other functions and expressions which contain subscripts.
If X ~ N ( µ , σ 2 ) = =
then EX Mode =
( X ) Median ( X ) µ and VX = σ 2 .
The cumulative distribution function (cdf) of X is

x
F ( x)  P( X  x)  FN (  , 2 ) ( x)  F ( x, N (,  ))   f N (  , 2 ) (t )dt .
2

The (lower) p-quantile of X is the value of x such that F ( x)  p.
Thus the p-quantile of X is the inverse cdf of X. This may also be written
F 1 ( p )  FX1 ( p )  FN(1 , 2 ) ( p )  FInv( p, N (,  2 )) .
If Z ~ N(0,1), we say that Z has the standard normal distribution. The pdf,
cdf, (lower) p-quantile and upper p-quantile of Z may be denoted by φ ( z ) ,
( z ) , 1 ( p ) , and z p  1 (1 p ) , respectively.
667
This notation means that if X ~ N ( µ , σ 2 ) , then we may write:

1  x     x   
f ( x)     , F ( x)   
1
 , FX ( p )    z1 p  .

      
Note: We sometimes use upper and lower case letters interchangeably.

Thus X ~ N ( µ , σ 2 ) may also be written x ~ N ( µ , σ 2 ) . The pdf of a rv
X when evaluated at c may also be denoted by f ( x = c) .
B.2 The gamma distribution
A random variable X has the gamma distribution with parameters a and b

if its pdf has the form
b a x a1ebx
f X ( x)  , x > 0.
( a )
We then write X ~ Gamma(a,b) or X ~ Gam(a,b) or X ~ G(a,b). We may

also write f X ( x) as f ( x) or fG ( a ,b ) ( x) or f ( x, G (a, b)) .
The cdf of X may be written FX ( x)  FG ( a ,b ) ( x)  F ( x, G (a, b)) , and X’s

p-quantile is FX1 ( p )  FG(1a ,b ) ( p )  F 1 ( p, G (a, b))  FInv( p, G (a, b)) .
If X ~ G(a,b) then:
Mode( X )  (a 1) / b if a > 1
Mode( X )  0 if a ≤ 1
EX = a / b , VX = a / b 2
( a  k )
EX k  k (the kth raw moment of X).
b ( a )
The last result may be proved by writing

 a a1 bx 
k b x e b a (a  k ) b ak x ak 1ebx
EX   x
b (a ) 0
k
dx  ak dx
0
( a ) ( a  k )
and noting that the last integral is equal to unity.
The definition of the gamma distribution involves the gamma function,


(k )   t k 1et dt .
0
668
Some properties of the gamma function are as follows:

(k )   as k → ∞ or k → 0
(k )  (k 1)(k 1) for k > 1
(k )  (k 1)! if k ∈ {1, 2,3,...} (with 0! = 1 )
(1/ 2)   .
Note: There is an alternative definition of the gamma distribution,

whereby X ~ G(a,b) means f ( x)  ba x a1e x / b / (a ) , x > 0 , so that
EX = ab . This alternative definition is not used in this book.
B.3 The exponential distribution
If X ~ G (1, b) then X has the exponential distribution with parameter b,

and we write X ~ Exponential (b) or X ~ Expo(b) .
Note: We do not write X ~ Exp (b) because this could more easily be
=
confused with =
X exp(b) eb (where exp is the exponential function).
The pdf of X, namely f ( x)  bebx , x > 0 , may also be written as

f Expo (b ) ( x) or f ( x, Expo(b)) .
If X ~ Expo(1) , we say that X has the standard exponential distribution.
B.4 The chi-squared distribution
If X ~ G (m / 2,1/ 2) then X has the chi-squared distribution with

parameter m (called the degrees of freedom, abbreviated dof).
We then write X ~ χ 2 (m) or X ~ Chisq (m) , and denote the pdf of X by

f χ 2 ( m ) ( x) or f ( x, Chisq (m)) .
The upper p-quantile of the χ 2 (m) distribution may be written

χ p2 (m=
) Fχ−1( m ) (1 − p=
2 ) FInv(1 − p, Chisq (m)) .
669
A useful result is that if Y = rX , where X ~ Gamma (m / 2, r / 2) , then

Y ~ G (m / 2,1/ 2) ~ χ 2 (m) . This result can be proved easily using the
transformation rule, as follows:
m
−1
dx y  y  2 − 2r × ry 1 y m2 −1 − 12 y
f ( y) =
f ( x) ∝   e ∝ y e .
dy r r
y
Note: The symbol ∝ here denotes ‘proportionality with respect to y’.
t
The statement g ∝ h means g = c × h , where c is a constant that does
t r
not depend on t. E.g. if g = 5t 2 r 3 , we may write: g ∝ t 2 , g ∝ r 3 ,
t, r r t
g ∝ t 2 r 3 , g ∝t , g ∝ r 4 , etc. By default, g (t ) ∝ t 5 means g (t ) ∝ t 5 ,
t t, u
and g (t | u ) ∝ t 5 means g (t | u ) ∝ t 5 (not g (t | u ) ∝ t 5 ).
B.5 The inverse gamma distribution
If X ~ G(a,b), then Y = 1/ X has the inverse gamma distribution with

parameters a and b. In that case, we write Y ~ InverseGamma(a,b) or
Y ~ IGam(a,b) or Y ~ IG(a,b).
By the transformation rule, the pdf of Y is
dx b a (1/ y ) a1 eb (1/ y ) 1 b a y( a1) eb / y
f ( y )  f ( x)   2  , y  0,
dy ( a ) y ( a )
which may also be written f IG ( a ,b ) ( y ) or f ( y, IG (a, b)) .
Some other properties of Y are as follows:

EY  b / (a 1) if a > 1, EY   if a ≤ 1
VY  b / {(a 1) (a  2)} if a > 2,
2 2
Mode(Y )  b / (a  1).
B.6 The t distribution
A random variable X has the t distribution with parameter m if

1
 ( m1)
((m  1) / 2)  x2 
1  
2
f ( x)  ,   x  .
(m / 2) m  m 
In that case, we write X ~ t (m) and denote the density of X by ft ( m ) ( x)

or f ( x, t (m)) . The cdf of X is denoted Ft ( m ) ( x) or F ( x, t (m)) , and the
670
) Ft −( m1 ) (1 − p=
upper p-quantile may be written t p (m= ) FInv(1 − p, t (m)) .
We call m the degrees of freedom parameter.
An equivalent definition of the t distribution is as follows. If Z ~ N (0,1) ,

Y ~ χ 2 (m) and Z ⊥ Y , then X = Z / Y / m ~ t (m) .
Note: The symbol ⊥ here denotes independence. Thus, the statement

A ⊥ B means that A and B are independent random variables. Likewise,
( A ⊥ B | C ) means that A and B are independent conditional on C.
B.7 The F distribution
U /a
Suppose that U ~ χ 2 (a ) , W ~ χ 2 (b) and U ⊥ W . Then X = has
W /b
the F distribution with parameters a and b. We then write X ~ F (a, b) .
The pdf and cdf of X (both omitted here) may be denoted f F ( a ,b ) ( x) and
FF ( a ,b ) ( x) , respectively. We call a the numerator degrees of freedom and
b the denominator degrees of freedom. The upper p-quantile of X may be
denoted as Fp (a, b) or FF(1a ,b ) (1 p ) or Finv(1 p, F (a, b)) .
B.8 The (continuous) uniform distribution
A random variable X has the (continuous) uniform distribution with

parameters a and b if its pdf is f ( x)  1/ (b  a ), a  x  b.
We then write X ~ U (a, b) and f ( x) = fU ( a ,b ) ( x) = f ( x, U (a, b)) .

The cdf of X is FU ( a ,b ) ( x) = F ( x,U ( a, b))  ( x  a ) / (b  a ) , a  x  b.
The mean and variance of X are ( a + b) / 2 and (b − a ) 2 / 12 .
B.9 The discrete uniform distribution
A random variable X has the discrete uniform distribution with parameters

a1 ,..., aK if its density is=
f ( x) 1/=
K , x a1 ,..., aK .
We then write X ~ DU (a1 ,..., aK ) . The density f ( x) may also be written

as f DU ( a1 ,...,aK ) ( x) or f ( x, DU (a1 ,..., aK )) .
671
Equivalently, we may describe X as having the discrete uniform

distribution with parameter a = (a1 ,..., aK ) (a vector). In that case, we may
write X ~ DU (a ) and denote f ( x) by f DU ( a ) ( x) or f ( x, DU (a )) .
Note: Because X here is discrete, f ( x) may more aptly be called the

probability mass function (pmf) of X. But for simplicity, we usually use
the term probability density function (pdf) or density in reference to any
type of random variable (continuous, discrete or mixed).
B.10 The binomial distribution
A rv X has the binomial distribution with parameters n and p if its density

has the form
n
f ( x) =   p x (1 − p) n − x , x = 0,1,..., n .
 x
We then write X ~ Bin(n, p ) . The density f ( x) may also be denoted by

f Bin ( n , p ) ( x) or f ( x, Bin(n, p )) . The mean and variance of X are np and
np(1 − p) . We call n the number of trials and p the probability of success
(equivalently, the binomial parameter or the binomial proportion).
B.11 The Bernoulli distribution
If X ~ Bin(1, p ) then we say that X has the Benoulli distribution with

parameter p. We then write X ~ Bernoulli ( p ) or X ~ Bern( p ) .
B.12 The geometric distribution
A random variable is said to have the geometric distribution with

parameter p if its pdf has the form
f ( x) = (1 − p) x −1 p, x =
1, 2,3,...
We then write X ~ Geo( p ) . The pdf of X may be denoted by fGeo ( p ) ( x)

or f ( x, Geo( p )) . The mean and variance of X are 1/ p and (1 − p ) / p 2 .
The cdf of X is given by
FGeo ( p ) ( x) =F ( x, Geo( p )) =P( X ≤ x) =1 − (1 − p ) x , x =1, 2,3,...
672
APPENDIX C
Abbreviations and Acronyms
Below are some of the abbreviations and acronyms used in this book. The
list may not be comprehensive. Some of the expressions listed have more
than one meaning, depending on the context.
ACF autocorrelation function

AELF absolute error loss function
AR autoregressive (process); acceptance rate
ARMA autoregressive moving average (process)
B beta function; bias

Bern Bernoulli distribution
Beta beta distribution
BF Bayes factor
Bin, Binom binomial distribution
BUGS Bayesian inference Using Gibbs Sampling (software
environment for performing MCMC)
C, Cov covariance operator

cdf cumulative distribution function (same as df)
CDR central density region
Chisq chi-squared distribution (equivalent to χ 2 )
CI confidence interval
CNR conditional Newton-Raphson (algorithm)
CPDI central posterior (or predictive) density interval
CPDR central posterior (or predictive) density region
cts continuous
D data
DA data augmentation (algorithm)
df distribution function (same as cdf)
dof degrees of freedom
dsn distribution
DU discrete uniform distribution
673
E expectation operator
e Euler’s number (2.71828)
ECM Expectation-Conditional-Maximisation (algorithm)
ELF error loss function
EM Expectation-Maximisation (algorithm)
E-Step Expectation Step (in EM algorithm)
exp exponential function (e raised to a power)
Expo exponential distribution
F F distribution; (cumulative) distribution function

f pdf or pmf (same as p); finite population correction factor
FCP frequentist coverage probability
FInv inverse distribution function (equivalent to F −1 )
fpc finite population correction (factor)
G, Gam gamma distribution (not to be confused with the gamma

function, which is denoted by the Greek letter Γ )
Geo geometric distribution
GLM generalised linear model
GS Gibbs sampler/sampling
HPDI highest posterior (or predictive) density interval

HPDR highest posterior (or predictive) density region
Hyp hypergeometric distribution
I standard indicator function; vector of sample inclusion

indicators (or counters); Fisher information
id identically distributed (not necessarily independent)
IELF indicator error loss function
IG, IGam inverse gamma distribution
iid independent and identically distributed (as)
ind, indep independent (not necessarily identically distributed)
J Monte Carlo sample size
L loss function; lower bound; ordered sample (vector of the

labels of selected units in the order that they are sampled)
LIC law of iterated covariance:
= C ( X , Y ) EC ( X , Y | Z ) + C{E ( X | Z ), E (Y | Z )}
LIE law of iterated expectation: EX = EE ( X | Z )
LIV =
law of iterated variance: VX EV ( X | Z ) + VE ( X | Z )
ln, log natural logarithm (to base e)
674
Appendix C: Abbreviations and Acronyms
m nonsample size ( m= N − n )
MA moving average (process); Metropolis algorithm
MAD mean absolute deviation; finite population mean
absolute deviation about the superpopulation mean
max maximum/maximise
MC Monte Carlo (method); Markov chain
MCMC Markov chain Monte Carlo (method)
MH Metropolis-Hastings (algorithm)
min minimum/minimise
ML maximum likelihood (method)
MLE maximum likelihood estimate/estimator/estimation
MOME method of moments estimate/estimator/estimation
M-Step Maximisation Step (in EM algorithm)
N normal (or Gaussian) distribution; finite population size

n sample size
NG normal-gamma (Bayesian model)
NN normal-normal (Bayesian model)
NNG normal-normal-gamma (Bayesian model)
NR Newton-Raphson (algorithm)
P, Pr, Prob probability function

p binomial proportion; pdf or pmf (same as f)
PACF partial autocorrelation function
PDF portable document format (file)
pdf probability density function (used for all types of rvs:
continuous, discrete and mixed); used instead of pmf
PEL posterior expected loss (function)
pmf probability mass function (rarely used; see pdf)
Poi Poisson distribution
POO posterior odds
pop population
post posterior
ppp-value posterior predictive p-value
pr, prob probability
pred predictive/prediction/predictor
PRO prior odds
pt point
Q quantity of interest; quantile function; Q-function (in the

EM algorithm)
QELF quadratic error loss function
675
R R (software environment for statistical computing)

R relative bias; risk function (not to be confused with ℜ ,
which denotes the whole real line)
r Bayes risk; nonsample (vector of the labels of the units
that are not sampled)
RB Rao-Blackwell (estimate/estimator/estimation or method)
rv random variable
s sample standard deviation; sample (vector of the labels of

the units that are sampled)
SD, sd standard deviation
SE, se standard error (estimate of standard deviation)
SMA seasonal moving average (process)
SRS simple random sampling (with or without replacement)
SRSWOR simple random sampling without replacement
SRSWR simple random sampling with replacement
st such that
T random variable with the t distribution

t t distribution; upper quantile of the t distribution
TIAP Total International Airline Passengers (time series)
U (continuous) uniform distribution; random variable with

the standard uniform distribution; upper bound
V, Var variance operator
WinBUGS BUGS for Microsoft Windows (see BUGS)

wrt with respect to
X finite population covariate vector (of N values)

x sample covariate vector (of n values)
Y random variable or vector of random variables;

finite population vector (of N values)
y realised value of a random variable or vector of
random variables; sample vector (of n values);
sometimes used interchangeably with Y
Z standard normal random variable

z upper quantile of the standard normal distribution
676
Bibliography
Albert, J. (2009). Bayesian Computation with R, 2nd Edition. New York:

Springer.
Bolstad, W.M. (2009). Computational Bayesian Statistics. Hoboken NJ:
Wiley.
Box, G.E.P, and Tiao, G.C. (1992). Bayesian Inference in Statistical
Analysis by Box and Tiao (1973). Reading: Addison-Wesley.
Brooks, S., Gelman, A., Jones, G.L., and Meng, X.-L. (Eds.) (2011).
Handbook of Markov Chain Monte Carlo. London: Chapman &
Hall/CRC.
Bühlmann, H. (1967). Experience rating and credibility. ASTIN Bulletin.
Website: www.casact.org/library/astin/vol4no3/199.pdf
Byrne, A.P., and Dracoulis, G.D. (1985). Monte Carlo calculations for
asymmetric NaI(Tl) and BGO Compton suppression shields,
Nuclear Instruments and Methods in Physics Research, A234:
281−287.
Cochran, W.G. (1977). Sampling Techniques, 3rd Edition. New York:
Wiley.
Ericson, W.A. (1969). Subjective Bayesian models in sampling finite
populations. Journal of the Royal Statisticial Society, Series B, 31:
195−224.
Ericson, W.A. (1988). Bayesian inference in finite populations. In
Handbook of Statistics, Vol. 6, P.R. Krishnaiah and C.R. Rao (Eds.),
pp 213−246. Amserdam: North Holland.
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2004). Bayesian
Data Analysis, 2nd Edition. New York: Chapman and Hall.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (1996). Markov
Chain Monte Carlo in Practice. New York: Chapman & Hall.
Hobert, J.P. and Casella, G. (1996). The effect of improper priors on
Gibbs sampling in hierarchical linear mixed models. Journal of the
American Statistical Association, 91: 1461−1473.
Jeffreys, H. (1961). Theory of Probability, 3rd Edition. Oxford: Oxford
University Press.
Kéry, M., and Schaub, M. (2012). Bayesian Population Analysis Using
WinBUGS. New York: Elsevier.
Lachlan, G.J., and and Krishnan, T. (2008). The EM Algorithm and
Extensions. Hoboken, NJ: John Wiley & Sons.
677
Leonard, T., and Hsu, J.S.J. (1999). Bayesian Methods: An Analysis for
Statisticians and Interdisciplinary Researchers. Cambridge:
Cambridge University Press.
Lee, P. (1997). Bayesian Statistics: An Introduction. New York: Oxford
University Press.
Lunn, D.J., Thomas, A., Best., N., and Spiegelhalter, D. (2000).
WinBUGS − A Bayesian modelling framework: Concepts,
structure, and extensibility. Statistics and Computing, 10: 325−337.
Maindonald, J., and Braun, W.J. (2010). Data Analysis and Graphics
Using R: An Example-Based Approach, 3rd Edition. Cambridge:
Cambridge University Press.
Meng, X.-L. (1994). Posterior predictive p-values. The Annals of
Statistics, 22: 1142−1160.
Ntzoufras, I. (2009). Bayesian Modeling Using WinBUGS. Hoboken NJ:
Wiley.
O’Hagan, A, and Forster, J. (2004). Kendall’s Advanced Theory of
Statistics, Second Edition, Volume 2B, Bayesian Inference. London:
Arnold.
Puza, B. (1995). Monte Carlo Methods for Finite Population Inference.
Internal document. Canberra: Australian Bureau of Statistics.
Puza, B.D. (2002). ‘Postscript: Bayesian methods for estimation’ and
‘Appendix C: Details of calculations in the Postscript’. In Combined
Survey Sampling Inference: Weighing Basu’s Elephants, by
K. Brewer, London: Arnold, 2002, pp 293−296 and 299−302.
Puza, B.D., and O’Neill, T.J. (2005). Length-biased, with-replacement
sampling from an exponential finite population. Journal of
Statistical Computation and Simulation, 75: 159−174.
Puza, B. and O’Neill, T.J. (2006). Selection bias in binary data from
volunteer surveys. The Mathematical Scientist, 31: 85−94.
Rao, C.R. (1973). Linear Statistical Inference and its Applications, 2nd
Edition. New York: Wiley.
Rao, J.N.K. (2011). Impact of frequentist and Bayesian methods on
survey sampling practice: a selective appraisal. Statistical Science,
26: 240−256.
Särndal, C.-E., Swensson, B., and Wretman, J. (1992). Model Assisted
Survey Sampling. New York: Springer.
Seaman, S., Galati, J., Jackson, D., and Carlin, J. (2013). What is meant
by ‘Missing at Random’? Statistical Science, 28(2): 257−268.
Shaw D, (1988). On-site samples’ regression: Problems of non-negative
integers, truncation, and endogenous stratification. Journal of
Econometrics, 37: 211−223.
678
Bibliography
Smith, A.F.M., and Gelfand, A.E. (1992). Bayesian statistics without

tears: A sampling-resampling perspective. The American
Statistician, 46(2): 84−88.
Wackerly, D.D., Mendenhall III, W., and Scheaffer, R.L. (2008).
Mathematical Statistics with Applications, 7th edition. Duxbury:
Thomson, Brooks/Cole.
West, M. (1996). Inference in successive sampling discovery models.
Journal of Econometrics, 75: 217−238.
679

Bayesian Methods Statistical Analysis

Uploaded by

Copyright:

Available Formats

Bayesian Methods Statistical Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Methods Statistical Analysis

Uploaded by

Copyright:

Available Formats

BAYESIAN METHODS

for Statistical Analysis

National Library of Australia Cataloguing-in-Publication entry

Creator: Puza, Borek, author.

Title: Bayesian methods for statistical analysis / Borek Puza.

ISBN: 9781921934254 (paperback) 9781921934261 (ebook)

Subjects: Bayesian statistical decision theory.

Dewey Number: 519.542

Cover design and layout by ANU Press

Printed by Griffin Press

This edition © 2015 ANU eView

Chapter 1: Bayesian Basics Part 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2: Bayesian Basics Part 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 3: Bayesian Basics Part 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Chapter 5: Monte Carlo Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Chapter 6: MCMC Methods Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Chapter 8: Inference via WinBUGS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

Chapter 9: Bayesian Finite Population Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

Chapter 10: Normal Finite Population Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

Chapter 11: Transformations and Other Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515

Chapter 12: Biased Sampling and Nonresponse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559

Appendix B: Distributions and Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667

Appendix C: Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673

‘Bayesian Methods for Statistical Analysis’ is a book on statistical

Further modifications to ‘Bayesian Methods’ led to the present work,

‘Bayesian Methods for Statistical Analysis’ is a book which can be used

For a more comprehensive account of Bayesian methods, the reader is

R is a general software environment for statistical computing and graphics

BUGS stands for ‘Bayesian Inference Using Gibbs Sampling’ and is a

The present book includes a large number of exercises, interspersed

The following chapter provides an overview of the book. Appendix A

The last four of the 12 chapters in this book constitute a practical

Chapter 1: Bayesian Basics Part 1 (pages 1–60)

Chapter 2: Bayesian Basics Part 2 (pages 61–108)

Chapter 3: Bayesian Basics Part 3 (pages 109–152)

Chapter 4: Computational Tools (pages 153–200)

Chapter 5: Monte Carlo Basics (pages 201–262)

Chapter 6: MCMC Methods Part 1 (pages 263–320)

Chapter 7: MCMC Methods Part 2 (pages 321–364)

Chapter 8: MCMC Inference via WinBUGS

Chapter 9: Bayesian Finite Population Theory

Chapter 10: Normal Finite Population Models

Chapter 11: Transformations and Other Topics

Chapter 12: Biased Sampling and Nonresponse

Appendix A: Additional Exercises (pages 609–666)

Appendix B: Distributions and Notation

Appendix C: Abbreviations and Acronyms

Computer Code in Bayesian Methods for Statistical

Bayesian inference is different to classical inference (or frequentist

One drawback of Bayesian inference is that it invariably requires a prior

Another issue with Bayesian inference is that, although it may easily

integrals (or summations) which are intractable and difficult (or

In such situations, the desired solutions can typically be approximated to

1.2 Bayes’ rule

We see that the posterior probability P( A | B ) is equal to the prior

As regards terminology, we call P ( A) the prior probability of A

Figure 1.1 Beginning of the Wikipedia article on Thomas

Exercise 1.1 Medical testing

Solution to Exercise 1.1

So: P ( AB ) = P ( A) P ( B | A) = 0.01× 0.9 = 0.009

PAgBfun=function(p=0.01,q=0.9){ pq / (pq+(1-p)*(1-q)) }

tv=2:8; kv=tv(tv-1)(9-tv); c=sum(kv); c # 420