Bayesian Methods Statistical Analysis
Bayesian Methods Statistical Analysis
Bayesian Methods Statistical Analysis
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise,
without the prior permission of the publisher.
Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
ix
Acknowledgements
‘Bayesian Methods for Statistical Analysis’ derives from the lecture notes
for a four-day course titled ‘Bayesian Methods’, which was presented to
staff of the Australian Bureau of Statistics, at ABS House in Canberra, in
2013. Lectures of three hours each were held in the mornings of 11, 18
and 25 November and 9 December, and three-hour tutorials were held in
the mornings of 14, 20 and 27 November and 11 December.
Of the 30-odd participants, some of whom attended via video link from
regional ABS offices, special thanks go to Anura Amarasinghe, Rachel
Barker, Geoffrey Brent, Joseph Chien, Alexander Hanysz, Sebastien
Lucie, Peter Radisich and Anthony Russo, who asked insightful questions,
pointed out errors, and contributed to an improved second edition of the
lecture notes. Thanks also to Siu-Ming Tam, First Australian Statistician
of the Methodology and Data Management Division at ABS, for useful
comments, and for inviting the author to present the course in the first
place, after having read Puza (1995). Last but not least, special thanks go
to Kylie Johnson for her excellent work as the course administrator.
xi
Preface
The software packages which feature in this book are R and WinBUGS.
xiii
Bayesian Methods for Statistical Analysis
This book is in the form of an Adobe PDF file saved from Microsoft Word
2013 documents, with the equations as MathType 6.9 objects. The figures
in the book were created using Microsoft Paint, the Snipping Tool in
Windows, WinBUGS and R. In the few instances where color is used, this
is only for additional clarity. Thus, the book can be printed in black and
white with no loss of essential information.
xiv
Overview
xvi
Overview
xvii
CHAPTER 1
Bayesian Basics Part 1
1.1 Introduction
Bayesian methods is a term which may be used to refer to any
mathematical tools that are useful and relevant in some way to Bayesian
inference, an approach to statistics based on the work of Thomas Bayes
(1701–1761). Bayes was an English mathematician and Presbyterian
minister who is best known for having formulated a basic version of the
well-known Bayes’ Theorem.
Figure 1.1 (page 3) shows part of the Wikipedia article for Thomas
Bayes. Bayes’ ideas were later developed and generalised by many
others, most notably the French mathematician Pierre-Simon Laplace
(1749–1827) and the British astronomer Harold Jeffreys (1891–1989).
1
Bayesian Methods for Statistical Analysis
How to generate this sample presents another problem, but one which
can typically be solved easily via Markov chain Monte Carlo (MCMC)
methods. Both MC and MCMC methods will feature in later chapters of
the course.
2
Chapter 1: Bayesian Basics Part 1
3
Bayesian Methods for Statistical Analysis
More generally, we may consider any event B such that P( B) > 0 and
k > 1 events A1 ,..., Ak which form a partition of any superset of B (such
as the entire sample space S). Then, for any i = 1,...,k, it is true that
P( Ai B )
P( Ai | B) ,
P( B)
n
where P( B) P( Aj B) and P( Aj B) P( Aj ) P( B | Aj ) .
j 1
The incidence of a disease in the population is 1%. A medical test for the
disease is 90% accurate in the sense that it produces a false reading 10%
of the time, both: (a) when the test is applied to a person with the
disease; and (b) when the test is applied to a person without the disease.
A person is randomly selected from population and given the test. The
test result is positive (i.e. it indicates that the person has the disease).
What is the probability that the person actually has the disease?
Let A be the event that the person has the disease, and let B be the event
that they test positive for the disease. Then:
P ( A) = 0.01 (the prior probability of the person having the disease)
P ( B | A) = 0.9 (the true positive rate, also called
the sensitivity of the test)
P( B | A) = 0.9 (the true negative rate, also called
the specificity of the test).
4
Chapter 1: Bayesian Basics Part 1
Discussion
It may seem the posterior probability that the person has the disease
(1/12) is rather low, considering the high accuracy of the test (namely
P ( B | A) = P ( B | A) = 0.9).
On the other hand, it may be noted that the posterior probability of the
person having the disease is actually very high relative to the prior
probability of them having the disease ( P( A) = 0.01). The positive test
result has greatly increased the person’s chance of having the disease
(increased it by more than 700%, since 0.01 + 7.333 × 0.01 = 0.08333).
5
Bayesian Methods for Statistical Analysis
We find that
P( A) P( B | A) pq
P( A | B) .
P( A) P( B | A) P( A) P( B | A) pq (1 p )(1 q )
Figure 1.3 shows the posterior probability of the person having the
disease ( P( A | B)) as a function of p with q fixed at 0.9 and 0.95,
respectively (subplot (a)), and as a function of q with p fixed at 0.01 and
0.05, respectively (subplot (b)). In each case, the answer (1/12) is
represented as a dot corresponding to p = 0.01 and q = 0.9.
6
Chapter 1: Bayesian Basics Part 1
pvec=seq(0,1,0.01); Pveca=PAgBfun(p=pvec,q=0.9)
Pveca2=PAgBfun(p=pvec,q=0.95)
qvec=seq(0,1,0.01); Pvecb=PAgBfun(p=0.01,q=qvec)
Pvecb2=PAgBfun(p=0.05,q=qvec)
X11(w=8,h=7); par(mfrow=c(2,1));
plot(pvec,Pveca,type="l",xlab="p=P(A)",ylab="P(A|B)",lwd=2)
points(0.01,1/12,pch=16,cex=1.5); text(0.05,0.8,"(a)",cex=1.5)
lines(pvec,Pveca2,lty=2,lwd=2)
legend(0.7,0.5,c("q = 0.9","q = 0.95"),lty=c(1,2),lwd=c(2,2))
plot(qvec,Pvecb,type="l",xlab="q=P(B|A)=P(B'|A')",ylab="P(A|B)",lwd=2)
points(0.9,1/12,pch=16,cex=1.5); text(0.05,0.8,"(b)",cex=1.5)
lines(qvec,Pvecb2,lty=2,lwd=2)
legend(0.2,0.8,c("p = 0.01","p = 0.05"),lty=c(1,2),lwd=c(2,2))
# Technical note: The graph here was copied from R as ‘bitmap’ and then
# pasted into a Word document, which was then saved as a PDF. If the graph
# is copied from R as ‘metafile’, it appears correct in the Word document,
# but becomes corrupted in the PDF, with axis legends slightly off-centre.
# So, all graphs in this book created in R were copied into Word as ‘bitmap’.
In a particular population:
10% of persons have Type 1 blood,
and of these, 2% have a particular disease;
30% of persons have Type 2 blood,
and of these, 4% have the disease;
60% of persons have Type 3 blood,
and of these, 3% have the disease.
A person is randomly selected from the population and found to have the
disease.
7
Bayesian Methods for Statistical Analysis
P(CD) 0.018 9
Hence: P(C=
| D) = = = 56.25%.
P( D) 0.032 16
8
Chapter 1: Bayesian Basics Part 1
Then we calculate:
π 0 = P( E0 ) = the prior probability of the null hypothesis
π 1 = P( E1 ) = the prior probability of the alternative hypothesis
PRO = π 0 / π 1 = the prior odds in favour of the null hypothesis
p0 = P( E0 | D ) = the posterior probability of the null hypothesis
p1 = P( E1 | D ) = the posterior probability of the alternative hypothesis
POO = p0 / p1 = the posterior odds in favour of the null hypothesis.
Thus, the Bayes factor may also be interpreted as the ratio of the
likelihood of the data given the null hypothesis to the likelihood of the
data given the alternative hypothesis.
9
Bayesian Methods for Statistical Analysis
Note 2: The idea of a Bayes factor extends to situations where the null
and alternative hypotheses are statistical models rather than events. This
idea may be taken up later.
The incidence of a disease in the population is 1%. A medical test for the
disease is 90% accurate in the sense that it produces a false reading 10%
of the time, both: (a) when the test is applied to a person with the
disease; and (b) when the test is applied to a person without the disease.
A person is randomly selected from population and given the test. The
test result is positive (i.e. it indicates that the person has the disease).
Calculate the Bayes factor for testing that the person has the disease
versus that they do not have the disease.
This means the positive test result has multiplied the odds of the person
having the disease relative to not having it by a factor of 9 or 900%.
Another way to say this is that those odds have increased by 800%.
10
Chapter 1: Bayesian Basics Part 1
In the first few examples below, we will focus on the simplest case
where both y and are scalar and discrete.
11
Bayesian Methods for Statistical Analysis
Consider six loaded dice with the following properties. Die A has
probability 0.1 of coming up 6, each of Dice B and C has probability 0.2
of coming up 6, and each of Dice D, E and F has probability 0.3 of
coming up 6.
A die is chosen randomly from the six dice and rolled twice. On both
occasions, 6 comes up.
Let y be the number of times that 6 comes up on the two rolls of the
chosen die, and let θ be the probability of 6 coming up on a single roll
of that die. Then the Bayesian model is:
12
Chapter 1: Bayesian Basics Part 1
( y | θ ) ~ Bin(2, θ )
1/ 6, θ = 0.1
f (θ ) =
= 2 / 6, θ 0.2
3 / 6, θ = 0.3.
y
2
1 2 3
So f ( y ) = ∑ f (θ ) f ( y | θ ) = (0.1) 2 + (0.2) 2 + (0.3) 2 = 0.06.
θ 6 6 6
(1/ 6)0.1
= 2
=
/ 0.06 0.02778, θ 0.1
f (θ ) f ( y | θ )
So f (θ | y )
= = (2 / 6)0.2 = 2
=
/ 0.06 0.22222, θ 0.2
f ( y) (3 / 6)0.3=
2
=
/ 0.06 0.75, θ 0.3.
Note: This result means that if the chosen die were to be tossed again a
large number of times (say 10,000) then there is a 75% chance that 6
would come up about 30% of the time, a 22.2% chance that 6 would
come up about 20% of the time, and a 2.8% chance that 6 would come
up about 10% of the time.
13
Bayesian Methods for Statistical Analysis
This method is to multiply the prior density (or the kernel of that
density) by the likelihood function and try to identify the resulting
function of as the density of a well-known or common distribution.
A die is chosen randomly from the six dice and rolled twice. On both
occasions, 6 comes up.
14
Chapter 1: Bayesian Basics Part 1
With y denoting the number of times 6 comes up, the Bayesian model
may be written:
2
f ( y | θ ) = θ y (1 − θ ) 2− y , y = 0,1, 2
y
= f (θ ) 10 =θ / 6, θ 0.1, 0.2, 0.3 .
Note: 10θ / 6 = 1/6, 2/6 and 3/6 for θ = 0.1, 0.2 and 0.3, respectively.
Hence f (θ | y ) ∝ f (θ ) f ( y | θ )
10θ 2
= × θ y (1 − θ ) 2− y
6 y
∝ θ ×θ 2 since y = 2.
= 0.13 1/1000,
= θ 0.1 = 1, θ 0.1
Thus f (θ | y ) ∝ θ 3= 0.23= 8 /1000, θ= 0.2 ∝ 8, θ= 0.2
= 0.33 27= /1000, θ 0.3= 27, θ 0.3.
= 13 / 36 0.02778,
= θ 0.1
3
Now, 1 + 8 + 27 = (θ | y ) =
36 , and so f= =
2 / 36 0.22222, θ 0.2
3= 3
=
/ 36 0.75, θ 0.3,
which is the same result as obtained earlier in Exercise 1.4.
You are visiting a town with buses whose licence plates show their
numbers consecutively from 1 up to however many there are. In your
mind the number of buses could be anything from one to five, with all
possibilities equally likely.
Assuming that at any point in time you are equally likely to see any of
the buses in the town, how likely is it that the town has at least four
buses?
15
Bayesian Methods for Statistical Analysis
Let θ be the number of buses in the town and let y be the number of the
bus that you happen to first see. Then an appropriate Bayesian model is:
y | θ ) 1/=
f (= θ , y 1,...,θ
=f (θ ) 1/= 5, θ 1,...,5 (prior).
So the posterior probability that the town has at least four buses is
P(θ ≥ 4 | y ) = ∑
θ :θ ≥ 4
f (θ | y ) = f (θ = 4 | y ) + f (θ = 5 | y )
20 27
1 − f (θ =
= 3 | y ) =−
1 = = 0.5745.
47 47
16
Chapter 1: Bayesian Basics Part 1
Discussion
So, under this alternative prior, the probability of there being at least
four buses in the town (given that you have seen Bus 3) works out as
1
P (θ ≥ 4 | y ) =
1 − P (θ =
3 | y) =
1− = 0.7187.
9c
In each of nine indistinguishable boxes there are nine balls, the ith box
having i red balls and 9 − i white balls (i = 1,…,9).
One box is selected randomly from the nine, and then three balls are
chosen randomly from the selected box (without replacement and
without looking at the remaining balls in the box).
Exactly two of the three chosen balls are red. Find the probability that
the selected box has at least four red balls remaining in it.
17
Bayesian Methods for Statistical Analysis
In our case,
θ !(9 − θ )!
f (θ | y ) ∝ θ 2,...,9 − (3 − 2) ,
,=
(θ − 2)!(9 − θ − (3 − 2))!
or more simply,
f (θ | y ) ∝ θ (θ − 1)(9 − θ ) , θ = 2,...,8 .
14, θ = 2
36, θ = 3
60, θ = 4
Thus f (θ | y ) ∝ 80, θ = 5 ≡ k (θ ) ,
90, θ = 6
84, θ = 7
56, θ = 8
where
8
c ≡ ∑ k (θ ) = 14 + 36 + … + 56 = 420.
θ =1
18
Chapter 1: Bayesian Basics Part 1
=
14 / 420 =
0.03333, θ 2
36 / 420
= =
0.08571, θ 3
60 / 420
= =
0.14286, θ 4
k (θ )
So f (θ=
| y) = = 80 / 420 =
0.19048, θ 5
c 90 / 420
= =
0.21429, θ 6
=
84 / 420 =
0.20000, θ 7
=
56 / 420 =
0.13333, θ 8.
The probability that the selected box has at least four red balls remaining
is the posterior probability that θ (the number of red balls initially in the
box) is at least 6 (since two red balls have already been taken out of the
box). So the required probability is
90 + 84 + 56 23
P(θ ≥ =
6 | y) = = 0.5476.
420 42
23/42 # 0.5476
1-0.45238 # 0.5476 (alternative calculation of the required probability)
sum((kv/c)[tv>=6]) # 0.5476
# (yet another calculation of the required probability)
19
Bayesian Methods for Statistical Analysis
These three posteriors and the prior are illustrated in Figure 1.5.
20
Chapter 1: Bayesian Basics Part 1
X11(w=8,h=5); par(mfrow=c(1,1));
plot(c(0,1),c(0,3),type="n",xlab="theta",ylab="density")
lines(c(0,1),c(1,1),lty=1,lwd=3); tv=seq(0,1,0.01)
lines(tv,3*(1-tv)^2,lty=2,lwd=3)
lines(tv,3*2*tv*(1-tv),lty=3,lwd=3)
lines(tv,3*tv^2,lty=4,lwd=3)
21
Bayesian Methods for Statistical Analysis
In contrast, for the ‘buses’ example further above (Exercise 1.6), which
involves the model:
y | θ ) 1/=
f (= θ , y 1,...,θ
=f (θ ) 1/= 5, θ 1,...,5 ,
the quantity of interest θ represents the number of buses in a population
of buses, which of course is finite.
22
Chapter 1: Bayesian Basics Part 1
23
Bayesian Methods for Statistical Analysis
Noting that 0 < y < θ < 1, we see that the posterior density is
f (θ ) f ( y | θ ) 1 × (1 / θ )
= f (θ | y ) = 1
f ( y)
∫ 1 × (1 / θ )dθ
y
1/ θ −1
= = , y < θ <1.
log1 − log y θ log y
1.10 Conjugacy
When the prior and posterior distributions are members of the same class
of distributions, we say that they form a conjugate pair, or that the prior
is conjugate. For example, consider the binomial-beta model:
( y | ) ~ Binomial (n, )
~ Beta (, ) (prior)
⇒ ( | y ) ~ Beta ( y, n y ) (posterior).
Since both prior and posterior are beta, the prior is conjugate.
24
Chapter 1: Bayesian Basics Part 1
Since both prior and posterior are gamma, the prior is conjugate.
or lim f (θ | x ) = sup f (θ | x ) ,
θ →m
or the set of all such values.
• The posterior median of is
Median ( | y ) = any value m of such that
P ( m | y ) 1/ 2
and P ( m | y ) 1/ 2 ,
or the set of all such values.
Note 1: In some cases, the posterior mean does not exist or it is equal to
infinity or minus infinity.
25
Bayesian Methods for Statistical Analysis
Note 2: Typically, the posterior mode and posterior median are unique.
The above definitions are given for completeness.
Figure 1.6 illustrates the idea of the HPDR. In the very common
situation where is scalar, continuous and has a posterior density which
is unimodal with no local modes (i.e. has the form of a single ‘mound’),
the 1– HPDR takes on the form of a single interval defined by two
points at which the posterior density has the same value. When the
HPDR is a single interval, it is the shortest possible single interval over
which the area under the posterior density is 1– .
26
Chapter 1: Bayesian Basics Part 1
Figure 1.7 illustrates the idea of the CPDR. One drawback of the CPDR
is that it is only defined for a scalar parameter. Another drawback is that
some values inside the CPDR may be less likely a posteriori than some
values outside it (which is not the case with the HPDR). For example, in
Figure 1.7, a value just below the upper bound of the 80% CPDR has a
smaller posterior density than a value just below the lower bound of that
CPDR. However, CPDRs are typically easier to calculate than HPDRs.
Other variations are possible (of the form [a,b) and (a,b]); but when the
parameter of interest is continuous these definitions are all equivalent.
Yet another definition of the 1– CPDR is any of the CPDRs as defined
above but with all a posteriori impossible values of excluded.
27
Bayesian Methods for Statistical Analysis
We have a bent coin, for which , the probability of heads coming up, is
unknown. Our prior beliefs regarding may be described by a standard
uniform distribution. Thus no value of is deemed more or less likely
than any other.
Find the posterior mean, mode and median of . Also find the 80%
HPDR and CPDR for .
F ( | y ) 6t 5 dt 6 , 0 < < 1.
0
28
Chapter 1: Bayesian Basics Part 1
6 6
Therefore: E ( | y ) = 0.8571
6 1 7
6 1
Mode( | y ) 1
(6 1) (11)
Median( | y ) = solution in of F ( | y ) 1/ 2 , i.e. 6 0.5
= (0.5)1/ 6 = 0.8909.
29
Bayesian Methods for Statistical Analysis
points(cpdr,rep(0.4,2),pch=16); lines(cpdr,rep(0.4,2),lty=2,lwd=2)
abline(v=c(postmean,postmode,postmedian),lty=3)
abline(v=c(0,hpdr,cpdr),lty=3); abline(h=c(0,6),lty=3)
legend(0.2,5.8,c("posterior mean","posterior mode",
"posterior median"),pch=c(1,2,4))
legend(0.2,2.8,c("80% CPDR","80% HPDR"),lty=c(2,3),lwd=c(2,2))
Find the 90% HPDR and 90% CPDR for θ . Also find the 50% HPDR
and 50% CPDR for θ . For each region, calculate the associated exact
coverage probability.
30
Chapter 1: Bayesian Basics Part 1
The smallest set S such that P ( S | y ) 0.4 is {2} or {3}. With the
additional requirement that f (1 | y ) f (2 | y ) if 1 S and 2 S , we
see that S = {3} (only). That is, the 40% HPDR is the singleton set {3}.
31
Bayesian Methods for Statistical Analysis
32
Chapter 1: Bayesian Basics Part 1
Note: In the context where we toss a bent coin five times and get heads
every time (and the prior on the probability of heads is standard
uniform), the quantity ψ may be interpreted as the probability of the
next two tosses both coming up heads, or equivalently, as the proportion
of times heads will come up twice if the coin is repeatedly tossed in
groups of two tosses a hypothetically infinite number of times.
dψ 2
It follows that the posterior mean of ψ is
1
=ψˆ E=
(ψ | y ) ∫ψ ( 3ψ=
) dψ
2
0.75 ,
0
33
Bayesian Methods for Statistical Analysis
or =ψˆ E (=
θ 2 | y ) V (θ | y ) + {E (θ | y )}2
6 ×1
2
6
= + = 0.75
(6 + 1) (6 + 1 + 1) 6 + 1
2
Thus θˆ =(1 − k ) A + kB
α y n
where: A = , B= , k .
α +β n n
35
Bayesian Methods for Statistical Analysis
(a) n = 5, y = 4, α = 2, β = 6
(b) n = 20, y = 16, α = 2, β = 6.
In both cases, the prior mean is the same (A = 2/(2 + 6) = 0.25), as is the
MLE (B = 4/5 = 16/20 = 0.8). However, due to n being larger in case (b)
(i.e. there being more direct data), case (b) leads to a larger credibility
factor (0.714 compared to 0.385) and hence a posterior mean closer to
the MLE (0.643 compared to 0.462).
Note: Each likelihood function in Figure 1.9 has been normalised so that
the area underneath it is exactly 1. This means that in each case (a) and
(b), the likelihood function L( ) as shown is identical to the posterior
density which would be implied by the standard uniform prior, i.e. under
fU (0,1) ( ) f Beta (1,1) ( ) . Thus, L( ) f Beta (1 y ,1n y ) ( ) .
36
Chapter 1: Bayesian Basics Part 1
X11(w=8,h=7); par(mfrow=c(2,1))
points(c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)),c(0,0,0),pch=c(1,2,3),
cex=rep(1.5,3),lwd=2); text(0,2.5,"(a)",cex=1.5)
c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)) # 0.2500000 0.8000000 0.4615385
n/(alp+bet+n) # 0.3846154
points(c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)),c(0,0,0),pch=c(1,2,3),
cex=rep(1.5,3),lwd=2); text(0,4.5,"(b)",cex=1.5)
c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)) # 0.2500000 0.8000000 0.6428571
n/(alp+bet+n) # 0.7142857
37
Bayesian Methods for Statistical Analysis
( 1) 1
Now, the prior mode of θ is Mode( ) .
( 1) ( 1) 2
1 y
So we write Mode( | y )
n2 n2
1 2 1 n y
.
n 2 1 2 n 2 n
Find the posterior distribution of given data in the form of the vector
y ( y1 ,..., yn ) .
38
Chapter 1: Bayesian Basics Part 1
1 1 1 n
exp 2 2 20 02 2 yi2 2ny n 2 , (1.1)
2 0 i1
where y ( y1 ... yn ) / n is the sample mean.
It remains to find the normal mean and variance parameters, * and *2 .
(These must be functions of the known quantities n, y , , 0 and 0 .)
c .
1/ a a
39
Bayesian Methods for Statistical Analysis
1 b
2
0 ny
b 02 2 2 0 n02 y
* . (1.4)
a 1
n 2
n 2
0
02 2
40
Chapter 1: Bayesian Basics Part 1
Note 3: Since both prior and posterior are normal, the prior is
conjugate.
Note 4: The posterior mean, mode and median of are the same and
equal to * . The 1 − α CPDR and 1 − α HPDR for are the same and
equal to ( µ* ± zα /2σ * ) .
That is, if we know only the sample mean y , the posterior distribution
of is the same as if we know y, i.e. all n sample values. Knowing the
individual yi values makes no difference to the inference.
That is, if the prior information is very ‘precise’ or ‘definite’, the data
has little influence on the posterior. So the posterior is approximately
equal to the prior; i.e. f ( | y ) f () , or equivalently, ( | y ) ~ . In
this case the posterior mean, mode and median of are approximately
equal to 0 . Also, the 1 − α CPDR and 1 − α HPDR for are
approximately equal to ( µ0 ± zα /2σ 0 ) .
41
Bayesian Methods for Statistical Analysis
So, in this case, just as when 0 is large, the prior distribution has very
little influence on the posterior, and the ensuing inference is almost the
same as that implied by the classical approach.
42
Chapter 1: Bayesian Basics Part 1
Create a graph which shows these estimates as well as the prior density,
prior mean, likelihood, MLE and posterior density.
43
Bayesian Methods for Statistical Analysis
Here:
n = 3,
y = (8.4 + 10.1 + 9.4)/3 = 9.3
1 3
=k = 2
= 0.4285714
1 /3 7
1+
(1/ 2) 2
3 3
* 1 5 9.3 = 6.8428571
7 7
3 12 1
σ = × = = 0.1428571.
2
*
7 3 7
Figure 1.10 shows the various densities and estimates here, as well as the
normalised likelihood. Note that the likelihood function as shown is also
the posterior density if the prior is taken to be uniform over the whole
real line, i.e. µ ~ U (−∞, ∞) .
Discussion
Note that the posteriors in Figures 1.12 and 1.13 have the same mean but
different variances.
44
Chapter 1: Bayesian Basics Part 1
45
Bayesian Methods for Statistical Analysis
46
Chapter 1: Bayesian Basics Part 1
plot(c(0,11),c(-0.1,1.3),type="n",xlab="",ylab="density/likelihood")
lines(muv,prior,lty=1,lwd=2); lines(muv,like,lty=2,lwd=2)
lines(muv,post,lty=3,lwd=2)
points(c(mu0,ybar,mus),c(0,0,0),pch=c(1,2,4),cex=rep(1.5,3),lwd=2)
points(cpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2))
legend(0,1.3,
c("Prior density","Likelihood function (normalised)","Posterior density"),
lty=c(1,2,3),lwd=c(2,2,2))
legend(0,0.7,c("Prior mean","Sample mean (MLE)","Posterior mean",
"95% CPDR bounds"), pch=c(1,2,4,16),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
text(10.8,-0.075,"m", vfont=c("serif symbol","italic"), cex=1.5)
47
Bayesian Methods for Statistical Analysis
i 1 1/
2
1/
n
1e n /2 exp ( yi ) 2
2 i1
a1eb for some a and b.
We see that
( | y ) ~ G ( a, b) ,
n
where: a
2
n 2
b s
2
1 n
s2 ( yi ) 2 .
n i1
48
Chapter 1: Bayesian Basics Part 1
So the 1 − A CPDR for u is 12 A/2 (2 n), A2 /2 (2 n) .
2 (2 n) 2 (2 n)
So the 1 − A CPDR for λ =
u 1 A/2 A /2
2 β + nsµ2
is 2 ns 2 , 2 ns 2 .
2 ns 2
2 ns
2
49
Bayesian Methods for Statistical Analysis
So
y1 − µ yn − µ
2 2
~ iid χ (1) .
2
,...,
σ σ
So
yi − µ
2
n nsµ2
∑
i =1 σ
=
σ 2
~ χ 2 ( n) .
So
ns 2
1 − A P χ12− A/2 (n) < 2µ < χ A2 /2 (n)
=
σ
ns 2
ns 2
= P 2 µ <σ 2 < 2 µ .
χ ( n) χ1− A/2 (n)
A/2
Observe that
E / 1
for all ε , and
V / 2
as 0 .
50
Chapter 1: Bayesian Basics Part 1
2 ns2 2 ns2
The 1 − A CPDR for σ 2 = 1 / λ is 2 , 2 .
A/2 (2 n) 1 A/2 (2 n)
(a) Calculate the posterior mean, mode and median of the model
precision . Also calculate the 95% CPDR for . Create a graph which
shows these estimates as well as the prior density, prior mean,
likelihood, MLE and posterior density.
51
Bayesian Methods for Statistical Analysis
(b) Calculate the posterior mean, mode and median of the model
variance 2 1 / . Also calculate the 95% CPDR for 2 . Create a
graph which shows these estimates as well as the prior density, prior
mean, likelihood, MLE and posterior density.
(c) Calculate the posterior mean, mode and median of the model
standard deviation . Also calculate the 95% CPDR for . Create a
graph which shows these estimates as well as the prior density, prior
mean, likelihood, MLE and posterior density.
(d) Examine each of the point estimates in (a), (b) and (c) and determine
which ones, if any, can be easily expressed in the form of a credibility
estimate.
So:
• the posterior mean of λ is E ( | y ) a / b = 0.8547
• the posterior mode is Mode( | y ) ( a 1) / b = 0.6648
• the posterior median is the 0.5 quantile of the G(a,b) distribution
and works out as Median( | y ) = 0.7923
(as obtained using the qgamma() function in R; see below)
• the 95% CPDR for is (0.2564, 1.8065) (where the bounds are
the 0.025 and 0.975 quantiles of the G(a,b) distribution).
Also:
• the prior mean is E / = 1.5
• the prior mode is Mode( ) ( 1) / = 1
• the prior median is Median( ) = 1.3370
• the MLE of λ is ˆ 1 / s 2 = 0.4594
Figure 1.14 shows the various densities and estimates here, as well as the
normalised likelihood function.
52
Chapter 1: Bayesian Basics Part 1
2 1
[( 2 )1 ]1 e ( )
( 2 )2 )
( )
( 2 )1 e / , 2 0 .
2
(1.6)
( )
53
Bayesian Methods for Statistical Analysis
Figure 1.15 shows the various densities and estimates here, as well as the
normalised likelihood function.
54
Chapter 1: Bayesian Basics Part 1
We find that:
• the prior mean of σ is
∞
β α λ α −1e − βλ
Eσ E=
λ −1/2 ∫λ dλ
−1/2
=
0
Γ(α )
1 1
α ∞ α− α − −1 − βλ
β Γ(α − 1 / 2) β λ 2
e 2
=
β α −1/2 ∫
Γ(α ) 0 Γ(α − 1 / 2)
dλ
Γ(α − 1 / 2)
= β 1/2 = 0.9400
Γ(α )
2
• the prior mode of σ is Mode( ) = 0.7559
2 1
(obtained by setting the derivative of the logarithm of (1.7)
to zero, where that derivative is derived as follows:
l ( ) log f ( ) (2 1) log 2 + constant
2 1 set 2
l ( ) 2 3 0 2 )
2 1
• the prior median of σ is Median( ) Median( 2 ) = 0.8648
• the MLE of is ˆ s2 = 1.4754 (which is biased).
2b a 2 a1 b / 2
By analogy with the above, f ( | y ) e , 0.
( a )
So we find that:
Γ( a − 1 / 2)
• the posterior mean of σ is E (σ | y ) = b1/2 = 1.1836
Γ( a )
2b
• the posterior mode is Mode( | y ) = 1.0262
2a 1
55
Bayesian Methods for Statistical Analysis
Figure 1.16 shows the various densities and estimates here, as well as the
normalised likelihood function.
56
Chapter 1: Bayesian Basics Part 1
Likewise,
b ns2 / 2 2 ns2
Mode( | y )
2
a 1 ( n / 2) 1 2 n 2
n 2 2
s
,
n 2 2 n 2 2
where
2 2 1
n 2 2 n 2 2 1
2 2
Mode( 2 )
n 2 2
n
1 Mode( ) .
2
n 2 2
57
Bayesian Methods for Statistical Analysis
c(lampriormean,lamlikemode,lampriormode,lampriormedian,
lampostmode,lampostmedian, lampostmean,lamcpdr)
# 1.5000 0.4594 1.0000 1.3370 0.6648 0.7923 0.8547 0.2564 1.8065
lamv=seq(0,5,0.01); prior=dgamma(lamv,alp,bet)
post=dgamma(lamv,a,b); like=dgamma(lamv,a-alp+1,b-bet+0)
X11(w=8,h=4); par(mfrow=c(1,1))
plot(c(0,5),c(0,1.9),type="n",
main="Inference on the model precision parameter",
xlab="lambda",ylab="density/likelihood")
lines(lamv,prior,lty=1,lwd=2); lines(lamv,like,lty=2,lwd=2);
lines(lamv,post,lty=3,lwd=2)
points(c(lampriormean,lampriormode, lampriormedian,
lamlikemode,lampostmode,lampostmedian,lampostmean),
rep(0,7),pch=c(1,1,1,2,4,4,4),cex=rep(1.5,7),lwd=2)
points(lamcpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2))
legend(0,1.9,
c("Prior density","Likelihood function (normalised)","Posterior density"),
lty=c(1,2,3),lwd=c(2,2,2))
legend(3,1.9,c("Prior mode, median\n & mean (left to right)",
"MLE"), pch=c(1,2),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
legend(3,1,c("Posterior mode, median\n & mean (left to right)",
"95% CPDR bounds"), pch=c(4,16),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
58
Chapter 1: Bayesian Basics Part 1
sig2v=seq(0.01,10,0.01); prior=dgamma(1/sig2v,alp,bet)/sig2v^2
post=dgamma(1/sig2v,a,b)/sig2v^2;
like=dgamma(1/sig2v,a-alp-1,b-bet+0)/sig2v^2
plot(c(0,10),c(0,1.2),type="n",
main="Inference on the model variance parameter",
xlab="sigma^2 = 1/lambda",ylab="density/likelihood")
lines(sig2v,prior,lty=1,lwd=2); lines(sig2v,like,lty=2,lwd=2)
lines(sig2v,post,lty=3,lwd=2)
legend(1.8,1.2,
c("Prior density","Likelihood function (normalised)","Posterior density"),
lty=c(1,2,3),lwd=c(2,2,2))
legend(7,1.2,c("Prior mode, median\n & mean (left to right)",
"MLE"), pch=c(1,2),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
legend(6,0.65,c("Posterior mode, median\n & mean (left to right)",
"95% CPDR bounds"), pch=c(4,16),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
59
Bayesian Methods for Statistical Analysis
sigpriormean=sqrt(bet)*gamma(alp-1/2)/gamma(alp);
siglikemode=sqrt(sigmu2); sigpriormode=sqrt(2*bet/(2*alp+1))
sigpostmean= sqrt(b)*gamma(a-1/2)/gamma(a)
sigpostmode= sqrt(2*b/(2*a+1)); sigpostmedian=sqrt(sig2postmedian)
sigcpdr=sqrt(sig2cpdr); sigpriormedian= sqrt(sig2priormedian)
sigv=seq(0.01,3,0.01); prior=dgamma(1/sigv^2,alp,bet)*2/sigv^3
post=dgamma(1/sigv^2,a,b)*2/sigv^3;
like=dgamma(1/sigv^2,a-alp-1/2,b-bet+0)*2/sigv^3
plot(c(0,2.5),c(0,4.1),type="n",
main="Inference on the model standard deviation parameter",
xlab="sigma = 1/sqrt(lambda)",ylab="density/likelihood")
lines(sigv,prior,lty=1,lwd=2)
lines(sigv,like,lty=2,lwd=2)
lines(sigv,post,lty=3,lwd=2)
points(c(sigpriormean, sigpriormode, sigpriormedian, siglikemode,
sigpostmode, sigpostmedian,sigpostmean),
rep(0,7),pch=c(1,1,1,2,4,4,4),cex=rep(1.5,7),lwd=2)
points(sigcpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2))
legend(0,4.1,
c("Prior density","Likelihood function (normalised)","Posterior density"),
lty=c(1,2,3),lwd=c(2,2,2))
legend(1.7,4.1,c("Prior mode, median\n & mean (left to right)",
"MLE"), pch=c(1,2),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
legend(1.7,2.3,c("Posterior mode, median\n & mean (left to right)",
"95% CPDR bounds"), pch=c(4,16),pt.cex=rep(1.5,4),pt.lwd=rep(2,4))
60
CHAPTER 2
Bayesian Basics Part 2
2.1 Frequentist characteristics of Bayesian
estimators
Consider a Bayesian model defined by a likelihood f ( y | ) and a prior
f ( ) , leading to the posterior
f ( ) f ( y | )
f ( | y ) .
f ( y)
Then ̂ , I, L and U are functions of the data y and may be written ˆ( y ) ,
I(y), L(y) and U(y). Once these functions are defined, the estimates
which they define stand on their own, so to speak, and may be studied
from many different perspectives.
As we noted earlier, these estimates are exactly the same as the usual
estimates used in the context of the corresponding classical model,
61
Bayesian Methods for Statistical Analysis
62
Chapter 2: Bayesian Basics Part 2
Work out general formulae for the frequentist and relative bias of the
posterior mean of , and for the frequentist coverage probability of the
1 − α HPDR for .
Recall that
( | y ) ~ N ( , 2 ) ,
where:
* (1 k )0 ky is ’s posterior mean
σ2
σ *2 = k is ’s posterior variance
n
n
k= is a credibility factor.
n + σ 2 / σ 02
63
Bayesian Methods for Statistical Analysis
64
Chapter 2: Bayesian Basics Part 2
Figures 2.1, 2.2 and 2.3 (pages 66 and 67) show Bµ , Rµ and Cµ for
selected values of σ 0 , with n = 10 , µ0 = 1 , σ = 1 and α = 0.05 in each
case. The strength of the prior belief is represented by σ 0 , with large
values of this parameter indicating relative ignorance.
In Figure 2.1, we see that, for any given value of µ , the frequentist bias
Bµ of the posterior mean µ* = E ( µ | y ) converges to zero as the prior
belief tends to total ignorance, that is, in the limit as σ 0 → ∞ .
Note: One of the thin dotted guidelines in Figure 2.1 shows the function
B=
µ µ0 − µ in this latter extreme case of ‘absolute’ prior belief that
µ = µ0 . In all of the examples, µ0 = 1 .
In Figure 2.2, we see that, for any given value of µ , the frequentist
relative bias Rµ of the posterior mean µ* = E ( µ | y ) converges to zero
as σ 0 → ∞ . Also, Rµ → ( µ0 / µ ) − 1 as σ 0 → 0 .
Note: The curved thin dotted guidelines in Figure 2.2 shows the function
=Rµ ( µ0 / µ ) − 1 in this latter extreme case of ‘absolute’ prior belief that
µ = µ0 .
In Figure 2.3, we see that, for any given value of µ , the frequentist
coverage probability Cµ of the 1 − α (i.e. 0.95 or 95%) HPDR, namely
(* z /2* ) , converges to 1 − α as σ 0 → ∞ .
Note: In Figure 2.3, the thin dotted horizontal guidelines show the
values 0, 0.95 and 1.
65
Bayesian Methods for Statistical Analysis
66
Chapter 2: Bayesian Basics Part 2
biasfun = function(mu,n,sig,mu0,sig0){
k = n/(n+(sig/sig0)^2)
(1-k)*mu0-mu*(1-k) }
coverfun = function(mu,n,sig,mu0,sig0,alp=0.05){
k = n/(n + (sig/sig0)^2)
sigstar = sig*sqrt(k/n); z=qnorm(1-alp/2)
a= ( mu-(1-k)*mu0-z*sigstar ) / k
b= ( mu-(1-k)*mu0+z*sigstar ) / k
u= pnorm((b-mu)/(sig/sqrt(n)))
l= pnorm((a-mu)/(sig/sqrt(n)))
u-l }
X11(w=8,h=5.5); par(mfrow=c(1,1))
muvec=seq(-5,5,0.01); mu0=1; sig=1; n=10; sig0v=c(0.1,0.2,0.5,1)
plot(c(-2,2),c(-1,3),type="n",xlab="mu",ylab="",main=" ")
abline(1,-1,lty=3); abline(v=0,lty=3); abline(h=0,lty=3)
lines(muvec,biasfun(mu=muvec,n=n,sig=sig,mu0=mu0,sig0=sig0v[1]),
lty=1,lwd=3)
67
Bayesian Methods for Statistical Analysis
lines(muvec,biasfun(mu=muvec,n=n,sig=sig, mu0=mu0,sig0=sig0v[2]),
lty=2,lwd=3)
lines(muvec,biasfun(mu=muvec,n=n,sig=sig, mu0=mu0,sig0=sig0v[3]),
lty=3,lwd=3)
lines(muvec,biasfun(mu=muvec,n=n,sig=sig, mu0=mu0,sig0=sig0v[4]),
lty=4,lwd=3)
legend(1,2.8,c("sig0=0.1","sig0=0.2","sig0=0.5","sig0=1.0"),
lty=1:4,lwd=rep(3,4))
plot(c(-2,2),c(-2,4),type="n",xlab="mu",ylab="",main=" ")
abline(v=0,lty=3); abline(h=0,lty=3); lines(muvec, mu0/muvec-1,lty=3)
lines(muvec, biasfun(mu=muvec,n=n,sig=sig,mu0=mu0, sig0=sig0v[1])/muvec,
lty=1,lwd=3)
lines(muvec, biasfun(mu=muvec,n=n,sig=sig,mu0=mu0, sig0=sig0v[2])/muvec,
lty=2,lwd=3)
lines(muvec, biasfun(mu=muvec,n=n,sig=sig,mu0=mu0, sig0=sig0v[3])/muvec,
lty=3,lwd=3)
lines(muvec, biasfun(mu=muvec,n=n,sig=sig,mu0=mu0, sig0=sig0v[4])/muvec,
lty=4,lwd=3)
legend(-2,4,c("sig0=0.1","sig0=0.2","sig0=0.5","sig0=1.0"),
lty=1:4,lwd=rep(3,4))
plot(c(-1,3),c(0,1),type="n",xlab="mu",ylab="",main=" ")
abline(h=c(0,0.95,1),lty=3)
lines(muvec, coverfun(mu=muvec,n=n,sig=sig,mu0=mu0,sig0=sig0v[1]),
lty=1,lwd=3)
lines(muvec, coverfun(mu=muvec,n=n,sig=sig,mu0=mu0,sig0=sig0v[2]),
lty=2,lwd=3)
lines(muvec, coverfun(mu=muvec,n=n,sig=sig,mu0=mu0,sig0=sig0v[3]),
lty=3,lwd=3)
lines(muvec, coverfun(mu=muvec,n=n,sig=sig,mu0=mu0,sig0=sig0v[4]),
lty=4,lwd=3)
legend(-0.55,0.6,c("sig0=0.1","sig0=0.2","sig0=0.5","sig0=1.0"),
lty=1:4,lwd=rep(3,4))
68
Chapter 2: Bayesian Basics Part 2
(a) Work out general formulae for the frequentist bias and relative bias
of the posterior mean of 2 1 / , and for the frequentist coverage
probability of the 1 − α CPDR for 2 .
(b) Attempt to find a single prior under this model (that is, a single
suitable pair of values , β ) which results in both:
(i) a Bayesian posterior mean of 2 that is unbiased (in the
frequentist sense) for all possible values of 2 ; and
(ii) a CPDR for 2 that has frequentist coverage probabilities
exactly equal to the desired coverage for all possible values
of 2 .
(n / 2) s2 2 ns2
Thus, ˆ
2
.
(n / 2) 1 2 n 2
2
i 1
~ ( n ) (with mean n).
2
69
Bayesian Methods for Statistical Analysis
= P 2 < v − 2 ,u − 2 < 2 σ
σ σ σ σ
2β 2β
= Fχ 2 ( n ) v − 2 − Fχ 2 ( n ) u − 2 .
σ σ
Figures 2.4, 2.5 and 2.6 (pages 72 and 73) show Bσ 2 , Rσ 2 and Cσ 2 for
selected values of α and β , with n = 10 and A = 0.05 in each case.
70
Chapter 2: Bayesian Basics Part 2
So an unbiased estimate of 2 is
n 2 2 n 2 0 (n / 2) s
2
2
ˆ s2 (i.e. the MLE).
n n 0 (n / 2) 1
71
Bayesian Methods for Statistical Analysis
72
Chapter 2: Bayesian Basics Part 2
coverfun = function(sig2,n=10,alp=0,bet=0,A=0.05){
u = qchisq(A/2,2*alp+n); v = qchisq(1-A/2,2*alp+n)
pchisq(v-2*bet/sig2, n) - pchisq(u-2*bet/sig2, n) }
X11(w=8,h=5.5); par(mfrow=c(1,1))
sig2vec=seq(0.01,5,0.01); n=10; alpv=c(0.1,1,5); betv=c(0.1,1,5)
plot(c(0,5),c(-2,1),type="n",xlab="sigma^2",ylab="",main=" ")
abline(h=0,lty=3)
lines(sig2vec,biasfun(sig2=sig2vec,alp=0,bet=0), lty=1,lwd=3)
lines(sig2vec,biasfun(sig2=sig2vec,alp=0,bet=1), lty=2,lwd=3)
lines(sig2vec,biasfun(sig2=sig2vec,alp=1,bet=0), lty=3,lwd=3)
lines(sig2vec,biasfun(sig2=sig2vec,alp=1,bet=1), lty=4,lwd=3)
legend(0,-0.5,c("alp=0, bet=0","alp=0, bet=1","alp=1, bet=0","alp=1, bet=1"),
lty=1:4,lwd=rep(3,4))
73
Bayesian Methods for Statistical Analysis
plot(c(0,3),c(-1,6),type="n",xlab="sigma^2",ylab="",main=" ")
abline(h=0,lty=3); abline(v=0,lty=3)
plot(c(0,2),c(0,1),type="n",xlab="sigma^2",ylab="",main=" ")
abline(h=c(0,0.95,1),lty=3)
where each f m ( x) is a proper density and the cm values are positive and
sum to 1.
74
Chapter 2: Bayesian Basics Part 2
It can be shown (see Exercise 2.3 below) that if each component prior
f m (θ ) is conjugate then f (θ ) is also conjugate. This means that θ ’s
posterior distribution is also a mixture with density of the form
M
f (θ | y ) = ∑ cm′ f m (θ | y ) , (2.1)
m =1
Also calculate the prior mean of , the posterior mean of and the
MLE of . Then mark these three points in the figure.
(b) Show that any mixture of conjugate priors is also conjugate and
derive a general formula which could be used to calculate the mixture
weights cm′ in (2.1) above.
75
Bayesian Methods for Statistical Analysis
f ( | y ) f ( ) f ( y | )
B(a2 y, b2 n y )
( a2 y )1 (1 )(b2 n y )1
(1 k ) .
B(a2 , b2 ) B(a2 y, b2 n y )
Thus
f ( | y ) c1 f1 ( | y ) c2 f 2 ( | y ) ,
where:
B( a1 y , b1 n y )
c1 k
B( a1 , b1 )
B( a2 y , b2 n y )
c2 (1 k )
B( a2 , b2 )
Now,
f ( | y )d 1 ,
and so
f ( | y ) c f Beta ( a1 y ,b1 n y ) ( ) (1 c) f Beta ( a2 y ,b2 n y ) ( ) ,
where
c1
c .
c1 c2
76
Chapter 2: Bayesian Basics Part 2
We see that the prior f ( ) and posterior f ( | y ) are in the same family,
namely the family of mixtures of two beta distributions. Therefore the
mixture prior is conjugate.
Figure 2.7 shows the prior density f ( ) , the likelihood function L( ) ,
and the posterior density f ( | y ) , as well as the prior mean, the MLE
and the posterior mean.
Note: The likelihood function in Figure 2.7 has been normalised so that
the area underneath it is exactly 1. This means that this likelihood
function is identical to the posterior density under the standard uniform
prior, i.e. under fU (0,1) ( ) f Beta (1,1) ( ) . Thus, L( ) f Beta (1 y ,1n y ) ( ) .
Figure 2.7 also shows the two component prior densities and the two
component posterior densities. It may be observed that, whereas the
lower component prior has the highest weight, 0.8, the opposite is the
case regarding the component posteriors. For these, the weight
associated with the lower posterior is only 0.2583. This is because the
inference is being ‘pulled up’ in the direction of the likelihood (with the
posterior mean being between the prior mean and the MLE, 0.8).
77
Bayesian Methods for Statistical Analysis
78
Chapter 2: Bayesian Basics Part 2
It follows that
M
f (θ | y ) = ∑ cm′ f m (θ | y ) ,
m =1
c1=k*beta(a1+y,b1+n-y)/beta(a1,b1); c2=(1-k)*beta(a2+y,b2+n-y)/beta(a2,b2)
c=c1/(c1+c2); post=c*post1 + (1-c)*post2; options(digits=4); c # 0.2583
like=dbeta(thetav,1+y,1+n-y) # likelihood = post. under U(0,1)=beta(1,1) prior
X11(w=8,h=5.5)
plot(c(0,1),c(0,8),type="n",xlab="theta",ylab="density/likelihood")
lines(thetav,prior,lty=1,lwd=4)
lines(thetav,like,lty=2,lwd=4)
lines(thetav,post,lty=3,lwd=4)
legend(0,8,c("Prior","Likelihood","Posterior"),lty=c(1,2,3),lwd=c(4,4,4))
lines(thetav,prior1,lty=1,lwd=2)
lines(thetav,prior2,lty=1,lwd=2)
lines(thetav,post1,lty=3,lwd=2)
lines(thetav,post2,lty=3,lwd=2)
legend(0.3,8,c("Component priors","Component posteriors"),
lty=c(1,3),lwd=c(2,2))
mle=y/n; priormean=k*a1/(a1+b1)+(1-k)*a2/(a2+b2)
postmean=c*(a1+y)/(a1+b1+n) + (1-c)*(a2+y)/(a2+b2+n)
points(c(priormean,mle,postmean),c(0,0,0),pch=c(1,2,4),cex=c(1.5,1.5,1.5),
lwd=c(2,2,2))
c(priormean,mle,postmean) # 0.3068 0.8000 0.4772
legend(0.7,8,c(" Prior mean"," MLE"," Posterior mean"),
pch=c(1,2,4),pt.cex=c(1.5,1.5,1.5),pt.lwd=c(2,2,2))
79
Bayesian Methods for Statistical Analysis
Unlike for the normal-normal and normal-gamma models, more than one
uninformative prior specification has been proposed as reasonable in the
context of the binomial-beta model.
This reduces to the MLE y/n under the Haldane prior but not under the
Bayes prior. In contrast, the Bayes prior leads to a posterior mode which
is equal to the MLE.
80
Chapter 2: Bayesian Basics Part 2
No such problems occur using the Bayes prior. This is because that prior
is proper and so cannot lead to an improper posterior, whatever the data
may be. Interestingly, there is a third choice which provides a kind of
compromise between the Bayes and Haldane priors, as described below.
81
Bayesian Methods for Statistical Analysis
∂ ∂θ
2 2
∂θ
2
∝ I (θ ) = E log f ( y | θ ) θ
∂φ ∂θ ∂φ
∂ ∂θ
2
= E log f ( y | θ ) θ
∂θ ∂φ
∂
2
= E log f ( y | φ ) φ
∂φ
= I (φ ) .
µ n
1 1 n
Here: f ( y | µ ) ∝ ∏ exp − 2 ( yi − µ ) 2 = exp − 2 ∑ ( yi − µ ) 2
i =1 2σ 2σ i =1
n
1
log f ( y | µ ) = − 2 ∑ ( yi − µ ) 2 + c (where c is a constant)
2σ i =1
∂ 1 n n
log f ( y | µ ) =− 2 ∑ 2( yi − µ )1 (−1) = 2 ( y − µ )
∂µ 2σ i =1 σ
2
∂ n2
log f ( y | µ
= ) ( y − µ )2 .
∂µ σ 4
82
Chapter 2: Bayesian Basics Part 2
= I ( µ ) E log f ( y | µ
= ) µ E 4 ( y − µ )2 µ
∂µ σ
n2 n2 σ 2 n
= V (= y | µ ) = .
σ 4
σ n σ2
4
µ
n
It follows that the Jeffreys prior is f ( µ ) ∝ I =
(µ ) ∝1, µ ∈ℜ .
σ2
λ n
λ λ n 2
Here: f ( y | λ ) ∝ ∏ λ 1/2 exp − ( yi − µ ) 2= λ − ∑ ( yi − µ )
n /2
exp
i =1 2 2 i =1
n λ n
log f ( y | λ=) log λ − ∑ ( yi − µ ) 2 + c (where c is a constant)
2 2 i =1
∂ log f ( y | λ ) n 1 n ∂ 2 log f ( y | λ ) n
=− ∑ ( yi − µ ) , 2
= − 2.
∂λ 2λ 2 i =1 ∂λ 2
2λ
83
Bayesian Methods for Statistical Analysis
n λ 1
So the Jeffreys prior is f (λ ) ∝ =
I (λ ) ∝ , λ ∈ℜ .
2λ 2 λ
Note 2: Another way to obtain the the Fisher information is to first write
∂ log f ( y | λ ) n 1 n 2 1
= − λ ∑ ( yi − µ ) = (n − q ) ,
∂λ 2λ 2λ i =1 2λ
y −µ
n 2
where: q = ∑ i , ( q | λ ) ~ χ ( n) , E ( q | λ ) = n , V ( q | λ ) = 2n .
2
i =1 1/ λ
∂ log f ( y | λ )
2
1
We may then write =
(n 2 − 2nq + q 2 ) ,
∂λ 4 λ 2
∂ log f ( y | λ ) 2
and so the Fisher information is I (λ ) = E λ
∂λ
= 2 {n 2 − 2nE (q | λ ) + E (q 2 | λ=
1
4λ
)}
1
4λ 2 { } n
n 2 − 2nn + 2n + n 2 = 2 .
2λ
n
Here: =f ( y | θ ) θ y (1 − θ ) n − y
y
n
log f ( y = | θ ) log + y log θ + (n − y ) log(1 − θ )
y
∂
log f ( y | θ ) =0 + yθ −1 − (n − y )(1 − θ ) −1
∂θ
∂2
log f ( y | θ ) =− yθ −2 − (n − y )(1 − θ ) −2 .
∂θ 2
84
Chapter 2: Bayesian Basics Part 2
1 1 (1 − θ + θ ) n
=n + =n = .
θ 1−θ θ (1 − θ ) θ (1 − θ )
Here,
θ ) 1/=
f ( y |= θ θ −1
⇒ log f ( y | θ ) = − log θ
∂ 1
⇒ log f ( y | θ ) = −
∂θ θ
85
Bayesian Methods for Statistical Analysis
∂
2
1
⇒ log f ( y | θ ) = 2
∂θ θ
∂ 1
2
=⇒ I (θ ) E= log f ( y | θ ) θ .
∂θ θ 2
The loss function L represents the cost incurred when the true value is
estimated by ̂ and usually satisfies the property L( , ) 0 .
The three most commonly used loss functions are defined as follows:
L(ˆ, ) | ˆ | the absolute error loss function (AELF)
L(ˆ, ) (ˆ ) 2
the quadratic error loss function (QELF)
0 if ˆ
L(ˆ, ) I (ˆ ) the indicator error loss
1 if ˆ
function (IELF), also known as the zero-one loss function
(ZOLF) or the all-or-nothing error loss function (ANLF).
Figures 2.8 and 2.9 illustrate these three basic loss functions.
86
Chapter 2: Bayesian Basics Part 2
87
Bayesian Methods for Statistical Analysis
To obtain the overall expected loss we need to average the risk function
over all possible values of . This overall expected loss is called the
Bayes risk and may be defined as
r EL(ˆ, ) EE{L(ˆ, ) | } ER( ) R( ) f ( )d .
For each of the following estimators, derive a formulae for the risk
function under the quadratic error loss function:
1
(a) ˆ y ( y1 ... yn ) (the sample mean)
n
(b) ˆ y (the absolute value of the sample mean).
In each case, use the derived risk function to determine the Bayes risk.
88
Chapter 2: Bayesian Basics Part 2
R () E ( y ) 2 | E y 2 y 2 2
E ( y 2 | ) 2 E y 2
2
2 2m 2 , where m E y .
n
Now,
0
m ( y ) f ( y | )dy ( y ) f ( y | )dy
0
0 0 0
2 yf ( y | )dy yf ( y | )dy
0
2I , where I yf ( y | )dy .
Here,
/ c
y−µ σ
I cz ( z )dz after putting z =
σ/ n
with c =
n
/ c / c
z
1 12 z 2
Note: Here, ( z ) e and ( z ) (t )dt are the standard
2
normal pdf and cdf, respectively.
89
Bayesian Methods for Statistical Analysis
/ c
1 12 z 2
Now, J z
2
e dz
2
2 c2
1 w 1
e dw after substituting w = z 2
2 2
2
1 w c2
1 2 c2
2
e .
2
e e
2 2 c
w
Hence I cJ c ,
c c c
and so m 2 I 2 c .
c c
Therefore
2 2
R( ) 2 2 2m 2 2 2 2 c .
n n c c
Thereby we obtain:
2
R ( ) 4 2 4 , .
n / n n / n
r ER( ) R( ) f ( )d g ( )d ,
where
2 1 0
g ( ) 4 2
4 .
n / n n / n 0 0
We see that the Bayes risk r is an intractable integral equal to the area
under the integrand, g ( ) R ( ) f ( ) . However, this area can be
evaluated numerically (using techniques discussed later). Figures 2.11
and 2.12 show examples of the risk function R() and the integrand
function g () . For the case n= σ= µ0= σ 0= 1 , we find that r = 1.16.
90
Chapter 2: Bayesian Basics Part 2
91
Bayesian Methods for Statistical Analysis
X11(w=8,h=5.5); par(mfrow=c(1,1));
plot(c(-0.5,4),c(0,3),type="n",xlab="mu",ylab="R(mu)",main=" ")
Ifun = function(mu,sig,n,mu0,sig0){
Rfun(mu=mu,sig=sig,n=n)*dnorm(mu,mu0,sig0) }
92
Chapter 2: Bayesian Basics Part 2
Then, just as the risk function can be used to compute the Bayes risk
according to
r EL(ˆ, ) EE{L(ˆ, ) | } ER( ) R( ) f ( )d ,
so also can the PEL be used, but with the formula
r EL(ˆ, ) EE{L(ˆ, ) | y} E{PEL( y )} PEL( y) f ( y)dy .
Note: Both of these formulae for the Bayes risk use the law of iterated
expectation, but with different conditionings.
93
Bayesian Methods for Statistical Analysis
For each of the following estimators, derive a formula for the posterior
expected loss under the quadratic error loss function:
1
(a) ˆ y ( y1 ... yn ) (the sample mean)
n
(b) ˆ y (the absolute value of the sample mean).
In each case, use the derived PEL to obtain the Bayes risk.
*2 (1 k ) 2 ( y 0 ) 2 .
94
Chapter 2: Bayesian Basics Part 2
2
Thus r *2 (1 k ) 2 02
n
2 2
2
(1 k ) 02
n
k (where k = )
n n n + σ 2 / σ 02
2
(after a little algebra).
n
Note: This is in agreement with Exercise 2.8, where the result was
obtained much more easily by taking the mean of the risk function, as
follows:
r ER( ) E ( 2 / n ) 2 / n .
Some examples of this PEL function are shown in Figure 2.13. In all
these examples, n= σ= 1 .
95
Bayesian Methods for Statistical Analysis
96
Chapter 2: Bayesian Basics Part 2
PELfun=function(ybar,sig,n,sig0,mu0){
k=n/(n+sig^2/sig0^2)
mustar=(1-k)*mu0+k*ybar
sigstar2=k*sig^2/n
ybar^2-2*abs(ybar)*mustar+sigstar2 + mustar^2
}
ybarvec=seq(-10,10,0.01); options(digits=4)
X11(w=8,h=5.5); par(mfrow=c(1,1));
97
Bayesian Methods for Statistical Analysis
# Calculate r when n=1, sig=1, mu0=1, sig0=1 (should get 1.16 as before)
Jfun = function(ybar,sig,n,sig0,mu0){
PELfun(ybar=ybar,sig=sig,n=n,sig0=sig0,mu0=mu0)*
dnorm(ybar,mu0,sqrt(sig0^2+sig^2/n))
}
plot(ybarvec, PELfun(ybar=ybarvec,sig=sig,n=n,sig0=sig0,mu0=mu0)*
dnorm(ybarvec,mu0,sqrt(sig0^2+sig^2/n)),
type="l", xlab="ybar",ylab="PEL(ybar)*f(ybar)", lwd=3)
98
Chapter 2: Bayesian Basics Part 2
Find the Bayes estimate under the quadratic error loss function.
Note 1: This result can also be obtained using Leibniz’s rule for
differentiating an integral, which is generally
b b
d G (u , x) db da
dx a G (u , x)du
x
du G (b, x) G (a, x)
dx dx
a
b
G (u , x)
and which reduces to x
du 0 0 if a and b are constants.
a
∂ ∂
Thus we may write
∂θˆ
=
PEL ( y)
∂θˆ ∫ (θˆ − θ ) 2 f (θ | y )dθ
∂
= ∫
∂θˆ { }
(θˆ − θ ) 2 f (θ | y ) dθ + 0 − 0
{
= ∫ 2(θˆ − θ ) f (θ | y )dθ= 2 θˆ − ∫ θ f (θ | y )dθ .
1
}
=
Setting this to zero yields θˆ ∫=
θ f (θ | y )dθ E (θ | y ) .
99
Bayesian Methods for Statistical Analysis
Note 2: To check that this minimises the PEL (rather than maximises it)
we may further calculate
∂2
∂θˆ 2
PEL( y ) = 2
∂ ˆ
∂θˆ
{ }
θ − ∫ θ f (θ | y )dθ = 2 {1 − 0} > 0 .
Find the Bayesian estimate under the absolute error loss function.
Then PEL( y ) | t | f ( | y )d
t
(t ) f ( | y )d ( t ) f ( | y )d .
t
100
Chapter 2: Bayesian Basics Part 2
Find the Bayes estimate under the indicator error loss function.
Let t denote θˆ = θˆ( y ) and first suppose that the parameter θ is discrete.
The indicator error loss function is L(t , ) I (t ) 1 I (t ) .
Therefore
PEL( y ) E{L(t , ) | y} E{1 I (t ) | y} 1 E{I (t ) | y}
1 P (t | y )
1 f ( t | y ) .
101
Bayesian Methods for Statistical Analysis
(a) Find the risk function, Bayes risk and posterior expected loss implied
by the estimator ˆ 2 y under the quadratic error loss function.
E 2 y
2
E 4 y 2 4 y 2
4 E y 2 4 E y 2
4 V y E y 4 E y 2
2
4 2 4 2
n
4 / n, 0 (an increasing quadratic).
2
102
Chapter 2: Bayesian Basics Part 2
We see that
f ( | y ) ~ Gam( ny , n) .
It follows that
PEL( y ) E{L(ˆ, ) | y}
E 2 y y
2
E 4 y 2 4 y 2 y
4 y 2 4 yE ( | y ) E ( 2 | y )
ny ny ny
2
4 y 4 y
2
.
n ( n) 2 n
Note: The Bayes risk could also be computed using an argument which
begins as follows:
r E{PEL( y )}
ny ny ny
2
E 4 y 4 y
2
,
n ( n) 2 n
where, for example,
Ey EE ( y | ) EE ( y1 | ) E / .
103
Bayesian Methods for Statistical Analysis
(b) The Bayes estimate under the QELF is the posterior mean,
ny
E ( | y ) .
n
This estimator has the smallest Bayes risk amongst all possible
estimators, including the one in (a), which is different. So E ( | y ) must
have a smaller Bayes risk than the estimator in (a).
Discussion
E .
n
(a) Find the risk function and Bayes risk for the estimator ˆ y .
(b) Find the Bayes estimate and sketch it as a function of the data y.
104
Chapter 2: Bayesian Basics Part 2
1, 0
In summary, R () , as shown in Figure 2.16.
1.5 (), 0
105
Bayesian Methods for Statistical Analysis
Now
L(t , ) 1 I (0 t 2) ,
and so
PEL( y ) E{1 I (0 t 2) | y}
1 P (0 t 2 | y ) .
106
Chapter 2: Bayesian Basics Part 2
Also, if t 0 then
PEL(t ) 1 E{I (0 t 2) | y} .
1 P (0 t 2 | y )
1 P (t / 2 t | y )
1 (t ) ,
where
(t ) F ( t | y ) F ( t / 2 | y )
is to be maximised.
e 2(1/ 2) e 2(1/ 2) .
(1/ 2) 2 2 (1/ 2) 2
2
y y t 2
t
2
y y
log 2 t 2t 2
2
2 2 2 2 2
2
3 1
t 2 ty log 2 0
4 2
y y2 3
4 log 2
t 2 4 4 .
2(3 / 4)
X11(w=8,h=5.5)
(1/3)*(c(-1,0,1)+sqrt(c(-1,0,1)^2 + 12*log(2)))
# 0.6841672 0.9613513 1.3508339
108
CHAPTER 3
Bayesian Basics Part 3
3.1 Inference given functions of the data
Sometimes we observe a function of the data rather than the data itself.
In such cases the function typically degrades the information available
in some way. An example is censoring, where we observe a value only if
that value is less than some cut-off point (right censoring) or greater than
some cut-off value (left censoring). It is also possible to have censoring
on the left and right simultaneously. Another example is rounding,
where we only observe values to the nearest multiple of 0.1, 1 or 5, etc.
Find the posterior distribution and mean of the average light bulb
lifetime, m.
∫ ce
− cyi
P ( yi > 6 =
| c) dyi e −6 c .
=
6
109
Bayesian Methods for Statistical Analysis
The estimate 6.667 is also higher than the estimate obtained by simply
replacing the censored values with 6, namely
(1/3)(2.6 + 3.2 + 6 + 1.2 + 6) = 3.8.
Suppose that:
( y | θ ) ~ U (0, θ )
θ ~ U (0, 2) ,
where the data is
x = g ( y ) = the value of y rounded to the nearest integer.
Observe that:
x = 0 if 0 < y < 1/2
x = 1 if 1/2 < y < 3/2
x = 2 if 3/2 < y < 2.
110
Chapter 3: Bayesian Basics Part 3
1 if θ < 1/ 2
1
P( x= 0 | θ =
) P0 < y < θ = 1/ 2
2 if θ > 1/ 2
θ
0 if 0 < θ < 1/ 2
1 3 θ − 1/ 2 1 3
P( x= 1| θ=
) P < y < θ = if < θ <
2 2 θ 2 2
1 3
if < θ < 2
θ 2
0 if 0 < θ < 3 / 2
3
P(=x 2 | θ=) P < y < 2 θ=
θ −3/ 2 3
2 if < θ < 2.
θ 2
2
1/2
3 1 3 1 1 1 3
= − log − + log + log 2 − log
2 2 2 2 2 2 2
= 0.7383759.
111
Bayesian Methods for Statistical Analysis
θ − 1/ 2 1
3/2 2
E1= E (θ | x= 1)= ∫θ
1/2 Bθ
dθ + ∫ θ
3/2
Bθ
dθ
1
= = 1.354 (after some working).
B
Discussion
112
Chapter 3: Bayesian Basics Part 3
For completeness and checking we now also calculate the other two
posterior means:
7
E=0 E (θ | = =
x 0) = 0.7334
8A
1
E=2 E (θ | = =
x 2) = 1.8254,
8C
113
Bayesian Methods for Statistical Analysis
114
Chapter 3: Bayesian Basics Part 3
legend(0,2,c("f(theta|x=1)","f(theta|y=0.6)","f(theta|y=1)","f(theta|y=1.1)",
"f(theta|y=1.4)"), lty=c(1,2,3,4,5), lwd=c(3,3,3,3,3))
C=2-1.5*log(2)-1.5+1.5*log(1.5)
A=0.5+0.5*log(2)-0.5*log(0.5)
options(digits=7); c(A,B,C) # 1.19314718 0.73837593 0.06847689
E0=7/(8*A); E1=1/B; E2=1/(8*C); c(E0,E1,E2)
# 0.7333546 1.3543237 1.8254333
P0=1/4+(1/4)*(log(2)-log(1/2))
P1=0.5*(1.5-0.5*log(1.5)-0.5+0.5*log(0.5)) +0.5*(log(2)-log(1.5))
P2=0.5*(2-1.5*log(2)-1.5+1.5*log(1.5))
P0+P1+P2 # 1 Correct
c(P0,P1,P2) # 0.59657359 0.36918796 0.03423845
E0*P0 + E1*P1 + E2*P2 # 1 Correct
postvecA=thetavec; postvecC=thetavec;
for(i in 1:length(thetavec)){ postvecA[i]=postfunA(theta=thetavec[i])
postvecC[i]=postfunC(theta=thetavec[i]) }
plot(c(0,2),c(0,3.7),type="n",xlab="theta",ylab="density", main=" ")
lines(thetavec, postvecA,lty=2,lwd=3)
lines(thetavec, postvecB,lty=1,lwd=3)
lines(thetavec, postvecC,lty=3,lwd=3)
for(y in seq(0.1,1.9,0.1)){ k=1/(log(2)-log(y))
lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], lty=1,lwd=1) }
legend(0.7,3.6,c("f(theta|x=0)","f(theta|x=1)","f(theta|x=2)","f(theta|y)"),
lty=c(2,1,3,1), lwd=c(3,3,3,1))
115
Bayesian Methods for Statistical Analysis
Also, the ‘P’ in HPDR, and CPDR may be read as predictive rather than
as posterior. For example, the CPDR for x is now the central predictive
density region for x.
116
Chapter 3: Bayesian Basics Part 3
Note: This follows from the basic law of iterated variance (LIV),
=Vx EV ( x | θ ) + VE ( x | θ ) , after conditioning throughout on y.
Note: The last equation indicates that the pdf of ( x | y, θ ) is the same as
the pdf of ( y | θ ) but with y changed to x in the density formula.
117
Bayesian Methods for Statistical Analysis
Then, for y = 2.0, find the predictive mean, mode and median of x, and
also the 80% central predictive density region and 80% highest
predictive density region for x.
2( y 1) 2
, x0.
( x y 1)3
2( y 1) 2
Check: f ( x | y )dx
( x y 1)3
dx
0
y 1
2( y 1) u 3 du
2
(where u = x + y + 1)
0 y 1
2
2 u
2( y 1) ( y 1) 2 1 1 1 (correct).
2 ( y 1) 2
2 u y 1
118
Chapter 3: Bayesian Basics Part 3
E ( x | y ) x18( x 3)3 dx .
0
31 (2) 0
d 3 .
(1)
F ( x | y ) 18(t 3) dt 3
18u
3
dt where u = 3 + t
0 3
u 2 3 x
18 9 1 1 1 9 .
2 (3 x) 32 (3 x) 2
2
u 3
Setting this to p and solving for x yields the predictive quantile function,
−1
1
= Q ( p ) F= ( p | y) 3 − 1 .
1− p
1 1
So the predictive median=
is Q 3 − 1 = 1.2426.
2 1−1/ 2
119
Bayesian Methods for Statistical Analysis
The predictive quantile function can now also be used to calculate the
80% CPDR for x,
(Q (0.1), Q (0.9) ) = (0.1623, 6.4868),
and the 80% HPDR for x,
( 0, Q (0.8) ) = (0, 3.7082).
Another way to calculate the predictive median of x is as the solution in
q of
=
1/ 2 P( x < q | y )
after noting that the right hand side of this equation also equals
E{P( x < q | y, θ ) | y} =E (1 − e −θ q | y )
=1 − m(−q ) ,
where m(t ) is the posterior moment generating function (mgf) of θ .
(1 t / ( y + 1)) −2 .
But ( | y ) ~ Gamma (2, y 1) , and so m(t ) =−
You are visiting a small town with buses whose license plates show their
numbers consecutively from 1 up to however many there are. In your
mind the number of buses could be anything from 1 to 5, with all
possibilities equally likely. Whilst touring the town you first happen to
see Bus 3.
Assuming that at any point in time you are equally likely to see any of
the buses in the town, how likely is it that the next bus number you see
will be at least 4?
Also, what is the expected value of the bus number that you will next
see?
120
Chapter 3: Bayesian Basics Part 3
As in Exercise 1.6, let θ be the number of buses in the town and let y be
the number of the bus you happen to first see. Recall that a suitable
Bayesian model is:
y | θ ) 1/=
f (= θ , y 1,...,θ
= f (θ ) 1/= 5, θ 1,...,5 (prior),
and that the posterior density of θ works out as
20 / 47, θ = 3
= f (θ | y ) =15 / 47, θ 4
12 / 47, θ = 5.
Now let x be the number on the next bus that you happen to see in the
town. Then
1
,θ ) =
f ( x | y= , x 1,..., θ (same distribution as that of ( y | θ )) .
θ
This may also be written
f ( x | y, θ ) =≤
I (x θ ) / θ , x =
1, 2,3,... ,
and so the posterior predictive density of x is
5
I (x ≤ θ )
=
f ( x | y) ∑=
f ( x, θ | y ) ∑ f ( x | y, θ ) f (θ | y ) = ∑ f (θ | y ) .
θ θ θ=y θ
5
Check: ∑ f (=
x =1
x | y) 0.27270 × 3 + 0.13085 + 0.05106
= 1 (correct).
121
Bayesian Methods for Statistical Analysis
0.27270, x = 1, 2,3
=
In summary, for y = 3, we have that f ( x | y ) =0.13085, x 4
0.05106, x = 5.
So the probability that the next bus you see will have a number on it
which is at least 4 equals
P( x ≥ 4 | y ) = ∑
x:x ≥ 4
f ( x | y) = f ( x = 4 | y) + f ( x = 5 | y)
1+θ 1
=
Alternatively, {E ( x | y, θ ) | y} E
E ( x | y ) E= y = E (θ | y )
2 2
1 + 3(20 / 47) + 4(15 / 47) + 5(12 / 47) 1 + 180 / 47 227
= = = = 2.4149.
2 2 94
(a) For the Bayesian model given by (Y | ) ~ Bin(n, ) and the prior
~ Beta (, ) , find the posterior predictive density of a future data
value x, whose distribution is defined by ( x | y, ) ~ Bin(m, ) .
(b) A bent coin is tossed 20 times and 6 heads come up. Assuming a flat
prior on the probability of heads on a single toss, what is the probability
that exactly one head will come up on the next two tosses of the same
coin? Answer this using results in (a).
122
Chapter 3: Bayesian Basics Part 3
(c) A bent coin is tossed 20 times and 6 heads come up. Assume a
Beta(20.3,20.3) prior on the probability of heads.
Find the expected number of times you will have to toss the same coin
again repeatedly until the next head comes up.
(d) A bent coin is tossed 20 times and 6 heads come up. Assume a
Beta(20.3,20.3) prior on the probability of heads.
Now consider tossing the coin repeatedly until the next head, writing
down the number of tosses, and then doing all of this again repeatedly,
again and again.
Next define ψ to be the average of a very long sequence like this (e.g.
one of length 1,000,000). Find the posterior predictive density and mean
of ψ (approximately).
Note: In parts (c) and (d) the parameters of the beta distribution (both
20.3) represent a prior belief that the probability of heads is about 1/2, is
equally likely to be on either side of 1/2, and is 80% likely to be between
0.4 and 0.6. See the R Code below for details.
(a) First note that x is not a future independent replicate of the observed
data y, except in the special case where m = n.
f ( x | y, ) f ( | y )d
m
1
a1 (1 )b1
x (1 ) mx d
x B ( a, b)
0
123
Bayesian Methods for Statistical Analysis
124
Chapter 3: Bayesian Basics Part 3
It follows that
P ( x 1 | y ) E{P( x 1 | y , ) | y}
E{2 (1 ) | y}
2{E ( | y ) E ( 2 | y )}
2{E ( | y ) [V ( | y ) ( E ( | y )) 2 ]}
7 7
2
2 0.009432 = 0.415.
22 22
(c) Let z be the number of tosses until the next head. Then
( z | y, θ ) ~ Geometric(θ )
with pdf
f ( z | y, θ )= (1 − θ ) z −1θ , z = 1,2,3,....
=
f ( z | y) ∫=
f ( z , θ | y )dθ ∫ f ( z | y, θ ) f (θ | y )dθ .
∫ × B ( a , b ) dθ
E ( z | y ) = E{E ( z | y, θ ) | y} = E y=
θ 0 θ
B(a − 1, b) θ ( a −1) −1 (1 − θ )b −1
1
B(a, b) ∫0 B(a − 1, b)
= dθ
125
Bayesian Methods for Statistical Analysis
x=0:2
( 2*factorial(6+x)*factorial(16-x)/factorial(23) )/
( factorial(x)*factorial(2-x) * factorial(6)*factorial(14)/factorial(21) )
# 0.4743 0.4150 0.1107
7*15/(22^2*23) # 0.009432
2 * (7/22 - ( 0.009432267 + (7/22)^2 ) ) # 0.415
(20.3+20.3+20-1)/(20.3+6-1) # 2.356
126
Chapter 3: Bayesian Basics Part 3
σ 02
* .
n
f ( x | y ) f ( x | y , ) f ( | y ) d
( x ) 2 ( * ) 2
exp 2 exp d .
2 / m
2*
2
127
Bayesian Methods for Statistical Analysis
δ 2 = V ( x | y)
= E{V ( x | y, µ ) | y} + V {E ( x | y, µ ) | y}
σ 2 σ2
= E y + V {µ | y=
} + σ *2 .
m m
2 2 n i1
Now, ( x | y, ) ~ N (,1/ (m )) , and therefore
128
Chapter 3: Bayesian Basics Part 3
f ( x | y ) f ( x | y , ) f ( | y )d
m
exp ( x ) 2 a1 exp b d
0
2
1
a 1
2 exp b ( x ) 2 d
m
0
2
1
2 a 1
m2a ( x ) 2 2
1
m
a
2
2 b
b ( x ) 2
1 .
2 2a
m 2a ( x ) 2 x b/a
Now let Q , so that x Q .
2b b/a m m
129
Bayesian Methods for Statistical Analysis
Note 2: The discrepancy measure may or may not depend on the model
parameter, θ . Thus in some cases, T ( y , θ ) may also be written as T ( y ) .
130
Chapter 3: Bayesian Basics Part 3
Note: This is just the probability that a Poisson(1) random variable will
take on a value greater than 2, and so is the same as the classical
p-value which would be used in this situation.
131
Bayesian Methods for Statistical Analysis
e −2×113
P(λ 1=
Thus:= | y, H 0 ) = 0.48015
e −2×113 + e −2×2 23
P(λ = 2 | y , H 0 ) = 1 − 0.48015 = 0.51985.
So a suitable ppp-value is
= , H 0 ) E{P( x ≥ y | y , H 0 , λ ) | y , H 0 }
p P ( x ≥ y | y=
=−
E{1 FPoi ( λ ) ( y − 1) | y , H 0 }
= 0.48015 (1
× − FPoi (1) (2)) + 0.51985 (1
× − FPoi (2) (2))
e −110 e −111 e −112
= 0.48015 1
− + +
0! 1! 2!
e −2 20 e −2 21 e −2 22
+ 0.51985 1
− + +
0! 1! 2!
= 0.20664.
132
Chapter 3: Bayesian Basics Part 3
Derive a formula for the ppp-value under each of the following three
choices of the test statistic:
y−µ y −µ
(a) T ( y , λ ) = y , (b) T ( y , λ ) = , (c) T ( y, λ ) = ,
σ/ n sy / n
1 n
where: y= ∑ yi (the sample mean)
n i =1
1 n
s y2
n 1 i1
( yi y ) 2 (the sample variance).
For each of these choices of test statistic, report the ppp-value for the
case where µ = 2 and y = (2.1, 4.0, 3.7, 5.5, 3.0, 4.6, 8.3, 2.2, 4.1, 6.2).
x 1 n
Then, by Exercise 3.7, y ~ t (n) , where s y2 ( yi ) 2 .
s y / n n i1
1 n 1 n
Here: µ = 2, n = 10, y = ∑ i
n i =1
y = 4.370, s 2
y ( yi )2 = 2.978.
n i1
y −µ
Therefore = 2.51658, and so p = 1 − Ft (10) ( 2.51658) = 0.01528.
s yµ / n
y−µ
(b) If T ( y , λ ) = then the ppp-value is
σ/ n
x −µ y−µ
p=
P > y =P( x > y | y ) .
σ / n σ / n
133
Bayesian Methods for Statistical Analysis
y −µ
(c) If T ( y, λ ) = then the ppp-value is
sy / n
x −µ y −µ 1 n
= p P >
sx / n s y / n
y where s x2
n 1 i1
( xi x ) 2
x −µ y −µ
= E P > y, λ y
sx / n s y / n
by the law of iterated expectation
y − µ x −µ
= E 1 − Ft ( n −1) y since y , λ ~ t ( n − 1)
s / n s / n
y x
y −µ
= 1 − Ft ( n −1) .
s / n
y
We see that the ppp-value derived is exactly the same as the classical
p-value which would be used in this setting. Numerically, we have that:
1 n y −µ
s y2
n 1 i1
( yi y ) 2 = 1.901,
sy / n
= 3.942645.
Note: A fourth test statistic which makes sense in the present context is
y −µ 1 n
T ( y, λ ) = where s y ( yi ) 2 (as before).
2
s yµ / n n i1
134
Chapter 3: Bayesian Basics Part 3
options(digits=4); mu=2; y = c(2.1, 4.0, 3.7, 5.5, 3.0, 4.6, 8.3, 2.2, 4.1, 6.2);
n=length(y); ybar=mean(y); s=sd(y); smu=sqrt(mean((y-mu)^2))
c(ybar,s,smu) # 4.370 1.901 2.978
arga=(ybar-mu)/(smu/sqrt(n)); pppa=1-pt(arga,n); c(arga,pppa)
# 2.51658 0.01528
argc=(ybar-mu)/(s/sqrt(n)); pppc=1-pt(argc,n-1); c(argc,pppc)
# 3.942645 0.001696
The first task now is to find the joint posterior density of θ1 and θ 2 ,
according to
f (θ | y ) ∝ f (θ ) f ( y | θ ) ,
or equivalently
f (θ1 , θ 2 | y ) ∝ f (θ1 , θ 2 ) f ( y | θ1 , θ 2 ) ,
where
f (θ ) = f (θ1 , θ 2 )
is the joint prior density of the two parameters.
Once a Bayesian model with two parameters has been defined, one task
is to find the marginal posterior densities of θ1 and θ 2 , respectively, via
the equations:
f (θ1 | y ) = ∫ f (θ1 , θ 2 | y )dθ 2
f (θ 2 | y ) = ∫ f (θ1 , θ 2 | y )dθ1 .
135
Bayesian Methods for Statistical Analysis
From these two marginal posteriors, one may obtain point and interval
estimates of θ1 and θ 2 in the usual way (treating each parameter
separately). For example, the marginal posterior mean of θ1 is
= θˆ E=1(θ | y ) 1 ∫
θ f (θ | y )dθ .
1 1 1
The main idea of Equation (3.1) is to examine the joint posterior density
f (θ1 , θ 2 | y )
(or any kernel thereof), think of all terms in this as constant except for
θ1 , and then try to recognise a well-known density function of θ1 .
136
Chapter 3: Bayesian Basics Part 3
This posterior density may then be used to calculate point and interval
estimates of ψ . For example, the posterior mean of ψ is
=ψˆ E=
(ψ | y ) ∫ψ f (ψ | y )dψ .
Alternatively, this mean may be obtained using the equation
= ψˆ E= ( g (θ1 , θ 2 ) | y ) ∫ ∫ g (θ1 , θ 2 ) f (θ1 , θ 2 | y )dθ1dθ 2 .
137
Bayesian Methods for Statistical Analysis
Under this model, the joint posterior density of the two parameters n and
θ is
f (n, | y ) f (n, ) f ( y | n, )
f (n) f ( | n) f ( y | n, )
n
1 y (1 ) n y
1
k y
n
y (1 ) n y , 0 1, n y , y 1,...,9 .
y
y (1 ) n y d , =
n y , y + 1,...,9 (since y = 0,..., n )
y
0
n 1
y11 (1 ) n y11
B( y 1, n y 1) d , n = 5,6,7,8,9
y B ( y 1, n y 1)
0
n ( y 1)( n y 1)
1 (since the integral equals 1)
y ( y 1 n y 1)
n! y !( n y )!
y !( n y )! ( n 1)!
1
n 1
1/ 6, n 5
1/ 7, n 6
1/ 8, n 7
1/ 9, n 8
1/10, n 9.
138
Chapter 3: Bayesian Basics Part 3
After normalising (i.e. dividing each of these five numbers by their sum,
0.6456), we find that, to four decimals, n’s posterior pdf is
0.2581, n 5
0.2213, n 6
f (n | y ) 0.1936, n 7
0.1721, n 8
0.1549, n 9.
Thus, for example, there is a 17.2% chance a posteriori the coin was
tossed 8 times.
5 (1 )55 5 (1 )95
0.2581 ... 0.1549 .
5!(5 5)!/ (5 1)! 5!(9 5)!/ (9 1)!
139
Bayesian Methods for Statistical Analysis
Figures 3.3 and 3.4 (page 141) show the marginal posterior densities of n
and , respectively, with the posterior means n̂ = 6.744 and ̂ = 0.7040
marked by vertical lines.
140
Chapter 3: Bayesian Basics Part 3
141
Bayesian Methods for Statistical Analysis
X11(w=8,h=4); par(mfrow=c(1,1))
plot(nvec,fny,type="n",xlab="n",ylab="f(n|y)",ylim=c(0,0.4))
points(nvec,fny,pch=16,cex=1); abline(v=nhat)
plot(thvec,fthyvec,type="n",xlab="theta",ylab="f(theta|y) ",ylim=c(0,2.5))
lines(thvec,fthyvec,lwd=3); abline(v=thhat)
(c) Find the posterior mean of the signal to noise ratio, defined as
= γ µ= /σ µ λ .
Note: Both and λ are assigned uninformative priors. The joint prior
distribution of these two parameters could also be specified by:
f ( | ) 1,
f ( ) 1/ , 0,
or by the single statement
f (, ) 1/ , , 0 .
142
Chapter 3: Bayesian Basics Part 3
( yi )2 ( yi y ) ( y )
2
i1 i1
n n n
( yi y ) 2( y ) ( yi y ) ( y )
2 2
1
i1 i1 i1
1 n
( n 1) ( yi y ) 2 2( y )( ny ny ) n( y ) 2
n 1 i1
(n 1) s 2 n( y ) 2 , where s 2 is the sample variance.
143
Bayesian Methods for Statistical Analysis
1
{( n1)1}
y 2 2
s / n
1
{( n1)1}
n( y ) 2
1
2
1 .
(n 1) s 2 (n 1)
y s d s
We now define r , so that y r and .
s/ n n dr n
Note 1: In result (3.2), the data vector y appears only by way of the
sample mean y and sample standard deviation s. So it is also true that
y
y , s ~ t (n 1) .
s / n
Here, s may not be left out of the conditioning. So it is not true that
y
y ~ t (n 1) .
s / n
Note 2: Result (3.2) implies that the marginal posterior mean, mode and
median of µ are all equal to y , and the 1 − α CPDR/HPDR for µ is
144
Chapter 3: Bayesian Basics Part 3
( y t /2 (n 1) s / n ) .
This inference is identical to that obtained via the classical approach and
thereby justifies the use of the joint prior
f (, ) 1/ , , 0
in cases of a priori ignorance regarding both and λ .
({(n 1) 1} / 2)
Thus f ( | y )
((n 1) )((n 1) / 2)
1
(( n1)1)
1 y
2 2
n
1 , .
n 1 s / n s
145
Bayesian Methods for Statistical Analysis
It follows that
n 1 n 1 2
( | y ) ~ Gamma , s ,
2 2
(3.3)
n 1 1
Thus (u | y ) ~ Gamma , ~ 2 (n 1) , which confirms (3.4).
2 2
Note 2: Results (3.3) and (3.4) imply that λ has posterior mean 1/ s 2 .
This makes sense because λ = 1/ σ 2 , and s 2 is an unbiased estimator of
σ 2 . We see that the inverse of the posterior mean of λ provides us with
the classical estimator of σ 2 .
146
Chapter 3: Bayesian Basics Part 3
(n − 1) s 2 (n − 1) s 2
2 , 2 .
χ α /2 (n − 1) χ 1−α /2 (n − 1)
It will be observed that this is exactly the same as the usual classical
1 − α CI for σ 2 when the normal mean µ is unknown.
ˆ E ( | y ) f (, | y )d d ,
0
k ( , )
where: f (, | y )
c
n
1 n
k (, ) 2 exp ( yi ) 2
2 i1
c h ( , ) d d .
0
147
Bayesian Methods for Statistical Analysis
where
n −1 1
Γ +
cn = 2 2
.
n −1 n −1
1/2
Γ
2 2
Hence f ( x | y ) f ( x | y, ) f ( | y )d
1/2
nm ( x y ) 2
exp
2(n m)
0
1
n 1
n 1 2
2 exp s d
2
n
1 nm( x y ) 2 n 1 2
2 exp
s d
2( n m ) 2
0
n
nm( x y ) 2 n 1 2 2
s
2(n m) 2
148
Chapter 3: Bayesian Basics Part 3
n
nm( x y ) 2 2
1
(n 1)(n m) s 2
n
2
2
xy
( s / n ) (n m) / m
1 .
n 1
It follows that
xy
y ~ t (n 1) . (3.5)
( s / n ) (n m) / m
Consequently,
(n + m)a − ny
x= .
m
149
Bayesian Methods for Statistical Analysis
This may look familiar to some readers, the reason being as follows.
It will be noted that this inference is exactly the same as implied by the
standard approach in the classical survey sampling framework (e.g. see
Cochran, 1977).
150
Chapter 3: Bayesian Basics Part 3
({(n 1) 1} / 2)
We thereby obtain the density f (Y | y )
((n 1) )((n 1) / 2)
1
2 2 (( n1)1)
1 Y y
s
n
1 , Y .
n 1 ( s / n ) 1 n / N s 1 n / N
151
Bayesian Methods for Statistical Analysis
X11(w=8,h=6); par(mfrow=c(1,1))
ybar=10; s=2; cv=seq(0,20,0.005)
plot(c(4,16),c(0,1),type="n",xlab="Ybar",ylab="f( Ybar | y )", main=" ")
n=5; rv=(cv-ybar)*sqrt(n)/s; lines(cv, dt(rv,n-1)*sqrt(n)/s,lty=1,lwd=2)
Nvec=c(6,7,10,40)
for(i in 1:length(Nvec)){ N=Nvec[i]; qv=rv/sqrt(1-n/N)
lines(cv, dt(qv,n-1)*sqrt(n)/(s*sqrt(1-n/N)),lty=i+1,lwd=2) }
legend(4,1,
c("N=6 (m=1)","N=7 (m=2)","N=10 (m=5)","N=40 (m=35)","N=infinity (=m)"),
lty=c(2:5,1),lwd=2)
text(6,0.6,
"The solid line is also the\nposterior density of mu,\nnamely f( mu | y ).")
152
CHAPTER 4
Computational Tools
4.1 Solving equations
In most of the Bayesian models so far examined, the calculations required
could be done analytically. For example, the model given by:
(Y | ) ~ Binomial (5, )
~ U (0,1) ,
together with data y = 5, implies the posterior ( | y ) ~ Beta (6,1) . So
has posterior pdf f ( | y ) 6 5 and posterior cdf F ( | y ) 6 . Then,
setting F ( | y ) 1/ 2 yields the posterior median, 1/ 21/6 = 0.8909.
qbeta(0.5,6,1) # 0.8908987
How does the NR algorithm work? Figure 4.1 illustrates the idea.
153
Bayesian Methods for Statistical Analysis
154
Chapter 4: Computational Tools
j 0 1 2 3 4
j 1.0000 0.9167 0.8926 0.8909 0.8909
j 0 1 2 3 4
j 0.8000 0.9210 0.8933 0.8909 0.8909
Note 2: In this simple example, one could get the answer by solving the
equation g ( ) / g ( ) analytically. In general, that won’t be
possible, and iterating the algorithm will be required. Of course, if it is
possible to solve that equation analytically, there is no need to iterate.
155
Bayesian Methods for Statistical Analysis
NR <- function(th,J=5){
# This function performs the Newton-Raphson algorithm for J iterations
# after starting at the value th. It outputs a vector of th values of length J+1.
thvec <- th; for(j in 1:J){
num <- th^6-1/2 # theta’s posterior cdf minus 1/2 (numerator)
den <- 6*th^5 # theta’s posterior pdf (denominator)
th <- th - num/den
thvec <- c(thvec,th) }
thvec }
options(digits=4)
NR(th=1,J=6) # 1.0000 0.9167 0.8926 0.8909 0.8909 0.8909 0.8909
NR(th=0.8,J=6) # 0.8000 0.9210 0.8933 0.8909 0.8909 0.8909 0.8909
0.8909-(0.8909^6-0.5)/(6*0.8909^5) # 0.8909 (Check)
) t 2 − et . Now, g ′(t=
We wish to solve g (t ) = 0 , where g (t = ) 2t − et .
t 2j − et j
So we iterate according to t j +1= t j − .
2t − et j
j
Let us arbitrarily choose t0 = 0 . Then we get:
02 − e0 (−1) 2 − e −1
t1= 0 − = −1.000000, t2 =(−1) − = −0.733044
2(0) − e0 2(−1) − e −1
(−0.733044) 2 − e −0.733044
t3 =
(−0.733044) − = −0.703808
2(−0.733044) − e −0.733044
(−0.703808) 2 − e −0.703808
t4 =
(−0.703808) − = −0.703467
2(−0.703808) − e −0.703808
(−0.703467) 2 − e −0.703467
t5 =
(−0.703467) − = −0.703467, etc.
2(−0.703467) − e −0.703467
156
Chapter 4: Computational Tools
Also, we find that the output of the NR algorithm starting from 1 is:
1.000000, -1.392211, -0.835088, -0.709834, -0.703483, -0.703467,
-0.703467, -0.703467, .....
Figure 4.2 illustrates the function g and the output of the NR algorithm
starting from −5, which is:
-5.000000, -2.502357, -1.287421, -0.802834, -0.707162, -0.703473,
-0.703467, -0.703467, .....
157
Bayesian Methods for Statistical Analysis
158
Chapter 4: Computational Tools
Figures 4.3 and 4.4 show the posterior median 0.61427, as well as the
other solution of g ( p ) = 0 (i.e. root of g), namely 1.24748. This is not
actually a solution of F ( p | x) = 0.5, because the values of F ( p | x) for
p < 0 and p > 1 are 0 and 1, respectively.
159
Bayesian Methods for Statistical Analysis
160
Chapter 4: Computational Tools
4*(0.614272)^3-3*(0.614272)^4 # 0.499999
pvec=seq(-0.5,1.4,0.005); Fvec = 4*pvec^3-3*pvec^4
Fvec[pvec<=0] = 0; Fvec[pvec>=1] = 1
X11(w=8,h=4.5); par(mfrow=c(1,1))
161
Bayesian Methods for Statistical Analysis
x1 g1 ( x) 0
Let: x = , g ( x) = , 0 = (a column vector of length K).
x g ( x) 0
K K
( j )
xK
x1( j 1) ( j)
g1 ( x ) g1 ( x )
x ( j 1) , g ( x ( j ) )
( j 1)
xK g K ( x ( j ) ) g K ( x ) x x( j )
g ( x ( j ) ) g ( x) x x( j )
1× e − λ λ x / x ! =
First, f (λ | x) ∝ f (λ ) f ( x | λ ) = e − λ λ , since x = 1.
162
Chapter 4: Computational Tools
The 80% HPDR for λ is (a,b), where a and b satisfy the two equations:
F (b | x) − F (a | x) = 0.8 (4.1)
f (b | x) = f (a | x) . (4.2)
Starting at
a0 0.5
=t (0) =
b0 3.0
(based on a visual inspection of the posterior density f (λ | x) = λ e − λ ), we
obtain results as shown in Table 4.3.
j 0 1 2 3 4 5
aj 0.5 0.0776524 0.163185 0.167317 0.16730 0.16730
bj 3.0 2.7406883 3.025571 3.079274 3.08029 3.08029
163
Bayesian Methods for Statistical Analysis
It seems that the 80% CPDR for λ is (0.16730, 3.08029). This interval is
illustrated in Figure 4.5 and appears to be correct.
gfun = function(a,b){
g1=pgamma(b,2,1)-pgamma(a,2,1)-0.8; g2=dgamma(b,2,1)-dgamma(a,2,1);
c(g1,g2) }
options(digits=6); gmat
# [1,] 0.5 0.0776524 0.163185 0.167317 0.16730 0.16730 0.16730 0.16730
# [2,] 3.0 2.7406883 3.025571 3.079274 3.08029 3.08029 3.08029 3.08029
164
Chapter 4: Computational Tools
lamv=seq(0,5,0.01); fv=dgamma(lamv,2,1)
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(lamv,fv,type="l",lwd=3,xlab="lambda",ylab="f(lambda|x)", main=" ")
abline(h=c(dgamma(a,2,1)),v=c(a,b),lty=1)
# Checks:
c(a,b,dgamma(c(a,b),2,1)) # 0.167300 3.080291 0.141527 0.141527
c(pgamma(a,2,1), pgamma(b,2,1), pgamma(b,2,1) - pgamma(a,2,1))
# 0.0125275 0.8125275 0.8000000
This algorithm first requires the specification (i.e. definition by the user)
of some suitable latent data, which we will denote by z, and then the
application of the following two steps iteratively until convergence.
Note: The choice of the latent data z will depend on the particular
application.
165
Bayesian Methods for Statistical Analysis
or, in words, as
the expectation of the log-augmented posterior density with respect
to the distribution of the latent data given the observed data and
current parameter estimates.
Find the value of θ which maximises the Q-function, for example using
the Newton-Raphson algorithm.
This value becomes the current parameter estimate in the next iteration.
Suppose that the data, denoted D, consists of the observed data vector,
denoted by
166
Chapter 4: Computational Tools
yo = ( y1 ,..., yk ) ,
and the partially observed (or missing) data vector, denoted by
ym = ( yk +1 ,..., yn ) .
We don’t know the values in ym exactly, only that each of those values
is greater than some specified constant c.
(a) First, f (λ | D ) ∝ f (λ ) f ( D | λ )
k n
∝ 1× ∏ f ( yi | λ ) ∏ P( yi > c | λ ) ,
i= 1 i= k +1
− λ yi
where: f ( yi | λ ) = λ e
∞
∫ λe
− yi λ
| λ)
P( yi > c = dyi e − cλ .
=
c
k n
Then f (λ | D) ∝ ∏ λ e − λ yi ∏ e − cλ
i= 1 i= k +1
= λ exp{−λ[ yoT + (n − k )c]} ,
k
So l (λ ) ≡ log f (λ | D
= ) k log λ − λ[ y0T + (n − k )c]
k
⇒ l ′(λ ) = − [ yoT + (n − k )c] .
λ
167
Bayesian Methods for Statistical Analysis
λ e − λ yi
λ) =
Now, f ( yi | yi > c,= λ e − λ ( yi −c ) , yi > c
e− λc
(an exponential pdf shifted to the right by c).
1
Therefore, E ( yi | yi > c, λ ) =
c+ .
λ
It follows that the Q-function is given by
1
λ ) n log λ − λ yoT + (n − k ) c +
Q j (=
λj
(note the distinction here between λ and λ j ).
168
Chapter 4: Computational Tools
λ j λ=
Note: Writing (4.4) with = j +1 λ (i.e. the limiting value) gives
n
λ= ,
yoT + (n − k ) ( c + 1/ λ )
and this can be solved easily for the same formula as derived in (a),
namely
k
λ= .
yoT + (n − k )c
# (a)
n=5; k=3; c=10; yo=c(3.1, 8.2, 6.9); yoT=sum(yo); yoT # 18.2
k/(yoT+(n-k)*c) # 0.078534
# (b)
lam = 1; lamv = lam; options(digits=5)
for(j in 1:20){ lam=n/(yoT+(n-k)*(c+1/lam)); lamv=c(lamv,lam) }
lamv
# 1.000000 0.124378 0.092115 0.083456 0.080431 0.079282 0.078832
# 0.078653 0.078581 0.078553 0.078542 0.078537 0.078535 0.078535
# 0.078534 0.078534 0.078534 0.078534 0.078534 0.078534 0.078534
Suppose that the data, denoted D, consists of the observed data vector
yo = ( y1 ,..., yk )
and the partially observed (or ‘missing’) data vector
ym = ( yk +1 ,..., yn ) .
169
Bayesian Methods for Statistical Analysis
We don’t know the values in ym exactly, but only that each of these
values is greater than some specified constant c.
(a) Find the log-posterior density of µ and describe how it could be used
to find the posterior mode of µ . (Do not actually find that mode in this
way.)
(b) Find the posterior mode of µ using the EM algorithm. Then check
your answer by showing the mode in plots of the likelihood and log-
likelihood functions.
k n
(a) Observe that f ( µ | D) ∝ 1× ∏ f ( yi | µ ) ∏ P( yi > c | µ ) .
i= 1 i= k +1
1
1
k k − ( yi − µ )2 k
Here,
=i 1 =i 1
∏ f ( yi | µ ) ∝ ∏ e 2σ 2
=−
exp
2σ
2 ∑ ( y − µ)
i =1
i
2
1
= exp − (k − 1) so2 + k ( µ − yo ) 2 ,
2
2σ
1 k
where: ∑ yi (the observed sample mean)
k i =1
yo =
1 k
=so2 ∑ ( yi − yo )2 (the observed sample variance).
k − 1 i =1
∞ 1
1 − ( yi − µ )2
∫c σ 2π 2 dyi
Also, P ( yi > c | µ ) = e
c−µ c−µ ,
= PZ > = 1− Φ
σ σ
where Z ~ N(0,1) and Φ ( z ) = P ( Z ≤ z ) (the standard normal cdf).
n−k
k c − µ
Therefore f ( µ | D) ∝ exp − 2 ( µ − yo ) 2 1 − Φ .
2σ σ
170
Chapter 4: Computational Tools
So the log-posterior is
k c − µ
log f ( µ | D) = − ( µ − yo ) 2 + (n − k ) log 1 − Φ + c1
2σ 2
σ
(where c1 is a term which does not depend on µ ).
∂l ′( µ ) k
where l ′′( µ ) = =
− 2 + ...
∂µ σ
As a further exercise, one could complete the formula for l ′′( µ ) above
and actually implement the NR algorithm.
Note: The posterior mode here is also the maximum likelihood estimate,
since the prior is proportional to a constant.
171
Bayesian Methods for Statistical Analysis
k n
1 1
=
−
2σ 2
∑(y
i= 1
2
i − 2 µ yi + µ 2 ) −
2σ 2
∑ (y
i= k +1
2
i − 2 µ yi + µ 2 ) + c1
{
= c2 ( k µ − 2µ nyo ) + ( (n − k ) µ − 2µ (n − k ) ym ) + c3 ,
2 2
}
k
1
where: yo = ∑ yi (the sample mean of the observed values)
k i =1
n
1
ym = ∑ yi (the sample mean of the missing values).
n − k i= k +1
that e j E ( X | X > c ) µ = µ ,
We see=
j
c−µ c−µ
where P ( X > c) =1 − P ( X < c) =1 − P Z < =1 − Φ ,
σ σ
and where
∞ 1
1 − 2 ( x − µ )2
I =∫x e 2σ
dx
c σ 2π
∞ 1 ∞ 1
1 − 2 ( x − µ )2 1 − 2 ( x − µ )2
∫c ( x − µ ) σ 2π e
= 2σ
dx + ∫ µ
c σ 2π
e 2σ dx
∞ 1
1 − 2t
= ∫ 2 σ 2π e σ dt + µ P( X > c)
( c − µ ) /2
1
=
where t ( x − µ ) 2 and dt= ( x − µ )dx
2
∞
σ
1
1 − t
=
2π ∫ σ 2
e σ2
dt + µ P( X > c)
(c−µ ) 2
/2
172
Chapter 4: Computational Tools
1 c−µ
2
1
− ( c − µ )2 /2 −
σ
X > c) σ + µ P( X > c)
σ2 1 2 σ
= + µ P(=
2π 2π
c−µ where φ ( z ) is the standard normal pdf.
= σφ + µ P( X > c)
σ
1 c−µ
=
Thus E ( X | X > c) σφ + µ P( X > c)
P( X > c) σ
c−µ c − µ
= µ +σ φ 1 − Φ ,
σ σ
c − µj c − µj
and consequently e= µj +σ φ 1 − Φ .
σ σ
j
Figure 4.6 shows the posterior density (top subplot) and the log-posterior
density (bottom subplot). Each of these density functions is drawn scaled,
meaning correct only up to a constant of proportionality. In each subplot,
the posterior mode is indicated by way of a vertical dashed line.
173
Bayesian Methods for Statistical Analysis
# (b)
options(digits=6); yo = c(3.1, 8.2, 6.9); n=5; k = 3; c= 10; sig=3;
yoT=sum(yo); c(yoT, yoT/3) # 18.20000 6.06667
mu=5; muv=mu; for(j in 1:10){
ej = mu + sig * dnorm((c-mu)/sig) / ( 1-pnorm((c-mu)/sig) )
mu = ( yoT + (n-k)*ej ) / n
muv=c(muv,mu) }
muv # 5.00000 8.13784 8.37179 8.39570 8.39821 8.39847
# 8.39850 8.39850 8.39850 8.39850 8.39850
modeval=muv[length(muv)]; modeval # 8.3985
muvec=seq(0,20,0.001); lvec=muvec
for(i in 1:length(muvec)){ muval=muvec[i]
lvec[i]=(-1/(2*sig^2))*sum((yo-muval)^2) +
(n-k)*log(1-pnorm((c-muval)/sig)) }
iopt=(1:length(muvec))[lvec==max(lvec)]; muopt=muvec[iopt]; muopt # 8.399
X11(w=8,h=6); par(mfrow=c(2,1));
plot(muvec,exp(lvec),type="l",lwd=2); abline(v=modeval,lty=2,lwd=2)
plot(muvec,lvec,type="l",lwd=2); abline(v=modeval,lty=2,lwd=2)
174
Chapter 4: Computational Tools
The idea is, at each M-Step, to maximise the Q-function with respect to
θ1 , with θ 2 fixed at its current value; and then to maximise the Q-function
with respect to θ 2 , with θ1 fixed at its current value.
175
Bayesian Methods for Statistical Analysis
Assuming the current values of a and b are a j and b j , this can be achieved
via the NR algorithm by setting a0′ = a j and iterating until convergence
as follows (k = 0, 1, 2, ...):
∂g (a, b)
∂a = a a=′
k ,b bj
ak′ += ak′ − 2 ,
∂ g ( a, b)
1
= a a=′ , b b
∂a
2 k j
and finally setting
a j +1 = a∞′ . (4.5)
This can be achieved via the NR algorithm by setting b0′ = b j and iterating
until convergence as follows (k = 0, 1, 2, ...):
∂g (a, b)
∂b= a a=
j +1 , b bk
bk′ += bk − 2
,
∂ g ( a, b)
1
= a a= j +1 , b bk
∂ b 2
and finally setting
b j +1 = b∞′ . (4.6)
176
Chapter 4: Computational Tools
One application of the CNR and CNR1 algorithms is to finding the HPDR
for a parameter.
The 80% HPDR for λ was shown to be (a,b), where a and b are the
simultaneous solutions of the two equations:
g1 (a, b) = F (b | x) − F (a | x) − 0.8
g=
2 ( a, b) f (b | x) − f (a | x) .
This model says that each value yi has a common variance σ 2 and one
of two means, these being: µ if Ri = 0
µ + δ if Ri = 1.
177
Bayesian Methods for Statistical Analysis
(d) Create a plot which shows the routes taken by the algorithms in parts
(b) and (c).
(a) Figure 4.7 shows a histogram of the sampled values which clearly
shows the two component normal densities and the mixture density. The
sample mean of the data is 23.16. Also, 29 of the 100 Ri values are equal
to 1, and 71 of them are equal to 0.
178
Chapter 4: Computational Tools
(b) We will here take the vector R = ( R1 ,..., Rn ) as the latent data. The
conditional posterior of µ and δ given this latent data is
f ( µ , δ | y, R) ∝ f ( µ , δ , y, R)
= f ( µ , δ ) f ( R | µ , δ ) f ( y | R, µ , δ )
n n
1 2
2 ( i
∝ 1× ∏ π Ri (1 − π )1− Ri × ∏ exp − y − [ µ + Riδ ])
=i 1 =i 1 2σ
1 n
∑ ( y − [ µ + R δ ])
2
∝ 1×1× exp − 2 .
2σ
i i
i =1
=
−
1 n
2 ∑
2σ i =1
(
yi2 − 2 yi [ µ + Riδ ] + [ µ + Riδ ]
2
)
1 n n n
2
− 2 ∑ yi2 − 2∑ yi [ µ + Riδ ] + ∑ [ µ + Riδ ]
=
2σ i 1 =i 1
= =i 1
n n n
= −c1 c2 − 2 µ ny − 2δ ∑ yi Ri + nµ 2 + 2 µδ ∑ Ri + δ 2 ∑ Ri2 ,
=i 1 =i 1 =i 1
where c1 and c2 are positive constants which do not depend on µ or δ
in any way. We see that
n
log f ( µ , δ | y, R) =−c1 c2 − 2 µ ny − 2δ ∑ yi Ri + nµ 2 + 2 µδ RT + δ 2 RT ,
i =1
n
where RT = ∑ Ri .
i =1
So the Q-function is
Q j ( µ , δ ) = ER {log f ( µ , δ | y, R ) | y, µ j , δ j }
n
= −c1 c2 − 2 µ ny − 2δ ∑ yi eij + nµ 2 + 2 µδ eTj + δ 2 eTj ,
i =1
where: eij = E ( Ri | y, µ j , δ j )
n
= ( RT | y, µ j , δ j )
eTj E= ∑e
i =1
ij
.
179
Bayesian Methods for Statistical Analysis
We now need to obtain formulae for the eij values. Observe that
f ( R | y, µ , δ ) ∝ f ( µ , δ , y, R)
n n
1 2
2 ( i
∝ 1× ∏ π Ri (1 − π )1− Ri × ∏ exp − y − [ µ + Riδ ]) .
=i 1 =i 1 2σ
It follows that
( Ri | y, µ , δ ) ~ ⊥ Bernoulli (ei ) , i = 1,...,n,
where
π exp − 2 ( yi − [ µ + δ ])
1 2
ei = 2σ .
1 2 1 2
π exp − 2 ( yi − [ µ + δ ]) + (1 − π ) exp − 2 ( yi − µ )
2σ 2σ
Therefore
π exp − ( y − µ )
1 2
+ δ j
2σ
i j
2
eij = .
π exp − ( )
2 2
yi − µ j + δ j + (1 − π ) exp − 2 ( yi − µ j )
1 1
2σ 2σ
2
180
Chapter 4: Computational Tools
Running the algorithm from different starting points we obtain the same
final results. Unlike the NR algorithm, we find that the EM algorithm
always converges, regardless of the point from which it is started.
j µj δj
0 10.000 1.000
1 21.169 3.032
2 20.321 7.07
3 19.843 9.139
4 19.926 9.518
5 20.005 9.626
6 20.046 9.674
7 20.066 9.697
8 20.075 9.708
9 20.08 9.713
10 20.082 9.715
11 20.083 9.717
12 20.084 9.717
13 20.084 9.717
14 20.084 9.718
15 20.084 9.718
16 20.084 9.718
17 20.084 9.718
18 20.084 9.718
19 20.084 9.718
20 20.084 9.718
181
Bayesian Methods for Statistical Analysis
∂Q j ( µ , δ )
Thus, setting =− c1 {0 − 2ny − 0 + 2nµ + 2δ eTj + 0}
∂µ
1
to zero we get µ j +1= y − δ j eTj (after substituting in δ = δ j ).
n
∂Q j ( µ , δ ) n
Then, setting =−c1 0 − 0 − 2∑ yi eij + 0 + 2 µ eTj + 2δ eTj
∂δ i =1
n
∑ye i ij
δ j +1
to zero we get = i =1
− µ j +1 (same equation as in (c)).
eTj
We see that the ECM algorithm here is fairly similar to the EM algorithm.
(d) Figure 4.8 (page 185) shows a contour plot of the log-posterior density
log f ( µ , δ | y, R) and the routes of the EM and ECM algorithms in parts
(b) and (c), each from the starting point ( µ0 , δ 0 ) = (10, 1) to the mode,
( µˆ , δˆ ) = (20.08, 9.72). Also shown are two other pairs of routes, one pair
starting from (5, 30), and the other from (35, 20).
Note 2: The log-posterior density in Figure 4.8 has a formula which can
be derived as follows. First, the joint posterior of all unknowns in the
model is
f ( µ , δ , R | y ) ∝ f ( µ , δ , y, R)
n n
1 2
2 ( i
∝ 1× ∏ π Ri (1 − π )1− Ri × ∏ exp − y − [ µ + Riδ ])
=i 1 =i 1 2σ
182
Chapter 4: Computational Tools
n
1 2
2 ( i
= ∏π Ri
(1 − π )1− Ri exp − y − [ µ + Riδ ]) .
i =1 2σ
∂l ( µ , δ ) ∂l ( µ , δ ) ∂ 2l ( µ , δ ) ∂ 2l ( µ , δ ) ∂ 2l ( µ , δ )
, , , , ,
∂µ ∂δ ∂µ 2 ∂δ 2 ∂δ∂µ
and could prove to be unstable. That is, the algorithm might fail to
converge if started from a point not very near the required solution.
183
Bayesian Methods for Statistical Analysis
j µj δj
0 10.000 1.000
1 22.505 1.696
2 22.566 3.882
3 21.905 6.811
4 21.139 8.729
5 20.611 9.501
6 20.322 9.732
7 20.181 9.774
8 20.118 9.764
9 20.093 9.746
10 20.085 9.732
11 20.083 9.725
12 20.083 9.720
13 20.083 9.719
14 20.084 9.718
15 20.084 9.718
16 20.084 9.718
17 20.084 9.718
18 20.084 9.718
19 20.084 9.718
20 20.084 9.718
184
Chapter 4: Computational Tools
# (a)
X11(w=8,h=4.5); par(mfrow=c(1,1)); options(digits=4)
ntrue=100; pitrue=1/3; mutrue=20; deltrue=10; sigtrue=3
hist(yvec,prob=T,breaks=seq(0,50,0.5),xlim=c(10,40),ylim=c(0,0.2),
xlab="y", main=" ")
185
Bayesian Methods for Statistical Analysis
yv=seq(0,50,0.01); lines(yv,dnorm(yv,mutrue,sigtrue),lty=2,lwd=2)
lines(yv,dnorm(yv,mutrue+deltrue, sigtrue),lty=2,lwd=2)
lines(yv, (1-pitrue)*dnorm(yv,mutrue,sigtrue)+
pitrue*dnorm(yv,mutrue+deltrue,sigtrue), lty=1,lwd=2)
legend(10,0.2,c("Components","Mixture"),lty=c(2,1),lwd=c(2,2))
# (b)
evalsfun= function(y=yvec, pii=pitrue, mu=mutrue,del=deltrue,sig=sigtrue){
# This function outputs (e1,e2,...,en)
term1vals=pii*dnorm(y,mu+del,sig)
term0vals=(1-pii)*dnorm(y,mu,sig)
term1vals/(term1vals+term0vals) }
muhat=EMres$muv[21]; delhat=EMres$delv[21];
c(muhat,delhat) # 20.084 9.718
186
Chapter 4: Computational Tools
# (c)
CEMfun=function(J=20, mu=10, del=1, y=yvec, pii=pitrue, sig=sigtrue){
muv=mu; delv=del; ybar=mean(y); n=length(y)
for(j in 1:J){
evals=evalsfun(y=y, pii=pii, mu=mu, del=del, sig=sig)
sumyevals = sum(y*evals); sumevals=sum(evals)
mu=ybar-del*sumevals/n
del=sumyevals/sumevals - mu
muv=c(muv,mu); delv=c(delv,del)
}
list(muv=muv,delv=delv)
}
CEMres=CEMfun(J=20, mu=10, del=1,y=yvec,pii=pitrue,sig=sigtrue)
outmat2 = cbind(0:20, CEMres$muv, CEMres$delv)
print.matrix(outmat2)
# (d)
X11(w=8,h=9); par(mfrow=c(1,1))
logpostfun=function(mu=10,del=10,y=yvec,pii=pitrue,sig=sigtrue){
sum(log(pii*dnorm(y,mu+del,sig)+(1-pii)*dnorm(y,mu,sig))) }
mugrid=seq(0,35,0.5); delgrid=seq(0,30,0.5)
logpostmat=as.matrix(mugrid %*% t(delgrid))
dim(logpostmat) # 41 21 OK
points(10,1,pch=16,cex=1.2)
187
Bayesian Methods for Statistical Analysis
points(5,30,pch=16,cex=1.2)
EMres=EMfun(J=50, mu=5, del=30,y=yvec,pii=pitrue,sig=sigtrue)
CEMres=CEMfun(J=50, mu=5, del=30, y=yvec,pii=pitrue,sig=sigtrue)
lines(EMres$muv, EMres$delv,lty=1,lwd=3)
lines(CEMres$muv, CEMres$delv,lty=2,lwd=3)
points(35,20,pch=16,cex=1.2)
EMres=EMfun(J=50, mu=35, del=20,y=yvec,pii=pitrue,sig=sigtrue)
CEMres=CEMfun(J=50, mu=35, del=20, y=yvec,pii=pitrue,sig=sigtrue)
lines(EMres$muv, EMres$delv,lty=1,lwd=3)
lines(CEMres$muv, CEMres$delv,lty=2,lwd=3)
legend(21,30,c("EM","ECM"),lty=c(1,2),lwd=c(3,3))
ˆ E ( 2 | y ) 2 6 5 d 0.75 .
0
But what if this integral did not have a simple analytical solution?
188
Chapter 4: Computational Tools
ˆ 3 2 d = 0.75.
0
If this strategy does not help, we may then consider using a numerical
integration technique.
Applying this method (see the R code below for details) yields 0.7558 as
an estimate of ̂ . Repeating, but with the evaluations on the grid 0.01,
0.02, ...,1 yields 0.7500. Repeating again, but with evaluations on the grid
0.001, 0.002, ..., 1 yields 0.7500. It appears that a limit has been reached
and that using a finer grid would not result in any improvements to the
results of this numerical procedure.
189
Bayesian Methods for Statistical Analysis
gfun=function(t){ 6*t^7 }
tvec <- seq(0,1,0.1); gvec <- gfun(tvec)
INTEG(tvec,gvec,0,1) # 0.755803
tvec <- seq(0,1,0.01); gvec <- gfun(tvec)
INTEG(tvec,gvec,0,1) # 0.75
tvec <- seq(0,1,0.001); gvec <- gfun(tvec)
INTEG(tvec,gvec,0,1) # 0.75
=
Suppose that X ~ N ( µ , σ 2 ) and Y ( X | X > c) where µ = 8, σ = 3
and c = 10. Find EY using numerical techniques and compare your answer
with the exact value,
c−µ c − µ
µ +σ φ 1 − Φ ,
σ σ
which was derived analytically in Exercise 4.6.
xf ( x) 1 x−µ
where: g ( x) = , f ( x) = φ ,
P ( X > 0) σ σ
c−µ .
P ( X > 0) = 1 − Φ
σ
190
Chapter 4: Computational Tools
Use the integrate() and INTEG() functions in at least two different ways
so as to calculate the double integral
1 x3
I = ∫ ∫ t t dt dx .
t 0
=
x 0=
191
Bayesian Methods for Statistical Analysis
Using the integrate() function alone (and not the INTEG() function), the
integral can be worked out as follows:
integrate(function(x) {
sapply(x, function(x) {
integrate(function(t) {
sapply(t, function(t) t^t )
}, 0, x^3)$value }) }, 0, 1)
where
x3
g ( x) = ∫ h(t )dt
t =0
and
h(t ) = t t .
We will now use the integrate() function to obtain g ( x) for each value of
x in the grid 0, 0.01, 0.02, ..., 1. We will then apply the INTEG() function
to the resulting coordinates.
Figure 4.9 below displays the two functions h(t ) and g ( x) . The value
g (0.8) = 0.381116 is the area under h(t ) between 0 and 0.8. The total area
under h(t ) (from 0 to 1) is 0.78343.
192
Chapter 4: Computational Tools
One could also adapt the second approach above so as to calculate the
double integral using the INTEG() function only (without using the
integrate() function directly). This might be useful if the inner integral
x3
g ( x) = ∫ h(t )dt
t =0
where h(t ) = t t
Note: The integrate() function is called within the INTEG() function and
so is used at least indirectly in all of the approaches considered here.
integrate(function(x) {
sapply(x, function(x) {
integrate(function(t) {
sapply(t, function(t) t^t )
}, 0, x^3)$value }) }, 0, 1)
# 0.192723 with absolute error < 7.8e-10
193
Bayesian Methods for Statistical Analysis
integrate(f=hfun,lower=0,upper=0.8^3)$value
# 0.381116 This is g(0.8) = area under h(t) to left of 0.8
integrate(f=hfun,lower=0,upper=1)$value
# 0.78343 This is the total areas under h(t) (from 0 to 1)
The second of the next two exercises shows how the optim() function can
be used to specify a prior distribution.
194
Chapter 4: Computational Tools
Use the optim() function to ‘find’ the mode of each of the following:
=
(a) g ( x ) x 2 e −5 x , x > 0 (mode = 2/5)
| x |x e − ( x −1)
2
=
(b) g ( x ) , x ∈ℜ (the mode has no closed form)
1+ | x |
(b) The function returns a value of 1.5047. (We presume that this is
correct; see below for a verification.)
Figure 4.10 illustrates these three solutions, with each mode being marked
by a dot and vertical line. Subplot (c) shows several examples of the
function g ( x, y ) in part (c) considered as a function of only x, with each
line defined by a fixed value of y on the grid 0, 0.5, 1, ...,4.5, 5.
195
Bayesian Methods for Statistical Analysis
# (a)
fun=function(x){ -x^2 * exp(-5*x) }
res0=optim(par=0.5,fn=fun)$par; res0 # 0.4
# Warning message:
# In optim(par = 0.5, fn = fun) :
# one-diml optimization by Nelder-Mead is unreliable:
# use "Brent" or optimize() directly
plot(seq(0,5,0.01), -fun(seq(0,5,0.01)),type="l",lwd=3,xlab="x",ylab="g(x)");
abline(v=res0); points(res0, -fun(res0), pch=16, cex=2); text(4,0.02,"(a)",cex=2)
# (b)
fun=function(x){ -exp(-(x-1)^2) * abs(x)^x/(1+abs(x)) }
res0=optim(par=1,fn=fun)$par; res0 # 1.5047
plot(seq(-2,5,0.01), -fun(seq(-2,5,0.01)),type="l",lwd=3, xlab="x",ylab="g(x)");
abline(v=res0); points(res0, -fun(res0), pch=16, cex=2); text(4,0.45,"(b)",cex=2)
196
Chapter 4: Computational Tools
# (c)
fun=function(v){ -v[2]^3 * exp( -v[2] * ( (v[1]-1)^2 + (v[1]-3)^2 ) ) }
res0=optim(par=c(2,2),fn=fun, lower = c(-Inf,0), upper = c(Inf,Inf),
method = "L-BFGS-B")$par; res0 # 2.0 1.5
plot(c(0.5,3.5),c(0,0.2), type="n",xlab="x",ylab="f(x,y)")
for(y in seq(0,5,0.5))
lines(seq(0,5,0.01), fun2(x=seq(0,5,0.01),y=y), lty=1)
abline(v=res0[1]); points(res0[1],fun2(res0[1],res0[2]), pch=16, cex=2);
lines(seq(0,5,0.01),fun2(x= seq(0,5,0.01), y=res0[2]),lty=1,lwd=3);
text(3,0.17,"(c)",cex=2)
We wish to find the values of and which satisfy the two equations:
P(σ < a) = α /2 and P(σ < b) =1 − α / 2 ,
where a = 0.5, b = 1 and α = 0.05.
These two equations are together equivalent to each of the following five
pairs of equations:
P(σ 2 < a 2 ) =
α /2 and P(σ 2 < b 2 ) =
1−α / 2
P(1/ λ < a 2 ) = α /2 and P(1/ λ < b 2 ) = 1−α / 2
P(1/ a < λ ) =
2
α /2 and P(1/ b < λ ) =−
2
1 α /2
P(λ < 1/ a 2 ) = 1−α / 2 and P(λ < 1/ b 2 ) = α /2
FG (η ,τ ) (1/ a ) − (1 − α / 2) =
2
0 and FG (η ,τ ) (1/ b ) − α / 2 =
2
0.
197
Bayesian Methods for Statistical Analysis
We now focus on the last of these pairs of two equations. Two obvious
ways to solve these equations are by trial and error and via the multivariate
Newton-Raphson algorithm, as illustrated earlier. But the solution can be
obtained more easily by using the optim() function to minimise
2 2
g (η ,τ ) FG (η ,τ ) (1 / a 2 ) − (1 − α / 2) + FG (η ,τ ) (1 / b2 ) − (α / 2) .
=
Note: Clearly, this function has a value of zero at the required values of
and .
However, applying the optim() function again but starting at the previous
solution, namely = 8.4764 and = 3.7679, yielded a ‘refined’
solution, = 8.4748 and = 3.7654.
Discussion
The three densities are plotted in Figure 4.11 (in the stated order from top
to bottom). The vertical lines show the 0.025 and 0.975 quantiles of each
distribution. The formulae for the three densities are as follows:
τ η λη −1e−τλ
= f (λ ) f=G (η ,τ ) (λ ) , λ >0
Γ(η )
198
Chapter 4: Computational Tools
dλ
f (σ
= 2
) f IG (η ,τ ) (σ
= 2
) f (λ ) = ) (λ
fG (η ,τ= σ −2 ) −(σ 2 ) −2 ,
d (σ )
2
where λ = (σ 2 ) −1
τ η (1/ σ 2 )η −1 e −τ (1/σ ) 2 −2 2
2
= (σ ) , σ > 0
Γ(η )
dλ
(σ ) f (λ ) = f G (η ,τ=
f= ) (λ σ −2 ) −2σ −3 where λ = (σ ) −2
dσ
τ η (1/ σ 2 )η −1 e −τ (1/σ ) −3
2
= 2σ , σ > 0 .
Γ(η )
As a check on the last of these three densities, the integrate() function was
used to show that the area under that density is exactly 1, and that the areas
underneath it to the left of 0.5 and to the right of 1 are both exactly 0.025.
199
Bayesian Methods for Statistical Analysis
res0=optim(par=c(0.2,6),fn=fun)$par
res0 # 8.4764 3.7679
pgamma(c(1/b^2,1/a^2),res0[1],res0[2]) # 0.025048 0.975104 Close
par(mfrow=c(3,1)); tv=seq(0,10,0.01)
200
CHAPTER 5
Monte Carlo Basics
5.1 Introduction
The term Monte Carlo (MC) methods refers to a broad collection of tools
that are useful for approximating quantities based on artificially generated
random samples. These include the Monte Carlo integration (for
estimating an integral using such a sample), the inversion technique (for
generating the required sample), and Markov chain Monte Carlo methods
(an advanced topic in Chapter 6). In principle, the approximation can be
made as good as required simply by making the Monte Carlo sample size
sufficiently large. As will be seen (further down), Monte Carlo methods
are a very useful tool in Bayesian inference.
This method will be faster and more accurate; but it will also require at
least some mathematical work to identify exactly what the parameters of
each drop are and what configuration of those parameters correspond to
the needle crossing a line (again, this is done in one of the exercises
below).
201
Bayesian Methods for Statistical Analysis
In this chapter, we will first discuss Monte Carlo methods and their
usefulness under the assumption that we have available or can generate
the required random samples. As we will see in the exercises and their
solutions, such samples can often be obtained very easily using inbuilt R
functions, e.g. runif() and rnorm().
Also, as part of the structure of the present chapter, we will first discuss
Monte Carlo methods and random number generation in a fully general
setting. Only after we have finished our treatment of these two topics (to
a certain level at least) will we discuss their application to Bayesian
inference. Hopefully this format will minimise any confusion.
µ Ex
= = ∫ xf ( x )dx
(or µ Ex
= = ∑ xf ( x )
x
µ Ex
or = = ∫ xdF ( x ) ).
Also suppose, however, that we are able to generate (or obtain) a random
sample from the distribution in question. Denote this sample as
x1 ,..., xJ ~ iid f ( x )
(or x1 ,..., xJ ~ iid F ( x ) ).
202
Chapter 5: Monte Carlo Basics
(b) Repeat (a) but with MC sample sizes of 1,000 and 10,000, and discuss
the results.
203
Bayesian Methods for Statistical Analysis
(a) Applying the above procedure (see the R code below) we estimate µ
by x = 1.5170. The Monte Carlo 95% confidence interval for µ is
CI= ( x ± z0.025 s / J ) = (1.3539, 1.6800).
We note that x is ‘close’ to the true value, µ = 1.5, and the CI contains
that true value.
(b) Repeating (a) with J = 1,000 we obtain the point estimate 1.5199 and
the interval estimate (1.4658, 1.5740).
Repeating (a) with J = 10,000 we obtain the point estimate 1.4942 and the
interval estimate (1.4773, 1.5110).
204
Chapter 5: Monte Carlo Basics
ψ
= Ey
= ∫ yf ( y)dy
= Eg ( x=
) ∫ g ( x) f ( x)dx .
Then we simply calculate y j = g ( x j ) for each j = 1,..., J . The result will
be a random sample y1 ,..., yJ ~ iid f ( y ) to which the method of Monte
Carlo can then be applied in the usual way. Thus, an estimate of ψ is
1 J
y = ∑ y j (the sample mean of the y-values),
J j =1
and a 1 − α CI for ψ is
sy
y ± zα /2 ,
J
205
Bayesian Methods for Statistical Analysis
1 J
=
where s y2 ∑
J − 1 j =1
( y j − y ) 2 (the sample variance of the y-values).
Note 1: As we will see later, it is often the case that we are able to sample
from a distribution without knowing—or being able to derive—the
exact form of its density function.
Present your results graphically, and wherever possible show the true
values of the quantities being estimated. Then repeat everything but using
a Monte Carlo sample size of J = 10,000.
The required graphs are shown in Figures 5.1 to 5.4. See the R code below
for more details.
206
Chapter 5: Monte Carlo Basics
207
Bayesian Methods for Statistical Analysis
208
Chapter 5: Monte Carlo Basics
hist(xv,prob=T,breaks=seq(0,7,0.25),xlim=c(0,7),ylim=c(0,0.6),xlab="x",
main=""); lines(xden,lty=2,lwd=2)
xvec=seq(0,10,0.01); lines(xvec,dgamma(xvec,3,2),lty=1,lwd=2)
abline(v= c(xbar, xci, xcdr), lty=2, lwd=2)
abline(v=c(3/2,qgamma(c(0.1,0.9),3,2)), lty=1,lwd=2)
legend(4,0.6,c("MC estimates","True values"),lty=c(2,1),lwd=c(2,2))
hist(yv,prob=T,breaks=seq(0,0.2,0.005),xlim=c(0,0.2),ylim=c(0,30),xlab="y",
main=""); lines(yden,lty=2,lwd=2)
abline(v= c(ybar, yci, ycdr), lty=2, lwd=2)
legend(4,0.6,c("MC estimates","True values"),lty=c(2,1),lwd=c(2,2))
hist(xv,prob=T,breaks=seq(0,9,0.25),xlim=c(0,7),ylim=c(0,0.6),xlab="x",
main=""); lines(xden,lty=2,lwd=2)
xvec=seq(0,10,0.01); lines(xvec,dgamma(xvec,3,2),lty=1,lwd=2)
abline(v= c(xbar, xci, xcdr), lty=2, lwd=2)
abline(v=c(3/2,qgamma(c(0.1,0.9),3,2)), lty=1,lwd=2)
legend(4,0.6,c("MC estimates","True values"),lty=c(2,1),lwd=c(2,2))
hist(yv,prob=T,breaks=seq(0,0.2,0.005),xlim=c(0,0.2),ylim=c(0,30),xlab="y",
main="")
lines(yden,lty=2,lwd=2); abline(v= c(ybar, yci, ycdr), lty=2, lwd=2)
legend(4,0.6,c("MC estimates","True values"),lty=c(2,1),lwd=c(2,2))
209
Bayesian Methods for Statistical Analysis
210
Chapter 5: Monte Carlo Basics
This suggests that we sample x1 ,..., xJ ~ iid h( x ) (as before) and apply
MC estimation to the means of w( x ) and u( x ) , respectively (each with
respect to the distribution defined by density h( x ) ) so as to obtain the
estimate
1 J
w J j =1
∑ w j w + ... + w
ψ=ˆ = = 1 J
,
1 J
u1 + ... + u J
∑uj
u
J j =1
where w j = w( x j ) and u j = u( x j ) .
1 −x
f ( x) ∝ e , x>0.
x +1
1 −x
Here, k ( x ) = h( x ) e − x , x > 0
e , and it is convenient to use =
x +1
(the standard exponential density, or Gamma(1,1) density). Then,
=µ Ex
∞
= ∫ xf ( x)dx
=
∫ xk ( x)dx
0 ∫ k ( x)dx
k ( x)
∫ x h( x) h( x)dx ∫ x x+ 1 h( x)dx
= = .
k ( x) 1
∫ h( x) h( x)dx ∫ x + 1 h( x)dx
1 J xj
J
∑ x +1
So a MC estimate of µ is µˆ = jJ=1 j ,
1 1
∑
J j =1 x j + 1
where x1 ,..., x J ~ iid G (1,1) .
211
Bayesian Methods for Statistical Analysis
0.40345
Implementing this with J = 100,000, we get µˆ = = 0.67631.
0.59655
Note 1: For interest we use numerical techniques to get the exact answer,
µ = 0.67687.
Thus the relative error is –0.084%. Figure 5.5 illustrates.
options(digits=10);
kfun=function(x){ exp(-x)/(x+1) }
c=integrate(f=kfun,lower=0,upper=Inf)$value; c # 0.5963473624
ffun=function(x){ (1/ 0.5963473624)*exp(-x)/(x+1) }
integrate(f=ffun,lower=0,upper=Inf)$value; # 0.9999999999
xffun= function(x){ x*(1/0.5963474)*exp(-x)/(x+1) }
mu= integrate(f=xffun,lower=0,upper=Inf)$value; mu # 0.6768749849
212
Chapter 5: Monte Carlo Basics
plot(c(0,3),c(0,2),type="n",xlab="x",ylab="density"); xvec=seq(0,5,0.01);
lines(xvec,dgamma(xvec,1,1),lty=1,lwd=3)
lines(xvec,xvec*dgamma(xvec,1,1),lty=1,lwd=1)
lines(xvec,ffun(xvec),lty=2,lwd=3); lines(xvec,xvec*ffun(xvec),lty=2,lwd=1)
points(c(1,mu,est),c(0,0,0),pch=c(16,4,1),lwd=c(2,2,2),cex=c(1.2,1.2,1.2))
legend(1.7,2,c( "f(x) = (1/c)*exp(-x)/(x+1)", "h(x) = exp(-x)" ),
lty=c(2,1), lwd=c(3,3))
legend(1.7,1.3,c( "x*f(x)", "x*h(x)" ), lty=c(2,1), lwd=c(1,1))
legend(0.5,2,c("E(x) = area under x*f(x)", "E(x) = area under x*h(x)",
"MC estimate of E(x)"), pch=c(4,16,1),pt.lwd=c(2,2,2),pt.cex=c(1.2,1.2,1.2))
213
Bayesian Methods for Statistical Analysis
214
Chapter 5: Monte Carlo Basics
Exercise 5.4
215
Bayesian Methods for Statistical Analysis
hist(rv,prob=T, breaks=seq(-1,1.8,0.1),xlim=c(-1,1.6),ylim=c(0,1.3),xlab="r",
main=""); lines(rden,lty=1,lwd=2); abline(v= c(rbar, rci, rcdr), lty=2, lwd=2)
s 1 J x (1 − x )
=
So the MC SE is =x (1 − x ) .
J J J −1 J −1
216
Chapter 5: Monte Carlo Basics
Note 1: The above theory is really nothing other than the usual classical
theory for estimating a binomial proportion. Thus, there are many other
CIs that could be substituted, (e.g. the Wilson CI whose coverage is
closer to 1 − α , and the Clopper-Pearson CI whose coverage is always
guaranteed to be at least 1 − α but which is typically wider).
The procedure for the second example is similar, except that it involves
sampling ( x1 , y1 ) ~ f ( x, y ) and determining=
r1 I ( x1 < y1 ) , etc.
217
Bayesian Methods for Statistical Analysis
The result will be a sample r1 ,..., rM ~ iid Bern( p ) , where p is the true
coverage probability, which can then be estimated via MC methods in
the usual way.
x
=
Use MC to estimate p P > 0.3e x , where x ~ Gamma (3, 2) .
x +1
1 J
Thereby we obtain an estimate of p equal to pˆ = ∑ rj = 0.2117
J j =1
pˆ (1 − pˆ )
and a 95% CI for p equal to pˆ ± 1.96 = (0.2060, 0.2173).
200000
x
Note 1: We may also view p as=p P( y > 0.3) , where y = e − x
x +1
(for example). In that case, we sample x1 ,..., x J ~ iid G (3, 2) , calculate
218
Chapter 5: Monte Carlo Basics
219
Bayesian Methods for Statistical Analysis
(a) Analytically derive p, the probability that the needle crosses a line.
(b) Now forget that you know p. Estimate p using Monte Carlo methods
on a computer and a sample size of 1,000. Also provide a 95% confidence
interval for p. Then repeat with a sample size of 10,000 and discuss.
It follows that
p P(C ) P( X sin Y )
sin y
/2 /2
2 2
dx dy sin y dy
f ( x, y )dxdy
xsin y y 0 x 0 y 0
2
cos y 0 cos ( cos 0)
2 /2
2
2 2
0 (1) = 0.63662.
220
Chapter 5: Monte Carlo Basics
221
Bayesian Methods for Statistical Analysis
Note 1: Another way to express the above working is to first note that
It follows that
π /2
2 2
p ==
P (C ) EP (=
C | Y ) E=
sin Y ∫ (sin y=
0
) dy
π π
,
as before.
Note 2: It can be shown that if the length of the needle is r times the
distance between lines, then the probability that the needle will cross a
line is given by the formula
2r / π , r ≤1
p= 2 2 −1 1
1 − π r − 1 − r + sin r , r > 1.
(b) For this part, we will make use of the analysis in (a) whereby
= C {( x, y ) : x < sin y} ,
and where:
π
x ~ U (0,1) , y ~ U 0, , X Y .
2
Note: We suppose that these facts are understood but that the integration
required to then proceed on from these facts to the final answer (as in
(a)) is too difficult.
We now sample x1 ,..., xJ ~ iid U (0,1) and y1 ,..., y J ~ iid U (0, π / 2) (all
independently of one another). Next, we obtain the indicators defined by
1 if x j < sin y j
rj =I ( x j < sin y j ) =
0 otherwise.
222
Chapter 5: Monte Carlo Basics
1 J r
The MC estimate of p is pˆ= r= ∑
J j =1
rj= T ,
J
pˆ (1 − pˆ )
= pˆ ± zα /2
and a 95% CI for p is CI .
J
We see that increasing the MC sample size (from 1,000 to 10,000) has
reduced the width of the MC CI from 0.060 to 0.019. Both intervals
contain the true value, namely 2 / π = 0.6366.
# (a)
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(seq(0,pi/2,0.01),sin(seq(0,pi/2,0.01)), type="l",lwd=3,xlab="y", ylab="x")
abline(v=c(0,pi/2),lty=3); abline(h=c(0,1),lty=3)
text(0.2,0.4,"x = sin(y)"); text(1,0.4,"C"); text(0.35,0.8,"Complement of C")
text(1.52,0.06,"pi/2")
# (b)
J=1000; set.seed(213); xv=runif(J,0,1); yv=runif(J,0,pi/2); rv=rep(0,J)
options(digits=4); for(j in 1:J) if(xv[j]<sin(yv[j])) rv[j]=1
223
Bayesian Methods for Statistical Analysis
(b) Repeat (a) but with J = 200, 500, 1,000, 10,000 and 100,000,
respectively. Report the widths of the resulting CIs and, for each CI, state
whether it contains µ . Discuss any patterns that you see.
(c) Repeat (a) M = 100 times and report the proportion of the resulting M
95% MC CIs which contain the true value of the mean. (In each case use
J = 100.) Hence calculate a 95% CI for p, the true coverage probability of
the 95% MC CI for µ based on a MC sample of size J = 100 from the
Gamma(3,2) distribution.
(d) Repeat (c), but with M = 200, 500, 1,000 and 10,000, respectively.
Discuss any patterns that you see.
224
Chapter 5: Monte Carlo Basics
0.93(1 − 0.93)
A 95% CI for p is 0.93 ± 1.96 = (0.880 0.980).
100
This is consistent with the MC 95% CI for µ having coverage 95%.
(d) Repeating (a) M = 200 times leads to p̂ = 94.5% of the 200 CIs
containing 1.5, with a 95% CI for p,
0.945(1 − 0.945)
0.945 ± 1.96 = (0.913, 0.977).
200
The widths of all five CIs for p are: 0.100, 0.063, 0.041, 0.027 and 0.009.
We see that the CI for p becomes narrower as M increases. Also, the
proportion of CIs containing 1.5 converges towards 95% as M increases.
The convergence does not seem to be uniform. This is because of Monte
Carlo error. If we repeated the experiment again, we might find a slightly
different pattern.
Each of the CIs for p is consistent with p = 0.95, except the one with
M = 10,000, which is the most reliable. In that case the CI for p is
225
Bayesian Methods for Statistical Analysis
(0.940, 0.949), which is entirely below 0.95. This suggests that the true
coverage probability of the 95% MC CI for µ is slightly less than 95%.
# (a)
options(digits=5); J = 100; set.seed(221); xv=rgamma(J,3,2)
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
c(xbar,ci) # 1.5170 1.3539 1.6800
# (b)
Jvec=c(100,200,500,1000,10000,100000); K = length(Jvec)
xbarvec=rep(NA,K); LBvec= rep(NA,K); UBvec= rep(NA,K);
set.seed(221);
for(k in 1:K){ J=Jvec[k]; xv=rgamma(J,3,2); xbar=mean(xv)
ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
xbarvec[k]=xbar; LBvec[k]=ci[1]; UBvec[k]=ci[2]
}
Wvec=UBvec-LBvec
print(rbind(Jvec, xbarvec, LBvec,UBvec, Wvec),digits=4)
# (c)
J=100; M=100; ct=0; set.seed(442); for(m in 1:M){
xv=rgamma(J,3,2)
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
if((ci[1]<=1.5)&&(1.5<=ci[2])) ct = ct + 1 }
p=ct/M; ci=p+c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/J)
c(ct,p,ci) # 93.00000 0.93000 0.87999 0.98001
226
Chapter 5: Monte Carlo Basics
# (d)
J=100; Mvec=c(200,500,1000,10000); set.seed(651)
for(M in Mvec){ ct=0
for(m in 1:M){
xv=rgamma(J,3,2); xbar=mean(xv)
ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
if((ci[1]<=1.5)&&(1.5<=ci[2])) ct = ct + 1
}
p=ct/M; ci=p+c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/M)
print(c(M,p,ci,ci[2]-ci[1]),digits=3) }
So we will next discuss some basic techniques that can be used to generate
the required Monte Carlo sample from a given distribution. More
advanced techniques will be treated later. We will first treat the discrete
case, which is the simplest, and then the continuous case. It will be
assumed throughout that we can at least sample easily from the standard
uniform distribution, i.e. that we can readily generate u ~ U (0,1) .
227
Bayesian Methods for Statistical Analysis
Note 1: We see that this procedure will work also in the case where K is
infinite. In that case a practical alternative is to redefine K as a value k
for which Fk is very close to 1 (e.g. 0.9999) and then approximate f ( x )
by zero for all x > xK .
Show that the above method works when applied to generating a value x
from the Bin(2,1/2) distribution, i.e. that it returns x = 0, 1 and 2 with
probabilities 1/4, 1/2 and 1/4, respectively.
228
Chapter 5: Monte Carlo Basics
x 3e − x
=
Using R we calculate k ( x) = , x 1,3,5,..., 41 (here k stands for
1+ x
kernel), noting that the last two values of k ( x ) are tiny (9.455201e-14 and
1.454999e-14).
229
Bayesian Methods for Statistical Analysis
Note: We could also change fvec to kvec here, where kvec is a vector
with the values k (1), k (3),..., k (41) ; both possibilities will work since
sample() will automatically normalise the values in its parameter ‘prob’.
230
Chapter 5: Monte Carlo Basics
First derive the quantile function of X, denoted FX1 ( p ) (0 < p < 1).
(This can be done by setting FX ( x ) to p and solving for x.)
231
Bayesian Methods for Statistical Analysis
∫e
−t
F ( x )= dt = 1 − e − x , x > 0. The quantile function here is the solution
0
of 1 − e −x
= p, namely F −1 ( p ) =
− log(1 − p ) .
This results in the required sample x1 ,..., x J ~ iid G (1,1) . Using this
sample, the MC estimate of µ = EX is 0.9967, and a 95% CI for µ is
(0.9322, 1.0613). We see that the CI contains the true value being
estimated (i.e. 1).
0
0 0
− xe − x + 0 + −e − t =
x
= − xe − x − e − x + 1 =
1 − ( x + 1)e − x .
0
232
Chapter 5: Monte Carlo Basics
However, for any p we can obtain that root using the Newton-Raphson
algorithm by iterating
g(x j )
x j += xj − where g ′( x=
) F ′( x) −= ) xe− x
0 f ( x=
g ′( x j )
1
1 − ( x j + 1)e − x j − p
= xj − .
x e
−xj
j
options(digits=5)
# (a)
-log(1-0.371) # 0.463624
J=1000; set.seed(221); uv=runif(J,0,1)
xv=-log(1-uv) # Generate a random sample of size 1000 from the G(1,1) dsn
est=mean(xv); std=sd(xv); ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(J)
c(est,ci) # 0.99673 0.93216 1.06130
233
Bayesian Methods for Statistical Analysis
# (b)
u=0.371; x=1; xv=x; for(j in 1:7) { x=x-(1-(x+1)*exp(-x)-u)/(x*exp(-x)); xv=c(xv,x) }
xv # 1.0000 1.2902 1.2939 1.2939 1.2939 1.2939 1.2939 1.2939
pgamma(x,2,1) # 0.371 Just checking that F(1.293860) = 0.371
pgamma(1.2939,2,1) # 0.37101
It can be shown that z1 , z2 ~ iid N (0,1) . If we only need one value from
the standard normal distribution then we may arbitrarily discard z2 and
return only z1 .
234
Chapter 5: Monte Carlo Basics
(1 / 2)eu , u < 0
So U has pdf = ′(u )
f (u ) F= −u .
0 − (1 / 2) e ( − 1), u ≥ 0
1 −|u|
That is, =
f (u ) e , −∞ < u < ∞ , which is the same the pdf of X.
2
235
Bayesian Methods for Statistical Analysis
x, 0 < x < 1
Suppose we want to sample x ~ f ( x) where f ( x) = .
2 − x, 1 < x < 2
Sample the two random variables r ~ Bern(0.5) and y ~ Beta (2,1) . Then
calculate x = ry + (1 − r )(2 − y ) . This way, there is a 50% chance that x
will equal y, whose pdf is f ( y= ) 2 y, 0 < y < 1 , and a 50% chance that x
will equal z= 2 − y , whose pdf is f ( z )= 2(2 − z ),1 < z < 2 .
In such cases, one convenient and easy way to obtain a value from the
distribution of interest may be via rejection sampling (also known as the
rejection method or the acceptance-rejection method). This method works
as follows.
236
Chapter 5: Monte Carlo Basics
The idea here is that f ( x ) lies entirely beneath ch( x ) except that it
touches ch( x ) at maybe only one point. Then p( x ) , which is called the
acceptance probability, appropriately lies between 0 and 1 (inclusive).
Figure 5.10 illustrates this setup. The rejection algorithm is as follows:
237
Bayesian Methods for Statistical Analysis
238
Chapter 5: Monte Carlo Basics
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(c(0,1), c(0,6),type="n",xlab="x",ylab="")
xv=seq(0.001,0.999,0.01); hxv=dbeta(xv,2,2); lines(xv,hxv,lty=2,lwd=3)
kfun=function(x){ dbeta(x,4,8) }
# We could specify any positive function here (*)
k0=integrate(f=kfun,lower=0,upper=1)$value
# This calculates the normalising constant
fxv=kfun(xv)/k0; # This ensures f(x) as defined at (*) is a proper density
lines(xv,fxv,lty=1,lwd=3)
c=max(fxv/hxv); c # 2.4472
lines(xv,c*hxv,lty=3,lwd=3)
legend(0,6,c("f(x)","h(x)","c*h(x)"),lty=c(1,2,3),lwd=c(3,3,3))
text(0.07,3,"c = 2.45")
239
Bayesian Methods for Statistical Analysis
f ( x ) 1 / 2 3 f ( x) 1/ 2, x 0, 2
Here: c max , p( x) .
x g ( x ) 1 / 3 2 cg ( x) 1, x 1
We see that about 2/3 of all the proposed values will be accepted, and of
these about 25% will be 0, 50% will be 1, and 25% will be 2. About 1/3
of the candidates will be rejected, about half of these being 0 and half
being 2. The overall acceptance rate is 1/c = 1/(3/2) = 2/3, and the wastage
is 1 − 1 / c = 1/3. On average, c = 1.5 candidates will have to be proposed
until an acceptance. Thus, generation of 1,000 Bin(2,1/2) values (say) will
require about 1,500 candidates.
240
Chapter 5: Monte Carlo Basics
First, denote the Monte Carlo sample as θ1 ,..., θ J ~ iid f (θ | x) . Then, the
MC estimate of the posterior mean of θ , namely
= θˆ E= (θ | x ) ∫ θ f (θ | x )dθ ,
is
1 J
θ = ∑θ j (the MC sample mean),
J j =1
and a 1 − α CI for θˆ is
sθ
θ ± zα /2 ,
J
where
1 J
= sθ2 ∑
J − 1 j =1
(θ j − θ ) 2 .
241
Bayesian Methods for Statistical Analysis
Further, when the posterior density f (θ | x) does not have a closed form
expression (as is often the case), it can be estimated by smoothing a
probability histogram of θ1 ,..., θ J .
Once an estimate of the posterior density has been obtained, the mode of
that estimate defines the MC estimate of the posterior mode.
One may then apply any of the ideas above, just as before. For example,
the posterior mean of ψ , namely
= ψˆ E= (ψ | x) ∫ψ f (ψ =
| x)dψ ∫ g (θ ) f (θ | x)dθ ,
can be estimated by its MC estimate,
1 J
ψ = ∑ψ j ,
J j =1
and a 1 − α CI for ψˆ is
sψ
ψ ± zα /2 ,
J
where
1 J
=sψ2 ∑
J − 1 j =1
(ψ j −ψ ) 2 .
242
Chapter 5: Monte Carlo Basics
Suppose we observe the data vector y = ( y1 ,..., yn ) = (2.1, 3.2, 5.2, 1.7).
(c) Use MC methods to estimate the signal to noise ratio (SNR), defined
=
as γ µ= / σ µ λ . Illustrate your inferences with a suitable graph.
243
Bayesian Methods for Statistical Analysis
244
Chapter 5: Monte Carlo Basics
So, again by the method of composition, but this time using the identity
f (, | y ) f ( | y ) f ( | y , ) ,
we make use of the sample already generated in (a) and sample
n n
j ~ Gamma , s2 j
2 2
for each j = 1,..., J . The result is (1 , 1 ),...,(J , J ) ~ iid f (, | y ) , and
thereby λ1 ,..., λJ ~ iid f (λ | y ) (after discarding all of the j values).
245
Bayesian Methods for Statistical Analysis
1
λˆ = = 0.4071
s2
95% CPDR = F −1n −1 n −1 2 (0.025), F −1n −1 n −1 2 (0.975)
G , s
2 2
G , s
2 2
= (0.0293, 1.2684).
We see that the true posterior mean is contained in the 95% MC CI for
that mean. Figure 5.12 illustrates these Monte Carlo and ‘exact’ inferences.
(c) Using the values sampled in (a) and (b), we now calculate j j j
for each j = 1,.., J , and hence obtain a MC sample 1 ,..., J ~ iid f ( | y ) ,
which can then be used to perform MC inference on γ . (NB: The symbols
‘ γ ’ and ‘ ’ are typographically equivalent.) Implementing this strategy,
we estimate γ ’s posterior mean by 1.800, with (1.745 1.854) as a 95% CI
for that mean, and we estimate γ ’s 95% CPDR as (0.228, 3.543).
Figure 5.13 illustrates these Monte Carlo estimates. Also shown are:
• the exact posterior mean of , which is γˆ = E (γ | y ) = 1.793
• the exact 95% CPDR for , which is (0.0733, 3.5952)
• the exact posterior density of
• the MLE of , which is γ = y / s = 3.05/1.567 = 1.946.
246
Chapter 5: Monte Carlo Basics
See the Note and R Code below for details of these calculations.
This follows from the uninformative normal-normal model, i.e. from the
fact that
( | y , ) ~ N ( y ,1 / ( n )) .
f ( | y ) E { f ( | y , ) | y} f ( | y , ) f ( | y )d ,
0
where:
n n2 ( y )2
f ( | y, ) f N ( y ,1/ n )
( ) e ,
2
n1
n1 n1 2
n 1 2 2 1 s
2 s
2 2
e
f ( | y ) f ( ) ,0.
n1 n1 2
G s n 1
,
2 2
2
247
Bayesian Methods for Statistical Analysis
The exact 95% CPDR for may be obtained by using the optim()
function to minimise
g ( L,U ) F (U | y ) F ( L | y ) 0.95 f (U | y ) f ( L | y )
2 2
U
2
f ( | y, ) f ( | y )d d 0.95
L 0
2
f (U | y, ) f ( | y )d f ( L | y, ) f ( | y )d ,
0
0
with the result being (L, U) = (0.0733, 3.5952).
# (a)
y=c(2.1, 3.2, 5.2, 1.7); n=length(y); ybar=mean(y); s=sd(y); s # 1.567
J=1000; set.seed(144); options(digits=4)
wv=rt(J,n-1); muv=ybar+s*wv/sqrt(n)
mubar=mean(muv); muci=mubar + c(-1,1)*qnorm(0.975)*sd(muv)/sqrt(J)
mucpdr=quantile(muv,c(0.025,0.975))
c(mubar,muci,mucpdr) # 3.0770 3.0012 3.1528 0.6848 5.5069
muhat=ybar; mucpdrtrue= ybar+(s/sqrt(n))*qt(c(0.025,0.975),n-1)
c(muhat,mucpdrtrue) # 3.050 0.556 5.544
X11(w=8,h=5); par(mfrow=c(1,1))
hist(muv,prob=T,xlab="mu",xlim=c(-2,7.5), ylim=c(0,0.5),main="",
breaks=seq(-20,20,0.25))
muvec=seq(-20,20,0.01);
postvec=dt( (muvec-ybar)/(s/sqrt(n)) , n-1 ) / (s/sqrt(n))
248
Chapter 5: Monte Carlo Basics
lines(muvec,postvec, lty=1,lwd=3)
lines(density(muv),lty=2,lwd=3)
abline(v=c(mubar,muci,mucpdr),lty=2,lwd=3)
abline(v=c(ybar, mucpdrtrue) , lty=1,lwd=3)
legend(-2,0.5,c("Monte Carlo estimates","Exact posterior estimates"),
lty=c(2,1),lwd=c(3,3),bg="white")
# (b)
lamv=rep(NA,J); set.seed(332)
for(j in 1:J) lamv[j] = rgamma(1,n/2,(n/2)*mean((y-muv[j])^2))
hist(lamv,prob=T,xlab="lam",xlim=c(0,2.5), ylim=c(0,2),main="",
breaks=seq(0,3,0.05))
lamvec=seq(0,3,0.01) ; lampostvec= dgamma(lamvec,(n-1)/2,((n-1)/2)*s^2)
lines(lamvec, lampostvec, lty=1,lwd=3)
lines(density(lamv),lty=2,lwd=3)
abline(v=c(lambar, lamci, lamcpdr),lty=2,lwd=3)
abline(v=c(1/s^2, lamcpdrtrue), lty=1,lwd=3)
legend(1.5,2,c("Monte Carlo estimates","Exact posterior estimates"),
lty=c(2,1),lwd=c(3,3),bg="white")
# (c)
gamv=muv*sqrt(lamv)
gamhat=(ybar/s)*gamma(0.5+(n-1)/2)/(sqrt((n-1)/2)*gamma((n-1)/2))
print(c(ybar,s,gamhat),digits=8) # 3.0500000 1.5673757 1.7928178
intfun=function(lam,gam, ybar=3.05,s=1.5673757,n=4){
dnorm(gam,ybar*sqrt(lam),1/sqrt(n))*dgamma(lam,(n-1)/2,s^2*(n-1)/2) }
249
Bayesian Methods for Statistical Analysis
integrate(function(gam) {
sapply(gam, function(gam) {
integrate(function(lam) {
sapply(lam, function(lam) intfun(lam,gam) )
}, 0, Inf)$value }) }, -Inf, Inf)
# 1 with absolute error < 4.7e-07 OK (Just checking)
integrate(function(gam) {
sapply(gam, function(gam) {
integrate(function(lam) {
sapply(lam, function(lam) gam*intfun(lam,gam) )
}, 0, Inf)$value }) }, -Inf, Inf)
# 1.793 with absolute error < 4.7e-06 OK (Agrees with exact calculation)
gamvec=seq(-5,10,0.01); fgamvec=gamvec
for(i in 1:length(gamvec)){
fgamvec[i]=integrate( f=intfun, lower=0, upper=Inf,
gam=gamvec[i])$value }
plot(gamvec,fgamvec) # OK
gfun(v=c(-0.1,4.2)) # 0.001473 OK
gfun(v=c(1,3)) # 0.08562 OK
250
Chapter 5: Monte Carlo Basics
res0=optim(par=c(0,4),fn=gfun)$par
res0 # 0.07334 3.59516
res1=optim(par=res0,fn=gfun)$par
res1 # 0.07332 3.59518
res2=optim(par=res1,fn=gfun)$par
res2 # 0.07332 3.59518 OK
integrate(function(gam) {
sapply(gam, function(gam) {
integrate(function(lam) {
sapply(lam, function(lam) intfun(lam,gam) )
}, 0, Inf)$value }) }, L,U)
# 0.95 with absolute error < 3.2e-07
integrate( f=intfun, lower=0, upper=Inf, gam=L)$value # 0.06598
integrate( f=intfun, lower=0, upper=Inf, gam=U)$value # 0.06598 All OK
hist(gamv,prob=T,xlab="gam",xlim=c(-1,6), ylim=c(0,0.6),main="",
breaks=seq(-2,7,0.1))
lines(density(gamv),lty=2,lwd=3)
abline(v=c(gambar, gamci, gamcpdr),lty=2,lwd=3)
points(mle,0,pch=4,lwd=3,cex=2)
lines(gamvec,fgamvec,lty=1,lwd=3)
abline(v=c(gamhat,L,U),lty=1,lwd=3)
legend(3,0.6,c("Monte Carlo estimates","Exact posterior estimates"),
lty=c(2,1),lwd=c(3,3),bg="white")
text(5,0.4,"The cross shows the MLE")
251
Bayesian Methods for Statistical Analysis
252
Chapter 5: Monte Carlo Basics
options(digits=5)
n=50; y=32; alp=1;bet=1; a=alp+y; b=bet+n-y; m=10; J=10000
set.seed(443); tv=rbeta(J,a,b); xv=rbinom(J,m,tv)
phat=length(xv[xv>=6])/J;
ci=phat+c(-1,1)*qnorm(0.975)*sqrt(phat*(1-phat)/J)
c(phat,ci) # 0.70840 0.69949 0.71731
xvec=0:m; fxgiveny=
choose(m,xvec)*beta(y+xvec+alp,n-y+m-xvec+bet)/beta(y+alp,n-y+bet)
sum(fxgiveny) # 1 Just checking
sum(fxgiveny[xvec>=6]) # 0.70296
deviation of θ1 ,..., θ J .
253
Bayesian Methods for Statistical Analysis
254
Chapter 5: Monte Carlo Basics
Note: The first of the three choices for E j is typically the easiest to
calculate but also leads to the least improvement over the ordinary
=
‘histogram’ predictor, x (1 / J ) ∑ Jj =1 x j .
Suppose that we observe the vector y = ( y1 ,..., yn ) = (2.1, 3.2, 5.2, 1.7).
255
Bayesian Methods for Statistical Analysis
In each case, report the associated 95% CI for that mean. Compare your
results with the true value of that mean. Produce a probability histogram
of the simulated λ -values. Overlay a smooth of this histogram and the
Rao-Blackwell estimate of λ ’s marginal posterior density. Also overlay
the exact density.
So we first sample
n 1 n 1 2
~ Gamma , s ,
2 2
and then we sample
1
~ N y , .
n
The result is
(, ) ~ f (, | y ) .
Next let e j E ( | y , j ) .
It will be observed that this second CI is narrower than the first (having
width 0.0053 compared with 0.0133). It will also be observed that both
CIs contain the true value, ˆ 1 / s 2 = 0.4071.
256
Chapter 5: Monte Carlo Basics
257
Bayesian Methods for Statistical Analysis
options(digits=4)
# (a)
y=c(2.1, 3.2, 5.2, 1.7); n=length(y); ybar=mean(y); s=sd(y); s2=s^2
J=100; set.seed(254); lamv=rgamma(J,(n-1)/2,s2*(n-1)/2);
muv=rnorm(J,ybar,1/sqrt(n*lamv)); est0=1/s^2
est1=mean(lamv); std1=sd(lamv); ci1=est1 + c(-1,1)*qnorm(0.975)*std1/sqrt(J)
ev=rep(NA,J); for(j in 1:J){ muval=muv[j]; ev[j]=1/mean((y-muval)^2) }
est2=mean(ev); std2=sd(ev); ci2=est2 + c(-1,1)*qnorm(0.975)*std2/sqrt(J)
rbind( c(est0,NA,NA,NA), c(est1,ci1,ci1[2]-ci1[1]), c(est2,ci2,ci2[2]-ci2[1]) )
# [1,] 0.4071 NA NA NA
# [2,] 0.4396 0.3767 0.5026 0.12589
# [3,] 0.4150 0.3892 0.4408 0.05166
# (b)
X11(w=8,h=5); par(mfrow=c(1,1))
hist(lamv,xlab="lambda",ylab="density",prob=T,xlim=c(0,2.5),
ylim=c(0,2.5),main="",breaks=seq(0,4,0.05))
lines(density(lamv),lty=1,lwd=3)
lamvec=seq(0,3,0.01); RBvec=lamvec; smu2v=1/ev
for(k in 1:length(lamvec)){ lamval=lamvec[k]
RBvec[k]=mean(dgamma(lamval,n/2,(n/2)*smu2v)) }
lines(lamvec,RBvec,lty=1,lwd=1)
lines(seq(0,3,0.005),dgamma(seq(0,3,0.005),(n-1)/2,s2*(n-1)/2), lty=3,lwd=3)
legend(1.2,2,c("Histogram estimate of posterior","Rao-Blackwell estimate",
"True marginal posterior"), lty=c(1,1,3),lwd=c(3,1,3))
258
Chapter 5: Monte Carlo Basics
2. Generate x j ~ ⊥ f ( y | θ j ) , j = 1,…,J
(so that x1 ,..., xJ ~ iid f ( x | y )).
1 J
4. Estimate p by pˆ = ∑ I j with associated 1 − α CI
J j =1
pˆ (1 − pˆ )
pˆ ± zα /2 .
J
The bent coin is tossed 10 times. Heads come up on the first seven tosses
and tails come up on the last three tosses.
The observed number of runs (of heads or tails in a row) is 2, which seems
rather small.
Let yi be the indicator for heads on the ith toss, (i = 1,…,n) (n = 10), and
let θ be the unknown probability of heads coming up on any single toss.
Also let xi be the indicator for heads coming up on the ith of the next n
tosses of the same coin, tossed independently each time.
259
Bayesian Methods for Statistical Analysis
Further, let y = ( y1 ,..., yn ) and x = ( x1 ,..., xn ) , and choose the test statistic
as
T ( y,θ ) = R( y ) ,
defined as the number of runs in the vector y.
1. Sample x1j ,..., xnj ~ iid Bern(θ j ) and form the vector
x j = ( x1j ,..., xnj ) .
3. Obtain=
I j I ( R j ≤ R ) , where R = R ( y ) = 2.
Thereby we estimate p by
1 J
pˆ = ∑ I j = 0.0995,
J j =1
with 95% CI
pˆ (1 − pˆ )
pˆ ± 1.96 = (0.0936, 0.1054).
J
260
Chapter 5: Monte Carlo Basics
Note 1: Using a suitable formula from runs theory, the exact value of p
could be obtained as
1
=p ∫ P( R( x) ≤ 2 | θ ) f
0
Beta (8,4) (θ )dθ
n
1
= ∫0 x∑
T =0
P( R( x) ≤ 2 | θ , xT ) f ( xT | θ ) f Beta (8,4) (θ )dθ ,
where:
n
f ( xT | θ ) θ xT (1 − θ ) n − xT is the binomial density with
•=
xT
parameters n and θ , evaluated at xT .
For this data, R ( y ) = 2 again but with n = 20 and y = 14. In this case,
(θ | y ) ~ Beta ( yT + 1, n − yT + 1) ~ Beta (15,7) ,
and we obtain the estimate p̂ = 0.0088 with 95% CI (0.0070 0.0106).
261
Bayesian Methods for Statistical Analysis
R=function(v){m=length(v); sum(abs(v[-1]-v[-m]))+1}
# Calculates the runs in vector v
R(c(1,1,1,0,1)) # 3 testing …
R(c(1,1)) # 1
R(c(1,0,1,0,1)) # 5
R(c(0,0,1,1,1)) # 2
R(c(1,0,0,1,1,0,0,1,1,1,1,0)) # 6 …. all OK
262
CHAPTER 6
MCMC Methods Part 1
6.1 Introduction
Monte Carlo methods were introduced in the last chapter. These included
basic techniques for generating a random sample and methods for using
such a sample to estimate quantities such as difficult integrals. This
chapter will focus on advanced techniques for generating a random
sample, in particular the class of techniques known as Markov chain
Monte Carlo (MCMC) methods. Applying an MCMC method involves
designing a suitable Markov chain, generating a large sample from that
chain for a burn-in period until stochastic convergence, and making
appropriate use of the values following that burn-in period.
263
Bayesian Methods for Statistical Analysis
For now, we will assume the driver to be symmetric, in the sense that
g (t | x) g ( x | t ) ,
or more precisely,
g ( t a | b) g ( t b | a ) ∀ a, b ∈ ℜ .
Note: The driver distribution may also be non-symmetric, but this case
will be discussed later.
f ( x j )
(b) Calculate the acceptance probability as p .
f ( x j1 )
264
Chapter 6: MCMC Methods Part 1
A problem with this second method of generating the sample values is that
they will be autocorrelated to some extent i.e. not a truly random (iid)
sample from the distribution f ( x ) . We will later discuss this issue and
how to deal with the problems that may arise from it. For the moment, we
stress that x1 ,..., x J will be approximately a random sample from f ( x ) .
Moreover, if J is sufficiently large, then these values will be effectively
independent. This means that a probability histogram of these values will
in fact converge to f ( x ) as J tends to infinity.
Note: This is just the Beta(6,1) density and could be sampled from easily
in many other ways.
265
Bayesian Methods for Statistical Analysis
The jth iteration of the algorithm involves first sampling a candidate value
(or proposed value) from the driver distribution centred at the last value,
namely
x j ~ U ( x j1 c, x j1 c ) ,
and then accepting this candidate value with probability
6 x j5 x j
5
f ( x x j )
p , (6.1)
f ( x x j1 ) 6 x 5j1 x j1
where p is taken to be:
0 in the case where x j 0 or x j 1
1 in the case where x j1 x j 1 .
266
Chapter 6: MCMC Methods Part 1
Note: There were four rejections until the first acceptance, at iteration
5, where x5 = x5′ = 0.1861, as underlined above.
The acceptance rate (AR) for this Markov chain is found to be 64%,
meaning that 320 of the 500 candidate values x j were accepted and 36%
(or 180) were rejected.
267
Bayesian Methods for Statistical Analysis
In this case the acceptance rate is only 20.8% and the histogram is a poorer
estimate of the true density (to which it would however converge as
J ) . We say that the algorithm is now displaying poor mixing
compared to results in the first run of 500 where c = 0.15.
What happens if we make c = 0.15 smaller? Figures 6.5 and 6.6 are a
repeat of Figures 6.1 and 6.2, respectively, but using simulated values
from a run of the Metropolis algorithm with c = 0.05.
268
Chapter 6: MCMC Methods Part 1
269
Bayesian Methods for Statistical Analysis
270
Chapter 6: MCMC Methods Part 1
hist(res$vec[-(1:101)],prob=T,xlim=c(0.4,1),ylim=c(0,6),
xlab="x",ylab="density",main="")
lines(seq(0.4,1,0.01),6*seq(0.4,1,0.01)^5); res$ar # 0.64
print(res$vec[1+c(0,1:10,301:310,491:500)], digits=4)
# [1] 0.1000 0.1000 0.1000 0.1000 0.1000 0.1861 0.2650 0.2650 0.4065 0.4388
# [11] 0.4388 0.9261 0.9987 0.9987 0.9987 0.9987 0.9725 0.8889 0.8889 0.9672
# [21] 0.9315 0.8058 0.6811 0.6073 0.4587 0.4353 0.3462 0.3462 0.4177 0.4177
# [31] 0.4656
hist(res$vec[-(1:101)],prob=T,xlim=c(0.4,1),ylim=c(0,6),xlab="x",
ylab="density", main=" ")
lines(seq(0.4,1,0.01),6*seq(0.4,1,0.01)^5); res$ar # 0.208
hist(res$vec[-(1:101)],prob=T,xlim=c(0.4,1),ylim=c(0,6),xlab="x",
ylab="density", main=" ")
lines(seq(0.4,1,0.01),6*seq(0.4,1,0.01)^5); res$ar # 0.83
271
Bayesian Methods for Statistical Analysis
Check your result by comparing the sample mean and sample standard
deviation of your sample to the true theoretical values, 0 and 1.
Calculate a Monte Carlo 95% confidence interval for the normal mean, 0.
1
− x2
Since f ( x ) ∝ e 2
, the acceptance probability at iteration j is given by
1
1 2 x j
2
e x 2 x 2
f ( x x j ) 2 j1 j
p exp .
f ( x x j1 ) 1
1 2 x j1
2
2
e
2
Figure 6.8 shows a histogram of the last J = 10,000 values, together with
the standard normal density overlaid.
The average of the J sampled values is 0.0355 (close to 0) and their sample
standard deviation is 1.0047 (close to 1). These values lead to a 95% CI
for the normal mean equal to (0.0158, 0.0552). We note that this CI does
not contain the true value, 0, as one might expect. The underlying issue
behind this fact will be discussed generally in the next section.
272
Chapter 6: MCMC Methods Part 1
273
Bayesian Methods for Statistical Analysis
vec = x; ct = 0
for(j in 1:K){ prop = runif(1,x-c,x+c)
p = exp(-0.5*(prop^2-x^2)); u = runif(1)
if(u <= p){ x = prop; ct = ct + 1 }
vec <- c(vec,x) }
ar = ct/K; list(vec=vec,ar=ar) }
B=500; J = 10000; K = B + J
set.seed(117); res <- MET(K=K,x=5,c=2.5); res$ar # 0.548381
X11(w=8,h=4.5); par(mfrow=c(1,1))
plot(0:K,res$vec,type="l",xlab="iteration",ylab="x",main=" ")
hist(res$vec[-(1:(B+1))],prob=T,xlim=c(-4,4),ylim=c(0,0.5),xlab="x",
ylab="density",nclass=50, main=" ")
lines(seq(-4,4,0.01),dnorm(seq(-4,4,0.01)),lwd=2)
est=mean(res$vec[-(1:(B+1))]); std=sd(res$vec[-(1:(B+1))])
ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(10000)
c(est,std,ci) # 0.03550254 1.00470749 0.01581064 0.05519445
274
Chapter 6: MCMC Methods Part 1
The batch means CI will be different from the ordinary CI, namely
( x 1.96 sx / J ) , where x and sx are the sample mean and sample
standard deviation of x1 ,..., x J . The batch means CI is obtained as follows.
First, break up the J sample values into m batches of size n each, so that:
………………………………………………………….....
Next: Let yk be the mean of the n x j -values in the kth batch (k = 1,...,m).
2
Let s y be the sample variance of y1 ,..., ym .
1 m 1 m
Note: Thus s y2 k
m 1 k 1
( y y ) 2
, where y yk x is the
m k 1
mean of the batch means and identical to the mean of all J x j -values.
275
Bayesian Methods for Statistical Analysis
Discussion
The rationale for the batch means method is as follows. If the batch size n
is sufficiently large then, by the central limit theorem,
y1 ,..., ym ~ iid N (, 2 / n ) ,
where E ( x j ) and 2 Var ( x j ) .
Consequently,
2 / n 2
y ~ N , ~ N , ,
m J
since J = mn .
Therefore a 1 − α CI for is
( y z /2 r / J ) ,
where r is an estimate of .
So an unbiased estimator of 2 is ns y .
2
Then use this sample to estimate EX, together with a 95% confidence
interval for EX. For this CI use the formula ( x 1.96s / J ) , where s 2
is the sample variance of the J sampled X-values. Also draw a histogram
of the J X-values overlaid with the exact pdf of X.
276
Chapter 6: MCMC Methods Part 1
(b) Use the output from the Metropolis algorithm in (a) to construct
another 95% CI for EX, one using the batch means method, as follows:
2
Let s y be the sample variance of the m batch means
y1 ,..., ym .
(c) Conduct a Monte Carlo experiment to assess the quality of the two CIs
for EX in (a) and (b).
Now divide the total count from (ii) by R to get an unbiased point estimate
of the probability that the ordinary CI for EX in (a) contains EX.
Similarly, divide the two total count from (iii) by R to get an unbiased
point estimate of the probability that the batch means CI for EX in (b)
contains EX.
Also produce 95% CIs for the two probabilities just mentioned.
(d) Repeat the experiment in (c) but with the following in place of (i):
277
Bayesian Methods for Statistical Analysis
(a) Let us specify a uniform driver centred at the last value and with half-
width h. We now iterate as follows after choosing a suitable starting value
of x:
Sample x ~ U ( x h, x h) .
278
Chapter 6: MCMC Methods Part 1
(b) Applying the batch means method with m = 20 and n = 50, we estimate
EX as 1.539 again, but with 95% CI (1.467, 1.611). Note that this CI is
wider than the CI in (a) and does contain the true value, 1.5.
We also estimate p2 , the true probability content of the batch means 95%
CI in (b) (with m = 20 and n = 50), as 90.0%, with 95% CI 84.1% to
95.9%.
We see that in this example the batch means method has performed far
better than the ordinary method for constructing 95% CIs for EX from the
output of a Metropolis algorithm.
We see that the two CIs have performed about equally well when
calculated using a truly random sample from X’s distribution. In such
situations, the batch means CI is in fact slightly inferior and the ordinary
CI should be used.
279
Bayesian Methods for Statistical Analysis
# (a)
MET <- function(Jp,x,h){
# This function implements a simple Metropolis algorithm.
# Inputs: Jp = total number of iterations
# x = starting value of x
# h = halfwidth of uniform driver.
# Outputs: $xv = vector of x-values of length (Jp + 1)
# $ar = acceptance rate.
xv <- x; ct <- 0
for(j in 1:Jp){ xprop <- runif(1,x-h,x+h)
if( (xprop>0) && (xprop<2) ){
p <- xprop^2 / x^2; u <- runif(1)
if(u < p){ x <- xprop; ct <- ct + 1 } }
xv <- c(xv,x) }
list(xv=xv,ar=ct/Jp) }
plot(0:Jp,res$x,type="l",xlab="j",ylab="x_j")
xv <- res$xv[-c(1:101)]; J= length(xv)
hist(xv,xlab="x",prob=T,ylim=c(0,2),nclass=20,ylab="density", main="")
xvec <- seq(0,2,0.1); fvec <- (3/8)*xvec^2; lines(xvec,fvec)
# (b)
m <- 20; n <- 50; yv <- rep(NA,m)
for(k in 1:m){ xvsub <- xv[ ((k-1)*n+1):(k*n) ]
yv[k] <- mean(xvsub) }
sdhat2 <- sqrt(n*var(yv)); sdhat2 # 1.15783
EXci <- EXhat + c(-1,1)*qnorm(0.975)*sdhat2/sqrt(J)
c(EXhat,EXci) # 1.538984 1.467222 1.610746
# (c)
R<- 100; m <- 20; n <- 50; J <- 1000; burn <- 100; EX <- 1.5; ct1 <- 0; ct2 <- 0;
yv <- rep(NA,m); set.seed(214)
280
Chapter 6: MCMC Methods Part 1
for(r in 1:R){
xv <- MET(Jp=burn+J,x=1,h=0.7)$xv[-c(1:101)]
# xv <- rbeta(J,3,1)*2 # for use in (d) (see below)
for(k in 1:m){ xvsub <- xv[ ((k-1)*n+1):(k*n) ]
yv[k] <- mean(xvsub) }
EXhat <- mean(xv); sdhat1 <- sqrt(var(xv)); sdhat2 <- sqrt(n*var(yv))
ci1 <- EXhat + c(-1,1)*qnorm(0.975)*sdhat1/sqrt(J)
ci2 <- EXhat + c(-1,1)*qnorm(0.975)*sdhat2/sqrt(J)
if( (EX >= ci1[1]) && (EX <= ci1[2])) ct1 <- ct1 + 1
if( (EX >= ci2[1]) && (EX <= ci2[2])) ct2 <- ct2 + 1 }
date() # took 2 secs
# (d)
# Repeat code in (c) but with the line
# "xv <- MET(Jp=burn+J,x=1,h=0.7)$xv[-c(1:101)]"
# replaced by the line "xv <- rbeta(J,3,1)*2".
(b) approximately, using a Monte Carlo method that does not involve
Markov chains
281
Bayesian Methods for Statistical Analysis
0−µ
since P ( y > 0 | µ ) = 1 − P z < = 1 − Φ(− µ ) .
1
1 n 2
( ) − 2 ∑ ( yi − µ )
−n
Thus f ( µ | y ) ∝ 1 − Φ ( − µ ) exp
i =1
1
= (1 − Φ (− µ ) ) exp − (n − 1) s 2 + n( y − µ ) 2
−n
2
1
∝ (1 − Φ ( − µ ) ) exp − n( µ − y ) 2
−n
2
≡ k ( µ ) , µ > 0 (this is the kernel of the posterior density).
∫ µ k ( µ )d µ
1
I
=
Thus µˆ E=
(µ | y) 0
∞
= 1,
I0
∫ µ k ( µ )d µ
0
0
∞
where I q = ∫ µ q k ( µ )d µ , q = 0,1.
0
∫ µ (1 − Φ(− µ ) )
−n
1
h ( µ )d µ
(b) Observe that µˆ = 0
∞
,
∫ µ (1 − Φ(− µ ) )
−n
0
h ( µ )d µ
0
n n
exp − ( µ − y ) 2
where h( µ ) = 2π 2 .
0− y
1− Φ
1/ n
282
Chapter 6: MCMC Methods Part 1
Thus µˆ =
E1
E0
=
, where: { }
Eq E µ q (1 − Φ ( − µ ) ) , q = 0,1
−n
µ ~ h( µ ) ~ N ( y ,1 / n ) I ( µ > 0) .
µ j (1 − Φ ( − µ j ) )
1 J q −n
=
where: E q ∑
J j =1
µ1 ,..., µ J ~ iid h( µ ) .
(c) Using the Metropolis algorithm and a normal driver distribution with
standard deviation 0.5, we obtain a Markov chain of size 10,000 following
a burn-in of size 100. The acceptance rate is found to be 59%.
Then taking every 10th value results in a very nearly uncorrelated sample
of size 1,000 from the posterior distribution of µ . Using these 1,000
values, leads to the estimate µ̂ by 0.5297, with associated 95% CI equal
to (0.5047, 0.5547).
We note that the true exact value calculated in (a), 0.5379, is contained in
this CI.
283
Bayesian Methods for Statistical Analysis
# (b)
J=110000; set.seed(551); samp=rnorm(J,ybar,1/sqrt(n))
samppos=samp[samp>0]; length(samppos) # 102763
samppos=samppos[1:100000]
numer=mean(samppos*(1-pnorm(-samppos))^(-n) )
denom=mean( (1-pnorm(-samppos))^(-n) )
c(numer,denom,numer/denom) # 1.9900593 3.7059926 0.5369842
# (c)
MET <- function(K,mu,del,y){
# This function implements a simple Metropolis algorithm.
# Inputs: K = total number of iterations
# mu = starting value of mu
# del = standard deviation of normal driver
# y = data vector
# Outputs: $muv = vector of mu-values of length (K + 1)
# $ar = acceptance rate
muv = mu; ct = 0; n=length(y); ybar=mean(y)
kfun=function(mu,ybar,n){ exp(-0.5*n*(mu-ybar)^2) / (1-pnorm(-mu))^n }
for(j in 1:K){ muprop = rnorm(1,mu,del)
if( muprop>0 ){
p=kfun(mu=muprop,ybar=ybar,n=n)/kfun(mu=mu,ybar=ybar,n=n)
u=runif(1); if(u < p){ mu = muprop; ct = ct + 1 } }
muv = c(muv,mu) }
list(muv=muv,ar=ct/K) }
plot(0:K,res$muv,type="l")
284
Chapter 6: MCMC Methods Part 1
vec1=res$muv[-(1:101)]
print(acf(vec1)$acf[1:10],digits=2) # Evidence of strong autocorrelation
# 1.00 0.78 0.61 0.48 0.39 0.30 0.24 0.19 0.14 0.11
J=length(v); J # 1000
est=mean(v); std=sd(v); ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(J)
c(est,std,ci) # 0.5296887 0.4039238 0.5046537 0.5547237
One relevant fact here is that in R on most computers (at present), 5e-324
(meaning 5 × 10−324 ) is the smallest representable non-zero number. This
problem can often be resolved by calculating p as
p = exp( q)
after first computing
q log f ( x j ) log f ( x j1 ) ,
but even this formulation may not be sufficient in every situation.
It may sometimes also be necessary to replace the calculation of a function,
say h ( r ) , by
h(max( r,5e − 324))
if that function requires a non-zero argument r which is likely to be
reported by R as 0 (because the exact value of r is likely to be between 0
and 5e − 324 ).
Further, and by the same token, if
0 < h(max( r,5e − 324)) < 5e − 324
then R will report a value of 0. In that case, if a non-zero value of
h is absolutely required (for some subsequent calculation) then the
code for h ( r ) should be replaced by code which returns
max( h(max( r,5e − 324)),5e − 324) .
285
Bayesian Methods for Statistical Analysis
286
Chapter 6: MCMC Methods Part 1
The rationale for this choice of driver is that the proposed value is
certainly positive, it has:
mean x j1 / x j1
variance x j1 / x j1 / .
2
Also, its variance around that last value is proportional to it (by a factor
of 1 / ). This ensures that values near zero are appropriately ‘explored’
by the Markov chain.
287
Bayesian Methods for Statistical Analysis
Even with this use of the logarithmic function, computational issues arose
in R on account of limitations with the functions rgamma() and lgamma().
These limitations are acknowledged in the help files for these functions
in R.
To give an example:
set.seed(321)
v = rgamma(10000,0.001,0.001)
# Large sample from the G(0.001,0.001) distribution.
mean(v) # 0.5827886
# This is clearly wrong since the mean is 0.001/0.001 = 1.
length(v[v==0]) # 4777
# Almost HALF of the values are EXACTLY zero.
The R code was appropriately modified so that whenever very small but
non-zero values were reported as zero by R (and problems ensued or
potentially ensued because of this) those values were changed in the code
to 5e-324 (the smallest representable non-zero number in R).
With the above specification and fixes, the Metropolis algorithm was run
for 10,000 iterations following a burn-in of size 100 and starting at 1. The
value of δ used was 1.3 and this resulted in an acceptance rate of 53% as
well as good mixing. Figure 6.11 shows the resulting trace of all 10,101
values of x, and Figure 6.12 shows the required probability histogram of
the last 10,000 values, together with the exact density f ( x ) overlaid.
288
Chapter 6: MCMC Methods Part 1
set.seed(321); v = rgamma(10000,0.001,0.001)
# Large sample from the G(0.001,0.001) distribution.
mean(v) # 0.5827886 This is clearly wrong since the mean is 0.001/0.001 = 1.
length(v[v==0]) # 4777 Almost HALF of the values are EXACTLY zero.
logffun=function(x){ res=-0.5*log(x)-log(4); if(x>1) res=1-x-log(2); res }
loggfun=function(t,x,del){
x*del*log(del)+(x*del-1)*log(t)-t*del-lgamma(max( x*del, 5e-324 )) }
289
Bayesian Methods for Statistical Analysis
summary(res$xv)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.004243 0.309400 1.034000 1.218000 1.738000 9.356000 (OK, as Min > 0)
290
Chapter 6: MCMC Methods Part 1
First let us again review the Metropolis algorithm for sampling from a
univariate density, f ( x) . This involves choosing an arbitrary starting
value of x, a suitable driver density g (t | x) and then repeatedly proposing
a value x′ ~ g (t | x) , each time accepting this value with probability
f ( x′) g ( x | x′)
= p ×
f ( x ) g ( x′ | x )
f ( x′)
(or p = in the case of a symmetric driver).
f ( x)
For simplicity we will first focus on the bivariate case (M = 2). Thus,
suppose we wish to generate a random sample from the distribution of a
random vector X ( X 1 , X 2 ) with pdf f ( x) , where x ( x1 , x2 ) denotes
a value of X.
291
Bayesian Methods for Statistical Analysis
292
Chapter 6: MCMC Methods Part 1
293
Bayesian Methods for Statistical Analysis
Note: As before, the x1( j ) values on their own then constitute a sample
from the marginal distribution of x1 , whose density is now
f ( x1 ) = ∫∫ f ( x1 , x2 , x3 )dx2 dx3 ,
and the three acceptance probabilities can also be expressed as
f ( x1, x2 , x3 ) g1 ( x1 | x1, x2 , x3 )
p1 , etc.
f ( x1 , x2 , x3 ) g1 ( x1 | x1 , x2 , x3 )
294
Chapter 6: MCMC Methods Part 1
specifying M drivers,
g m (t | x1 ,..., xM ) ( m = 1,..., M ),
and repeatedly iterating M steps as follows:
……..…………………………………………………………….
Note: Again, the x1( j ) values on their own then constitute a sample from
the marginal distribution of x1 , whose density is now
f ( x1 ) = ∫ ∫ f ( x1 ,..., xM )dx2 ...dxM ),
and the M acceptance probabilities can also be expressed as
f ( x1,..., xM ) g1 ( x1 | x1, x2 ,..., xM )
p1 , etc.
f ( x1 ,..., xM ) g1 ( x1 | x1 , x2 ,..., xM )
295
Bayesian Methods for Statistical Analysis
Let us now specify the driver for n as discrete uniform over the integers
from n r to n r , where r is a tuning parameter.
296
Chapter 6: MCMC Methods Part 1
1. Propose a value
n ~ DU (n r ,..., n r ) ,
and accept this value with probability
f (n , | y ) h(n , ) n ! y (1 ) n y / (n y )!
p1
f ( n, | y ) h ( n, ) n ! y (1 ) n y / (n y )!
n !(1 ) n / (n y )!
.
n !(1 ) n / (n y )!
2. Propose a value
~ U ( c, c) ,
and accept this value with probability
f (n, | y ) h(n, ) n ! y (1 ) n y / (n y )!
p2
f (n, | y ) h(n, ) n ! y (1 ) n y / (n y )!
y (1 ) n y
y .
(1 ) n y
297
Bayesian Methods for Statistical Analysis
The first 100 iterations were thrown away as the burn-in, and then every
20th value (only) was recorded so as to thereby yield an approximately
random sample of size J = 500 from the joint posterior distribution of n
and , namely ( n1 , θ1 ),...,( nJ , θ J ) ~ iid f ( n, θ | y ) .
Figures 6.13 and 6.14 (pages 299 and 300) show the traces for all 10,101
values of n and , respectively, and Figures 6.15 and 6.16 (pages 300
and 301) show the traces for the final 500 values of n and , respectively.
The final bivariate sample of size J = 500 was used for Monte Carlo
inference in the usual way, with the following results.
Each histogram also includes vertical lines showing the true distribution
mean, the MC estimate of that mean, and the 95% CI for that mean.
298
Chapter 6: MCMC Methods Part 1
For example, the height of the bar above 6 is the proportion of sample
values n1 ,..., nJ equal to 6, which is 117/500 = 0.234, and the short
vertical bar above 6 is the MC 95% CI for P (n = 6 | y ) , which is
(0.234 ± 1.96 0.234(1 − 0.234) / 500) = (0.1969, 0.2711).
299
Bayesian Methods for Statistical Analysis
300
Chapter 6: MCMC Methods Part 1
301
Bayesian Methods for Statistical Analysis
302
Chapter 6: MCMC Methods Part 1
X11(w=8,h=6); par(mfrow=c(2,1))
plot(nvec,fny,type="n",xlab="n",ylab="f(n|y)",ylim=c(0,0.4))
points(nvec,fny,pch=16,cex=1); abline(v=nhat)
plot(thvec,fthyvec,type="n",xlab="theta",ylab="f(theta|y) ",ylim=c(0,2.5))
lines(thvec,fthyvec,lwd=3); abline(v=thhat)
303
Bayesian Methods for Statistical Analysis
}
}
thprop = runif(1,th-c,th+c)
if(thprop > 0) if(thprop < 1){
logp2 = logfun(n=n,th=thprop,y=y) - logfun(n=n,th=th,y=y)
p2 = exp(logp2); u = runif(1)
if(u < p2){ th = thprop; thct = thct + 1}
}
nvec = c(nvec,n); thvec = c(thvec,th)
}
nar = nct/Jp; thar = thct/Jp; list(nvec=nvec,thvec=thvec,nar=nar,thar=thar) }
# END
X11(w=8,h=5); par(mfrow=c(1,1))
Jp = 10100; set.seed(135); res = MH(Jp=Jp,n=7,th=0.5,c=0.3,r=1,y=5,k=9)
c(res$nar,res$thar) # 0.7344 0.5847
plot(0:Jp,res$nvec,type="l", xlab="j",ylab="n_j")
plot(0:Jp,res$thvec,type="l", xlab="j",ylab="theta_j")
plot(1:J,nv,type="l", xlab="j",ylab="n_j")
plot(1:J,thv,type="l", xlab="j",ylab="theta_j")
rbind(nvals,fvals,pvals,Lvals,Uvals)
# nvals 5.0000 6.0000 7.0000 8.0000 9.0000
# fvals 128.0000 117.0000 98.0000 87.0000 70.0000
# pvals 0.2560 0.2340 0.1960 0.1740 0.1400
# Lvals 0.2177 0.1969 0.1612 0.1408 0.1096
# Uvals 0.2943 0.2711 0.2308 0.2072 0.1704
304
Chapter 6: MCMC Methods Part 1
par(mfrow=c(1,1))
hist(nv,prob=T,xlim=c(4,10),ylim=c(0,0.5),xlab="n",breaks=seq(4.5,9.5,1),
main="", ylab="density")
points(nvec,fny,pch=16); abline(v=nhat)
for(i in 1:length(nvals)) lines(rep(nvals[i],2),c(Lvals[i],Uvals[i]),lwd=2)
abline(v=nbar,lty=4); abline(v=nci,lty=2)
legend(8,0.5,c("True mean","Estimate of mean","95% CI for mean"),lty=c(1,4,2))
legend(4,0.5,c("True posterior"),pch=16,cex=1)
legend(4,0.4,c("95% CI for f(n|y)"),lty=1,lwd=2)
hist(thv,prob=T,xlim=c(0,1),ylim=c(0,3.2),xlab="theta",
main="", ylab="density")
lines(thvec,fthyvec,lwd=2); abline(v=thhat)
thdensity <- density( c(thv,1+abs(1-thv)), from=0, to=1,width=0.2)
lines(density(thv,from=0,to=1,width=0.2),lty=2,lwd=2)
# Note: This is the simplest way to estimate the density
lines(thdensity$x,thdensity$y*2,lty=3,lwd=2)
# Note: This density estimate is forced to be higher at theta=1
abline(v=thbar,lty=4); abline(v=thci,lty=2); abline(v=thcpdr,lty=3)
legend(0,3.2,c("True mean","Estimate of mean","95% CI for mean",
"95% CPDR estimate"),lty=c(1,4,2,3))
legend(0,1.6,c("True posterior","Estimate 1","Estimate 2"),lty=c(1,2,3),lwd=2)
In fact, this is the norm in practice, and it was the case for both drivers in
the last exercise.
305
Bayesian Methods for Statistical Analysis
Also, one may ‘bundle’ any of the M random variables into blocks and
thereby reduce the number of actual Metropolis steps per iteration. For
example, instead of doing a Metropolis step for each of x3 and x4 at each
iteration, one may do a single Metropolis step as follows:
This idea can be used to improve mixing and speed up the rate of
convergence but may require more work sampling from the bivariate
driver and determining the optimal tuning constant. Note that to sample
( x3 , x4 ) , it may be possible to do this in two steps via the method of
composition according to
g34 (t , u | x3 , x4 ) g3 (t | x3 , x4 ) g 4|3 (u | x3 , x4 , t ) .
306
Chapter 6: MCMC Methods Part 1
This means that the candidate value xm is definitely accepted at every
iteration. In that case we call the mth step of the Metropolis-Hastings
algorithm a Gibbs step.
If all the Metropolis steps are Gibbs steps then the algorithm may also be
called a Gibbs sampler.
In any case, the mth distribution can be obtained by examining the joint
density of all the variables seeing that joint density as a density function
of only xm .
307
Bayesian Methods for Statistical Analysis
Example
This density was used as a basis for the following Metropolis step for
at each iteration:
Instead of this Metropolis step at each iteration, it would be better and also
easier to apply a Gibbs step which involves sampling the next value of
directly from the Beta ( y 1, n y 1) distribution.
2. Draw ~ Beta ( y 1, n y 1) .
308
Chapter 6: MCMC Methods Part 1
(c) Repeat (b) but with a Gibbs sampler in place of the MH algorithm.
309
Bayesian Methods for Statistical Analysis
i 1
n n
exp 2 ( 0 ) 2 ( yi ) 2
1 1
2
20 2 i1
k ( , ) .
1. Draw a value ~ U ( c, c)
k ( , )
and accept it with probability p1 .
k ( , )
2. Draw a value ~ U ( r , r )
k ( , )
and accept it with probability p2 .
k ( , )
310
Chapter 6: MCMC Methods Part 1
The acceptance rates for µ and were 92% and 92%. These rates were
judged to be unduly high because they led to very strong serial correlation
in the simulated values (i.e. poor mixing).
So the algorithm was run again from the same starting values but with
c = 0.9 and r = 0.08 (both larger). This resulted in Figures 6.23 and 6.24
(pages 312 and 313), with much better mixing, faster convergence, and
the better acceptance rates of 59% and 58%.
The last 5,000 pairs of values from this second run of the algorithm were
then collected and used to produce the two histograms in Figures 6.25 and
6.26 (pages 313 and 314). Each histogram is overlaid by a density estimate
of the corresponding posterior and shows a dot indicating the true value
of the parameter (which was initially sampled from its prior).
311
Bayesian Methods for Statistical Analysis
312
Chapter 6: MCMC Methods Part 1
313
Bayesian Methods for Statistical Analysis
(c) Examining the kernel of the joint posterior in (b) and studying previous
exercises (involving the normal-normal model and the normal-gamma
model) we easily identify the two conditional distributions which define
the Gibbs sampler. These are defined as follows:
1. Sample ~ f ( | y, ) ~ N (* , *2 ) , where: * (1 k )0 ky ,
2 k n n
*2 k , k , σ 2 ≡ 1/ λ .
n n n / 0
2 2
n (1/ (02 ))
n 1
2. Sample ~ f ( | y, ) ~ G , (n 1) s 2 n( y ) 2 .
2 2
This Gibbs sampler was started at µ = 0 and = 1 and run for a total of
6,000 iterations. The resulting traces are shown in Figures 6.27 and 6.28.
The last 5,000 pairs of values were then collected and used to produce the
histograms in Figures 6.29 and 6.30 (page 316). Each histogram is
overlaid by a density estimate of the corresponding posterior and shows a
dot indicating the true value of the parameter.
We see that the Gibbs sampler has produced very similar output to that in
(b) as obtained using the Metropolis-Hastings algorithm, but with less
effort (e.g. no need to worry about tuning constants) and with arguably
better results.
314
Chapter 6: MCMC Methods Part 1
By this we mean that the output from the Gibbs sampler exhibits far less
serial correlation. This is evidenced clearly in Figure 6.31 (page 317),
which shows the sample autocorrelation functions of the simulated values
of µ and in (b) (top two subplots) and in (c) (bottom two subplots).
315
Bayesian Methods for Statistical Analysis
316
Chapter 6: MCMC Methods Part 1
# (a)
mu0=10; sig0=2; alp=3; bet=6; n=40; options(digits=4)
set.seed(226); lam=rgamma(1,alp,bet); mu=rnorm(1,mu0,sig0);
sig=1/sqrt(lam); y=rnorm(n,mu,sig)
c(lam, sig, sig^2, mu, mean(y), sd(y))
# 0.1292 2.7822 7.7405 11.9511 12.2768 2.5919
X11(w=8,h=5); par(mfrow=c(1,1))
317
Bayesian Methods for Statistical Analysis
# (b)
MH <- function(Jp, mu, lam, y, c, r, alp=0, bet=0, mu0=0, sig0=10000 ){
# This function implements a Metropolis-Hastings algorithm for the general
# normal-normal-gamma model.
# Inputs: Jp = total number of iterations
# mu, lam = starting values of mu and lambda
# y = vector of n observations
# c, r = tuning parameters for mu and lambda
# alp, bet = parameters of lambda’s gamma prior (mean = alp/bet)
# mu0, sig0 = mean and standard deviation of mu's normal prior
# Outputs: $muv, $lamv = (Jp+1)-vectors of values of mu and lambda
# $muar, $lamar = acceptance rates for mu and lambda.
muv <- mu; lamv <- lam; ybar <- mean(y); n <- length(y); muct <- 0; lamct <- 0
logpost <- function(n,y,mu,lam,alp,bet,mu0,sig0){
(alp + n/2-1)*log(lam) - bet*lam -
0.5*lam*sum((y-mu)^2) -0.5*(mu-mu0)^2/sig0^2 }
for(j in 1:Jp){
mup <- runif(1,mu-c,mu+c) # propose a value of mu
q1 <-
logpost(n=n,y=y,mu=mup, lam=lam,alp=alp,bet=bet,mu0=mu0,sig0=sig0)-
logpost(n=n,y=y,mu=mu ,lam=lam,alp=alp,bet=bet, mu0=mu0,sig0=sig0)
p1 <- exp(q1) # acceptance probability
u <- runif(1); if(u < p1){ mu <- mup; muct <- muct + 1 }
lamp <- runif(1,lam-r,lam+r) # propose a value of lambda
if(lamp > 0){ # automatically reject if lamp < 0
q2 <-
logpost(n=n,y=y,mu=mu,lam=lamp,alp=alp,bet=bet, mu0=mu0,sig0=sig0)-
logpost(n=n,y=y,mu=mu,lam=lam ,alp=alp,bet=bet, mu0=mu0,sig0=sig0)
p2 <- exp(q2) # acceptance probability
u <- runif(1); if(u < p2){ lam <- lamp; lamct <- lamct + 1 }
}
muv <- c(muv,mu); lamv <- c(lamv,lam)
}
list(muv=muv,lamv=lamv,muar=muct/Jp,lamar=lamct/Jp)
}
318
Chapter 6: MCMC Methods Part 1
plot(0:Jp,res$lamv,type="l",xlab="j",ylab="lambda_j");
text(3000,0.6,"c=0.9, r=0.08")
hist(lamv,prob=T,xlab="lambda",nclass=20,main="",
ylab="density/relative frequency"); lines(density(lamv),lwd=2)
points(lam,0,pch=16,cex=1.5)
# (c)
GS = function(Jp, mu, lam, y, alp=0, bet=0, mu0=0, sig0=10000 ){
# This function implements a Gibbs Sampler for the general normal-normal-
gamma model.
# Inputs: Jp = total number of iterations
# mu, lam = starting values of mu and lambda
# y = vector of n observations
# alp, bet = parameters of lambda’s gamma prior (mean = alp/bet)
# mu0, sig0 = mean and standard deviation of mu's normal prior
# Outputs: $muv, $lamv = (Jp+1)-vectors of values of mu and lambda
muv = mu; lamv = lam; n = length(y); ybar = mean(y); s2 = var(y); sig02 = sig0^2
for(j in 1:Jp){
sig2=1/lam; k=n/(n+sig2/sig02); sig2star=k*sig2/n;
mustar=(1-k)*mu0+k*ybar
mu = rnorm(1,mustar,sqrt(sig2star))
lam=rgamma( 1, alp+0.5*n, bet+0.5*((n-1)*s2+n*(mu-ybar)^2) )
muv = c(muv,mu); lamv = c(lamv,lam) }
list(muv=muv,lamv=lamv)
}
319
Bayesian Methods for Statistical Analysis
Jp = 6000; set.seed(331)
res = GS(Jp=Jp, mu=0,lam=1, y=y, alp=3,bet=6, mu0=10,sig0=2)
plot(0:Jp,res$muv,type="l",xlab="j",ylab="mu_j");
plot(0:Jp,res$lamv,type="l",xlab="j",ylab="lambda_j");
hist(muv,prob=T,xlab="mu",nclass=20,main="",ylim=c(0,1.1),
ylab="density/relative frequency"); lines(density(muv),lwd=2);
points(mu,0,pch=16,cex=1.5)
hist(lamv,prob=T,xlab="lambda",nclass=20,main="",
ylab="density/relative frequency"); lines(density(lamv),lwd=2)
points(lam,0,pch=16,cex=1.5)
muvc=muv; lamvc=lamv
X11(w=8,h=7); par(mfrow=c(2,2))
320
CHAPTER 7
MCMC Methods Part 2
7.1 Introduction
In the last chapter we introduced a set of very powerful tools for
generating samples required for Bayesian Monte Carlo inference, namely
Markov chain Monte Carlo (MCMC) methods. The topics we covered
included the Metropolis algorithm, the Metropolis Hastings algorithm and
the Gibbs sampler.
We now present one more topic, stochastic data augmentation, and
provide some further exercises in MCMC. These exercises will illustrate
how many statistical problem can be cast in the Bayesian framework and
how easily inference can then proceed relative to the classical framework.
The examples below include simple linear regression, logistic regression
(an example of generalised linear modelling and survival analysis),
autocorrelated Bernoulli data, and inference on the unknown bounds of a
uniform distribution.
g ( x ) = ∫ q(u | x )du
321
Bayesian Methods for Statistical Analysis
322
Chapter 7: MCMC Methods Part 2
(c) Estimate EX using a Monte Carlo sample obtained via the Metropolis
algorithm.
(d) Estimate EX using a Monte Carlo sample obtained via a Gibbs sampler
designed using the principles of data augmentation.
e− x
(a) Let the kernel be k ( x ) = .
x +1
So EX = 0.40365/0.59635 = 0.6769.
323
Bayesian Methods for Statistical Analysis
1
Thus p( x ) = .
x +1
Figure 7.1 shows a trace plot of the simulated values and (just for interest)
the associated sample ACF of these values (showing the complete absence
of autocorrelation), respectively.
324
Chapter 7: MCMC Methods Part 2
(c) Using a normal driver distribution centred at the last value and with
standard deviation 0.6 we ran a Metropolis algorithm for 40,500 iterations,
starting at x = 1. We kept every 40th sampled value after first discarding
the first 500 iterations as the burn-in. Using the resulting Monte Carlo
sample of size 1,000, we estimated EX as 0.7049 with 95% CI (0.6561,
0.7537). The overall acceptance rate of the algorithm was 58%. Figure 7.2
shows a trace plot of all 40,500 simulated values, the sample ACF of those
values (showing a very strong autocorrelation), a trace plot of the 1,000
values used for inference, and the sample ACF of those values (showing
very little autocorrelation).
325
Bayesian Methods for Statistical Analysis
∞
1
(d) Observe that = ∫ e − ( x +1) wdw .
x +1 0
∞
1 −x
=
Therefore f ( x) e ∝ ∫ e − ( x +1) we − x dw .
x +1 0
Hence we may define an artificial latent variable w such that the joint
density of w and x is
f ( w, x ) ∝ e − ( x +1) we − x , w > 0, x > 0 .
We see that:
w
f ( w | x) ∝ f ( w, x) ∝ e − ( x +1) w , w > 0
x
f ( x | w) ∝ f ( w, x) ∝ e − ( w+1) x , x > 0 .
Figure 7.3 shows a trace plot of all 5,100 simulated values, their sample
ACF (showing a slight autocorrelation), a trace plot of the 1,000 values
used for inference, and the sample ACF of these 1,000 values (showing
very little autocorrelation).
Note that similar plots could also be produced for the simulated latent
variable, w. Also note how data augmentation and a Gibbs sampler have
resulted in a usable Monte Carlo sample more easily and effectively than
the Metropolis algorithm.
326
Chapter 7: MCMC Methods Part 2
# (a)
options(digits=5); kfun=function(x){ exp(-x)/(x+1) }
c=integrate(f=kfun,lower=0,upper=Inf)$value; c # 0.59635
xkfun =function(x){ x*exp(-x)/(x+1) }
top=integrate(f=xkfun,lower=0,upper=Inf)$value; top # 0.40365
EX=top/c; EX # 0.67688
# (b)
J=1000; xv=rep(NA,J); ct=0; set.seed(331)
for(j in 1:J){ acc=F; while(acc==F){ ct=ct+1
x=rgamma(1,1,1); p=1/(x+1); u=runif(1); if(u<p){ acc=T; xv[j]=x } } }
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J)
c(ct,xbar,ci) # 1651.00000 0.68754 0.64016 0.73492
par(mfrow=c(2,1)); plot(1:J,xv,type="l")
acf(xv)$acf[1:5] # 1.0000000 -0.0205516 -0.0100987 -0.0040018 0.0732520
327
Bayesian Methods for Statistical Analysis
# (c)
MET <- function(K,x,c){
# This function applies the Metropolis algorithm to sampling from
# f(x)~exp(-x)/(x+1),x>0.
# Inputs: K = total number of iterations
# x = intial value of x, c = standard deviation of normal driver
# Ouputs: $xv = vector of (K+1) values of x, $ar = acceptance rate
xv = x; ct = 0
for(j in 1:K){
xp = rnorm(1,x,c)
if(xp>0) {
q = (-xp-log(xp+1)) - (-x-log(x+1)); p = exp(q); u = runif(1)
if(u < p){ x = xp; ct = ct + 1 }
}
xv <- c(xv,x) }
ar = ct/K; list(xv=xv,ar=ar) }
# (d)
GIBBS <- function(K,x){
# This generates a sample using the Gibbs sampler and data augmentation.
# Inputs: K = total number of iterations, x = initial value of x
# Ouputs: $xv = vector of (K+1) values of x, $wv = vector of (K+1) values of w
xv = x; wv=NA; for(j in 1:K){
w=rgamma(1,1,x+1); x=rgamma(1,1,w+1); xv=c(xv,x); wv=c(wv,w) }
list(xv=xv,wv=wv) }
328
Chapter 7: MCMC Methods Part 2
(b) Conduct a classical analysis of the data in (a). Report the MLEs and
95% CIs for a and b. Also create a single graph which shows:
ˆ
• the fitted regression line Eˆ (Y | x) aˆ bx
Use a suitable joint uninformative and improper prior for the three
parameters in the model.
(d) Create a single graph showing all the information in the two graphs in
(b) and (c).
Note: The Bayesian analysis in (c) could also be performed via the
Gibbs sampler.
329
Bayesian Methods for Statistical Analysis
(a) The simulated data are shown in Table 7.1. Note that xi i .
i 1 2 3 4 5
yi 5.879 8.54 14.12 13.14 15.26
i 6 7 8 9 10
yi 20.43 19.92 18.47 21.63 24.11
( x x )( y y )
i i
(b) The MLE of b is bˆ i 1
n
= 1.836,
(x x )
i 1
i
2
ˆ = 6.051.
and the MLE of a is then â y bx
An unbiased estimate of 2 ( 1/ 4) is
1 n
s2
n 2 i1
ˆ }) 2 = 3.816.
( yi {aˆ bxi
1 1
1 2
Let: X
1 n
m m12
M 11 ( X X )1 .
m21 m22
330
Chapter 7: MCMC Methods Part 2
Let us now solve this Bayesian model so as to estimate the posterior means
and 95% CPDRs for a and b. The joint posterior density of the three model
parameters is
1 n
f (a, b, | y ) exp ( yi i ) 2
i1 2
(where i a bxi as already defined).
331
Bayesian Methods for Statistical Analysis
Applying the MH algorithm for 2,500 iterations, we obtain traces for the
three parameters as shown in Figure 7.5. The horizontal lines show the
true values of the three parameters. The fourth subplot (bottom right) is a
histogram of the last 2,000 values of b simulated.
Using output from the last 2,000 iterations only, we estimate the posterior
mean and 95% CPDR for a (= 5) as 6.3445 and (3.578, 8.808), and the
same for b (= 2) are about 1.7881 and (1.392, 2.234).
Figure 7.6 shows the Bayesian analogue of Figure 7.5 in part (b).
332
Chapter 7: MCMC Methods Part 2
333
Bayesian Methods for Statistical Analysis
# (a) **************************************************
options(digits=4)
n <- 10; a <- 5; b <- 2; lam <- 0.25; sig <- 1/sqrt(lam); c(sig,sig^2) # 2 4
xdat <- 1:n; set.seed(123); ydat <- rnorm(n,a+b*xdat,sig)
rbind(xdat,ydat)
# xdat 1.000 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
# ydat 5.879 8.54 14.12 13.14 15.26 20.43 19.92 18.47 21.63 24.11
# (b) **********************************************************
fit <- lm(ydat ~ xdat); summary(fit)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.051 1.335 4.53 0.0019 **
# xdat 1.836 0.215 8.54 2.7e-05 ***
df <- length(ydat)-length(fit$coef)
aCI <- ahat + c(-1,1)*qt(0.975,df)*sqrt(sig2hat*summary(fit)$cov.unscaled[1,1])
aCI # 2.973 9.128
bCI <- bhat + c(-1,1)*qt(0.975,df)*sqrt(sig2hat*summary(fit)$cov.unscaled[2,2])
bCI # 1.340 2.332
334
Chapter 7: MCMC Methods Part 2
lines(xxv,muhatub,lty=3,lwd=2)
lines(xxv, predlb,lty=2,lwd=2)
lines(xxv, predub,lty=2,lwd=2)
legend(6,12,c("True mean of Y given x","Least squares fit","95% CI for mean",
"95% prediction interval"),lty=c(1,4,3,2),lwd=rep(2,4))
# (c) **********************************************************
MH.SLR <- function(Jp, x, y, a, b, lam, asd, bsd, lamsd){
# This function implements a Metropolis Hastings algorithm for a
# simple linear regression model with uninformative priors.
# Inputs: Jp = total number of iterations
# x = vector of covariates
# y = vector of observations
# a,b,lam = starting values of a,b,lambda
# asd,bsd,lamsd = st. dev.s of drivers for a,b,lambda.
# Outputs: $av,$bv,$lamv = (Jp+1)-vectors of values of a,b,lambda
# $aar,$bar,$lamar = acceptance rates for a,b,lambda.
av <- a; bv <- b; lamv <- lam; ybar <- mean(y); n <- length(y)
act <- 0; bct <- 0; lamct <- 0
logpost <- function(n, x, y, a, b, lam){ # logposterior
(n/2 - 1) * log(lam) - 0.5 * lam * sum((y - a - b * x)^2) }
for(j in 1:Jp) {
ap <- rnorm(1, a, asd) # propose a value of a
k <- logpost(n=n, x=x, y=y, a=ap, b=b, lam=lam) -
logpost(n=n, x=x, y=y, a=a, b=b, lam=lam)
p <- exp(k) # acceptance probability
u <- runif(1); if(u < p) { a <- ap; act <- act + 1 }
bp <- rnorm(1, b, bsd) # propose a value of b
k <- logpost(n=n, x=x, y=y, a=a, b=bp, lam=lam) -
logpost(n=n, x=x, y=y, a=a, b=b, lam=lam)
p <- exp(k) # acceptance probability
u <- runif(1); if(u < p) { b <- bp; bct <- bct + 1 }
lamp <- rnorm(1, lam, lamsd) # propose a value of lambda
if(lamp > 0) { # automatically reject if lamp < 0
k <- logpost(n=n, x=x, y=y, a=a, b=b, lam=lamp) -
logpost(n=n, x=x, y=y, a=a, b=b, lam=lam)
p <- exp(k) # acceptance probability
u <- runif(1); if(u < p) { lam <- lamp; lamct <- lamct + 1 }
}
av <- c(av, a); bv <- c(bv, b); lamv <- c(lamv, lam)
}
list(av = av, bv = bv, lamv = lamv, aar = act/Jp, bar = bct/Jp, lamar = lamct/Jp)
}
335
Bayesian Methods for Statistical Analysis
cpdrLBs <- xxv; cpdrUBs <- xxv; predLBs <- xxv; predUBs <- xxv; set.seed(171)
for(i in 1:nn){
mus <- av + bv*xxv[i]
cpdrLBs[i] <- quantile(mus,0.025)
cpdrUBs[i] <- quantile(mus,0.975)
sim <- rnorm(J,mus,1/sqrt(lamv))
predLBs[i] <- quantile(sim,0.025)
predUBs[i] <- quantile(sim,0.975)
}
336
Chapter 7: MCMC Methods Part 2
# (d) **********************************************************
Table 7.2 shows data on the number of rats who died in each of n = 10
experiments within one month of being administered a particular dose of
radiation. For example in Experiment 3, a total of 40 rats were exposed to
radiation for 3.6 hours, and 23 of them died within one month. Thus an
estimate of the probability of a rat dying within one month if it is exposed
to 3.6 hours of radiation is 23/40 = 57.5%.
i ni xi yi yi / ni pˆ i
1 10 0.1 1 1/10 = 0.1
2 30 1.4 0 0/30 = 0
3 40 3.6 23 23/40 = 0.575
4 20 3.8 12 12/20 = 0.6
5 15 5.2 8 8/15 = 0.5333
6 46 6.1 32 32/46 = 0.696
7 12 8.7 10 10/12 = 0.833
8 37 9.1 35 35/37 = 0.946
9 23 9.1 19 19/23 = 0.826
10 8 13.6 8 8/8 = 1
337
Bayesian Methods for Statistical Analysis
(a) Find the ML estimates of a and b using the glm() function in R. For
each parameter also calculate a suitable 95% CI.
(b) Find the ML estimates and associated 95% CIs in R using your own
code for the Newton-Raphson algorithm and without using the glm()
function.
(c) Find the ML estimates using a modification of the Newton-Raphson
algorithm which does not require the inversion of matrices.
(d) Suppose that a and b are assigned independent flat priors over the
whole real line. Thus consider the Bayesian model:
(Yi | a, b) ~ Bin(ni , pi ) , i = 1,...,n
1
pi (probability of death for experiment i)
1 exp(zi )
zi a bxi (linear predictor)
f ( a, b) 1 , a, b .
Hence estimate the posterior means of a and b, together with 95% MC CIs
for these estimates, and also estimate the 95% CPDRs.
Show graphs of the traces and histograms. Overlay the MC estimates and
MLEs over the traces, together with 95% CPDRs and CIs, respectively.
Also, overlay kernel density estimates over the histograms.
(e) Use the sample in (d) to estimate p(x), the probability of a rat dying if
it is exposed to x hours of radiation, for each x = 0,1,2,...,15.
Graph these results with a line in a figure which also shows the 10 pˆ i
values.
338
Chapter 7: MCMC Methods Part 2
Also include:
• the MC 95% CI for each estimate of p(x) (i.e. for each E{p(x) | y})
• the MC 95% CPDR for each p(x)
• the MLE of each p(x) using standard GLM procedures,
together with associated large-sample 95% CIs.
(f) Suppose that 20 more rats are about to be exposed to exactly five hours
of radiation. Use the sample in (d) to estimate how many of these 20 rats
will die, together with a 95% CI for your estimate. Also construct an
approximate 95% prediction region for the number of rats that will die
and report the estimated actual probability content of this region.
(g) Use the sample in (d) to estimate LD50, the lethal dose of radiation at
which 50% of rats die, together with a 95% CPDR. Also compute an
estimate and 95% CI for LD50 using standard GLM techniques.
(h) Consider the Bayesian model and data in (d). Modify the model
suitably so as to constrain the probability of death at a dose of zero to be
exactly zero. Estimate the parameters in the new model and draw a graph
similar to the one in (e) which shows the posterior probability of death for
each dose x from zero to 15, together with the associated 95% CPDRs.
(a) Using the glm() function in R, we find that the MLE and 95% CI for
a are –2.156 and (–2.9998, –1.3113). Also, the MLE and 95% CI for b are
0.5028 and (0.3456, 0.6601).
(b) Since the priors on a and b are flat, finding the maximum likelihood
estimate of (a,b) is the same as finding the posterior mode of (a,b). Now,
the posterior density of a and b is
n
f (a, b | y ) piyi (1 pi ) ni yi .
i 1
So the log-posterior is
n
l (a, b) log f (a, b | y ) qi ,
i 1
339
Bayesian Methods for Statistical Analysis
dqi dq
Let: d1i yi ni pi , d 2i i ( yi ni pi ) xi
da db
2
d q d 2 qi
d11i 2i ni pi (1 pi ) , d12i ni pi (1 pi ) xi
da dadb
d 2 qi n n
d 22i 2 ni pi (1 pi ) xi ,
2
d1 d1i , d 2 d 2i
db i 1 i 1
n n n
d11 d11i , d 22 d 22i , d12 d12i
i 1 i 1 i 1
a d d d12
v , D D(v) 1 , M M (v) 11 .
b d 2 d12 d 22
Starting from the origin, the iterates of a and b are as shown in Table 7.3.
t 0 1 2 3 4 5
at 0 –1.474 –2.013 –2.148 –2.156 –2.156
bt 0 0.3369 0.4670 0.5008 0.5028 0.5028
Thus the MLEs of a and b are â = –2.156 and b̂ = 0.5028. This agrees
perfectly with the results in (a).
A 95% CI for a is aˆ t0.025 (8) sa and a 95% CI for b is bˆ t0.025 (8) sb ,
where:
t0.025 (8) = 2.306
sa2 is the top left element of V
sb2 is the bottom right element of V
V ( X WX )1 (a 2 by 2 matrix)
1 x1
1 x2 i
1 , X , W diag ( w ,..., w ) , w
1 n i
V (i ) g (i ) 2
1 xn
340
Chapter 7: MCMC Methods Part 2
1
i ni , i pˆ i (MLE of the probability at x xi )
1 exp(zˆi )
ˆ
zˆi aˆ bxi(MLE of linear predictor at x xi )
V () (1 ) , g () log (logit link function)
1
1
g ( ) .
(1 ) 2
2
Starting from the origin (a, b) = (0,0) we obtain the results in Table 7.4.
t 0 1 2 3 4
at 0 0.4564 –0.45034 –0.06132 –0.7294
bt 0 0.1401 0.09223 0.20571 0.1690
t 20 21 99 100
at –1.8585 –1.8619 –2.1555 –2.1555
bt 0.4424 0.4532 0.5028 0.5028
We see that this modified and simpler algorithm converges more slowly
than plain NR. Also, it is less stable, as it fails to converge if started from
(a, b) = (0.3, 0.3), unlike plain NR. Both algorithms fail to converge if
started from (0.5, 0.5). (See the R code below for details.)
341
Bayesian Methods for Statistical Analysis
(d) We apply the Metropolis Hastings algorithm with a burn-in of 500 and
starting from the origin to get a sample of size of J = 10,000 from
f (a, b | y ) . The acceptance rates were 37% for a and 55% for b. The
Markov chain was not thinned for subsequent inference, meaning that the
CIs obtained below are perhaps narrower than they should be.
342
Chapter 7: MCMC Methods Part 2
Note: Figure 7.9 shows that the probability of a rat dying when given no
radiation is about 10%. We should interpret this result and the graph
near x = 0 with caution. Ideally, we would conduct another experiment
with only small values of x and a second logistic regression, perhaps
using the log of x as the explanatory variable. On the other hand, maybe
the 10% figure is reasonable because rats could die within one month
for reasons other than radiation. Alternatively, we could modify our
model so as to force p(0) = 0 (see (h) below).
(f) Let d be the number of rats which will die if exposed to radiation for
five hours. Then
(d | y, a, b) ~ Bin(20, p(a,b)),
where
p(a,b) = 1/(1 + exp(−a − 5b)).
Thus for each sampled (a,b) we calculate p(a,b) and sample from the
binomial distribution of d above. The frequencies of the resulting 10,000
values of d are shown in Table 7.5
343
Bayesian Methods for Statistical Analysis
d 3 4 5 6 7 8
frequency 1 3 20 75 217 472
d 9 10 11 12 13 14
frequency 845 1188 1562 1733* 1546 1123
d 15 16 17 18 19
frequency 709 332 131 37 6
(g) First observe that the LD50 is the value of x such that p ( x) 0.5 .
Using the sample of 10,000 in part (f), we estimate the posterior mean of
LD50 is 4.279, with 95% MC CI (4.273, 4.286). The MC 95% CPDR for
LD50 is (3.584, 4.916). Thus we can be 95% confident that the dose
required to kill half of a large number of rats is between 3.6 and 4.9.
Using standard GLM procedures and the delta method we estimate LD50
as 4.287 (the MLE) with 95% CI (3.532, 5.042). Thus we can be 95%
confident that the dose required to kill half of a large number of rats is
between 3.5 and 5.0. We see that Bayesian and classical methods have
resulted in inferences which are very similar.
(h) An alternative to the logistic model in (d), one with zero probability
of death at zero dosage of radiation, is as follows:
(Yi | a, b) ~ Bin(ni , pi ) , i = 1,...,n
pi 1 exp(zi ) , zi axi bxi2
f ( a, b) 1 , a, b 0 .
344
Chapter 7: MCMC Methods Part 2
# (a) ********************************************************
nvec <- c(10,30,40,20,15,46,12,37,23,8)
xvec <- c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6)
yvec <- c(1,0,23,12,8,32,10,35,19,8)
pvec <- yvec/nvec
options(digits=4)
cbind(xvec,nvec,yvec,pvec)
# xvec nvec yvec pvec
# [1,] 0.1 10 1 0.1000
# [2,] 1.4 30 0 0.0000
# [3,] 3.6 40 23 0.5750
# [4,] 3.8 20 12 0.6000
# [5,] 5.2 15 8 0.5333
# [6,] 6.1 46 32 0.6957
# [7,] 8.7 12 10 0.8333
# [8,] 9.1 37 35 0.9459
# [9,] 9.1 23 19 0.8261
# [10,] 13.6 8 8 1.0000
345
Bayesian Methods for Statistical Analysis
# (b) *****************************************************
346
Chapter 7: MCMC Methods Part 2
options(digits=4)
nrres <- NR.LOGISTIC(m=20,alp=0,bet=0,xv=xvec,nv=nvec,yv=yvec)
nrres
# $alpv: [1] 0.000 -1.474 -2.013 -2.148 -2.156 -2.156 ....
# $betv: [1] 0.0000 0.3369 0.4670 0.5008 0.5028 0.5028 ....
NR.LOGISTIC(m=20,alp=0.3,bet=0.3,xv=xvec,nv=nvec,yv=yvec)
# $alpv: [1] 0.000 -1.474 -2.013 -2.148 -2.156 -2.156 ....
# $betv: [1] 0.0000 0.3369 0.4670 0.5008 0.5028 0.5028 ....
NR.LOGISTIC(m=20,alp=0.5,bet=0.5,xv=xvec,nv=nvec,yv=yvec)
# Error in solve.default(M) :
# system is computationally singular: reciprocal condition
# number = 9.01649e-18
qt(0.975,8) # 2.306
alpmle + c(-1,1)*qt(0.975,8)*sqrt(varmat[1,1]) # -3.000 -1.311
betmle + c(-1,1)*qt(0.975,8)*sqrt(varmat[2,2]) # 0.3456 0.6601
# (c) ****************************************************
347
Bayesian Methods for Statistical Analysis
for(t in 1:m){
pv <- 1/(1+exp(-alp-bet*xv))
d1 <- sum(yv - nv*pv)
d2 <- sum((yv - nv*pv)*xv)
d11 <- -sum(nv*pv*(1-pv))
d22 <- -sum(nv*pv*(1-pv)*xv^2)
alp <- alp - d1/d11
bet <- bet - d2/d22
alpv <- c(alpv,alp); betv <- c(betv,bet)
}
list(alpv=alpv,betv=betv)
}
# (d) ****************************************************
348
Chapter 7: MCMC Methods Part 2
for(j in 1:its){
a2 <- rnorm(1,a,sa)
logpr <- logfun(a=a2,b=b,xv=xv,yv=yv,nv=nv)-
logfun(a=a,b=b, xv=xv,yv=yv,nv=nv)
pr <- exp(logpr); u <- runif(1)
if(u<pr){ a <- a2; arav[j+1] <- 1 }
b2 <- rnorm(1,b,sb)
logpr <- logfun(a=a,b=b2, xv=xv,yv=yv,nv=nv)-
logfun(a=a,b=b, xv=xv,yv=yv,nv=nv)
pr <- exp(logpr); u <- runif(1)
if(u<pr){ b <- b2; arbv[j+1] <- 1 }
list(av=av,bv=bv,ara=ara,arb=arb)
}
349
Bayesian Methods for Statistical Analysis
burn <- 500; K <- 10000; its <- burn + K; set.seed(221); date() #
res <- MHLR(burn=burn,J=K,a0=0,b0=0,xv=xvdata,
yv=yvdata,nv=nvdata,sa=0.5,sb=0.05); date() # 10000 Took 1 second
c(res$ara,res$arb) # 0.3650 0.5544
par(mfrow=c(2,1)); plot(res$av,type="l"); plot(res$bv,type="l") # OK
options(digits=4); J = K; thin=1
# thin=1 means no thinning (for experimentation)
av <- res$av[-(1:(burn+1))][seq(thin,K,thin)]; length(av) # 10000
acf(av)$acf[1:5] # 1.0000 0.9283 0.8756 0.8324 0.7945
# (very high autocorrelation)
ahat <- mean(av); aci <- ahat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(av)/J)
acpdr <- quantile(av,c(0.025,0.975))
c(ahat,aci,acpdr) # -2.207 -2.214 -2.199 -2.963 -1.521
X11(w=8,h=8); par(mfrow=c(2,2))
plot(0:its,res$av,type="l",xlab="j",ylab="a_j")
abline(h=c(ahat,aci,acpdr))
abline(h=c(fit$coef[1],fitaci),lty=4)
legend(400,0,c("MC est, 95% CI & CPDR",
"MLE & classical 95% CI"),lty=c(1,4))
plot(0:its,res$bv,type="l", xlab="j",ylab="b_j")
abline(h=c(bhat,bci,bcpdr))
abline(h=c(fit$coef[2],fitbci),lty=4)
legend(400,0.2,c("MC est, 95% CI & CPDR",
"MLE & classical 95% CI"),lty=c(1,4))
350
Chapter 7: MCMC Methods Part 2
hist(av,prob=T, xlim=c(-4,0),ylim=c(0,1.5),nclass=20,xlab="a")
lines(dena$x,dena$y,lwd=2)
hist(bv,prob=T, xlim=c(0.2,0.8),ylim=c(0,7),nclass=20,xlab="b")
lines(denb$x,denb$y,lwd=2)
# (e) ***************************************************
for(i in 1:len){
xx <- xxv[i]
ppsim <- 1/(1+exp(-av-bv*xx))
pp <- mean(ppsim)
ppci <- pp + c(-1,1)*qnorm(0.975)*sqrt(var(ppsim)/J)
ppcpdr <- quantile(ppsim,c(0.025,0.975))
ppv[i] <- pp # MC estimate of E(p|xx) and so indirectly of p at x=xx
ppci1[i] <- ppci[1]; ppci2[i] <- ppci[2]
ppcpdr1[i] <- ppcpdr[1]; ppcpdr2[i] <- ppcpdr[2]
}
X11(w=8,h=5); par(mfrow=c(1,1))
plot(c(0,15),c(0,1),type="n",xlab="x",ylab="probability p(x)")
points(xvdata,pvdata,pch=16); lines(xxv,ppv)
lines(xxv,ppci1,lwd=2); lines(xxv,ppci2,lwd=2)
lines(xxv,ppcpdr1,lty=2,lwd=2); lines(xxv,ppcpdr2,lty=2,lwd=2)
points(xxv,pihat); lines(xxv,pihatlb,lty=4); lines(xxv,pihatub,lty=4)
legend(8,0.65, c("MC est & 95% CI","95% CPDR","Classical GLM 95% CI"),
lty=c(1,2,4))
legend(8,0.35,c("Sample proportions","Standard GLM estimates"),pch=c(16,1))
# pphatv <- 1/(1+exp(-ahat-bhat*xxv))
# lines(xxv,pphatv,lty=3) # This alternative estimate is practically
# indistinguishable from ppv and so is not plotted
351
Bayesian Methods for Statistical Analysis
# (f) *****************************************************
# (g) ****************************************************
# (h) ****************************************************
352
Chapter 7: MCMC Methods Part 2
353
Bayesian Methods for Statistical Analysis
for(i in 1:len){
xx <- xxv[i]
ppsim <- 1-exp(-av*xx-bv*xx^2)
pp <- mean(ppsim)
ppci <- pp + c(-1,1)*qnorm(0.975)*sqrt(var(ppsim)/J)
ppcpdr <- quantile(ppsim,c(0.025,0.975))
ppv[i] <- pp # MC estimate of E(p|xx) and so indirectly of p at x=xx
ppci1[i] <- ppci[1]; ppci2[i] <- ppci[2]
ppcpdr1[i] <- ppcpdr[1]; ppcpdr2[i] <- ppcpdr[2]
}
X11(w=8,h=5); par(mfrow=c(1,1))
plot(c(0,15),c(0,1),type="n",xlab="x",ylab="probability p(x)")
points(xvdata,pvdata,pch=16); lines(xxv,ppv)
lines(xxv,ppci1,lwd=2); lines(xxv,ppci2,lwd=2)
lines(xxv,ppcpdr1,lty=2,lwd=2); lines(xxv,ppcpdr2,lty=2,lwd=2)
354
Chapter 7: MCMC Methods Part 2
1
Hence, with pi P(Yi 1 | a, b, yi1 )
1 exp(a byi1 )
(as already defined), the joint posterior pdf of a and b is
f ( a , b | y ) f ( a , b) f ( y | a , b)
a ,b n n
1 f ( y1 | a, b) f ( yi | a, b, yi1 ) q1 1 (1 q1 ) 1 pi i (1 pi ) i .
y 1 y y 1 y
i 2 i2
355
Bayesian Methods for Statistical Analysis
The traces of a and b over all 11,000 iterations, and histograms of the last
10,000 values of a and b, respectively, are shown in Figure 7.11, together
with posterior density estimates.
356
Chapter 7: MCMC Methods Part 2
yv <- c(1,1,1,1,1, 1,1,0,0,0); n <- length(yv); ybar <- mean(yv); ydot <- sum(yv)
357
Bayesian Methods for Statistical Analysis
for(j in 1:K){
a2 <- rnorm(1,a,sa) # proposed value of a
logpr <- logfun(a=a2,b=b,yv=yv,n=n)-logfun(a=a,b=b,yv=yv,n=n)
pr <- exp(logpr); u <- runif(1)
if(u<pr){ a <- a2; cta <- cta + 1 }
if(sb > 0){
b2 <- rnorm(1,b,sb) # proposed value of b
logpr <- logfun(a=a,b=b2,yv=yv,n=n)-logfun(a=a,b=b,yv=yv,n=n)
pr <- exp(logpr); u <- runif(1)
if(u<pr){ b <- b2; ctb <- ctb + 1 }
}
av <- c(av,a); bv <- c(bv,b)
}
list(av=av,bv=bv,ara=cta/K,arb=ctb/K)
}
rbind(c(abar,acpdr),c(bbar,bcpdr))
# [1,] -2.337 -6.3980 0.8313
# [2,] 5.411 0.9098 11.8691
X11(w=8,h=6); par(mfrow=c(2,2));
plot(av,type="l",xlab="j",ylab="a_j",cex=1.2)
plot(bv,type="l",xlab="j",ylab="b_j",cex=1.2)
hist(av,prob=T,xlab="a",ylab="relative frequency",cex=1.2);
abline(v=c(abar,acpdr), lty=1,lwd=3); lines(density(av),lwd=2)
hist(bv,prob=T,xlab="b",ylab="relative frequency",cex=1.2);
abline(v=c(bbar,bcpdr), lty=1,lwd=3); lines(density(bv),lwd=2)
358
Chapter 7: MCMC Methods Part 2
Generate a random sample of size n = 20 from the model with a = 0.6 and
b = 0.8. Then apply MCMC methods to generate a random sample from
the joint posterior of a and b. Then use this sample to perform Monte Carlo
inference on m= E ( yi | a, b=) ( a + b) / 2 .
i 1 2 3 4 5
yi 0.7846 0.7572 0.6381 0.7626 0.6105
i 6 7 8 9 10
yi 0.6990 0.7728 0.7113 0.7314 0.7435
i 11 12 13 14 15
yi 0.6324 0.7072 0.7493 0.7979 0.6182
i 16 17 18 19 20
yi 0.7652 0.7883 0.7194 0.6211 0.6054
Note: The range of this data is from 0.6054 to 0.7979. This tells us
immediately that 0 ≤ a ≤ 0.6054 and 0.7979 ≤ b ≤ 1 .
359
Bayesian Methods for Statistical Analysis
pa
f (a | y, b) 1/ (b a ) 2 b a
f (b | y, a ) 1/ (b (b a ) 2 ) b b a
n
pb .
f (b | y, a ) 1/ (b(b a ) 2 ) b b a
Starting at a = 0.1 and b = 0.9, and using the tuning constants r = 0.008
and t = 0.01, the algorithm was run for 2,500 iterations. The resulting trace
plots are shown in Figure 7.12.
360
Chapter 7: MCMC Methods Part 2
The algorithm was then run for a further 50,000 iterations, starting at the
last values in the previous run (a = 0.5979 and b = 0.8123). The acceptance
rates were now 61% and 54%, and this second run took 14 seconds of
computer time.
Then every 50th value was recorded so as to yield a final random sample
of size J = 1,000 from the joint posterior distribution of a and b, i.e.
(a1 , b1 ),..., (aJ , bJ ) ~ iid f (a, b | y ) .
As a check, the sample ACF of each sample of size 1,000 was calculated.
Figure 7.13 shows the ACF estimates for a and b, and these provide no
evidence for residual autocorrelation in either series.
361
Bayesian Methods for Statistical Analysis
362
Chapter 7: MCMC Methods Part 2
options(digits=4)
MH = function(B,J=1000,y,a,b,r,t){
# This function performs a Metropolis-Hastings algorithm for a model involving
3 uniforms.
# Inputs: B = burn-in length
# J = desired Monte Carlo size
# y = (y1,...,yn) = data (yi ~ iid U(a,b))
# a = starting value of a (a ~ U(0,b))
# b = starting value of b (b ~ U(0,1))
# r,t = tuning constants for a & b, respectively
# Outputs: $av = (1+B+J) vector of a-values
# $bv = (1+b+J) vector of b-values
# $ar = acceptance rate for a (over last J iterations)
# $br = acceptance rate for b (over last J iterations)
av = a; bv = b; an=0; bn=0; miny=min(y); maxy=max(y); n=length(y);
for(j in 1:(B+J)){
ap = rnorm(1,a,r)
if((0<ap)&&(ap<miny)){
p = ((b-a)/(b-ap))^n; u = runif(1)
if(u<p){ a=ap; if(j>B) an=an+1 } }
bp = rnorm(1,b,t)
if((maxy<bp)&&(bp<1)){
q = (b/bp)*((b-a)/(bp-a))^n; v = runif(1)
363
Bayesian Methods for Statistical Analysis
X11(w=8,h=5); par(mfrow=c(1,1))
hist(mv,prob=T,xlab="m",main="",
xlim=c(0.65,0.75), ylim=c(0,80))
lines(density(mv),lwd=2)
est=mean(mv); ci=est+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J)
cpdr=quantile(mv,c(0.025,0.975))
print(c(est,ci,cpdr),digits=4) # 0.7013 0.7008 0.7019 0.6837 0.7173
abline(v= c(est,ci,cpdr),lwd=2)
364
CHAPTER 8
Inference via WinBUGS
8.1 Introduction to BUGS
We have illustrated the usefulness of MCMC methods by applying them
to a variety of statistical contexts. In each case, specialised R code was
used to implement the chosen method. Writing such code is typically time
consuming and requires a great deal of attention to details such as
choosing suitable tuning constants in the Metropolis-Hastings algorithm.
Figure 8.2 shows the Wikipedia article on WinBUGS (on the same day):
http://en.wikipedia.org/wiki/WinBUGS
The preferred reference for citing WinBUGS in scientific papers is:
Lunn, D.J., Thomas, A., Best, N., and Spiegelhalter, D. (2000).
WinBUGS – A Bayesian modelling framework: Concepts,
structure, and extensibility. Statistics and Computing, 10:
325–337.
365
Bayesian Methods for Statistical Analysis
366
Chapter 8: Inference via WinBUGS
Suppose the data is y = ( y1 ,..., yn ) = (2.4, 1.2, 5.3, 1.1, 3.9, 2.0), and we
wish to find the posterior mean and 95% posterior interval for each of µ
and γ = µ τ (the signal to noise ratio).
To perform this in WinBUGS 1.4.3, open a new window (select ‘File’ and
then ‘New’ in the BUGS toolbar), and type the following BUGS code:
model
{
for(i in 1:n){
y[i] ~ dnorm(mu, tau)
}
mu ~ dnorm(0,0.0001)
tau ~ dgamma(0.001, 0.001)
gam <- mu*sqrt(tau)
}
list(tau=1)
Alternatively, copy this text from a Word document into a Notepad file,
and then copy the text from the Notepad file into the WinBUGS window.
Note: Do not copy text from Word to WinBUGS directly or you may
get an error message.
367
Bayesian Methods for Statistical Analysis
Next, select ‘Model’ (in the WinBUGS toolbar) and then ‘Specification’.
Then highlight the word ‘model’ (in the BUGS code above) and click on
‘check model’ in the ‘Specification Tool’.
Then highlight the first word ‘list’, click on ‘load data’ and click on
‘compile’.
Then highlight the second word ‘list’, click on ‘load inits’ and click on
‘gen inits’.
Next, select ‘Inference’ and then ‘Samples’. Then, in the ‘Sample Monitor
Tool’ which appears, type ‘mu’ in the ‘node’ box, click ‘set’, type ‘gam’
in the ‘node’ box and click ‘set’ again.
In the ‘Update Tool’ which appears, change ‘1000’ to ‘1500’ and click
‘update’. This will implement 1,500 iterations of an MCMC algorithm.
Next type ‘*’ (an asterisk) in the ‘node’ box, change ‘1’ to ‘501’ in the
‘beg’ box (meaning beginning) and click ‘stats’ (statistics).
This should produce something similar to what is shown in Figure 8.4 and
Table 8.1.
368
Chapter 8: Inference via WinBUGS
From these results, we see that the posterior mean and 95% posterior
interval for µ are about 2.64 and (0.94, 4.31), and the same quantities
for γ are about 1.54 and (0.38, 2.91).
369
Bayesian Methods for Statistical Analysis
The posterior mean and CPDR for γ do not have such simple formulae.
To see line plots of the simulated values, click on ‘history’ (in the ‘Sample
Monitor Tool’), and to view smoothed histograms of them, click ‘density’.
Figure 8.5 illustrates.
370
Chapter 8: Inference via WinBUGS
gam 1 1000
mu 1001 2000
The other box, called ‘CODA for chain 1’, should have two columns and
2,000 rows and look as follows:
501 1.298
502 1.307
503 1.478
.......................
1498 0.8303
1499 1.993
1500 2.326
501 1.812
502 1.999
503 2.8
......................
1498 1.628
1499 2.161
1500 2.748
Next, copy the contents of ‘CODA for chain 1’ into a Notepad file called
‘out.txt’ (say). Save that file somewhere, e.g. onto the desktop.
371
Bayesian Methods for Statistical Analysis
dim(out) # 2000 2
372
Chapter 8: Inference via WinBUGS
One can then use the MCMC output in many other ways, e.g. to simulate
from a posterior predictive distribution via the method of composition.
For more information on BUGS, click on ‘Help’ and ‘User manual’ in the
toolbar. Also see ‘Examples Vol I’ and ‘Examples Vol II’ for several
dozen worked examples in BUGS. The examples are very user-friendly.
They contain data, code and everything one needs to reproduce the results
shown. Figure 8.7 shows various excerpts from these files.
373
Bayesian Methods for Statistical Analysis
374
Chapter 8: Inference via WinBUGS
375
Bayesian Methods for Statistical Analysis
376
Chapter 8: Inference via WinBUGS
377
Bayesian Methods for Statistical Analysis
378
Chapter 8: Inference via WinBUGS
Predictions:
379
Bayesian Methods for Statistical Analysis
xi ( i ) 1 2 3 4 5
yi 5.879 8.54 14.12 13.14 15.26
i 6 7 8 9 10
yi 20.43 19.92 18.47 21.63 24.11
Using the following WinBUGS code, we obtain the results in Table 8.3:
model{
for(i in 1:n){
mu[i] <- a + b*x[i]
y[i] ~ dnorm(mu[i],lam)
}
a ~ dnorm(0.0,0.001)
b ~ dnorm(0.0,0.001)
lam ~ dgamma(0.001,0.001)
}
# data
list(n = 10, x = c(1,2,3,4,5,6,7,8,9,10), y=c(5.879,8.54,14.12,
13.14,15.26,20.43,19.92,18.47,21.63,24.11))
# inits
list(a=0,b=0,lam=1)
380
Chapter 8: Inference via WinBUGS
Using the results in Table 8.3, we estimate a by 6.039 with 95% CPDR
(2.955, 9.107), and we estimate b by 1.836 with 95% CPDR (1.342, 2.334).
It may be noted that these results are very similar to those obtained via
classical techniques in an earlier exercise: 6.051 and (2.973, 9.128) for a,
and 1.836 and (1.340, 2.332) for b.
Figure 8.8 shows trace plots and density estimates produced as part of the
WinBUGS output.
381
Bayesian Methods for Statistical Analysis
Consider the data in Table 8.4, which is the same as in Table 7.2 of
Exercise 7.3 (where, for example, in Experiment 3 a total of 40 rats were
exposed to radiation for 3.6 hours, and 23 of them died within one month).
i ni xi yi yi / ni pˆ i
1 10 0.1 1 1/10 = 0.1
2 30 1.4 0 0/30 = 0
3 40 3.6 23 23/40 = 0.575
4 20 3.8 12 12/20 = 0.6
5 15 5.2 8 8/15 = 0.5333
6 46 6.1 32 32/46 = 0.696
7 12 8.7 10 10/12 = 0.833
8 37 9.1 35 35/37 = 0.946
9 23 9.1 19 19/23 = 0.826
10 8 13.6 8 8/8 = 1
In your results, also include inference on LD50, the dose at which 50% of
rats will die (= −a/b), and on d, defined as the number of rats that will die
out of 20 that are exposed to five hours of radiation.
382
Chapter 8: Inference via WinBUGS
model
{
for(i in 1:N){
z[i] <- a + b*x[i]
logit(p[i])<- z[i]
y[i] ~ dbin(p[i],n[i])
}
a ~ dnorm(0.0,0.001)
b ~ dnorm(0.0,0.001)
logit(p5) <- a+5*b
d ~ dbin(p5,20)
LD50 <- -a/b
}
# data
list(N=10,n=c(10,30,40,20,15,46,12,37,23,8),
x=c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6),
y=c(1,0,23,12,8,32,10,35,19,8))
# inits
list(a=0,b=0)
383
Bayesian Methods for Statistical Analysis
These results are very similar to those obtained via classical techniques in
Exercise 7.3, namely –2.156 and (–3.000, –1.311) for a, etc.
Figure 8.9 shows some traces and density estimates produced as part of
the WinBUGS output. Here, ‘p5’ represents the probability of a rat dying
within one month if exposed to five hours of radiation. We chose to
monitor this node so as to estimate its posterior density
384
Chapter 8: Inference via WinBUGS
Suppose that n = 20 data values from this model with a = 0.6 and
b = 0.8 are as shown in Table 8.6 (which is the same as Table 7.6 in
Exercise 7.5).
i 1 2 3 4 5
yi 0.7846 0.7572 0.6381 0.7626 0.6105
i 6 7 8 9 10
yi 0.6990 0.7728 0.7113 0.7314 0.7435
i 11 12 13 14 15
yi 0.6324 0.7072 0.7493 0.7979 0.6182
i 16 17 18 19 20
yi 0.7652 0.7883 0.7194 0.6211 0.6054
385
Bayesian Methods for Statistical Analysis
Applying the following WinBUGS code we obtain the results in Table 8.7:
model
{
for(i in 1:n){ y[i] ~ dunif(a,b) }
b ~ dunif(0,1)
a ~ dunif(0,b)
m <- (a+b)/2
}
list(a=0.1, b=0.9)
386
Chapter 8: Inference via WinBUGS
0.7016 +c(-1,1)*qnorm(0.975)*0.0001388
0.7016 +c(-1,1)*qnorm(0.975)*0.008201/sqrt(10000)
in place of the corresponding row of Table 8.7. Then, the 95% CI for
m’s posterior mean becomes (0.7009, 0.7023), obtained via
0.7016 +c(-1,1)*qnorm(0.975)*0.0003573
This CI has a width of 0.0014, which is greater than 0.0006, the width
of (0.7013, 0.7019), and closer to 0.0011, the width of the CI in Note 2.
Figure 8.10 shows some traces and density estimates produced as part of
the WinBUGS output.
387
Bayesian Methods for Statistical Analysis
388
Chapter 8: Inference via WinBUGS
Note: You must have a connection to the internet for this to work. This
command is required only once for each installed version of R.
Then type
library("R2WinBUGS")
Note: This loads the necessary functions and must be done at the
beginning of each R session in which WinBUGS is to be called.
389
Bayesian Methods for Statistical Analysis
model
mu ~ dnorm(0,0.0001)
y <- c(2.4,1.2,5.3,1.1,3.9,2.0)
n <- length(y)
model.file= "C:/R-3.0.1/BugsCode1.txt",
bugs.directory = "C:/WinBUGS14/",
working.directory = "C:/R-3.0.1/BugsOut/")
This sets things up, starts WinBUGS, runs the BUGS code, closes
WinBUGS, and creates a number of files in the working directory, similar
to the ones shown in Figure 8.11.
390
Chapter 8: Inference via WinBUGS
These files contain information which can then be accessed within R, for
example as follows:
print(sim,digits=4)
par(mfrow=c(2,1))
hist(sim$sims.list$mu, breaks=20)
hist(sim$sims.list$gam, breaks=20)
After typing these commands, you should see two histograms similar to
the ones shown in Figure 8.12. For more information on the bugs()
function, simply type
help(bugs)
391
Bayesian Methods for Statistical Analysis
Note: If your WinBUGS code has an error, the procedure will crash,
with little to tell you what went wrong. In that case, first iron out any
‘bugs’ directly in WinBUGS, and only then run your WinBUGS code
in R, as above.
392
Chapter 8: Inference via WinBUGS
Using classical methods, fit a suitable ARIMA model to this time series.
Then forecast the time series forward for one up to twelve quarters.
Then repeat your analysis and forecasts using WinBUGS called from R.
393
Bayesian Methods for Statistical Analysis
Figure 8.13 shows plots of the original times series xt , its logarithm
(showing stabilised variability), the difference of the logarithm (showing
a removal of the trend), and yt , the fourth seasonal difference of the first
difference of the logarithm (showing that seasonality has been removed).
The last two (bottom) plots are the sample ACF and sample PACF for yt .
394
Chapter 8: Inference via WinBUGS
The last two plots in Figure 8.13 suggest SAR(1) or SMA(1) processes.
Both fits pass standard diagnostic checks, the second being marginally
better. Figure 8.14 shows some diagnostic plots for the SMA(1) fit (see
the R Code below for further details).
395
Bayesian Methods for Statistical Analysis
The chosen SMA(1) model for the TIAP time series xt may be expressed
by writing
yt = ∇4∇ log xt ,
where
y=
t wt + Θ1wt − 4 , wt ~ iid N (0, σ 2 ) .
σ̂ 2 = 0.0013.
Figure 8.15 shows the time series xt plus predictions 12 quarters ahead
based on the above fitted model. The dashed lines show the 95%
prediction interval at each of the 12 future times points. (See the R code
below for details regarding all calculations.)
396
Chapter 8: Inference via WinBUGS
We now fit the same model to the time series but using MCMC via
WinBUGS called from R. Some graphical output from the WinBUGS run
is shown in Figure 8.16. (See the code below for details.)
To compare the classical and Bayesian analyses, we combine the two sets
of forecasts into a single plot, as shown in Figure 8.18 (page 399). Figure
8.19 (page 399) is a detail in Figure 8.18.
397
Bayesian Methods for Statistical Analysis
398
Chapter 8: Inference via WinBUGS
3000
Classical
Bayesian
2000
xt
1000
0
0 10 20 30 40 50 60
399
Bayesian Methods for Statistical Analysis
We see from Figures 8.18 and 8.19 that the two approaches to inference
have yielded very similar results, at least as regards prediction.
The Bayesian approach has produced 95% prediction intervals which are
slightly wider than those obtained via the classical approach.
It may be argued that such wider intervals are more appropriate, since the
classical approach makes forecasts without taking into account any
uncertainty in the parameter estimates.
To conclude, we report that the fitted model for the TIAP time series xt
is given by
yt = ∇4∇ log xt ,
with
y= ˆ wˆ , wˆ ~ iid N (0, σˆ 2 ) ,
wˆ t + Θ
t 1 t −4 t
σ̂ 2 = 0.0013,
and where, via Bayesian analysis:
σ̂ 2 = 0.0015.
400
Chapter 8: Inference via WinBUGS
# Classical analysis in R
# ==========================================================
x <-
c(362, 385, 432, 341, 382, 409, 498, 387, 473, 513, 582, 474,
544, 582, 681, 557, 628, 707, 773, 592, 627, 725, 854, 661, 742,
854, 1023, 789, 878, 1005, 1173, 883, 972, 1125, 1336, 988, 1020,
1146, 1400, 1006, 1108, 1288, 1570, 1174, 1227, 1468, 1736, 1283 )
n <- length(x); n # 48
X11(w=8,h=9); par(mfrow=c(3,2))
plot(x,type="l"); abline(v=seq(0,48,4),h=seq(0,2000,100), lty=3)
plot(log(x),type="l"); abline(v=seq(0,48,4), lty=3)
plot(diff(log(x)),type="l"); abline(v=seq(0,48,4), lty=3)
plot(diff(diff(log(x),lag=4)),type="l"); abline(v=seq(0,48,4), lty=3)
y <- diff(diff(log(x),lag=4))
acf(y, lag=24)
pacf(y,lag=24)
tsdiag(fit1); fit1
# sar1
# -0.4990
# s.e. 0.1417
# sigma^2 estimated as 0.001310: log lik. = 81.12, aic = -158.24
401
Bayesian Methods for Statistical Analysis
X11(w=8,h=5); par(mfrow=c(2,2))
acf(fit$resid, lag=24)
pacf(fit$resid, lag=24)
qqnorm(fit$resid)
hist(fit$resid, nclass=12)
402
Chapter 8: Inference via WinBUGS
# ----------------------------------------------------------------------
model
{
for(t in 1:n) { z[t] <- log(x[t]) }
for(t in 1:5){ y[t] <- 0; w[t] ~ dnorm(0,tau) }
for(t in 6:n){ y[t] <- z[t] - z[t-1] - z[t-4] + z[t-5] }
for(t in 6:N){ # N=n+12=60
m[t] <- Phi1*w[t-4]
y[t] ~ dnorm(m[t],tau)
w[t] <- y[t] - m[t]
}
tau ~ dgamma(0.001,0.001)
Phi1dum ~ dbeta(1,1); Phi1 <- 2*Phi1dum-1
for(k in 1:12) {
z[n+k] <- z[n+k-1] + z[n+k-4] - z[n+k-5] + y[n+k]
x[n+k] <- exp(z[n+k])
}
sig2 <- 1/tau
}
# ----------------------------------------------------------------------
403
Bayesian Methods for Statistical Analysis
x <- c(362, 385, 432, 341, 382, 409, 498, 387, 473, 513, 582, 474,
544, 582, 681, 557, 628, 707, 773, 592, 627, 725, 854, 661, 742,
854, 1023, 789, 878, 1005, 1173, 883, 972, 1125, 1336, 988, 1020,
1146, 1400, 1006, 1108, 1288, 1570, 1174, 1227, 1468, 1736, 1283,
NA,NA,NA,NA, NA,NA,NA,NA, NA,NA,NA,NA)
# This starts WinBUGS, runs the BUGS code for 6000 iterations, closes
# WinBUGS, and creates a number of files in the working directory. These
# files contain information which can also be accessed within R, as follows.
print(sim,digits=4)
404
Chapter 8: Inference via WinBUGS
par(mfrow=c(2,2))
hist(Phi1v, breaks=20); hist(sig2v, breaks=20)
hist(xm[,1], breaks=20); hist(xm[,2], breaks=20)
# Let’s now make the forecasts of the series using the BUGS output.
xF2 <- xF; xL2 <- xL; xU2 <- xU; for(t in 1:12){
xF2[t] <- mean(xm[,t])
xL2[t] <- quantile(xm[,t], 0.025)
xU2[t] <- quantile(xm[,t], 0.975) } # Calc. estimates
405
Bayesian Methods for Statistical Analysis
X11(h=5); par(mfrow=c(1,1));
par(mfrow=c(1,1))
plot(c(40,60),c(1000,3500), type="n", xlab="t", ylab="xt")
lines(x, lwd=2); points(x, lwd=2)
points((n+1):(n+12), xF, pch=16, cex=1.5, col="red");
lines(n:(n+12), c(x[n],xF), lty=1,lwd=2, col="red")
lines((n+1):(n+12), xL, lty=1, lwd=2, col="red")
lines((n+1):(n+12), xU, lty=1, lwd=2, col="red")
abline(v=seq(0,100,4),h=seq(0,4000,100), lty=3)
points((n+1):(n+12), xF2, pch=16, cex=1.5, col="blue" );
lines(n:(n+12), c(x[n],xF2), lty=2,lwd=2, col="blue ")
lines((n+1):(n+12), xL2, lty=2, lwd=2, col="blue ")
lines((n+1):(n+12), xU2, lty=2, lwd=2, col="blue ")
legend(40,3000,c("Classical","Bayesian"), lty=c(1,2),
lwd=c(2,2), col=c("red", "blue"), bg="white" )
406
CHAPTER 9
Bayesian Finite Population Theory
9.1 Introduction
In this chapter we will focus on the topic of Bayesian methods for finite
population inference in the sample survey context. We have previously
touched on this topic when considering posterior predictive inference of
‘future’ values in the context of the normal-normal-gamma model. The
topic will now be treated more generally and systematically.
There are many and various ways in which Bayesian finite population
inference can be categorised, for example:
Each of these categories can in turn be broken down further. For example,
Monte Carlo based techniques may or may not require Markov chain
Monte Carlo methods for generating the sample required for inference.
We see there is potentially a vast subject ground to cover.
407
Bayesian Methods for Statistical Analysis
Suppose that n units are selected from the finite population without
replacement.
Let s = ( s1 ,..., sn ) be the vector of the ordered labels of the sampled units.
Define ys = ( ys1 ,..., ysn ) to be the sample vector, and likewise define
yr = ( yr1 ,..., yrm ) to be the nonsample vector.
408
Chapter 9: Bayesian Finite Population Theory
Also, the population vector may sometimes be written using upper case
letters, as Y = (Y1 ,..., YN ) or Y = (Y1 ,..., YN )′ . For the remainder of this
chapter, these alternative notations will not be used.
Also suppose that a sample of size n is drawn from the finite population
without replacement according to some probability distribution for s.
409
Bayesian Methods for Statistical Analysis
Note 1: The values of s and r here are fixed at their observed values
defined by the data. Thus, given D = ( s, ys ) , we may always express
Q = g ( y,θ ) as h(( ys , yr ),θ ) for some function h (which will in many
cases be the same function as g), and there should be no ambiguity in
the meaning of quantities such as f ( ys , yr | θ ) in (9.1).
410
Chapter 9: Bayesian Finite Population Theory
In this case it may be necessary to modify the notation to account for the
number of distinct units sampled, previously the fixed constant n, due to
the possibility of multiple selections under sampling with replacement.
411
Bayesian Methods for Statistical Analysis
is the sample space for s (the set of all possible combinations of n integers
taken from N).
In this case, f ( s | y,θ ) does not depend on y or θ at all and so may also
be written simply as f ( s ) . This then guarantees that
−1
N
, θ ) f=
f ( s | y s , yr= ( s)
n
at the single observed value of s, whatever that value may be.
This result tells us that f (Q | D ) will be the same when the sampling
mechanism density f ( s | y s , yr , θ ) is ‘ignored’ in the model, so to speak.
412
Chapter 9: Bayesian Finite Population Theory
= ∫ f (θ , s, yr , ys )dyr
= f (θ ) ∫ f ( ys , yr | θ ) f ( s | y s , yr , θ )dyr .
f (θ | D ) ∝ f (θ ) ∫ f ( ys , yr | θ ) × 1dyr
= f (θ ) f ( ys | θ ) ∫ f ( yr | ys , θ )dyr
since f ( ys , yr | θ ) = f ( ys | θ ) f ( yr | θ , ys )
= f (θ ) f ( ys | θ )
413
Bayesian Methods for Statistical Analysis
= ∫ f (θ , s, yr , ys )dθ = ∫ f (θ ) f ( ys , yr | θ ) f ( s | ys , yr , θ )dθ .
f ( yr | D ) ∝ ∫ f (θ ) f ( ys , yr | θ ) × 1dθ
= ∫ f ( yr | ys , θ ) f (θ ) f ( ys | θ )dθ
∝ ∫ f ( yr | ys , θ ) f (θ | ys )dθ
since f (θ | y s ) ∝ f (θ ) f ( ys | θ ) .
414
Chapter 9: Bayesian Finite Population Theory
(b) Find the predictive distribution of the finite population total, namely
yT = y1 + ... + yN .
i∈s
=θ 2
since n = 2 and yi = 1 for all i ∈ s
= =
(1/ 4) 2 1/16, θ 1/ 4
=
=(1/ 2) 4=
2
/16, θ 1/ 2.
415
Bayesian Methods for Statistical Analysis
1/ 5, θ = 1/ 4
It follows that f (θ | D) =
4 / 5, θ = 1/ 2.
(1 − θ ) 2 , yr =
(0, 0)
(1 − θ )θ , yr =(0,1)
(b) Next, observe that f ( yr | D, θ ) =
θ (1 − θ ), yr =(1, 0)
θ , yr = (1,1).
2
1 2 1 2 2 4 25
1 − + 1 − = ,=yr (0, 0)
4 5 4 5 80
1 1 1 2 2 4 19
1 − + 1 − = ,=yr (0,1)
4 4 5 4 4 5 80
=
1 1 − 1 1 + 2 1 − 2 =
4 19
, y= (1, 0)
4 4 5 4 4 5 80 r
2 2
1 1 2 4 17
+ = , yr = (1,1).
4 5 4 5 80
25 / 80, yrT = 0
Therefore=
f ( yrT | D) =38 / 80, yrT 1
17 / 80, y = 2.
rT
416
Chapter 9: Bayesian Finite Population Theory
(b) Find the predictive distribution of the finite population total, namely
yT = y1 + ... + yN
417
Bayesian Methods for Statistical Analysis
In this case the sampling mechanism is nonignorable and the first thing
we should do is determine the exact form of the sampling density of
s = ( s1 , s2 ) . Now,
, θ ) cy
f ( s | y= = sT c( y s1 + y s2 )
for some constant c such that
1 = ∑ f ( s | y,θ )
s
= c {( y1 + y2 ) + ( y1 + y3 ) + ( y1 + y4 ) + ( y2 + y3 ) + ( y2 + y4 ) + ( y3 + y4 )}
= c {3( y1 + y2 + y3 + y4=
)} 3cyT .
Note 2: This formula is only true when the finite population total yT is
positive, i.e. when at least one of y1 ,..., y N is nonzero. In the case where
all population values are zero, we have that y sT = ys1 + ys2 = 0 for all
possible samples s = ( s1 , s2 ) , and consequently f ( s | y, θ ) ∝ 0 , which
must be understood to mean that that no sample actually gets drawn. The
fact that a sample has been observed implies f ( s | y, θ ) > 0 for at least
one value of s, which implies that at least one population value is
positive, which in turn implies that yT > 0. This would be true even if
all the sample values were zero; but as it happens, at least one of them
is positive (in fact both are), which in itself implies that yT > 0 .
We may now work out the joint density of all quantities in the model:
f (θ , ys , yr , s ) = f (θ ) f ( ys | θ ) f ( yr | θ ) f ( s | ys , yr , θ )
1 ys + ys2
=× ∏ θ yi (1 − θ )1− yi × ∏ θ yi (1 − θ )1− yi × 1
2 i∈s i∈r 3 yT
418
Chapter 9: Bayesian Finite Population Theory
1 1+1
= × θ 2 × θ y1 + y3 (1 − θ ) 2− y1 − y3 ×
2 3( y1 + 1 + y3 + 1)
θ 2+ y + y (1 − θ )2− y − y
1 3 1 3
∝ .
2 + y1 + y3
= ∑ f (θ , s, ys , yr )
yr
y1 + y3
θ
1 1
1
∝ θ (1 − θ ) ∑ ∑
2 2
= y3 0 1 − θ
y1 0= 2 + y1 + y3
θ
0+ 0
1 θ
0 +1
1
∝ θ (1 − θ )
2 2
+
1 − θ 2 + 0 + 0 1 − θ 2 + 0 + 1
1
1+ 0 1+1
θ 1 θ
+ +
1−θ 2 +1+ 0 1−θ 2 + 1 + 1
1 θ 1 θ 1 θ 2 1
=θ 2 (1 − θ ) 2 + + +
2 1 − θ 3 1 − θ 3 1 − θ 4
=
1
12
{6θ 2 (1 − θ )2 + 8θ 3 (1 − θ ) + 3θ 4 }
1 1 2 1 2 1 1 1
3 4
1
6 1 − + 8 1 − + 3 , θ =
12 4 4 4 4 4 4
=
1 2 2 2 2 2
2 2 3 4
2
12 6 4 1 − 4 + 8 4 1 − 4 + 3 4 , θ =
4
1 1
12(256) [ 6(9) + 8(3) + 3(1)] , θ = 4
=
1 [ 6(16) + 8(16) + 3(16)] , θ = 2
12(256) 4
1
6(9) + 8(3) + 3(1)= 81, θ= 4
∝
6(16) + 8(16) + 3(16) 2
= 272, θ= .
4
419
Bayesian Methods for Statistical Analysis
= ∑ f (θ , s, ys , yr )
θ
1
∝ ∑
2 + y1 + y3 θ =1/4,2/4
θ 2+ y1 + y3 (1 − θ )2− y1 − y3
1
1 1 3 1 1 3 2 1 3 2 1 3
2+ y + y 2− y − y 2+ y + y 2− y − y
= 1 − + 1 −
2 + y1 + y3
4 4 4 4
16 + 32− y1 − y3
1
(2 + y1 + y3 )256
{ 32− y1 − y3 + 16} ∝
2 + y1 + y3
16 + 32−0−0 25 150
2 + 0 + 0= 2= 12 , ( y1 , y=
3) (0,0)
2 −0 −1
16 + 3 19 76
2 + 0 + 1= 3= 12 , ( y1 , y=
3) (0,1)
∝ 2 −1−0
16 + 3 = 19 =
76
, ( y1 , y=
3) (1,0)
2 +1+ 0 3 12
2 −1−1
16 + 3 = 17 =
51
, ( y1 , y=
3) (1,1)
2 + 1 + 1 4 12
75, ( y1 , y3 ) = (0, 0)
38, ( y , y ) = (0,1)
∝ 1 3
38, ( y1 , y3 ) = (1, 0)
24, ( y1 , y3 ) = (1,1).
420
Chapter 9: Bayesian Finite Population Theory
For yr = (0,0) :
1 2 1 2 9 1
1 − = θ
,=
4 4 256 4
f (θ | y s , yr , s ) ∝ θ 2+ 0+ 0
(1 − θ ) 2 −0−0
=
2 2
2 2 16 2
4 1 − 4=
θ
,=
256 4
9 / 25, θ = 1/ 4
⇒ f (θ | ys , yr , s ) =
16 / 25, θ = 1/ 2.
For yr = (0,1) :
1 3 1 1 3 1
1 − = θ
,=
4 4 256 4
f (θ | ys , yr , s ) ∝ θ 2 + 0 +1
(1 − θ ) 2 −0 −1
=
3 1
2 2 16 2
4 1 − 4=
θ
,=
256 4
3 /19, θ = 1/ 4
⇒ f (θ | ys , yr , s ) =
16 /19, θ = 1/ 2.
421
Bayesian Methods for Statistical Analysis
For yr = (1,0) :
1 3 1 1 3 1
1 − = θ
,=
4 4 256 4
f (θ | ys , yr , s ) ∝ θ 2 +1+ 0
(1 − θ ) 2 −1−0
=
3 1
2 2 16 2
4 1 − 4=
θ
,=
256 4
3 /19, θ = 1/ 4
⇒ f (θ | ys , yr , s ) =
16 /19, θ = 1/ 2.
For yr = (1,1) :
1 4 1 1
=
= , θ
4 256 4
f (θ | y s , yr , s ) ∝ θ 2 +1+1
(1 − θ ) 2 −1−1
=
4
2 16 2
=
4 = , θ
256 4
1/17, θ = 1/ 4
⇒ f (θ | ys , yr , s ) =
16 /17, θ = 1/ 2.
Now,
f (θ | ys , s )
= ∑=
f (θ , y | y , s ) ∑ f (θ | y , y , s ) f ( y
yr
r s
yr
s r r | ys , s ) .
1
f (θ 1 / 4=
= | ys , s) ∑
=
yr
f θ
4
ys , yr , s f ( yr | ys , s )
9 150 3 76 3 76 1 51
= × + × + × + × = 0.22946
25 353 19 353 19 353 17 353
1
f (θ 1 / 2=
= | ys , s) ∑
=
yr
f θ
2
ys , yr , s f ( yr | ys , s )
16 150 16 76 16 76 16 51
= × + × + × + × = 0.77054.
25 353 19 353 19 353 17 353
These results are all in agreement with those obtained in (a) using a
different approach.
422
Chapter 9: Bayesian Finite Population Theory
(d) (i) The probability of selecting unit i into the sample given y and θ is
the same for all i , in particular i = 1, and so may be written
1
, θ ) ∑ f ( s | y=
P(1 ∈ s | y= ,θ ) {( y1 + y2 ) + ( y1 + y3 ) + ( y1 + y4 )}
s:1∈s 3 yT
y + 2 y1 1 2 y1
= T = + ,
3 yT 3 3 yT
assuming that yT > 0 ; otherwise, P(1 ∈ s | y , θ ) = 0.
1 2 yi
+ , yT > 0
Thus, for each i = 1,…,4 we have that P(i ∈ s | y, θ ) =
3 3 yT
0, y = 0.
T
The answer is yes, assuming that y is such that yT > 0 ; in that case,
N 4
1 2 yi 4 2( y1 + y2 + y3 + y4 )
∑ P ( i ∈
=i 1 =i 1 3
s | y , θ ) ∑
= + =+ = 2= n .
3 yT 3 3 yT
(iii) The probability of selecting unit i into the sample given θ is the same
for all i, in particular i = 1, and so may be written
P(i ∈ s | θ ) =P(1 ∈ s=
|θ ) ∑ P(1∈ s | θ , y) f ( y | θ )
y
1 2 y1 4 yi
= (0,0,0,0) | θ ) +
0 × P( y = ∑ +
y: yT >0 3
∏ θ (1 − θ )
3 yT i =
1
1− yi
1 2 y1 yT 0.34180, θ = 1/ 4
=∑ + θ (1 − θ )
4 − yT
=
y: yT >0 3 3 yT 0.46875, θ = 1/ 2.
423
Bayesian Methods for Statistical Analysis
(iv) The unconditional probability that any particular population unit i will
be selected into the sample is
P (i ∈=
s) ∑
θ
P (i ∈ s | θ ) f (θ )
1 1
= 0.34180 × + 0.46875 × = 0.40527.
2 2
4
The first of these quantities is ∑ P(i ∈ s) =4 × 0.40527 = 1.6211.
i =1
options(digits=5)
kern=function(th,yr){ th^(2+sum(yr))*(1-th)^(2-sum(yr))/(2+sum(yr)) }
424
Chapter 9: Bayesian Finite Population Theory
postyr =c(kernyr00,kernyr01,kernyr10,kernyr11)/
(kernyr00+kernyr01+kernyr10+kernyr11)
postyr # 0.42493 0.21530 0.21530 0.14448
# (c)
# (d)
ymat
# [1,] 0 0 0 0
# [2,] 0 0 0 1
# ...............................
# [15,] 1 1 1 0
# [16,] 1 1 1 1
425
Bayesian Methods for Statistical Analysis
(b) Find the predictive distribution of the finite population mean, namely
y = ( y1 + ... + y N ) / N .
The easiest way to do this exercise is to first identify eight equally likely
possibilities to start with. These possibilities are:
1. θ = 0, y = (0,0,0,0) with y =0
2. θ = 0, y = (0,0,0,1) with y = 1/4
3. θ = 0, y = (0,0,1,1) with y = 1/2
4. θ = 0, y = (0,1,1,1) with y = 3/4
5. θ = 1, y = (1,1,1,1) with y =1
6. θ = 1, y = (1,1,1,0) with y = 3/4
7. θ = 1, y = (1,1,0,0) with y = 1/2
8. θ = 1, y = (1,0,0,0) with y = 1/4.
=
After observing ys (= y2 , y3 ) (1,1), there are only three possibilities
remaining (4, 5 and 6 in the list, each highlighted by an arrow).
426
Chapter 9: Bayesian Finite Population Theory
Alternative solution
The above results can also be obtained by working through in the style of
the solutions to previous exercises, as follows. Before the data is observed,
the Bayesian model may be written:
−1 −1
N 4 1
y,θ ) =
f (s | = = ,
n 2 6
s = (1, 2), (1,3), (1, 4), (2,3), (2, 4), (3, 4)
1
f ( y | θ ) = , y (θ , θ , θ , θ ), (θ , θ , θ ,1 − θ ),
=
4
(θ ,θ ,1 − θ ,1 − θ ), (θ ,1 − θ ,1 − θ ,1 − θ )
f (θ ) 1=
= / 2, θ 0,1 (the prior density of the parameter).
427
Bayesian Methods for Statistical Analysis
= f (θ , s, y ) f=
(θ , s, ys , yr ) f (θ ) f ( ys , yr | θ ) f ( s | ys , yr , θ )
I (θ ∈ {0,1}) I ( y =(0,1,1,1), θ =0) + I ( y ∈ {(1,1,1,1),(1,1,1,0)}, θ =1) 1
= × ×
2 4 6
θ , yr
∝ I ( yr =(0,1), θ =0) + I ( yr ∈ {(1,1),(1,0)}, θ =
1) .
∑ f (θ , s, y )
f (θ | D ) ∝ f (θ , s, ys ) =
yr
∑= = 1,
I ( yr (0,1)) = θ 0
∝
∑ I ( yr ∈ {(1,1), (1, 0)}) =
2, θ =
1.
1 / 3, θ = 0
After normalising, we see that f (θ | D ) = .
2 / 3, θ = 1
(b) Also,
∑ f (θ , s, ys , yr )
f ( y r | D ) ∝ f ( y r , s, y s ) =
θ
1
∑ [ I ( yr =(0,1), θ = 1)] =
0) + I ( yr ∈ {(1,1),(1,0)}, θ = 1, yr = (0,1)
θ =0
1
∝ ∑ [ I ( yr =(0,1), θ =0) + I ( yr ∈ {(1,1),(1,0)}, θ =1)] =1, yr =(1,1)
θ = 0
1
∑ [ I ( yr =(0,1), θ = 1) ] =
0) + I ( yr ∈ {(1,1),(1,0)}, θ = 1, yr = (1,0)
θ =0
Consequently, f=
( y | D ) 1=
/ 3, y (0,1,1,1),(1,1,1,1),(1,1,1,0) .
Now, the values of y listed here as possible given the observed data have
means 3/4, 1 and 3/4, respectively.
428
Chapter 9: Bayesian Finite Population Theory
Find the posterior distribution of the Poisson mean and the predictive
distribution of the nonsample total.
Also find these distributions under the (false) assumption that the
sampling is SRSWR.
Then create two plots which suitably compare the four distributions
indicated above.
Note: The concepts here involve a biased sampling mechanism and are
relevant to on-site sampling, where for example we wish to estimate the
total number of times that visitors (or potential visitors) to a recreational
park actually visit there in some specified time period.
429
Bayesian Methods for Statistical Analysis
r = ( r1 ,..., rm ) is the vector of the labels of the m units that are not sampled.
Since we are interested in the nonsample values only by way of their total
yrT , a suitable Bayesian finite population model in this context is:
Ii
n ! N yi
f ( I | y, λ ) = N ∏ ,
∏i =1 I i ! i =1 yT
I ∈ {(a1 ,..., aN ) : ai ∈ {0,1,..., n}∀i, a1 + ... + aN =n}
e − λ λ yi e − mλ ( mλ ) yrT
f=( ys , yrT | λ ) ∏ ×
i∈s yi ! yrT !
λ ~ G (η ,τ ) .
430
Chapter 9: Bayesian Finite Population Theory
Note: The probability of sampling unit 2 once and unit 4 twice (as is
assumed to have occurred) equals
1 2 Ii
y2 y4 y4 y4 y2 y4 y4 y4 y2 3! y2 y4 3! 9
yi
+ + = =
yT yT yT yT yT yT yT yT yT 1!2! yT yT
∏
∏ i =1 I i ! i =1 yT
9
Ii
n! N
y
f ( I | y,θ ) = N ∏ i
∏i =1 I i i =1 yT
For this exercise we will first derive the predictive distribution of yrT and
then use this to obtain the posterior distribution of λ only afterwards. The
predictive density of yrT is
f ( yrT | D ) ∝ ∫ f ( yrT , ys , I , λ )d λ
= ∫ f (λ ) f ( ys | λ ) f ( yrT | λ ) f ( I | ys , yrT , λ )d λ
∞
e − mλ ( mλ ) yrT 1
∝ ∫ λ η −1e −τλ × ∏ e − λ λ yi × × n dλ
0 i∈s y rT ! yT
Ii n
1
N
1
(note that ∏ = )
i =1 yT yT
∞
m yrT 1
n ∫
= λ η + ysT + yrT −1e − λ (τ +d +m )d λ
yrT ! yT 0
m yrT 1 Γ(η + ysT + yrT )
= × ×1 .
yrT ! yT (τ + d + m)η + ysT + yrT
n
431
Bayesian Methods for Statistical Analysis
Thus
k ( yrT )
=
f ( yrT | D ) = , yrT 0,1, 2,... ,
c
where
N −d Γ(η + ysT + yrT )
yrT
k ( yrT ) =
N +τ yrT !( ysT + yrT ) n
and
∞
c= ∑ k( y
yrT =0
rT ).
Using the predictive density of yrT , we can now obtain the posterior
density of λ as
∞
f (λ | D ) = ∑ f (λ | D, y
yrT = 0
rT ) f ( yrT | D ) ,
where
(λ | D, yrT ) ~ G (η + ysT + yrT ,τ + N ) .
432
Chapter 9: Bayesian Finite Population Theory
f (λ | D ) ∝ ∑ f ( yrT , ys , I , λ ) .
yrT
The result is then almost the same as before, the only difference being that
the term
∏ iN=1 yTIi = yTn =( y sT + yrT ) n
in
N − d Γ(η + ysT + yrT )
yrT
k ( yrT ) =
N + τ yrT !( ysT + yrT )
n
is replaced by 1.
K ( yrT ) =
N +τ yrT ! × 1
and
∞
C= ∑ K( y
yrT =0
rT ).
433
Bayesian Methods for Statistical Analysis
We see that the inference under the assumption of length-bias is the lower
of the two. This is because it appropriately corrects for large finite
population values being more likely to be selected. If we ‘ignore’ the fact
that large values are more likely to be selected. then we will erroneously
over-estimate the superpopulation mean, λ .
Figure 9.2 shows the predictive density f ( yrT | D ) , again under the two
assumptions.
434
Chapter 9: Bayesian Finite Population Theory
435
Bayesian Methods for Statistical Analysis
lamv=seq(0,10,0.01); lamfv=lamv
for(i in 1:length(lamv)) lamfv[i]=sum(fv*dgamma(lamv[i],eta+ysT+yrTv,tau+N))
plot(lamv,lamfv,type="l", lty=1, lwd=3,
xlab="lambda",ylab="posterior density", main=" ")
lamfvigno=lamv
for(i in 1:length(lamv))
lamfvigno[i]=sum(fvigno*dgamma(lamv[i],eta+ysT+yrTv,tau+N))
# lines(lamv,lamfvigno,lty=2,lwd=1) # Can do as a check on calculations
lines(lamv,dgamma(lamv,eta+ysT,tau+d),lty=2,lwd=3)
legend(4,0.5,c("Length-bias assumed (Inference is correct)",
"SRSWR assumed (Inference is too high)"),lty=c(1,2),lwd=c(3,3))
436
Chapter 9: Bayesian Finite Population Theory
Units 3 and 5 are selected, and their values are 1.6 and 0.4, respectively.
Find and sketch the posterior density of the superpopulation mean µ and
the predictive density of the finite population mean y under each of the
following specifications:
Note: Here, the sample size n is not fixed and is a random variable.
437
Bayesian Methods for Statistical Analysis
i =1
N
=
i =1
− λ yi
f ( y | λ) i ∏ λe , y >0∀i
f (λ ) ∝ 1 / λ , λ > 0 .
Here,
π=
1 ...= π N = 0.3,
and the data is
= D (= I , ys ) ((0,0,1,0,1,0,0),(1.6,0.4)) ,
with n = 2 (the achieved sample size).
Next,
( yrT | λ ) ~ G (m, λ ) ,
where m = N − n = 7 − 2 = 5 .
It follows that
f ( yrT | D ) = ∫ f ( yrT | D, λ ) f (λ | D )d λ
∞
∝ ∫ λ m yrTm −1e − λ yrT × λ n −1e − λ ysT d λ
0
∞
= yrTm −1 ∫ λ n + m −1e − λ ( yrT + ysT ) d λ
0
m −1
y Γ( n + m )
= rT
( yrT + ysT ) n + m
yrTm −1
∝ , yrT > 0 .
( yrT + ysT ) N
438
Chapter 9: Bayesian Finite Population Theory
Hence
N − n −1
n
y − ys
f ( y | D) ∝
N n
N
, y > ys
y N
(using the fact that y= rT Ny − nys ).
(b) In this case, inferences will be exactly the same as in (a). This is
because, even though the sampling mechanism is potentially nonignorable
due to f ( I | y , λ ) depending on a population value y5 , that value happens
to be known (since unit 5 is in the sample, i.e. 5 ∈ s ).
To clarify, we write
0.3, y5 < 1
π5 =
π 5 ( y5 ) =
=0.3 + 0.6 I ( y5 > 1) .
0.9, y5 > 1
Thus
N
| y, λ )
f ( I= ∏π
i =1
i
Ii
(1 − π i )1− Ii
Therefore,
∫ f (λ , I , ys , yr )dyr
f (λ | D ) ∝ f (λ , I , y s ) =
= ∫ f (λ ) f ( ys , yr | λ ) f ( I | y s , yr , λ )dyr
439
Bayesian Methods for Statistical Analysis
N
So f ( I | y , λ ) = ∏ f ( I i | y , λ ) is unknown.
i =1
∫ f (λ , I , ys , yr )dyr
f (λ | D ) ∝ f (λ , I , y s ) =
= ∫ f (λ ) f ( ys , yr | λ ) f ( I | y s , yr , λ )dyr
∝ f (λ ) f ( ys | λ )W ,
where
= (λ )
W W= ∫ f (y r | λ ) f ( I 4 | y4 )dyr
∞
= ∏ ∫ f ( yi | λ )dyi ∫ f ( y4 | λ ) f ( I 4 | y4 )dy4
ii∈≠r4 0
∞
= ∏ 1 ∫ λ e − λ y4 [0.7 − 0.6 I ( y4 > 1)]dy4
ii∈≠4r 0
since f ( yi= | λ ) λ e − λ yi ∀i
∞ ∞
= 0.7 ∫ λ e − λ y4
dy4 − 0.6∫ λ e − λ y4 dy4
0 1
−λ
= 0.7 × 1 − 0.6e .
Thus
f (λ | D ) ∝ λ n −1e − λ ysT (7 − 6e=
−λ
) 7λ n −1e − λ ysT − 6λ n −1e − λ ( ysT +1) .
440
Chapter 9: Bayesian Finite Population Theory
Thus
7 y n λ n −1e − λ ysT 6 ( ysT + 1) n λ n −1e − λ ( ysT +1)
= f (λ | D ) c n sT − n ,
sT
y Γ ( n ) ( y sT + 1) Γ ( n )
where
7 6
= 1 ∫ f (λ=| D )d λ c n (1) − n ( )
1
y sT ( y sT + 1)
−1
7 6
⇒ c= n − n .
ysT ( ysT + 1)
441
Bayesian Methods for Statistical Analysis
y4 − λ t
7 ∫ λ e dt , 0 < y4 < 1
0
First, F ( y4 | D, λ ) ∝ y y4
7 λ e − λt dt − 6 λ e − λt dt ,
4
∫ ∫1 y4 > 1
0
7(1 − e − λ y4 ), 0 < y4 < 1
= − λ y4 − λ1 − λ y4
7(1 − e ) − 6(e − e ), y4 > 1.
k (7 − 7e − λ y4 ), 0 < y4 < 1
Thus F ( y4 | D , λ ) = − λ y4 −λ
k (7 − e − 6e ), y4 > 1,
=
where (λ ) 1 / (7 − 6e − λ ) , since 1 ==
k k= k (7 − 6e − λ ) .
F ( y4 ∞ | D, λ ) =
(a convolution)
a
= ∫ F( y =
0
4 a − y0 | D, λ ) f G ( m −1,λ ) ( y0 )dy0
∫ k 7 − 7e
− λ ( a − y0 )
= f G ( m −1,λ ) ( y0 )dy0 , 0 < a < 1 .
0
442
Chapter 9: Bayesian Finite Population Theory
f ( a | D, λ ) =
da ∫
=
0
k 0 − 7e − λ ( a − y0 ) ( −λ ) f G ( m −1,λ ) ( y0 )dy0
da
+ k 7 − 7e − λ ( a −a ) f G ( m −1,λ ) ( a ) (this is zero)
da
d0
− k 7 − 7e − λ ( a −0) f G ( m −1,λ ) (0) (this is zero)
da
443
Bayesian Methods for Statistical Analysis
a
λ m −1 y0m −2 e − λ y0 λ m −1 a m −2
= 7 k λ e − λ a ∫ e λ y0
Γ( m − 1) dy0 = 7k λ e − λ a
( m − 2)! ∫0 y0 dy0
0
λ m −1
λ m a m −1e − λ a
= 7k λ e − λ a a m −1 = 7k = 7kf G ( m ,λ ) ( a ) .
( m − 1)! ( m − 1)!
d ( a − 1)
+ k 7 − e − λ ( a −( a −1)) − 6e − λ f G ( m −1,λ ) ( a − 1)
da
d0
− k 7 − e − λ ( a −0) − 6e − λ f G ( m −1,λ ) (0) (this is zero)
da
a
+ ∫ k 0 − 7e − λ ( a − y0 ) ( −λ ) f G ( m −1,λ ) ( y0 )dy0
a −1
da
+ k 7 − 7e − λ ( a −a ) f G ( m −1,λ ) ( a ) (this is zero)
da
d ( a − 1)
− k 7 − 7e − λ ( a −( a −1)) f G ( m −1,λ ) ( a − 1)
da
a −1
λ m −1 y0m −2 e − λ y0
= k λ e−λa ∫ e λ y0 −λ −λ
dy0 + k 7 − e − 6e f G ( m −1,λ ) ( a − 1)
0 Γ ( m − 1)
a
λ m −1 y0m −2 e − λ y0
∫
−λa λ y0
+7 k λ e e
−λ
dy0 −7k 1 − e f G ( m −1,λ ) ( a − 1)
a −1 Γ( m − 1)
λ m −1
k λ e−λa ( a − 1) m −1 + 7k (1 − e − λ ) f G ( m −1,λ ) ( a − 1)
( m − 1)!
λ m −1
+7 k λ e − λ a a m −1 − ( a − 1) m −1 − 7k (1 − e − λ ) f G ( m −1,λ ) ( a − 1)
( m − 1)!
λ m ( a − 1) m −1 e − λ ( a −1) λ m a m −1e − λ a
= ke − λ + 7 k ( m 1)!
( m − 1)! −
λ ( a − 1) m −1 e − λ ( a −1)
m
−7ke − λ
( m − 1)!
= k {7 f G ( m ,λ ) ( a ) − 6e − λ f G ( m ,λ ) ( a − 1)} .
444
Chapter 9: Bayesian Finite Population Theory
In summary so far,
7 fG ( m ,λ ) (a ), 0 < a <1
f (a | D, λ )= k × −λ
7 fG ( m ,λ ) (a ) − 6e fG ( m,λ ) (a − 1), a > 1.
Check: Here,
∫ f (a | D, λ )da =
{
k × 7 FG ( m,λ ) (1) + 7 1 − FG ( m,λ ) (1) − 6e − λ 1 − FG ( m,λ ) (1 − 1)
}
× {7 − 6e − λ [1 − 0]} = 1
1
= −λ
7 − 6e
(which is correct).
f ( y | D , λ ) = f1 ( y , λ )
≡ Nk (λ )7 fG ( m ,λ ) ( Ny − nys )
ny ny + 1
for s < y < s
N N
f ( y | D, λ ) = f 2 ( y , λ )
≡ Nk (λ ) {7 fG ( m ,λ ) ( Ny − nys ) − 6e − λ fG ( m ,λ ) ( Ny − nys − 1)}
nys + 1
for y > ,
N
where:
nys
= 0.2857
N
ny s + 1
= 0.4286
N
1
k (λ ) = (as before).
7 − 6e − λ
445
Bayesian Methods for Statistical Analysis
We see that inferences under the length-biased sampling scheme in (c) are
lower than those under SRSWR in (a). This is because, generally
speaking, length bias makes larger units more likely to be selected, and
not adjusting for that bias naturally leads to inferences that are too high.
Note 1: In (a),
( µ | D ) ~ IG ( n, ysT ) ,
and therefore
E ( µ | D=) y sT / ( n − 1) = 2 / (2 − 1)
= 2 (exactly).
446
Chapter 9: Bayesian Finite Population Theory
# (a)
X11(w=8,h=4); par(mfrow=c(1,1))
N=7; ys=c(1.6,0.4); ysT=sum(ys); ysbar=mean(ys); n=length(ys); m=N-n
c(ysT,ysbar,n,m) # 2 1 2 5
fmufun=function(mu,n,ysT) dgamma(1/mu,n,ysT)/mu^2
integrate(fmufun,0, Inf,n=n,ysT=ysT)$value # 1 check
muv=seq(0.0001,20.0001,0.005); fmuv= fmufun(muv,n=n,ysT=ysT)
plot(muv,fmuv,type="l",xlim=c(0,20)) # check
integrate(function(mu,n,ysT) mu*fmufun(mu,n,ysT),
0,Inf,n=n,ysT=ysT)$value # 2 check (posterior mean of mu)
447
Bayesian Methods for Statistical Analysis
# (c)
c = 1 / ( 7/ysT^n - 6/(ysT+1)^n ); c # 0.9230769
flamfunc=function(lam,n,ysT,c) c*
( (7/ysT^n)*dgamma(lam,n,ysT) - (6/(ysT+1)^n)*dgamma(lam,n,ysT+1) )
integrate(flamfunc,0,Inf,n=n,ysT=ysT,c=c)$value # 1 check
lamv=seq(0,20,0.01)
plot(lamv,flamfunc(lamv,n=n,ysT=ysT,c=c),type="l") # OK
fmufunc=function(mu,n,ysT,c) c*(1/mu^2)*
( (7/ysT^n)*dgamma(1/mu,n,ysT) - (6/(ysT+1)^n)*dgamma(1/mu,n,ysT+1) )
integrate(fmufunc,0,Inf,n=n,ysT=ysT,c=c)$value # 1 check
integrate(function(mu,n,ysT,c) mu*fmufunc(mu,n,ysT,c),
0,Inf,n=n,ysT=ysT,c=c)$value # 1.384615 (posterior mean of mu)
fmuvc=fmufunc(mu=muv,n=n,ysT=ysT,c); plot(muv,fmuvc) # OK
f1fun=function(ybar,lam,n,N,m,ysT) (N / (7-6*exp(-lam))) *
7*dgamma(N*ybar-ysT,m,lam)
f2fun=function(ybar,lam,n,N,m,ysT) (N / (7-6*exp(-lam))) *
(7*dgamma(N*ybar-ysT,m,lam)-6*exp(-lam)*dgamma(N*ybar-ysT-1,m,lam) )
448
Chapter 9: Bayesian Finite Population Theory
g1fun=function(ybar,n,N,m,ysT,c)
integrate(function(lam,ybar,n,N,m,ysT,c)
f1fun(ybar,lam,n,N,m,ysT)*flamfunc(lam,n,ysT,c),
0,Inf, ybar=ybar, n=n,N=N,m=m,ysT=ysT,c=c)$value
g2fun=function(ybar,n,N,m,ysT,c)
integrate(function(lam,ybar,n,N,m,ysT,c)
f2fun(ybar,lam,n,N,m,ysT)*flamfunc(lam,n,ysT,c),
0,Inf, ybar=ybar, n=n,N=N,m=m,ysT=ysT,c=c)$value
# Check:
g1fun(ybar=0.4,n,N,m,ysT,c) # 0.4119163 OK
g2fun(ybar=0.6,n,N,m,ysT,c) # 1.274185 OK
ybarv1=seq(ybarmin,ybarcut,length.out=400); fybarv1=ybarv1
for(j in 1:length(ybarv1)) fybarv1[j] =
g1fun(ybar=ybarv1[j],n=n,N=N,m=m,ysT=ysT,c=c)
plot(c(0,5),c(0,1.5),type="n")
lines(ybarv1, fybarv1,lty=1,lwd=2)
lines(ybarv2, fybarv2,lty=1,lwd=2) # OK
# Check
INTEG <- function(xvec, yvec, a = min(xvec), b = max(xvec)){
# Integrates numerically under a spline through the points given by
# the vectors xvec and yvec, from a to b.
fit <- smooth.spline(xvec, yvec); spline.f <- function(x){predict(fit, x)$y }
integrate(spline.f, a, b)$value }
INTEG(seq(0,1,0.01),seq(0,1,0.01)^2,0,1) # 0.3333333 check
prob1=INTEG(ybarv1,fybarv1,ybarmin,ybarcut)
prob2=INTEG(ybarv2,fybarv2,ybarcut,10000)
c(prob1,prob2,prob1+prob2) # 0.02880659 0.97119399 1.00000058 OK
INTEG(c(ybarv1,ybarv2),c(fybarv1,fybarv2),ybarmin,10000) # 1.000004 OK
449
Bayesian Methods for Statistical Analysis
X11(w=8,h=6); par(mfrow=c(2,1))
plot(ybarv1, ybarv1* fybarv1, xlim=c(0,1)) # OK
plot(ybarv2, ybarv2* fybarv2, xlim=c(0,20)) # OK
450
Chapter 9: Bayesian Finite Population Theory
f (λ ) ∝ 1 / λ , λ > 0 .
So
1 n
f ( I , ys , y0 , y4 , λ ) ∝ ( )
× ∏ λ e − λ yi × λ m −1 y0m − 2 e − λ y0
λ i =1
×λ e − λ y4 × [ 7 − 6 I ( y4 > 1) ] .
2. f ( y0 | I , ys , λ , y4 ) ∝ y0m − 2 e − λ y0
⇒ ( y0 | I , ys , λ , y4 ) ~ G ( m − 1, λ )
The first of these three conditionals are straightforward and easy to sample
from. The third conditional can be sampled from via the inversion
technique as follows.
451
Bayesian Methods for Statistical Analysis
7 − 7e − λ x , 0 < x <1
= r −λ x −λ ,
7 − e − 6e , x > 1
7 − 7e − λ
Now observe that F ( x= 1)= .
7 − 6e − λ
7 − 7e − λ
First, if p < −λ
p r (7 − 7e − xλ )
then we solve =
7 − 6e
1 p
and thereby obtain x = − log 1 − .
λ 7r
7 − 7e − λ
Secondly, if p > then we solve p =r (7 − e − λ x − 6e − λ )
7 − 6e − λ
1 p
and thereby obtain x = − log 7 − 6e − λ − .
λ r
Figure 9.4 displays trace plots for the three unknowns, λ , y0 , y4 , sample
ACFs for these over the last 20,000 iterations, and the three sample ACFs
again over the final samples of size J. Figure 9.5 is a histogram of the J
simulated values of µ = 1/ λ and Figure 9.6 is a histogram of the J
simulated values of y = ( ysT + y0 + y4 ) / N . In each histogram are shown
a density estimate as well as three vertical lines for the Monte Carlo point
estimate and 95% CI for the mean.
The posterior mean of µ , i.e. E ( µ | D), was also estimated via Rao-
Blackwell as
1 J ysT + y0( j ) + y4( j ) 1 J Ny ( j )
=
=
µˆ = ∑
J j 1= N −1
∑
J j 1 N −1
= 1.41,
453
Bayesian Methods for Statistical Analysis
So we define
1 m −1
e=j E ( y | D, λ j , y4( j=
)
) ysT + y4 +
( j)
.
N λ j
454
Chapter 9: Bayesian Finite Population Theory
455
Bayesian Methods for Statistical Analysis
456
Chapter 9: Bayesian Finite Population Theory
Qfun = function(p=0.5,lam=1){
c1 = (7-7*exp(-lam))/(7-6*exp(-lam))
if(p <= c1) c2 = 1- (p/7) * (7-6*exp(-lam))
if(p > c1) c2 = 7 - 6*exp(-lam) - p*(7-6*exp(-lam))
-(1/lam)*log(c2) }
# Check:
pvec=seq(0,1,0.001); Qvec=pvec
for(i in 1:length(pvec)) Qvec[i] = Qfun(p=pvec[i],lam=1.3)
plot(pvec,Qvec); plot(Qvec,pvec) # OK
muhat=(N/(N-1))*ybarhat
muci=muhat + c(-1,1)*qnorm(0.975)*sd( (N/(N-1))*ybarvec ) / sqrt(J)
c(muhat, muci) # 1.405272 1.343556 1.466989
457
Bayesian Methods for Statistical Analysis
mugrid=seq(0.001,10.001,0.01)
fmuhat=mugrid; for(i in 1:length(mugrid))
fmuhat[i] = mean( dgamma(1/mugrid[i], N, N*ybarvec )/mugrid[i]^2 )
X11(w=8,h=5)
hist(muvec,prob=T, xlim=c(0,5),ylim=c(0,1),breaks=seq(0,80,0.1),
xlab="mu", main="")
lines(mugrid,fmuhat,lwd=2); abline(v= c(muhat, muci), lwd=2)
hist(ybarvec,prob=T, xlim=c(0,5),ylim=c(0,1.2),breaks=seq(0,80,0.1),
xlab="ybar", main=" ")
lines(density(ybarvec),lwd=2); abline(v= c(ybarhat, ybarci), lwd=2)
n y Li
f ( L | y, λ ) = ∏ , L = ( L1 ,..., Ln ) ∈ {( a1 ,..., an ) :
yT − ∑ j =1 y L j
i −1
i =1
N
f ( y | λ)
= ∏ λe λ
i =1
− yi
, yi > 0 ∀ i
f (λ ) ∝ 1 / λ , λ > 0 .
458
Chapter 9: Bayesian Finite Population Theory
This pdf implies that units are selected from the finite population, one by
one and without replacement, in such a way that the probability of
selecting a unit on any given draw is its value divided by the sum of the
values of all units which have not yet been sampled at that point in time.
We call this procedure length-biased sampling without replacement.
( yrT | λ ) ~ G ( m, λ ) ,
the joint posterior density of λ and yrT (given the data, D = ( L, ys ) ) may
now be written as
f (λ , yrT | D ) ∝ f (λ , yrT , ys , L) =
f (λ ) f ( ys | λ ) f ( yrT | λ ) f ( L | ys , yrT )
1 n
× ∏ λ e − λ yi × ( λ m yrTm −1e − λ yrT ) × ∏
n
1
∝ .
=λ i 1= i 1 yi + ... + yn + y rT
460
Chapter 9: Bayesian Finite Population Theory
This sample can then be used for Monte Carlo inference on the quantities
of interest, namely µ = 1/ λ and=y ( ysT + yrT ) / N .
Applying the above Gibbs sampler (with a suitable burn-in and thinning)
we obtained a random sample of size J = 2,000 from the joint posterior
distribution of λ , yrT and w = ( w1 ,..., wn ) .
461
Bayesian Methods for Statistical Analysis
where
wT( j )= w1( j ) + ... + wn( j ) .
Figure 9.7 shows trace plots for λ , yrT and w1 , sample ACFs for these
quantities over the last 10,000 iterations, and these three sample ACFs
again but calculated using only the final smaller samples of size J = 2,000.
Figures 9.8 and 9.9 (page 464) show two histograms, of the J simulated
values of µ = 1/ λ , and of the J simulated values of=
y ( ysT + yrT ) / N .
In each histogram are shown a density estimate and three vertical lines
representing the Monte Carlo point estimate and 95% CI for the posterior
mean.
462
Chapter 9: Bayesian Finite Population Theory
Figure 9.7 Trace plots and sample ACFs for samples obtained
via MCMC
463
Bayesian Methods for Statistical Analysis
464
Chapter 9: Bayesian Finite Population Theory
GS = function(J=1000,N=7,n=3,m=4, ys=c(1.6,0.4,0.7),
lam=1,yrT=1,w=rep(1,3)){
ysT=sum(ys); lamv=lam; yrTv=yrT; wmat=w; for(j in 1:J){
lam=rgamma(1,N,ysT+yrT);
yrT=rgamma(1,m,lam+sum(w))
for(i in 1:n) w[i] = rgamma(1,1,sum(ys[i:n]))
lamv=c(lamv,lam); yrTv=c(yrTv,yrT); wmat=rbind(wmat,w)
}
list(lamv=lamv, yrTv=yrTv, wmat=wmat)
}
set.seed(321); date()
res=GS(J=11000,N=7,n=3,m=4, ys=c(1.6,0.4,0.7), lam=1,yrT=1,w=rep(1,3))
date() # took 4 secs
X11(w=8,h=9); par(mfrow=c(3,3));
lamv=res$lamv[-(1:1001)]; yrTv=res$yrTv[-(1:1001)];
wmat=res$wmat[-(1:1001),]
acf(lamv); acf(yrTv); acf(wmat[,1]) #
inc= seq(5,10000,5); lamvec=lamv[inc]; yrTvec=yrTv[inc]; wmatrix=wmat[inc,];
acf(lamvec); acf(yrTvec); acf(wmatrix[,1]) # OK
J = length(lamvec); J # 2000
muhat=(N/(N-1))*ybarhat
muci=muhat + c(-1,1)*qnorm(0.975)*sd( (N/(N-1))*ybarvec ) / sqrt(J)
c(muhat, muci) # 0.6188547 0.6136692 0.6240401
465
Bayesian Methods for Statistical Analysis
mugrid=seq(0.001,10.001,0.01)
fmuhat=mugrid; for(i in 1:length(mugrid))
fmuhat[i] = mean( dgamma(1/mugrid[i], N, N*ybarvec )/mugrid[i]^2 )
ybargrid=seq(0,10,0.01)
fybarhat= ybargrid; for(i in 1:length(ybargrid))
fybarhat[i] = mean( dgamma(N*ybargrid[i]-ysT, m, lamvec+wTvec )*N )
X11(w=8,h=5); par(mfrow=c(1,1))
hist(muvec,prob=T, xlim=c(0,3),ylim=c(0,2.5),breaks=seq(0,80,0.1),
xlab="mu", main="")
lines(mugrid,fmuhat,lwd=2); abline(v= c(muhat, muci), lwd=2)
hist(ybarvec,prob=T, xlim=c(0.3,1.2),ylim=c(0,7),breaks=seq(0,80,0.025),
xlab="ybar", main="")
lines(ybargrid, fybarhat,lwd=2); abline(v= c(ybarhat, ybarci), lwd=2)
466
CHAPTER 10
Normal Finite Population Models
10.1 The basic normal-normal finite
population model
For convenience, we will in what follows label (or rather relabel) the
n sample units as 1,..., n and the m= N − n nonsample units as
n + 1,..., N . This convention simplifies notation and allows us to write
the finite population vector, originally defined by y = ( y1 ,,..., y N ) , as
=y ((=
y1 ,..., yn ), ( yn +1 ,..., yN )) ( ys , yr ) .
467
Bayesian Methods for Statistical Analysis
where: µ* =−
(1 k ) µ0 + kys (the posterior mean as a credibility estimate)
σ2 n
σ *2 = k (the posterior variance), k =
n n + σ 2 / σ 02
(the credibility factor and weight given to the MLE, y s ).
f ( yr | y s ) = ∫ f ( yr | y s , µ ) f ( µ | y s ) d µ .
468
Chapter 10: Normal Finite Population Models
ny + myr ny + mE ( yr | ys )
c = E ( y | ys ) = E s ys = s
N N
ny + ma ny + mµ*
= s = s
N N
ny + myr m
2
d 2 = V ( y | ys ) = V s y s = V ( yr | y s )
N N
m2 2 m2 2 σ 2
== b σ* +
N 2 m
.
N2
m
The 1 − α CPDR for yr is (a ± zα /2b) .
469
Bayesian Methods for Statistical Analysis
(b) What is the predictive distribution in the case of very weak prior
information?
(c) What is the predictive distribution in the case of very strong prior
information?
(d) What is the predictive distribution in the case of a very large sample
size?
In your graph indicate the predictive mean and 95% highest predictive
density region for the average of all seven values in the finite population.
470
Chapter 10: Normal Finite Population Models
471
Bayesian Methods for Statistical Analysis
This makes sense because if the sample data values are given ‘full
credibility’ then their straight average should intuitively be used to
estimate the finite population mean.
This also makes sense because if the sample data are given ‘zero
credibility’ then each of the N − n nonsampled values should
intuitively be estimated by the prior mean of the superpopulation mean
µ.
(b) In the case of very weak prior information we have (in the limit) that
σ 0 = ∞ , hence k = 1, and hence q = 1. Consequently
σ2 n σ2 n
( y | y s ) ~ N (1 − 1) µ0 + 1 y s ,1 1 − ~ N y s , 1 − .
n N n N
Note: This is the same inference one would make via classical
techniques after substituting the sample standard deviation
1 n
=s ∑
n − 1 i =1
( yi − ys ) 2
(c) In the case of very strong prior information we have (in the limit)
that σ 0 = 0 , hence k = 0, and hence q = n/N. Consequently,
n n n σ2 n
( y | ys ) ~ N 1 − µ 0 + ys , 1 −
N N N n N
( N − n) µ0 + nys σ 2 n
~ N , 1 − .
N N N
472
Chapter 10: Normal Finite Population Models
(d) In the case of a very large sample size we have (in the limit) that
n = ∞ , hence k = 1, and hence q = 1 . Consequently (just as in (b) for
the case of very weak prior information),
σ2 n
( y | ys ) ~ N (1 − 1) µ0 + 1 ys ,1 1 −
n N
σ2 n
~ N ys , 1 − .
n N
3
µ0 = 10, σ 0 = = 1.53064
1.96
µ ~ N ( µ0 , σ 02 )
n
k= = 0.63731
n + σ 2 / σ 02
σ2
µ* =−
(1 k ) µ0 + kys = 8.6404, σ * = k = 0.9218141
n
( µ | ys ) ~ N ( µ* , σ *2 )
473
Bayesian Methods for Statistical Analysis
a = µ* = 8.6404,=b σ *2 + σ 2 / m = 1.3601
( yr | y s ) ~ N ( a , b 2 )
n + ( N − n )k
q= = 0.79275
N
ny + mµ*
c =s =− (1 q ) µ0 + qys = 8.3088
N
m2 2
d= b = 0.77717
N2
( y | y s ) ~ N (c, d 2 ) .
474
Chapter 10: Normal Finite Population Models
475
Bayesian Methods for Statistical Analysis
X11(w=8,h=7); par(mfrow=c(1,1))
plot(c(4,15),c(0,0.6),type="n",xlab="mu, yrbar, ybar",
ylab="density, likelihood", main="")
v=seq(0,20,0.01)
lines(v,dnorm(v,ysbar,sig/sqrt(n)),lty=1,lwd=3,col="black")
# likelihood function (i)
lines(v,dnorm(v,mu0,sig0^2+sig^2/m),lty=3,lwd=2, col="blue")
# prior pdf of yrbar (iv)
lines(v,dnorm(v,a,sqrt(b2)),lty=3,lwd=3, col="blue")
# predictive pdf of yrbar (v)
lines(v,dnorm(v,mu0,sig0^2+sig^2/N),lty=4,lwd=2, col="green")
# prior pdf of ybar (vi)
lines(v,dnorm(v,c,sqrt(d2)),lty=4,lwd=3, col="green")
# predictive pdf of ybar (vii)
abline(v=c(c,HPDR),lty=1,lwd=1)
legend(3.8,0.6,c("(i) Likelihood","(ii) Prior","(iii) Posterior"),
lty=c(1,2,2), lwd=c(3,2,3), col=c("black","red","red"))
legend(10,0.6,c("(iv) Prior pdf of yrbar","(v) Predictive pdf for yrbar",
"(vi) Prior pdf of ybar","(vii) Predictive pdf for ybar"),
lty=c(3,3,4,4), lwd=c(2,3,2,3), col=c("blue","blue","green","green"))
text(12.5,0.38, "The thin vertical lines show the predictive")
text(12.5,0.345,"mean and 95% HPDR bounds for ybar")
476
Chapter 10: Normal Finite Population Models
We will continue to assume that the values in the population are all
(conditionally) normally distributed, and that the (conditional) variance
of each value in the finite population is known. We will now also
assume that all the covariance terms between these values are known.
(These assumptions will be relaxed at a later stage.)
477
Bayesian Methods for Statistical Analysis
xi1
xi =
x
ip
is the covariate vector for the ith population unit ( i = 1,..., N ) and
x1 j
Xj =
x
Nj
is the population vector for the jth explanatory variable ( j = 1,..., p ) .
Also suppose that the finite population vector y has a known variance-
covariance structure in the form of an N by N positive definite matrix
σ 11 σ 1N
Σ = ,
σ
N 1 σ NN
=
where: σ ij C= ( yi , y j ) σ ji
σ=
ii Vyi ≡ σ i2 ,
with the covariance and variance operations here (C and V) implicitly
conditional on all model parameters.
478
Chapter 10: Normal Finite Population Models
479
Bayesian Methods for Statistical Analysis
f ( y ) = ∫ f ( y , β )d β = ∫ f ( y | β ) f ( β )d β .
Thus, y ~ N N ( X δ , Σ + X ΩX ′) .
X 1′
Thus, X s = is a submatrix consisting of the first n rows of X, etc.
X′
n
480
Chapter 10: Normal Finite Population Models
Note: We have here used the following result (e.g. see equation
(81.2.11) in Rao, 1973):
X1 µ1 Σ11 Σ12
X ~ N n1 + n2 µ , Σ
2 2 21 Σ 22
⇒ ( X 2 | X 1 ) ~ N n2 ( µ2 + Σ 21Σ11
−1
( X 1 − µ1 ), Σ 22 − Σ 21Σ11
−1
Σ12 ) .
481
Bayesian Methods for Statistical Analysis
It follows that:
D =(Ω −1 + X s′Σ −ss1 X s ) −1
βˆ = D (Ω −1δ + X ′Σ −1 y ) . s ss s
It follows that:
E ( yr | ys ) = E{E ( yr | ys , β ) | ys }
= E{ X r β + Σ rs Σ −ss1 ( ys − X s β ) | ys }
= X βˆ + Σ Σ −1 ( y − X βˆ )
r rs ss s s (10.5)
V ( yr | ys ) E{V ( yr | ys , β ) | ys } + V {E ( yr | ys , β ) | ys }
=
= E{Σ rr − Σ rs Σ −ss1Σ sr | ys } + V { X r βˆ + Σ rs Σ −ss1 ( ys − X s βˆ ) | ys }
= Σ − Σ Σ −1Σ + V {( X − Σ Σ −1 X ) βˆ | y }
rr rs ss sr r rs ss s s
= Σ rr − Σ rs Σ Σ sr + ( X r − Σ rs Σ X s ) D ( X r − Σ rs Σ −ss1 X s )′ .
−1
ss
−1
ss (10.6)
482
Chapter 10: Normal Finite Population Models
Note: The expression for E* at (10.1) must be the same as that for
E ( yr | ys ) at (10.5), and likewise the expression for V* at (10.2) must
be the same as that for V ( yr | ys ) at (10.6). This equivalence can also
be shown with some algebra by making use of the formula
(Σ ss + X s ΩX s′ ) −1 =Σ −ss1{I s − X s (Ω −1 + X s Σ −ss1 X s ) −1 X s′Σ −ss1} ,
which in turn follows from the general matrix identity
( A − UW −1V ) −1 =+ A−1 A−1U (W − VA−1U ) −1VA−1 .
Note: Here, 1′r denotes the row vector with m= N − n ones. This
vector could also be written 1′m or 1′N −n or (1,...,1) .
483
Bayesian Methods for Statistical Analysis
( rr + X r ΩX r′ ) − (Σ rs + X r ΩX s′ )(Σ ss + X s ΩX s′ ) −1 (Σ sr + X s ΩX r′ )
V* =Σ
= Σ rr − Σ rs Σ −ss1Σ sr + ( X r − Σ rs Σ −ss1 X s ) D ( X r − Σ rs Σ −ss1 X s )′ .
484
Chapter 10: Normal Finite Population Models
Under this model it can be shown that the predictive distribution of the
finite population mean is given by
( y | ys ) ~ N ( A, B 2 ) ,
where:
n n δσ 2 + ω 2 ∑in=1 yi xi1−2γ
A= y s + 1 − xr 2
N N σ + ω 2 ∑in=1 xi2−2γ
σ2 N
m 2ω 2 xr2
=B2 2 ∑
N i= n +1
xi
2γ
+ 2 − 2γ
σ + ω ∑i =1 xi
2 2 n
1 N
xr = ∑ xi (average of the covariate values in the nonsample).
m i= n +1
1 n
xs = ∑ xi (the average of the covariate values in the sample).
n i =1
485
Bayesian Methods for Statistical Analysis
As regards this last special case, we see that the predictive mean A is
identical to the common design-based ratio estimator.
Note 1: If units with relatively large y-values are selected, then xs will
x
likely be larger than xr , so that then r will likely be small, and
xs
σ2 n xr
thereby=
B2 V ( y=
| ys ) 1 − x will also likely be small.
n N xs
Note 2: The same formulae as derived in the last special case will also
apply approximately when the sample size n is very large. This makes
sense because the effect of a very large sample size is the same as that
of a very diffuse prior. Note that in the case of a census, n = N and we
find that the above formulae correctly yield A = y s and B 2 = 0 .
486
Chapter 10: Normal Finite Population Models
−1
n σ2
∑ i i i
= σ= −1 2
x x x
i =1 xsT
487
Bayesian Methods for Statistical Analysis
σ2 1 n ysT ys
2 ∑ i i
= x x −=
1
yi = .
xsT σ i =1 xsT xs
Next,
( yr | ys ) ~ N m ( E* ,V* ) ,
where:
m= N − n
xn +1 xn +1
ˆ β ) βˆ =
E* X r β + Σ rs Σ ss ( ys − X s=
= −1 ˆ +0 ys
x x xs
N N
V* = Σ rr − Σ rs Σ −ss1Σ sr + ( X r − Σ rs Σ −ss1 X s ) D ( X r − Σ rs Σ −ss1 X s )′
xn +1 xn +1 2
σ
= σ − 2
0 + 1r − 0 x (( x n +1 xN ) − 0 )
x sT
x N N
xn +1 xn +1 xn +1 xn +1 xN
.
1
= σ 2 + x
x N sT
x N xn +1 x N xn +1
488
Chapter 10: Normal Finite Population Models
1′r V*1r 1
v* = 2
= 2 σ 2 (1 1)
N N
xn +1 xn +1 xn +1 xn +1 xN 1
1
× + x
xN sT xN xn +1 xN xn +1 1
1
1 2 1 N N
=
N2
σ (
n +1
x x N ) + ∑n+1 xi xn+1
xsT i =
∑n+1 xi xN
i= 1
1 2 1 N N
= 2
σ ( x n +1 + ... + x N ) + x n +1 ∑ i x + ... + x N ∑ xi
N xsT i=
n +1 i=n +1
1 2 1 2 x x + xrT
= 2
σ xrT + xrT = rT2 σ 2 sT
N xsT N xsT
1 ( N − n ) xr xsT + xrT σ 2 n xr
= σ2 × × × = 1 − x .
N nxs N n N xs
489
Bayesian Methods for Statistical Analysis
Let n0 denote the number of covariate values xi in the sample (of size n)
which are 0, and let n1 be the number which are 1. Likewise, let m0
denote the number of covariate values xi in the nonsample (of size
m= N − n ) which are 0, and let m1 be the number which are 1.
(Thus, in each of the sample and nonsample vectors, place the values
with covariate 0 first, and place the values with covariate 1 last.)
Then
( β | ys ) ~ N p ( βˆ , D ) ,
where:
490
Chapter 10: Normal Finite Population Models
1
1
=
n0σ 0 + n1σ 1−2
−2
Note: Here,
n0
y s 0 T = ∑ yi
i =1
491
Bayesian Methods for Statistical Analysis
Thus ( y | y s ) ~ N ( e* , v* ) , where:
ysT + 1′r E* 1
e* =
N
=+
N
{ 1
ysT (1 1)1r βˆ =+
N
ysT mβˆ } { }
1′r V*1r 1
=v* = 2 (1 1)
N N2
σ 02
1 1 1
σ0
2
× − D
σ1
2
1 1 1
σ 12
1
σ 0 σ 0 σ1 σ1 − D ( m m )
( )
1 2 2 2 2
N2
1
1
= 2
( m0σ 02 + m1σ 12 − Dm 2 ) .
N
492
Chapter 10: Normal Finite Population Models
We see that:
n0 = 3, n1 = 2, m0 = 4, m1 = 8
ys 0T = 2.1 + 2.0 + 2.3 = 6.4, ys1T = 4.9 + 0.1 = 5.1 ,
y=
sT 6.4 + 5.1 = 11.5
ys 0 = 6.4 /3 = 2.1333, ys1 = 5.1/2 = 2.55,
ys = 11.5/5 = 2.3.
493
Bayesian Methods for Statistical Analysis
options(digits=4)
sig0=0.08; sig1=1.2; ys = c(2.1,2.0,2.3,4.9,0.2); n=length(ys)
xs=c(0,0,0,1,1); xr = c(0,0,0,0,1, 1,1,1,1,1, 1,1); m=length(xr); N = n+m
n1=sum(xs); n0=n-n1; m1=sum(xr); m0=m-m1
c(n,n0,n1, m,m0,m1, N) # 5 3 2 12 4 8 17
ysT=sum(ys); ys1T=sum(ys*xs); ys0T=ysT-ys1T
ysbar=ysT/n; ys1bar=ys1T/n1; ys0bar=ys0T/n0
c(ys0T,ys1T,ysT, ys0bar,ys1bar,ysbar)
# 6.400 5.100 11.500 2.133 2.550 2.300
D = 1/( n0/ sig0^2 + n1/ sig1^2 ); betahat = D*(ys0T/ sig0^2 + ys1T/ sig1^2 )
estar=(1/N)*( ysT+m*betahat );
vstar=(1/N^2)*(m0* sig0^2+m1* sig1^2-D*m^2)
c(D,betahat,estar,vstar) # 0.002127 2.134564 2.183222 0.038890
hpdr=estar+c(-1,1)*qnorm(0.975)*sqrt(vstar); c(hpdr) # 1.797 2.570
494
Chapter 10: Normal Finite Population Models
Then the result is the same as the classical design-based CI one would
use in the same situation of a large sample size.
However, this strategy will not work well generally. For example, if n is
small then it will lead to an interval which has a frequentist coverage
well below the intended level of 1 − α . In such cases, the problem could
be addressed to some extent by applying an adjustment which reflects
uncertainty regarding the unknown variance parameter. However, the
nature of this type of adjustment would be ad hoc and lead to possibly
other problems with the inference.
Perhaps the best way to deal with uncertainty regarding the variance
parameter is to incorporate it into the finite population model as yet
another random variable with its own prior distribution, i.e. to add
another level to the hierarchical structure of that model. This is the
approach we will now take. Note that parts of the exposition below will
be a review of material already covered in previous chapters.
495
Bayesian Methods for Statistical Analysis
=f ( yr | ys ) ∫ ∫ f ( y , β , λ | y )d β d λ ∝ ∫ ∫ f ( y , β , λ )d β d λ ,
r s
(10.7)
where f ( y , β , λ ) = f (λ ) f ( β | λ ) f ( y | β , λ )
1
∝ λ η −1e −τλ × exp − ( β − δ )′Ω −1 ( β − δ )
2
1
×λ N /2 exp − λ ( y − X β )′Σ −1 ( y − X β )
2
is the joint density of all random variables involved in the model,
namely the N finite population values, y1 ,..., y N , and the p + 1 model
parameters, namely λ , β1 ,..., β p .
496
Chapter 10: Normal Finite Population Models
Using these two distributions, one can solve for the predictive density of
the finite population mean via the identity
=
f ( y | ys ) ∫=
f ( y , λ | y )d λ ∫ f ( y | y , λ ) f (λ | y )d λ .
s s s
497
Bayesian Methods for Statistical Analysis
We see that
M= X s′Σ −ss1 X s and MT= X s′Σ −ss1 ys ,
so that
T= M −1 ( MT ) = ( X s′Σ −ss1 X s ) −1 X s′Σ −ss1 ys .
498
Chapter 10: Normal Finite Population Models
Thus
λ
n
f (λ | y s ) ∝ ∫ λ η −1e −τλ × 1 × λ 2 exp − [ ( β − T )′M ( β − T )′ + R ] d β
2
R
n
η + −1
= λ 2
exp −λ τ + × I ,
2
where
1 M −1
−1
=I ∫ exp − ( β − T )′ ( β − T )′ dβ
2 λ
1
−
M −1 2
p
= (2π ) det
2
λ
(using standard multivariate normal theory)
p
∝λ2 (since M= X s′Σ −ss1 X s is a p by p matrix).
It follows that
R B
n p A
η + + −1 −1
f (λ | y s ) ∝ λ 2 2
exp −λ τ + = λ exp − λ ,
2
2 2
where:
= 2τ + R ,
A = 2η + n − p , B R =( ys − X sT )′Σ −ss1 ( ys − X sT ) .
499
Bayesian Methods for Statistical Analysis
G = Σ rr − Σ rs Σ −ss1Σ sr , A X r − Σ rs Σ −ss1 X s .
=
Therefore
f ( y | y s ) = ∫ f ( y | y s , λ ) f (λ | y s )d λ
1
λ A
−1 B
∝ ∫ λ 2 exp − ( y − e0 ) 2 × λ 2 exp − λ d λ
2 w0 2
A+1
−1 B ( y − e0 ) 2
= ∫λ exp −λ + dλ
2 w0
2
2
A+1 A+1
− −
B ( y − e0 ) 2 2 ( y − e0 ) 2 2
∝ + ∝ 1 +
2 2 w0 Bw0
A+1
A+1 −
( y − e0 ) 2
−
2
y − e 2 2
0
Bw0 / A Bw0 / A
∝ 1+ ∝ 1 + .
A A
500
Chapter 10: Normal Finite Population Models
y − e0 Bw0
It follows that
h ys ~ t ( A) , where h02 = .
0 A
B 2τ + R 1′ V 1
=
h02 = w0 × r 02 r
A 2η + n − p N
501
Bayesian Methods for Statistical Analysis
With such a sample we can, for example, estimate y’s predictive mean,
namely yˆ = E ( y | ys ) , by the average of y (1) ,..., y ( J ) , and estimate y’s
(1) (J )
95% CPDR by the empirical 0.025 and 0.975 quantiles of y ,..., y .
This then raises the question of how the Monte Carlo sample can be
obtained. In this context, we may employ the method of composition via
the equation
f ( y , β , λ | ys ) = f ( y | ys , β , λ ) f ( β , λ | ys ) .
Thus, we first generate a sample from the joint posterior distribution the
two parameters,
( β (1) , λ (1) ),...,( β ( J ) , λ ( J ) ) ~ iid f ( β , λ | ys ) .
This in turn raises the question of how to obtain the sample from
f ( β , λ | ys ) . In this case an ideal solution is to apply a Gibbs sampler
defined by the following conditional distributions:
1. ( β | y s , λ ) ~ N p ( β , D ) ,
where: β = D (Ω −1δ + λ X s′Σ −ss1 ys )
D =(Ω −1 + λ X s′Σ −ss1 X s ) −1
n 1
2. (λ | ys , β ) ~ G η + ,τ + ( y s − X s β )′Σ −ss1 ( ys − X s β ) .
2 2
Note: The first of these distributions derives directly from the normal-
normal finite population model with Σ sr and Σ ss replaced by Σ sr / λ
and Σ ss / λ , etc.
502
Chapter 10: Normal Finite Population Models
= f (λ ) f ( β | λ ) f ( y s | λ , β )
1
∝ λη −1e −τλ × exp − ( β − δ )′Ω −1 ( β − δ )
2
λ
n
2
λ η + −1
n
1
∝ λ 2 exp −λ τ + ( ys − X s β )′Σ −ss1 ( ys − X s β ) .
2
Find the predictive mean and 95% central predictive density region for
the finite population mean y in each of the following scenarios.
(a) There are no covariates, the population values are conditionally iid
and there is no prior information available regarding the model
parameters.
(c) There are no covariates, the population values are conditionally iid,
the prior on the normal mean is normal with mean 10 and variance 2.25,
and (independently) the prior on the normal precision parameter (inverse
of the normal variance) is gamma with mean 2 and variance 1/2 (or
equivalently, gamma with parameters 8 and 4).
503
Bayesian Methods for Statistical Analysis
504
Chapter 10: Normal Finite Population Models
Note: This inference is lower than that in (a) because the mean of the
covariate values in the nonsample is 4.7, which is much lower than
their mean in the sample, 9.58. The regression coefficient β in our
model is estimated as 0.5365, reflecting the positive linear relationship
between the x and y values in the sample.
(c) In this case, a good option is to first employ the Gibbs sampler to
generate a random sample from the joint posterior distribution of β and
λ , with:
p = 1, δ = 10, Ω =9 , η = 8, τ = 4 , X = 1N , Σ =diag (1N ) .
1. ( β | y s , λ ) ~ N p ( β , D ) ,
where:
β = D (Ω −1δ + λ X s′Σ −ss1 ys )
D =(Ω −1 + λ X s′Σ −ss1 X s ) −1
n 1
2. (λ | ys , β ) ~ G η + ,τ + ( y s − X s β )′Σ −ss1 ( ys − X s β ) .
2 2
1. ( β | ys , λ ) ~ N ( β λ , σ λ2 ) ,
where:
βλ =
(1 − kλ ) β 0 + kλ ys
kλ n
σ λ2 = , kλ =
nλ n + 1 / (λσ 02 )
β 0 = 10, σ 0 = 3
n n
2. (λ | ys , β ) ~ G η + ,τ + sβ2 ,
2 2
where
1 n
s2 ( yi ) 2 .
n i1
505
Bayesian Methods for Statistical Analysis
Either way, implementing this Gibbs sampler for 10,100 iterations with a
burn-in of 100 we obtain the trace plots and histograms for β and λ in
Figure 10.2. (The two subplots on the left are for β , and the two on the
right are for λ . The histograms do not include the first 100 iterations.)
The sample ACFs over the entire sample of 10,000 and over the thinned
sample of 1,000 are shown for each of β and λ in Figure 10.3. (E.g. the
top-left subplot is for β over the entire sample of 10,000.) The thinning
process has virtually eliminated all signs of autocorrelation.
506
Chapter 10: Normal Finite Population Models
Using our sample from the joint posterior of the two parameters we now
generate a sample from the predictive distribution of the nonsample
mean by drawing
1
yr( j ) ~ f ( yr | ys , β j , λ j ) ~ N β j , for each j = 1,…,J.
( N − n )λ j
N
1
( ny s + ( N − n ) yr( j ) ) for each j = 1,…,J.
507
Bayesian Methods for Statistical Analysis
We also estimate the 95% CPDR for y by (4.685, 6.633), where the
bounds of this interval are the empirical 0.025 and 0.975 quantiles of
y (1) ,..., y ( J ) .
f ( y | ys ) = ∫ f ( y , β , λ | ys ) d β d λ
= ∫ f ( y | ys , β , λ ) f ( β , λ | ys ) d β d λ
= ( y | y s ) Eβ ,λ {E ( y | y s , β , λ ) y s }
yˆ E=
f ( y | y s ) = E β ,λ { f ( y | y s , β , λ ) y s } .
So we now define:
e( β , λ ) = E ( y | y s , β , λ )
1
= ( nys + ( N − n) E ( yr | ys , β , λ ) )
N
1
= ( nys + ( N − n) β )
N
508
Chapter 10: Normal Finite Population Models
v ( β , λ ) = V ( y | ys , β , λ )
( N − n) 2
= V ( yr | y s , β , λ )
N2
( N − n) 2 1 N −n
= × = 2
N 2
( N − n )λ N λ
e j e( β j , λ=
= j)
1
N
( nys + ( N − n)β j )
N −n
= (β j , λ j )
v j v= .
N 2λ j
We can now also obtain the Rao-Blackwell estimate of the CPDR for y .
509
Bayesian Methods for Statistical Analysis
U
1 J 1 1 2
∑
∫ J j=1 v j 2π 2v j
exp − ( y − e j ) dy =
0.975 .
−∞
where X j ~ N ( e j , v j ) , or equivalently as
1 J L − ej
∑Φ =
J j =1 v j
0.025 (where Φ is the standard normal cdf).
Note: We could also obtain L and U using trial and error or the
Newton-Raphson algorithm.
510
Chapter 10: Normal Finite Population Models
511
Bayesian Methods for Statistical Analysis
# (a)
# (b)
b=sqrt(b2); cpdr=a+c(-1,1)*qt(1-alp/2,c)*b
list(a=a,b=b,c=c,beta=beta, cpdr=cpdr)
}
512
Chapter 10: Normal Finite Population Models
# (c)
betbar=mean(betvec); betci=betbar+c(-1,1)*qnorm(0.975)*sd(betvec)/sqrt(J)
c(betbar,betci) # 5.766 5.731 5.801
513
Bayesian Methods for Statistical Analysis
X11(w=8,h=7); par(mfrow=c(1,1))
hist(ybarvec,prob=T,nclass=20,xlim=c(3.5,8),
xlab="ybar",ylab="density/relative frequency",main="")
lines(density(ybarvec),lty=2,lwd=3,col="blue")
abline(v=c(ybarbar,ybarci,ybarcpdr),lty=2,lwd=3,col="blue")
ybarv=seq(3,8,0.01); fv=rep(NA,length(ybarv))
for(i in 1:length(ybarv)) fv[i] = mean(dnorm(ybarv[i], evec, sqrt(vvec)))
lines(ybarv,fv,lty=1,lwd=2,col="red")
abline(v=c(ebar,eci,ecpdr),lty=1,lwd=2,col="red")
legend(3.4,0.9,c("Histogram","Rao-Blackwell"),
lty=c(2,1), lwd=c(3,2),col=c("blue","red"), bg="white")
514
CHAPTER 11
Transformations and Other Topics
11.1 Inference on complicated quantities
515
Bayesian Methods for Statistical Analysis
Step 2. Use the sample in Step 1 to generate a random sample from the
predictive distribution of the nonsample vector yr = ( yn +1 ,..., y N ) , that is
yr(1) ,..., yr( J ) ~ iid f ( yr | D ) , where yr( j ) = ( yn( +j )1 ,..., y N( j ) ) .
Often, the sample can be obtained easily via the method of composition
and the identity
f ( yr , θ | D ) = f ( yr | D, θ ) f (θ | D ) ,
namely by sampling
yr( j ) ~ f ( yr | D, θ ( j ) )
for each j = 1,..., J .
In many cases, each sampled nonsample vector yr( j ) here can obtained
by sampling
yi( j ) ~ ⊥ f ( yi | D, θ ( j ) ) , i= n + 1,..., N ,
and then forming the vector according to
yr( j ) = ( yn( +j 1) ,..., y N( j ) ) .
=ψˆ E=
(ψ | D ) ∫ψ f (ψ | D)dψ
(which may be impossible to obtain analytically), by the Monte Carlo
1 J
sample mean ψ = ∑ψ ( j ) (which is unbiased, in that E (ψ | D ) = ψˆ ).
J j =1
516
Chapter 11: Transformations and Other Topics
−
∑ (ψ ( j ) − ψ ) 2 .
J J 1 j =1
(a) Suppose that 2.1, 5.2, 3.0, 7.7 and 9.3 constitute a random sample
from a normal finite population of size 20 whose mean and variance are
unknown. We are interested in the finite population median. Estimate
this quantity using a suitable Bayesian model.
517
Bayesian Methods for Statistical Analysis
For the purposes of this exercise, let y( i ) denote the ith finite population
order statistic, meaning the ith value amongst y1 ,..., y N after these are
ordered from smallest to largest. We are interested in three finite
population quantities, as follows:
y( N /2) + y( N /2)+1
=
(a) ψ 1 g1 (=y , θ ) g=
1( y)
2
N y
( i ) − y( i −1)
∑
y(i −1)
I ( y(i −1) > 4)
100 N
i =2
(b)=ψ 2 g 2 (=y, θ ) g=2 ( y)
518
Chapter 11: Transformations and Other Topics
Note 1: The median ψ 1 is the average of the middle two values, since
N = 20 is even.
n −1 n −1 2
Step 1. Generate λ1 ,..., λJ ~ iid f (λ | D ) ~ G , ss ,
2 2
n
1
=
where ss2 ∑ ( yi − y ) 2 .
n i =1
(This step derives from results for the normal-normal-gamma model.)
1
Step 2. Generate µ j ~ f ( µ | D, λ j ) ~ N ys , for each j = 1,..., J .
nλ
j
(This step derives from results for the normal-normal model).
519
Bayesian Methods for Statistical Analysis
Step 4. Use the values ψ (1) ,...,ψ ( J ) ~ iid f (ψ | D ) for Monte Carlo
inference on ψ in the usual way.
n n
Step 2’. Generate λ j ~ ⊥ f (λ | D, µ j ) ~ G , sµ2 j , where
2 2
n
1
= sµ2 j ∑
n i =1
( yi − µ j ) 2
Applying the above four-step procedure (using the original Steps 1 and
2) with Monte Carlo sample size J = 1,000, we obtain Table 11.1 which
shows numerical estimates for the three quantities of interest:
ψ = ψ 1 , ψ 2 and ψ 3 , respectively. Figure 11.1 shows histograms which
illustrate these inferences.
520
Chapter 11: Transformations and Other Topics
Table 11.1 and Figure 11.1 also contain analogous results for a fourth
quantity of interest which may be defined as
= ψ 4 g 4= ( y , θ ) (ψ 3 |ψ 3 ≠ 0)
N 1 −1 N 1 −1
= ∑ yi I yi > µ + Φ (0.75) ∑ I yi > µ + Φ (0.75) > 0 .
i 1 = λ i 1 λ
This is because it might be the case that the upper quartile of the
normal distribution is negative and many of the finite population
values happen (by a very small chance) to lie between that negative
quartile and zero.
521
Bayesian Methods for Statistical Analysis
Quantity of interest:
ψ1 ψ2 ψ3 ψ 4 (ψ 3 |ψ 3 ≠ 0)
=
522
Chapter 11: Transformations and Other Topics
options(digits=4)
523
Bayesian Methods for Statistical Analysis
X11(w=9,h=6.5); par(mfrow=c(2,1))
psivec=psi1vec; J = length(psivec)
psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J)
fpsi=density(psivec); psimode=fpsi$x[fpsi$y==max(fpsi$y)]
psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975))
c(psibar,psici,psimode,psimedian,psicpdr)
# 5.842 5.790 5.893 5.528 5.769 4.308 7.528
hist(psivec, prob=T, xlab="psi1",xlim=c(0,10),ylim=c(0,0.6),
breaks=seq(0,10,0.25), main="Monte Carlo inference on psi1")
lines(fpsi,lwd=3)
abline(v= c(psibar, psici, psicpdr, psimedian, psimode) ,
lty=c(1,1,1,1,1,2,2), lwd=rep(2,7))
legend(0,0.6,
c("Posterior mean, 95% CI \n & 95% CPDR","Posterior mode & median"),
lty=c(1,2), lwd=c(2,2), bg="white")
psivec=psi2vec; J = length(psivec)
psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J)
fpsi=density(psivec); psimode=fpsi$x[fpsi$y==max(fpsi$y)]
psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975))
c(psibar,psici,psimode,psimedian,psicpdr)
# 9.975 9.775 10.175 8.150 9.377 5.522 17.770
524
Chapter 11: Transformations and Other Topics
length(psi3vec[psi3vec!=0]) # 960
length(psi3vec[psi3vec==0]) # 40 40/1000 = 4%
psivec=psi3vec[psi3vec!=0]; J=length(psivec); J # 960 Condition on psi > 0
psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J)
fpsi=density(psivec); psimode=fpsi$x[fpsi$y==max(fpsi$y)]
psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975))
c(psibar,psici,psimode,psimedian,psicpdr)
# 60.74 58.99 62.49 62.45 60.59 11.72 114.96
525
Bayesian Methods for Statistical Analysis
526
Chapter 11: Transformations and Other Topics
Finally, we calculate
ψ ( j) = g( y( j) )
for each j = 1,..., J .
This results in
ψ (1) ,...,ψ ( J ) ~ iid f (ψ | D ) ,
namely a sample from the predictive distribution of the finite population
quantity of interest, on the original scale required for that quantity. This
sample can then be used for Monte Carlo inference on ψ in the usual
way.
28.374, 69.857, 22.721, 57.593, 126.965, 17.816, 16.078, 0.803, 3.164, 3.544,
2.123, 2.353, 184.539, 59.856, 63.701, 585.684, 29.094, 79.245, 18.105, 1.623,
5.513, 1.629, 63.654, 22.060, 187.463, 5.051, 34.299, 27.475, 0.746, 34.016,
8.547, 1.081, 3.151, 55.569, 2.593, 522.377, 1.660, 130.435, 1.246, 169.462,
3.444, 6.376, 18.735, 51.312, 33.920, 350.346, 475.795, 4.972, 24.451, 86.987.
527
Bayesian Methods for Statistical Analysis
We create a histogram of the sample values and see that the underlying
distribution is highly right skewed. However, a histogram of the natural
logarithm of the sample values is consistent with a normal
superpopulation model. The histograms are shown in Figure 11.2.
The data =
is D (= s, zs ) ((1,...,50),(28.374, 69.857,...,86.987)) (after a
convenient ordering), and the quantity of interest is
1 N 1 N −1 1 N
=
= y ∑= yi g=
N i1 =
( z) ∑
N i 1=
h =( zi ) ∑ exp( zi ) .
N i1
So we generate
( µ1 , λ1 ),...,( µ J , λJ ) ~ iid f ( µ , λ | D )
(using methods detailed previously).
528
Chapter 11: Transformations and Other Topics
We also estimate the bounds of the 95% CPDR for y by 49.26 and
302.05, where these are the empirical 0.025 and 0.975 quantiles of
y (1) ,..., y ( J ) .
529
Bayesian Methods for Statistical Analysis
Discussion
Figure 11.4 shows histograms of the values z1 ,..., z N which were in fact
drawn from the normal distribution with mean 3 and standard deviation
2 (left plot), and the = =
values of y1 exp( z1 ),..., y N exp( z N ) (right plot),
together with the true underlying superpopulation densities of the
variables zi and yi .
530
Chapter 11: Transformations and Other Topics
Figure 11.5 shows the original data values (untransformed) and both sets
of inferences above. It highlights the value of performing an appropriate
prior transformation for purposes of estimating the finite population
mean.
531
Bayesian Methods for Statistical Analysis
532
Chapter 11: Transformations and Other Topics
X11(w=8,h=4); par(mfrow=c(1,2))
hist(Z,prob=T,xlim=c(-4,10), ylim=c(0,0.25),breaks=seq(-3,12,0.5))
lines(seq(-5,12,0.01),dnorm(seq(-5,12,0.01),3,2),lwd=3)
hist(Y,prob=T,xlim=c(0,600),ylim=c(0,0.08), breaks=seq(0,5000,10));
yg=seq(0.1,700,0.5); lines(yg ,dnorm( log(yg),3,2)/yg, lwd=3)
# Look at given data and the log of that data (load data etc.) ------------------
N = 200; n = 50; m = N-n; options(digits=4)
ys = c( 28.374, 69.857, 22.721, 57.593, 126.965,
17.816, 16.078, 0.803, 3.164, 3.544,
2.123, 2.353, 184.539, 59.856, 63.701,
585.684, 29.094, 79.245, 18.105, 1.623,
5.513, 1.629, 63.654, 22.060, 187.463,
5.051, 34.299, 27.475, 0.746, 34.016,
8.547, 1.081, 3.151, 55.569, 2.593,
522.377, 1.660, 130.435, 1.246, 169.462,
3.444, 6.376, 18.735, 51.312, 33.920,
350.346, 475.795, 4.972, 24.451, 86.987)
summary(ys)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.7 3.5 23.6 74.2 63.7 586.0
zs=log(ys); par(mfrow=c(1,2))
hist(ys,prob=T); hist(zs,prob=T) # preliminary plots
hist(ys,prob=T,xlim=c(0,600),ylim=c(0,0.045),
breaks=seq(0,700,10), main="Sample values");
hist(zs,prob=T,xlim=c(-2,8), ylim=c(0,0.35),
breaks=seq(-3,10,0.5), main="Log of sample values");
533
Bayesian Methods for Statistical Analysis
summary(ybarvec)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 37.0 70.6 89.4 111.0 122.0 2080.0
534
Chapter 11: Transformations and Other Topics
ybarvec=(1/N)*(n*ysbar+m*yrbarvec); ybarhat=mean(ybarvec)
ybarci=ybarhat+c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J)
ybarcpdr=quantile(ybarvec,c(0.025,0.975))
inf.transform = c(ybarhat,ybarci,ybarcpdr)
c(inf.transform,YBAR) # 11.006 10.904 11.108 8.478 15.016 11.698
X11(w=8,h=4); par(mfrow=c(1,1))
hist(ys,prob=T) # preliminary plot
hist(ys,prob=T,xlim=c(0,40),ylim=c(0,0.2), breaks=seq(0,40,1), main=" ");
abline(v=inf.original,lty=2,lwd=2); abline(v=inf.transform,lty=1,lwd=2)
points(YBAR,0,pch=16)
legend(20,0.2,
c("Inference using original scale", "Inference using log transformation"),
lty=c(2,1),lwd=c(2,2))
text(30,0.1,
"The dot shows 11.7, the true value \nof the finite population mean")
535
Bayesian Methods for Statistical Analysis
Now suppose that in the context of this general model, data and quantity
of interest, we derive a point estimate for ψ (such as the posterior mean,
mode or median) of the form
ψˆ = ψˆ ( D )
and a 1 − α interval estimate for ψ (such as the CPDR or HPDR) of the
form
=I ( L=
,U ) I=
( D ) ( L( D ),U ( D )) .
536
Chapter 11: Transformations and Other Topics
Also, we define:
537
Bayesian Methods for Statistical Analysis
538
Chapter 11: Transformations and Other Topics
E y ( ys | θ , s ) = ∫ ys f ( y | θ , s )dy .
Now, in this case,
−1
N
,θ )
f ( s | y= ( s ) =
f= for all s (1,..., n ),...,( N − n + 1,..., N ) ,
n
so that
y y
( y | θ , s ) ∝ f ( y , θ , s ) f ( s | y , θ ) f ( y | θ ) f (θ ) ∝1 × f ( y | θ ) × 1 ,
f=
and therefore
y | θ , s ) f=
f (= ( y | θ ) f ( y r=
, ys | θ ) f ( yr | θ , ys ) f ( ys | θ ) ,
with s fixed at its observed value.
=
Therefore Bθ ,s E ( ys | θ ) =
−θ E ( ys − θ | θ ) .
We have shown that the model bias here is the same as the bias of y s
in the earlier non-finite population context (where s did not feature in
the notation).
539
Bayesian Methods for Statistical Analysis
Now, Es ( y s | θ , y ) = ∑ y s f ( s =
1
s
| y,θ ) ∑ ( ys1 + ... + ysn )
kn s
1 N
where f ( s | y , θ ) = and k =
k n
1
= {( y1 + ... + yn ) + ...( y N −n +1 + ... + y N )} .
kn
kn
We see that {=
} ( y1 + ... + y=
N) kny .
N
1
Thus Es ( y s | θ , y ) = kn y = y ,
kn
and so Bθ , y = Es ( y s | θ , y ) − y = y − y = 0 .
540
Chapter 11: Transformations and Other Topics
What is the difference between your point estimate and γ ? Does γ lie
inside the interval? Calculate γ , the MLE of γ and report the difference
between γ and γ .
Based on your results, estimate the model bias and relative model bias of
your point estimator, and the model coverage of your interval estimator.
Also estimate the model bias and relative model bias of the MLE γ .
541
Bayesian Methods for Statistical Analysis
What is the difference between your point estimate and ψ ? Does ψ lie
inside the interval?
Based on your results, estimate the design bias and relative design bias
of your point estimator, and the design coverage of your interval
estimator.
542
Chapter 11: Transformations and Other Topics
Then the first n = 20 values were taken as a sample from the finite
population. Figure 11.8 shows a histogram of these sample values. The
sample mean and standard deviation of the sample values were
ys = 10.516 and ss = 1.749 . So the MLE of γ = µ / σ was calculated
=
as γ µ=/ σ ys / ss = 6.011.
Then a Monte Carlo sample of size J = 1,000 was taken from the joint
posterior distribution of µ and λ = 1 / σ 2 , i.e. from f ( µ , λ | D ) where
D = ( s, ys ). Hence a MC sample of size J was obtained from the
posterior distribution of γ , namely γ 1 ,..., γ J ~ iid f (γ | D ) .
543
Bayesian Methods for Statistical Analysis
544
Chapter 11: Transformations and Other Topics
545
Bayesian Methods for Statistical Analysis
Note that this applies in a very particular situation, namely one with
N = 100, n = 20, µ = 10, σ = 2, and a MC estimation scheme as
described above with specifically J = 1,000.
546
Chapter 11: Transformations and Other Topics
(c) Repeating (a) and (b) with K = 5,000, we obtained the following
results:
Estimate of model bias of γ is 0.1616 with 95% CI (0.1359, 0.1872)
Estimate of model bias of γ is 0.2301 with 95% CI (0.2041, 0.2561)
Estimate of relative model bias of γ is 3.2 with 95% CI (2.7, 3.7) (%)
Estimate of relative model bias of γ is 4.6 with 95% CI (4.1, 5.1) (%).
From these results it appears that both the Bayesian and ML estimators
are indeed positively biased by several percent, with the Bayesian
estimator slightly outperforming the MLE.
It also appears that the model coverage of the Bayesian interval estimate
is very close to the nominal 95%.
547
Bayesian Methods for Statistical Analysis
548
Chapter 11: Transformations and Other Topics
Then a MC sample of size J = 1,000 was taken from the joint posterior
distribution of µ and λ = 1 / σ 2 , i.e. from f ( µ , λ | D ) with D = ( s, y s ) .
Hence a MC sample of size J was obtained from the predictive
distribution of ψ , namely ψ 1 ,...,ψ J ~ iid f (ψ | D ) .
549
Bayesian Methods for Statistical Analysis
We note that the true value of ψ lies in the Bayesian interval estimate,
and the difference between the Bayesian estimate and the true value is
1.715 − 1.536 = 0.179.
550
Chapter 11: Transformations and Other Topics
Note: The empirical mode was obtained using the R function density().
We see that the design bias of the empirical mode appears to be smaller
than that of the empirical median, which in turn is smaller than that of
the posterior mean. The biases of the Monte Carlo predictive mean,
median and mode estimates (based on a Monte Carlo sample size of
J = 1,000) are estimated as +5.3%, +3.8% and +1.4%.
551
Bayesian Methods for Statistical Analysis
Note: From Figure 11.15 in (d) we may have already guessed that the
posterior mode is better than the posterior mean as an estimate of ψ
(whose true value is 1.536, as shown by the dot in Figures 11.15–18).
552
Chapter 11: Transformations and Other Topics
# (a)
ys=y[1:n]
hist(ys,prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5),
main=" ")
lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),lwd=3)
J=1000; set.seed(171);
lamv=rgamma(J,(n-1)/2,sys^2*(n-1)/2); muv=rnorm(J,ysbar,1/sqrt((n*lamv)))
gamv=muv*sqrt(lamv)
gambar=mean(gamv); gamint=quantile(gamv,c(0.025,0.975))
c(gambar, gamint) # 5.925 4.115 7.963
553
Bayesian Methods for Statistical Analysis
Eest=mean(gambarvec);
Eci=Eest+c(-1,1)*qnorm(0.975)*sd(gambarvec)/sqrt(K)
Cest=mean(gamlie); Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K)
c(Eest,Eci,Cest,Cci) # 5.2226 4.9986 5.4466 0.9100 0.8539 0.9661
Emleest=mean(gammlevec)
Emleci=Emleest+c(-1,1)*qnorm(0.975)*sd(gammlevec)/sqrt(K)
c(Emleest,Emleci) # 5.298 5.070 5.526
Biasest=Eest-gam; Biasci=Eci-gam
Biasmleest=Emleest-gam; Biasmleci=Emleci-gam
c(Biasest,Biasci, Biasmleest,Biasmleci)
# 0.222583 -0.001418 0.446583 0.298019 0.070493 0.525544
c(Biasest,Biasci, Biasmleest,Biasmleci)/gam
# 0.0445165 -0.0002836 0.0893166 0.0596037 0.0140986 0.1051088
# hist(gambarvec,prob=T)
hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(2,12),
ylim=c(0,0.6), breaks=seq(0,12,0.5), main= "")
abline(v=c(Eest,Eci), lty=1, lwd=3); abline(v=c(Emleest,Emleci), lty=2, lwd=3)
lines(density(gambarvec),lty=1,lwd=3); lines(density(gammlevec),lty=2,lwd=3)
points(gam,0,pch=16)
legend(6.5,0.6,c("Bayesian estimates \n(MC with J=1000)", "ML estimates"),
lty=c(1,2), lwd=c(3,3))
# (c)
554
Chapter 11: Transformations and Other Topics
Eest=mean(gambarvec);
Eci=Eest+c(-1,1)*qnorm(0.975)*sd(gambarvec)/sqrt(K)
Cest=mean(gamlie); Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K)
c(Eest,Eci,Cest,Cci) # 5.162 5.136 5.187 0.951 0.945 0.957
Emleest=mean(gammlevec)
Emleci=Emleest+c(-1,1)*qnorm(0.975)*sd(gammlevec)/sqrt(K)
c(Emleest,Emleci) # 5.230 5.204 5.256
Biasest=Eest-gam; Biasci=Eci-gam
Biasmleest=Emleest-gam; Biasmleci=Emleci-gam
c(Biasest,Biasci, Biasmleest,Biasmleci)
# 0.1616 0.1359 0.1872 0.2301 0.2041 0.2561
c(Biasest,Biasci, Biasmleest,Biasmleci)/gam
# 0.03231 0.02718 0.03745 0.04602 0.04081 0.05122
# hist(gambarvec,prob=T)
hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(2,12),
ylim=c(0,0.6), breaks=seq(2,12,0.25), main= "")
abline(v=c(Eest,Eci), lty=1, lwd=3); abline(v=c(Emleest,Emleci), lty=2, lwd=3)
lines(density(gambarvec),lty=1,lwd=3); lines(density(gammlevec),lty=2,lwd=3)
points(gam,0,pch=16)
legend(6,0.6,c("Bayesian estimates \n(MC with J=1000)", "ML estimates"),
lty=c(1,2), lwd=c(3,3))
555
Bayesian Methods for Statistical Analysis
# (d)
set.seed(421); ys=sample(y,n)
ys=y[s]; ysbar=mean(ys); sy=sd(ys); sy2=var(ys)
c(ysbar,sy, sy2) # 9.438 2.448 5.994
set.seed(323); J=1000;
lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2); muv=rnorm(J,ysbar,1/sqrt((n*lamv)))
psiv=rep(NA,J);
for(j in 1:J){ yrsim=rnorm(N-n,muv,1/sqrt(lamv)); ysim=c(ys,yrsim);
psiv[j]=psifun(y=ysim) }
psibar=mean(psiv); psiint=quantile(psiv,c(0.025,0.975))
c(psibar,psiint) # 1.715 1.456 2.078
summary(psiv)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.37 1.60 1.69 1.72 1.81 2.34
# hist(psiv,prob=T)
hist(psiv,prob=T,xlab="psi", xlim=c(1.3,2.4), ylim=c(0,4),breaks=seq(1,2.5,0.05),
main="")
abline(v=c(psibar,psiint),lwd=3); den=density(psiv)
lines(den,lwd=3); points(psi,0,pch=16)
psimedian=median(psiv)
psimode=den$x[den$y==max(den$y)]
c(psibar,psimedian,psimode) # 1.715 1.688 1.659
556
Chapter 11: Transformations and Other Topics
date() #
for(k in 1:K){
ys=sample(y,n); ysbar=mean(ys); sy2=var(ys)
lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2);
muv=rnorm(J,ysbar,1/sqrt((n*lamv)))
psiv=rep(NA,J); for(j in 1:J){
yrsim=rnorm(N-n,muv,1/sqrt(lamv))
ysim=c(ys,yrsim)
psiv[j]=psifun(y=ysim)
}
psibarvec[k] = mean(psiv);
LBvec[k]=quantile(psiv,alp/2); UBvec[k]=quantile(psiv,1-alp/2)
};
date() # Simulation with K=100 & J=1000 takes 12 seconds
# hist(psibarvec,prob=T)
hist(psibarvec,prob=T,xlab="psibar", xlim=c(1.2,2), ylim=c(0,6.5),
breaks=seq(1.2,2,0.025), main= "")
points(psi,0,pch=16)
date() #
557
Bayesian Methods for Statistical Analysis
for(k in 1:K){
ys=sample(y,n); ysbar=mean(ys); sy2=var(ys)
lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2);
muv=rnorm(J,ysbar,1/sqrt((n*lamv)))
psiv=rep(NA,J); for(j in 1:J){
yrsim=rnorm(N-n,muv,1/sqrt(lamv))
ysim=c(ys,yrsim)
psiv[j]=psifun(y=ysim)
}
psimedianvec[k] = median(psiv)
den=density(psiv); psimodevec[k]=den$x[den$y==max(den$y)]
LBvec[k]=quantile(psiv,alp/2); UBvec[k]=quantile(psiv,1-alp/2)
}
date() # Simulation with K=100 & J=1000 takes 12 seconds
ct=0; for(k in 1:K) if((LBvec[k]<=psi)&&(psi<=UBvec[k])) ct=ct+1
# hist(psimedianvec,prob=T)
hist(psimedianvec,prob=T,xlab="psimedian", xlim=c(1.2,2),
ylim=c(0,6),breaks=seq(1.2,2,0.025), main= "")
points(psi,0,pch=16)
# hist(psimodevec,prob=T)
hist(psimodevec,prob=T,xlab="psimode", xlim=c(1.2,2),
ylim=c(0,6),breaks=seq(1.2,2,0.025), main= "")
points(psi,0,pch=16)
558
CHAPTER 12
Biased Sampling and Nonresponse
12.1 Review of sampling mechanisms
We have already discussed the topic of ignorable and nonignorable
sampling in the context of Bayesian finite population models. To be
definite, let us now focus on the model defined by:
f ( s | y,θ ) (the probability of obtaining sample s for given
values of y and θ )
f (y |θ ) (the model density of the finite population vector)
f (θ ) (the prior density of the parameter),
where the data is D = ( s, ys ) and the quantity of interest is some
functional ψ = g (θ , y ) , e.g. a function of two components of θ or a
function of y only, etc.
559
Bayesian Methods for Statistical Analysis
560
Chapter 12: Biased Sampling and Nonresponse
With these definitions we may now augment our ‘base model’ above
with a new level in the hierarchy, typically in between y and s, as
follows:
Then define
o = (o1 ,..., ono )
as the observed vector (the vector of the labels of the units sampled and
observed), and define
u = (u1 ,..., unu )
as the unobserved vector (the vector of the labels of the units sampled
and unobserved).
561
Bayesian Methods for Statistical Analysis
562
Chapter 12: Biased Sampling and Nonresponse
These two basic definitions then lead to four general cases, defined as
follows:
563
Bayesian Methods for Statistical Analysis
f ( s | R, y , θ )
f ( R | y,θ )
f (y |θ )
f (θ ) ,
D = ( s, Rs , yo )
N
ψ g (θ , y ,=
= ′N y
R ) 1= ∑=
y
i =1
i yT (finite population total).
θ = (µ , π )
Show that the sampling mechanism and response mechanism are both
ignorable, and that this is true for all possible values of the data.
564
Chapter 12: Biased Sampling and Nonresponse
= ′u yu
yuT 1= ∑y
i∈u
i is the total of the unobserved sample values
= ′r yr
yrT 1= ∑y
i∈r
i is the total of the nonsample values.
565
Bayesian Methods for Statistical Analysis
f ( yu , yr | s, Rs , yo ) ∝ f ( yu , yr , s, Rs , yo )
= ∑ ∫ ∫ f ( yu , yr , s, Rs , yo , Rr , µ , π )d µ dπ
Rr
= ∑ ∫ ∫ f ( µ ) f (π ) f ( yo | µ ) f ( yu , yr | µ , yo )
Rr
× f ( Rs | π ) f ( Rr | π ) f ( s )d µ d π
{
= f ( s ) × ∫ f ( yu , yr | µ , yo ) f ( µ ) f ( yo | µ ) d µ }
× ∫ f (π ) f ( Rs | π )∑ f ( Rr | π ) d π
Rr
where = [ ] ∫ f (π , Rs ) ×1=d π f ( Rs )
yu , y r f ( µ ) f ( yo | µ )
∝ 1 × ∫ f ( y u , y r | µ , yo ) d µ ×1
f ( yo )
= ∫ f ( y u , y r | µ , yo ) f ( µ | yo ) d µ
= ∫ f ( y u , y r , µ | yo ) d µ
= f ( y u , y r | yo ) .
That is,
f ( yu , yr | s, Rs , yo ) = f ( yu , yr | yo ) ,
as required.
566
Chapter 12: Biased Sampling and Nonresponse
(c) Repeat (b) but using a suitable Bayesian model which takes into
account the response mechanism and appropriately incorporates it into
the inferential procedure.
567
Bayesian Methods for Statistical Analysis
568
Chapter 12: Biased Sampling and Nonresponse
Noting that the sampling mechanism is ignorable, and that the response
mechanism would be ignorable if all n sample values were known, we
posit a suitable Bayesian model as follows:
1
( y | yrT , Rs , ys , µ , λ ) ~ ( y | y=
rT , y s ) ( ysT + yrT )
N
N −n
( yrT | Rs , ys , µ , λ ) ~ ( yrT | µ , λ ) ~ N ( N − n ) µ ,
λ
s , µ, λ )
f ( Rs | y= f (=
Rs | y s ) ∏p
i∈s
i
Ri
(1 − pi )1− Ri
1
where pi = − ( a + byi )
1+ e
λ
λ − 2 ( y −µ ) 2
f ( ys | µ , λ ) = ∏
i
e
i∈s 2π
f ( µ , λ ) ∝ 1/ λ , µ ∈ℜ, λ > 0 .
569
Bayesian Methods for Statistical Analysis
570
Chapter 12: Biased Sampling and Nonresponse
Discussion
It is instructive to now reveal that the data values in this exercise were in
fact generated as follows.
First, a finite population of size N = 500 was generated from the normal
distribution with mean µ = 10 and standard deviation σ = 2. The mean
of the finite population values was calculated as y = 10.10.
Note: We see that the CPDR in (c), (9.013, 10.32), contains this true
value of y , whereas the CPDRs in (a) and (b), (11.42, 12.46) and
(10.33, 11.51), do not. This suggests the analysis in (c) was on the
right track.
Then a random sample of size n = 100 was taken from the finite
population according to SRSWOR. The sample mean was calculated as
ys = 9.91.
Note: Thus, if there had been no nonresponse then the finite population
mean (with true value 10.10) would have been estimated by 9.91.
Figure 12.3 shows histograms of the population and sample values, each
overlaid by the superpopulation density. The dots in the two subplots
show y = 10.10 and ys = 9.91, respectively.
571
Bayesian Methods for Statistical Analysis
Thereby it was established which sample units would respond and which
would not. Figure 12.4 shows histograms of these two groups (of size
no = 34 and nu = 66), each overlaid by the superpopulation density. The
dots in the left and right subplots show yo and yu , respectively, and
each histogram is overlaid by the superpopulation density.
We see how the respondent values are systematically larger than the
nonrespondent values. This reflects the fact that units with larger values
were more likely to respond.
572
Chapter 12: Biased Sampling and Nonresponse
573
Bayesian Methods for Statistical Analysis
574
Chapter 12: Biased Sampling and Nonresponse
rbind(s[1:10],Rs[1:10])
# [1,] 6 7 14 17 22 37 39 48 66 69
# [2,] 0 0 1 0 1 0 0 0 1 1
o[1:5] # 14 22 66 69 78 Correct
u[1:5] # 6 7 17 37 39 Correct
yo = y[o]; yu = y[u]
ybar=mean(y); ysbar=mean(ys); yrbar=mean(yr);
yobar=mean(yo); yubar=mean(yu)
c(ybar,ysbar,yrbar,yobar,yubar) # 10.095 9.907 10.143 11.938 8.860
575
Bayesian Methods for Statistical Analysis
# (a) ===================================
yo = c(12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28, 9.7,
12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75, 10.34, 14.37, 12.13,
8.56, 11.91, 11.79, 11.45, 14.98, 10.57, 12.28, 9.91, 10.94, 13.28, 11.43)
no=length(yo); N=500; ybarhata = mean(yo); so=sd(yo)
ybarcpdra=ybarhata+c(-1,1)*qt(0.975,no-1)*(so/sqrt(no))*sqrt(1-no/N)
c(no,so,ybarhata, ybarcpdra) # 34.000 1.552 11.939 11.416 12.461
# (b) ===================================
yf = c(5.4,9.41,7.03,8.88,11.47,7,9.44,8.58,9.27,8.18,8.62,8.73,7.33, 9.81,9.88)
yof=c(yo,yf); nof=no+nf; ybarhatb = mean(yof);sof=sd(yof)
ybarcpdrb=ybarhatb+c(-1,1)*qt(0.975,nof-1)*(sof/sqrt(nof))*sqrt(1-nof/N)
c(nof,sof,ybarhatb, ybarcpdrb) # 49.000 2.168 10.917 10.326 11.509
# (c) ============================================
# Plot observed and follow-up sample values separately
par(mfrow=c(1,2))
hist(yo,prob=T,xlab="value", main="Initially observed",
xlim=c(3,17),ylim=c(0,0.35), breaks=seq(0,20,1));
points(mean(yo),0,pch=16);
hist(yf,prob=T,xlab="value", main="Follow-up",
xlim=c(3,17),ylim=c(0,0.35), breaks=seq(0,20,1));
points(mean(yf),0,pch=16)
576
Chapter 12: Biased Sampling and Nonresponse
model
{
for(i in 1:n){
zs[i] <- a + b*ys[i]
logit(ps[i])<- zs[i]
rs[i] ~ dbern(ps[i])
ys[i] ~ dnorm(mu,lam)
}
a ~ dnorm(0.0,0.001)
b ~ dnorm(0.0,0.001)
mu ~ dnorm(0.0,0.001)
lam ~ dgamma(0.001,0.001)
ysT <- sum(ys[])
meanyrT <- nr*mu
precyrT <- lam/nr
yrT ~ dnorm(meanyrT,precyrT)
ybar <- (ysT+yrT)/(n+nr)
}
# data
list( n=100, nr=400,
rs=c( 1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1, 1,1,1,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0),
ys=c(
12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28,
9.7, 12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75,
10.34, 14.37, 12.13, 8.56, 11.91, 11.79, 11.45, 14.98, 10.57, 12.28,
9.91, 10.94, 13.28, 11.43, 5.4, 9.41, 7.03, 8.88, 11.47, 7,
9.44, 8.58, 9.27, 8.18, 8.62, 8.73, 7.33, 9.81, 9.88, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA) )
# inits
list(a=0,b=0,mu=0,lam=1)
577
Bayesian Methods for Statistical Analysis
578
Chapter 12: Biased Sampling and Nonresponse
Then define:
yi as the indicator for pop. unit i having the characteristic (0 or 1)
π i as the probability that unit i will be sampled (e.g. phone in to
answer the question)
I i as the indicator that population unit i is sampled.
We now wish to generalise this model to account for the possibility that
ys may be biased. To this end, suppose each π i can be one of two
values:
φ1 if that unit has the characteristic in question, i.e. if yi = 1
φ0 if that unit does not have the characteristic, i.e. if yi = 0.
λφ if y i = 1
579
Bayesian Methods for Statistical Analysis
P=( yi 1) P=
( I i 1|= yi 1)
=
P( yi = 0) P( I i =1| yi = 0) + P ( yi =1) P ( I i =1| yi =1)
pφ1 pφλ pλ
= = = .
(1 − p )φ0 + pφ1 (1 − p )φ + pφλ 1 − p + pλ
Note: Observe how one of the parameters, namely φ , cancels out here.
pλ
We may now write ysT ~ Bin(n, ω ) , where ω = .
1 − p + pλ
pλ ω
Also, solving ω = for p yields p = .
1 − p + pλ λ − λω + ω
ys
It follows that the MLE and MOME of p is pˆ = .
λ − λ ys + ys
y (1 − ys )
=
Also, ( L, U ) ys ± zα /2 s is a 1 − α CI for ω .
n
L U
Therefore, a 1 − α CI for p is , .
λ − λ L + L λ − λU + U
580
Chapter 12: Biased Sampling and Nonresponse
ys
=
Also, the bias of p̂ is B( pˆ ) E − p.
λ − λ ys + ys
ys
That is, pˆ = is asymptotically unbiased for p as n → ∞ .
λ − λ ys + ys
pλ ω (1 − p)
Observe that ω = implies λ = .
1 − p + pλ p(1 − ω )
Then recall that the phone-in poll conducted by the TV network yielded
an estimate of 0.33, and that the parallel scientifically designed (and
‘proper’) survey yielded an estimate of 0.72.
This estimate being less than unity is consistent with our earlier intuition
that the phone-in poll estimate might be too low due to yes-respondents
being less likely to phone in than no-respondents.
581
Bayesian Methods for Statistical Analysis
To this poll there were 4,941 yes-responses and 4,512 no-responses, thus
a proportion of
4,941/(4,941 + 4,512) = 4,941/9,453 = 0.523 yeses.
This suggests that persons who wanted the flag replaced were almost
twice as likely to register their opinion via the Internet poll as persons
who were happy with the old flag.
Now recall Example 12.2. Clearly there is some similarity between the
two polls. Both were conducted on the Internet by the same organisation
within the same half-year, and the two questions asked both relate to
changing something about Australia’s heritage. This similarity suggests
that 1.84 may be a plausible value of λ = π 1 / π 0 to be used in the 4 June
poll here.
582
Chapter 12: Biased Sampling and Nonresponse
pλ
Then, a 95% CI for ω = (the probability of a yes-response for
1 − p + pλ
a respondent) is
y (1 − ys ) 0.592(1 − 0.592)
=
( L, U ) ys ± zα /2 s = 0.592 ± 1.96
n 4, 299
= (0.577, 0.607).
Therefore, a 1 − α CI for p is
L U
,
λ − λ L + L λ − λU + U
0.577 0.607
= ,
1.84 − 1.84 × 0.577 + 0.577 1.84 − 1.84 × 0.607 + 0.607
= (0.426, 0.456).
pλ
( ysT | p, λ ) ~ Bin(n, ω ) where ω= (as before)
1 − p + pλ
( p | λ ) ~ Beta (α , β )
λ ~ Gamma (η ,τ ) . (12.2)
583
Bayesian Methods for Statistical Analysis
Recall the 28 January 2000 Internet poll yielding 4,941 yeses out of
9,453 responses and the related properly conducted probability survey
yielding 829 yeses and 1,394 nos.
n = 9,453, y sT = 4,941
(the observed data in the self-selected sample).
Using suitable WinBUGS code (see below) and a sample size of 10,000
after a burn-in of 1,000, we obtained results shown in Table 12.2. Figure
12.7 shows some of the graphical output from WinBUGS.
We see that λ has been estimated as 1.84 again, but now with some
measure of uncertainty: the 95% posterior interval estimate for λ is
(1.68, 2.03).
584
Chapter 12: Biased Sampling and Nonresponse
Equating the sample mean and sample variance of the 10,000 simulated
values with the theoretical mean and variance of the Gamma (η ,τ ) ,
namely η / τ and η / τ 2 , respectively, we may approximate the posterior
distribution of λ as Gamma (η ,τ ) with η = 431 and τ = 234.
585
Bayesian Methods for Statistical Analysis
model;
{
ysT~ dbin(omega,n)
omega <- (p*lam)/(1-p+lam*p)
lam ~ dgamma(eta,tau)
p ~ dbeta(alpha,beta)
}
# data
list(ysT=4941,n=9453,eta=0.000001,
tau=0.000001,alpha=830,beta=1395)
# inits
list(p=0.5,lam=1)
# Need to run BUGS code above first, using coda to create output in data.txt
586
Chapter 12: Biased Sampling and Nonresponse
Recall the 4 June 2000 poll yielding 2,544 yeses out of 4,299 responses,
leading to 0.441 as an estimate of p, with 95% CI (0.426, 0.456), based
on λ being exactly equal to 1.84. This suggests we apply our Bayesian
model in WinBUGS to estimate p with:
η = 431, τ = 234
(using the posterior for λ in Example 4 as the prior)
n = 4,299, y sT = 2,544
(the observed data in the self-selected sample).
We see that p has been estimated as 0.441 again, with 95% interval
estimate (0.414, 0.470). It will be noted that this interval is wider than
the one in Example 12.3; this may be attributed to the fact that in
Example 12.3 uncertainty regarding λ was not properly taken into
account. For more information on the topic in this section, see Puza and
O’Neill (2006).
Note: The posterior for λ is virtually the same as the prior for λ . This
was to be expected, since—unlike in Example 12.4—the data here
does not contain any structure which could tell us anything about the
relationship between the sampling propensities π 0 and π 1 .
587
Bayesian Methods for Statistical Analysis
model;
{
ysT~ dbin(omega,n)
omega <- (p*lam)/(1-p+lam*p)
lam ~ dgamma(eta,tau)
p ~ dbeta(alpha,beta)
}
# data
list(ysT=2544,n=4299,eta=431,
tau=234,alpha=1,beta=1)
# inits
list(p=0.5,lam=1)
This is a clue to the fact that the Bayesian model in that section is only
useful for infinite population inference, in particular on the
superpopulation parameter p, and cannot be used for inference on finite
population quantities, in particular the finite population mean
y = ( y1 + ... + y N ) / N .
588
Chapter 12: Biased Sampling and Nonresponse
This is not an issue when N is very large (as it was assumed there), since
in that case inference on y is, by the law of large numbers, virtually
identical to inference on the superpopulation mean p.
A sample is selected from the finite population in such a way that every
unit without the characteristic has probability φ of being sampled, and
every unit with the characteristic has probability λφ of being selected.
Every unit that is sampled has its value fully observed.
The prior on φ is beta with parameters δ and γ but evenly spread over
the interval (0, c), where c < 1 is a specified constant representing an
absolute upper bound for what the value of φ could possibly be.
(Examples of a potentially suitable values of c are 0.1, 0.2 and 0.5.)
Also, the prior on λ is beta with parameters η and τ but evenly spread
over the interval (0, 1/c), so as to permit a suitably wide range of
possible values for the ratio of sampling propensities π 1 = λφ to π 0 = φ .
(For example, if c = 0.2 then that ratio could be anything from 0 to 5.)
(b) Suppose we are interested in both the superpopulation mean (i.e. the
common probability of a unit having the characteristic, p) and the finite
population mean (i.e. proportion of the N finite population units which
have the characteristic, y ). Write down a formula for the joint posterior
(and predictive) density of all quantities which are relevant to and could
used be as a basis for the desired inference.
589
Bayesian Methods for Statistical Analysis
(d) Modify the MH algorithm in (c) so that its output features only the
three model parameters and none of the nonsample values. (NB: The
idea here is to design a superior MH algorithm, one with better ‘mixing’
than the one in (c).)
(e) Describe a procedure whereby the output from the algorithm in (d)
could be used to obtain a sample from the predictive distribution of the
nonsample mean. Then run that algorithm and implement the procedure
so as to produce results intended to be equivalent to those in the
reanalysis of Example 5 in (c) with N = 200,000.
( p | λ , φ ) ~ Beta (α , β ) , (λ | φ ) ~ (1 / c ) × Beta (η ,τ )
590
Chapter 12: Biased Sampling and Nonresponse
f (φ , λ , p, yr | I , y s ) ∝ f (φ , λ , p, λ , yr , I , y s )
= f (φ , λ , p, yr , I s , I r , y s )
= f (φ ) f (λ ) f ( p ) × f ( ys | p ) f ( yr | p )
× f ( I s | y s , φ , λ ) f ( I r | yr , φ , λ ) (12.3)
(φ / c )δ −1 (1 − φ / c )γ −1 c( cλ )η −1 (1 − cλ )τ −1 pα −1 (1 − p ) β −1
= × ×
cB(δ , γ ) B(η ,τ ) B(α , β )
× ∏ p yi (1 − p )1− yi ∏ p yi (1 − p )1− yi × ∏ φλ yi ( ) (1 − φλ ) yi 1− I i
Ii
i∈s i∈r i∈s
1− I i
(
× ∏ φλ yi ) ( )
Ii
1 − φλ yi (12.4)
i∈r
∝ φ δ −1 (1 − φ / c )γ −1 × λ η −1 (1 − cλ )τ −1 × pα −1 (1 − p ) β −1
× p ysT (1 − p ) n − ysT p yrT (1 − p ) N − n − yrT
(
× ∏ φλ yi ) (1 − φλ ) yi 1−1
∏ φλ ( ) (1 − φλ ) yi 1− 0
1 yi 0
(12.5)
i∈s i∈r
= φ δ −1 (1 − φ / c )γ −1 × λ η −1 (1 − cλ )τ −1 × pα −1 (1 − p )τ β −1
× p ysT + yrT (1 − p ) N − ysT − yrT × φ n λ ysT (1 − φλ ) yrT (1 − φ ) N −n − yrT . (12.6)
Note 1: In all of the above e.g. (12.3), s and r are fixed at their
observed values.
591
Bayesian Methods for Statistical Analysis
φ wδ −1 (1 − w)γ −1
If w ≡ ~ Beta (δ , γ ) then f ( w) = .
c B(δ , γ )
Therefore
dw (φ / c )δ −1 (1 − φ / c )γ −1
f (φ ) f=
= ( w) .
dφ cB(δ , γ )
⇒ ( yrT | D, φ , λ , p ) ~ Bin( N − n, q) ,
p(1 − φλ )
where q = (12.7)
p(1 − φλ ) + (1 − p )(1 − φ )
=
f (p |D , φ , λ , yrT ) pα + ysT + yrT −1 (1 − p )τ β + N −n − ysT − yrT −1
Also:
f (φ | D, yrT , λ , p ) ∝ φ δ + n −1 (1 − φ / c )γ −1 (1 − φλ ) yrT (1 − φ ) N −n − yrT
(12.9)
592
Chapter 12: Biased Sampling and Nonresponse
Our point and interval estimates for λ are 1.85 and (1.68, 2.02), which
are very similar to 1.84 and (1.68, 2.03) in Example 12.4.
Repeating the above but with finite population sizes 400,000 and 40,000,
respectively, we obtain the corresponding results shown in Tables 12.5.
N = 400,000 N = 40,000
phi, φ lam, λ p ybar, y phi, φ lam, λ p ybar, y
0.01803 1.83548 0.37394 0.373948 0.18123 1.81588 0.375693 0.375834
0.08546 0.08546 0.00981 0.009832 0.07579 0.07579 0.009203 0.009399
0.01731 1.68407 0.35413 0.354113 0.17492 1.66922 0.357356 0.357050
0.01878 2.00923 0.39122 0.391193 0.18813 1.97208 0.393969 0.394500
Note: The three sets of inferences in Tables 12.4 and 12.5 have yielded
different estimates of φ but very similar results for the other three
quantities, in particular the object of this study, λ .
593
Bayesian Methods for Statistical Analysis
Figure 12.10 shows graphical output from the first of the three
Metropolis-Hastings algorithms (i.e. the one with N = 200,000).
594
Chapter 12: Biased Sampling and Nonresponse
This posterior for λ was then fed in as the prior for λ so as to redo the
analysis in Example 12.5.
Accordingly, the MH algorithm in (b) was next applied once again but
with the following specifications:
595
Bayesian Methods for Statistical Analysis
Thus point and interval estimates for p are 0.440 and (0.413, 0.466),
which we note are similar to 0.441 and (0.414, 0.470) in Example 12.5.
Also point and 95% interval estimates for y are 0.452 and (0.426,
0.478).
Note 1: The inference on y here was not possible using the theory in
the section just above the present exercise, i.e. using the infinite
population models developed in that section.
Note 2: The posterior for λ is very similar to its prior, which is as one
might expect, since the data now has no structure which could tell us
anything further about that parameter.
Repeating the above but with finite population sizes 400,000 and 40,000,
respectively, we obtain the corresponding results shown in Tables 12.7.
N = 400,000 N = 40,000
phi, φ lam, λ p ybar, y phi, φ lam, λ p ybar, y
0.007863 1.83516 0.44193 0.44792 0.07888 1.82895 0.44228 0.50220
0.087755 0.08776 0.01375 0.01372 0.08162 0.08162 0.01359 0.01337
0.007482 1.66809 0.41563 0.42160 0.07538 1.66402 0.41490 0.47517
0.008299 2.00048 0.46819 0.47409 0.08278 1.99275 0.47007 0.52985
596
Chapter 12: Biased Sampling and Nonresponse
Discussion
We see no problem in the first two of these three cases. But for
N = 15,000, the estimation of φ appears to be artificially restricted by
our arbitrary choice of c as 0.2. (Observe that the simulated values are
strongly ‘bunched up’ at just below 0.2.)
Repeating the MCMC run with N = 15, 000 but with c also changed to
0.5 appears to solve this problem. Results are shown in Figure 12.14
(page 599). We note that estimation of λ has changed from about 2 to
less than 1. This suggests that we might get very similar results with c
even larger, e.g. c = 1.
But when we do this, we get very different results (not shown). Why?
Note: The prior for φ also involves c but does not need reconfiguring
(because that prior is uniform for all values of c, since δ= γ= 1 ).
Thus, Figure 12.14 (the case of N = 15,000 and c = 0.5) in fact illustrates
output which is ‘flawed’ (in this sense) and so should be disregarded.
597
Bayesian Methods for Statistical Analysis
598
Chapter 12: Biased Sampling and Nonresponse
599
Bayesian Methods for Statistical Analysis
(d) Recall the joint density (12.6). This density may also be written as:
f (φ , λ , p, yr | I , ys ) ∝ f (φ , λ , p ) p ysT + yrT (1 − p ) N − ysT − yrT
× φ n λ ysT (1 − φλ ) yrT (1 − φ ) N − n − yrT ,
where f (φ , λ , p ) ∝ φ δ −1 (1 − φ / c )γ −1 × λ η −1 (1 − cλ )τ −1 × pα −1 (1 − p ) β −1 .
i∈r
p(1 − φλ )
z= .
p(1 − φλ ) + (1 − p )(1 − φ )
∑∏ z
yr i∈r
yi
)1− yi
(1 − z= ∏∑z
i∈r yi =
0
yi
)1− yi 1
(1 − z=
It follows that
f (φ , λ , p | I , y s ) = ∑ f (φ , λ , p, yr | I , y s )
yr
∝ f (φ , λ , p ) × p ysT (1 − p ) n − ysT φ n λ ysT
× [ p (1 − φλ ) + (1 − p )(1 − φ ) ] .
N −n
f (φ | D, λ , p ) ∝ φ δ + n −1 (1 − φ / c )γ −1 [ p (1 − φλ ) + (1 − p )(1 − φ )]
N −n
f (λ | D, φ , p ) ∝ λ η + ysT −1 (1 − cλ )τ −1 [ p (1 − φλ ) + (1 − p )(1 − φ )]
N −n
600
Chapter 12: Biased Sampling and Nonresponse
( j)
2. Sample yrT ~ Bin( N − n, z j ) , where
p j (1 − φ j λ j )
zj = , j = 1,…,J (from (12.11))
p j (1 − φ j λ j ) + (1 − p j )(1 − φ j )
1
3. Calculate=
y ( j) ( ysT + yrT( j ) ) , j = 1,…,J .
N
We now perform the MH algorithm in (d) and the above procedure with:
N = 200,000, n = 4299, ysT = 2544, c = 0.2
α = 1, β = 1, η = 278.1, τ = 474.8, δ = γ = 1.
601
Bayesian Methods for Statistical Analysis
Thus, since
1
=
y ( ysT + ( N − n) yr ) ,
N
the RB estimate of ŷ is actually
1
( ysT + ( N − n) z ) = 0.440,
N
with a 95% confidence interval for ŷ equal to
1 1
( ysT + ( N − n)0.4361, ( ysT + ( N − n)0.4367 = (0.439, 0.440).
N N
Note 2: The Monte Carlo 95% confidence intervals reported here are
unduly narrow (i.e. will have less than 95% actual coverage). This is
because we did not address the problem of the very strong serial
correlation amongst the values outputted from the Metropolis-Hastings
algorithm, for example by way of thinning or the batch means method.
But this remark only applies to confidence intervals for mean estimates
and not to posterior or predictive interval estimates, such as (0.413,
0.467) for y in Table 12.8.
602
Chapter 12: Biased Sampling and Nonresponse
# A ----------------------------------
603
Bayesian Methods for Statistical Analysis
# B ----------------------------------
604
Chapter 12: Biased Sampling and Nonresponse
# Now calculate new prior from posterior of lambda (based on 1st run above):
c(lamhat,lamse) # 1.846864 0.087889
fun=function(etatau, c=0.2, est=lamhat, se=lamse){
(est-(1/c)*etatau[1]/sum(etatau))^2+
( se^2 - (1/c^2)*prod(etatau)/( sum(etatau)^2*(1 + sum(etatau)) ) )^2 }
etataunew0 = optim(par=c(2,5), fn=fun)$par
etataunew = optim(par= etataunew0, fn=fun)$par
etanew=etataunew[1]; taunew=etataunew[2]
c(etanew, taunew) # 278.10 474.79
(1/0.2)*etanew/(etanew+taunew) # 1.8469
sqrt((1/0.2^2)*etanew*taunew/((etanew+taunew)^2*(etanew+taunew+1)))
# 0.087889 OK
# C -----------------------------------------------------------
605
Bayesian Methods for Statistical Analysis
# D -------------------------------------------------
# Repeat above exactly from C to D but with N=20000 and 15000 to produce
# extra graphs. We omit the code for the case N = 15000, c=0.5 and the case
# N = 15000, c = 1
# (e)
MH2 = function(J=100, n=9453, ysT=4941, alp=830, bet=1395,
p0=0.5, phi0=0.1, lam0=1, psd=0.1, phisd=0.1, lamsd=0.1,
eta=1, tau=1, del=1, gam=1, c=0.2, N=200000 ){
p=p0; phi=phi0; lam=lam0; pv=p; phiv=phi; lamv=lam; pct=0; phict=0;
lamct=0;
606
Chapter 12: Biased Sampling and Nonresponse
for(j in 1:J){
pnew=rnorm(1,p,psd)
if((pnew >0)&&(pnew <1)){
logprobnum=(alp-1+ysT)*log(pnew)+(bet-1+n-ysT)*log(1-pnew) +
(N-n)*log((1-pnew)*(1-phi)+pnew*(1-phi*lam))
logprobden=(alp-1+ysT)*log(p)+(bet-1+n-ysT)*log(1-p) +
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lam))
logprob= logprobnum- logprobden; prob=exp(logprob)
u=runif(1); if(u<=prob){ pct=pct+1; p=pnew } }
phinew=rnorm(1,phi,phisd)
if((phinew>0)&&(phinew<c)){
logprobnum=(del-1+n)*log(phinew)+(gam-1)*log(1- phinew/c)+
(N-n)*log((1-p)*(1-phinew)+p*(1-phinew*lam))
logprobden=(del-1+n)*log(phi)+(gam-1)*log(1-phi/c)+
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lam))
logprob= logprobnum- logprobden; prob=exp(logprob)
u=runif(1); if(u<=prob){ phict=phict+1; phi=phinew } }
lamnew=rnorm(1,lam,lamsd)
if((lamnew>0)&&(lamnew<(1/c))){
logprobnum= (eta-1+ysT)*log(lamnew)+(tau-1)*log(1- lamnew*c)+
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lamnew))
logprobden= (eta-1+ysT)*log(lam)+(tau-1)*log(1- lam*c)+
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lam))
logprob= logprobnum- logprobden; prob=exp(logprob)
u=runif(1); if(u<=prob){ lamct=lamct+1; lam=lamnew } }
pv=c(pv,p); phiv=c(phiv,phi); lamv=c(lamv,lam) }
par=pct/J; phiar=phict/J; lamar=lamct/J
list(pv=pv, phiv=phiv, lamv=lamv, par=par, phiar=phiar, lamar=lamar) }
# end fn
X11(w=8,h=6); par(mfrow=c(2,2))
N=200000; n = 4299; ysT=2544; K=2000
set.seed(531); res=MH2(J=K, n=4299, ysT=2544, alp=1, bet=1,
p0=0.5, phi0=0.1, lam0=1, psd=0.008, phisd=0.0007, lamsd=0.04,
eta= etanew, tau= taunew, del=1, gam=1, c=0.2, N=N )
c(res$par, res$phiar,res$lamar) # 0.6580 0.4135 0.6045 OK
plot(res$pv); plot(res$phiv); plot(res$lamv) # Has burnt in OK
p0=res$pv[2001]; lam0=res$lamv[2001]; phi0=res$phiv[2001]
# record last values
607
Bayesian Methods for Statistical Analysis
# Calculate estimates
phat=mean(pv); pcpdr=quantile(pv,c(0.025,0.975)); pse=sd(pv)
lamhat=mean(lamv); lamcpdr=quantile(lamv,c(0.025,0.975)); lamse=sd(lamv)
phihat=mean(phiv); phicpdr=quantile(phiv,c(0.025,0.975)); phise=sd(lamv)
RBest=mean(zv); RBci=RBest+c(-1,1)*qnorm(0.975)*sd(zv)/sqrt(J)
c(RBest,RBci) # 0.43639 0.43612 0.43667
(1/N)*(ysT+(N-n)*RBest) # 0.43973
(1/N)*(ysT+(N-n)*RBci) # 0.43946 0.44000
608
APPENDIX A
Additional Exercises
Exercise A.1 Practice with the Metropolis algorithm
Illustrate your results with suitable figures (for example, trace plots and
histograms).
3.4, 6.3, 1.0, 2.9, 1.8, 2.0, 0.5, 7.9, 4.8, 6.5.
Using MCMC methods, estimate the finite population mean and provide
a suitable 95% interval estimate.
609
Bayesian Methods for Statistical Analysis
(a) The sampled value of m was 0.7071. A histogram of the 100 sampled
normal values is shown in Figure A.1(a) (page 612). This histogram is
overlaid by the (known) normal distribution with mean m and variance
v = m 2 = 0.5.
The posterior density of m is
f ( m | y ) ∝ f ( m) f ( y | m)
n
1 1
∝ e− m ∏ exp − 2 ( yi − m) 2
i =1 m 2m
1 n
= e − m m − n exp − 2 ∑ ( yi − m ) 2 .
2m i =1
So the log-posterior is
n
1
l (m) = log f (m | y ) =−m − n log m −
2m 2
∑ ( y − m)
i =1
i
2
.
610
Appendix A: Additional Exercises
The dots show the true posterior mean, mˆ = E ( m | y ) = 0.7393, and the
true 95% CPDR for m. The cross shows the true value of m, 0.7071.
The Monte Carlo sample was used to generate a random sample from the
predictive distribution of
=
c ( yn +1 + ... + yn +10 ) / 10
by sampling
c j ~ N ( m j , m 2j / 10) , j = 1,…,J.
A histogram of these c-values is shown in Figure A.1(f).
The vertical lines show the predictive mean estimate, c = 0.741, the 95%
CI for the predictive mean, (0.7270, 0.7549), and the 95% CPDR estimate
for c, (0.3063, 1.1893).
611
Bayesian Methods for Statistical Analysis
612
Appendix A: Additional Exercises
1 1
• with=c =
( y11 + ... + y50 ) (instead of c ( y101 + ... + y110 )).
40 10
Figure A.2 is an analogue of Figure A.1, except that subplot (a) does not
have a normal density overlaid, and there is an extra subplot (g) that shows
inference on the finite population mean, which may be denoted here by
1
a= (10 × 3.71 + 40c ) .
50
613
Bayesian Methods for Statistical Analysis
Some of the estimates and quantities shown in the last subplot (g) are as
follows. The histogram estimate of a’s predictive mean is a = 3.061 with
95% CI (3.028, 3.094). The Rao-Blackwell estimate of a’s predictive
mean is (10 × 3.71 + 40m ) / 50 = 3.055, with 95% CI (3.031, 3.078). The
exact predictive mean of a is the same as the posterior mean of m and
equal to 3.068. The 95% CPDR estimate for a is 2.190 4.256.
614
Appendix A: Additional Exercises
# (a)
options(digits=4)
INTEG <- function(xvec, yvec, a = min(xvec), b = max(xvec)){
# Integrates numerically under a spline through the points given by
# the vectors xvec and yvec, from a to b.
fit <- smooth.spline(xvec, yvec)
spline.f <- function(x){predict(fit, x)$y }
integrate(spline.f, a, b)$value }
INTEG(seq(0,1,0.01), seq(0,1,0.01)^2, 0,1) # 0.3333 correct
X11(w=8,h=6); par(mfrow=c(2,2));
set.seed(221); m=rgamma(1,1,1); v=m^2; n=100; y=rnorm(n,m,m); c(m,v)
# 0.7071 0.5000
hist(y,prob=T,xlim=c(-2,4),ylim=c(0,0.8), breaks=seq(-2,4,0.25),
main="(a) Histogram of 100 y-values")
yvec=seq(-2,4,0.01); lines(yvec,dnorm(yvec,m,m),lwd=3)
abline(v=c(m,m+c(-1,1)*qnorm(0.975)*m), lwd=3)
LOGPOST=function(m=2,n=10,y=c(2,1)){
-m-n*log(m)-(1/(2*m^2))*sum((y-m)^2) }
LOGPOST() # -9.056 OK
METALG = function(J=1000,y,m0=1,mdel=0.4){
m=m0; mv=m; mct=0; n=length(y); for(j in 1:J){
mcand=runif(1,m-mdel,m+mdel)
if(mcand>0){ logprob=LOGPOST(m= mcand,n=n,y=y)-
LOGPOST(m=m,n=n,y=y)
prob=exp(logprob)
u=runif(1); if(u<=prob){ mct=mct+1; m= mcand }
}
mv=c(mv,m)
}
list(mv=mv,mar=mct/J) }
615
Bayesian Methods for Statistical Analysis
J=length(mv); J # 1000
mbar=mean(mv); mci=mbar+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J)
mcpdr=quantile(mv,c(0.025,0.975));
mvec=seq(0.5,1,0.01); kvec=mvec;
for(i in 1:length(mvec)) kvec[i] = exp(LOGPOST(m=mvec[i],n=n,y=y))
k0=INTEG(mvec,kvec); postvec=kvec/k0; k0 # 6.269e-11
mhat=INTEG(mvec,mvec*postvec);
c(mbar,sd(mv),mhat,mci,mcpdr)
# 0.73769 0.04305 0.73935 0.73502 0.74036 0.66197 0.82984
fun=function(q,p=0.025){ (INTEG(mvec,postvec,0,q)-p)^2 }
LB0 = optim(par=0.5,fn=fun)$par; LB = optim(par= LB0,fn=fun)$par
fun=function(q,p=0.975){ (INTEG(mvec,postvec,0,q)-p)^2 }
UB0 = optim(par=0.8,fn=fun)$par; UB = optim(par= UB0,fn=fun)$par
c(LB,UB) # 0.6609 0.8305
INTEG(mvec,postvec,0,LB) # 0.025
INTEG(mvec,postvec,UB,1) # 0.025 OK (Ignore all the warnings)
par(mfrow=c(2,1))
hist(mv,prob=T,xlim=c(0.6,0.9),ylim=c(0,10), breaks=seq(0.5,1,0.01),
xlab="x",main="(e) Histogram of 1000 m-values")
lines(mvec,postvec,lty=1,lwd=3)
lines(density(mv),lty=2,lwd=3)
abline(v=c(mbar,mci,mcpdr),lwd=2)
points(c(mhat,LB,UB),c(0,0,0),pch=16)
points(m,0,pch=4,lwd=3)
# Prediction of c -----------------------
set.seed(332); cv=rnorm(J,mv,mv/sqrt(10))
cbar=mean(cv); cci=cbar+c(-1,1)*qnorm(0.975)*sd(cv)/sqrt(J)
ccpdr=quantile(cv,c(0.025,0.975))
c(cbar,sd(cv),cci,ccpdr) # 0.7410 0.2253 0.7270 0.7549 0.3063 1.1893
hist(cv,prob=T,xlim=c(0,1.6),ylim=c(0,2.5), breaks=seq(0,1.6,0.05),
xlab="c",main="(f) Histogram of 1000 c-values")
cvec=seq(0,1.5,0.01); fcvec=seq(0,1.5,0.01); for(i in 1:length(cvec))
fcvec[i]=mean(dnorm(cvec[i],mv,mv/sqrt(10)))
lines(cvec,fcvec,lty=1,lwd=3)
lines(density(cv),lty=2,lwd=3)
abline(v=c(cbar,cci,ccpdr),lwd=2)
points(mhat,0,pch=16)
616
Appendix A: Additional Exercises
# (b)
X11(w=8,h=6); par(mfrow=c(2,2));
y = c(3.4, 6.3, 1.0, 2.9, 1.8, 2.0, 0.5, 7.9, 4.8, 6.5); n = 10; ybar=mean(y);
ybar # 3.71
hist(y,prob=T,xlim=c(0,10),ylim=c(0,0.6), breaks=seq(0,10,0.5),
main="(a) Histogram of 10 y-values")
mbar=mean(mv); mci=mbar+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J)
mcpdr=quantile(mv,c(0.025,0.975));
mvec=seq(1.8,5,0.01); kvec=mvec;
for(i in 1:length(mvec)) kvec[i] = exp(LOGPOST(m=mvec[i],n=n,y=y))
k0=INTEG(mvec,kvec); postvec=kvec/k0; k0 # 3.317e-08
mhat=INTEG(mvec,mvec*postvec);
c(mbar,sd(mv),mhat,mci,mcpdr)
# 2.8907 0.4823 2.9071 2.8608 2.9206 2.1456 3.9827
fun=function(q,p=0.025){ (INTEG(mvec,postvec,1.8,q)-p)^2 }
LB0 = optim(par=2.1,fn=fun)$par; LB = optim(par= LB0,fn=fun)$par
fun=function(q,p=0.975){ (INTEG(mvec,postvec,1.8,q)-p)^2 }
UB0 = optim(par=4.1,fn=fun)$par; UB = optim(par= UB0,fn=fun)$par
c(LB,UB) # 2.143 4.033
INTEG(mvec,postvec,1.8,LB) # 0.025
INTEG(mvec,postvec,UB,5) # 0.025 OK (Ignore all the warnings)
par(mfrow=c(2,1))
hist(mv,prob=T,xlim=c(1,5),ylim=c(0,1), breaks=seq(1,5,0.2),
xlab="x",main="(e) Histogram of 1000 m-values")
lines(mvec,postvec,lty=1,lwd=3)
lines(density(mv),lty=2,lwd=3)
abline(v=c(mbar,mci,mcpdr),lwd=2)
points(c(mhat,LB,UB),c(0,0,0),pch=16)
points(m,0,pch=4,lwd=3)
617
Bayesian Methods for Statistical Analysis
X11(w=8,h=4); par(mfrow=c(1,1))
hist(av,prob=T,xlim=c(1.5,5.5), ylim=c(0,1), breaks=seq(1,6,0.2), xlab="c",
main="(g) Histogram of 1000 a-values (finite population mean)")
avec=seq(1,6,0.01); favec=seq(1,6,0.01); for(i in 1:length(avec))
favec[i]=
mean( dnorm( avec[i], (1/50)*( 10*ybar+40*mv), mv*sqrt(40)/50 ) )
lines(avec,favec,lty=1,lwd=3); lines(density(av),lty=2,lwd=3)
abline(v=c(abar,aci,acpdr),lwd=2)
points( (1/50)*(10*ybar+40*mbar) ,0.1,pch=1,cex=1, lwd=2)
points( (1/50)*(10*ybar+40*mci) ,c(0.06,0.14), pch=1,cex=1, lwd=2)
points( (1/50)*(10*ybar+40*mhat) ,0,pch=4,lwd=2,cex=2)
points(ybar,0,cex=1,lwd=2,pch=16)
legend(3.9,1, c("Histogram density estimate","Rao-Blackwell estimate"),
lty=c(2,1), lwd=c(3,3), bg="white")
legend(3.83,0.67,c("Sample mean","Rao-Blackwell estimate & 95% CI",
"Exact predictive mean"),
pch=c(16,1,4), pt.cex=c(1,1,2), pt.lwd= c(2,2,2), bg="white")
618
Appendix A: Additional Exercises
Then randomly sample n = 100 values from the gamma distribution with
mean m = a / b and variance v = a / b2 .
Illustrate your results with suitable figures (e.g. trace plots and
histograms).
619
Bayesian Methods for Statistical Analysis
The sampled values of a and b were 1.463 and 5.528. So the value of m
was a/b = 0.2647. The 100 sampled gamma values are shown in Figure
A.3(a) (page 621).
f ( a, b | y ) ∝ f ( a, b) f ( y | a, b)
n
b a yia −1e − byi e − a b na (∏in=1 yi ) a −1 e − byT
∝ e− a ∏ = .
i =1 Γ(a ) Γ(a ) n
So the log-posterior is
1. Proposes a value
a′ ~ U (a − δ a , a + δ a ) ,
where δ a is a tuning constant, and accepts this value with
probability p = e q , where
= q l ( a′, b) − l ( a, b)
2. Proposes a value
b′ ~ U (b − δ b , b + δ b ) ,
where δ b is a tuning constant, and accepts this value with
probability p = e q , where
= q l ( a, b′) − l ( a, b) .
620
Appendix A: Additional Exercises
The Monte Carlo sample was then used to generate a random sample from
the predictive distribution of
=
c ( yn +1 + ... + yn +10 ) / 10 .
621
Bayesian Methods for Statistical Analysis
(b) Here we repeat the procedure in (a) but using n = 6 (rather than 100),
and the 6 given sample values whose mean is 2.25 (instead of the 100
generated values as before), so as to generate a Monte Carlo sample of
size J = 1,000 from the posterior distribution of a and b.
Then for each j we calculate the associated value of the MAD, namely
1 N a
= ψj ∑
N i =1
yi − j .
bj
We then use the resulting J values of the MAD, i.e. ψ 1 ,...,ψ J , for Monte
Carlo inference in the usual way.
622
Appendix A: Additional Exercises
# (a)
options(digits=4); n = 100; X11(w=8,h=4); par(mfrow=c(1,1));
set.seed(192); a=rgamma(1,1,1); b=runif(1,0,10); y=rgamma(n,a,b);
m=a/b; v=a/b^2; c(a,b,m,v) # 1.46321 5.52763 0.26471 0.04789
hist(y,prob=T,xlim=c(0,1.5),ylim=c(0,3), breaks=seq(0,1.5,0.05),
main="(a) Histogram of 100 y-values")
yvec=seq(0,1.5,0.01); lines(yvec,dgamma(yvec,a,b),lwd=3)
abline(v=m,lwd=3)
623
Bayesian Methods for Statistical Analysis
MHALG = function(J=1000,y,a0=1,b0=1,adel=1,bdel=1){
a=a0; b=b0; av=a; bv=b; act=0; bct=0; n=length(y);
sumlogy=sum(log(y)); sumy=sum(y) # sufficient statistics
for(j in 1:J){
acand=runif(1,a-adel,a+adel)
if(acand>0){
logprob=
LOGPOST (a=acand,b=b,n=n,sumlogy=sumlogy,sumy=sumy)-
LOGPOST (a=a,b=b,n=n,sumlogy=sumlogy,sumy=sumy)
prob=exp(logprob)
u=runif(1); if(u<=prob){ act=act+1; a= acand } }
bcand=runif(1,b-bdel,b+bdel)
if((bcand>0)&&(bcand<10)){
logprob=
LOGPOST (a=a,b=bcand,n=n,sumlogy=sumlogy,sumy=sumy)-
LOGPOST (a=a,b=b,n=n,sumlogy=sumlogy,sumy=sumy)
prob=exp(logprob)
u=runif(1); if(u<=prob){ bct=bct+1; b= bcand }
}
av=c(av,a); bv=c(bv,b)
}
list(av=av,bv=bv,aar=act/J,bar=bct/J)
}
set.seed(312); res=MHALG(J=10100,y=y,a0=1,b0=1,adel=0.3,bdel=1)
X11(w=8,h=6); par(mfrow=c(2,1));
plot(res$av); plot(res$bv); c(res$aar,res$bar) # 0.5055 0.5611
X11(w=8,h=4); par(mfrow=c(1,1));
hist(mv,prob=T,xlim=c(0.2,0.4),ylim=c(0,20), breaks=seq(0.2,0.4,0.005),
xlab="m",main="(b) Histogram of 1000 m-values")
lines(density(mv),lty=1,lwd=3)
abline(v=c(mbar,mci,mcpdr),lwd=2)
points(m,0,pch=4,lwd=3)
624
Appendix A: Additional Exercises
# Prediction of c -----------------------
set.seed(332); cv=rep(NA,J); for(j in 1:J) cv[j]=mean(rgamma(10,av[j],bv[j]))
cbar=mean(cv); cci=cbar+c(-1,1)*qnorm(0.975)*sd(cv)/sqrt(J)
ccpdr=quantile(cv,c(0.025,0.975))
c(cbar,sd(cv),cci,ccpdr) # 0.29812 0.08356 0.29294 0.30329 0.15843 0.48783
hist(cv,prob=T,xlim=c(0.05,0.7),ylim=c(0,7), breaks=seq(0,1.6,0.02),
xlab="c",main="(c) Histogram of 1000 c-values")
lines(density(cv),lty=1,lwd=3); abline(v=c(cbar,cci,ccpdr),lwd=2)
# (b)
y=c( 0.4, 3.3, 1.0, 2.9, 1.8, 4.1); X11(w=8,h=6); par(mfrow=c(2,1));
n=length(y); sumlogy=sum(log(y)); sumy=sum(y) # sufficient statistics
set.seed(312); res=MHALG(J=10100,y=y,a0=1,b0=1,adel=1.3,bdel=0.7)
plot(res$av); plot(res$bv); c(res$aar,res$bar) # 0.5129 0.5094
X11(w=8,h=4); par(mfrow=c(1,1));
hist(mv,prob=T,xlim=c(0,7),ylim=c(0,0.8), breaks=seq(0,10,0.5),
xlab="x",main="Histogram of 1000 simulated m-values")
lines(density(mv),lty=2,lwd=3); abline(v=c(mbar,mci,mcpdr),lwd=2)
hist(psiv,prob=T,xlim=c(0,4),ylim=c(0,1.5), breaks=seq(0,7,0.1),
xlab="psi",main="")
lines(density(psiv),lty=1,lwd=3); abline(v=c(psibar,psici,psicpdr),lwd=2)
625
Bayesian Methods for Statistical Analysis
Then select a random sample of size n = 20 from the N units in the finite
population, without replacement.
Plot the y values against the x values, over the population and over the
sample, respectively. Draw the true regression line y= a + bx and the
two least squares regression lines estimated using the population data and
sample data, respectively.
Then use this sample and R to estimate each of the following quantities:
m= a + 16b (average of a hypothetically infinite number of
values with covariate 16)
y1 + ... + yN
y= (the finite population mean)
N
2 y(100)
ψ= (ratio of maximum to median of the 100 finite
y(50) + y(51)
population values).
(c) Repeat the inferences in (b) but using WinBUGS and a sample size of
J = 10,000.
626
Appendix A: Additional Exercises
(a) The required plot and regression lines are shown in the Figure A.5.
(b) Denote the sample values by s1 ,..., sn ∈ {1,..., N } , where s1 < ... < sn ,
and define s = ( s1 ,..., sn ) .
=
Also define =
r ( r1 ,..., rN −n ) {1,..., N } − s in such a way that r1 < ... < rN −n ,
and define the nonsample vector as yr = ( yr1 ,..., yrN − n )′ .
627
Bayesian Methods for Statistical Analysis
Thus, to do the required inference, first carry out the following steps:
1. Relabel the population units so that y s = ( y1 ,..., yn )′ ,
xs = ( x1 ,..., xn )′ , yr = ( yn +1 ,..., y N )′ , xr = ( xn +1 ,..., xN )′ , etc.,
so that y = ( y s′ , yr′ )′ , etc.
2. Calculate A, B, D and T as per the above
3. Generate λ1 ,..., λJ ~ iid G ( A / 2, B / 2) (easy)
4. Generate β ~ ⊥ N 2 (T , D / λ j ) , for j = 1,…,J
( j)
(easy)
5. Generate y ,..., y (1)
r
(J )
r ~ N N −n ( X r β ( j)
, Σ rr / λ j ) , for j = 1,…,J
(e.g. for each j, generate y ( j)
i ~ ⊥ N ( a j + b j xi ,1 / λ j ) ,
i= n + 1,..., N , and form y ( j)
r = ( yn( +j 1) ,..., y N( j ) )′
6. Form y ( j ) = ( y s′ , yr( j )′ )′ for each j = 1,…,J.
Now calculate
m=j a j + 16b j
and perform Monte Carlo inference on m, using the fact that
m1 ,..., mJ ~ iid f ( m | D ) .
m J −1 ∑ Jj =1 m j .)
(For example, estimate m by =
628
Appendix A: Additional Exercises
Finally, calculate
( j)
2 y(100)
ψ j = ( j)
y(50) + y(51)
( j)
629
Bayesian Methods for Statistical Analysis
630
Appendix A: Additional Exercises
Table A.1 shows some of the true values and corresponding numerical
estimates featuring in Figure A.6.
631
Bayesian Methods for Statistical Analysis
632
Appendix A: Additional Exercises
# (a)
X11(w=8,h=5.5); par(mfrow=c(1,1)); options(digits=4)
N=100; n=20; a=3; b=0.5; sig=2; set.seed(312); x=runif(N,10,20);
y=rnorm(N,a+b*x,sig); s=sort(sample(1:N,n)); xs=x[s]; ys=y[s];
r=(1:N)[-s]; xr=x[r]; yr=y[r]; yT=sum(y); ysT=sum(ys); yrT=sum(yr)
ybar=mean(y); ysbar=mean(ys); yrbar=mean(yr);
xT=sum(x); xsT=sum(xs); xrT=sum(xr)
xbar=mean(x); xsbar=mean(xs); xrbar=mean(xr);
m=a+16*b; psi=max(y)/median(y)
c(m, ybar,max(y),median(y),psi) # 11.000 10.473 15.234 10.616 1.435
plot(x,y,xlim=c(0,20),ylim=c(0,17));
points(xs,ys,pch=16); abline(v=0,lty=3); abline(h=0,lty=3); abline(v=16,lty=3);
abline(h=a+16*b,lty=3);
abline(a,b,lwd=3);
abline(lm(y~x),lty=2,lwd=3); abline(lm(ys~xs),lty=3,lwd=3);
abline(lm(yr~xr),lty=4,lwd=3)
legend(0,17,bg="white", c("True regression line","Estimate from population",
"Estimate from sample","Estimate from nonsample"),
lty=1:4,lwd=rep(3,4) )
text(16,2,"The solid dots show the sample values")
avec=betamat[1,]; bvec=betamat[2,]
ahat=mean(avec); bhat=mean(bvec); c(ahat,bhat) # -0.5742 0.7175
yrmat=matrix(NA,nrow=N-n,ncol=J)
set.seed(334); for(j in 1:J)
yrmat[,j]= rnorm(N-n,avec[j]+bvec[j]*xr,1/sqrt(lamvec[j]))
633
Bayesian Methods for Statistical Analysis
hist(mvec,prob=T,xlim=c(8,14),ylim=c(0,1), breaks=seq(7,14,0.25),
xlab="m",main="(a) Histogram of 1000 m-values") # Ignore warnings
lines(density(mvec),lty=2,lwd=3) # Histogram estimate
abline(v=c(mhat,mci,mcpdr),lty=2,lwd=3) # Histogram estimates
mhat2=c(1,16)%*%T; points(mhat2,0, pch=16,cex=1.5) # Exact posterior mean
mvarterm2=c(1,16)%*%D%*%c(1,16); msdterm2=sqrt(mvarterm2)
mv=seq(6,16,0.05); fmv2=mv
for(k in 1:length(mv))
fmv2[k]=mean(dnorm(mv[k],mhat2,msdterm2/sqrt(lamvec)))
lines(mv,fmv2,lwd=3); # Exact posterior density of m
points(median(y),0, pch=4,cex=2,lwd=3 ) # True value of m
legend(8,1,c("Histogram estimate","Exact density"), lty=c(2,1),lwd=c(3,3),
bg="white")
legend(8,0.6,c("Rao-Blackwell","True"),pch=c(16,4),
pt.cex=c(1.5,2), pt.lwd=c(1,3), bg="white")
hist(ybarvec,prob=T,xlim=c(8,12),ylim=c(0,1), breaks=seq(3,18,0.25),
xlab="ybar",main="(b) Histogram of 1000 ybar-values")
lines(density(ybarvec),lty=2,lwd=3) # Histogram estimate
abline(v=c(ybarhat, ybarci, ybarcpdr),lty=2,lwd=3) # Histogram estimates
ybarv=seq(8,13,0.02); fybarhatv=ybarv;
meanvalvec = (1/N)*( ysT+(N-n)*(avec+bvec*xrbar) )
varvalvec = (N-n)/(lamvec*N^2)
for(k in 1:length(ybarv)){
fybarhatv[k]= mean( dnorm(ybarv[k], meanvalvec, sqrt(varvalvec) ) ) }
lines(ybarv, fybarhatv,lty=1,lwd=3) # Rao-Blackwell
points(mean(meanvalvec),0,pch=16,cex=1.5) # Rao-Blackwell
points(ybar, 0, pch=4,cex=2,lwd=3 ) # True value of ybar
legend(8,1,c("Histogram estimate","Rao-Blackwell"),
lty=c(2,1),lwd=c(3,3), bg="white")
634
Appendix A: Additional Exercises
legend(8,0.6,c("Rao-Blackwell","True value"),pch=c(16,4),
pt.cex=c(1.5,2), pt.lwd=c(1,3), bg="white")
635
Bayesian Methods for Statistical Analysis
636
Appendix A: Additional Exercises
model
{
for(i in 1:100){
mu[i] <- a + b*x[i]
y[i] ~ dnorm(mu[i],lam)
}
a ~ dnorm(0.0,0.0001)
b ~ dnorm(0.0,0.0001)
lam ~ dgamma(0.0001,0.0001)
m <- a+16*b
ybar <- mean(y[])
max <- ranked(y[],100)
medL <- ranked(y[],50)
medU <- ranked(y[],51)
med <- (medL + medU)/2
psi <- max/med
}
# data
list(y=c(
14.98,10.99,9.58,6.56,13.83, 11.38,9.13,13.25,7.03,11.14,
2.74,11.97,12.15,9.39,11.71, 10.25,7.98,8.54,10.66,10.41,
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA),
x=c(19.34,18.2,14.27,10.91,13.45,13.3,11.31,16.62,13.07,17.45,10.55,
17.66,17.34,17.46,16.14,17.19,10.96,14.19,16.08,14.83,17.92,16.61,
14.52,16.7,12.28,14.61,14.51,11.5,15.17,16.72,11.27,15.21,16.34,
10.36,12.62,19.27,19.7,12.26,10.07,18.74,11.86,12.35,16.79,13.18,
14.05,17.52,18.17,18.7,18.1,10.17,10.26,12.95,12.64,12.35,18.39,
12.08,17.48,13.47,14.47,16.76,17.64,14.32,19.07,17.29,15.87,14.2,
18.49,14.69,13.57,14.74,12.41,19.99,18.39,16.43,15.6,15.74,18.33,
16.98,16.72,19.3,13.92,11.4,11.55,13.83,12.36,13.3,15.3,19.26,18.15,
17.75,10.72,13.78,13.2,14.98,13.53,10.19,16.46,12.57,10.36,19.49))
# inits
list(a=0,b=0,lam=1)
637
Bayesian Methods for Statistical Analysis
y1 + ... + yN
the finite population mean y= .
N
if the value of unit 1 is 1 then each sample with unit 1 is twice as likely to
be selected as each sample without unit 1.
We observe the values of the two sampled units (each being 0 or 1) as well
as the labels identifying them (each being 1, 2, 3 or 4).
(a) Write down a suitable Bayesian model for the above scenario in terms
of the densities of the parameter θ , the finite population vector,
y = ( y1 ,..., y N ) , and the sample, s = ( s1 ,..., sn ) .
Your formulae may involve only these variables, as well as n, N, and the
vector of inclusion counters, I = ( I1 ,..., I N ) , where I i = 1 if the ith unit is
in the sample, and I i = 0 otherwise. (Note that there is a one-to-one
correspondence between s and I in this exercise.)
638
Appendix A: Additional Exercises
(i) Design and run a Gibbs sampler to check the posterior mean of θ
in (c) and the predictive mean of y in (f).
(j) Use Monte Carlo methods to check the two design biases in (h).
(k) Find the mean of the predictive mean of the finite population mean.
Then apply Monte Carlo methods to check your answer.
639
Bayesian Methods for Statistical Analysis
f (θ ) 1=
= / 2, θ 1 / 4,3 / 4 .
Also, if y1 = 1 then
c, i∉s
s | y, θ ) f=
f (= ( s | y1 )
2c , i ∈ s
c, s = (1, 2), (1,3), (1, 4)
= .
2c, s = (2,3), (2, 4), (3, 4)
Hence
s
=
1 ∑ f (s | y =) c∑ (1 + I =)
s
c ∑1 + ∑ I (i ∈ s )
1
s s
i
N N − 1 4 3
= c ∑1 +=
∑ 1 c + = c +
s s :i ∈ s n n − 1 2 1
= c(6 + 3) = 9c
⇒c=
1/ 9 .
640
Appendix A: Additional Exercises
N − 1
Note 2: There are a total of samples s which contain any given
n −1
particular unit i. So if y1 = 1 then
1/ 9, s = (1, 2), (1,3), (1, 4)
s | y, θ ) f=
f (= ( s | y1 ) .
2 / 9, s = (2,3), (2, 4), (3, 4)
Putting together the two cases above ( y1 = 0 and 1), we see that the
sampling mechanism is given generally by
f ( s | y, θ ) = f ( s | y1 )
1 + I1 y1
=
N N − 1
+ y1
n n −1
1 + I1 y1
= , s = (1, 2),(1,3),(1, 4),(2,3),(2, 4),(3, 4) ,
6 + 3 y1
where of course =
I1 I ( s ∈ {(1, 2), (1,3), (1, 4)}) .
From Table A.2 we may also confirm that, as specified in the problem:
641
Bayesian Methods for Statistical Analysis
In that case:
1= 1 / 6, y1 0=
3 / 18, y1 0 3 − y1
| y,θ )
f ( s= = = = ,
6 + 3=
y1 1 / 9, y1 1=
2 / 18, y1 1 18
s = (2,3),(2, 4),(3, 4) .
f (θ | D ) = f (θ | s, y s ) ∝ f (θ , s, ys ) = ∑ f (θ , s, y , y )
yr
s r
= ∑ f (θ ) f ( y s , yr | θ ) f ( s | y s , yr , θ )
yr
= ∑ f (θ ) f ( y s | θ ) f ( yr | θ ) f ( s | y1 )
yr
= f (θ ) f ( y s | θ ) f ( s | y1 )∑ f ( yr | θ )
yr
642
Appendix A: Additional Exercises
i∈s
= θ ysT
(1 − θ )2− ysT
1 ysT 3 ysT 2− ysT
, θ = 1/ 4
4 4
= ysT ysT 2 − ysT
3 1
4 4 , θ = 3 / 4
3 2 − ysT
, θ = 1 / 4
∝ y
3 , θ = 3/ 4
sT
32 , θ = 1 / 4
∝ y +y
3
sT sT
, θ = 3 / 4
9, θ = 1 / 4
= ysT .
9 , θ = 3 / 4
That is (if 1 ∈ s ),
9 /10, θ = 1/ 4
, ysT = 0
9 1/10, θ = 3 / 4
9 + 9 ysT , θ = 1/ 4 1/ 2, θ = 1/ 4
=f (θ | D) = = , ysT 1
1/ 2, θ = 3 / 4
ysT
9 , θ = 3 / 4
9 + 9 ysT 1/10, θ = 1/ 4
, ysT = 2.
9 /10, θ = 3 / 4
3 + 2 ysT
Note: This could also be written as θˆ = (if 1 ∈ s ).
10
643
Bayesian Methods for Statistical Analysis
Next, suppose that unit 1 is not sampled. Then the value of unit 1 is
unknown and so the sampling mechanism is nonignorable.
= ∑ f (θ ) f ( y s , yr | θ ) f ( s | y s , yr , θ )
yr
= ∑ f (θ ) f ( y s | θ ) f ( yr | θ ) f ( s | y1 )
yr
= f (θ ) f ( y s | θ )∑ f ( s | y1 ) f ( yr | θ )
yr
= f (θ ) f ( ys | θ ) q(θ ) ,
where
q(θ ) ∝ ∑ (3 − θ ) f ( yr | θ )
yr
= E yr (3 − θ | θ )
= 3−θ .
1 1
= ∑ θ yk (1 − θ )1− yk ∑ (3 − y1 )θ y1 (1 − θ )1− y1
= yk 0 = y1 0
=1 × {(3 − 0)θ 0 (1 − θ )1−0 + (3 − 1)θ 1 (1 − θ )1−1}
= 3(1 − θ ) + 2θ
= 3−θ .
644
Appendix A: Additional Exercises
32 × 11, θ = 1/ 4
∝ y +y
3
sT sT
× 9, θ = 3 / 4
11, θ = 1 / 4
= ysT .
9 , θ = 3 / 4
645
Bayesian Methods for Statistical Analysis
Putting the two cases together we find that the posterior mean of θ is
given generally by:
=θˆ E (θ =
| D) θˆ=
( D) θˆ( s, ys )
or equivalently, by
Note: Here:
1∈ s ⇔ I1 = 1 ⇔ s = (1, 2),(1,3) or (1, 4)
1∉ s ⇔ I1 = 0 ⇔ s = (2,3),(2, 4) or (3, 4) .
Also:
ysT = 0 iff both sampled values are 0
ysT = 1 iff one sampled value is 0 and the other is 1
ysT = 2 iff both sampled values are 1.
646
Appendix A: Additional Exercises
Now,
f ( y, s | θ )
f ( y | θ , s) = ,
f (s | θ )
where:
3 + y1 4 yi
f ( y, s | θ ) f ( s =
= | y,θ ) f ( y | θ ) ∏
18 i =1
θ (1 − θ )1− yi
3 + y1
(using the result in (b) that f ( s | y , θ ) = if 1 ∈ s )
18
f (s | θ )
= ∑
=
y
f ( y , s | θ ) ∑ f ( s | y=
y
,θ ) f ( y | θ ) E y { f ( s | y,θ ) | θ }
3 + y1 3 + θ
= Ey θ = .
18 18
Therefore
3 + y1 4 yi 1− yi
18 ∏ θ (1 − θ )
f ( y | θ , s) = i =1
3 +θ
18
3 + y1 y1 1− y1
4
θ θ ∏ θ (1 − θ ) .
1− yi
= (1 − ) yi
3 + θ i =2
We see that
( yi | θ , s ) ~ ⊥ Bernoulli (π i ) , i = 1, 2,3, 4 ,
where:
π=2 π= 3 π=4 θ
3+1 1 4θ
π1
= θ (1 − θ =
)1−1 .
3+θ 3+θ
647
Bayesian Methods for Statistical Analysis
3+ 0 0 3(1 − θ ) 4θ
Check: θ (1 − θ )1−0 = =
1− 1 − π1 .
=
3+θ 3+θ 3+θ
It follows that
ysT | θ , s ) E ( y1 | θ , s ) + E ( y3 | θ , s )
E (=
1 1
7+
4θ θ (7 + θ ) 4 4 29 / 16 29
=π 1 + π 3 =θ + = = = = .
3+θ 3+θ 1 13 / 4 52
3+
4
Hence
1 29 107
E (θˆ | θ , s ) = 3 + 2 = = 0.4115.
10 52 260
648
Appendix A: Additional Exercises
In this case,
f ( y, s | θ )
f ( y | θ , s) = ,
f (s | θ )
as before, but with
3 − y1 4 yi
f ( y , s | θ ) f ( s |=
= y,θ ) f ( y | θ ) ∏
18 i =1
θ (1 − θ )1− yi
3 − y1
(using the result in (b) that f ( s | y , θ ) = if 1 ∉ s ).
18
Thus,
= f (s | θ ) ∑
=
y
f ( y, s | θ ) ∑ f ( s | y, θ ) f ( y | θ )
y
3 − y1
= E y { f ( s | y,θ ) | θ } = E y θ
18
3−θ
= .
18
So
3 − y1 4 yi 1− yi
18 ∏ θ (1 − θ )
f ( y | θ , s) = i =1
3 −θ
18
3 − y1 y1 1− y1
4
θ θ ∏ θ (1 − θ ) .
1− yi
= (1 − ) yi
3 − θ i =2
We see that
( yi | θ , s ) ~ ⊥ Bernoulli (π i ) , i = 1, 2,3, 4 ,
where:
π=2 π= 3 π=4 θ
3 −1 1 2θ
π1
= θ (1 − θ =
)1−1 .
3−θ 3−θ
3− 0 0 3(1 − θ ) 2θ
Check: θ (1 − θ )1−0 = =
1− 1 − π1 .
=
3−θ 3−θ 3−θ
649
Bayesian Methods for Statistical Analysis
It follows that
( ysT | θ , s ) E ( y2 | θ , s ) + E ( y3 | θ , s )
E=
1 1 1
= π 2′ + π 3′ = θ + θ = + = .
4 4 2
Equivalently,
( ysT | θ , s ) ~ Bin(2, θ ) ,
and so
E ( ysT | θ , s ) = 2θ .
Hence
1 5
805 + 462 + 44
=E (θˆ | θ , s ) = 2 8 2127 = 0.3853.
2760 5520
∝ f (θ ) f ( ys | θ )
∝ 1× ∏ θ yi (1 − θ )1− yi .
i∈s
650
Appendix A: Additional Exercises
Note: In the above, E (θˆ | θ , y ) does not depend on θ . So, for the case
1 3
θ = 3/4 and y = (0,0,1,1) , the design bias of θˆ is − = −0.25.
2 4
651
Bayesian Methods for Statistical Analysis
Recall from (c) that the posterior mean of θ is a function of the data given
generally by
3 /10 = 0.3000 if 1 ∈ s and ysT =
0
if 1 ∈ s and ysT =
1/ 2 = 0.5000 1
7 /10 = 0.7000 if 1 ∈ s and ysT =
2
= θˆ θˆ=
( s, ys )
7 / 24 = 0.2917 if 1 ∉ s and ysT =
0
19 / 40 = 0.4750 if 1 ∉ s and ysT =
1
127 /184 = 0.6902 if 1 ∉ s and ysT =
2.
Then y s = ( y1 , y2 ) = (1,0).
Likewise:
If s = (1,3) then ys = ( y1 , y3 ) = (1,1) and so
7 3+1 7
θˆ( s, ys ) f ( s | θ , y ) = × = .
10 18 45
652
Appendix A: Additional Exercises
653
Bayesian Methods for Statistical Analysis
Therefore
θ + θ , 1 ∈ s
rT | θ , s, y s )
E ( y= ( yrT | θ , s )
E= ,
θ + φ , 1 ∉ s
where
2θ
φ= .
3−θ
So
E ( yrT | s, ys ) = E{E ( yrT | θ , s, ys ) | s, ys }
E (2θ | D), 1 ∈ s
=
E (θ | D) + E (φ | D), 1 ∉ s
2θˆ, 1 ∈ s
= ,
θ + φ , 1 ∉ s
ˆ ˆ
where
2θ
=φˆ E=
(φ | D) Eθ D
3 −θ
2θ
= ∑ f (θ | D) .
θ =1/4,3/4 3 − θ
654
Appendix A: Additional Exercises
Note: Working through the above equation using exact fractions, it can
be shown that
3 / 20, 1 ∈ s, ysT =0
1/ 2, 1 ∈ s, ysT =
1
17 / 20, 1 ∈ s, ysT =2
= yˆ yˆ=( s, ys )
37/288, 1 ∉ s, ysT =0
15 / 32, 1 ∉ s, ysT =
1
607 / 736, 1 ∉ s, ysT =2.
The following are details of the working for 37/288, 15/32 and 607/736.
Observe that
2θ θ (5 − θ )
E ( yrT | θ , s, ys ) =
θ+ = .
3−θ 3−θ
Therefore
θ (5 − θ )
= {E ( yrT | s, ys , θ ) | s, ys } E
yˆ rT E= s, y s .
3−θ
655
Bayesian Methods for Statistical Analysis
19 1 3 17
11 1 17
= + =
4 4 4 4 1
19 +
11 12 9 12 48 3
4 4
1 57 + 17 74 37
= = = .
48 3 48 × 3 72
Also, if y sT = 1 then
1 1 3 3
5− 5 − 9
θ (5 − θ ) 4 4
+
11 4 4
=yˆ rT E = D
3−θ 3−
1 20 3−
3 20
4 4
1 19 3 17
11
= + =
4 4 9 1 7
4 4
9
{19 + 51} = .
11 20 20 80 8
4 4
And if y sT = 2 then
1 1 3 3
5− 5−
θ (5 − θ ) 4 4 11 4 4 81
=yˆ rT E = D +
3−θ 3−
1 92
3−
3 92
4 4
1 19 3 17
11
= +
4 4 4 4 81
11 92 9 92
4 4
1 478 239
= {19 + 27 × 17
= } = .
368 368 184
656
Appendix A: Additional Exercises
Hence
=0 + 37/72 37/72, = ysT 0
yˆT = ysT + yˆ rT = yˆ rT = 1 + 7 / 8 = 15 / 8, ysT = 1
2 + 239 /184
= 607 /184, =
ysT 2.
A similar logic can be used to obtain the fractions 3/20, 1/2 and 17/20.
In this case,
ysT= y1 + y3 ,
and so:
9 3 27
P( ysT = 0 | θ , s ) = × =
13 4 52
4 1 4
P( ysT = 2 | θ , s ) = × =
13 4 52
27 4 21
P( ysT = 1 | θ , s ) =−1 − = .
52 52 52
657
Bayesian Methods for Statistical Analysis
In this case,
ysT= y2 + y3 ,
and so:
3 3 9
P( ysT = 0 | θ , s ) = × =
4 4 16
1 1 1
P( ysT = 2 | θ , s ) = × =
4 4 16
9 1 6
P( ysT = 1| θ , s ) =−1 − = .
16 16 16
658
Appendix A: Additional Exercises
Note: The derivation of this result did not involve θ . So for the case
θ = 3/4 and y = (0,0,1,1) , the design bias of ŷ is also −0.01463.
659
Bayesian Methods for Statistical Analysis
E ( yˆ | θ , y ) E=
= {E ( yˆ | θ , y , s ) | θ , y} ∑ yˆ ( s, y ) f ( s | θ , y )
s
s
2 2 2
= yˆ ((1, 2),(1,0)) + yˆ ((1,3),(1,1)) + yˆ ((1, 4),(1,1))
9 9 9
1 1 1
+ yˆ ((2,3),(0,1)) + yˆ ((2, 4),(0,1)) + yˆ ((3, 4),(1,1))
9 9 9
= (2/9)(0.5 + 0.85 + 0.85) + (1/9)(0.46875+ 0.46875 + 0.8247283)
= 0.684692.
Note: The derivation of this result did not involve θ . So for the case
θ = 3/4 and y = (1,0,1,1) , the design bias of ŷ is also −0.06531.
(3 / 4) yT (1 − 1/ 4)1− yT , θ =
1/ 4
= (A.1)
(1/ 4) (1 − 3 / 4) , θ =
yT 1− yT
3 / 4.
660
Appendix A: Additional Exercises
where: π=
2 π=
3 π=
4 θ
3 −1 1 2θ
=π1 θ (1 − θ =
)1−1 .
3−θ 3−θ
Therefore
( yr2 | θ , s, y s , yr1 ) ~ Bernoulli (θ ) . (A.2)
However, there are two possibilities for yr1 . If the data is such that s1 = 1
then
( yr1 | θ , s, y s , yr2 ) ~ Bernoulli (θ ) . (A.3)
On the other hand, if the data is such that s1 > 1 then r1 = 1 , and this
implies that
2θ
( yr1 | θ , s, y s , yr2 ) ~ Bernoulli . (A.4)
3−θ
661
Bayesian Methods for Statistical Analysis
It will be observed that these numbers are very close to the corresponding
values obtained in (c), namely
3 /10 = 0.3000 if 1 ∈ s and ysT = 0
if 1 ∈ s and ysT =
1/ 2 = 0.5000 1
7 /10 = 0.7000 if 1 ∈ s and ysT = 2
θˆ =
7 / 24 = 0.2917 if 1 ∉ s and ysT = 0
19 / 40 = 0.4750 if 1 ∉ s and ysT =1
127 /184 = 0.6902 if 1 ∉ s and ysT =2.
It will be noted that these are very close to the corresponding values
obtained in (f), namely:
0.15, 0.5, 0.85, 0.1284722, 0.4687500, 0.8247283.
(j) To check the design bias in (h)(i) we note that for y = (0,0,1,1) the
sampling mechanism is ignorable.
To check the design bias in (h)(ii) we note that for y = (1,0,1,1) the
sampling mechanism is nonignorable with each sample containing unit 1
twice as likely as each unit not containing unit 1.
So, select a sample s from (1,2), (1,3), (1,4), (2,3), (2,4), (3,4), in such a
way that each of the first three of these has probability 2/9 and each of the
last three has probability 1/9. Then calculate the corresponding value of
ŷ . Repeat another J − 1 times, independently. Then take the mean of the
simulated ŷ values and subtract y = 3/4.
662
Appendix A: Additional Exercises
(k) The mean of the predictive mean of the finite population mean is the
same as the unconditional mean of the finite population mean, which is
the same as the prior mean of the superpopulation mean, which in our case
equals 1/2. Mathematically,
Eyˆ = EE ( y | s, y s ) by the definition of ŷ
= Ey by the law of conditional expectation
= EE ( y | θ ) by the law of conditional expectation
1 4 1 4
= Eθ y |θ )
since E (=
=
∑
4 i 1=
yi | θ ) =
E( = ∑θ θ
4i1
1 1 3 1 1
= ∑θ f (θ ) = × + × = .
θ 4 2 4 2 2
To verify this obvious result via Monte Carlo is a good final check on
previous calculations.
To this end, simulate θ , then simulate y, then simulate s, hence obtain the
data ( s, y s ) , then calculate the associated ŷ . Then repeat all of the above
independently another J − 1 times.
# (g)
postfun = function(s=c(1,2), ys=c(0,1)){ ysT=sum(ys)
if(any(s==1)==T){ if(ysT==0) probs=c(0.9,0.1)
if(ysT==1) probs=c(0.5,0.5)
if(ysT==2) probs=c(0.1,0.9) }
if(any(s==1)==F){ if(ysT==0) probs=c(11/12,1/12)
if(ysT==1) probs=c(11/20,9/20)
if(ysT==2) probs=c(11/92,81/92) }
probs }
663
Bayesian Methods for Statistical Analysis
smat=matrix(c(1,2, 1,2, 1,2, 1,2, 2,3, 2,3, 2,3, 2,3), byrow=T,nrow=8, ncol=2)
ysmat= matrix(c(0,0, 0,1, 1,0, 1,1, 0,0, 0,1, 1,0, 1,1),
byrow=T,nrow=8, ncol=2)
thetahatvec=rep(NA,8); phihatvec=rep(NA,8); ybarhatvec=rep(NA,8);
664
Appendix A: Additional Exercises
# (h)
(1/6)*(0.15 + 0.5 + 0.5 + 0.46875+ 0.46875 + 0.8247283) # 0.4853714
(2/9)*(0.5 + 0.85 + 0.85) + (1/9)*(0.46875+ 0.46875 + 0.8247283) # 0.684692
# (i) Check posterior means and predcitive means via Gibbs sampler
options(digits=4)
GS=function(J=1000, s=c(1,2),ys=c(1,0), theta=1/4 ){
thetav=rep(NA,J); yrTv=rep(NA,J); yTv=rep(NA,J)
yrmat=matrix(NA,nrow=J,ncol=2); ysT=sum(ys)
for(j in 1:J){
probsyi = c(1-theta, theta)
yr2=sample(x=c(0,1),size=1,prob=probsyi)
if(s[1]==1) yr1=sample(x=c(0,1),size=1,prob=probsyi) else
yr1=sample(x=c(0,1),size=1,prob=c(3,2)*probsyi)
yr=c(yr1,yr2); yrT=sum(yr); yT=ysT+yrT
probstheta=c( (1/4)^yT *(3/4)^(4-yT), (3/4)^yT *(1/4)^(4-yT) )
theta = sample( x=c(1/4,3/4), size=1, prob= probstheta)
thetav[j]=theta; yrTv[j]=yrT; yTv[j]=yT; yrmat[j,]=yr
}
list(thetav=thetav, yrTv=yrTv, yTv=yTv, ybarv=yTv/4, yrmat=yrmat) }
665
Bayesian Methods for Statistical Analysis
# (j) Check design bias of predictive mean of ybar if theta=1/4 and y=(0,0,1,1)
smatrix=matrix(c(1,2, 1,3, 1,4, 2,3, 2,4, 3,4), byrow=T,nrow=6, ncol=2)
y=c(0,0,1,1); J = 10000; ybarhatsimv=rep(NA,J); set.seed(413)
est=mean(ybarhatsimv)-0.5;
ci=est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv-0.5)/sqrt(J)
c(est,ci) # -0.01562 -0.01945 -0.01179 Consistent with -0.01463 in (h)(i)
est=mean(ybarhatsimv)-0.75;
ci=est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv-0.5)/sqrt(J)
c(est,ci) # -0.06592 -0.06944 -0.06239 Consistent with -0.06531 in (h)(ii)
for(j in 1:J){
thetasim=sample(c(1/4,3/4),1); ysim=rbinom(4,1,thetasim)
if(ysim[1]==0) indexsim = sample(1:6,1,prob=c(1,1,1,1,1,1))
if(ysim[1]==1) indexsim = sample(1:6,1,prob=c(2,2,2,1,1,1))
ssim=smatrix[indexsim,]; yssim= ysim[ssim];
ybarhatsimv[j]= ybarhatfun(s=ssim,ys=yssim) }
est = mean(ybarhatsimv);
ci = est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv)/sqrt(J)
c(est,ci) # 0.4992 0.4938 0.5047 Consistent with 0.5
666
APPENDIX B
Distributions and Notation
Below are several probability distributions which feature in this book. The
purpose of this appendix is to provide a brief guide to the style of notation
and terminology used throughout. It is not intended to be a comprehensive
listing. Some of the notation introduced here is repeated in Appendix C.
If X ~ N ( µ , σ 2 ) = =
then EX Mode =
( X ) Median ( X ) µ and VX = σ 2 .
F ( x) P( X x) FN ( , 2 ) ( x) F ( x, N (, )) f N ( , 2 ) (t )dt .
2
Thus the p-quantile of X is the inverse cdf of X. This may also be written
F 1 ( p ) FX1 ( p ) FN(1 , 2 ) ( p ) FInv( p, N (, 2 )) .
If Z ~ N(0,1), we say that Z has the standard normal distribution. The pdf,
cdf, (lower) p-quantile and upper p-quantile of Z may be denoted by φ ( z ) ,
( z ) , 1 ( p ) , and z p 1 (1 p ) , respectively.
667
Bayesian Methods for Statistical Analysis
If X ~ G(a,b) then:
Mode( X ) (a 1) / b if a > 1
Mode( X ) 0 if a ≤ 1
EX = a / b , VX = a / b 2
( a k )
EX k k (the kth raw moment of X).
b ( a )
(k ) t k 1et dt .
0
668
Appendix B: Distributions and Notation
Note: We do not write X ~ Exp (b) because this could more easily be
=
confused with =
X exp(b) eb (where exp is the exponential function).
669
Bayesian Methods for Statistical Analysis
y
Note: The symbol ∝ here denotes ‘proportionality with respect to y’.
t
The statement g ∝ h means g = c × h , where c is a constant that does
t r
not depend on t. E.g. if g = 5t 2 r 3 , we may write: g ∝ t 2 , g ∝ r 3 ,
t, r r t
g ∝ t 2 r 3 , g ∝t , g ∝ r 4 , etc. By default, g (t ) ∝ t 5 means g (t ) ∝ t 5 ,
t t, u
and g (t | u ) ∝ t 5 means g (t | u ) ∝ t 5 (not g (t | u ) ∝ t 5 ).
670
Appendix B: Distributions and Notation
) Ft −( m1 ) (1 − p=
upper p-quantile may be written t p (m= ) FInv(1 − p, t (m)) .
We call m the degrees of freedom parameter.
U /a
Suppose that U ~ χ 2 (a ) , W ~ χ 2 (b) and U ⊥ W . Then X = has
W /b
the F distribution with parameters a and b. We then write X ~ F (a, b) .
The pdf and cdf of X (both omitted here) may be denoted f F ( a ,b ) ( x) and
FF ( a ,b ) ( x) , respectively. We call a the numerator degrees of freedom and
b the denominator degrees of freedom. The upper p-quantile of X may be
denoted as Fp (a, b) or FF(1a ,b ) (1 p ) or Finv(1 p, F (a, b)) .
671
Bayesian Methods for Statistical Analysis
672
APPENDIX C
Abbreviations and Acronyms
Below are some of the abbreviations and acronyms used in this book. The
list may not be comprehensive. Some of the expressions listed have more
than one meaning, depending on the context.
D data
DA data augmentation (algorithm)
df distribution function (same as cdf)
dof degrees of freedom
dsn distribution
DU discrete uniform distribution
673
Bayesian Methods for Statistical Analysis
E expectation operator
e Euler’s number (2.71828)
ECM Expectation-Conditional-Maximisation (algorithm)
ELF error loss function
EM Expectation-Maximisation (algorithm)
E-Step Expectation Step (in EM algorithm)
exp exponential function (e raised to a power)
Expo exponential distribution
674
Appendix C: Abbreviations and Acronyms
m nonsample size ( m= N − n )
MA moving average (process); Metropolis algorithm
MAD mean absolute deviation; finite population mean
absolute deviation about the superpopulation mean
max maximum/maximise
MC Monte Carlo (method); Markov chain
MCMC Markov chain Monte Carlo (method)
MH Metropolis-Hastings (algorithm)
min minimum/minimise
ML maximum likelihood (method)
MLE maximum likelihood estimate/estimator/estimation
MOME method of moments estimate/estimator/estimation
M-Step Maximisation Step (in EM algorithm)
675
Bayesian Methods for Statistical Analysis
676
Bibliography
677
Bayesian Methods for Statistical Analysis
Leonard, T., and Hsu, J.S.J. (1999). Bayesian Methods: An Analysis for
Statisticians and Interdisciplinary Researchers. Cambridge:
Cambridge University Press.
Lee, P. (1997). Bayesian Statistics: An Introduction. New York: Oxford
University Press.
Lunn, D.J., Thomas, A., Best., N., and Spiegelhalter, D. (2000).
WinBUGS − A Bayesian modelling framework: Concepts,
structure, and extensibility. Statistics and Computing, 10: 325−337.
Maindonald, J., and Braun, W.J. (2010). Data Analysis and Graphics
Using R: An Example-Based Approach, 3rd Edition. Cambridge:
Cambridge University Press.
Meng, X.-L. (1994). Posterior predictive p-values. The Annals of
Statistics, 22: 1142−1160.
Ntzoufras, I. (2009). Bayesian Modeling Using WinBUGS. Hoboken NJ:
Wiley.
O’Hagan, A, and Forster, J. (2004). Kendall’s Advanced Theory of
Statistics, Second Edition, Volume 2B, Bayesian Inference. London:
Arnold.
Puza, B. (1995). Monte Carlo Methods for Finite Population Inference.
Internal document. Canberra: Australian Bureau of Statistics.
Puza, B.D. (2002). ‘Postscript: Bayesian methods for estimation’ and
‘Appendix C: Details of calculations in the Postscript’. In Combined
Survey Sampling Inference: Weighing Basu’s Elephants, by
K. Brewer, London: Arnold, 2002, pp 293−296 and 299−302.
Puza, B.D., and O’Neill, T.J. (2005). Length-biased, with-replacement
sampling from an exponential finite population. Journal of
Statistical Computation and Simulation, 75: 159−174.
Puza, B. and O’Neill, T.J. (2006). Selection bias in binary data from
volunteer surveys. The Mathematical Scientist, 31: 85−94.
Rao, C.R. (1973). Linear Statistical Inference and its Applications, 2nd
Edition. New York: Wiley.
Rao, J.N.K. (2011). Impact of frequentist and Bayesian methods on
survey sampling practice: a selective appraisal. Statistical Science,
26: 240−256.
Särndal, C.-E., Swensson, B., and Wretman, J. (1992). Model Assisted
Survey Sampling. New York: Springer.
Seaman, S., Galati, J., Jackson, D., and Carlin, J. (2013). What is meant
by ‘Missing at Random’? Statistical Science, 28(2): 257−268.
Shaw D, (1988). On-site samples’ regression: Problems of non-negative
integers, truncation, and endogenous stratification. Journal of
Econometrics, 37: 211−223.
678
Bibliography
679