Block 4 ST3189
Block 4 ST3189
Block 4 ST3189
Learning outcomes
Understand the basic concepts of Bayesian inference
Identify differences between Frequentist and Bayesian Inference in point and interval
estimation, hypothesis testing and prediction
Implement Bayesian Inference in various examples including linear regression using R.
Introductory example
Recall the example of the 'Advertising' dataset of the James et al book that we analysed in the Linear
Regression block.
Suppose that an expert from the marketing department of the company that supplies this product
provides us with the following additional information.
Let's assume an increase of $1,000 in the budget of either TV, Radio or newspaper advertisement. It is
extremely unlikely that the sales will change (upwards or downwards) by more than 50 units as a
result of this increased budget.'
How can we incorporate this extra information in our conclusions about the role of these types of
advertisement on the sales of this product?
One way to address this question formally is through an alternative school of statistics, that of
Bayesian inference. In addition to allowing incorporating additional information in a probabilistic
manner, Bayesian inference can be useful for other reasons such as incorporating parameter
uncertainty in prediction (although this can perhaps also be done via Bootstrap). In this block we will
review and apply its basic concepts noting differences with frequentist inference.
P(A|B)P(B) P(A|B)P(B)
P(B|A) = =
P(A) P(A|B)P(B) + P(A|𝐵𝑐 )P(𝐵𝑐 )
Note that the events of B and 𝐵𝑐 form a partition of the sample space (i.e. they are disjoint and their
union is equal to the sample space). The law of total probability then ensures that:
(4.2)
The distribution π(𝜃𝑦 ), formed by the probabilities π(𝜃𝑗|𝑦 ), for each j, is termed as the posterior
distribution and represents our uncertainty around θ after observing y (a-posteriori).
If θ is a continuous variable, i.e. takes values on a interval such as the real line R, then π(θ) denote the
probability density function (pdf) of the prior. The posterior pdf is then given by:
(4.3)
The term f(y) is known as marginal likelihood or evidence and reflects the density (probability for
discrete y) of the data under the adopted probability model. Note that it does not depend on θ, hence it
may be viewed as the normalising constant of π(θ|y) (to ensure that it integrates/sums to 1).
Bayes Estimators
In general point estimators are functions of y and other known things (but not θ), the output of which
provide our 'educated guess' for θ. Examples of point estimators include the maximum likelihood
estimators (MLE), method of moments estimators, least square estimators etc.
Bayes estimators provide another alternative. Their exact description requires concepts related to
statistical decision theory such as loss(utility) function, frequentist, posterior and Bayes risk (in fact,
Bayes estimator minimise the posterior and Bayes risk), that are beyond the scope of the course. So
we will just focus on the following three Bayes estimators:
The posterior mode 𝜃̂ = 𝑞π(θ|y)≤π(q|y) for all θ, also known as MAP (Maximum a-
posteriori).
The value of the Bayes factor can also provide evidence in terms of the strength of evidence
against 𝐻0 . In terms of wording the following guidelines are available:
Sometimes the following expression is used. Applying Bayes theorem on P(𝐻1 |y) and P(𝐻0 |y) in
(Equation 4.4) we get
𝑃(𝑦|𝐻1 )𝑃(𝐻1 )
𝑃(𝑦)𝑃(𝐻1 ) 𝑃(𝑦|𝐻1 )
𝐵10(𝑦) = =
𝑃(𝑦|𝐻0 )𝑃(𝐻0 ) 𝑃(𝑦|𝐻0 )
𝑃(𝑦)𝑃(𝐻0 )
which is also known as the ratio of marginal likelihoods (evidences) of 𝐻1 and 𝐻0 . Hence, Bayes
factor chooses the hypothesis with the higher marginal likelihood.
Bayesian Model Choice comes as a straightforward extension to Bayesian Hypothesis Testing. In the
presence of several models (can be more than two), we choose the one with the higher marginal
likelihood.
Bayesian Prediction/Forecasting
Let 𝑦𝑛 denote a future observation. Under the assumption that 𝑦𝑛 comes from the same probability
model as y, we are interested predicting its value. Under Bayesian Prediction/Forecasting this is done
via the (posterior-)predictive distribution that combines the uncertainty of the unknown
parameters θ as well as the uncertainty of the future observation:
The predictive distribution can be used in different ways (e.g. point prediction, interval prediction,
etc) depending on the forecasting task at hand.
Prior Specication
As mentioned earlier the prior distribution represents our uncertainty regarding the unknown
parameter θ prior to seeing the data y.
In some cases information may be available prior to the experiment about various feature of the
distribution such as moments, percentiles, probabilities of certain intervals etc. This information can
then be summarised to define the prior distribution. For example in an experiment where the
observations are independent Poisson random variables with unknown mean λ, it is known from
previous experience that the mean of λ is around 5 and the variance around 4.
In this case we can use the Gamma(α,β) distribution that takes valued on the positive real line
as λ does. The prior parameters α and β can be determined using the fact that if λ is Gamma(α,β)
𝑎 𝑎
then E(λ)= 𝛽 =5 and Var(λ)= 𝛽2 =4. Combining these, we get that α=6.25 and β=1.25, hence we can
use the Gamma(6.25,1.25) as our prior. This procedure is known as prior elicitation.
When there is no information about θ prior to the experiment it is not straightforward as to how this
can be reflected via a distribution. Quite often low informative priors are being used where the
variance of the distribution is set to a very large value. This is usually accompanied by a sensitivity
analysis at the end to ensure that this choice (on this large value of the variance) does not affect the
conclusions much.
Jeffreys Prior
An alternative approach is the Jeffreys prior which is based on Fisher's information for θ. Given the
joint density (likelihood) f(y|θ), Fisher's information for θ is defined as:
𝜕 2 log 𝜋 ∗ (𝜃|𝑥)
[H(θ)] 𝑖𝑗 =−
𝜕𝜃𝑖 𝜕𝜃𝑗
Then as n → ∞
where 𝑁𝑝 denotes a p−dimensional distribution. The proof of the above statement is similar to the
case of asymptotic distribution of Maximum Likelihood Estimators. Hence it is often referred to as
'Bayesian Central Limit Theorem'.
The expression in Equation 4.5 offers a way to conduct Bayesian Inference in an 'asymptotic' manner,
i.e. based on an approximate posterior distribution whose approximation error is small when we have
a 'large' amount of data. Note however that this is just an approximation and it could perform quite
poorly, especially in relatively small datasets.
Introduction
We will go through some examples of Bayesian Inference. In most of those we will use the following
to trick in order to find the posterior distribution without calculating the denominator in Bayes
theorem who is generally a difficult integral.
(𝜃 − 𝜇)2 𝜃 2 − 2𝜃𝜇
𝑝(𝜃|𝑥) ∝ exp (− ) ∝ exp (− )
2𝜎 2 2𝜎 2
Beta-Binomial
Suppose that y is an observation from a Binomial(n,θ) random variable. The likelihood is given by the
probability of y given θ which is provided by the Binomial distribution
𝑛
𝑓(𝑦|𝜃) = ( ) 𝜃 𝑦 (1 − 𝜃)𝑛−𝑦 ∝ 𝜃 𝑦 (1−𝜃)𝑛−𝑦
𝑦
As 0<θ<1 a corresponding distribution must be chosen as the prior. A Beta distribution with hyper
parameters αα and ββ, denote Beta(α,βα,β), provides such an example.
Note that the posterior mean, one of the most commonly used Bayes estimators, that could be
𝛼+𝑥
used as an estimator for pp is equal to 𝛼+𝛽+𝑛. For α = β = 0 it coincides with the maximum
likelihood estimator, MLE x/n. No such prior distribution (improper prior) but posterior is
well defined.
𝛼+𝑥−1
Other Bayes estimators include the posterior mode which is equal to 𝛼+𝛽+𝑛−2 and the
posterior median that is not available in closed form but can by calculated via Monte Carlo,
see the relevant section in this block.
Poisson-Gamma
Let y = (𝑦1 ,…, 𝑦𝑛 ) be a random sample (𝑦𝑖 's are independent and identically distributed) from the
Poisson(λ) population. The likelihood is given by the joint density of the sample
𝑛
exp(−𝜆)𝜆𝑦𝑖
𝑓(𝑦|𝜆) = ∏ ∝ exp(−𝑛𝜆)𝜆 ∑ 𝑦𝑖
𝑦𝑖 !
𝑖=1
As λ>0 a corresponding distribution must be chosen as prior. The Gamma distribution with hyper-
parameters αα and β, denote Gamma(α,β), provides such an example.
Note that the posterior mean, one of the standard Bayes estimators can be written as
which is a weighted average between the prior mean and 𝑦̅. As n→∞ the posterior mean converges
to 𝑦̅ which is the MLE in this case.
Let now 𝑦𝑛 denote a future observation from the same Poisson(λ) model (likelihood). The predictive
distribution for 𝑦𝑛 is,
for 𝑦𝑛 = 0,1, …
Normal-Normal
Let y=(𝑦1 ,…, 𝑦𝑛 ) be a random sample (𝑦𝑖 's are independent and identically distributed) from a
N(θ,𝜎 2 ) - 𝜎 2 known. The likelihood is given by the joint density of the sample
(θ − μ)
𝜋(𝜃) ∝ 𝑒𝑥𝑝 (− )
2𝜏 2
The posterior can then be obtained as
As with the previous example note that the posterior mean can be written as a weighted average
between the prior mean μ and the MLE 𝑦̅
𝜏2 𝜏2
(1 − 2 )𝜇 + 2 𝑦̅
𝜎 𝜎
𝑛 + 𝜏2 𝑛 + 𝜏2
Note that the posterior mean in the case of the Normal distribution is the same as the posterior mean
and mode. Hence all previously mentioned Bayes estimators coincide to the one above.
In case we want to obtain a symmetric 100(1−α)% interval for $\mu$, we can use the mean and the
standard error of the posterior (similarly with confidence intervals) to get t
𝜎2 𝜎2 2
𝑦̅𝜏 2 + 𝜇 𝑛 𝜏
2 ± 𝒵𝑎/2 √ 𝑛2
𝜎 𝜎
( 𝑛 + 𝜏 2) ( 𝑛 + 𝜏2
𝑎
where 𝑍𝑎/2 is the 2 percentile of a N(0,1). If we assume the improper prior of π(θ)∝1 (which is the
Jeffreys prior in this case), we get as posterior the N(𝑥̅ ,𝜎 2 /n). The corresponding 100(1−α)% credible
interval will then be
𝜎2
𝑦̅ ± 𝒵𝑎/2 √
𝑛
which is the same as the 100(1−α)% confidence interval. Their interpretation is however slightly
different, check it.
Linear Regression
Consider the multiple linear regression model that we first discussed in Block 2.
Alternatively, using matrix algebra notation, define ϵ = (𝜖1 , … , 𝜖𝑛 )2 , β = (𝛽0 , 𝛽1 , 𝛽2 , 𝛽3 ) 2 and the
design matrix
1 𝑋11 ⋯ 𝑋𝑝1
1 𝑋12 ⋯ 𝑋𝑝2
𝑋=( )
⋮ ⋮ ⋮
1 𝑋1𝑛 ⋯ 𝑋𝑝𝑛
Then we can rewrite equation 4.2 in matrix notation as
Y = Xβ + ϵ
where ϵ ∼ 𝑁𝑛 (0𝑛 ,𝜎 2 𝐼𝑛 ), with 𝑁𝑛 (⋅) denoting the multivariate Normal distribution of
dimension n, 0𝑛 being n-dimensional vector of zeros and 𝐼𝑛 being the identity matrix of dimension n.
We first need to specify a prior in the parameters θ = (β, 𝜎 2 ). We will factorise the joint density
of θ as
𝛽0𝑎0 𝛽0
𝜋(𝜎 2 )= (𝜎 2 )−𝑎0−1 exp(− )
Γ(𝑎0 ) 𝜎2
With standard, yet tedious, matrix algebra calculations that are beyond the scope of this course one
can the show that the posterior can again be factorised as
Loosely speaking one can note that as Ω0 goes to infinity, 𝜇𝑛 converges to the MLE (𝑋 𝑇 X) −1 𝑋 𝑇 y, in
other words as the prior information vanishes the Bayes estimator is the same as the MLE. But if there
is some prior information, reflected through the prior mean 𝜇0 and the prior variance Ω0 , then the
Bayes estimator combines information from the data and prior beliefs.
The second thing you need to to do then is to assign a prior on the unknown lambda parameter, and we
have to choose a suitable distribution, this is the case where we assign a gamma prior. We know that,
so we'll assume that a priori, our uncertainty knowledge about lambda is reflected by the gamma
distribution that has two hyperparameters, alpha and beta. Typically, this will be known and actually,
the researcher has to choose them, but in order to derive a general expression here, we just denote them
with the Greek letters alpha and beta, and we will give the result depending on these parameters at the
end.
Again, you only need to write down, and that takes up a lot of trouble, you only need to write down the
prior up the proportionality. That means that you only need to include the terms that involve lambda,
any other term that has nothing to do with lambda, you can skip it, remembering though to put the
proportional sign, the equality sign. That's step number two. Step number three. Now, it's to just derive
the posterior. As always, we know that the posterior lambda given X, in this case, is equal to the product
of the latter than the prior, so just write things now. Then, I'm Screencast.mp4 1 going to substitute the
likelihood and the prior with the expressions in the previous slides.
The likelihood is this guy here and the prior is this guy here. That's all I did here. There next step here,
we just put together the lambdas and the X. If we do that, we get that lambda alpha plus sum of XI
minus one, and then this is what you get for the X. Actually, you can put them together as minus lambda
N plus beta. Now it's the time the standard trick arrives. You have to think, does this expression here
resembles the density of a distribution that you know?
Actually, if we go a bit back when lambda was a gamma prior or the prior for-- Remember it was just
a gamma of alpha and beta, its pdf was-- Lambda, whatever the first parameter Alpha minus one, and
then we had X to the minus lambda here and beta, whatever, this is the second parameter. Now, let's
see, again, where we end up. Here we have lambda to something minus one and then E to the minus
lambda times something. Well, it's not very difficult. Now, to see that this something and this something
could be the parameters of the new gamma posterior. Actually, that's exactly how we derive. Here, it is
the derivation of the gamma posterior in the Poisson likelihood- Poisson gamma conjugate model.
That was the first thing we're going to talk about about Bayesian inference. The second thing I want to
talk about is the predictive distribution, in particular, the Bayesian forecasting. Essentially, you do in
Bayesian inference you do forecasting using the predictive distribution. What is the predictive
distribution? Well, you denote it with this f of Y given X. Y here represents the future observations that
we haven't seen, X represents the training data. Obviously, we have a model for the data, and the
predictive distribution is this integral f of Y given theta here is the destiny of the future observation,
and PI of theta given X is the posterior on theta based on the training data X.
Now, you can see that as what do we know about the future observation. Why? Well, the first thing that
can express our uncertainty is the density, assuming that we knew the theta, so that's what FY given
theta does. Of course, this assumption is not correct. We don't know theta. What Bayesian do is that
they an average, some sort of an average, you can see this integral as a weighted average where the
weights are given by the posterior. Most probable thetas, according to the posterior, will get more weight
and less probable will get less weight. This averaging, that's what you can think of this integral X, this
averaging then gives you the predicted distribution.
I'm going to show you in the next slide an example and how to derive this predictive distribution. In
order to do this, to do this derivation, I'm going to use a special trick, which is a very handy trick and
you can use in other exercises. It's a neat way of calculating very difficult integrals without doing any
calculus, essentially just by remembering the formula of a probability density function. Here is the trick.
Let's say that lambda has a distribution, which is a gamma distribution with parameters A and B. Then,
we can write that because the pdf f of lambda is a pdf, then the integral over the range of lambda will
integrate to one. That's the definition of a pdf.
What I'm going to do here is I'm going to substitute f of lambda with its exact expression. Remember
this is the gamma distribution with parameters A and B. This is the pdf of lambda. Essentially, what I'm
going to do now is I'm going to take out the parts in this integral, this is an integral with respect to the
lambda, with lambda. Everything that doesn't involve lambda can come out of the integral. I will take
this part here, this fraction out and bring in to this side.
Then, I will get this expression here and that tells you that this integral, which is by no means easy to
compute, can be expressed in that form. If you're going to do that with calculus, you have to do
integration by parts several times. It's a very nice thing to learn to calculate yet if you just remember
that lambda skew the pdf of lambda, which in the [unintelligible 00:09:13] one, you can immediately
find a solution. This holds for all lambda B and A.
Now, let's go to the example. Let's derive the predictive distribution in a case where the likelihood is an
exponential with lambda distribution. If we assume a gamma prior and we do a similar calculation with
what we did earlier, then we get the posterior as a gamma and plus alpha and X bar plus beta. We have
the posterior, we have the model, we have f of y given lambda, how do we find the predicted
distribution? Remember that the predicted distribution Y is a positive distribution, so the range of Y is
important, don't forget to write it down. What I'm going to do now is I'm going to-- This is a big integral
that can seem a bit scary, but it's not so scary. All this is doing is, essentially, if you go back to this
formula, I'm just substituting what is F of Y given theta, in this case, lambda, and what is the posterior.
In this integral, this part is coming from the fact that the new Y will be coming from an exponential
lambda distribution. That's the PDF of an exponential lambda, and then this part here is just a posterior,
which is gamma with these parameters. This is just a PDF of this gamma. Here it's important to write
everything to equality, no proportionality, all the terms have been used here.
Next step, we have an integral here which is the lambda. That means that anything, all the terms that do
not involve lambda can be taken out of the integral. This guy here, there's no lambda, I can take it out,
so that's what I'm going to do. I will also put together the lambdas and the Xs. Doing so will give me
this expression. This guy comes here. Here I had A+α-1+another lambda. That's coming here and
similarly for the Xs. All right. Where am I? Well, I still have this difficult integral to calculate. What
am I going to do? Well, I can use the trick in the previous slide. All I have to see is know that if I set
here this guy to be A and this guy to be B, then I'm exactly into the situation here.
For this A and B, I have exactly the same expression. All I have to do is-- then this guy will be equal to
this guy. All I have to do is make sure I put the correct A and B in here. This guy will be the A. This
guy will be the B. If you put it together, you'll get this complicated looking expression, that's what it is,
and we are able to derive it in just a few lines of calculation. That concludes the purpose of this video.
We were able to demonstrate these few types of calculations that we often encounter in Bayesian
inference.
[00:12:36] [END OF AUDIO]
Self-test Activities
1. Let x=(x1,…,xn) be a random sample from the Exponential(λ) distribution. Set the prior
for λ to be a Gamma(α,β) and derive its posterior distribution. Then find a Bayes estimator
for λ and the predictive distribution for a new observation y which again from the
Exponential(λ) model.
Solution
The joint density (likelihood) can be written as
π(λ)∝𝜆𝑎−1 exp(−βλ
The posterior is then proportional to
A Bayes estimator is provided by the posterior mean who in this case is equal to
𝑛+𝑎
𝛽 + ∑𝑛𝑖=1 𝑥𝑖
The predictive distribution (for y>0) is
2. Let x=(𝑥1 ,…,𝑥𝑛 ) be a random sample from a N(0,𝜎 2 ) distribution. Set the prior for 𝜎 2 to be
IGamma(α,β) and derive its posterior distribution.
Solution
The joint density (likelihood) can be written as
𝛽
𝜋(𝜎 2 ) ∝ (𝜎 2 )−𝑎−1 exp(− )
𝜎2
The posterior is then proportional to
3. A student uses Bayesian inference in modeling his IQ score x|θ which is N(θ,80). His prior is
N(110,120). After a score of x=98 the posterior becomes N(102.8,48).
a. Find a 95\% confidence interval, a 95\% credible interval and comment on their
diffrences.
b. The student claims it was not his day and his genuine IQ is at least 105. So we want
to test 𝐻0 : θ ≥ 105 vs 𝐻1 :θ < 105.
Solution
a. The frequentist 95% confidence interval is
105−102.8
b. P (𝜃 ≥ 105|𝑥) = 𝑃(𝑍 ≥ = 1 − Φ(0.3175) = 0.3754
√48
Since the probability is less the 50% (in fact less than 38%) the claim of the student is rejected.
The Bayes factor against 𝐻0 is log𝐵10 (98)=1.244, indicating strong evidence against 𝐻0 .
The R code for this calculation is given below:
mupost = 102.8
sdpost = sqrt(48)
muprior = 110
sdprior = sqrt(120)
post𝐻0 = 1-pnorm(105,mupost,sdpost
post𝐻1 = pnorm(105,mupost,sdpost)
prior𝐻0 = 1-pnorm(105,muprior,sdprior
prior𝐻1 = pnorm(105,muprior,sdprior)
logB10 = log(post𝐻1 /post𝐻0 ) - log(prior𝐻1 /prior𝐻0 )
Consolidation Activities
Exercise 1
A big magnetic roll tape needs tape. An experiment is being conducted in which each time 1 meter of
the tape is examined randomly. The procedure is repeated 5 times and the number of defects is
recorded to be 2,2,6,0 and 3 respectively. The researcher assumes a Poisson distribution for the
parameter λλ. From previous experience, the beliefs of the researcher about λλ can be expressed by a
Gamma distribution with mean and variance equal to 3. Derive the posterior distribution that will be
obtained. What would be the expected mean and variance of the number of defects per tape meter
after the experiment?
Solution
From the section Bayesian Inference Examples we get that for likelihood
Exercise 2
Let (𝑥 = (𝑥1 ,…, 𝑥𝑛 ) be a random sample from a N(θ,𝜎 2 ) distribution with 𝜎 2 known.
a. Show that the likelihood is proportional to
𝑛
2
1
𝑆 = ∑(𝑥𝑖 − 𝑥̅ )2
𝑛−1
𝑖=1
(𝜃 − 𝑥 2 )
𝑓(𝑥|𝜃) ∝ exp (− )
𝜎2
2
𝑛
Solution
The joint density of the sample x is
𝑛 𝑛
𝑖=1 𝑖=1
Note that
𝑛 𝑛 𝑛
𝑖=1 𝑖=1
b. Set the prior for θ to be N(μ,𝜏 2 ) and derive its posterior distribution. (You can use the above
result)
Solution
The prior for θ is set to be N(μ,𝜏 2 ). Hence we can write:
(𝜃 − 𝜇)2
𝜋(𝜃) ∝ exp(−
2𝜏 2
Using the result of part (a), the posterior is then proportional to
Exercise 3
Consider a sample 𝑥 = (𝑥1 ,…, 𝑥𝑛 ) of independent random variables, identically distributed from the
Geometric$(\theta)$ distribution, i.e. the probability mass function for each 𝑥𝑖 is
a. Choose a suitable proper prior for θ and derive the corresponding posterior. Justify your
choice of prior.
Solution
The likelihood for the sample is given by
In our case there is no prior information available so the prior parameters (α, β) can be chosen such
that the distribution is as flat as possible. Setting α=β=1 will correspond to the Uniform(0,1)
distribution.
b. Derive the Jeffreys prior for θ. Use it obtain the corresponding posterior distribution.
Solution
To find the Jeffreys' prior we write
Solution
We will need the mode of the likelihood, 𝜃𝑀 and the Fisher information. The latter was derived from
𝑁
part (b) and is equal to 𝜃(1−𝜃)2 . For the former we set
𝑥̅ 𝜃(1−𝜃)2
The requested Normal approximation to the posterior has mean 1+𝑥̅ and variance 𝑛
d. Find the Bayes estimator corresponding to the posterior mean for the posteriors in parts (a),
(b) and (c).
Solution
For part (a) it is equal to
𝑛𝑥̅ + 1
𝑛𝑥̅ + 𝑛 + 2
whereas for part (b) it is
𝑛𝑥̅ + 1/2
𝑛𝑥̅ + 𝑛 + 1/2
and (c) it is equal to
𝑥̅
𝑥̅ + 1
Solution
Note that from the pdf of the Beta distribution we get for all positive α, β that
Consider the posterior 𝜋 𝐽 (𝜃|𝑥) corresponding to the Jeffreys prior. The predictive distribution for y,
𝑥𝑖 = 1, …, n, is
Exercise 4
Let x be an observation from a Binomial (n,θ) and assign the prior for θ to be the Beta(α,β).
a. Find a Normal approximation to the posterior based on the mode and the Hessian matrix
of π∗(θ|x)=f(x|θ)π(θ).
Solution
We can write
𝑥+𝑎−1
which is negative for x≠0 and x≠n, so the mode is at 𝜃𝑀 = . The Hessian is
𝑛+𝑎+𝛽−2
𝑥+𝑎−1
The normal approximation to the posterior for θ will have meant 𝜃𝑀 = . and
𝑛+𝑎+𝛽−2
variance H(𝜃𝑀 ) −1
b. Assume x=15, n=20 and α=β=2. Write an R script to provide a graph of true posterior density
and the normal approximation derived in the previous part.
c. Repeat with x=75, n=100.
Solution
R code for parts (b) and (c)
Exercise 5
Let x=(𝑥 1 ,…,𝑥 𝑛 ) be a r.s. from a Poisson(λ) and assign a Gamma(α,β) prior to λ
a. Find a Normal approximation to the posterior based on the mode and the Hessian matrix
of π∗(θ|x)=f(x|θ)π(θ).
Solution
We can write
= (𝑎 + ∑ 𝑥𝑖 − 1) log(𝜆) + (𝑛 + 𝛽)log(𝜆)
𝜕 2 𝑙𝑜𝑔𝜋 ∗ (𝜆|𝑥) 𝑎 + ∑𝑖 𝑥𝑖 − 1
= −
𝜕𝜆2 𝜆2
which is negative for a+∑𝑖 𝑥𝑖 > 1 implying the mode is at 𝜆𝑀 in this case. The Hessian is
𝑎 + ∑𝑖 𝑥𝑖 − 1
𝐻(𝜃) =
𝜆2
The normal approximation to the posterior for λ will have mean 𝜆𝑀 and variance H(𝜆𝑀 )2 .
b. Assume ∑𝑖 𝑥𝑖 = 95, n=50 and α=β=1. Write an R script to provide to calculate the posterior
probability of λ being less than 1.7 using the true posterior density and the normal
approximation derived in the previous part.
c. Using R provide a 95 credible set for λ using the true posterior.
Solution
Code for finding the probability and the credible interval:
sumx=95
n=50
alpha=1
beta=1
mu=(alpha+sumx-1)/(n+beta) # mode. substituted alpha=beta=1
S=sqrt(mu*mu/(alpha+sumx -1)) # square root of inverse hessian at the mode
pgamma(1.7,1+sumx,n+1) # true requested probability
pnorm(1.7,mu,S) # probability from normal approximation
N=1000000
rx=rgamma(N, alpha+sumx, n+beta) # Generate N posterior samples
quantile(rx,c(0.025,0.975)) # 2.5 and 97.5 posterior sample percentiles
Exercise 6
Consider the problem raised in the section 'Introductory Example' and provide estimates of the
parameters of the relevant linear regression model.
Solution
The additional information states that changes to the sales of more than 50 units per 1,000$'s spent are
unlikely. In other words we would expect the regression coefficient of the three predictors to be
between -0.05 and 0.05. A prior distribution that reflects that would be the N(0,0.0252 ) for each
coefficient, as it implies that it will be in that interval with probability 95%. We complete the prior
specification by assuming independence between the β parameters a-priori. For the constant 𝛽0 we
can assign a very large variance -say- 106 to reflect our ignorance about its value.
R code for fitting the model is given below. It will be necessary to install the R packages
'MCMCpack', 'MASS' and 'coda'. Note that the package requires the precision, being the inverse of
the variance, rather than the variance. In our case this will be 10−6 for 𝛽0 and around 1540 for the
remaining β's
library(MCMCpack) # initialise the MCMCpack package
advertising=read.csv("advertising.csv",header=T,na.strings="?") # load data
model.freq=lm(sales~TV+radio+newspaper,data=advertising) # fit frequentist linear regression
summary(model.freq)
# set prior parameters
mu=c(0,0,0,0) Precision=diag(c(1e-6,1540,1540,1540)) # fit Bayesian Linear regression
model.bayes=MCMCregress(sales~TV+radio+newspaper, data=advertising,b0=mu,B0=Precision)
summary(model.bayes)
We note that the coefficient corresponding to radio is the highest under linear regression,
around 0.188. Bayesian Linear regression provides a slightly smaller number which is due to the prior
distribution pulling towards its mean, 0. The results are otherwise similar.