ParameterEstimation
ParameterEstimation
Himanshu Yadav
2024-03-05
Contents
1 The goal 3
1.1 The challenge 4
2 Parameter estimation using analytically-derived posterior 4
3 Parameter estimation using posterior simulation algorithms 6
4 Grid approximation 7
4.1 Idea 7
4.2 Method 7
4.3 Implementation 8
4.4 Drawbacks 10
5 Monte Carlo integration 10
5.1 Idea 10
5.2 Method 11
5.3 Implementation 11
5.4 Complications 12
6 Markov chain Monte Carlo (MCMC) 14
6.1 Idea 14
6.2 Method 15
6.3 Implementation 16
7 Recap: Metropolis algorithms 29
8 Gibbs sampling 33
8.1 Idea 33
8.2 Method 33
8.3 Implementation 34
8.4 Limitations 39
9 Hamiltonian Monte Carlo (HMC) 39
9.1 Idea 39
9.2 Hamiltonian dynamics 39
9.3 Using Hamiltonian dynamics for MCMC 40
9.4 Step-by-step procedure 41
9.5 Implementation 42
bayesian models & data analysis 2
1 The goal
Given the observed data y, the likelihood function L(θ |y), and the prior distribution p(θ ), we want
to estimate the posterior distribution of the parameter θ. Not just that, we need samples from the
posterior distribution of θ.
What do we mean by samples from the posterior distribution of a parameter?
## Warning: Using ‘size‘ aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use ‘linewidth‘ instead.
## This warning is displayed once every 8 hours.
## Call ‘lifecycle::last_lifecycle_warnings()‘ to see where this warning was
## generated.
2.0
1.5
density
1.0
0.5
0.0
1.5 2.0 2.5
θ
The above graph depicts the posterior density function of the parameter θ. The samples from the
posterior would be — a set of values of θ that have relatively high posterior density. For example, in
the above graph, the values of θ between 1.5 and 2.5 would have higher posterior densities compared
to values above 2.5 or below 1.5. Thus, we need a ‘sampler’ that somehow gives us a lot of values
from the high posterior density regions and fewer values from the low posterior density regions.
Consider the following vector ‘theta_samples’. It contains 50 samples from the posterior distrobu-
tion of θ. This is how it looks like.
theta_samples
A histogram of these values should look similar to the posterior density graph.
hist(theta_samples,breaks = 8)
Histogram of theta_samples
Frequency
8
4
0
theta_samples
If the posterior density function p(θ |y) belongs to the same probability distribution family as the
prior distribution p(θ ), the prior is then called a conjugate prior for the likelihood function L(θ |y).
bayesian models & data analysis 5
In these cases, where you (can) specify conjugate priors for all the free parameters of the model,
you can derive a posterior distribution analytically and you can easily sample from this distribution.
Some examples of the conjugate distributions are:
3. Normal likelihood — Normal prior on the mean; Inverse gamma prior on the variance
4. Multivariate normal likelihood — Multivariate normal prior on the mean; Inverse wishart prior on
the variance-covariance matrix
x ∼ Binomial ( N, θ )
θ ∼ Beta(α, β)
Suppose the observed data from n successive experiments is x1 , x2 , x3 , ..., xn .
The anlytical derived posterior of θ given the data x1:n will be a Beta distribution with parameters
α∗ and β∗ , where
α∗ = α + ∑in=i xi and β∗ = β + ∑in=1 N − xi
We can write the sampling statement
n n
θ | x1:n ∼ Beta α + ∑ xi , β + ∑ ( N − xi )
i =i i =1
Let us use the above statement to draw samples from the posterior distribution of θ.
hist(theta_samples)
bayesian models & data analysis 6
Histogram of theta_samples
Frequency
1500
0
theta_samples
plot(density(theta_samples))
density(x = theta_samples)
6
Density
4
2
0
When we do not have an analytically-derived posterior for sampling, we need to find other solutions
for parameter estimation.
Four classes of solutions:
2. Estimate the denominator using independent samples from a continuous probability distribution
3. Draw dependent samples from the posterior based on their relative posterior densities (the method
uses only the numerator term of Bayes’ rule)
The methods (1) and (2) aim to approximate the precise posterior density for a given value of the
parameter θ. But you cannot use them draw samples from the posterior. These methods mostly used
for estimating the marginal likelihood; the marginal likelihood estimate is needed for quantifying
evidence for a model.
The methods (3) and (4) are commonly used for drawing samples from the posterior.
I will introduce methods (1) and (2) first, because they provide some foundational ideas for sam-
pling algorithms.
4 Grid approximation
Goal: we want to approximate the exact posterior densities for a set of parameter values.
4.1 Idea
• Discretise a continuos parameter
4.2 Method
R
• Suppose we want to estimate the marginal likelihood p ( y | θ ) p ( θ ).
• Divide the parameter space θ into n equally spaced points, v1 , v2 , .., vn called grid points
• For each grid point, vi , calculate the likelihood p(y|vi ) and the prior density p(vi )
• Add the product of the likelihood and the prior from all grid points to approximate the marginal
likelihood, p(y) ≈ ∑in=1 p(y|vi ) p(vi ).
p ( y | vi ) p ( vi )
• Estimate posterior density at each grid point, vi , using ∑in=1 p(y|vi ) p(vi )
Note: If θ represents a set of m parameters such that θ = {θ1 , θ2 , ..., θm }, then divide each param-
eter space, θ j into n equally spaced points, v j,1 , .., v j,n and use all possible combinations to create nm
grid points.
bayesian models & data analysis 8
4.3 Implementation
Example. A normal model with unknown mean and known variance
Suppose you are given 10 independent and identically distributed data points that are assumed to
come from a Normal distribution with mean µ and variance 4. Let yi be the ith data point,
yi ∼ Normal (µ, σ2 = 4)
µ ∼ Normal (µ0 = 0, σ02 = 9)
We can derive the posterior distribution analytically,
σ2 µ0 +σ02 ∑in=1 yi 1 n
θ |y ∼ Normal ( σ2 +nσ02
, σ2 + σ2
)
0
where n is the number of data points.
#Analytical posterior
sigma = 2 # Known standard devation of normal distribution
mu_prior = 0 # Mean of prior distribution on mu
sigma_prior = 3 # Standard deviation of prior distribution on mu
n = 10 # no. of observations
analytical_mu_post <- rnorm(10000,
mean=(((sigmaˆ2)*(mu_prior))+
((sigma_priorˆ2)*sum(y)))/
(sigmaˆ2 + (n*(sigma_priorˆ2))),
sd=(1/(sigma_priorˆ2))+(n/(sigmaˆ2)))
hist(analytical_mu_post,freq = FALSE)
bayesian models & data analysis 9
Histogram of analytical_mu_post
Density
0.08
0.00
−10 −5 0 5 10
analytical_mu_post
# Grid approximation
length(mu_grid)
## [1] 1000
df.posterior$posterior
0.008
0.000
−10 −5 0 5 10
df.posterior$mu
4.4 Drawbacks
• The curse of dimensionality
– The number of computations increases exponentially with an increase in the number of parame-
ters
– Given m parameters, if we divide each parameter space into n equally spaced points, we will
create nm grid points
– Say we have to estimate 4 parameters and we discretize each parameter space into 1000 points,
the number of grid points will be 10004
5.1 Idea
Expectation of a function f (θ ) can be approximated using independent samples from probability
density function of θ, i.e., p(θ ),
n
1
Z
f (θ ) p(θ )dθ ≈
n ∑ f (θ̃i ) where θ̃i ∼ p(θ )
i =1
The expression n1 ∑in=1 f (θ̃i ), θ̃i ∼ p(θ ) means that we draw n independent samples from the distri-
bution p(θ ) and calculate average of function f() applied to all n samples.
R
– ML = E( p(y|θ )) = p(y|θ ) p(θ )dθ
– ML ≈ 1
n ∑in=1 p(y|θ̃i ), θ̃i ∼ p(θ )
* Means you are drawing n independent samples from the prior distribution and calculating
the average of the likelihood of all n samples
– Importance density g(θ ) should resemble the posterior distribution and have fatter tails than the
posterior distribution
– When it is useful?
* When the posterior distribution is peaked (too narrow) relative to the prior
* In this situation, most of the samples from the prior will have likelihood zero
5.2 Method
• Draw n independent samples θ̃1 , θ̃2 , .., θ̃n from the prior distribution p(θ )
5.3 Implementation
Example. A beta-binomial model
Suppose n is the sample size, k is the number of successes,
k|n, θ ∼ Binomial (n, θ )
θ ∼ Beta( a, b)
We can derive the marginal likelihood, ML, analytically,
( k + a −1) ! ( n − k + b −1) !
ML = (nk) ( n + a + b −1) !
Suppose
k=2, n=10, a=1, b=1
ML = 1/(10 + 1) = 0.0909
and posterior distribution of θ will be
θ |k = 2, n = 10 ∼ Beta(3, 9)
Let us estimate the marginal likelihood and the posterior using a Monte Carlo estimator.
# Marginal likelihood
ML <- mean(df.estimate$likelihood)
ML
## [1] 0.08987182
5.4 Complications
• But what about the posterior distribution?
• You have estimated the marginal likelihood ML, you know the likelihood function, you know the
prior distribution; can you draw independent samples from the posterior distribution?
– Not directly! But you can compute posterior density for a parameter value
– Apply rejection sampling to draw independent samples from the posterior
∗
* Draw a sample, θ from a uniform distribution Uniform(a,b), where [a,b] is ‘possible’ range
of parameter values having non-zero posterior density
∗ ∗
* Compute posterior density for the sample θ , call it pd
∗
* Sample t from a uninform distribution, t ∼ Uni f orm(0, 1)
∗ ∗
* If pd > t, accept, θ as a sample from the posterior distribution
– Drawbacks:
hist(estimated_posterior,freq = FALSE)
bayesian models & data analysis 13
Histogram of estimated_posterior
2.0
Density
1.0
0.0
estimated_posterior
Histogram of analytical_posterior
3.0
Density
1.5
0.0
analytical_posterior
analytical_posterior estimated_posterior
3
density
0
0.0 0.2 0.4 0.6 0.8
theta
• We have not estimated the absolute posterior density; but we are able to estimate relative posterior
density of the samples quite well
– Relative posterior density is the basis of dependent sampling in Markov chain Monte Carlo
algorithms
Unresolved issues:
• The parameter space explored by the estimator depends on the proposal distribution, i.e., prior
distribution or importance distributon
– Ideally, we want most of the samples from non-zero posterior density regions
6.1 Idea
• Explore the parameter space in such a way that the histogram of their samples produces the target
distribution
– Do not care about absolute posterior density
– Draw zero samples from zero posterior density regions
– Draw relatively more samples from higher posterior density regions
• Collection of samples from one iteration to the other is a Markov process.
– Evolution of the chain only depends on the current position; past samples cannot be used to
determine new positions in parameter space
• A proposal sample is evaluated based on its relative posterior density
bayesian models & data analysis 15
6.2 Method
Suppose we want to estimate the posterior distribution, p(θ |y),
p(y|θ ) p(θ )
p(θ |y) = p(y)
We are going to generate a chain of n samples from the posterior distribution of θ.
• for i = 1 to i = n:
• The proposal distribution q proposes a new parameter value conditional on the current state of the
chain.
– A commonly used proposal distribution is a normal distribution with mean θi and some vari-
ance σ2 , such that θ ∗ ∼ Normal (θi , σ2 )
– σ is called the step-size parameter and it is critical for the algorithm to work
• The chain will converge to the target distribution if the proposal distribution, q has following three
properties:
– Rejection rate increases with increase in the value of the step-size parameter
• MCMC algorithms may differ in the choice of proposal distribution or in methods to evaluate the
relative posterior density of the samples
bayesian models & data analysis 16
6.3 Implementation
Example. A beta-binomial model
Suppose n is the sample size, k is the number of successes,
k|n, θ ∼ Binomial (n, θ )
θ ∼ Beta( a, b)
We can derive the marginal likelihood, ML, analytically,
( k + a −1) ! ( n − k + b −1) !
ML = (nk) ( n + a + b −1) !
Suppose
k=2, n=10, a=1, b=1
ML = 1/(10 + 1) = 0.0909
and posterior distribution of θ will be
θ |k = 2, n = 10 ∼ Beta(3, 9)
Let us estimate the posterior distribution using a simple Metropolis-Hastings sampler.
k <- 2
n <- 10
a <- 1
b <- 1
# Markov chain
nsamp <- 50000
theta_chain <- rep(NA,nsamp)
i <- i+1
}
}
}
hist(theta_chain)
Histogram of theta_chain
Frequency
4000
0
theta_chain
analytical_posterior theta_chain
3
density
0
0.00 0.25 0.50 0.75
theta
plot(theta_chain)
0.8
theta_chain
0.4
0.0
Index
# Markov chain
nsamp <- 50000
theta_chain <- rep(NA,nsamp)
Histogram of theta_chain
7000
Frequency
3000
0
theta_chain
ggplot(melt(posteriors),aes(x=value,colour=variable))+
geom_density(size=1.2)+theme_bw()+
xlab("theta")+theme(legend.title = element_blank(),
legend.position = "top")
analytical_posterior theta_chain
4
density
0
0.0 0.2 0.4 0.6 0.8
theta
plot(theta_chain)
0.7
theta_chain
0.5
0.3
Index
• Samples are highly correlated - chain varies slowly around the mean
reject*100/(reject+nsamp)
## [1] 0.5113716
The rejection rate is too low. We should target a rejection rate of at least 44% (for less than 5 pa-
rameters).
# Markov chain
nsamp <- 50000
theta_chain <- rep(NA,nsamp)
Histogram of theta_chain
Frequency
3000
0
theta_chain
analytical_posterior theta_chain
3
density
0
0.0 0.2 0.4 0.6 0.8
theta
plot(theta_chain)
bayesian models & data analysis 23
0.8
theta_chain
0.4
0.0
Index
reject*100/(reject+nsamp)
## [1] 60.22022
# Markov chain
nsamp <- 50000
theta_chain <- rep(NA,nsamp)
# Compute prior*likelihood
post_new <- dbinom(k,n,proposal_theta)*dbeta(proposal_theta,a,b)
post_prev <- dbinom(k,n,theta_chain[i])*dbeta(theta_chain[i],a,b)
#Compute Hastings ratio
Hastings_ratio <- (post_new*dnorm(theta_chain[i],proposal_theta,step))/
(post_prev*dnorm(proposal_theta,theta_chain[i],step))
p_str <- min(Hastings_ratio,1) # probability of acceptance
if(p_str>runif(1,0,1)){
theta_chain[i+1] <- proposal_theta
lkl <- dbinom(k,n,proposal_theta)*dbeta(proposal_theta,a,b)/
dnorm(proposal_theta,theta_chain[i],step)
ML <- ML+lkl
i <- i+1
}else{
reject <- reject+1
}
}
}
estimated_ML <- ML/nsamp
estimated_ML
## [1] 0.09208308
## [1] 0.09090909
Histogram of y
200
Frequency
100
0
y
bayesian models & data analysis 25
You assume that the data come from a lognormal distribution with mean µ and variance σ.
Priors:
µ ∼ Normal (10, 6)
σ ∼ Normal+ (0, 2)
Let us estimate the posterior distributions of µ and σ using our simple Metropolis-Hastings sam-
pler.
# Markov chain
nsamp <- 6000
mu_chain <- rep(NA,nsamp)
sigma_chain <- rep(NA,nsamp)
}
}
10
density
0
5.95 6.00 6.05 6.10
mu
ggplot(posteriors[-(1:2000),],aes(x=sigma_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("mu")+
theme(legend.title = element_blank(),
legend.position = "top")
bayesian models & data analysis 27
20
15
density
10
0
0.50 0.55
mu
#Rejection rate
reject*100/(reject+nsamp)
## [1] 93.49939
#Chain inspection
plot(mu_chain)
mu_chain
12
8
6
Index
plot(sigma_chain)
bayesian models & data analysis 28
sigma_chain
1 2 3 4
Index
#Zooming in
plot(mu_chain[1:500])
mu_chain[1:500]
12
8
6
Index
plot(sigma_chain[1:500])
bayesian models & data analysis 29
sigma_chain[1:500]
1 2 3 4
Index
Histogram of y
Frequency
80
40
0
pler.
# Markov chain
nsamp <- 8000
mu_chain <- rep(NA,nsamp)
sigma_chain <- rep(NA,nsamp)
while(i<nsamp){
#Sample from proposal distribution
proposal_mu <- rnorm(1,mu_chain[i],step)
proposal_sigma <- rtruncnorm(n=1,mean=sigma_chain[i],sd=step,a=0)
# Compute prior*likelihood
post_new <- sum(dnorm(y,proposal_mu,sqrt(proposal_sigma),log = TRUE))+
dnorm(proposal_mu,1000,100,log = TRUE)+
dinvgamma(proposal_sigma,21,2000,log=TRUE)
post_prev <- sum(dnorm(y,mu_chain[i],sqrt(sigma_chain[i]),log = TRUE))+
dnorm(mu_chain[i],1000,100,log = TRUE)+
dinvgamma(sigma_chain[i],21,2000,log=TRUE)
#Compute Hastings ratio
Hastings_ratio <-
exp((post_new+dnorm(mu_chain[i],proposal_mu,step,log=TRUE)+
log(dtruncnorm(x=sigma_chain[i],mean=proposal_sigma,sd=step,a=0)))-
(post_prev+dnorm(proposal_mu,mu_chain[i],step,log=TRUE)+
log(dtruncnorm(x=proposal_sigma,mean=sigma_chain[i],sd=step,a=0))))
p_str <- min(Hastings_ratio,1) # probability of acceptance
if(p_str>runif(1,0,1)){
mu_chain[i+1] <- proposal_mu
sigma_chain[i+1] <- proposal_sigma
i <- i+1
}else{
reject <- reject+1
}
}
ggplot(posteriors[-(1:2000),],aes(x=mu_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("mu")+
theme(legend.title = element_blank(),
legend.position = "top")+
geom_vline(xintercept=800,size=1.5,color="red")
0.6
density
0.4
0.2
0.0
799 800 801 802
mu
ggplot(posteriors[-(1:2000),],aes(x=sigma_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("sigma")+
theme(legend.title = element_blank(),
legend.position = "top")+
_
geom vline(xintercept = 100,size=1.5,color="red")
bayesian models & data analysis 32
0.06
0.04
density
0.02
0.00
80 90 100 110 120
sigma
#Rejection rate
reject*100/(reject+nsamp)
## [1] 87.30219
Key features
Possible improvements
– Gibbs sampling
8 Gibbs sampling
8.1 Idea
Suppose the data y come from a model having two parameters θ1 and θ2 ,
y ∼ f ( θ1 , θ2 )
Our goal is to estimate the posterior distributions for θ1 and θ2 . We can write the joint posterior for
θ1 and θ2 using Bayes’ rule,
p(y|θ1 ,θ2 ) p(θ1 ) p(θ2 )
p ( θ1 , θ2 | y ) = p(y)
We cannot analytically derive the posterior distribution p(θ1 , θ2 |y) because we cannot solve the
denominator p(y).
But in some cases, we can derive the conditional posterior distributions
p(θ1 |θ2 , y) and p(θ2 |θ1 , y)
The idea is to sample from the conditional posteriors of θ1 and θ2 using Markov process.
8.2 Method
• In each iteration, sample from the conditional posterior of θi such that the known values of param-
eters θ−i come from the current state of the Markov chain
Steps:
• Set θ1 = u0 and θ2 = v0
• (u0 , v0 ), (u1 , v1 ), (u2 , v2 )... satisfies the property of being a Markov chain.
8.3 Implementation
Example. A Normal model with semi-conjugate priors
You are given 500 data points, y1 , y2 , ..., y500 that are assumed to come from a normal distribution
with mean µ and variance σ2 ,
Let yi be ith data point,
yi ∼ Normal (µ, σ2 )
µ ∼ Normal (µ0 , σ02 )
σ2 ∼ InverseGamma(α, β)
We can derive conditional posterior distributions for µ and σ2 ,
nσ02 ȳ+σ2 µ0 σ2 σ2
µ|σ2 , y ∼ Normal (nσ02 +σ2
, nσ2 +0σ2 )
0
σ2 |µ, y ∼ + ∑in=1 (yi − ȳ)2 /n) + β
InverseGamma( n2 + α, n2 ((µ − ȳ)2
Where n is the total number of data points.
Let us estimate the posterior distributions of µ and σ using a Gibbs sampler.
#Priors
# mu ~ Normal(m,s)
m <- 1000
s <- 100
# sigmaˆ2 ~ InverseGamma(a,b)
a <- 21
b <- 2000
# Data
y <- y
n <- length(y)
# Gibbs sampler
# Markov chain
nsamp <- 10000
mu_chain <- rep(NA,nsamp)
sigma_chain <- rep(NA,nsamp)
0.06
0.04
density
0.02
0.00
700 800 900 1000 1100 1200 1300
mu
ggplot(posteriors[-(1:5000),],aes(x=sigma_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("sigma")+
theme(legend.title = element_blank(),
legend.position = "top")+
_
geom vline(xintercept = 100,size=1.5,color="red")
0.03
density
0.02
0.01
0.00
0 50000 100000 150000 200000
sigma
ggplot(posteriors[-(1:5000),],aes(x=sigma_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("sigma")+
bayesian models & data analysis 37
theme(legend.title = element_blank(),
legend.position = "top")+
_
geom vline(xintercept = 100,size=1.5,color="red")+
scale_x_continuous(limits = c(0,1000))
0.025
0.020
0.015
density
0.010
0.005
0.000
0 250 500 750 1000
sigma
ggplot(gibbs_metro,aes(x=mu_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("mu")+
theme(legend.title = element_blank(),
legend.position = "top")+
facet_wrap(~algorithm,scales = "free")+
scale_x_continuous(limits = c(750,850))
Gibbs Metropolis
0.06 0.6
density
0.04 0.4
0.02 0.2
0.00 0.0
750 775 800 825 850 750 775 800 825 850
mu
ggplot(gibbs_metro,aes(x=sigma_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("mu")+
theme(legend.title = element_blank(),
legend.position = "top")+
_
facet wrap(~algorithm,scales = "free")+
scale_x_continuous(limits = c(50,500))
Gibbs Metropolis
0.025 0.06
0.020
0.04
density
0.015
0.010
0.02
0.005
0.000 0.00
100 200 300 400 500 100 200 300 400 500
mu
bayesian models & data analysis 39
8.4 Limitations
• We cannot derive conditional posterior densities in most cases
9.1 Idea
• Exploration of parameter space is informed by the geometry of the posterior distribution
– Random walk: randomly jump from the current location to a new location in the parameter
space
– Hamiltonian slide: slide along the posterior density space from the current location to a new
location
−25 0 25
x
• Total energy of the system at a point x is the sum of the potential and the kinetic energy of the
particle at that point,
– Say, H is the total energy, V is the potential energy and T is the kinetic energy
– H ( x, ρ) = V ( x ) + T (ρ)
2
* T = ρ /2m where ρ is the momentum
– The potential energy of the particle V ( x ) is a function of position x
δx δH δT (ρ)
– δt = δρ = δρ
δρ δV ( x )
– δt = − δH
δx = − δx
– These are called Hamiltonian’s equations.
– Given initial position and initial momentum of the particle at time t0 , we can determine its
position and momentum at time t0 + T
−25 0 25
θ
• A particle of mass m moves in negative log posterior density space. The potential energy function,
V (θ ), is negative log of (unnormalized) posterior density function
• In each iteration, the particle is pushed from its current position θi in random direction with some
momentum ρi .
– After time T has elapsed, the particle acquires new position θ j and new momentum ρ j
– New position and new momentum can be calculated using Hamiltonian’s equations
– In practice, Hamiltonian’s equations must be numerically approximated by discretizing time.
This is done by splitting the time interval T into small intervals of size ϵ.
• Accept the new position θ j with probability min(1, exp( H (θi , ρi ) − H (θ j , ρ j )))
δV
• Define gradient function, δθ
• Choose internal parameters: number of samples, number of leapfrog steps, step size
• In each iteration:
9.5 Implementation
Example. A Normal model with normal priors on µ and σ
You are given 500 data points, y1 , y2 , ..., y500 that are assumed to come from a normal distribution
with mean µ and variance σ2 ,
Let yi be ith data point,
Model:
yi ∼ Normal (µ, σ )
µ ∼ Normal (m, s)
σ ∼ Normal ( a, b)
Data
Histogram of y
80
Frequency
40
0
y
bayesian models & data analysis 43
#Gradient functions
gradient <- function(mu,sigma,y,n,m,s,a,b){
grad_mu <- (((n*mu)-sum(y))/(sigmaˆ2))+((mu-m)/(sˆ2))
grad_sigma <- (n/sigma)-(sum((y-mu)ˆ2)/(sigmaˆ3))+((sigma-a)/(bˆ2))
return(c(grad_mu,grad_sigma))
}
#Data
y <- y
n <- length(y)
#Model parameters
m <- 1000
s <- 100
a <- 10
b <- 2
# HMC sampler
# Internal parameters
# Step size
step <- 0.02
# Number of leapfrog steps
bayesian models & data analysis 44
L <- 20
# Markov chain
nsamp <- 8000
mu_chain <- rep(NA,nsamp)
sigma_chain <- rep(NA,nsamp)
reject <- 0
0.75
density
0.50
0.25
0.00
799 800 801
mu
ggplot(posteriors[-(1:2000),],aes(x=sigma_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("sigma")+
theme(legend.title = element_blank(),
legend.position = "top")+
geom_vline(xintercept = 10,size=1.5,color="red")
bayesian models & data analysis 46
1.2
0.8
density
0.4
0.0
9.5 10.0 10.5 11.0 11.5
sigma
ggplot(gibbs_metro_hmc,aes(x=mu_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("mu")+
theme(legend.title = element_blank(),
legend.position = "top")+
_
facet wrap(~algorithm,scales = "free",nrow = 3)+
scale_x_continuous(limits = c(750,850))
bayesian models & data analysis 47
Gibbs
0.06
0.04
0.02
0.00
750 775 800 825 850
HMC
density
1.00
0.75
0.50
0.25
0.00
750 775 800 825 850
Metropolis
0.6
0.4
0.2
0.0
750 775 800 825 850
mu
ggplot(gibbs_metro_hmc,aes(x=sigma_chain))+
geom_density(size=1.2)+
theme_bw()+xlab("mu")+
theme(legend.title = element_blank(),
legend.position = "top")+
_
facet wrap(~algorithm,scales = "free",nrow = 3)+
scale_x_continuous(limits = c(50,500))
Gibbs
0.025
0.020
0.015
0.010
0.005
0.000
100 200 300 400 500
HMC
density
0.06
0.04
0.02
0.00
100 200 300 400 500
Metropolis
0.06
0.04
0.02
0.00
100 200 300 400 500
mu
# Inspect chains
posteriors <- data.frame(mu_chain,sigma_chain)
bayesian models & data analysis 48
1000
950
mu_chain
900
850
800
0 2000 4000 6000 8000
mu chain
ggplot(posteriors[-c(1:2000),],aes(x=id,y=mu_chain))+
geom_line(size=1.2,color="blue")+
theme_bw()+xlab("mu chain")
801
mu_chain
800
799
ggplot(posteriors[-c(1:2000),],aes(x=id,y=sigma_chain))+
geom_line(size=1.2,color="blue")+
theme_bw()+xlab("sigma chain")
bayesian models & data analysis 49
11.5
11.0
sigma_chain
10.5
10.0
9.5
• There are packages in R and Python can estimate parameters for you using one of posterior simu-
lation algorithms introduced here.
• You only need to define your likelihood and the priors in a given syntax; the algorithm (defined
for the package) will start drawing samples from the posterior.
• A popular one is Rstan/pystan/brms package, which uses a Hamiltonian Monte Carlo algorithm
for sampling.
• We will talk about them in the next lecture; you can read chapter 3 of the book "An Introduction to
Bayesian Data Analysis for Cognitive Science (https://vasishth.github.io/bayescogsci/book/)" for
reference.
## y obs
## 1 791.4674 1
## 2 797.0719 2
## 3 786.8546 3
## 4 798.3879 4
bayesian models & data analysis 50
## 5 786.7599 5
## 6 801.9058 6
## Start sampling
plot(fit_normal)
b_Intercept b_Intercept
0.8 802
0.6 801
0.4 800
0.2 799 Chain
0.0
799 800 801 0 200 400 600 800 1000 1
2
sigma sigma 3
1.25 12
1.00 4
0.75 11
0.50
10
0.25
0.00
10 11 0 200 400 600 800 1000
posterior_summary(fit_normal)