Lec16 SummarizingPosteriors BayesianModelSelection
Lec16 SummarizingPosteriors BayesianModelSelection
Distributions &
Bayesian Model Selection
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
Learn about the AIC and Bayesian and Akaike Information Criteria
Familiarize ourselves with criteria for Bayesian model section and comparison, Bayes’ factors and
Occam’s Razor
However, the posterior mode, aka the MAP estimate, is the most popular
choice because it reduces to an optimization problem, for which efficient
algorithms often exist.
In many applications, it is important to know how much one can trust a given
estimate.
On the example on the right, the mode is 0, but the mean is non-zero. Such skewed
distributions arise when inferring variance parameters, especially in hierarchical models. In
such cases the MAP estimate is a very bad estimate.
4.5 0.9
probability mass 0
0.1
-2 -1 0 1 2 3 4
Run bimodalDemo and gammaPlotDemo 1 2 3 4 5 6 7
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Decision Theory and Loss Functions
How should we summarize a posterior if the mode is not a good choice?
The answer is to use decision theory. The basic idea is to specify a loss function, where
is the loss you incur if the truth is q and your estimate is 𝜃.
መ
መ = |𝜃 − 𝜃|,
Or we can use a more robust loss function, 𝐿(𝜃, 𝜃) መ which gives rise to the posterior
median.
Suppose we compute the posterior for 𝑥. If we define 𝑦 = 𝑓(𝑥), the distribution for 𝑦 is given
by
dx
p y ( y ) px ( x)
dy
The Jacobian measures the change in size of a unit volume passed through 𝑓.
For example, let 𝑥~𝒩(6,1), and 𝑦 = 𝑓(𝑥) = 1/(1 + exp(−𝑥 + 5)). We can derive the
distribution of 𝑦 using MC. See the following implementation.
0.9 g
0.8
0.7
0.6
Run bayesChangeOfVar
0.5 from Kevin Murphys’ PMTK
0.4
0.3
0.2 pY
0.1 pX
0
0 2 4 6 8 10 12
The MLE does not suffer from this issue (the likelihood is a function not a probability density).
Bayesian inference also does not suffer from this since the change of measure is taken into
account when integrating in the parameter space.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
MAP Estimation and Reparametrization
Consider a Bernoulli likelihood and uniform prior as follows:
p ( y 1| ) , where y 0,1
p ( ) 1,0 1
Without data, the MAP estimate is the mode of the prior which is anywhere on the interval
[0,1]
Druilhet, P. and J.-M. Marin (2007). Invariant HPD credible sets and MAP estimators. Bayesian
Analysis 2(4), 681–692.
Jermyn, I. (2005). Invariant Bayesian estimation on manifolds. Annals of Statistics 33(2), 583–605.
C (D ) ( , u ) : p( q u | D ) 1
There may be many such intervals, so we choose one such that there is 𝛼/2 mass in each tail; this is
called a central interval.
We simply sort the 𝑆 samples, and find the one that occurs at locations
ℓ, 𝑢 = 𝑆𝛼Τ2 , 𝑆(1 − 𝛼Τ2) along the sorted list. As 𝑆∞, this converges to the
true quantile.
Run mcQuantileDemo
from Kevin Murphys’ PMTK
In general, credible intervals are usually what people want to compute, but
confidence intervals are usually what they actually compute!
3.5
This motivates the highest posterior
3 density or HPD region. This is defined
Central interval for a
as the set of most probable points that
2.5 Beta(3,9) posterior
in total constitute 100(1 − 𝛼)% of the
2
probability mass. More formally, we
1.5 find the threshold 𝑝∗ on the pdf such
1 that
0.5 1
q : p ( | ) p*
qD
p (q | D ) dq
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C q : p (q | D ) p *
from Kevin Murphys’ PMTK
We see that this is narrower than the CI, even though it still contains 95% of the mass. Also
every point inside of it has higher density than every point outside of it.
3.5 3.5
1.5 1.5
1 1
0.5 0.5
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
/2 /2
pMIN
For a unimodal distribution, the HDI will be the narrowest interval around the mode containing
95% of the mass.
If the posterior is multimodal, the HDI may not even be a connected region. Note that
summarizing multimodal posteriors is always difficult.
Run postDensityIntervals
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Inference for a Difference in Proportions
Often we have multiple parameters, and we are interested in computing the posterior
distribution of some function of these parameters.
Given:
a. Seller 1 has 𝑦1 = 90 positive reviews and 10 negative reviews.
b. Seller 2 has 𝑦2 = 2 positive reviews and 0 negative reviews.
It seems you should pick seller 2, but we cannot be very confident that seller 2 is better since
it has had so few reviews.
We sketch a Bayesian analysis of this problem. Similar methodology can be used to compare
rates or proportions across groups for a variety of other settings.
The posteriors are 𝑝(𝜃1 |𝒟1) = ℬℯ𝓉𝒶(91,11) and 𝑝(𝜃2 |𝒟2) = ℬℯ𝓉𝒶(3,1).
We want to compute 𝑝(𝜃1 > 𝜃2|𝒟). For convenience, let us define 𝛿 = 𝜃1 − 𝜃2 as the
difference in the rates. We can compute the desired quantity using numerical integration:
1 1
𝑝(𝛿 > 0|𝒟)=0 0 𝕀(𝜃1 > 𝜃2 )ℬℯ𝓉𝒶(𝜃1 |𝑦1 + 1, 𝑁1 − 𝑦1 + 1൯ℬℯ𝓉𝒶(𝜃2 |𝑦2 + 1, 𝑁2 − 𝑦2 + 1)𝑑𝜃1 𝑑𝜃2
We find 𝑝(𝛿 > 0|𝒟) = 0.710, which means you are better off buying from seller 1!
Run amazonSellerDemo
from Kevin Murphys’ PMTK
10
1.5
8
pdf
6
1
0.5
2
0 0
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
We approximate the posterior 𝑝(𝛿|𝒟) by MC sampling. 𝜃1 and 𝜃2 are independent and both have Beta
distributions, which can be sampled easily.
𝑝(𝜃𝑖|𝒟𝑖) are shown on the right, and a MC approximation to 𝑝(𝛿|𝒟) together with a 95% central interval
on the left. An MC approximation to 𝑝(𝛿 > 0|𝒟) is obtained by counting the fraction of samples where
𝜃1 > 𝜃2 . This turns out to be 0.718, which is very close to the exact value.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Model Selection
A number of complexity parameters (polynomial order, regularization parameter, etc.) need to
be selected to optimize performance/predictive capability. This is a model selection problem.
In MLE, the performance on the training set is not a good indicator of predictive performance
due to the problem of over-fitting.
We often use some of the available data to train a range of models (or a given model with a
range of values for its complexity parameters) and then to compare them on a validation set.
We then select the one having the best predictive performance.
Some over-fitting to the validation data can occur and a third test set on which the
performance of the selected model is finally evaluated maybe needed.
Training Set
Validation Set (Optimize hyperparameters)
Test Set (Measure Performance)
ln p(D | wML ) M
Examples:
p D | m p ( m)
p m | D
p(m ', D )
m 'M
m max p m | D
m
p D | m p D | q , m p q | m dq
This quantity is called the evidence for model 𝑚.
The details on how to perform this integral will be discussed with examples
later on.
This is true if we use 𝑝 𝒟|𝜃መ𝑚 to select models, where 𝜃መ𝑚 is the MLE or MAP estimate of the
parameters for model 𝑚 - models with more parameters will fit the data better, and hence
achieve higher likelihood.
However, if we integrate out the parameters, rather than maximizing them, we are
automatically protected from overfitting.
Models with more parameters do not necessarily have higher marginal likelihood.
This is called the Bayesian Occam’s razor effect (MacKay 1995b; Murray and Ghahramani
2005)
Occams Razor Principle: one should pick the simplest model that adequately
explains the data.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29
Bayesian Occam’s Razor
The marginal likelihood can be rewritten as follows:
If a model is too complex, it will overfit the early examples and will then predict the remaining
ones poorly.
f i ( x | qi , M i ) i (qi | M i )
i (qi | x, M i )
i (x | Mi )
We define as the best model the one that is more probable to have generated
the data 𝒙 that we observed.
fi ( x | qi , M i ) i (qi | M i ) ( x | M i ) i ( M i )
x i (qi | x, M i ) i (M i | x) i
i (x | Mi ) ( x)
Noting that
i ( x | M i ) fi ( x | qi , M i ) i (qi | M i ) dqi
we can find the best model that represents the data by computing:
( M 1 | x ) ( x | M 1 ) ( M 1 )
f ( x | q , M ) (q | M )dq
1 1 1 1 1 1 1 (M1 )
( M 2 | x ) ( x | M 2 ) ( M 2 ) f ( x | q , M ) (q | M )dq
2 2 2 2 2 2 2
(M 2 )
Ratio of
B10 Pr iors
Bayes ' factor
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32
Bayesian Model Comparison - Example
Consider the coin flipping example
Model Comparison
(M1 | x)
f ( x | q , M ) (q | M )dq (M ) 2.58 ( M )
1 1 1 1 1 1
(M 2 | x) f ( x | q , M ) (q | M )dq (M )
2 2 2
(M )
2 2 2
Ratio of
Bayes ' factor Pr iors
(M 2 | x) f ( x | q , M ) (q | M )dq (M )
2 2 2 2
(M ) 2 2
Ratio of
Bayes ' factor Pr iors
Remark: Bayes’ factors and posterior model PDFs should be used with caution when non-
informative priors are applied.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34
Bayes Factors and Jeffreys Scale
Suppose our prior on models is uniform, 𝑝(𝑚)~1. Then model selection is equivalent to
picking the model with the highest marginal likelihood. Now suppose we just have two models
we are considering, call them the null hypothesis, 𝑀0, and the alternative hypothesis, 𝑀1.
𝜋
For log(𝐵10 ) between 0 and 0.5, the evidence against 𝐻0 is poor
Bayes’ factor does not tell us whether any of these models is sensible
Estimation and Beyond in the Bayes Universe, Brani Vidakovic (online Course on Bayesian Stat. for Engineers)
0.16 5
0.14 4.5
4
coinsModelSelDemo
0.12
from Kevin Murphys’ PMTK
3.5
0.1
3
0.08
2.5
0.06
2
0.04
1.5
0.02
1
0
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 0.5
num heads
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5
-1.2
-0.8
-1.4
-1
-1.6
-1.2 -1.8
-2
-1.4
-2.2
-1.6
-2.4
-1.8 -2.6
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5
p M p (D | M ) p M p (D | M ) p (D | q ) p q | M dq 1 p ( D | q ) p q | M dq
0
0 0 1 1 0 1
0 1
p D' | m 1
D'
Bayesian Methods for
Machine Learning, ICML
Tutorial, 2004,
Zoubin Ghahramani
Model 2 is "just right": it predicts the observed data with a reasonable degree of confidence,
but does not predict too many other things. Hence model 2 is the most probable model.
F or any model M :
all d D
p( D d | M ) 1
A note on the evidence and Bayesian Occam’s razor, I. Murray and Z. Ghahramani (2005),
Gatsby Unit Technical Report GCNU-TR 2005-003
Occam’s Razor, C. Rasmussen and Z. Ghahramani, In T.K. Leen, T.G. Dietterich and V. Tresp (edts),
Neural Information Procesing Systems 13, pp. 294-300, 2001, MIT Press
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
Bayesian Occam’s Razor
Bayesian Methods for
Machine Learning, ICML
Tutorial, 2004,
Zoubin Ghahramani
60
linregEbModelSelVsN
60
from Kevin Murphys’ PMTK
50 40
40 20
30
0
20
-20
10
-40
0
-10 -60
-20 -80
-2 0 2 4 6 8 10 12
-2 0 2 4 6 8 10 12
d=3, logev=-21.777, EB N=5, method=EB
300
1
250
200
0.8
150
100
P(M|D)
0.6
50
0
0.4
-50
-100
0.2
-150
-200
-2 0 2 4 6 8 10 12 0
1 2 3
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44
Bayesian Occam’s Razor
Polynomials of degrees 1, 2, 3 fit to 𝑁 = 30 data points using empirical Bayes. Solid green curve is the
true function, Dashed red curve is the prediction (dotted blue lines represent 𝜎 around the mean). The
posterior over models 𝑝(𝑚|𝒟) is also shown using a Gaussian prior 𝑝(𝑚).
30
30
20 20
10 10
0 0
-10
-10 -2 0 2 4 6 8 10 12
-2 0 2 4 6 8 10 12 N=30, method=EB
d=3, logev=-108.181, EB
100 1
80
0.8
60
P(M|D)
0.6
40
0.4
20
0
0.2
-20
-2 0 2 4 6 8 10 12
0
1 2 3
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Marginal Likelihood (Evidence)
Let us review once more on how we can compute the evidence for conjugate prior
models.
However, when comparing models, we need to know how to compute the marginal
likelihood, 𝑝(𝒟|𝑚).
In general, this can be quite hard, since we have to integrate over all possible
parameter values, but when we have a conjugate prior, it is easy to compute.
p D | m p D | q , m p q | m dq
Let 𝑝(𝒟|𝜃) = 𝑞(𝒟|𝜃)/𝑍𝑙 be the likelihood, where 𝑍𝑙 contains any constant factors in the
likelihood.
Let 𝑝(𝜃|𝒟) = 𝑞(𝜃|𝒟)/𝑍𝑁 be our posterior, where 𝑞(𝜃|𝒟) = 𝑞(𝒟|𝜃)𝑞(𝜃) is the unnormalized
posterior, and 𝑍𝑁 is the normalization constant of the posterior.
p q p D | q q q | D q q q D | q ZN
p q | D p D
p D ZN Z0 Zl p D Z0 Zl
So assuming the relevant normalization constants are tractable, we have an easy way to
compute the marginal likelihood.
p D p D B ( a, b) N1
1 1 N 1
B (a N1 , b N 0 ) p D N1 B (a, b)
N B (a N1 , b N 0 )
p D
N1 B ( a, b)
The marginal likelihood for the Beta-Bernoulli model is the same as
N
above, but without the term.
N1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
Dirichlet-Multinoulli Model
One can show that the marginal likelihood for the Dirichlet-multinoulli model is
given by
K
B( N ) ( k )
p D , B( ) k 1
B ( ) K
k
k 1
Hence, we can rewrite the above result in the following form, which is more
often used
K
k K
k 1 ( N k k )
p D
K
k 1 ( k )
N k
k 1
2 0 D v0 / 2
v0 /2 v D /2
S
0
0
D vN / 2
D /2 v0 /2
1 S0
ND / 2 0
N SN
vN / 2
D v0 / 2
InvWis | S0 , 0
1
NIW ( , | 0 , 0 , S0 , v0 ) N | 0 ,
This equation 0
1
| |1/2 exp 0 0 1 0 | | ( 0 D 1)/2 exp Tr 1 S0
1
will prove
Z NIW 2
T
2
useful
1 T 1
exp 0 0 1 0 Tr 1 S0 | | ( 0 D 2)/2
Z NIW 2 2
later on. D /2
v 2 v0 /2
Z NIW 2 v0 D /2
D 0 S0 , D multivariate Gamma function
2 0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50
Laplace Approximation
The Laplace approximation allows a Gaussian approximation of the parameter
posterior about the maximum a posteriori (MAP) parameter estimate.
Consider a data set 𝒟 and 𝑀 models ℳ𝑖, 𝑖 = 1, . . , 𝑀 with corresponding parameters
𝜽𝑖 , 𝑖 = 1, … 𝑀. We compare models using the posteriors:
p (M | D ) p ( M ) p ( D | M )
For large sets of data 𝒟 (relative to the model parameters), the parameter posterior is
approximately Gaussian around the MAP estimate 𝜽𝑀𝐴𝑃 𝑚 (use 2𝑛𝑑 order Taylor
expansion of the log-posterior):
1
p (q m | D , Mm ) 2 exp q m q m A q m q mMAP ,
d /2 1/2 MAP T
A
2
2 log P q m | D , Mm
Aij
q mi q mj
q mMAP
dof q m is the number of degrees of freedom in the model, and
q m is the MLE for the model. We see that this has the form of a
penalized log likelihood, where the penalty term depends on the
model complexity.
y q
2
MLE : q X T X X T y,
N
1 1
2 T
i xi
N i 1
y q
T 2
xi
i
N 2
log p D | q log 2 i
2
2 2
N
2
2
log p (D | q ) log 2
N
2
N
2
2 N D
BIC log 2 log N
2 2
𝐷 is the number of variables in the model.
BIC log p (D | q m , Mm )
log N log p(D | M )
dof q m
m
2
This causes AIC to pick more complex models. However, this sometimes can
result in better predictive accuracy!
Clarke, B., E. Fokoue, and H. H. Zhang (2009). Principles and Theory for Data Mining and Machine
Learning. Springer.
p D | m p D | q p q | , m p ( | m)dq d
p D | m p D | w p q | , m dq
where
arg max p D | , m arg max p D | q p q | , m dq