0% found this document useful (0 votes)
21 views

Lec16 SummarizingPosteriors BayesianModelSelection

This document discusses summarizing posterior distributions in Bayesian statistics. It introduces maximum a posteriori (MAP) estimation and its drawbacks, such as not providing a measure of uncertainty. It also discusses other Bayesian techniques like credible intervals and Bayesian model selection. The goals are to understand alternatives to MAP estimation, compute highest posterior density (HPD) intervals, and learn criteria for Bayesian model selection like marginal likelihood approximations.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lec16 SummarizingPosteriors BayesianModelSelection

This document discusses summarizing posterior distributions in Bayesian statistics. It introduces maximum a posteriori (MAP) estimation and its drawbacks, such as not providing a measure of uncertainty. It also discusses other Bayesian techniques like credible intervals and Bayesian model selection. The goals are to understand alternatives to MAP estimation, compute highest posterior density (HPD) intervals, and learn criteria for Bayesian model selection like marginal likelihood approximations.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Summarizing Posterior

Distributions &
Bayesian Model Selection
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

September 15, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 1


Contents
 Summarizing Posterior Distributions – An Introduction, MAP Estimation, Drawbacks of MAP Estimation,
MAP Estimation and Reparametrization
 Credible Intervals, HPD Intervals, Bayesian Inference for a Difference in Proportions
 Model Selection and Cross Validation, AIC Information Criterion, Posterior over Models, Model Evidence,
Bayesian Occam’s Razor, Bayesian Model Comparison, Bayes Factors and Jeffreys Scale of Evidence,
Examples, Jeffreys-Lindley Paradox
 Back to Bayesian Occam’s Razor, Marginal Likelihood, Evidence Approximation, Laplace Approximation,
Bayesian Information Criterion, Akaike Information Criterion, Effect of the Prior on Marginal Likelihood,
Empirical Bayes

• Following closely Chris Bishops’ PRML book, Chapters 1 & 2


• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 5
• C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-
Verlag, NY, 2001 (online resource)
• A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition,
2003.
• M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
• Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Goals
 The goals for today’s lecture include:

 Understand the drawbacks of MAP estimators

 Learn how to compute HPD intervals

 Learn about the AIC and Bayesian and Akaike Information Criteria

 Familiarize ourselves with criteria for Bayesian model section and comparison, Bayes’ factors and
Occam’s Razor

 Learn how to compute the model evidence

 Understand how to perform a Laplace approximation to the posterior distribution

 Understand the effects of the prior on model evidence

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Introduction
 We assume that we computed the posterior of unknown parameters from data.
 Using the posterior distribution 𝑝(𝜽|𝒟) to summarize everything we know about these
variables is at the core of Bayesian statistics.
 We discuss here this approach to statistics in more detail. In particular, we emphasize that
point estimates are not the best approach.
 We discuss next some simple quantities that can be derived from 𝑝(𝜽|𝒟), such as
 the posterior mean,
 the MAP estimate,
 the median, etc.
 These summary statistics (point estimates) are often easier to understand and visualize than
the full posterior distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


MAP Estimation
 Typically the posterior mean or median is the most appropriate choice for a
real-valued quantity, and the vector of posterior marginals is the best choice for
a discrete quantity.

 However, the posterior mode, aka the MAP estimate, is the most popular
choice because it reduces to an optimization problem, for which efficient
algorithms often exist.

 There are various drawbacks to MAP estimation, which we discuss below.

 This will provide motivation for a more thoroughly Bayesian approach.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Drawbacks of MAP Estimation
 The most obvious drawback of MAP estimation (and other point estimates
such as the posterior mean or median) is that it does not provide any measure
of uncertainty.

 In many applications, it is important to know how much one can trust a given
estimate.

 We will derive such confidence measures from the posterior.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Plugging-in the MAP Estimate Can Overfit
 In machine learning, we care more about predictive accuracy than
in interpreting the parameters of our models.

 However, if we don't model the uncertainty in our parameters, then


our predictive distribution will be overconfident.

 Overconfidence in predictions is particularly problematic in


situations where we don’t want to take any risk.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


The Mode is an Untypical Estimate
 The mode is usually quite untypical of the distribution, unlike the mean or median (left Fig.)

 The mean/median take the whole probability mass into account.

 On the example on the right, the mode is 0, but the mean is non-zero. Such skewed
distributions arise when inferring variance parameters, especially in hierarchical models. In
such cases the MAP estimate is a very bad estimate.

4.5 0.9

The mode is very 4 0.8

untypical of the 3.5 mean 0.7 A skewed (Gamma) distribution


distribution. The 3
0.6
in which the mode is quite
mean is a better 2.5
0.5
different from the mean
summary of the 2
0.4
distribution, since 1.5
0.3
it is near the 1

majority of the 0.5


0.2

probability mass 0
0.1

-2 -1 0 1 2 3 4
Run bimodalDemo and gammaPlotDemo 1 2 3 4 5 6 7
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Decision Theory and Loss Functions
 How should we summarize a posterior if the mode is not a good choice?

 The answer is to use decision theory. The basic idea is to specify a loss function, where
is the loss you incur if the truth is q and your estimate is 𝜃.

መ = 𝕀 𝜃 ≠ 𝜃መ , then the optimal estimate is the posterior mode. 0 − 1


 If we use 0 − 1 loss, 𝐿(𝜃, 𝜃)
loss means you only get fully penalized if you make errors.

 For continuous-valued quantities, we often prefer to use squared error loss, 𝐿 𝜃, 𝜃መ =


2

𝜃 − 𝜃 , the optimal estimator is now the posterior mean.

መ = |𝜃 − 𝜃|,
 Or we can use a more robust loss function, 𝐿(𝜃, 𝜃) መ which gives rise to the posterior
median.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


MAP Estimation and Reparametrization
 In MAP estimation the result we get depends on the parametrization of the probability
distribution. Changing from one representation to another equivalent representation changes
the result.

 Suppose we compute the posterior for 𝑥. If we define 𝑦 = 𝑓(𝑥), the distribution for 𝑦 is given
by
dx
p y ( y )  px ( x)
dy
 The Jacobian measures the change in size of a unit volume passed through 𝑓.

𝐼𝑓 𝑥ො = max𝑝𝑥 (𝑥), 𝑦ො = max𝑝𝑦 (𝑦) ⇒ 𝑦ො ≠ 𝑓 𝑥ො


𝑥 𝑦

 For example, let 𝑥~𝒩(6,1), and 𝑦 = 𝑓(𝑥) = 1/(1 + exp(−𝑥 + 5)). We can derive the
distribution of 𝑦 using MC. See the following implementation.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


MAP Estimation and Reparametrization
 Transformation of a density under a nonlinear transform. Note how the mode of the
transformed distribution is not the transform of the original mode.
1

0.9 g

0.8

0.7

0.6
Run bayesChangeOfVar
0.5 from Kevin Murphys’ PMTK

0.4

0.3

0.2 pY

0.1 pX

0
0 2 4 6 8 10 12

 The MLE does not suffer from this issue (the likelihood is a function not a probability density).

Bayesian inference also does not suffer from this since the change of measure is taken into
account when integrating in the parameter space.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
MAP Estimation and Reparametrization
 Consider a Bernoulli likelihood and uniform prior as follows:
p ( y  1|  )   , where y  0,1
p (  )  1,0    1

 Without data, the MAP estimate is the mode of the prior which is anywhere on the interval
[0,1]

 Case 1: Consider the parametrization


q    d  / dq  2q  pq (q )  2q
 The MAP estimate is
𝜃መ = 1
 Case 2: Consider the parametrization:
q  1  1    d  / dq  2(1  q )  pq (q )  2(1  q )
 The MAP estimate is
𝜃መ = 0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
MAP Estimation and Reparametrization
 Using the Fisher information matrix 𝐼(𝜽) associated with the likelihood 𝑝 𝒙 𝜽 , a solution to
the problem is to optimize the following objective function:

෡ = max𝑝 𝒟 𝜽 𝑝(𝜽) 𝐼(𝜽)


𝜽 −1/2
𝜽

 This estimate is parameterization independent.

 The optimization problem above is difficult to implement in practice.

 Druilhet, P. and J.-M. Marin (2007). Invariant HPD credible sets and MAP estimators. Bayesian
Analysis 2(4), 681–692.
 Jermyn, I. (2005). Invariant Bayesian estimation on manifolds. Annals of Statistics 33(2), 583–605.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Credible (Central) Interval
 Credible Interval: A standard measure of confidence in some (scalar) quantity 𝜃 is the "width" of its
posterior distribution. This can be measured using a 100(1 − 𝛼) % credible interval, which is a
(contiguous) region 𝐶 = (ℓ, 𝑢) (standing for lower and upper) which contains 1 − 𝛼 of the posterior
probability mass, i.e.,

C (D )  ( , u ) : p(  q  u | D )  1  

 There may be many such intervals, so we choose one such that there is 𝛼/2 mass in each tail; this is
called a central interval.

 Some examples (for 𝛼 = 0.05):


Run quantileDemo
 Gaussian distribution: from Kevin Murphys’ PMTK

For p (q | D)  N (0,1), ( , u )    1 ( / 2),  1 (1   / 2)   ( 1.96, 1.96)


For p (q | D)  N (  ,  2 ), ( , u )    2
 Beta prior in a coin example. The posterior is ℬℯ𝓉𝒶. Run betaCreditbleInt
from Kevin Murphys’ PMTK

For p (q | D)  Beta (48,54) (47 H in 100 trials), ( , u )   0.3749, 0.5673 


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Credible Intervals
 If we dont know the functional form, but we can draw samples from the
posterior, then we can use a Monte Carlo approximation to the posterior
quantiles.

 We simply sort the 𝑆 samples, and find the one that occurs at locations
ℓ, 𝑢 = 𝑆𝛼Τ2 , 𝑆(1 − 𝛼Τ2) along the sorted list. As 𝑆∞, this converges to the
true quantile.
Run mcQuantileDemo
from Kevin Murphys’ PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15


Credible Vs. Confidence Intervals
 Bayesian credible intervals versus frequentist confidence intervals are not the
same thing.

 In general, credible intervals are usually what people want to compute, but
confidence intervals are usually what they actually compute!

 Fortunately, the mechanics of computing a credible interval is just as easy as


computing a confidence interval.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Highest Posterior Density Regions
 In central intervals there might be points outside the CI which have higher probability density. This is
illustrated in the Figure, where we see that points outside the left-most CI boundary have higher density
than those just inside the right-most CI boundary.

3.5
 This motivates the highest posterior
3 density or HPD region. This is defined
Central interval for a
as the set of most probable points that
2.5 Beta(3,9) posterior
in total constitute 100(1 − 𝛼)% of the
2
probability mass. More formally, we
1.5 find the threshold 𝑝∗ on the pdf such
1 that
0.5 1  
q : p ( | )  p*
qD
p (q | D ) dq
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

 We then define the HPD as:


Run betaHPD

C  q : p (q | D )  p *
from Kevin Murphys’ PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Highest Posterior Density Regions
 The HPD region is sometimes called a highest density interval or HDI. The figure shows the
95%HDI of a ℬℯ𝓉𝒶(3, 9) distribution, which is 0.04, 0.48 .

 We see that this is narrower than the CI, even though it still contains 95% of the mass. Also
every point inside of it has higher density than every point outside of it.

3.5 3.5

HPD for a Beta(3,9)


3 3
Central interval for a
2.5 Beta(3,9) posterior 2.5
posterior
2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Run betaHPD  The HPD region for unimodal distributions


from Kevin Murphys’ PMTK has minimal width and contains 𝟗𝟓% of
the mass. It can be computed by
optimization and using the inverse CDF.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Highest Posterior Density Regions
Central Interval HPD

/2 /2
pMIN

 For a unimodal distribution, the HDI will be the narrowest interval around the mode containing
95% of the mass.

 If the posterior is multimodal, the HDI may not even be a connected region. Note that
summarizing multimodal posteriors is always difficult.

Run postDensityIntervals
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Inference for a Difference in Proportions
 Often we have multiple parameters, and we are interested in computing the posterior
distribution of some function of these parameters.

 Example: suppose you are about to buy a book from Amazon.com

 Given:
 a. Seller 1 has 𝑦1 = 90 positive reviews and 10 negative reviews.
 b. Seller 2 has 𝑦2 = 2 positive reviews and 0 negative reviews.

 It seems you should pick seller 2, but we cannot be very confident that seller 2 is better since
it has had so few reviews.

 We sketch a Bayesian analysis of this problem. Similar methodology can be used to compare
rates or proportions across groups for a variety of other settings.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Inference for a Difference in Proportions
 Let 𝜃1 and 𝜃2 be the unknown reliabilities of the two sellers. We endow them both with uniform
priors, 𝜃𝑖 ~ ℬℯ𝓉𝒶(1,1).

 The posteriors are 𝑝(𝜃1 |𝒟1) = ℬℯ𝓉𝒶(91,11) and 𝑝(𝜃2 |𝒟2) = ℬℯ𝓉𝒶(3,1).

 We want to compute 𝑝(𝜃1 > 𝜃2|𝒟). For convenience, let us define 𝛿 = 𝜃1 − 𝜃2 as the
difference in the rates. We can compute the desired quantity using numerical integration:

1 1
𝑝(𝛿 > 0|𝒟)=‫׬‬0 ‫׬‬0 𝕀(𝜃1 > 𝜃2 )ℬℯ𝓉𝒶(𝜃1 |𝑦1 + 1, 𝑁1 − 𝑦1 + 1൯ℬℯ𝓉𝒶(𝜃2 |𝑦2 + 1, 𝑁2 − 𝑦2 + 1)𝑑𝜃1 𝑑𝜃2

 We find 𝑝(𝛿 > 0|𝒟) = 0.710, which means you are better off buying from seller 1!

Run amazonSellerDemo
from Kevin Murphys’ PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


Inference for a Difference in Proportions
2.5 14
Monte Carlo Run amazonSellerDemo p(q1|data)
approximation from Kevin Murphys’ PMTK
12 p(q2|data)
2
to 𝑝(𝛿 > 0|𝒟)

10

1.5
8
pdf

6
1

0.5
2

0 0
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

 We approximate the posterior 𝑝(𝛿|𝒟) by MC sampling. 𝜃1 and 𝜃2 are independent and both have Beta
distributions, which can be sampled easily.

 𝑝(𝜃𝑖|𝒟𝑖) are shown on the right, and a MC approximation to 𝑝(𝛿|𝒟) together with a 95% central interval
on the left. An MC approximation to 𝑝(𝛿 > 0|𝒟) is obtained by counting the fraction of samples where
𝜃1 > 𝜃2 . This turns out to be 0.718, which is very close to the exact value.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Model Selection
 A number of complexity parameters (polynomial order, regularization parameter, etc.) need to
be selected to optimize performance/predictive capability. This is a model selection problem.

 In MLE, the performance on the training set is not a good indicator of predictive performance
due to the problem of over-fitting.

 We often use some of the available data to train a range of models (or a given model with a
range of values for its complexity parameters) and then to compare them on a validation set.
We then select the one having the best predictive performance.

 Some over-fitting to the validation data can occur and a third test set on which the
performance of the selected model is finally evaluated maybe needed.

 Training Set
 Validation Set (Optimize hyperparameters)
 Test Set (Measure Performance)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


Model Selection: Cross Validation

 The technique of 𝑆-fold cross-validation (here 𝑆 = 4) involves taking the


available data and partitioning it into 𝑆 groups.
 𝑆 − 1 of the groups are used to train a set of models that are then evaluated
on the remaining group. This procedure is repeated for all 𝑆 possible choices
for the held-out group and the performance scores from the 𝑆 runs are then
averaged.
 The cross-validation cost increases by a factor of 𝑆.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Akaike Information Criterion
 To correct for the bias of MLE, we use different information criteria (here 𝑀 =
# of parameters in the model), e.g.:

Akaike Information Criterion (AIC):

ln p(D | wML )  M

We choose the model for which the AIC is largest.

 AIC does not account for uncertainty in model parameters.

 It favor simple models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Bayesian Model Selection
 In general, when faced with a set of models (i.e., families of
parametric distributions) of different complexity, how should we
choose the best one?

This is called the model selection problem.

 Examples:

 a low order polynomial in linear regression underfitts while a


high order polynomial overfitts

 a small regularization parameter 𝜆 results in overfitting and too


large 𝜆 in underfitting.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Bayesian Model Selection
 Can use 𝐶𝑉 to estimate the generalization error of all the candidate models,
and then to pick the model that performs the best. This requires fitting each
model 𝐾 times, where 𝐾 is the number of CV folds. More efficient approach is
to compute the posterior over models.

p  D | m  p ( m)
p m | D  
 p(m ', D )
m 'M

 From this, we can easily compute the MAP model

m  max p  m | D 
m

 This is called Bayesian model selection.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Model Evidence
 If we use a uniform prior over models, 𝑝(𝑚)~1, this amounts to picking the
model which maximizes the marginal likelihood:

p  D | m    p  D | q , m  p q | m  dq
 This quantity is called the evidence for model 𝑚.

 The details on how to perform this integral will be discussed with examples
later on.

 An intuitive interpretation of model evidence is discussed next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Bayesian Occam’s Razor
 One might think that using 𝑝(𝒟|𝑚) to select models would always favor the model with the
most parameters.

This is true if we use 𝑝 𝒟|𝜃መ𝑚 to select models, where 𝜃መ𝑚 is the MLE or MAP estimate of the
parameters for model 𝑚 - models with more parameters will fit the data better, and hence
achieve higher likelihood.

 However, if we integrate out the parameters, rather than maximizing them, we are
automatically protected from overfitting.

 Models with more parameters do not necessarily have higher marginal likelihood.

 This is called the Bayesian Occam’s razor effect (MacKay 1995b; Murray and Ghahramani
2005)

Occams Razor Principle: one should pick the simplest model that adequately
explains the data.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29
Bayesian Occam’s Razor
The marginal likelihood can be rewritten as follows:

p (D )  p ( y1 ) p ( y2 | y1 ) p ( y3 | y1:2 )... p ( y N | y1:N 1 )

where we have dropped the conditioning on 𝑚 for brevity.

 This is similar to a leave-one-out cross-validation estimate of the likelihood, since we predict


each future point given all the previous ones.

 If a model is too complex, it will overfit the early examples and will then predict the remaining
ones poorly.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Bayesian Model Comparison
 Suppose we have two models 𝑀1 and 𝑀2

 Each is associated with a set of parameters 𝜃1 and 𝜃2

 We consider priors 𝑝𝑖 𝜃𝑖 |𝑀𝑖 , likelihoods 𝑓𝑖 𝒙 |𝜃𝑖 , 𝑀𝑖 and posteriors


𝑝𝑖 𝜃𝑖|𝒙, 𝑀𝑖

f i ( x | qi , M i ) i (qi | M i )
 i (qi | x, M i ) 
i (x | Mi )

 We define as the best model the one that is more probable to have generated
the data 𝒙 that we observed.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31


Bayesian Model Comparison
From data we can learn the parameters for each model
and then the model itself

fi ( x | qi , M i ) i (qi | M i )  ( x | M i ) i ( M i )
x   i (qi | x, M i )    i (M i | x)  i
i (x | Mi )  ( x)

Noting that

 i ( x | M i )   fi ( x | qi , M i ) i (qi | M i ) dqi

we can find the best model that represents the data by computing:

 ( M 1 | x )  ( x | M 1 ) ( M 1 )
 
 f ( x | q , M ) (q | M )dq
1 1 1 1 1 1 1  (M1 )
 ( M 2 | x )  ( x | M 2 ) ( M 2 )  f ( x | q , M ) (q | M )dq
2 2 2 2 2 2 2
 (M 2 )
 Ratio of
B10 Pr iors
Bayes ' factor
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32
Bayesian Model Comparison - Example
Consider the coin flipping example

Let 𝜃 the probability of getting heads

Consider two models:


𝑀1 Coin is Fair: 𝜃|𝑀1 ~ℬ 100,100
𝑀2 Coin is Unfair: 𝜃|𝑀2 ~ℬ 0.5,0.5

Data 𝒙 = {2𝐻, 3𝑇}

 f ( x | q , M ) (q | M )dq   q (1  q ) q (1  q )99 / beta(100,100) dq


2 3 99
Bayes Factor 1 1 1 1 0.031

 f ( x | q , M ) (q | M )dq  q (1  q ) q (1  q ) 0.5 / beta(0.5, 0,5) dq
2 3 0.5
2 2 2 2
0.012
Bayes ' factor

Model Comparison
 (M1 | x)

 f ( x | q , M ) (q | M )dq  (M )  2.58  ( M )
1 1 1 1 1 1

 (M 2 | x)  f ( x | q , M ) (q | M )dq  (M )
2 2 2
 (M )
2 2 2
Ratio of
Bayes ' factor Pr iors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Bayesian Model Comparison - Example
Consider the coin flipping example
Let 𝜃 probability of getting heads
Two models:
𝑀1 Coin is Fair: 𝜃|𝑀1 ~ℬ 100,100
𝑀2 Coin is Unfair: 𝜃|𝑀2 ~ℬ 0.5,0.5
Data 𝒙 = {5𝐻}
 f ( x | q , M ) (q | M )dq   q q (1  q )99 / beta(100,100) dq
5 99
Bayes Factor 1 1 1 1

0.033
 f ( x | q , M ) (q | M )dq  q q (1  q ) 0.5 / beta(0.5, 0,5) dq
5 0.5
2 2 2 2
0.25
Bayes ' factor

Model Comparison  (M1 | x)



 f ( x | q , M ) (q | M )dq  (M )  0.13  (M )
1 1 1 1 1 1

 (M 2 | x)  f ( x | q , M ) (q | M )dq  (M )
2 2 2 2
 (M ) 2 2
Ratio of
Bayes ' factor Pr iors

Remark: Bayes’ factors and posterior model PDFs should be used with caution when non-
informative priors are applied.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34
Bayes Factors and Jeffreys Scale
 Suppose our prior on models is uniform, 𝑝(𝑚)~1. Then model selection is equivalent to
picking the model with the highest marginal likelihood. Now suppose we just have two models
we are considering, call them the null hypothesis, 𝑀0, and the alternative hypothesis, 𝑀1.

 Define the Bayes factor as the ratio of marginal likelihoods:


p(D | M 1 ) p( M 1 | D ) p( M 1 )
BF1,0   /
p(D | M 0 ) p( M 0 | D ) p( M 0 )
 If 𝐵𝐹1,0 > 1, we prefer model 1, otherwise we prefer model 0. Jeffreys proposed a scale of
evidence shown below

Bayes factor BF(1,0) Interpretation


BF< 1/100 Decisive evidence for M0
BF< 1/10 Strong evidence for M0
1/10< BF< 1/3 Moderate evidence for M0
1/3< BF < 1 Weak evidence for M0
1 < BF < 3 Weak evidence for M1
3 < BF < 10 Moderate evidence for M1
BF>10 Strong evidence for M1
BF>100 Decisive evidence for M1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35
Jeffrey’s Scale of Evidence
 Using the alternative reference below, Jeffrey’s scale of evidence says:

𝜋
 For log(𝐵10 ) between 0 and 0.5, the evidence against 𝐻0 is poor

 In between 0.5 and 1, it is substantial


  ( x | H1 )
 In between 1 and 2, it is strong and B10 
 (x | H0 )
 Above 2, it is decisive.

 Bayes’ factor tells us if one should prefer 𝐻0 to 𝐻1 (relative comparison of models).

 Bayes’ factor does not tell us whether any of these models is sensible

Estimation and Beyond in the Bayes Universe, Brani Vidakovic (online Course on Bayesian Stat. for Engineers)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Example: Testing if a Coin is Fair
 Suppose we observe some coin tosses, and want to decide if the data was
generated by a fair coin, 𝜃 = 0.5, or a potentially biased coin, where 𝜃 in
[0, 1]. Denote the fair coin model by 𝑀0 and the biased coin model by 𝑀1.

 The marginal likelihood under 𝑀0 is simply


N
1
p D | M0    
2
where 𝑁 is the number of coin tosses.

 The marginal likelihood under M1 using a Beta prior, is


B 1  N1 ,  0  N 0 
p  D | M 1    p  D | q  p q | M 1  dq 
B  1 ,  0 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Example: Testing if a Coin is Fair
 We plot log 𝑝(𝒟|𝑀0) and log 𝑝(𝒟|𝑀1) vs the number of heads 𝑁1 with 𝑁 = 5 and
𝑎1 = 𝑎0 = 1.
 If we observe 2 or 3 heads, the unbiased coin hypothesis 𝑀0 is more likely than M1
since 𝑀0 is a simpler model - it would be a suspicious coincidence if the coin were
biased but happened to produce almost exactly 50/50 heads/tails.
 However, as the counts become more extreme, we favor the biased coin hypothesis.
Note that, if we plot the log Bayes factor, 𝑙𝑜𝑔𝐵𝐹10 it will have exactly the same
shape, since 𝑙𝑜𝑔𝑝(𝒟|𝑀0) is a constant.
Marginal likelihood for Beta-Bernoulli model,  p(D|q) Be(q|1,1) dq BF(1,0)
0.18 5.5

0.16 5

0.14 4.5

4
coinsModelSelDemo
0.12
from Kevin Murphys’ PMTK
3.5
0.1
3
0.08
2.5
0.06
2
0.04
1.5
0.02
1
0
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 0.5
num heads
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


Example: Testing if a Coin is Fair
 Log marginal likelihood for coins example and the BIC approximation to
log 𝑝(𝒟|𝑀1) for our biased coin example.
 The curve has approximately the same shape as the exact log marginal
likelihood.
 It favors the simpler model unless the data is overwhelmingly in support of
the more complex model.
log10 p(D|M1)
-0.4 BIC approximation to log10 p(D|M1) coinsModelSelDemo
-0.8
from Kevin Murphys’ PMTK
-0.6
-1

-1.2
-0.8

-1.4
-1
-1.6

-1.2 -1.8

-2
-1.4
-2.2
-1.6
-2.4

-1.8 -2.6
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39


Jeffreys Lindley Paradox
 Define the marginal density of 𝜃 as: p q   p q | M 0  p  M 0   p q | M 1  p  M 1 
where we consider the hypothesis M 0 : q  M 0 vs M 1 : q  M 1
 We can estimate the posterior as (denote: p  M 0    , p  M 1   1   )
p  M  p(D | M )   p (D | q ) p q | M  dq
pM D  
0 0
0

0

p  M  p (D | M )  p  M  p (D | M )   p (D | q ) p q | M  dq  1     p ( D | q ) p q | M  dq
0
0 0 1 1 0 1
0 1

 Let us now assume that the priors are improper: p q | M 0   c0 , p q | M 1   c1

 The posterior is determined by 𝑐0/𝑐1 (e.g. it can be anything we want!)


  p (D | q )dq
pM0 D  
0

  p (D | q )dq  1     c1 c0   p(D | q )dq


0 1
 Using proper but vague priors causes similar problem.
 The Bayes factor favors the simpler model – the probability of the observed data under a
complex model with diffuse prior is very low.
 Jeffreys-Lindley paradox → Use proper priors for model selection. If 𝑀0 & 𝑀1 share the same
prior over a subset of 𝜃, this part of the prior can be improper, since 𝑐0/𝑐1 will cancel out.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Bayesian Occam’s Razor
 To further understand the Bayesian Occam’s razor effect is to note that probabilities must sum
to one (sum over all possible data sets)

 p  D' | m   1
D'
Bayesian Methods for
Machine Learning, ICML
Tutorial, 2004,
Zoubin Ghahramani

 Model 1 is too simple and


assigns low probability to 𝐷0.

 Model 3 also assigns 𝐷0


relatively low probability, because
it can predict many data sets, and hence
it spreads its probability quite widely & thinly. D0

 Model 2 is "just right": it predicts the observed data with a reasonable degree of confidence,
but does not predict too many other things. Hence model 2 is the most probable model.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41


Bayesian Occam’s Razor

F or any model M : 
all d D
p( D  d | M )  1

The law of conservation of belief states that models that explain


many possible data sets must necessarily assign each of them
a low probability

 A note on the evidence and Bayesian Occam’s razor, I. Murray and Z. Ghahramani (2005),
Gatsby Unit Technical Report GCNU-TR 2005-003
 Occam’s Razor, C. Rasmussen and Z. Ghahramani, In T.K. Leen, T.G. Dietterich and V. Tresp (edts),
Neural Information Procesing Systems 13, pp. 294-300, 2001, MIT Press
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
Bayesian Occam’s Razor
Bayesian Methods for
Machine Learning, ICML
Tutorial, 2004,
Zoubin Ghahramani

𝑴𝟏 : the too simple model is unlikely to


generate this data

𝑴𝟑 : the too complex model explains poorly a


lots of data sets and it is a little better but
still unlikely to have generated our data

𝑴𝟐 : the just right model has the highest


marginal likelihood
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43
Bayesian Occam’s Razor
 Polynomials of degrees 1, 2, 3 fit to 𝑁 = 5 data points using empirical Bayes. Solid green curve is the
true function, Dashed red curve is the prediction (dotted blue lines represent 𝜎 around the mean). The
posterior over models 𝑝(𝑚|𝒟) is also shown using a Gaussian prior 𝑝(𝑚).
d=2, logev=-20.486, EB
d=1, logev=-18.899, EB 80
70

60
linregEbModelSelVsN
60
from Kevin Murphys’ PMTK
50 40

40 20

30
0
20
-20
10

-40
0

-10 -60

-20 -80
-2 0 2 4 6 8 10 12
-2 0 2 4 6 8 10 12
d=3, logev=-21.777, EB N=5, method=EB
300
1
250

200
0.8
150

100

P(M|D)
0.6
50

0
0.4
-50

-100
0.2
-150

-200
-2 0 2 4 6 8 10 12 0
1 2 3
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44
Bayesian Occam’s Razor
 Polynomials of degrees 1, 2, 3 fit to 𝑁 = 30 data points using empirical Bayes. Solid green curve is the
true function, Dashed red curve is the prediction (dotted blue lines represent 𝜎 around the mean). The
posterior over models 𝑝(𝑚|𝒟) is also shown using a Gaussian prior 𝑝(𝑚).

d=1, logev=-106.337, EB d=2, logev=-103.490, EB


70 80
linregEbModelSelVsN
60
70
from Kevin Murphys’ PMTK
60
50
50
40
40

30
30

20 20

10 10

0 0

-10
-10 -2 0 2 4 6 8 10 12
-2 0 2 4 6 8 10 12 N=30, method=EB
d=3, logev=-108.181, EB
100 1

80
0.8

60

P(M|D)
0.6
40

0.4
20

0
0.2

-20
-2 0 2 4 6 8 10 12
0
1 2 3
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Marginal Likelihood (Evidence)
 Let us review once more on how we can compute the evidence for conjugate prior
models.

 When discussing parameter inference for a fixed model, we often write


p q | D , m   p q | m  p  D | q , m 
 We thus ignore the normalization constant 𝑝(𝒟|𝑚). This is valid since 𝑝(𝒟|𝑚) is
constant wrt 𝜃.

 However, when comparing models, we need to know how to compute the marginal
likelihood, 𝑝(𝒟|𝑚).

 In general, this can be quite hard, since we have to integrate over all possible
parameter values, but when we have a conjugate prior, it is easy to compute.
p  D | m    p  D | q , m  p q | m  dq

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


Marginal Likelihood - Evidence
 Let 𝑝(𝜃) = 𝑞(𝜃)/𝑍0 be our prior, where 𝑞(𝜃) is an unnormalized distribution, and 𝑍0 is the
normalization constant of the prior.

 Let 𝑝(𝒟|𝜃) = 𝑞(𝒟|𝜃)/𝑍𝑙 be the likelihood, where 𝑍𝑙 contains any constant factors in the
likelihood.

 Let 𝑝(𝜃|𝒟) = 𝑞(𝜃|𝒟)/𝑍𝑁 be our posterior, where 𝑞(𝜃|𝒟) = 𝑞(𝒟|𝜃)𝑞(𝜃) is the unnormalized
posterior, and 𝑍𝑁 is the normalization constant of the posterior.

p q  p  D | q  q q | D  q q  q  D | q  ZN
p q | D      p D  
p D  ZN Z0 Zl p  D  Z0 Zl
 So assuming the relevant normalization constants are tractable, we have an easy way to
compute the marginal likelihood.

 Several examples are presented next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47


Beta-Binomial Model
 Let us apply the above result to the Beta-binomial model. Since we know
𝑝(𝜃|𝐷) = ℬ(𝜃|𝑎′, 𝑏′), where 𝑎′ = 𝑎 + 𝑁1 and 𝑏′ = 𝑏 + 𝑁0, we know the
normalization constant of the posterior is ℬ(𝑎′, 𝑏′). Hence
p q  p  D | q  1  1  N  N1
b 1   N0 
p q | D     q (1  q )    q (1  q ) 
a 1

p D  p  D   B ( a, b)   N1  
1 1 N 1
  
B (a  N1 , b  N 0 ) p  D   N1  B (a, b)
 N  B (a  N1 , b  N 0 )
p D    
 N1  B ( a, b)
 The marginal likelihood for the Beta-Bernoulli model is the same as
N
above, but without the   term.
 N1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
Dirichlet-Multinoulli Model
 One can show that the marginal likelihood for the Dirichlet-multinoulli model is
given by
K

B( N   )  ( k )
p D   , B( )  k 1

B ( )  K 
  k 
 k 1 
 Hence, we can rewrite the above result in the following form, which is more
often used
 K 
  k  K
 k 1  ( N k   k )
p D  
 K 
 k 1 ( k )
  N  k 
 k 1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49


Gaussian-Gaussian-Wishart Model
 Consider the case of a ℳ𝒱𝒩 with a conjugate 𝒩ℐ𝒲 prior. Let 𝑍0 be the
normalizer for the prior, 𝑍𝑁 be normalizer for the posterior, and let 𝑍𝑙 =
2𝜋 𝑁𝐷Τ2 be the normalizer for the likelihood.
D /2
Then it is easy to see that
 2 
 D  vN / 2 
 vN /2 ( v  N ) D / 2
  S 2 0
 N 
N
ZN 1 1
p D    ND /2 ND /2
Z0 Zl  2  2 
D /2

2 0  D  v0 / 2 
 v0 /2 v D /2
  S
 0 
0

 D  vN / 2 
D /2 v0 /2
1   S0
 ND / 2  0 
  N  SN
vN / 2
 D  v0 / 2 
 
  InvWis   | S0 ,  0  
1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
 This equation  0 
    1 
|  |1/2 exp   0    0   1    0   |  | ( 0  D 1)/2 exp   Tr   1 S0  
1
will prove 
Z NIW  2
T

  2 
useful 
1   T 1 
exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
Z NIW  2 2 
later on. D /2
 v   2   v0 /2
Z NIW  2 v0 D /2
D  0    S0 ,  D multivariate Gamma function
 2   0 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50
Laplace Approximation
 The Laplace approximation allows a Gaussian approximation of the parameter
posterior about the maximum a posteriori (MAP) parameter estimate.
 Consider a data set 𝒟 and 𝑀 models ℳ𝑖, 𝑖 = 1, . . , 𝑀 with corresponding parameters
𝜽𝑖 , 𝑖 = 1, … 𝑀. We compare models using the posteriors:
p (M | D )  p ( M ) p ( D | M )

 For large sets of data 𝒟 (relative to the model parameters), the parameter posterior is
approximately Gaussian around the MAP estimate 𝜽𝑀𝐴𝑃 𝑚 (use 2𝑛𝑑 order Taylor
expansion of the log-posterior):
 1 
p (q m | D , Mm )   2  exp   q m  q m  A q m  q mMAP   ,
 d /2 1/2 MAP T
A
 2 
 2 log P q m | D , Mm 
Aij  
q mi q mj
q mMAP

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51


Laplace Approximation to Model Evidence
 We can write the model evidence as
p (D | q m , Mm ) p(q m | Mm )
p (D | Mm ) 
p (q m | D , Mm )
 Using the Laplace approximation for the posterior of the parameters and evaluating
𝑀𝐴𝑃
the equation above at 𝜃𝑚 :
log p (D | Mm )

log  2   log A  q mMAP  q mMAP  A q mMAP  q mMAP 


d 1 1 T
 log p (D | q mMAP , Mm )  log p(q mMAP | Mm ) 
2 2 2
d 1
 log p (D | q mMAP , Mm )  log p(q mMAP | Mm )  log  2   log A
2 2
 This Laplace approximation is used often for model comparison.

 Other approximations are also very useful:

• Bayesian Information Criterion (BIC) (on the limit of 𝑁)


• MCMC (Sampling approach)
• Variational Methods
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Bayesian Information Criterion
 Start with the Laplace approximation for large data sets 𝑁,
d 1
log p (D | Mm )  log p(D | q mMAP , Mm )  log p(q mMAP | Mm )  log  2   log A
2 2
 A 𝑁 grows, 𝑨 grows as 𝑁𝑨0 for some fixed matrix 𝑨0, thus
log A  log NA0  log  N d A0   d log N  log  A0  
N 
 d log N

 Then the Laplace approximation is simplified as:


d
log p (D | Mm )  log p(D | q mMAP , Mm )  log N ( as N  )
2
 Note interesting properties of (the easy to compute) BIC:
• No dependence on the prior
• One can use the MLE rather than the MAP estimate
(but use MAP when working with mixtures of Gaussians)
• If not all parameters are well determined from the data, 𝑑 =number of effective
parameters.
 Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6(2), 461-464.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53
BIC Approximation to Log Marginal Likelihood
 The Bayesian information criterion or BIC thus has the following
form:

BIC : log p (D | Mm )  log p( D | q m , Mm ) 


dof q m  log N (as N  )
2

 
 dof q m is the number of degrees of freedom in the model, and
q m is the MLE for the model. We see that this has the form of a
penalized log likelihood, where the penalty term depends on the
model complexity.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54


BIC for Linear Regression
 As an example consider linear regression. Consider a model 𝑝 𝑦 𝒙 =
𝒩 𝑦|𝜽𝑇 𝒙, 𝜎 2 . The MLE, log likelihood and BIC are:

  y q 
2

MLE : q   X T X  X T y,
N
1 1
2 T
  i xi
N i 1

  y q 
T 2
xi
   
i
N 2
log p D | q   log 2 i
2
2 2
N
2
2
 log p (D | q )   log 2 
N
2
 
N
2
2 N D
BIC   log 2   log N
2 2
 
 𝐷 is the number of variables in the model.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55


BIC for Linear Regression
 Hence the BIC score is as follows (dropping constant terms)
BIC  
N
2
2 N D

log 2   log N
2 2
 𝐷 is the number of variables in the model. In the statistics literature, it is
common to use an alternative definition of BIC, which we call the BIC cost
(since we want to minimize it):
BIC  Cost  2 log p (D | Mm )  2 log p( D | q m , Mm )  dof q m log N  
 In the context of the regression example, this becomes:

BIC  Cost  N log 2  2


  N  D log N
 The BIC method is related to the minimum description length or MDL
principle. It characterizes the score of how well the model fits the data, minus
how complex the model is.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56
Akaike Information Criterion
 There is a very similar expression to BIC/ MDL called the Akaike information
criterion or AIC, defined as
AIC (m, D )  log p( D | q m , Mm )  dof q m  
 This is derived from a frequentist framework, and cannot be interpreted as an
approximation to the marginal likelihood.

 The penalty for AIC is less than for BIC.

BIC  log p (D | q m , Mm ) 
  log N  log p(D | M )
dof q m
m
2

 This causes AIC to pick more complex models. However, this sometimes can
result in better predictive accuracy!
 Clarke, B., E. Fokoue, and H. H. Zhang (2009). Principles and Theory for Data Mining and Machine
Learning. Springer.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 57


Effect of the Prior on Marginal Likelihood
 When performing posterior inference, the prior may not matter much since the
likelihood often overwhelms the prior.
 But when computing the marginal likelihood, the prior plays a much more
important role, since we are averaging the likelihood over all possible
parameter settings, as weighted by the prior.
 If the prior is unknown, the correct Bayesian procedure is to put a prior on the
prior. That is, we should put a prior on 𝜽 as well as on the hyper-parameter 𝑎.
To compute the marginal likelihood, we should integrate out all unknowns, i.e.,
we should compute

p  D | m    p  D | q  p q |  , m  p ( | m)dq d

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58


Empirical Bayes
 This requires specifying the hyper-prior.

 Fortunately, the higher up we go in the Bayesian hierarchy, the less sensitive


are the results to the prior settings. Thus can usually make the hyper-prior
uninformative.

 A computational shortcut is to optimize 𝑎 rather than integrating it out.


p  D | m    p  D | w  p q |  , m dq 
where
  arg max p  D |  , m   arg max  p  D | q  p q |  , m  dq
 

 This approach is called empirical Bayes (EB).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy