0% found this document useful (0 votes)

21 views

Lec16 SummarizingPosteriors BayesianModelSelection

This document discusses summarizing posterior distributions in Bayesian statistics. It introduces maximum a posteriori (MAP) estimation and its drawbacks, such as not providing a measure of uncertainty. It also discusses other Bayesian techniques like credible intervals and Bayesian model selection. The goals are to understand alternatives to MAP estimation, compute highest posterior density (HPD) intervals, and learn criteria for Bayesian model selection like marginal likelihood approximations.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Lec16 SummarizingPosteriors BayesianModelSelection

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Summarizing Posterior

Distributions &
Bayesian Model Selection
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

September 15, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 1

Contents
 Summarizing Posterior Distributions – An Introduction, MAP Estimation, Drawbacks of MAP Estimation,
MAP Estimation and Reparametrization
 Credible Intervals, HPD Intervals, Bayesian Inference for a Difference in Proportions
 Model Selection and Cross Validation, AIC Information Criterion, Posterior over Models, Model Evidence,
Bayesian Occam’s Razor, Bayesian Model Comparison, Bayes Factors and Jeffreys Scale of Evidence,
Examples, Jeffreys-Lindley Paradox
 Back to Bayesian Occam’s Razor, Marginal Likelihood, Evidence Approximation, Laplace Approximation,
Bayesian Information Criterion, Akaike Information Criterion, Effect of the Prior on Marginal Likelihood,
Empirical Bayes

• Following closely Chris Bishops’ PRML book, Chapters 1 & 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 5
• C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-
Verlag, NY, 2001 (online resource)
• A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition,
2003.
• M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
• Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

Goals
 The goals for today’s lecture include:

 Understand the drawbacks of MAP estimators

 Learn how to compute HPD intervals

 Learn about the AIC and Bayesian and Akaike Information Criteria

 Familiarize ourselves with criteria for Bayesian model section and comparison, Bayes’ factors and
Occam’s Razor

 Learn how to compute the model evidence

 Understand how to perform a Laplace approximation to the posterior distribution

 Understand the effects of the prior on model evidence

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

Introduction
 We assume that we computed the posterior of unknown parameters from data.
 Using the posterior distribution 𝑝(𝜽|𝒟) to summarize everything we know about these
variables is at the core of Bayesian statistics.
 We discuss here this approach to statistics in more detail. In particular, we emphasize that
point estimates are not the best approach.
 We discuss next some simple quantities that can be derived from 𝑝(𝜽|𝒟), such as
 the posterior mean,
 the MAP estimate,
 the median, etc.
 These summary statistics (point estimates) are often easier to understand and visualize than
the full posterior distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

MAP Estimation
 Typically the posterior mean or median is the most appropriate choice for a
real-valued quantity, and the vector of posterior marginals is the best choice for
a discrete quantity.

 However, the posterior mode, aka the MAP estimate, is the most popular
choice because it reduces to an optimization problem, for which efficient
algorithms often exist.

 There are various drawbacks to MAP estimation, which we discuss below.

 This will provide motivation for a more thoroughly Bayesian approach.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Drawbacks of MAP Estimation
 The most obvious drawback of MAP estimation (and other point estimates
such as the posterior mean or median) is that it does not provide any measure
of uncertainty.

 In many applications, it is important to know how much one can trust a given
estimate.

 We will derive such confidence measures from the posterior.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

Plugging-in the MAP Estimate Can Overfit
 In machine learning, we care more about predictive accuracy than
in interpreting the parameters of our models.

 However, if we don't model the uncertainty in our parameters, then

our predictive distribution will be overconfident.

 Overconfidence in predictions is particularly problematic in

situations where we don’t want to take any risk.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

The Mode is an Untypical Estimate
 The mode is usually quite untypical of the distribution, unlike the mean or median (left Fig.)

 The mean/median take the whole probability mass into account.

 On the example on the right, the mode is 0, but the mean is non-zero. Such skewed
distributions arise when inferring variance parameters, especially in hierarchical models. In
such cases the MAP estimate is a very bad estimate.

4.5 0.9

The mode is very 4 0.8

untypical of the 3.5 mean 0.7 A skewed (Gamma) distribution

distribution. The 3
0.6
in which the mode is quite
mean is a better 2.5
0.5
different from the mean
summary of the 2
0.4
distribution, since 1.5
0.3
it is near the 1

majority of the 0.5

0.2

probability mass 0
0.1

-2 -1 0 1 2 3 4
Run bimodalDemo and gammaPlotDemo 1 2 3 4 5 6 7
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Decision Theory and Loss Functions
 How should we summarize a posterior if the mode is not a good choice?

 The answer is to use decision theory. The basic idea is to specify a loss function, where
is the loss you incur if the truth is q and your estimate is 𝜃.
መ

መ = 𝕀 𝜃 ≠ 𝜃መ , then the optimal estimate is the posterior mode. 0 − 1

 If we use 0 − 1 loss, 𝐿(𝜃, 𝜃)
loss means you only get fully penalized if you make errors.

 For continuous-valued quantities, we often prefer to use squared error loss, 𝐿 𝜃, 𝜃መ =

2
መ
𝜃 − 𝜃 , the optimal estimator is now the posterior mean.

መ = |𝜃 − 𝜃|,
 Or we can use a more robust loss function, 𝐿(𝜃, 𝜃) መ which gives rise to the posterior
median.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

MAP Estimation and Reparametrization
 In MAP estimation the result we get depends on the parametrization of the probability
distribution. Changing from one representation to another equivalent representation changes
the result.

 Suppose we compute the posterior for 𝑥. If we define 𝑦 = 𝑓(𝑥), the distribution for 𝑦 is given
by
dx
p y ( y )  px ( x)
dy
 The Jacobian measures the change in size of a unit volume passed through 𝑓.

𝐼𝑓 𝑥ො = max𝑝𝑥 (𝑥), 𝑦ො = max𝑝𝑦 (𝑦) ⇒ 𝑦ො ≠ 𝑓 𝑥ො

𝑥 𝑦

 For example, let 𝑥~𝒩(6,1), and 𝑦 = 𝑓(𝑥) = 1/(1 + exp(−𝑥 + 5)). We can derive the
distribution of 𝑦 using MC. See the following implementation.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10

MAP Estimation and Reparametrization
 Transformation of a density under a nonlinear transform. Note how the mode of the
transformed distribution is not the transform of the original mode.
1

0.9 g

0.8

0.7

0.6
Run bayesChangeOfVar
0.5 from Kevin Murphys’ PMTK

0.4

0.3

0.2 pY

0.1 pX

0
0 2 4 6 8 10 12

 The MLE does not suffer from this issue (the likelihood is a function not a probability density).

Bayesian inference also does not suffer from this since the change of measure is taken into
account when integrating in the parameter space.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
MAP Estimation and Reparametrization
 Consider a Bernoulli likelihood and uniform prior as follows:
p ( y  1|  )   , where y  0,1
p (  )  1,0    1

 Without data, the MAP estimate is the mode of the prior which is anywhere on the interval
[0,1]

 Case 1: Consider the parametrization

q    d  / dq  2q  pq (q )  2q
 The MAP estimate is
𝜃መ = 1
 Case 2: Consider the parametrization:
q  1  1    d  / dq  2(1  q )  pq (q )  2(1  q )
 The MAP estimate is
𝜃መ = 0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
MAP Estimation and Reparametrization
 Using the Fisher information matrix 𝐼(𝜽) associated with the likelihood 𝑝 𝒙 𝜽 , a solution to
the problem is to optimize the following objective function:

෡ = max𝑝 𝒟 𝜽 𝑝(𝜽) 𝐼(𝜽)

𝜽 −1/2
𝜽

 This estimate is parameterization independent.

 The optimization problem above is difficult to implement in practice.

 Druilhet, P. and J.-M. Marin (2007). Invariant HPD credible sets and MAP estimators. Bayesian
Analysis 2(4), 681–692.
 Jermyn, I. (2005). Invariant Bayesian estimation on manifolds. Annals of Statistics 33(2), 583–605.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

Credible (Central) Interval
 Credible Interval: A standard measure of confidence in some (scalar) quantity 𝜃 is the "width" of its
posterior distribution. This can be measured using a 100(1 − 𝛼) % credible interval, which is a
(contiguous) region 𝐶 = (ℓ, 𝑢) (standing for lower and upper) which contains 1 − 𝛼 of the posterior
probability mass, i.e.,

C (D )  ( , u ) : p(  q  u | D )  1  

 There may be many such intervals, so we choose one such that there is 𝛼/2 mass in each tail; this is
called a central interval.

 Some examples (for 𝛼 = 0.05):

Run quantileDemo
 Gaussian distribution: from Kevin Murphys’ PMTK

For p (q | D)  N (0,1), ( , u )    1 ( / 2),  1 (1   / 2)   ( 1.96, 1.96)

For p (q | D)  N (  ,  2 ), ( , u )    2
 Beta prior in a coin example. The posterior is ℬℯ𝓉𝒶. Run betaCreditbleInt
from Kevin Murphys’ PMTK

For p (q | D)  Beta (48,54) (47 H in 100 trials), ( , u )   0.3749, 0.5673 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Credible Intervals
 If we dont know the functional form, but we can draw samples from the
posterior, then we can use a Monte Carlo approximation to the posterior
quantiles.

 We simply sort the 𝑆 samples, and find the one that occurs at locations
ℓ, 𝑢 = 𝑆𝛼Τ2 , 𝑆(1 − 𝛼Τ2) along the sorted list. As 𝑆∞, this converges to the
true quantile.
Run mcQuantileDemo
from Kevin Murphys’ PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15

Credible Vs. Confidence Intervals
 Bayesian credible intervals versus frequentist confidence intervals are not the
same thing.

 In general, credible intervals are usually what people want to compute, but
confidence intervals are usually what they actually compute!

 Fortunately, the mechanics of computing a credible interval is just as easy as

computing a confidence interval.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

Highest Posterior Density Regions
 In central intervals there might be points outside the CI which have higher probability density. This is
illustrated in the Figure, where we see that points outside the left-most CI boundary have higher density
than those just inside the right-most CI boundary.

3.5
 This motivates the highest posterior
3 density or HPD region. This is defined
Central interval for a
as the set of most probable points that
2.5 Beta(3,9) posterior
in total constitute 100(1 − 𝛼)% of the
2
probability mass. More formally, we
1.5 find the threshold 𝑝∗ on the pdf such
1 that
0.5 1  
q : p ( | )  p*
qD
p (q | D ) dq
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

 We then define the HPD as:

Run betaHPD

C  q : p (q | D )  p *
from Kevin Murphys’ PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Highest Posterior Density Regions
 The HPD region is sometimes called a highest density interval or HDI. The figure shows the
95%HDI of a ℬℯ𝓉𝒶(3, 9) distribution, which is 0.04, 0.48 .

 We see that this is narrower than the CI, even though it still contains 95% of the mass. Also
every point inside of it has higher density than every point outside of it.

3.5 3.5

HPD for a Beta(3,9)

3 3
Central interval for a
2.5 Beta(3,9) posterior 2.5
posterior
2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Run betaHPD  The HPD region for unimodal distributions

from Kevin Murphys’ PMTK has minimal width and contains 𝟗𝟓% of
the mass. It can be computed by
optimization and using the inverse CDF.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Highest Posterior Density Regions
Central Interval HPD

/2 /2
pMIN

 For a unimodal distribution, the HDI will be the narrowest interval around the mode containing
95% of the mass.

 If the posterior is multimodal, the HDI may not even be a connected region. Note that
summarizing multimodal posteriors is always difficult.

Run postDensityIntervals
from Kevin Murphys’ PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Inference for a Difference in Proportions
 Often we have multiple parameters, and we are interested in computing the posterior
distribution of some function of these parameters.

 Example: suppose you are about to buy a book from Amazon.com

 Given:
 a. Seller 1 has 𝑦1 = 90 positive reviews and 10 negative reviews.
 b. Seller 2 has 𝑦2 = 2 positive reviews and 0 negative reviews.

 It seems you should pick seller 2, but we cannot be very confident that seller 2 is better since
it has had so few reviews.

 We sketch a Bayesian analysis of this problem. Similar methodology can be used to compare
rates or proportions across groups for a variety of other settings.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

Inference for a Difference in Proportions
 Let 𝜃1 and 𝜃2 be the unknown reliabilities of the two sellers. We endow them both with uniform
priors, 𝜃𝑖 ~ ℬℯ𝓉𝒶(1,1).

 The posteriors are 𝑝(𝜃1 |𝒟1) = ℬℯ𝓉𝒶(91,11) and 𝑝(𝜃2 |𝒟2) = ℬℯ𝓉𝒶(3,1).

 We want to compute 𝑝(𝜃1 > 𝜃2|𝒟). For convenience, let us define 𝛿 = 𝜃1 − 𝜃2 as the
difference in the rates. We can compute the desired quantity using numerical integration:

1 1
𝑝(𝛿 > 0|𝒟)=‫׬‬0 ‫׬‬0 𝕀(𝜃1 > 𝜃2 )ℬℯ𝓉𝒶(𝜃1 |𝑦1 + 1, 𝑁1 − 𝑦1 + 1൯ℬℯ𝓉𝒶(𝜃2 |𝑦2 + 1, 𝑁2 − 𝑦2 + 1)𝑑𝜃1 𝑑𝜃2

 We find 𝑝(𝛿 > 0|𝒟) = 0.710, which means you are better off buying from seller 1!

Run amazonSellerDemo
from Kevin Murphys’ PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

Inference for a Difference in Proportions
2.5 14
Monte Carlo Run amazonSellerDemo p(q1|data)
approximation from Kevin Murphys’ PMTK
12 p(q2|data)
2
to 𝑝(𝛿 > 0|𝒟)

1.5
8
pdf

6
1

0.5
2

0 0
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


 We approximate the posterior 𝑝(𝛿|𝒟) by MC sampling. 𝜃1 and 𝜃2 are independent and both have Beta
distributions, which can be sampled easily.

 𝑝(𝜃𝑖|𝒟𝑖) are shown on the right, and a MC approximation to 𝑝(𝛿|𝒟) together with a 95% central interval
on the left. An MC approximation to 𝑝(𝛿 > 0|𝒟) is obtained by counting the fraction of samples where
𝜃1 > 𝜃2 . This turns out to be 0.718, which is very close to the exact value.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Model Selection
 A number of complexity parameters (polynomial order, regularization parameter, etc.) need to
be selected to optimize performance/predictive capability. This is a model selection problem.

 In MLE, the performance on the training set is not a good indicator of predictive performance
due to the problem of over-fitting.

 We often use some of the available data to train a range of models (or a given model with a
range of values for its complexity parameters) and then to compare them on a validation set.
We then select the one having the best predictive performance.

 Some over-fitting to the validation data can occur and a third test set on which the
performance of the selected model is finally evaluated maybe needed.

 Training Set
 Validation Set (Optimize hyperparameters)
 Test Set (Measure Performance)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23

Model Selection: Cross Validation

 The technique of 𝑆-fold cross-validation (here 𝑆 = 4) involves taking the

available data and partitioning it into 𝑆 groups.
 𝑆 − 1 of the groups are used to train a set of models that are then evaluated
on the remaining group. This procedure is repeated for all 𝑆 possible choices
for the held-out group and the performance scores from the 𝑆 runs are then
averaged.
 The cross-validation cost increases by a factor of 𝑆.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

Akaike Information Criterion
 To correct for the bias of MLE, we use different information criteria (here 𝑀 =
# of parameters in the model), e.g.:

Akaike Information Criterion (AIC):

ln p(D | wML )  M

We choose the model for which the AIC is largest.

 AIC does not account for uncertainty in model parameters.

 It favor simple models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

Bayesian Model Selection
 In general, when faced with a set of models (i.e., families of
parametric distributions) of different complexity, how should we
choose the best one?

This is called the model selection problem.

 Examples:

 a low order polynomial in linear regression underfitts while a

high order polynomial overfitts

 a small regularization parameter 𝜆 results in overfitting and too

large 𝜆 in underfitting.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Bayesian Model Selection
 Can use 𝐶𝑉 to estimate the generalization error of all the candidate models,
and then to pick the model that performs the best. This requires fitting each
model 𝐾 times, where 𝐾 is the number of CV folds. More efficient approach is
to compute the posterior over models.

p  D | m  p ( m)
p m | D  
 p(m ', D )
m 'M

 From this, we can easily compute the MAP model

m  max p  m | D 
m

 This is called Bayesian model selection.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Model Evidence
 If we use a uniform prior over models, 𝑝(𝑚)~1, this amounts to picking the
model which maximizes the marginal likelihood:

p  D | m    p  D | q , m  p q | m  dq
 This quantity is called the evidence for model 𝑚.

 The details on how to perform this integral will be discussed with examples
later on.

 An intuitive interpretation of model evidence is discussed next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28

Bayesian Occam’s Razor
 One might think that using 𝑝(𝒟|𝑚) to select models would always favor the model with the
most parameters.

This is true if we use 𝑝 𝒟|𝜃መ𝑚 to select models, where 𝜃መ𝑚 is the MLE or MAP estimate of the
parameters for model 𝑚 - models with more parameters will fit the data better, and hence
achieve higher likelihood.

 However, if we integrate out the parameters, rather than maximizing them, we are
automatically protected from overfitting.

 Models with more parameters do not necessarily have higher marginal likelihood.

 This is called the Bayesian Occam’s razor effect (MacKay 1995b; Murray and Ghahramani
2005)

Occams Razor Principle: one should pick the simplest model that adequately
explains the data.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29
Bayesian Occam’s Razor
The marginal likelihood can be rewritten as follows:

p (D )  p ( y1 ) p ( y2 | y1 ) p ( y3 | y1:2 )... p ( y N | y1:N 1 )

where we have dropped the conditioning on 𝑚 for brevity.

 This is similar to a leave-one-out cross-validation estimate of the likelihood, since we predict

each future point given all the previous ones.

 If a model is too complex, it will overfit the early examples and will then predict the remaining
ones poorly.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

Bayesian Model Comparison
 Suppose we have two models 𝑀1 and 𝑀2

 Each is associated with a set of parameters 𝜃1 and 𝜃2

 We consider priors 𝑝𝑖 𝜃𝑖 |𝑀𝑖 , likelihoods 𝑓𝑖 𝒙 |𝜃𝑖 , 𝑀𝑖 and posteriors

𝑝𝑖 𝜃𝑖|𝒙, 𝑀𝑖

f i ( x | qi , M i ) i (qi | M i )
 i (qi | x, M i ) 
i (x | Mi )

 We define as the best model the one that is more probable to have generated
the data 𝒙 that we observed.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31

Bayesian Model Comparison
From data we can learn the parameters for each model
and then the model itself

fi ( x | qi , M i ) i (qi | M i )  ( x | M i ) i ( M i )
x   i (qi | x, M i )    i (M i | x)  i
i (x | Mi )  ( x)

Noting that

 i ( x | M i )   fi ( x | qi , M i ) i (qi | M i ) dqi

we can find the best model that represents the data by computing:

 ( M 1 | x )  ( x | M 1 ) ( M 1 )
 
 f ( x | q , M ) (q | M )dq
1 1 1 1 1 1 1  (M1 )
 ( M 2 | x )  ( x | M 2 ) ( M 2 )  f ( x | q , M ) (q | M )dq
2 2 2 2 2 2 2
 (M 2 )
 Ratio of
B10 Pr iors
Bayes ' factor
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32
Bayesian Model Comparison - Example
Consider the coin flipping example

Let 𝜃 the probability of getting heads

Consider two models:

𝑀1 Coin is Fair: 𝜃|𝑀1 ~ℬ 100,100
𝑀2 Coin is Unfair: 𝜃|𝑀2 ~ℬ 0.5,0.5

Data 𝒙 = {2𝐻, 3𝑇}

 f ( x | q , M ) (q | M )dq   q (1  q ) q (1  q )99 / beta(100,100) dq

2 3 99
Bayes Factor 1 1 1 1 0.031

 f ( x | q , M ) (q | M )dq  q (1  q ) q (1  q ) 0.5 / beta(0.5, 0,5) dq
2 3 0.5
2 2 2 2
0.012
Bayes ' factor

Model Comparison
 (M1 | x)

 f ( x | q , M ) (q | M )dq  (M )  2.58  ( M )
1 1 1 1 1 1

 (M 2 | x)  f ( x | q , M ) (q | M )dq  (M )
2 2 2
 (M )
2 2 2
Ratio of
Bayes ' factor Pr iors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33

Bayesian Model Comparison - Example
Consider the coin flipping example
Let 𝜃 probability of getting heads
Two models:
𝑀1 Coin is Fair: 𝜃|𝑀1 ~ℬ 100,100
𝑀2 Coin is Unfair: 𝜃|𝑀2 ~ℬ 0.5,0.5
Data 𝒙 = {5𝐻}
 f ( x | q , M ) (q | M )dq   q q (1  q )99 / beta(100,100) dq
5 99
Bayes Factor 1 1 1 1

0.033
 f ( x | q , M ) (q | M )dq  q q (1  q ) 0.5 / beta(0.5, 0,5) dq
5 0.5
2 2 2 2
0.25
Bayes ' factor

Model Comparison  (M1 | x)


 f ( x | q , M ) (q | M )dq  (M )  0.13  (M )
1 1 1 1 1 1

 (M 2 | x)  f ( x | q , M ) (q | M )dq  (M )
2 2 2 2
 (M ) 2 2
Ratio of
Bayes ' factor Pr iors

Remark: Bayes’ factors and posterior model PDFs should be used with caution when non-
informative priors are applied.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34
Bayes Factors and Jeffreys Scale
 Suppose our prior on models is uniform, 𝑝(𝑚)~1. Then model selection is equivalent to
picking the model with the highest marginal likelihood. Now suppose we just have two models
we are considering, call them the null hypothesis, 𝑀0, and the alternative hypothesis, 𝑀1.

 Define the Bayes factor as the ratio of marginal likelihoods:

p(D | M 1 ) p( M 1 | D ) p( M 1 )
BF1,0   /
p(D | M 0 ) p( M 0 | D ) p( M 0 )
 If 𝐵𝐹1,0 > 1, we prefer model 1, otherwise we prefer model 0. Jeffreys proposed a scale of
evidence shown below

Bayes factor BF(1,0) Interpretation

BF< 1/100 Decisive evidence for M0
BF< 1/10 Strong evidence for M0
1/10< BF< 1/3 Moderate evidence for M0
1/3< BF < 1 Weak evidence for M0
1 < BF < 3 Weak evidence for M1
3 < BF < 10 Moderate evidence for M1
BF>10 Strong evidence for M1
BF>100 Decisive evidence for M1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35
Jeffrey’s Scale of Evidence
 Using the alternative reference below, Jeffrey’s scale of evidence says:

𝜋
 For log(𝐵10 ) between 0 and 0.5, the evidence against 𝐻0 is poor

 In between 0.5 and 1, it is substantial

  ( x | H1 )
 In between 1 and 2, it is strong and B10 
 (x | H0 )
 Above 2, it is decisive.

 Bayes’ factor tells us if one should prefer 𝐻0 to 𝐻1 (relative comparison of models).

 Bayes’ factor does not tell us whether any of these models is sensible

Estimation and Beyond in the Bayes Universe, Brani Vidakovic (online Course on Bayesian Stat. for Engineers)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

Example: Testing if a Coin is Fair
 Suppose we observe some coin tosses, and want to decide if the data was
generated by a fair coin, 𝜃 = 0.5, or a potentially biased coin, where 𝜃 in
[0, 1]. Denote the fair coin model by 𝑀0 and the biased coin model by 𝑀1.

 The marginal likelihood under 𝑀0 is simply

N
1
p D | M0    
2
where 𝑁 is the number of coin tosses.

 The marginal likelihood under M1 using a Beta prior, is

B 1  N1 ,  0  N 0 
p  D | M 1    p  D | q  p q | M 1  dq 
B  1 ,  0 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Example: Testing if a Coin is Fair
 We plot log 𝑝(𝒟|𝑀0) and log 𝑝(𝒟|𝑀1) vs the number of heads 𝑁1 with 𝑁 = 5 and
𝑎1 = 𝑎0 = 1.
 If we observe 2 or 3 heads, the unbiased coin hypothesis 𝑀0 is more likely than M1
since 𝑀0 is a simpler model - it would be a suspicious coincidence if the coin were
biased but happened to produce almost exactly 50/50 heads/tails.
 However, as the counts become more extreme, we favor the biased coin hypothesis.
Note that, if we plot the log Bayes factor, 𝑙𝑜𝑔𝐵𝐹10 it will have exactly the same
shape, since 𝑙𝑜𝑔𝑝(𝒟|𝑀0) is a constant.
Marginal likelihood for Beta-Bernoulli model,  p(D|q) Be(q|1,1) dq BF(1,0)
0.18 5.5

0.16 5

0.14 4.5

4
coinsModelSelDemo
0.12
from Kevin Murphys’ PMTK
3.5
0.1
3
0.08
2.5
0.06
2
0.04
1.5
0.02
1
0
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 0.5
num heads
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

Example: Testing if a Coin is Fair
 Log marginal likelihood for coins example and the BIC approximation to
log 𝑝(𝒟|𝑀1) for our biased coin example.
 The curve has approximately the same shape as the exact log marginal
likelihood.
 It favors the simpler model unless the data is overwhelmingly in support of
the more complex model.
log10 p(D|M1)
-0.4 BIC approximation to log10 p(D|M1) coinsModelSelDemo
-0.8
from Kevin Murphys’ PMTK
-0.6
-1

-1.2
-0.8

-1.4
-1
-1.6

-1.2 -1.8

-2
-1.4
-2.2
-1.6
-2.4

-1.8 -2.6
0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39

Jeffreys Lindley Paradox
 Define the marginal density of 𝜃 as: p q   p q | M 0  p  M 0   p q | M 1  p  M 1 
where we consider the hypothesis M 0 : q  M 0 vs M 1 : q  M 1
 We can estimate the posterior as (denote: p  M 0    , p  M 1   1   )
p  M  p(D | M )   p (D | q ) p q | M  dq
pM D  
0 0
0

0

p  M  p (D | M )  p  M  p (D | M )   p (D | q ) p q | M  dq  1     p ( D | q ) p q | M  dq
0
0 0 1 1 0 1
0 1

 Let us now assume that the priors are improper: p q | M 0   c0 , p q | M 1   c1

 The posterior is determined by 𝑐0/𝑐1 (e.g. it can be anything we want!)

  p (D | q )dq
pM0 D  
0

  p (D | q )dq  1     c1 c0   p(D | q )dq

0 1
 Using proper but vague priors causes similar problem.
 The Bayes factor favors the simpler model – the probability of the observed data under a
complex model with diffuse prior is very low.
 Jeffreys-Lindley paradox → Use proper priors for model selection. If 𝑀0 & 𝑀1 share the same
prior over a subset of 𝜃, this part of the prior can be improper, since 𝑐0/𝑐1 will cancel out.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Bayesian Occam’s Razor
 To further understand the Bayesian Occam’s razor effect is to note that probabilities must sum
to one (sum over all possible data sets)

 p  D' | m   1
D'
Bayesian Methods for
Machine Learning, ICML
Tutorial, 2004,
Zoubin Ghahramani

 Model 1 is too simple and

assigns low probability to 𝐷0.

 Model 3 also assigns 𝐷0

relatively low probability, because
it can predict many data sets, and hence
it spreads its probability quite widely & thinly. D0

 Model 2 is "just right": it predicts the observed data with a reasonable degree of confidence,
but does not predict too many other things. Hence model 2 is the most probable model.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41

Bayesian Occam’s Razor

F or any model M : 
all d D
p( D  d | M )  1

The law of conservation of belief states that models that explain

many possible data sets must necessarily assign each of them
a low probability

 A note on the evidence and Bayesian Occam’s razor, I. Murray and Z. Ghahramani (2005),
Gatsby Unit Technical Report GCNU-TR 2005-003
 Occam’s Razor, C. Rasmussen and Z. Ghahramani, In T.K. Leen, T.G. Dietterich and V. Tresp (edts),
Neural Information Procesing Systems 13, pp. 294-300, 2001, MIT Press
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
Bayesian Occam’s Razor
Bayesian Methods for
Machine Learning, ICML
Tutorial, 2004,
Zoubin Ghahramani

𝑴𝟏 : the too simple model is unlikely to

generate this data

𝑴𝟑 : the too complex model explains poorly a

lots of data sets and it is a little better but
still unlikely to have generated our data

𝑴𝟐 : the just right model has the highest

marginal likelihood
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43
Bayesian Occam’s Razor
 Polynomials of degrees 1, 2, 3 fit to 𝑁 = 5 data points using empirical Bayes. Solid green curve is the
true function, Dashed red curve is the prediction (dotted blue lines represent 𝜎 around the mean). The
posterior over models 𝑝(𝑚|𝒟) is also shown using a Gaussian prior 𝑝(𝑚).
d=2, logev=-20.486, EB
d=1, logev=-18.899, EB 80
70

60
linregEbModelSelVsN
60
from Kevin Murphys’ PMTK
50 40

40 20

30
0
20
-20
10

-40
0

-10 -60

-20 -80
-2 0 2 4 6 8 10 12
-2 0 2 4 6 8 10 12
d=3, logev=-21.777, EB N=5, method=EB
300
1
250

200
0.8
150

100

P(M|D)
0.6
50

0
0.4
-50

-100
0.2
-150

-200
-2 0 2 4 6 8 10 12 0
1 2 3
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44
Bayesian Occam’s Razor
 Polynomials of degrees 1, 2, 3 fit to 𝑁 = 30 data points using empirical Bayes. Solid green curve is the
true function, Dashed red curve is the prediction (dotted blue lines represent 𝜎 around the mean). The
posterior over models 𝑝(𝑚|𝒟) is also shown using a Gaussian prior 𝑝(𝑚).

d=1, logev=-106.337, EB d=2, logev=-103.490, EB

70 80
linregEbModelSelVsN
60
70
from Kevin Murphys’ PMTK
60
50
50
40
40

30
30

20 20

10 10

0 0

-10
-10 -2 0 2 4 6 8 10 12
-2 0 2 4 6 8 10 12 N=30, method=EB
d=3, logev=-108.181, EB
100 1

80
0.8

P(M|D)
0.6
40

0.4
20

0
0.2

-20
-2 0 2 4 6 8 10 12
0
1 2 3
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Marginal Likelihood (Evidence)
 Let us review once more on how we can compute the evidence for conjugate prior
models.

 When discussing parameter inference for a fixed model, we often write

 However, when comparing models, we need to know how to compute the marginal
likelihood, 𝑝(𝒟|𝑚).

 In general, this can be quite hard, since we have to integrate over all possible
parameter values, but when we have a conjugate prior, it is easy to compute.
p  D | m    p  D | q , m  p q | m  dq

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46

Marginal Likelihood - Evidence
 Let 𝑝(𝜃) = 𝑞(𝜃)/𝑍0 be our prior, where 𝑞(𝜃) is an unnormalized distribution, and 𝑍0 is the
normalization constant of the prior.

 Let 𝑝(𝒟|𝜃) = 𝑞(𝒟|𝜃)/𝑍𝑙 be the likelihood, where 𝑍𝑙 contains any constant factors in the
likelihood.

 Let 𝑝(𝜃|𝒟) = 𝑞(𝜃|𝒟)/𝑍𝑁 be our posterior, where 𝑞(𝜃|𝒟) = 𝑞(𝒟|𝜃)𝑞(𝜃) is the unnormalized
posterior, and 𝑍𝑁 is the normalization constant of the posterior.

p q  p  D | q  q q | D  q q  q  D | q  ZN
p q | D      p D  
p D  ZN Z0 Zl p  D  Z0 Zl
 So assuming the relevant normalization constants are tractable, we have an easy way to
compute the marginal likelihood.

 Several examples are presented next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47

Beta-Binomial Model
 Let us apply the above result to the Beta-binomial model. Since we know
𝑝(𝜃|𝐷) = ℬ(𝜃|𝑎′, 𝑏′), where 𝑎′ = 𝑎 + 𝑁1 and 𝑏′ = 𝑏 + 𝑁0, we know the
normalization constant of the posterior is ℬ(𝑎′, 𝑏′). Hence
p q  p  D | q  1  1  N  N1
b 1   N0 
p q | D     q (1  q )    q (1  q ) 
a 1

p D  p  D   B ( a, b)   N1  
1 1 N 1
  
B (a  N1 , b  N 0 ) p  D   N1  B (a, b)
 N  B (a  N1 , b  N 0 )
p D    
 N1  B ( a, b)
 The marginal likelihood for the Beta-Bernoulli model is the same as
N
above, but without the   term.
 N1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
Dirichlet-Multinoulli Model
 One can show that the marginal likelihood for the Dirichlet-multinoulli model is
given by
K

B( N   )  ( k )
p D   , B( )  k 1

B ( )  K 
  k 
 k 1 
 Hence, we can rewrite the above result in the following form, which is more
often used
 K 
  k  K
 k 1  ( N k   k )
p D  
 K 
 k 1 ( k )
  N  k 
 k 1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49

Gaussian-Gaussian-Wishart Model
 Consider the case of a ℳ𝒱𝒩 with a conjugate 𝒩ℐ𝒲 prior. Let 𝑍0 be the
normalizer for the prior, 𝑍𝑁 be normalizer for the posterior, and let 𝑍𝑙 =
2𝜋 𝑁𝐷Τ2 be the normalizer for the likelihood.
D /2
Then it is easy to see that
 2 
 D  vN / 2 
 vN /2 ( v  N ) D / 2
  S 2 0
 N 
N
ZN 1 1
p D    ND /2 ND /2
Z0 Zl  2  2 
D /2

2 0  D  v0 / 2 
 v0 /2 v D /2
  S
 0 
0

 D  vN / 2 
D /2 v0 /2
1   S0
 ND / 2  0 
  N  SN
vN / 2
 D  v0 / 2 
 
  InvWis   | S0 ,  0  
1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
 This equation  0 
    1 
|  |1/2 exp   0    0   1    0   |  | ( 0  D 1)/2 exp   Tr   1 S0  
1
will prove 
Z NIW  2
T

  2 
useful 
1   T 1 
exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
Z NIW  2 2 
later on. D /2
 v   2   v0 /2
Z NIW  2 v0 D /2
D  0    S0 ,  D multivariate Gamma function
 2   0 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50
Laplace Approximation
 The Laplace approximation allows a Gaussian approximation of the parameter
posterior about the maximum a posteriori (MAP) parameter estimate.
 Consider a data set 𝒟 and 𝑀 models ℳ𝑖, 𝑖 = 1, . . , 𝑀 with corresponding parameters
𝜽𝑖 , 𝑖 = 1, … 𝑀. We compare models using the posteriors:
p (M | D )  p ( M ) p ( D | M )

 For large sets of data 𝒟 (relative to the model parameters), the parameter posterior is
approximately Gaussian around the MAP estimate 𝜽𝑀𝐴𝑃 𝑚 (use 2𝑛𝑑 order Taylor
expansion of the log-posterior):
 1 
p (q m | D , Mm )   2  exp   q m  q m  A q m  q mMAP   ,
 d /2 1/2 MAP T
A
 2 
 2 log P q m | D , Mm 
Aij  
q mi q mj
q mMAP

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51

Laplace Approximation to Model Evidence
 We can write the model evidence as
p (D | q m , Mm ) p(q m | Mm )
p (D | Mm ) 
p (q m | D , Mm )
 Using the Laplace approximation for the posterior of the parameters and evaluating
𝑀𝐴𝑃
the equation above at 𝜃𝑚 :
log p (D | Mm )

log  2   log A  q mMAP  q mMAP  A q mMAP  q mMAP 

d 1 1 T
 log p (D | q mMAP , Mm )  log p(q mMAP | Mm ) 
2 2 2
d 1
 log p (D | q mMAP , Mm )  log p(q mMAP | Mm )  log  2   log A
2 2
 This Laplace approximation is used often for model comparison.

 Other approximations are also very useful:

• Bayesian Information Criterion (BIC) (on the limit of 𝑁)

• MCMC (Sampling approach)
• Variational Methods
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Bayesian Information Criterion
 Start with the Laplace approximation for large data sets 𝑁,
d 1
log p (D | Mm )  log p(D | q mMAP , Mm )  log p(q mMAP | Mm )  log  2   log A
2 2
 A 𝑁 grows, 𝑨 grows as 𝑁𝑨0 for some fixed matrix 𝑨0, thus
log A  log NA0  log  N d A0   d log N  log  A0  
N 
 d log N

 Then the Laplace approximation is simplified as:

d
log p (D | Mm )  log p(D | q mMAP , Mm )  log N ( as N  )
2
 Note interesting properties of (the easy to compute) BIC:
• No dependence on the prior
• One can use the MLE rather than the MAP estimate
(but use MAP when working with mixtures of Gaussians)
• If not all parameters are well determined from the data, 𝑑 =number of effective
parameters.
 Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6(2), 461-464.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53
BIC Approximation to Log Marginal Likelihood
 The Bayesian information criterion or BIC thus has the following
form:

BIC : log p (D | Mm )  log p( D | q m , Mm ) 

dof q m  log N (as N  )
2

 
 dof q m is the number of degrees of freedom in the model, and
q m is the MLE for the model. We see that this has the form of a
penalized log likelihood, where the penalty term depends on the
model complexity.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54

BIC for Linear Regression
 As an example consider linear regression. Consider a model 𝑝 𝑦 𝒙 =
𝒩 𝑦|𝜽𝑇 𝒙, 𝜎 2 . The MLE, log likelihood and BIC are:

  y q 
2

MLE : q   X T X  X T y,
N
1 1
2 T
  i xi
N i 1

  y q 
T 2
xi
   
i
N 2
log p D | q   log 2 i
2
2 2
N
2
2
 log p (D | q )   log 2 
N
2
 
N
2
2 N D
BIC   log 2   log N
2 2
 
 𝐷 is the number of variables in the model.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55

BIC for Linear Regression
 Hence the BIC score is as follows (dropping constant terms)
BIC  
N
2
2 N D

log 2   log N
2 2
 𝐷 is the number of variables in the model. In the statistics literature, it is
common to use an alternative definition of BIC, which we call the BIC cost
(since we want to minimize it):
BIC  Cost  2 log p (D | Mm )  2 log p( D | q m , Mm )  dof q m log N  
 In the context of the regression example, this becomes:

BIC  Cost  N log 2  2

  N  D log N
 The BIC method is related to the minimum description length or MDL
principle. It characterizes the score of how well the model fits the data, minus
how complex the model is.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56
Akaike Information Criterion
 There is a very similar expression to BIC/ MDL called the Akaike information
criterion or AIC, defined as
AIC (m, D )  log p( D | q m , Mm )  dof q m  
 This is derived from a frequentist framework, and cannot be interpreted as an
approximation to the marginal likelihood.

 The penalty for AIC is less than for BIC.

BIC  log p (D | q m , Mm ) 
  log N  log p(D | M )
dof q m
m
2

 This causes AIC to pick more complex models. However, this sometimes can
result in better predictive accuracy!
 Clarke, B., E. Fokoue, and H. H. Zhang (2009). Principles and Theory for Data Mining and Machine
Learning. Springer.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 57

Effect of the Prior on Marginal Likelihood
 When performing posterior inference, the prior may not matter much since the
likelihood often overwhelms the prior.
 But when computing the marginal likelihood, the prior plays a much more
important role, since we are averaging the likelihood over all possible
parameter settings, as weighted by the prior.
 If the prior is unknown, the correct Bayesian procedure is to put a prior on the
prior. That is, we should put a prior on 𝜽 as well as on the hyper-parameter 𝑎.
To compute the marginal likelihood, we should integrate out all unknowns, i.e.,
we should compute

p  D | m    p  D | q  p q |  , m  p ( | m)dq d

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58

Empirical Bayes
 This requires specifying the hyper-prior.

 Fortunately, the higher up we go in the Bayesian hierarchy, the less sensitive

are the results to the prior settings. Thus can usually make the hyper-prior
uninformative.

 A computational shortcut is to optimize 𝑎 rather than integrating it out.


p  D | m    p  D | w  p q |  , m dq 
where
  arg max p  D |  , m   arg max  p  D | q  p q |  , m  dq
 

 This approach is called empirical Bayes (EB).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59

Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Probability & Statistics-Handouts
No ratings yet
Probability & Statistics-Handouts
4 pages
Andrew
No ratings yet
Andrew
8 pages
Paths, Path Products and Regular Expressions: UNIT-3
100% (3)
Paths, Path Products and Regular Expressions: UNIT-3
70 pages
UNIT 3-Bayesian Statistics
No ratings yet
UNIT 3-Bayesian Statistics
80 pages
Lecture 5 - 8 Bayesian Estimation
No ratings yet
Lecture 5 - 8 Bayesian Estimation
65 pages
Bayesian-inference-slides-2021
No ratings yet
Bayesian-inference-slides-2021
37 pages
L08 MAP (1)
No ratings yet
L08 MAP (1)
8 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Variational Bayes
No ratings yet
Variational Bayes
38 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
SL-Chapter2
No ratings yet
SL-Chapter2
14 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
2311.15990v1
No ratings yet
2311.15990v1
22 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
Lec11 Introduction2BayesianStatistics
No ratings yet
Lec11 Introduction2BayesianStatistics
48 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
UNIT 3 - Frequentist Statistics
No ratings yet
UNIT 3 - Frequentist Statistics
65 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
bayesian-inference
No ratings yet
bayesian-inference
18 pages
Ds 7
No ratings yet
Ds 7
20 pages
DS 630_Lec 5_St
No ratings yet
DS 630_Lec 5_St
15 pages
IntroBayesTimeSeries1
No ratings yet
IntroBayesTimeSeries1
72 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
A Statistical Perspective On Inverse Problems - 10 Lectures On Inverse Problems and Imaging
No ratings yet
A Statistical Perspective On Inverse Problems - 10 Lectures On Inverse Problems and Imaging
12 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
25 Intro to Bayesian Inference (1)
No ratings yet
25 Intro to Bayesian Inference (1)
31 pages
ML-Map-and-Bayseian
No ratings yet
ML-Map-and-Bayseian
35 pages
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
No ratings yet
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
9 pages
Map
No ratings yet
Map
10 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
PAC Bayesian Learning overview
No ratings yet
PAC Bayesian Learning overview
66 pages
Notes4_BayesianLearning
No ratings yet
Notes4_BayesianLearning
8 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Bayes
No ratings yet
Bayes
27 pages
Bayesian Computation and Model Selection without
No ratings yet
Bayesian Computation and Model Selection without
32 pages
FunctionSpace Regularization in Neural NetworksA Probabilistic Perspective
No ratings yet
FunctionSpace Regularization in Neural NetworksA Probabilistic Perspective
16 pages
IDS22Bayes Applications
No ratings yet
IDS22Bayes Applications
34 pages
Machine Learning Econometrics Bayesian Algorithms
No ratings yet
Machine Learning Econometrics Bayesian Algorithms
33 pages
ANNParameter Estimation-II,III
No ratings yet
ANNParameter Estimation-II,III
2 pages
Bayesian Ibrahim
No ratings yet
Bayesian Ibrahim
370 pages
Regression-probabilistic-perspective
No ratings yet
Regression-probabilistic-perspective
20 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
Trotta_Bayesian_Inference_Aug_2018
No ratings yet
Trotta_Bayesian_Inference_Aug_2018
57 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
mcmc
No ratings yet
mcmc
76 pages
6 Min Read: Siwei Xu Aug 27
No ratings yet
6 Min Read: Siwei Xu Aug 27
4 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
Bayesian Parameter Estimation
No ratings yet
Bayesian Parameter Estimation
40 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Bayesian Credible Interval
100% (1)
Bayesian Credible Interval
8 pages
SI_Chapter-2
No ratings yet
SI_Chapter-2
53 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Dai 2020
No ratings yet
Dai 2020
62 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Seminar em
No ratings yet
Seminar em
51 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec19 Introduction2LinearRegression
No ratings yet
Lec19 Introduction2LinearRegression
53 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Statistics - Chapter Test
No ratings yet
Statistics - Chapter Test
5 pages
CHAPTER 3 Statistical Description of Data
No ratings yet
CHAPTER 3 Statistical Description of Data
30 pages
12 Statistical Tables (2 Pages)
No ratings yet
12 Statistical Tables (2 Pages)
6 pages
PROBAB
No ratings yet
PROBAB
5 pages
Lesson 4 Measure of Central Tendency or Position
No ratings yet
Lesson 4 Measure of Central Tendency or Position
9 pages
Time Management On Distance Learning EDU 580 Chapter 1
No ratings yet
Time Management On Distance Learning EDU 580 Chapter 1
24 pages
Updated Scheme of Exams. & Syllabi For B.SC
No ratings yet
Updated Scheme of Exams. & Syllabi For B.SC
16 pages
E882 Standard Guide For Accountability and Quality Control in The Chemical Analysis Laboratory PDF
No ratings yet
E882 Standard Guide For Accountability and Quality Control in The Chemical Analysis Laboratory PDF
6 pages
Chetty Et Al. (2011) ,'how Does Your Kindergarten Classroom Effect Your Earnings' PDF
No ratings yet
Chetty Et Al. (2011) ,'how Does Your Kindergarten Classroom Effect Your Earnings' PDF
89 pages
Chapter2 Stats
No ratings yet
Chapter2 Stats
9 pages
Measure of Variability
No ratings yet
Measure of Variability
25 pages
FRM Part 1: Basic Statistics
No ratings yet
FRM Part 1: Basic Statistics
28 pages
Co1 Stat
No ratings yet
Co1 Stat
25 pages
Group 1 (Thesis)
No ratings yet
Group 1 (Thesis)
34 pages
Answer All The Following Questions
No ratings yet
Answer All The Following Questions
3 pages
Two-Sample Tests and One-Way ANOVA: Chapter 10, Slide 1
No ratings yet
Two-Sample Tests and One-Way ANOVA: Chapter 10, Slide 1
69 pages
Grade11 Statistics and Probabilty - Module 2
100% (1)
Grade11 Statistics and Probabilty - Module 2
6 pages
Reading 8: Probability Concepts
No ratings yet
Reading 8: Probability Concepts
31 pages
CE Board Problems in Algebra
No ratings yet
CE Board Problems in Algebra
7 pages
Measures of Central Tendency or Measure of Location: Definition
No ratings yet
Measures of Central Tendency or Measure of Location: Definition
12 pages
Biology Laboratory Manual 10th Edition by Vodopich and Moore ISBN Solution Manual
100% (46)
Biology Laboratory Manual 10th Edition by Vodopich and Moore ISBN Solution Manual
10 pages
MMW Ass 1
No ratings yet
MMW Ass 1
5 pages
COMMUNITY MEDICINE 4th year syllabus for class test
No ratings yet
COMMUNITY MEDICINE 4th year syllabus for class test
3 pages
Range Variance and Standard Deviation
No ratings yet
Range Variance and Standard Deviation
5 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
3 pages
Normal Distirbution: 5.1 Meaning of Normal Distribution
No ratings yet
Normal Distirbution: 5.1 Meaning of Normal Distribution
13 pages
Text Messaging Final
100% (1)
Text Messaging Final
50 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.