Stacking Paper Discussion Rejoinder
Stacking Paper Discussion Rejoinder
Stacking Paper Discussion Rejoinder
917–1003
Abstract. Bayesian model averaging is flawed in the M-open setting in which the
true data-generating process is not one of the candidate models being fit. We take
the idea of stacking from the point estimation literature and generalize to the com-
bination of predictive distributions. We extend the utility function to any proper
scoring rule and use Pareto smoothed importance sampling to efficiently compute
the required leave-one-out posterior distributions. We compare stacking of pre-
dictive distributions to several alternatives: stacking of means, Bayesian model
averaging (BMA), Pseudo-BMA, and a variant of Pseudo-BMA that is stabilized
using the Bayesian bootstrap. Based on simulations and real-data applications, we
recommend stacking of predictive distributions, with bootstrapped-Pseudo-BMA
as an approximate alternative when computation cost is an issue.
Keywords: Bayesian model averaging, model combination, proper scoring rule,
predictive distribution, stacking, Stan.
1 Introduction
A general challenge in statistics is prediction in the presence of multiple candidate mod-
els or learning algorithms M = (M1 , . . . , MK ). Choosing one model that can give opti-
mal performance for future data can be unstable and wasteful of information (see, e.g.,
Piironen and Vehtari, 2017). An alternative is model averaging, which tries to find an
optimal model combination in the space spanned by all individual models. In Bayesian
context, the natural target for prediction is to find a predictive distribution that is
close to the true data generating distribution (Gneiting and Raftery, 2007; Vehtari and
Ojanen, 2012).
Ideally, we would avoid the Bayesian model combination problem by extending the
model to include the separate models Mk as special cases (Gelman, 2004). In practice,
constructing such an expansion requires a lot of conceptual and computational effort.
Hence, in this paper we focus on simpler tools that work with existing inferences from
models that have been fitted separately.
This paper is organized as follows. In Section 2, we give a brief review of some
existing model averaging methods. Then we propose our stacking method in Section 3. In
Section 4, we compare stacking, Bayesian model averaging, and several other alternatives
NY, gelman@stat.columbia.edu
through a Gaussian mixture model, a series of linear regression simulations, two real data
examples, and an application in variational inference. We conclude with Section 5 where
we give general recommendations. We provide the R and Stan code in the Suplement
material (Yao et al., 2018).
2 Existing approaches
In Bayesian model comparison, the relationship between the true data generator and
the model list M = (M1 , . . . , MK ) can be classified into three categories: M-closed,
M-complete and M-open. We adopt the following definition from Bernardo and Smith
(1994) (see also Key et al. (1999), and Clyde and Iversen (2013)):
Bayesian model averaging If all candidate models are generative, the Bayesian solu-
tion is to simply average the separate models, weighing each by its marginal posterior
probability. This is called Bayesian model averaging (BMA) and is optimal if the method
is evaluated based on its frequency properties evaluated over the joint prior distribution
of the models and their internal parameters (Madigan et al., 1996; Hoeting et al., 1999).
If y = (y1 , . . . , yn ) represents the observed data, then the posterior distribution for any
!K
quantity of interest ∆ is p(∆|y) = k=1 p(∆|Mk , y)p(Mk |y). In this expression, each
model is weighted by its posterior probability,
p(y|Mk )p(Mk )
p(Mk |y) = !K ,
k=1 p(y|Mk )p(Mk )
has been assigned a normal prior distribution with center 0 and scale 10, and where its
estimate is likely to be in the range (−1, 1). The chosen prior is then essentially flat,
as would also be the case if the scale were increased to 100 or 1000. But such a change
would divide the posterior probability of the model by roughly a factor of 10 or 100.
Stacking Stacking (Wolpert, 1992; Breiman, 1996; LeBlanc and Tibshirani, 1996) is a
direct approach for averaging point estimates from multiple models. In supervised learn-
ing, where the data are ((xi , yi ), i = 1, . . . , n ) and each model Mk has a parametric form
ŷk = fk (x|θk ), stacking is done in two steps (Ting and Witten, 1999). First, each model
(−i)
is fitted separately and the leave-one-out (LOO) predictor fˆk (xi ) = E[yi |θ̂k,y−i , Mk ]
is obtained for each model k and each data point i. In the second step, a weight for each
model is obtained by minimizing the LOO mean squared error
n
$ %2
# # (−i)
ŵ = arg min yi − ˆ
wk f (xi ) . (1)
w k
i=1 k
It is not surprising that stacking typically outperforms BMA when the criterion is
mean squared predictive error (Clarke, 2003), because BMA is not optimized to this
task. Wong and Clarke (2004) emphasize that the BMA weights reflect the fit to the
data rather than evaluating the prediction accuracy. On the other hand, stacking is not
widely used in Bayesian model combination because the classical stacking only works
with point estimates, not the entire posterior distribution (Hoeting et al., 1999).
Clyde and Iversen (2013) give a Bayesian interpretation for stacking by considering
model combination as a decision problem when !K the true model Mt is not in the model
list. If the decision is of the form a(y, w) = k=1 wk ŷk , then the expected utility under
quadratic loss is,
( K
#
& '
Ey˜ u (ỹ, a(y, w)) |y = − ||ỹ − wk ỹˆk ||2 p(ỹ|y, Mt )dỹ,
k=1
where ỹˆk is the predictor of new data ỹ in model k. Hence, the stacking weights are the
solution to the LOO estimator
n
1#
ŵ = arg maxw u(yi , a(y−i , w)),
n i=1
!K
where a(y−i , w) = k=1 wk E[yi |y−i , Mk ].
920 Using Stacking to Average Bayesian Predictive Distributions
Le and Clarke (2017) prove the stacking solution is asymptotically the Bayes so-
lution. With some mild conditions on distributions, the following asymptotic relation
holds: ( n
1# L2
l(ỹ, a(y, w))p(ỹ|y)dỹ − l(yi , a(y−i , w)) −
−→ 0,
n i=1
where l is the squared loss, l(ỹ, a) = (ỹ − a)2 . They also prove that when the action is a
!K
predictive distribution a(y−i , w) = k=1 wk p(yi |y−i , Mk ), the asymptotic relation still
holds for negative logarithm scoring rules.
However, most early literature limited stacking to averaging point predictions, rather
than predictive distributions. In this paper, we extend stacking from minimizing the
squared error to maximizing scoring rules, hence make stacking applicable to combining
a set of Bayesian posterior predictive distributions. We argue this is the appropriate
version of Bayesian model averaging in the M-open situation.
Pseudo-BMA+ weighting as an initial guess for optimization procedure in the log score
stacking.
Other model weighting approaches Besides BMA, stacking, and AIC-type weighting,
some other methods have been introduced to combine Bayesian models. Gutiérrez-Peña
and Walker (2005) consider using a nonparametric prior in the decision problem stated
above. Essentially they are fitting a mixture model with a Dirichlet process, yielding a
posterior expected utility of
n #
# K
Un (wk , θk ) = wk fk (yi |θk ).
i=1 k=1
They then solve for the optimal weights ŵk = arg maxwk ,θk Un (wk , θk ).
Li and Dunson (2016) propose model averaging using weights based on divergences
from a reference model in M-complete settings. If the true data generating density
function is known to be f ∗ , then an AIC-type weight can be defined as
* +
exp −nKL(f ∗ , fk )
wk = ! K * +. (2)
∗
k=1 exp −nKL(f , fk )
The true model can be approximated with a reference model M0 with density f0 (·|θ0 )
using nonparametric methods like Gaussian process or Dirichlet process, and KL(f ∗ , fk )
can be estimated by its posterior mean,
(( - .
, 1 (f0 , fk ) =
KL KL f0 (·|θ0 ), fk (·|θk ) p(θk |y, Mk )p(θ0 |y, M0 )dθk dθ0 ,
, 1 corresponds to Gibbs utility, which can be criticized for not using the poste-
Here, KL
rior predictive distributions (Vehtari and Ojanen, 2012). Although asymptotically the
two utilities are identical, and KL, 1 is often computationally simpler than KL , 2.
"
Let p(ỹ|y, Mk ) = fk (ỹ|θk )p(θk |y, Mk )dθk , k = 0, . . . , K, then
( (
, 2 (f0 , fk ) = − log p(ỹ|y, Mk )p(ỹ|y, M0 )dỹ + log p(ỹ|y, M0 )p(ỹ|y, M0 )dỹ.
KL
"
As the entropy of the reference model log p(ỹ|y, M0 )p(ỹ|y, M0 )dỹ is constant, the cor-
responding terms cancel out in the weight (2), leaving
* " +
exp n log p(ỹ|y, Mk )p(ỹ|y, M0 )dỹ
w k = !K * " +.
k=1 exp n log p(ỹ|y, Mk )p(ỹ|y, M0 )dỹ
It is proportional to the exponential expected log predictive density, where the expec-
tation is taken with respect to the reference model M0 . Comparing with definition 8 in
Section 3.4, this method could be called Reference-Pseudo-BMA.
922 Using Stacking to Average Bayesian Predictive Distributions
1. Quadratic score: QS(p, y) = 2p(y) − ||p||22 with the divergence d(p, q) = ||p − q||22 .
2. Logarithmic score: LogS(p, y) = log(p(y)) with d(p, q) = KL(q, p). The logarithmic
score is the only proper local score assuming regularity conditions.
"
3. Continuous-ranked" probability score: CRPS(F, y) = − IR (F (y ′ ) − 1(y ′ ≥ y))2 dy ′
with d(F, G) = IR (F (y) − G(y))2 dy, where F and G are the corresponding dis-
tribution functions.
4. Energy score: ES(P, y) = 12 EP ||Y − Y ′ ||β2 − EP ||Y − y||β2 , where Y and Y ′ are two
independent random variables from distribution P . When β = 2, this becomes
ES(P, y) = −||EP (Y ) − y||2 . The energy score is strictly proper when β ∈ (0, 2)
but not when β = 2.
5. Scoring rules depending on first and second moments: Examples include S(P, y) =
− log det(ΣP ) − (y − µP )T Σ−1
p (y − µP ), where µP and ΣP are the mean vector
and covariance matrix of distribution P .
where p(ỹ|y, Mk ) is the predictive density of new data ỹ in model Mk that has been
trained on observed data y and pt (ỹ|y) refers to the true distribution.
An empirical approximation to (3) can be constructed by replacing the full predictive
distribution p(ỹ|y, Mk ) evaluated at" a new datapoint ỹ with the corresponding LOO
predictive distribution p̂k,−i (yi ) = p(yi |θk , Mk )p(θk |y−i , Mk )dθk . The corresponding
stacking weights are the solution to the optimization problem
1 # -# .
n K
max S wk p̂k,−i , yi . (4)
w∈S1K n i=1
k=1
The choice of scoring rules can depend on the underlying application. Stacking of
means (1) corresponds to the energy score with β = 2. The reasons why we prefer
stacking of predictive distributions (corresponding to the logarithmic score) to stacking
of means are: (i) the energy score with β = 2 is not a strictly proper scoring rule and can
give rise to identification problems, and (ii) without further smoothness assumptions,
every proper local scoring rule is equivalent to the logarithmic score (Gneiting and
Raftery, 2007).
1 # -# . -# .
n K K
L2
S wk p̂k,−i , yi − Ey˜|y S wk p(ỹ|y, Mk ), ỹ −−→ 0.
n i=1
k=1 k=1
In terms of Vehtari and Ojanen (2012, Section 3.3), the proposed stacking of predic-
tive distributions is the M∗ -optimal projection of the information in the actual belief
model M∗ to ŵ, where explicit specification of M∗ is avoided by re-using data as a proxy
for the predictive distribution of the actual belief model and the weights wk are the free
parameters.
Exact LOO requires refitting each model n times. To avoid this onerous computation,
we use the following approximate method. For the k-th model, we fit to all the data,
obtaining S simulation draws θks (s = 1, . . . S) from the full posterior p(θk |y, Mk ) and
calculate
s 1 p(θks |y−i , Mk )
ri,k = ∝ . (6)
p(yi |θks , Mk ) p(θks |y, Mk )
s
The ratio ri,k has a density function and can be unstable, due to a potentially long
right tail. This problem can be resolved using Pareto smoothed importance sampling
(PSIS, Vehtari et al., 2017a). For each fixed model k and data yi , we fit the generalized
s
Pareto distribution to a set of largest importance ratios ri,k , and calculate the expected
values of the order statistics of the fitted generalized Pareto distribution. These value
s s
are used to obtain the smoothed importance weight wi,k , which is used to replace ri,k .
For details of PSIS, see Vehtari et al. (2017a). PSIS-LOO importance sampling (Vehtari
et al., 2017b) computes the LOO predictive density as
(
p(θk |y−i , Mk )
p(yi |y−i , Mk ) = p(yi |θk , Mk ) p(θk |y, Mk )dθk
p(θk |y, Mk )
!S s (7)
s=1 wi,k p(yi |θks , Mk )
≈ !S .
s
s=1 wi,k
The reliability of the PSIS approximation can be determined by the estimated shape
parameter k̂ in the generalized Pareto distribution. For the left-out data points where
k̂ > 0.7, Vehtari et al. (2017b) suggest replacing the PSIS approximation of those
problematic cases by the exact LOO or k-fold cross-validation.
One potential drawback of LOO is the large variance when the sample size is small.
We see in simulations that when the ratio of relative sample size to the effective number
of parameters is small, the weighting can be unstable. How to adjust this small sample
behavior is left for the future research.
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 925
3.4 Pseudo-BMA
In this paper, we also consider an AIC-type weighting using leave-one-out cross-valida-
tion. As mentioned in Section 2, these weights estimate the same quantities as Li and
Dunson (2016) that use the divergence from the reference model based inference.
To maintain comparability with the given dataset and to get easier interpretation
of the differences in scale of effective number of parameters, we define the expected
log pointwise predictive density(elpd) for a new dataset ỹ as a measure of predictive
accuracy of a given model for the n data points taken! one at a time (Gelman et al.,
n "
2014; Vehtari et al., 2017b). In model Mk , elpdk = i=1 pt (ỹi ) log p(ỹi |y, Mk )dỹi ,
where pt (ỹi ) denotes the true distribution of future data ỹi .
Given observed data y and model k, we use LOO to estimate the elpd as
n n
$ !S %
k # # s s
s=1 wi,k p(yi |θk , Mk )
/
elpdloo = log p̂(yi |y−i , Mk ) = log !S .
s
i=1 i=1 s=1 wi,k
However, this estimation doesn’t take into account the uncertainty resulting from having
a finite number of proxy samples from the future data distribution. Taking into account
the uncertainty would regularize the weights making them go further away from 0 and
1.
k
The computed estimate elpd/
loo is defined as the sum of n independent components
so it is trivial to compute their standard errors by computing the standard deviation of
the n pointwise values (Vehtari and Lampinen, 2002). As in (7), define
k
/
elpd loo,i = log p̂(yi |y−i , Mk ),
Finally, the Bayesian bootstrap (BB) can be used to compute uncertainties related
to LOO estimation (Vehtari and Lampinen, 2002). The Bayesian bootstrap (Rubin,
1981) gives simple non-parametric approximation to the distribution. Having samples
926 Using Stacking to Average Bayesian Predictive Distributions
The distribution over all replicated φ̂(Z|α) (i.e., generated by repeated sampling of α)
produces an estimation for φ(Z).
k
/
As the distribution of elpd loo,i is often highly skewed, BB is likely to work better
than the Gaussian approximation. In our model weighting, we can define
k
/
zik = elpd loo,i , i = 1, . . . n.
n
5 67 8
We sample vectors (α1,b , . . . , αn,b )b=1,...,B from the Dirichlet (1, . . . , 1) distribution, and
compute the weighted means,
#n
z̄bk = αi,b zik .
i=1
exp(nz̄bk )
wk,b = !K , b = 1, . . . , B,
k
k=1 exp(nz̄b )
4 Simulation examples
4.1 Gaussian mixture model
This simple example helps us understand how BMA and stacking behave differently.
It also illustrates the importance of the choice of scoring rules when combining distri-
butions. Suppose the observed data y = (yi , i = 1, . . . , n) come independently from a
normal distribution N(3.4, 1), not known to the data analyst, and there are 8 candidate
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 927
models, N(µk , 1) with µk = k for 1 ≤k ≤8. This is an M-open problem in that none
of the candidates is the true model, and we have set the parameters so that the models
are somewhat separate but not completely distinct in their predictive distributions.
For BMA with a uniform prior Pr(Mk ) = 18 , k = 1, . . . , 8, we can write the posterior
distribution explicitly:
!n
exp(− 12 i=1 (yi − µk )2 )
ŵkBMA = P (Mk |y) = ! 1
!n ,
′ 2
k′ exp(− 2 i=1 (yi − µk ) )
P P
from which we see that ŵ3BMA −→ 1 and ŵkBMA −→ 0 for k ̸= 3 as sample size n → ∞.
Furthermore, for any given n,
$ n
%
BMA 1# 2
Ey∼N(µ,1) [ŵk ] ∝ Ey exp(− (yi − µk ) )
2 i=1
3( ∞ 3 4 4n
1* 2 2
+
∝ exp − (y − µk ) + (y − µ) ) dy
−∞ 2
3 4
n(µk − µ)2
∝ exp − .
4
This example is simple in that there is no parameter to estimate within each of the
models: p(ỹ|y, Mk ) = p(ỹ|Mk ). Hence, in this case the weights from Pseudo-BMA and
Pseudo-BMA+ are the same as the BMA weights.
For stacking of means, we need to solve
n
# 8
# 8
#
ŵ = arg min (yi − wk k)2 , s.t. wk = 1, wk ≥ 0.
w
i=1 k=1 k=1
Figure 1: For the Gaussian mixture example, the predictive distribution p(ỹ|y) of BMA
(green curve), stacking of means (blue) and stacking of predictive distributions (red). In
each graph, the gray distribution represents the true model N(3.4, 1). Stacking of means
matches the first moment but can ignore the distribution. For this M-open problem,
stacking of predictive distributions outperforms BMA as sample size increases.
In fact, this example is a density estimation problem. Smyth and Wolpert (1998)
first applied stacking to non-parametric density estimation, which they called stacked
density estimation. It can be viewed as a special case of our stacking method.
!
We compare the posterior predictive distribution p̂(ỹ|y) = k ŵk p(ỹ|y, Mk ) for these
three methods of model averaging. Figure 1 shows the predictive distributions in one
simulation when the sample size n varies from 3 to 200. Stacking of means (the middle
row of graphs) gives an unappealing predictive distribution, even if its point estimate is
reasonable. The broad and oddly spaced distribution here arises from nonidentification
of w, and it demonstrates the general point that stacking of means does not even try to
match the shape of the predictive distribution. The top and bottom row of graphs show
how BMA picks up the single model that is closest in KL divergence, while stacking
picks a combination; the benefits of stacking becomes clear for large n.
In this trivial non-parametric case, stacking of predictive distributions is almost the
same as fitting a mixture model, except for the absence of the prior. The true model
N(3.4, 1) is actually a convolution of single models rather than a mixture, hence no
approach can recover the true one from the model list. From Figure 2 we can compare
the mean squared error and the mean logarithmic score of these three combination
methods. The log scores and errors are calculated through 500 repeated simulations
and 200 test data. The left panel shows the logarithmic score (or equivalent, expected
log predictive density) of the predictive distribution. Stacking of predictive distributions
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 929
Figure 2: (a) The left panel shows the expected log predictive density of the combined
distribution under BMA, stacking of means and stacking of predictive distributions.
Stacking of predictive distributions performs best for moderate and large sample sizes.
(b) The middle panel shows the mean squared error treating the posterior mean of ŷ as
a point estimation. Stacking of predictive distributions gives almost the same optimal
mean squared error as stacking of means, both of which perform better than BMA. (c)
The right panel shows the expected log predictive density of stacking and BMA when
adding some more N(4, 1) models to the model list, where sample size is fixed to be 15.
All average log scores and errors are calculated through 500 repeated simulation and
200 test data generating from the true distribution.
always gives a larger score except for extremely small n. In the middle panel, it shows
the mean squared error by considering the posterior mean of predictive distribution to
be a point estimate, even if it is not our focus. In this case, it is not surprising to see that
stacking of predictive distributions gives almost the same optimal mean squared error
as the stacking of means, both of which are better than the BMA. Two distributions
close in KL divergence are close in each moment, while the reverse does not necessarily
hold. This illustrates the necessity of matching the distributions, rather than matching
the moments.
Stacking depends only on the space expanded by all candidate models, while BMA
or Pseudo-BMA weighting may by misled by such model expansion. If we add another
N(4, 1) as the 9th model in the model list above, stacking will not change at all in
theory, even though it becomes non-strictly-convex and has infinite same-height mode.
For BMA, it is equivalent to putting double prior mass on the original 4th model, which
doubles the final weights for it. The right panel of Figure 2 shows such phenomenon:
we fix sample size n to be 15 and add more and more N(4, 1) models. As a result, BMA
(or Pseudo-BMA weighting) puts larger weight on N(4, 1) and behaves worse, while the
stacking is essentially unchanged. It illustrates another benefit of stacking compared to
BMA or Pseudo-BMA weights. If the performance of a combination method decays as
the list of candidate models is expanded, this may indicate disastrous performance if
there are many similar weak models on the candidate list. We are not saying BMA can
930 Using Stacking to Average Bayesian Predictive Distributions
never work in this case. In fact some other methods are proposed to make BMA overcome
such drawbacks. For example, George (2010) establishes dilution priors to compensate
for model space redundancy for linear models, putting smaller weights on those models
that are close to each other. Fokoue and Clarke (2011) introduce prequential model
list selection to obtain an optimal model space. But we propose stacking as a more
straightforward solution.
Y = β1 X1 + · · · + βJ XJ + ϵ,
where ϵ ∼ N(0, 1). All the covariates Xj are independently from N(5, 1). The number
of predictors J is 15. The coefficient β is generated by
* +
βj = γ (1|j−4 |<h (h − |j − 4|)2 + (1|j−8 |<h )(h − |j − 8|)2 + (1|j−12|<h )(h − |j − 12|)2 ,
The value h determines the number of nonzero coefficients in the true model. For h = 1,
there are 3 “strong” coefficients. For h = 5, there are 15 “weak” coefficients. In the
following simulation, we fix h = 5. We consider the following two cases:
1. M-open: Each subset contains only one single variable. Hence, the k-th model is
a univariate linear regression with the k-th variable Xk . We have K = J = 15
different models in total. One advantage of stacking and Pseudo-BMA weighting
is that they are not sensitive to prior, hence even a flat prior will work, while
BMA can be sensitive to the prior. For each single model Mk : Y ∼N(βk Xk , σ 2 ),
we set prior βk ∼N (0, 10), σ ∼Gamma(0.1, 0.1).
2. M-closed: Let model k be the linear regression with subset (X1 , . . . , Xk ). Then
!k are still K = 15 different models. Similarly, in model Mk : Y ∼
there
N( j=1 βj Xj , σ 2 ), we set prior βj ∼N (0, 10), j = 1, . . . , k, σ ∼Gamma(0.1, 0.1).
In both cases, we have seven methods for combining predictive densities: (1) stacking
of predictive distributions, (2) stacking of means, (3) Pseudo-BMA, (4) Pseudo-BMA+,
(5) best model selection by mean LOO value, (6) best model selection by marginal
likelihood, and (7) BMA. We generate a test dataset (x̃i , ỹi ), i = 1, . . . , 200 from the
underlying true distribution to calculate the out of sample logarithm scores for the com-
bined distribution under each method and repeat the simulation 100 times to compute
the expected predictive accuracy of each method.
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 931
Figure 3: Mean log predictive densities of 7 combination methods in the linear regression
example: the k-th model is a univariate regression with the k-th variable (1 ≤k ≤15).
We evaluate the log predictive densities using 100 repeated experiments and 200 test
data.
Figure 3 shows the expected out-of-sample log predictive densities for the seven
methods, for a set of experiments with sample size n ranging from 5 to 200. Stacking
outperforms all other methods even for small n. Stacking of predictive distributions is
asymptotically better than any other combination method. Pseudo-BMA+ weighting
dominates naive Pseudo-BMA weighting. Finally, BMA performs similarly to Pseudo-
BMA weighting, always better than any kind of model selection, but that advantage
vanishes in the limit since BMA picks up one model. In this M-open setting, model
selection can never be optimal.
The results change when we move to the second case, in which the k-th model con-
tains variables X1 , . . . , Xk so that we are comparing models of differing dimensionality.
The problem is M-closed because the largest subset contains all the variables, and we
have simulated data from this model. Figure 4 shows the mean log predictive densities of
the seven combination methods in this case. For a large sample size n, almost all meth-
ods recover the true model (putting weight 1 on the full model), except BMA and model
selection based on marginal likelihood. The poor performance of BMA comes from the
parameter priors: recall that the optimality of BMA arises when averaging over the
priors and not necessarily conditional on any particular chosen set of parameter values.
There is no general rule to obtain a “correct” prior that accounts for the complexity for
BMA in an arbitrary model space. Model selection by LOO can recover the true model,
while selection by marginal likelihood cannot due to the same prior problems. Once
932 Using Stacking to Average Bayesian Predictive Distributions
Figure 4: Mean log predictive densities of 7 combination methods in the linear regression
example: the k-th model is the regression with the first k variables (1 ≤k ≤15). We
evaluate the log predictive densities using 100 repeated experiments and 200 test data.
again, BMA eventually becomes the same as model selection by marginal likelihood,
which is much worse than any other methods asymptotically.
In this example, stacking is unstable for extremely small n. In fact, our compu-
tations for stacking of predictive distributions and Pseudo-BMA depend on the PSIS
approximation to log p(yi |y−i ). If this approximation is crude, then the second step
optimization cannot be accurate. It is known that the parameter k̂ in the generalized
Pareto distribution can be used to diagnose the accuracy of PSIS approximation. When
k̂ > 0.7 for a datapoint, we cannot trust the PSIS-LOO estimate and so we re-run the
full inference scheme on the dataset with that particular point left out (see Vehtari
et al., 2017b).
Figure 5 shows the comparison of the mean elpd estimated by LOO and the mean
elpd calculated using 200 independent test data for each model and each sample size in
the simulation described above. The area of each dot in Figure 5 represents the relative
complexity of the model as measured by the effective number of parameters divided
by sample size. We evaluate the effective number of parameters using LOO (Vehtari
et al., 2017b). The sample size n varies from 30 to 200 and variable size is fixed to be
20. Clearly, the relationship is far from the line y = x for extremely small sample size,
and the relative bias ratio (elpdloo /elpdtest ) depends on model complexity. Empirically,
we have found the approximation to be poor when the sample size is less than 5 times
the number of parameters. Further diagnostics for PSIS are described by Vehtari et al.
(2017a).
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 933
Figure 5: Comparison of the mean elpd estimated by LOO and the mean elpd calculated
from test data, for each model and each sample size in the simulation described above.
The area of each dot represents the relative complexity of the model as measured by
the effective number of parameter divided by sample size.
As a result, in the small sample case, LOO can lead to relatively large variance,
which makes the stacking of predictive distributions and Pseudo-BMA/ Pseudo-BMA+
unstable, with performance improving quickly as n grows.
The mixture model seems to be the most straightforward continuous model ex-
pansion. Nevertheless, there are several reasons why we may prefer stacking to fitting
a mixture model. Firstly, Markov chain Monte Carlo (MCMC) methods for mixture
models are difficult to implement and generally quite expensive. Secondly, if the sample
size is small or several components in the mixture could do the same thing, the mixture
model can face non-identification or instability problem unless a strong prior is added.
Figure 6 shows a comparison of mixture models and other model averaging methods
in a numerical experiment, in which the true model is
Y ∼N(β1 X1 + β2 X2 + β3 X2 , 1), βk is generated from N(0, 1),
and there are 3 candidate models, each containing one covariate:
934 Using Stacking to Average Bayesian Predictive Distributions
One widely used variational family is mean-field family where ) parameters are as-
m
sumed to be mutually independent Q = {q(θ) : q(θ1 , . . . , θm ) = j=1 qj (θj )}. Some
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 935
Figure 7: (1) A multi-modal posterior distribution of (µ1 , µ2 ). (2–3) Posterior draws from
variational inference with different initial values. (4–5) Averaged posterior distribution
using stacking of predictive distributions and Pseudo-BMA+ weighting.
recent progress is made to run variational inference algorithm in a black-box way. For
example, Kucukelbir et al. (2017) implement Automatic Differentation Variational In-
ference in Stan (Stan Development Team, 2017). Assuming all parameters θ are continu-
ous and model likelihood is differentiable, it transforms θ into real
)mcoordinate space IRm
2 2
through ζ = T (θ) and uses normal approximation p(ζ|µ, σ ) = j=1 N(ζj |µj , σj ). Plug-
ging this into (10) leads to an optimization problem over (µ, σ 2 ), which can be solved
by stochastic gradient descent. Under some mild condition, it eventually converges to
a local optimum q ∗ . However, q ∗ may depend on initialization since such optimization
problem is in general non-convex, particularly when the true posterior density p(θ|y) is
multi-modal.
Stacking of predictive distributions and Pseudo-BMA+ weighting can be used to
average several sets of posterior draws coming from different approximation distribu-
tions. To do this, we repeat the variational inference K times. At time k, we start
from a random initial point and use stochastic gradient descent to solve the optimiza-
tion problem (10), ending up with an approximation distribution qk∗ . Then we draw S
(1) (S)
samples (θk , . . . , θk ) from qk∗ (θ) and calculate the importance ratio ri,k
s
defined in
s (s)
(6) as ri,k = 1/p(yi |θk ). After this, the remaining steps follow as before. We obtain
stacking
!K or Pseudo-BMA+ weights wk and average all approximation distributions as
∗
k=1 w k q k .
Figure 7 gives a simple example that the averaging strategy helps adjust the opti-
mization uncertainty of initial values. Suppose the data is two-dimensional y = (y (1) , y (2) )
and the parameter is (µ1 , µ2 ) ∈ IR2 . The likelihood p(y|µ1 , µ2 ) is given by
936 Using Stacking to Average Bayesian Predictive Distributions
where xi represents the i-th voter’s preferred ideological position, and xD and xR repre-
sent the ideological positions of the Democratic and Republican candidates, respectively.
In contrast, the i-th voter’s directional comparison is defined by
Mγ : y| (X, β0 , βγ ) ∼N (β0 + Xγ βγ , σ 2 ).
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 937
Full model BMA Stacking of Pseudo-BMA+ weighting
predictive distributions
Mean Voter- Mean Voter- Mean Voter- Mean Voter-
Candidate specific Candidate specific Candidate specific Candidate specific
prox. adv. -3.05 (1.32) -2.01 (1.06) -0.22 (0.95) 0.75 (0.68) 0.00 (0.00) 0.00 (0.00) -0.02 (0.08) 0.04 (0.24)
direct. adv. 7.95 (2.85) 4.18 (1.36) 3.58 (2.02) 2.36 (0.84) 2.56 (2.32) 1.93 (1.16) 1.60 (4.91) 1.78 (1.22)
incumb. adv. 1.06 (1.20) 1.14 (1.19) 1.61 (1.24) 1.30 (1.24) 0.48 (1.70) 0.34 (0.89) 0.66 (1.13) 0.54 (1.03)
quality adv. 3.12 (1.24) 2.38 (1.22) 2.96 (1.25) 2.74 (1.22) 2.20 (1.71) 2.30 (1.52) 2.05 (2.86) 1.89 (1.61)
spend adv. 0.27 (0.04) 0.27 (0.04) 0.32 (0.04) 0.31 (0.04) 0.31 (0.07) 0.31 (0.03) 0.31 (0.04) 0.30 (0.04)
partisan adv. 0.06 (0.05) 0.06 (0.05) 0.08 (0.06) 0.07 (0.06) 0.01 (0.04) 0.00 (0.00) 0.03 (0.05) 0.03 (0.05)
constant 53.3 (1.2) 52.0 (0.8) 51.4 (1.0) 51.6 (0.8) 51.9 (1.1) 51.6 (0.7) 51.5 (1.2) 51.4 (0.8)
Figure 8: Regression coefficients and standard errors in the voting example, from the full
model (columns 1–2), the averaged subset regression model using BMA (columns 3–4),
stacking of predictive distributions (columns 5–6) and Pseudo-BMA+ (columns 7–8).
Democratic proximity advantage and Democratic directional advantage are two highly
correlated variables. Mean candidate and Voter-specific are two datasets that provide
different measurements on candidates’ ideological placement.
Accounting for the different complexity, they used the hyper-g prior (Liang et al., 2008).
Let φ to be the inverse of the variance φ = σ12 . The hyper-g prior with a hyper-parameter
α is,
1
p(φ) ∝ ,
φ
- g .
β| (g, φ, X) ∼N 0, (X T X)−1 ,
φ
α−2
p(g|α) = (1 + g)−α/2 , g > 0.
2
The first two columns of Figure 8 show the linear regression coefficients as estimated
using least squares. The remaining columns show the posterior mean and standard er-
ror of the regression coefficients using BMA, stacking of predictive distributions, and
Pseudo-BMA+ respectively. Under all three averaging strategies, the coefficient of prox-
imity advantage is no longer statistically significantly negative, and the coefficient of
directional advantage is shrunk. As fit to these data, stacking puts near-zero weights
on all subset models containing proximity advantage, whereas Pseudo-BMA+ weight-
ing always gives some weight to each model. In this example, averaging subset models
by stacking or Pseudo-BMA+ weighting gives a way to deal with competing variables,
which should be more reliable than BMA according to our previous argument.
• dist: the distance (in meters) to the closest known safe well,
• arsenic: the arsenic level (in 100 micrograms per liter) of the respondent’s well,
• assoc: whether a member of the household is active in any community association,
• educ: the education level of the head of the household.
We start with what we call Model 1, a simple logistic regression with all variables
above as well as a constant term,
y ∼Bernoulli(θ),
θ =logit−1 (β0 + β1 dist + β2 arsenic + β3 assoc + β4 educ).
Furthermore, we can use spline to capture the nonlinear relational between the logit
switching probability and the distance or the arsenic level. Model 3 contains the B-
splines for the distance and the arsenic level with polynomial degree 2,
θ = logit−1 (β0 + β1 dist + β2 arsenic + β3 assoc + β4 educ + αdis Bdis + αars Bars ),
where Bdis is the B-spline basis of distance with the form (Bdis,1 (dist), . . . , Bdis,q (dist))
and αdis , αars are vectors. We also fix the number of spline knots to be 10. Model 4 and
5 are the similar models with 3-degree and 5-degree B-splines, respectively.
Next, we can add a bivariate spline to capture nonlinear interactions,
where Bdis,ars is the bivariate spline basis with the degree to be 2×2, 3×3 and 5×5 in
Model 6, 7 and 8 respectively.
Figure 9 shows the inference results in all 8 models, which are summarized by the
posterior mean, 50% confidence interval and 95% confidence interval of the probability
of switching from an unsafe well as a function of the distance or the arsenic level. Any
other variables assoc and educ are fixed at their means. It is not obvious from these
results which one is the best model. Spline models give a more flexible shape, but also
introduce more variance for posterior estimation.
Finally, we run stacking of predictive distributions and Pseudo-BMA+ to combine
these 8 models. The calculated model weights are printed above each panel in Figure 9.
For both combination methods, Model 5 (univariate splines with degree 5) accounts
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 939
Figure 9: The posterior mean, 50% and 95% confidence interval of the well switching
probability in Models 1–8. For each model, the switching probability is shown as a
function of (a) the distance to the nearest safe well or (b) the arsenic level of the existing
well. In each subplot, other input variables are held constant. The model weights by
stacking of predictive distributions and Pseudo-BMA+ are printed above each panel.
Figure 10: The posterior mean, 50% and 95% confidence interval of the well switching
probability in the combined model via stacking of predictive distributions. Pseudo-
BMA+ weighting gives a similar result for the combination.
for the majority share. Model 8 is the most complicated one, but both stacking and
Pseudo-BMA+ avoid overfitting by assigning it a negligible weight.
Figure 10 shows the posterior mean, 50% confidence interval, and 95% confidence
interval of the switching probability in the stacking-combined model. Pseudo-BMA+
weighting gives a similar combination result for this example. At first glance, the com-
bination looks quite similar to Model 5, while it may not seem necessary to put an
extra 0.09 weight on Model 1 in stacking combination since Model 1 is completely con-
940 Using Stacking to Average Bayesian Predictive Distributions
tained in Model 5 if setting αdis = αars = 0. However, Model 5 is not perfect since
it predicts that the posterior mean of switching probability will decrease as a function
of the distance to the nearest safe well, for small distances. In fact, without further
control, it is not surprising to find boundary fluctuation as a main drawback for higher
order splines. This decreasing trend around the left boundary is flatter in the combined
distribution since the combination contains part of straightforward logistic regression
(in stacking weights) or lower order splines (in Pseudo-BMA+ weights). In this example
the sample size n = 3020 is large, hence we have reasons to believe stacking of predictive
distributions gives the optimal combination.
5 Discussion
5.1 Sparse structure and high dimensions
!K
Yang and Dunson (2014) propose to combine multiple point forecasts, f = k=1 wk fk ,
α α
through using a Dirichlet aggregation prior, w ∼Dirichlet( K γ , . . . , K γ ), and the adap-
tive regression. Their goal is to impose the sparsity structure (certain models can receive
zero weights). They show their combination algorithm can achieve the minimax squared
risk among all convex combinations,
sup inf sup E||fˆ − fλ∗ ||2 ,
f1 ,...fK ∈F0 fˆ fλ∗ ∈FΓ
Our explanation is that when the model list is large, the convex span should be large
enough to approximate the true model. And this is the reason why we prefer adding
stronger priors to make the estimation of weights stable in high dimensions.
Supplementary Material
Supplementary Material to “Using Stacking to Average Bayesian Predictive Distribu-
tions” (DOI: 10.1214/17-BA1091SUPP; .pdf).
References
Adams, J., Bishin, B. G., and Dow, J. K. (2004). “Representation in Congressional
Campaigns: Evidence for Discounting/Directional Voting in U.S. Senate Elections.”
Journal of Politics, 66(2): 348–373. 936
Akaike, H. (1978). “On the likelihood of a time series model.” The Statistician, 217–235.
920
Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley & Sons.
918
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). “Variational inference: A
review for statisticians.” Journal of the American Statistical Association, 112(518):
859–877. 934
942 Using Stacking to Average Bayesian Predictive Distributions
Breiman, L. (1996). “Stacked regressions.” Machine Learning, 24(1): 49–64. 919, 930
Burnham, K. P. and Anderson, D. R. (2002). Model Selection and Multi-Model Inference:
A Practical Information-Theoretic Approach. Springer, 2nd edition. MR1919620.
920
Clarke, B. (2003). “Comparing Bayes model averaging and stacking when model approx-
imation error cannot be ignored.” Journal of Machine Learning Research, 4: 683–712.
919, 934, 940
Clyde, M. and Iversen, E. S. (2013). “Bayesian model averaging in the M-open frame-
work.” In Damien, P., Dellaportas, P., Polson, N. G., and Stephens, D. A. (eds.),
Bayesian Theory and Applications, 483–498. Oxford University Press. 918, 919, 923
Fokoue, E. and Clarke, B. (2011). “Bias-variance trade-off for prequential model list
selection.” Statistical Papers, 52(4): 813–833. 930
Geisser, S. and Eddy, W. F. (1979). “A Predictive Approach to Model Selection.” Jour-
nal of the American Statistical Association, 74(365): 153–160. 920
Gelfand, A. E. (1996). “Model determination using sampling-based methods.” In Gilks,
W. R., Richardson, S., and Spiegelhalter, D. J. (eds.), Markov Chain Monte Carlo in
Practice, 145–162. Chapman & Hall. 920
Gelman, A. (2004). “Parameterization and Bayesian modeling.” Journal of the Ameri-
can Statistical Association, 99(466): 537–545. 917
Gelman, A. and Hill, J. (2006). Data Analysis Using Regression and Multi-
level/Hierarchical Models. Cambridge University Press. 937
Gelman, A., Hwang, J., and Vehtari, A. (2014). “Understanding predictive infor-
mation criteria for Bayesian models.” Statistics and Computing, 24(6): 997–1016.
MR3253850. doi: https://doi.org/10.1007/s11222-013-9416-2. 925
George, E. I. (2010). “Dilution priors: Compensating for model space redundancy.” In
Borrowing Strength: Theory Powering Applications – A Festschrift for Lawrence D.
Brown, 158–165. Institute of Mathematical Statistics. MR2798517. 930
Geweke, J. and Amisano, G. (2011). “Optimal prediction pools.” Journal of Econo-
metrics, 164(1): 130–141. MR2821798. doi: https://doi.org/10.1016/j.jeconom.
2011.02.017. 920
Geweke, J. and Amisano, G. (2012). “Prediction with misspecified models.” American
Economic Review , 102(3): 482–486. 920
Gneiting, T. and Raftery, A. E. (2007). “Strictly proper scoring rules, prediction, and es-
timation.” Journal of the American Statistical Association, 102(477): 359–378. 917,
922, 923
Gutiérrez-Peña, E. and Walker, S. G. (2005). “Statistical decision problems and
Bayesian nonparametric methods.” International Statistical Review , 73(3): 309–330.
921
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 943
Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). “Bayesian
model averaging: A tutorial.” Statistical Science, 14(4): 382–401. MR1765176.
doi: https://doi.org/10.1214/ss/1009212519. 918, 919
Key, J. T., Pericchi, L. R., and Smith, A. F. M. (1999). “Bayesian model choice: What
and why.” Bayesian Statistics, 6: 343–370. 918, 923
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). “Auto-
matic differentiation variational inference.” Journal of Machine Learning Research,
18(1): 430–474. MR3634881. 935
Le, T. and Clarke, B. (2017). “A Bayes interpretation of stacking for M-
complete and M-open settings.” Bayesian Analysis, 12(3): 807–829. MR3655877.
doi: https://doi.org/10.1214/16-BA1023. 920, 923
LeBlanc, M. and Tibshirani, R. (1996). “Combining estimates in regression and classi-
fication.” Journal of the American Statistical Association, 91(436): 1641–1650. 919
Li, M. and Dunson, D. B. (2016). “A framework for probabilistic inferences from im-
perfect models.” ArXiv e-prints:1611.01241. 921, 925
Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). “Mix-
tures of g priors for Bayesian variable selection.” Journal of the American
Statistical Association, 103(481): 410–423. MR2420243. doi: https://doi.org/
10.1198/016214507000001337. 937
Madigan, D., Raftery, A. E., Volinsky, C., and Hoeting, J. (1996). “Bayesian model
averaging.” In Proceedings of the AAAI Workshop on Integrating Multiple Learned
Models, 77–83. MR1820767. doi: https://doi.org/10.1214/ss/1009212814. 918
Merz, C. J. and Pazzani, M. J. (1999). “A principal components approach to combining
regression estimates.” Machine Learning, 36(1–2): 9–32. 919
Montgomery, J. M. and Nyhan, B. (2010). “Bayesian model averaging: Theoretical de-
velopments and practical applications.” Political Analysis, 18(2): 245–270. 936
Piironen, J. and Vehtari, A. (2017). “Comparison of Bayesian predictive methods for
model selection.” Statistics and Computing, 27(3): 711–735. 917
Rubin, D. B. (1981). “The Bayesian bootstrap.” Annals of Statistics, 9(1): 130–134.
925
Smyth, P. and Wolpert, D. (1998). “Stacked density estimation.” In Advances in Neural
Information Processing Systems, 668–674. 928
Stan Development Team (2017). Stan modeling language: User’s guide and reference
manual . Version 2.16.0, http://mc-stan.org/. 935
Stone, M. (1977). “An Asymptotic Equivalence of Choice of Model by Cross-Validation
and Akaike’s Criterion.” Journal of the Royal Statistical Society. Series B (Method-
ological), 39(1): 44–47. MR0501454. 920
Ting, K. M. and Witten, I. H. (1999). “Issues in stacked generalization.” Journal of
Artificial Intelligence Research, 10: 271–289. 919
944 Using Stacking to Average Bayesian Predictive Distributions
Vehtari, A., Gelman, A., and Gabry, J. (2017a). “Pareto smoothed importance sam-
pling.” ArXiv e-print:1507.02646. 920, 924, 932
Vehtari, A., Gelman, A., and Gabry, J. (2017b). “Practical Bayesian model evaluation
using leave-one-out cross-validation and WAIC.” Statistics and Computing, 27(5):
1413–1432. 920, 924, 925, 932
Vehtari, A. and Lampinen, J. (2002). “Bayesian model assessment and comparison using
cross-validation predictive densities.” Neural Computation, 14(10): 2439–2468. 925
Vehtari, A. and Ojanen, J. (2012). “A survey of Bayesian predictive methods for model
assessment, selection and comparison.” Statistics Surveys, 6: 142–228. MR3011074.
917, 920, 921, 924
Wagenmakers, E.-J. and Farrell, S. (2004). “AIC model selection using Akaike weights.”
Psychonomic bulletin & review , 11(1): 192–196. 920
Watanabe, S. (2010). “Asymptotic Equivalence of Bayes Cross Validation and Widely
Applicable Information Criterion in Singular Learning Theory.” Journal of Machine
Learning Research, 11: 3571–3594. MR2756194. 920
Wolpert, D. H. (1992). “Stacked generalization.” Neural Networks, 5(2): 241–259. 919
Wong, H. and Clarke, B. (2004). “Improvement over Bayes prediction in small samples
in the presence of model uncertainty.” Canadian Journal of Statistics, 32(3): 269–283.
919
Yang, Y. and Dunson, D. B. (2014). “Minimax Optimal Bayesian Aggregation.” ArXiv
e-prints:1403.1345. 919, 940
Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). “Supplementary Material to
“Using stacking to average Bayesian predictive distributions”.” Bayesian Analysis.
doi: https://doi.org/10.1214/17-BA1091SUPP. 918
Acknowledgments
We thank the U.S. National Science Foundation, Institute for Education Sciences, Office of
Naval Research, and Defense Advanced Research Projects Administration (under agreement
number D17AC00001) for partial support of this work. We also thank the Editor, Associate
Editor, and two anonymous referees for their valuable comments that helped strengthen this
manuscript.
B. Clarke 945
Invited Discussion
Bertrand Clarke∗
bclarke3@unl.edu
946 Invited Discussion
in which the subscript −i indicates that the i-th data point, i = 1, . . . , n, has been left
out. It is also assumed that model Mk is defined by a conditional density for Yi and
includes a prior p(θk ) where θk is the parameter for model Mk . Using (2) in (1) and
reverting to the initial definition of the score function, the stacking weights are
n
$K %
1# #
(ŵi , . . . , ŵK ) = arg max S wk p̂k,−i (·), yi , (3)
w∈S K n
i=1 k=1
assuming they exist and are unique. It is important to note the interchangeability of
pT and Yi = yi which is ‘true’ in the sense that it is a valid outcome of pT . Now, the
‘stack’ of predictive densities is
K
#
p̂(yn+1 | Y = y) = ŵk p(yn+1 | Y = y, Mk ), (4)
k=1
where the coefficients come from (3). Expression (4) can be used to obtain point and
interval predictors for Yn+1 .
ordering is imposed by the analysis. Parallel to this generality, and given the ubiquity
of data that is not plausibly stochastic, it makes sense to merge the definitions of M-
complete and M-open and redefine M-open as those DG’s that are not meaningfully
described by any stochastic process, i.e., no true probability model exists. Indeed, the
definition of M-complete and M-open offered originally in Bernardo and Smith (2000)
p. 384-5 allows for this but permits M-complete and M-open problems to overlap.
Redefining the M-open class of problems so it is disjoint from the M-complete (and
M-closed) classes of problems seems logical. Doing this also helps focus on model list
selection which deserves more attention; see the comparisons in Clarke et al. (2013).
In the context of the Prequential Principle, the assumption built into the paper
(and comments) is that stacking the posterior predictive densities is done using only
the first n data points and the goal is to predict the n + 1. That is, n data points (xi , yi )
for i = 1, . . . , n are available to form a predictor such as p̂ and that the predictor is
evaluated at xn+1 to predict Yn+1 (xn+1 ) and its performance assessed in some way that
does not involve S. Implicitly, it is assumed the prediction problem will be repeated
many times and we are looking only at the n-th stage so as to examine the updating of
the predictor. In this way the authors’ framework may be seen as Prequential (although
this is a point on which reasonable people might disagree). Aside from the formula
(4), updating can include changing the models, the Mk ’s, or even the model averaging
technique itself. This is done chiefly by the comparing the predictions with the realized
values. Here, the xi ’s are regarded as deterministic explanatory variables and the Yi ’s
are random. In much of the paper, the xi ’s are suppressed in the notation.
In their Gaussian mixture model example (Section 4.1), the authors treat their prob-
lem as M-open because the true model is not on the model list. While this is reasonable
for the sake of argument, it actually underscores the importance of model list selection
because choosing a better model list would make the problem M-closed. Nevertheless,
in this example, the authors make a compelling point by comparing three different pre-
dictors: Bayes model averaging (BMA), stacking of means (under squared error), and
stacking of predictive distributions by using a log score as in (4). Figure 1 shows that
BMA converges to the model on the model list closest to the true model i.e., BMA has
unavoidable (and undetectable) bias. By contrast, stacking of means and stacking of
predictive distributions both do well in terms of their means (Figure 2, middle panel)
and stacking of predictive densities outperforms both BMA and stacking of means in
other senses (Figure 2, left and right panels) because, as shown in Figure 1, it converges
to the correct predictive distribution. The distribution associated with the stacking
of means converges to a broad lump that does not seem useful. This example shows
that matching whole distributions is feasible and sensible. It also shows BMA does not
routinely perform well despite its asymptotic optimality; see Raftery and Zheng (2003).
In the example of Section 4.2, Figures 3 and 4 show that, again, stacking means
and stacking predictive densities are the best among seven model averaging methods
while BMA ties for fourth place or is in last place. Other comparisons have found BMA
to have similarly disappointing finite sample behavior. (There is good evidence that a
technique called Pseudo-BMA+ is competitive with the two versions of stacking.)
In the example of Section 4.3 where the goal is to obtain a density, the authors show
stacking predictive densities and Pseudo-BMA+ outperform four other techniques for
948 Invited Discussion
obtaining a density; one is BMA. The remaining examples show further properties of
stacking predictive densities (Section 4.4) or use the method to do predictive inference
(Sections 4.5 and 4.6). The results on real data seem reasonable.
A remaining question is the relationship between using predictions from stacked
predictive distributions and the Prequential Principle. Obviously, point predictors from
stacked predictive distributions can be compared directly to outcomes and hence the
Prequential Principle is satisfied. However, one of the authors’ arguments is that ob-
taining a full distribution, as can be done by using their method, is better than simply
using point predictors; this is justified by using what appear to be non-Prequential as-
sessments such as the mean log density; see Figures, 2, 3, 4, and 6. First, the log-score
was used to form the stacks. So, is log really the right way to assess performance? Sec-
ond, all too often, taking a mean requires a distribution to exist and so the evaluation of
the predictor may therefore depend on the true distribution if only to define the mode
of convergence. It’s hard to tell if this is the case with the present predictor; these are
points on which the authors should comment. Moreover, effectively, the new method
gives a prediction distribution (4) that leaves us with two questions: i) How should
we use the distributional information, including that from the smoothing and Bayesian
bootstrap, assuming it’s valid? and ii) Is the distributional information associated with
the stacking of means or densities valid, i.e., an accurate representation of its variability?
The answer to i) might simply be the obvious: It’s a distribution and therefore any
operation we might wish to perform on it, e.g., extract interval or percentile predictors,
is feasible and it can be assessed in the score function of our choice. Of course, in
practice, we do not know the actual distribution of the future outcome; we have only an
estimate of it that we hope is good. Perhaps the consistency statement in Section 3.2
is enough. The answer to ii) seems to require more thought on what exactly the Pareto
smoothing and Bayesian bootstrap are producing. This is important because an extra
quantity, the score function, has been introduced and the solution in (4) – and hence
the distribution assigned to stacked means or predictive densities – can depend on it,
possibly delicately. The effect of the score function and the validity of the distribution
the authors have associated to stacking of means are points for which the authors might
be able to provide some insight.
Recall the idea behind Shtarkov’s original formulation is that prediction can be
treated as a game in which a Forecaster chooses a density q (for prediction) and Nature
chooses an outcome y not constrained by any rule. The payoff to Nature (or loss to
the Forecaster) is log q(y), i.e., log loss. Naturally, the Forecaster wants to minimize the
loss. So, assume the Forecaster has ‘Advisors’ represented by densities pθ ; each Advisor
corresponds to a θ. The advisors announce their densities before Nature issues a y. If the
Forecaster has a pre-game idea about the relative abilities of the advisors to give good
advice, this may be formulated into a prior p(θ). Now it makes sense for the Forecaster
to minimize the maximum regret, i.e., to seek the smallest loss (incurred by the best
Advisor). This means finding the q that minimizes
: ;
1 1
sup log − inf log . (5)
y q(y) θ p(θ)p(y | θ)
The solution exists in closed form and can be computed; see Le and Clarke (2016).
Following the authors, consider replacing the log loss in (5) by a general score function,
S. Now, the Forecaster wants the q ∈ Q, say qopt,S , that minimizes
: ;
sup S(q(·), y) − inf S(p(θ)p(· | θ), y) , (6)
y θ
References
Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory. Chichester, West
Sussex, UK: John Wiley and Sons. MR1274699. doi: https://doi.org/10.1002/
9780470316870. 947
Clarke, J., Yu, C.-W., and Clarke, B. (2013). “Prediction in M-complete prob-
lems with limited sample size.” Bayesian Analysis, 8: 647–690. MR3102229.
doi: https://doi.org/10.1214/13-BA826. 947
Dawid, A. P. (1984). “Present Position and Potential Developments: Some Personal
Views: Statistical Theory: The Prequential Approach.” Journal of the Royal Statis-
tical Society. Series A, 147: 278–292. MR0763811. doi: https://doi.org/10.2307/
2981683. 946
Le, T. and Clarke, B. (2016). “Using the Bayesian Shtarkov solution for pre-
dictions.” Computational Statistics & Data Analysis, 104: 183–196. MR3540994.
doi: https://doi.org/10.1016/j.csda.2016.06.018. 948, 949
950 Invited Discussion
Invited Discussion
Meng Li∗†
1 Introduction
I would like to offer my congratulations to Yao et al. for a welcome and interesting
addition to the growing literature on model averaging. Earlier papers on stacking cited
by the authors have mostly focused on averaging models to improve point estimation.
The authors now demonstrate that the same idea can be extended to the combination of
predictive distributions in the M-open case. In the next several pages, I will first review
the paper connecting it to the related literature (Section 2), then comment on the M-
complete case (Section 3) and an application that may be favorable to the proposed
method (Section 4). Section 5 concludes this comment.
2 Overview
Suppose we have a list of parametric models under consideration M = {M1 , . . . , MK }
for the observations y (n) = {y1 , . . . , yn } ∈ Y n with Y the sample space. Yao et al.
(2017) address a general problem of model aggregation from an interesting perspective:
how to average the multiple models in M such that the resulting model combination
has an optimal predictive distribution. This distinguishes its goal from two areas that
have been well studied, i.e., weighing models targeted to an optimal point prediction
and selecting a single model possibly with uncertainty quantification. Yao et al. (2017)
particularly focus on the M-open case (Bernardo and Smith, 1994) to allow the true
model to fall outside of M.
One of the most popular approaches is to use Bayesian model probabilities pr(Mk |
y (n) ) as weights, with these weights forming the basis of Bayesian Model Averaging
(BMA). Philosophically, in order to interpret pr(Mj | y (n) ) as a model probability, one
must rely on the assumption that one of the models in the list M is exactly true,
known as the M-closed case. This assumption is arguably always flawed, although one
can still use pr(Mk | y (n) ) as a model weight from a pragmatic perspective, regardless
of the question of interpretation. In the case of M-complete or M-open, an alternative
approach is to formulate the model selection problem in a decision theoretic framework,
selecting the model in M that maximizes expected utility. Yao et al. (2017) adopt a
stacking approach along the line of this decision theoretic framework (Bernardo and
Smith, 1994; Gutiérrez-Peña et al., 2009; Clyde and Iversen, 2013).
There are various scoring rules available when defining the unity function in a de-
cision theoretic framework. The choice can and probably should depend on the specific
∗ This work was partially supported by the Ralph E. Powe Junior Faculty Enhancement Award by
ORAU.
† Noah Harding Assistant Professor, Department of Statistics, Rice University, Houston, TX, U.S.A.,
meng@rice.com
952 Invited Discussion
• Stacking using proper scoring rules. One may obtain the weights by optimizing
an approximated version of (3):
$K % $K %
# #
min d wk p(·|y, Mk ), p̂t (·|y) or max S wk p(·|y, Mk ), p̂t (·|y) , (1)
w∈S1K w∈S1K
k=1 k=1
where p̂t (ỹi ) is a nonparametric Bayes model and other notations are defined in
Yao et al. (2017).
• Pseudo-BMA. We replace the empirical observations used in Section 3.4 by the
nonparametric surrogate. Specifically, the quantity elpdk can be estimated by
#n (
k
/
elpd = p̂t (ỹi ) log p(ỹi |y, Mk )dỹi (2)
i=1
k
/
instead of elpd loo used in the paper, and the final weights become
k
/ )
exp(elpd
wk = ! k
. (3)
K / )
exp(elpd
k=1
between two unknown densities based on samples from these densities (Leonenko et al.,
2008; Pérez-Cruz, 2008; Bu et al., 2018). Here the setting is somewhat different as there
is only one sample, but the local likelihood methods of Lee and Park (2006) and the
Bayesian approach of Viele (2007) can potentially be used, among others.
Furthermore, all of these methods in a decision theoretic framework or BMA are
focused on providing weights for model aggregation, and are not useful for goodness-
of-fit assessments of (absolute) model adequacy. The nonparametric reference models
in the M-complete case enables the assessment the quality of each individual model in
M as well as the entire model list. One of course needs to specify an absolute scale to
define what is adequate, but rules of thumb such as the one provided by Li and Dunson
(2018) based on the convention of Bayes factors may be obtainable.
4 Data integration
Section 5.3 makes a great point that an ideal case for stacking is that the K models in
the model list are orthogonal. This ideal case is not fully demonstrated by the paper, but
it insightfully points to a possible solution to a challenging problem—data integration.
Modern techniques enable researchers to acquire rich data from multiple platforms,
thus it becomes possible to combine various data types of fundamental differences to
make a single decision, hopefully more informative than any decision based on an indi-
vidual data resource. In response to this demand, there has been a recent surge of interest
in data integration expanding into a variety of emerging areas, for example, imaging
genetics, omics data, and analysis of covariate adjusted complex objects (such as func-
tions, images, trees, and networks). One concrete example that I have been working on
comprises a cohort of patients with demographic, clinical and omics data; the omics data
include single nucleotide polymorphisms (SNPs), expression, and micro-ribonucleic acids
(miRNAs). In these cases, the dramatic heterogeneity across data structures, which is
one of root causes that fail many attempts especially those trying to map various data
structures to a common space such as the Euclidean space, seems to be a characteristic
favorable to the stacking approach. The sample size is usually not large, thus even the
cross-validation approach without approximation may be computationally manageable.
5 Summary
To summarize, Yao et al. (2017) have tackled the model averaging problem that is one of
fundamental tasks in statistics. They have proposed improvements to existing stacking
methods for stacking of densities. This method requires intensive leave-one-out posterior
distributions to approximate the expected utility, and the authors propose to use Pareto
smoothed importance sampling to scale up the implementation.
I would like to thank Yao et al. for writing an interesting paper. I appreciated that
the paper has several detailed and thoughtful demonstrations to compare the proposed
methods to existing model weights and help readers understand how stacking and BMA
behave differently. The integration with R and Stan makes the method immediately
M. Li 955
available to practitioners. I find the work useful and expect the proposed methods
positively impact model averaging and its application to a wide range of problems
in practice. I hope the comments on M-complete and a possible application to data
application add some useful insights to a paper already rich in content.
References
Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. Wiley, New York, NY.
MR1274699. doi: https://doi.org/10.1002/9780470316870. 951, 953
Bhattacharya, A., Pati, D., and Dunson, D. (2014). Anisotropic function estimation
using multi-bandwidth Gaussian processes. The Annals of Statistics, 42(1):352–381.
MR3189489. doi: https://doi.org/10.1214/13-AOS1192. 952
Bu, Y., Zou, S., Liang, Y., and Veeravalli, V. V. (2018). Estimation of KL divergence:
optimal minimax rate. IEEE Transactions on Information Theory, 64(4):2648–2674.
MR3782280. doi: https://doi.org/10.1109/TIT.2018.2805844. 954
Castillo, I. (2014). On Bayesian supremum norm contraction rates. The Annals of Statis-
tics, 42(5):2058–2091. MR3262477. doi: https://doi.org/10.1214/14-AOS1253.
952
Claeskens, G. and Hjort, N. L. (2003). The Focused Information Criterion.
Journal of the American Statistical Association, 98(464):900–916. MR2041482.
doi: https://doi.org/10.1198/016214503000000819. 952
Clyde, M. A. and Iversen, E. S. (2013). Bayesian model averaging in the M-open
framework. In Damien, P., Dellaportas, P., Polson, N. G., and Stephens, D. A., ed-
itors, Bayesian Theory and Applications, pages 483–498. Oxford University Press.
MR3221178. doi: https://doi.org/10.1093/acprof:oso/9780199695607.003.
0024. 951, 952
Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparamet-
ric Bayesian Inference. Cambridge University Press, Cambridge. MR3587782.
doi: https://doi.org/10.1017/9781139029834. 952
Gutiérrez-Peña, E., Rueda, R., and Contreras-Cristán, A. (2009). Objective para-
metric model selection procedures from a Bayesian nonparametric perspec-
tive. Computational Statistics & Data Analysis, 53(12):4255–4265. MR2744322.
doi: https://doi.org/10.1016/j.csda.2009.05.018. 951, 952
Gutiérrez-Peña, E. and Walker, S. G. (2005). Statistical decision problems and Bayesian
nonparametric methods. International Statistical Review, 73(3):309–330. 952
Hjort, N. L., Holmes, C., Müller, P., and Walker, S. G., editors (2010).
Bayesian Nonparametrics. Cambridge University Press, Cambridge. MR2722988.
doi: https://doi.org/10.1017/CBO9780511802478.002. 952
Lee, Y. K. and Park, B. U. (2006). Estimation of Kullback–Leibler divergence by
local likelihood. Annals of the Institute of Statistical Mathematics, 58(2):327–340.
MR2246160. doi: https://doi.org/10.1007/s10463-005-0014-8. 954
956 Invited Discussion
Leonenko, N., Pronzato, L., and Savani, V. (2008). A class of Rényi information es-
timators for multidimensional densities. The Annals of Statistics, 36(5):2153–2182.
MR2458183. doi: https://doi.org/10.1214/07-AOS539. 954
Li, M. and Dunson, D. B. (2018). Comparing and weighting imperfect models using
D-probabilities. arXiv preprint arXiv:1611.01241v3. 953, 954
Li, M. and Ghosal, S. (2017). Bayesian detection of image boundaries. The An-
nals of Statistics, 45(5):2190–2217. MR3718166. doi: https://doi.org/10.1214/16-
AOS1523. 952
Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). Mix-
tures of g priors for Bayesian variable selection. Journal of the American Statis-
tical Association, 103(481):410–423. MR2420243. doi: https://doi.org/10.1198/
016214507000001337. 952
Pérez-Cruz, F. (2008). Kullback–Leibler divergence estimation of continuous distribu-
tions. In IEEE International Symposium on Information Theory – Proceedings, pages
1666–1670. IEEE. 954
Rigollet, P. and Tsybakov, A. B. (2012). Sparse Estimation by Exponential Weight-
ing. Statistical Science, 27(4):558–575. MR3025134. doi: https://doi.org/10.1214/
12-STS393. 953
Schwartz, L. (1965). On Bayes procedures. Zeitschrift für Wahrscheinlichkeitstheorie
und verwandte Gebiete, 4(1):10–26. 952
Shen, W. and Ghosal, S. (2015). Adaptive Bayesian procedures using random
series priors. Scandinavian Journal of Statistics, 42(4):1194–1213. MR3426318.
doi: https://doi.org/10.1111/sjos.12159. 952
Smyth, P. and Wolpert, D. (1998). Stacked density estimation. In Advances in neural
information processing systems, pages 668–674. 952
van der Vaart, A. W. and van Zanten, J. H. (2009). Adaptive Bayesian estimation using
a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics,
37(5B):2655–2675. MR2541442. doi: https://doi.org/10.1214/08-AOS678. 952
Vehtari, A. and Lampinen, J. (2002). Bayesian model assessment and comparison using
cross-validation predictive densities. Neural Computation, 14(10):2439–2468. 952
Vehtari, A., Ojanen, J., et al. (2012). A survey of bayesian predictive methods for model
assessment, selection and comparison. Statistics Surveys, 6:142–228. MR3011074.
doi: https://doi.org/10.1214/12-SS102. 952
Viele, K. (2007). Nonparametric estimation of Kullback–Leibler information illus-
trated by evaluating goodness of fit. Bayesian Analysis, 2(2):239–280. MR2312281.
doi: https://doi.org/10.1214/07-BA210. 954
Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2017). Using stacking to average
Bayesian predictive distributions. Bayesian Analysis, pages 1–28. 951, 952, 953, 954
P. Grünwald and R. de Heide 957
Invited Discussion
Peter Grünwald∗ and Rianne de Heide†
Yao et al. (2018) aim to improve Bayesian model averaging (BMA) in the M-open
(misspecified) case by replacing it with stacking, which is extended to combine predictive
distributions rather than point estimates. We generally applaud the program to adjust
Bayesian methods to better deal with M-open cases and we can definitely see merit
in stacking-based approaches. Yet, we feel that the main method advocated by Yao
et al. (2018), which stacks based on the log score, while often outperforming BMA,
fails to address a crucial problem of the M-open-BMA setting. This is the problem of
hypercompression as identified by Grünwald and Van Ommen (2017), and shown also
to occur with real-world data by De Heide (2016). We explore this issue in Section 2;
first, we very briefly compare stacking to a related method called switching.
Figure 1: The conditional expectation E[Y |X] according to the predictive distribution
found by stacking (red), standard BMA (green) and SafeBayesian regression (blue),
based on models M1 , . . . , M30 with polynomial basis functions, given 50 data points
sampled i.i.d. ∼ P ∗ , of which approximately half are placed in (0, 0). The true regres-
sion function is depicted in black. Behaviour of stacking and standard BMA slowly
improves as sample size increases and becomes comparable to SafeBMA around n = 80
for stacking and n = 120 for BMA. Implementation details are given at the end of the
section.
score yield very similar results, we do think that there is an issue here – stacking in itself
is not sufficient to get useful weighted combinations of Bayes predictive distributions in
some small sample situations where such combinations do exist.
But why would this be undesirable? It turns out that the predictive distribution p(· |
y−i , Mk ) in (2) achieves being significantly closer to P ∗ in terms of KL divergence than
any of the elements inside Mk , by being a mixture of elements of Mk which themselves
are all very ‘bad’, i.e. very far from P ∗ in terms of KL divergence (see in particular
Figure 7 and 8 of Grünwald and Van Ommen (2017)). As a result, using a log score
oriented averaging procedure, whether it be BMA or stacking, one can select an Mk
whose predictive is good, at sample size i, in log score, but quite bad in terms of
just about any other measure. For example, consider a linear model Mk as above.
For such models, for fixed σ 2 , as a function of β, the KL divergence D(P ∗ ∥pβ,σ2 ) :=
EX∼P ∗ EY ∼P ∗ |X [log p∗ (Y | X)/pβ,σ2 (Y | X) is linearly increasing in the mean squared
error EX,Y ∼P ∗ (Y −β T X)2 . Therefore, one commonly associates a predictive distribution
p(yi | xi ) that behaves well in terms of log score (close in KL divergence to P ∗ ) to be
also good in predicting yi as a function of the newly observed xi in terms of the squared
prediction error. Yet, this is true only if p is actually of the form pβ,σ2 ∈ Mk ; the Bayes
predictive distribution, being a mixture, is simply not of this form and can be good at
the log score yet very bad at squared error predictions.
Now it might of course be argued that none of this matters: stacking for the log
score was designed to come up with a predictive that is good in terms of log score. . .
and it does! Indeed, if one really deals with a practical prediction problem in which
one’s prediction quality will be directly measured by log score, then stacking with the
log score should work great. But to our knowledge, the only such problems are data
compression problems in which log score represents codelength. In most applications
in which log score is used, it is rather used for its generic properties, and then the
resulting predictive distributions may be used in other ways (they may be plotted to
give insight in the data, or they may be used to make predictions against other loss
functions, which may not have been specified in advance). For example (Yao et al.,
2018, end of Section 3.1) cite the generic properties that log score is local and proper
as a reason for adopting it. Our example indicates that in the M-open case, such use of
log score for its generic properties only can give misleading results. The SafeBayesian
method overcomes this problem by exponentiating the likelihood to the power η that
minimizes a variation of log-score for predictive densities (the R-log loss, Eq. (23) in
Grünwald and Van Ommen (2017)) in which loss cannot be made smaller by mixing
together bad densities.
(not shown). The regression function according to the Safe BMA predictive distribution
is a mixture of all these Ridge-based regression functions hence also close to 0.
As Yao et al. (2018) note, the implementation of their method can be unstable when
the ratio of relative sample size to the effective number of parameters is small. We
encountered this unstable behaviour for a large proportion of the simulations when the
sample size was relatively small, and the Pareto-k-diagnostic (indicating stability) was
above 0.5, though mostly below 0.7, for some data points. In those cases the method
did not give sensible outputs, irrespective of the true regression function (which we set
to, among others, Yi = 0.5Xi + ξi and Yi = Xi2 + ξi , and we also experimented with a
Fourier basis). Thus, we re-generated the whole sample of size n = 50 many times and
only considered the runs in which the k-diagnostic was below 0.5 for all data points.
In all those cases, we observed the overfitting behaviour depicted in Figure 1. This
‘sampling towards stable behaviour’ may of course induce bias. Nevertheless, the fact
that we get very similar results for model selection rather than stacking (mixing) based
on LOO with log-score indicates that the stacking curve in Figure 1 is representative.
References
Dawid, A. (1984). “Present Position and Potential Developments: Some Personal Views,
Statistical Theory, The Prequential Approach.” Journal of the Royal Statistical
Society, Series A, 147(2): 278–292. MR0763811. doi: https://doi.org/10.2307/
2981683. 957
Van Erven, T., Grünwald, P., and de Rooij, S. (2012). “Catching up faster by switch-
ing sooner: a predictive approach to adaptive estimation with an application to
the AIC–BIC dilemma.” Journal of the Royal Statistical Society: Series B (Sta-
tistical Methodology), 74(3): 361–417. With discussion, pp. 399–417. MR2925369.
doi: https://doi.org/10.1111/j.1467-9868.2011.01025.x. 957
Grünwald, P. and Van Ommen, T. (2017). “Inconsistency of Bayesian inference for
misspecified linear models, and a proposal for repairing it.” Bayesian Analysis, 12(4):
1069–1103. MR3724979. doi: https://doi.org/10.1214/17-BA1085. 957, 958, 959,
960
De Heide, R. (2016). “The Safe–Bayesian Lasso.” Master’s thesis, Leiden University.
957, 958, 960
Van der Pas, S. and Grünwald, P. (2018). “Almost the Best of Three Worlds: Risk, Con-
sistency and Optional Stopping for the Switch Criterion in Nested Model Selection.”
Statistica Sinica, 28(1): 229–255. MR3752259. 957
Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). “Using Stacking to Average
Bayesian Predictive Distributions.” Bayesian Analysis. 957, 958, 959, 960, 961
962 Contributed Discussion
Contributed Discussion
A. Philip Dawid∗
1 Scoring rules
The paper extends the method of stacking to allow probabilistic rather than point
predictions. It does this by applying a scoring rule, which is a loss function S(y, Q)1
for critiquing a quoted probabilistic forecast Q for a quantity Y in the light of Nature’s
realised outcome y. In retrospect this simple extension has been a long time coming.
Although theory and applications of proper scoring rules have been around for at least
70 years, this fruitful and versatile concept has lain dormant until quite recently, and is
still woefully unfamiliar to most statisticians. See Dawid (1986); Dawid and Sebastiani
(1999); Grünwald and Dawid (2004); Parry et al. (2012); Dawid and Musio (2014, 2015);
Dawid et al. (2016) for a variety of theory, examples and applications.
A scoring rule S(y, Q) is a special case of a loss function L(y, a), measuring the
negative worth of taking an act a in some action space A, when the variable Y turns
out to have value y. In this special case, A is set of probability distributions. Conversely,
given any action space and loss function we can construct an associated proper scoring
rule S(y, Q) := L(y, aQ ), where aQ is a Bayes act against Q, thus minimising L(Q, a) :=
EY ∼Q L(Y, a). This extends stacking to general decision problems.
2 Prequential strategies
However, when we take probability forecasting seriously, there are more principled ways
to proceed. Rather than assessing forecasts using a cross-validatory approach—which,
though popular, has little foundational theory and does not easily extend beyond the
context of independent identically distributed (“IID”) observations—we can conduct
prequential (predictive sequential) assessment (Dawid, 1984). This considers the obser-
vations in sequence,2 and at any time t constructs a probabilistic prediction Pt+1 for
the next observable Yt+1 , based on the currently known outcomes yt = (y1 , . . . , yt ).
Any method of doing this is a prequential strategy. Many (though by no means all)
strategies can be formed by applying some principle to a parametric statistical model
M = {Pθ : θ ∈ Θ} for the sequence of observables (Y1 , Y2 , . . .)—which need not incor-
porate independence, and (in contrast to all the approaches mentioned in § 2 of the
paper) not only need not be considered as containing the “true generating process”,
but does not even require that such a process exist (the “M-absent” case). Example
M -based strategies are the “plug-in” density forecast pt+1 = p(yt+1 | yt ; θ=t ), with θ=t the
maximum likelihood
" estimate based on the current data yt ; and the Bayesian density
forecast pt+1 = p(yt+1 | yt ; θ) πt (θ)dθ, with πt (θ) the posterior density of θ given yt .
3 Again, my notation transposes the arguments of d compared with that of the paper.
964 Contributed Discussion
By adding probabilistic assumptions about the origin of y we can then (if so desired)
recover some of the consistency results in § 2.1 and § 2.2 above. And again extension
to forecasts based on a collection M of models is straightforward. I wonder how well
stacking would perform by this criterion?
References
Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning and Games. Cam-
bridge: Cambridge University Press. MR2409394. doi: https://doi.org/10.1017/
CBO9780511546921. 964
Dawid, A. P. (1984). “Statistical Theory. The Prequential Approach (with Discus-
sion).” Journal of the Royal Statistical Society, Series A, 147: 278–292. MR0763811.
doi: https://doi.org/10.2307/2981683. 962, 963
Dawid, A. P. (1986). “Probability Forecasting.” In Kotz, S., Johnson, N. L., and
Read, C. B. (eds.), Encyclopedia of Statistical Sciences, volume 7, 210–218. Wiley-
Interscience. MR0892738. 962
Dawid, A. P. and Musio, M. (2014). “Theory and Applications of Proper Scoring
Rules.” Metron, 72: 169–183. MR3233147. doi: https://doi.org/10.1007/s40300-
014-0039-y. 962
Dawid, A. P. and Musio, M. (2015). “Bayesian Model Selection Based on Proper
Scoring Rules (with Discussion).” Bayesian Analysis, 10: 479–521. MR3420890.
doi: https://doi.org/10.1214/15-BA942. 962
Dawid, A. P., Musio, M., and Ventura, L. (2016). “Minimum Scoring Rule Inference.”
Scandinavian Journal of Statistics, 43: 123–138. MR3466997. doi: https://doi.org/
10.1111/sjos.12168. 962
Dawid, A. P. and Sebastiani, P. (1999). “Coherent Dispersion Criteria for
Optimal Experimental Design.” Annals of Statistics, 27: 65–81. MR1701101.
doi: https://doi.org/10.1214/aos/1018031101. 962
Grünwald, P. D. and Dawid, A. P. (2004). “Game Theory, Maximum Entropy, Minimum
Discrepancy, and Robust Bayesian Decision Theory.” Annals of Statistics, 32: 1367–
1433. MR2089128. doi: https://doi.org/10.1214/009053604000000553. 962
A. P. Dawid 965
Parry, M. F., Dawid, A. P., and Lauritzen, S. L. (2012). “Proper Local Scoring Rules.”
Annals of Statistics, 40: 561–92. MR3014317. doi: https://doi.org/10.1214/
12-AOS971. 962
Rakhlin, A., Sridharan, K., and Tiwari, A. (2015). “Online Learning Via Sequential
Complexities.” Journal of Machine Learning Research, 16: 155–186. MR3333006.
964
Skouras, K. and Dawid, A. P. (2000). “Consistency in Misspecified Models.” Research
Report 218, Department of Statistical Science, University College London. 963
Vovk, V. (2001). “Competitive On-Line Statistics.” International Statistical Review , 69:
213–248. 964
966 Contributed Discussion
Contributed Discussion
William Weimin Yoo∗
In this paper, Yao et al. (2018) consider the problems of model selection and aggre-
gation of different candidate models for inference. Inspired by the stacking of means
method proposed in the frequentist literature, the authors generalize this idea by devel-
oping a procedure to stack Bayesian predictive distributions. Given a list of candidate
models with their corresponding predictive distributions, the goal here is to find a lin-
ear combination of these distributions such that it is as close as possible to the true
date generating distribution, under some score criterion. To find the linear combination
(model) weights, they replace the full predictive distributions with their leave-one-out
(LOO) versions in the objective function, and proceed to solve this convex optimization
problem. The authors then propose a further approximation to LOO computation by
importance sampling, with the importance weights obtained by fitting a generalized
Pareto distribution. To showcase the benefits of the newly proposed stacking method,
the authors conducted extensive simulation studies and data analyses, by comparing
with other techniques in the literature such as BMA (Bayesian Model Averaging), BMA
with LOO (Psedo-BMA), BMA with LOO and Bayesian Bootstrap, and others.
We can take a graphical modeling )n perspective on LOO. For example, replacing
marginal likelihoods p(y|Mk ) with i=1 p(yi |yj : j ̸= i, Mk ) is akin to simplifying a
complete (fully connected) graph linking observations to one where the Markov as-
sumption holds, i.e., the node corresponding to yi is independent conditioned on its
neighbors {yj : j ̸= i}. In the proposed stacking method, the full predictive distribution
y |y) is approximated by the LOO p(yi |yj : j ̸= i), and further approximation is needed
p(?
because the LOO is typically expensive to compute. However if there are some struc-
tures in the data, such as clusters of data points/nodes, then one can take advantage
of this by conditioning on a smaller cluster B of nodes around yi and compute instead
p(yi |yj : j ̸= i, j ∈ B). This would then further speed up computations as one can fit
models using only local data points.
Another point I would like to make is that the superb performance of the stacking
method warrants further theoretical investigations. Figure 2(c) in the simulations shows
that the proposed method is robust to incorrect/bad models, in the sense that its per-
formance stays unchanged even if more incorrect models are added to the list. It would
be nice if we will also have some theoretical guarantees that the stacking method will
concentrate on the correct (M-closed) or the best models (M-complete) in the model
list. In addition, Figure 9 shows that this method is able to “borrow strength” across
different models, by using some aspects of a model to enhance performance of a different
model. Therefore aggregation by stacking adds value by bringing the best out of each
individual model component, and it would be interesting to characterize through theory
what this added value is. This then invites us to reflect on how the quality of individual
model component affects the final stacked distribution. For example, given posterior
contraction rates for the different posterior models, what would be the aggregated rate
for the stacked posterior? Also, how does prediction/credible sets constructed using the
stacked posterior compare with those constructed from each model component, will it
be bigger, smaller or something in between?
As most existing methods including the newly proposed stacking method use some
form of linear combinations, it would be interesting to find other ways of aggregation.
As pointed out by the authors, linear combinations of predictive distributions will not
be able to reproduce truths that are generated through some other means, e.g., convo-
lutions. To apply stacking in the convolutional case, I think one way is to do everything
in the Fourier domain, by stacking log Fourier transforms (i.e., log characteristic func-
tions) of the predictive posterior densities, exponentiate and then apply inverse Fourier
transform to approximate the truth generated through convolutions.
I think another possible area of application for the stacking method is in distributed
computing. In this modern big data era, data has grown to such size that it is some-
times infeasible or impossible to analyze them using standard Markov Chain Monte
Carlo (MCMC) algorithms on a single machine. Hence this gave rise to the divide and
conquer strategy where data is first divided into batches and (sub)posterior distribution
is computed for each batch. The final posterior for inference is then obtained by aggre-
gating these (sub)posteriors. To this end, I think the present stacking method can be
deployed after some modifications, with potential to yield superior performance when
compared with existing weighted average-type approaches.
I find the paper to be very interesting and the stacking method proposed is a key con-
tribution to the Bayesian model averaging/selection literature. It is shown to be superior
than the golden standard, i.e., BMA and its finite sample performance is tested compre-
hensively though a series of numerical experiments and real data analyses. I think this
is a very promising research direction, and any future contributions are very welcomed.
References
Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). “Using stacking to average
Bayesian predictive distributions.” Bayesian Analysis, 1–28. Advance publication.
966
968 Contributed Discussion
Contributed Discussion
Robert L. Winkler∗ , Victor Richmond R. Jose† ,
Kenneth C. Lichtendahl Jr.‡ , and Yael Grushka-Cockayne§
Yao et al. (2018) propose a stacking approach for aggregating predictive distributions
and compares its approach with several alternatives, such as Bayesian model averaging
and mean stacking. Because distributions are more informative than point estimates and
aggregating distributions can increase the information upon which decisions are made,
any paper that encourages the use of distributions and their aggregation is important
and most welcome. Therefore, we applaud the authors for bringing attention to these
issues and for their careful approach to distribution stacking using various scoring rules.
To stack distributions, the authors propose the use of convex combinations of com-
peting models’ cdfs. The different cdf stackers they consider vary in the (non-negative)
weights assigned to each model in the combination. This cdf stacking approach is framed
by the logic of Bayesian model averaging: there is a prior distribution over which model
is correct and the data are generated by this one true model. The result is a convex com-
bination of the competing models, where the weights follow from the prior distribution
over which model is correct.
The idea of a “true model” is in the spirit of hypothesis testing in the sense that it is
black and white–one truth that we try to identify. In most real-world situations, however,
the existence of a “true model” is highly doubtful at best. A better approach may be
to think in terms of information aggregation from multiple inputs/forecasters/models.
Under this approach, no base model is true, but all provide some useful information
that is worth combining. The following example illustrates this point and motivates the
idea of aggregating information with a quantile stacker.
Suppose there are k ≥ 2 forecasters. Forecaster i privately !i observes the sample
xi = (xNi−1 +1 , . . . , xNi ) of size ni for i = 1, . . . , k, where Ni = j=1 nj . Each forecaster
will report their quantile function Qi for the uncertain quantity of interest xNk +1 . The
data are drawn from the normal-normal model: µ ∼N (µ0 , mλ) and (xj |µ) ∼iid N (µ, λ)
for j = 1, . . . , Nk + 1, where λ denotes the precision.
k k
m Nk # ni x̄i −1/2
# Φ−1 (u)
Qdm (u) = µ0 + + λdm
m + Nk m + Nk i=1 Nk i=1
k
3 4 (1)
k
# m + ni 1/2 k
# 1 λ1/2
m − km 1 λi i
= µ0 + − 1/2
µi + Q (u).
1/2 i
m + Nk i=1
m + N k k λ i=1
k λ
dm dm
According to the example above, the result of aggregating the forecasters’ informa-
tion is mean/quantile stacking. The mean/quantile stacker in (1) is a linear combination
of the prior-predictive mean, the forecasters’ posterior-predictive means, and the fore-
caster’s posterior-predictive quantiles. Interestingly, only the weights on the quantiles
are necessarily positive. Choosing this linear combination to optimize a quantile scoring
rule (Jose and Winkler, 2009; Gneiting, 2011; Grushka-Cockayne et al., 2017), such as
the pinball loss function, may be a useful way to choose the weights.
In fact, a quantile stacker solves the example in Yao et al’s Section 4.1 exactly and
would be perfectly calibrated. The quantile stacker 0.6Q3 (u) + 0.4Q4 (u), where Qi (u) =
µi +Φ−1 (u) and µi = i, yields the aggregate quantile function 3.4+Φ−1 (u), which is the
quantile function of the true data-generating process in Section 4.1. A quantile stacker
may also work well in the regression example of Section 4.2.
When aggregating model forecasts of a continuous quantity of interest, each model
may be well-calibrated to the training data, although each model may have different
sharpness. In this case, a result in Hora (2004) kicks in. The convex combination of
well-calibrated models’ cdfs cannot be well-calibrated (unless they are all identical).
Typically, the convex combination of cdfs will lead to an underconfident aggregate cdf
(Lichtendahl Jr et al., 2013). The same holds for forecasts of a binary event; see Ranjan
and Gneiting (2010). Because the average quantile forecast is always sharper than the
average probability forecast (Lichtendahl Jr et al., 2013), the average quantile may be
better calibrated when the average cdf is underconfident.
In the case of aggregating binary-event forecasts, such as the example in Yao et al’s
Section 4.6, it might be optimal to transform the probabilities prior to combining them
in a generalized linear model. Lichtendahl Jr et al. (2018) propose a Bayesian model in
the spirit of the example given here. The model results in a generalized additive model
for combining the base model’s binary-event forecasts.
We are thankful that the authors highlight the importance of combining predictive
distributions, and we hope this paper stimulates further work in this area.
References
Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory. Chichester, Eng-
land: John Wiley & Sons, Ltd. MR1274699. doi: https://doi.org/10.1002/
9780470316870. 968
970 Contributed Discussion
Contributed Discussion
Kenichiro McAlinn∗ , Knut Are Aastveit† , and Mike West‡
cases. BPS links to past literature on subjective Bayesian “agent/expert opinion analy-
sis” (e.g. Genest and Schervish, 1985; West and Crosse, 1992; West, 1992) and provides
a formal Bayesian framework that regards predictive densities from multiple models (or
individuals, agencies, etc) as data to be used in prior-posterior updating by a Bayesian
observer (see also West and Harrison, 1997, Sect 16.3.2). The approach allows for the
integration of other sources of information and explicitly provides the ability to deal
with M-incompleteness. A main theoretical component of BPS is a general theorem
describing a subset of Bayesian analyses showing how densities can be “synthesized”.
Special cases include traditional BMA, most existing forecast pooling rules, and– in
terms of theoretical construction– the stacking approach in the article.
In McAlinn and West (2018) and McAlinn et al. (2018) BPS is developed for time
series forecasting where the underlying Bayesian foundation defines a class of dynamic
latent factor models. The sequences of predictive densities define time-varying priors for
inherent latent factor processes linked to the time series of interest. BPS is able to learn
and adapt to the biases, aspects of mis-calibration, and– critically– inter-dependences
among predictive densities. A further practical key point is that BPS can– and should–
be defined with respect to specific predictive goals; this is a point of wider import
presaged in the earlier Bayesian macroeconomics literature and illustrated in McAlinn
and West (2018) and McAlinn et al. (2018) through separate forecast combination
models for multiple different goals (multiple-step ahead forecasting). Applications in
macroeconomic forecasting in these papers demonstrate that a class of proposed BPS
models can significantly improve over conventional methods (including BMA and other
pooling/weighting schemes). Further, as BPS is a fully-specified Bayesian model within
which the information from each of the sources generating predictive density are treated
as (complicated) “covariates,” posterior inferences on (time-varying or otherwise) pa-
rameters weighting and relating the sources provides direct access to inferences on their
biases and inter-dependencies.
It is of interest to consider how the current stacking approach relates to BPS through
an understanding of how the resulting rule for predictive density combination can be
interpreted in BPS theory (see equation (1), and the discussion thereafter, in McAlinn
and West 2018). As with other combination rules, an inherent latent factor interpreta-
tion is implied and this may provide opportunity for further development. In related
work with BPS based on mixture models, Johnson and West (2018) highlight the oppor-
tunities to improve both resulting predictions and generate insights about the practical
impact of model inter-dependencies that are largely ignored by other approaches. This
can be particularly important in dealing with larger numbers of predictive densities
when the underlying models generating the densities are known or expected to have
strong dependencies (a topic touched upon in Section. 5.3 in the article).
References
Aastveit, K. A., Gerdrup, K. R., Jore, A. S., and Thorsrud, L. A. (2014). “Nowcasting
GDP in real time: A density combination approach.” Journal of Business & Eco-
nomic Statistics, 32: 48–68. MR3173707. doi: https://doi.org/10.1080/07350015.
2013.844155. 971
K. McAlinn, K. A. Aastveit, and M. West 973
Aastveit, K. A., Ravazzolo, F., and van Dijk, H. K. (2018). “Combined density Now-
casting in an uncertain economic environment.” Journal of Business & Economic
Statistics, 36: 131–145. MR3750914. doi: https://doi.org/10.1080/07350015.
2015.1137760. 971
Billio, M., Casarin, R., Ravazzolo, F., and van Dijk, H. K. (2013). “Time-varying
combinations of predictive densities using nonlinear filtering.” Journal of Economet-
rics, 177: 213–232. MR3118557. doi: https://doi.org/10.1016/j.jeconom.2013.
04.009. 971
Del Negro, M., Hasegawa, R. B., and Schorfheide, F. (2016). “Dynamic prediction pools:
an investigation of financial frictions and forecasting performance.” Journal of Econo-
metrics, 192(2): 391–405. MR3488085. doi: https://doi.org/10.1016/j.jeconom.
2016.02.006. 971
Genest, C. and Schervish, M. J. (1985). “Modelling expert judgements for Bayesian
updating.” Annals of Statistics, 13: 1198–1212. MR0803766. doi: https://doi.org/
10.1214/aos/1176349664. 972
Geweke, J. F. and Amisano, G. G. (2011). “Optimal prediction pools.” Journal of Econo-
metrics, 164: 130–141. MR2821798. doi: https://doi.org/10.1016/j.jeconom.
2011.02.017. 971
Hall, S. G. and Mitchell, J. (2007). “Combining density forecasts.” International Journal
of Forecasting, 23: 1–13. 971
Johnson, M. C. and West, M. (2018). “Bayesian predictive synthesis: Forecast calibra-
tion and combination.” Technical report. ArXiv:1803.01984. MR3755068. 972
Kapetanios, G., Mitchell, J., Price, S., and Fawcett, N. (2015). “Generalised den-
sity forecast combinations.” Journal of Econometrics, 188: 150–165. MR3371665.
doi: https://doi.org/10.1016/j.jeconom.2015.02.047. 971
McAlinn, K., Aastveit, K. A., Nakajima, J., and West, M. (2018). “Multivariate
Bayesian predictive synthesis in macroeconomic forecasting.” Submitted . ArXiv:
1711.01667. 971, 972
McAlinn, K. and West, M. (2018). “Dynamic Bayesian predictive synthesis in time series
forecasting.” Journal of Econometrics, to appear. ArXiv:1601.07463. MR3664859.
971, 972
West, M. (1992). “Modelling agent forecast distributions.” Journal of the Royal Statis-
tical Society (Series B: Methodological), 54: 553–567. MR1160482. 972
West, M. and Crosse, J. (1992). “Modelling of probabilistic agent opinion.” Journal
of the Royal Statistical Society (Series B: Methodological), 54: 285–299. MR1157726.
972
West, M. and Harrison, P. J. (1997). Bayesian Forecasting & Dynamic Models. Springer
Verlag, 2nd edition. MR1482232. 972
974 Contributed Discussion
Contributed Discussion
Minsuk Shin∗
Overview
I congratulate the authors on an improvement in predictive performance of Bayesian
model averaging. This improvement is considerably significant under M-open settings
(Bernardo and Smith, 2009) where we know that the true model is not one of the con-
sidered candidate models. As the authors pointed out, it is well-known that the weights
of standard Bayesian model averaging (BMA) asymptotically concentrates on a single
model that is closest to the true model in Kullback–Leibler (KL) divergence. This would
be problematic under M-open settings, because a single (wrong) model would dominate
the predictive inference so that its predictive performance might be attenuated.
To address this issue, the authors bring the stacking idea to BMA. Instead of mini-
mizing the mean square error of the point estimate as in original stacking procedures,
they propose a procedure to evaluate the model weights that maximizes the empirical
scoring rule based on the leave-one-out (LOO). This results in a convex combination
of models that is empirically close (in terms of the considered score rule or the di-
vergence function) to the true model that generates the observed data. The authors
also circumvent the computational intensity of the LOO procedure by adopting Pareto
smoothed importance sampling (Vehtari et al., 2017). This importance sampling proce-
dure uses the importance sampling weights approximated by the order statistics of the
fitted generalized Pareto distribution, so refitting each model n times can be avoided.
Conclusion
Once again, I would like to congratulate the authors of this paper, and I think that it
would be interesting to extend the idea to high-dimensional model selection problems.
References
Bernardo, J. M. and Smith, A. F. (2009). Bayesian Theory, volume 405. John Wiley &
Sons. MR1274699. doi: https://doi.org/10.1002/9780470316870. 974
Castillo, I., Schmidt-Hieber, J., Van der Vaart, A., et al. (2015). “Bayesian linear re-
gression with sparse priors.” The Annals of Statistics, 43(5): 1986–2018. MR3375874.
doi: https://doi.org/10.1214/15-AOS1334. 974
George, E. I. and McCulloch, R. E. (1993). “Variable selection via Gibbs sampling.”
Journal of the American Statistical Association, 88(423): 881–889. 975
Hans, C., Dobra, A., and West, M. (2007). “Shotgun stochastic search for “large
p” regression.” Journal of the American Statistical Association, 102(478): 507–516.
MR2370849. doi: https://doi.org/10.1198/016214507000000121. 975
Narisetty, N. N., He, X., et al. (2014). “Bayesian variable selection with shrink-
ing and diffusing priors.” The Annals of Statistics, 42(2): 789–817. MR3210987.
doi: https://doi.org/10.1214/14-AOS1207. 974
Raftery, A. E., Madigan, D., and Hoeting, J. A. (1997). “Bayesian model averaging for
linear regression models.” Journal of the American Statistical Association, 92(437):
179–191. MR1436107. doi: https://doi.org/10.2307/2291462. 975
Vehtari, A., Gelman, A., and Gabry, J. (2017). “Pareto smoothed importance sampling.”
arXiv preprint arXiv:1507.02646. 974
976 Contributed Discussion
Contributed Discussion
Tianjian Zhou∗
Model Space and Computational Complexity Some considerations of BMA still apply
also for stacking. For example, the choice of candidate models. For a regression analysis
with p predictors, there are 2p possible linear subset regression models. One could
potentially consider most or all of them, as in Section 4.5. Sometimes it is also necessary
to consider interaction and nonlinear terms, as in Section 4.6. Thus, the size of the model
space can be enormous, making computation infeasible for reasonably large p. Similar
strategies as for implementing BMA could potentially also be used for stacking.
Compared to BMA, stacking is more appropriate for the M-open case. That is, the
true model need not be in the model space M. Still, it would be better if the true model
or a model close to the true model were included in the set of candidate models. For
example, comparing Figures 3 and 4 the performance of stacking is better under the M-
closed case compared to the M-open case, in terms of predictive density. Thus, from a
practical perspective, it is still desirable to work with a reasonably large class of models.
In all examples, the candidate models belong to the same model family (e.g., of
regression models). It would be interesting to see if performance can be further improved
by combining models that belong to different model families. For example, combining
linear regression models and regression tree models.
a natural choice would be the widely used Dirichlet process mixture model. For Sec-
tion 4.6, a natural choice could be a Bayesian additive regression tree (BART) model
or a Gaussian process prior. Of course, BNP models have their own limitations, such
as often difficult implementation, complex computation and possibly poor performance
with small sample size. A possible way out could be to include BNP models as candidate
models in stacking. Little would change in the overall setup.
References
Volinsky, C. T., Madigan, D., Raftery, A. E., and Kronmal, R. A. (1997). “Bayesian
model averaging in proportional hazard models: Assessing the risk of a stroke.” Jour-
nal of the Royal Statistical Society: Series C (Applied Statistics), 46(4): 433–448.
MR1765176. doi: https://doi.org/10.1214/ss/1009212519. 977
Yeung, K. Y., Bumgarner, R. E., and Raftery, A. E. (2005). “Bayesian model averaging:
Development of an improved multi-class, gene selection and classification tool for
microarray data.” Bioinformatics, 21(10): 2394–2402. 977
978 Contributed Discussion
Contributed Discussion∗
Lennart Hoogerheide† and Herman K. van Dijk‡
Proposal by Yao, Vehtari, Simpson and Gelman In recent literature and practice in
statistics as well as in econometrics, it is shown that Bayesian Model Averaging (BMA)
has its limitations for forecast averaging, see the earlier reference for a summary of the
literature in economics. The authors focus in their paper on the specific limitation of
BMA when the true data generating process is not in the set and also indicate the
sensitivity of BMA in case of weakly or non-informative priors. As a better approach
in terms of forecast accuracy and robustness, the authors propose the use of stacking,
which is used in point estimation, and extend it to the case of combinations of predictive
densities. A key step in the stacking procedure is that an optimisation step is used to
determine the weights of a mixture model in such a way that the averaging method is
then relatively robust for misspecified models, in particular, in large samples.
Dynamic learning to average predictively We fully agree that BMA has the earlier
mentioned restrictions. However, we argue that a static approach to forecast averaging,
∗ This comment should not be reported as representing the views of Norges Bank. The views ex-
pressed are those of the authors and do not necessarily reflect those of Norges Bank. An extended
version is available as Tinbergen DP.
† VU University Amsterdam and Tinbergen Institute, l.f.hoogerheide@vu.nl
‡Erasmus University Rotterdam, Norges Bank and Tinbergen Institute, hkvandijk@ese.eur.nl
1 In applied statistics and economics the usual terminology is forecasting instead of prediction. In
more theoretical work the usage of the word prediction is common. In this comment we use both
terminologies interchangeably.
L. Hoogerheide and H. K. Van Dijk 979
as suggested by the authors, will in many cases remain sensitive for the presence of a
bad forecast and extremely sensitive to a very bad forecast. We suggest to extend the
approach of the authors to a setting where learning about features of predictive densities
of possibly incomplete or misspecified models can take place. This extension will im-
prove the process of averaging over good and bad forecasts. To back-up our suggestion,
we summarise how this has been developed in empirical econometrics in recent years
by Billio et al. (2013), Casarin et al. (2018), and Baştürk, Borowska, Grassi, Hooger-
heide, and Van Dijk (2018). Moreover, we show that this approach can be extended
to combining not only forecasts but also policies. The technical tools necessary for the
implementation refer to filtering methods from the nonlinear time series literature and
we show their connection with dynamic machine learning.
The fundamental predictive density combination Let the predictive probability dis-
′
tribution of the variable of interest yt of (1), given the set ỹ t = (ỹ1t , . . . , ỹnt ) , be
specified as a large discrete mixture of conditional probabilities of yt given ỹ t coming
′
from n different models with weights wt = (w1t , . . . , wnt ) that are interpreted as proba-
bilities and form a convex combination. One can then give (1) a stochastic interpretation
using mixtures. Such a probability model, in terms of densities, is given as:
n
#
f (yt |ỹ t ) = wit f (yt |ỹit ). (2)
i=1
Let the predictive densities from the n models be denoted as f (ỹit |Ii ), i = 1, . . . , n,
where Ii is the information set of model i. Given the fundamental density combination
model of (2) and the predictive densities from the n models, one can specify, given
standard regularity conditions about existence of sums and integrals, that the marginal
predictive density of yt is given as a discrete/continuous mixture,
n
# (
f (yt |I) ∼ wit f (yt |ỹit )f (ỹit |Ii )dỹit (3)
i=1
where I is the joint information set of all models. The numerical evaluation of this equa-
tion is simple when all distributions have known simulation properties. An important
research line in economics and finance has been to make this approach operational to
more realistic environments by allowing for model incompleteness and dynamic learning
where the densities have no known simulation properties; see the earlier cited references.
Mixtures with model incompleteness and dynamic weight learning A first step is
to introduce, possibly, time-varying model incompleteness by specifying a Gaussian
mixture combination model with time varying volatility which controls the potential
size of the misspecification in all models in the mixture. When the uncertainty level
tends to zero then the mixture of experts or the smoothly mixing regressions model
is recovered as limiting case, see Geweke and Keane (2007), Jacobs et al. (1991). The
weights can be interpreted as a convex set of probabilistic weights of different models
which are updated periodically using Bayesian learning procedures. One can write the
model in the form of a nonlinear state space which allows to make use of algorithms
980 Contributed Discussion
based on sequential Monte Carlo methods such as particle filters in order to approximate
combination weight and predictive densities.
References
Aastveit, K. A., Mitchell, J., Ravazzolo, F., and van Dijk, H. K. (2018). “The evolution
of forecast density combinations in economics.” Forthcoming in the Oxford Research
Encyclopaedia in Economics and Finance. 978
L. Hoogerheide and H. K. Van Dijk 981
Baştürk, N., Borowska, A., Grassi, S., Hoogerheide, L., and Van Dijk, H. K. (2018).
“Learning Combinations of Bayesian Dynamic Models and Equity Momentum Strate-
gies.” Journal of Econometrics, Forthcoming. 979, 980
Billio, M., Casarin, R., Ravazzolo, F., and van Dijk, H. K. (2013). “Time-varying
combinations of predictive densities using nonlinear filtering.” Journal of Econo-
metrics, 177: 213–232. MR3118557. doi: https://doi.org/10.1016/j.jeconom.
2013.04.009. 979
Casarin, R., Grassi, S., Ravazzolo, F., and van Dijk, H. K. (2018). “Predictive Den-
sity Combinations with Dynamic Learning for Large Data Sets in Economics and
Finance.” Technical report. 979
Geweke, J. and Amisano, G. (2011). “Optimal prediction pools.” Journal of Econo-
metrics, 164(1): 130–141. MR2821798. doi: https://doi.org/10.1016/j.jeconom.
2011.02.017. 980
Geweke, J. and Keane, M. (2007). “Smoothly mixing regressions.” Journal of Econo-
metrics, 138: 252–290. MR2380699. doi: https://doi.org/10.1016/j.jeconom.
2006.05.022. 979
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). “Adaptive mix-
tures of local experts.” Journal of Neural Computation, 3: 79–87. 979
McAlinn, K. and West, M. (2018). “Dynamic Bayesian predictive synthesis for time
series forecasting.” Journal of Econometrics. Forthcoming. MR3664859. 978
982 Contributed Discussion
Contributed Discussion
Haakon C. Bakka∗ , Daniela Castro-Camilo† , Maria Franco-Villoria‡ ,
Anna Freni-Sterrantino§ , Raphaël Huser¶ , Thomas Opitz∥ , and Håvard Rue∗∗
The problem of estimating the leave-one-out predictive density (LOOPD) for a model
can be clarified when considering a regression-type setup. Let η be the linear predictor
for conditionally independent data y, so that yi relates to ηi only through the likeli-
hood p(yi |ηi ). For simplicity, we fix and then ignore the remaining variables (see Rue
et al., 2009, Sec. 6.3, for a more general treatment). We can compute the LOOPD from
p(ηi |y−i ) ∝p(ηi |y)/p(yi |ηi ), noting that p(yi |ηi ) is a known function of ηi . Suppose that
we can estimate p(ηi |y) well in the region [µi − γσi , µi + γσi ] (with obvious notation),
and that this region contains most of the probability mass. The question is whether the
correction needed for removing yi (i.e., the denominator p(yi |ηi )) is “small enough” so
that also p(ηi |y−i ) has most of its probability mass in the same region. If so, computing
p(ηi |y−i ) by correcting p(ηi |y) in this way is stable; otherwise, it is potentially unre-
liable and should be computed from a rerun without yi . Depending on the inference
algorithm, initial values can be extracted from the full model to speed up the corrected
run. Following this rationale, (R-)INLA (Rue et al., 2009; Martins et al., 2013; Rue
et al., 2017; Bakka et al., 2018) compute LOOPDs using integrated nested Laplace ap-
proximations. Cases where the above test does not hold are marked as “failures”. The
failed cases can then be recomputed after the corresponding observations are removed,
and we gain speed by using the joint fit as initial values. In addition to being faster
than Markov chain Monte Carlo methods, we also get smooth estimates of the posterior
marginals, which helps the optimisation step for the weights. Held et al. (2010) discuss
this approach in more details and compare it with estimates obtained by Markov chain
Monte Carlo.
Recently, Bakka et al. (2018) used leave-one-out cross-validation (LOOCV) log-
scores in spatial modelling. That paper introduces the Barrier model, a non-stationary
model dealing with coastlines and other physical barriers. The goal was to compare
several spatial and non-spatial models through LOOCV. When comparing several mod-
els using the mean LOOCV log-score, we always end up choosing one model as “the
best”. However, such a way to rank models ignores uncertainty. With our dataset, a
∗ CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia,
haakon.bakka@kaust.edu.sa
† CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia,
daniela.castro@kaust.edu.sa
‡Department of Economics and Statistics, University of Torino, Italy, maria.francovilloria@unito.it
§ Small Area Health Statistics Unit, Department of Epidemiology and Biostatistics, Imperial College
raphael.huser@kaust.edu.sa
∥ Biostatistics and Spatial Processes Unit, French National Institute for Agronomic Research, 84914
haavard.rue@kaust.edu.sa
H. C. Bakka et al. 983
small subset of the individual log-scores strongly influenced the model selection result.
For illustration, the left panel of Figure 1 depicts the histogram of log-score differences
between two example models (for each held-out point in the leave-one-out procedure),
from which the mean (or sum) is usually computed. It is clear that we cannot conclude
that one model is superior to the other (i.e., that zero is a bad choice for the center of
this distribution). In the context of stacking, we cannot give more weight to one of these
two models with any degree of confidence. To further assess the variability inherent to
the LOOCV estimate of marginal predictive performance, we bootstrapped the mean-
differences to compute uncertainty intervals, and decided to conclude that one model
was better than another only if this interval did not include zero; see the right panel
of Figure 1 for this computation on our dataset. The first interval in this figure corre-
sponds to the histogram in Figure 1. The non-spatial model 5 (M5) performs poorly,
but we cannot conclude that there is a best model. In the context of stacking, the five
“equivalent” models would be weighted by highly arbitrary weights to create a stacked
model, which we find questionable. We wonder whether combining the bootstrapped
uncertainty intervals with the stacking idea (in some way) could lead to a more robust
approach to stacking.
We question the authors’ choice to compare the stacking approach to the other
methods presented in the paper. Indeed, they point out that Bayesian model averaging
(BMA) weights reflect only the fit to the data (i.e., within-sample performance)without
maximizing the prediction accuracy (i.e., out-of-sample performance). Thus, the com-
parison of BMA (or its modified versions, Pseudo-BMA and Pseudo-BMA+) against
the stacking of distributions, which is conveniently constructed to improve prediction
accuracy, does not seem fair. To highlight the gains and the pitfalls of stacking predic-
tive distributions, it would be more reasonable to compare the prediction ability of the
stacking approach against the prediction ability of each one of the stacked models.
In Section 3.3, the authors advocate the use of Pareto-smoothed importance sam-
pling as a cheap alternative to exact LOOCV, which can be computationally expensive
984 Contributed Discussion
for large sample sizes. We agree with the authors’ guidelines, and we here re-emphasize
that this approach is potentially unstable/invalid when the importance ratios have a
very heavy or “noisy” tail. Indeed, it is well-known that the ith order statistics X(i) in
a sample of size n from the GP distribution with shape parameter k has finite mean
for k < n − i + 1, and finite variance for k < (n − i + 1)/2; see, e.g., Vännman (1976).
In particular, this implies that the maximum X(n) has infinite mean for k ≥ 1. As the
GP shape parameter is usually estimated with high uncertainty, especially with heavy
tails, a conservative decision rule is preferred in practice. Moreover, we want to stress
that the estimation of the shape parameter k via maximum likelihood may be strongly
influenced by the largest observations. Therefore, more robust approaches might be
preferred. Possibilities include using methods based on probability weighted moments,
which were found to have good small sample properties (Hosking and Wallis, 1987;
Naveau et al., 2016), or using a Bayesian approach with strong prior shrinkage towards
light tails. Opitz et al. (2018) recently developed a penalized complexity (PC) prior
(Simpson et al., 2017) for k, designed for this purpose.
References
Bakka, H., Rue, H., Fuglstad, G. A., Riebler, A., Bolin, D., Illian, J., Krainski, E., Simp-
son, D., and Lindgren, F. (2018). “Spatial modelling with R-INLA: A review.” WIREs
Computational Statistics, MR1535567. doi: https://doi.org/10.2307/wics.1443.
982, 983
Held, L., Schrödle, B., and Rue, H. (2010). “Posterior and cross-validatory predictive
checks: A comparison of MCMC and INLA.” In Kneib, T. and Tutz, G. (eds.), Statisti-
cal Modelling and Regression Structures – Festschrift in Honour of Ludwig Fahrmeir ,
91–110. Berlin: Springer Verlag. MR2664630. doi: https://doi.org/10.1007/978-
3-7908-2413-1 6. 982
Hosking, J. R. M. and Wallis, J. R. (1987). “Parameter and quantile estimation
for the generalized Pareto distribution.” Technometrics, 29: 339–349. MR0906643.
doi: https://doi.org/10.2307/1269343. 984
Martins, T. G., Simpson, D., Lindgren, F., and Rue, H. (2013). “Bayesian computing
with INLA: New features.” Computational Statistics & Data Analysis, 67: 68–83.
MR3079584. doi: https://doi.org/10.1016/j.csda.2013.04.014. 982
Naveau, P., Huser, R., Ribereau, P., and Hannart, A. (2016). “Modeling jointly low,
moderate, and heavy rainfall intensities without a threshold selection.” Water Re-
sources Research, 52: 2753–2769. 984
Opitz, T., Huser, R., Bakka, H., and Rue, H. (2018). “INLA goes extreme: Bayesian
tail regression for the estimation of high spatio-temporal quantiles.” Extremes.
doi: https://doi.org/10.1007/s10687-018-0324-x. 984
Rue, H., Martino, S., and Chopin, N. (2009). “Approximate Bayesian Inference for La-
tent Gaussian Models Using Integrated Nested Laplace Approximations (with discus-
sion).” Journal of the Royal Statistical Society, Series B , 71(2): 319–392. MR2649602.
doi: https://doi.org/10.1111/j.1467-9868.2008.00700.x. 982
H. C. Bakka et al. 985
Rue, H., Riebler, A., Sørbye, S. H., Illian, J. B., Simpson, D. P., and Lindgren, F. K.
(2017). “Bayesian Computing with INLA: A Review.” Annual Reviews of Statis-
tics and Its Applications, 4(March): 395–421. MR3634300. doi: https://doi.org/
10.1214/16-STS576. 982
Simpson, D. P., Rue, H., Riebler, A., Martins, T., and Sørbye, S. H. (2017). “Pe-
nalising model component complexity: A principled, practical approach to con-
structing priors.” Statistical Science, 32: 1–28. MR3634300. doi: https://doi.org/
10.1214/16-STS576. 984
Vännman, K. (1976). “Estimators based on order statistics from a Pareto distribution.”
Journal of the American Statistical Association, 71: 704–708. MR0440779. 984
986 Contributed Discussion
Contributed Discussion
Marco A. R. Ferreira∗
We note that after yt has been observed, comparing p(yt |y1:(t−1) , M1 ), . . . , p(yt |y1:(t−1) ,
MK ) allows one to evaluate the relative ability of each model to predict at time t − 1
the vector of observations yt . Hence, in the context of data observed over time, instead
of p(yi |y−i , Mk ), it seems more natural to use the one-step ahead predictive density
p(yt |y1:(t−1) , Mk ). Thus, for data observed over time the stacking of predictive distribu-
tions would choose weights
T
# K
#
= = argmaxw∈S1K
w log wk p(yt |y1:(t−1) , Mk ), (1)
t=t∗ +1 k=1
where the summation on t starts at t∗ + 1 because the first t∗ observations are used
to train the models to reduce dependence on priors for parameters. We note that the
above equation is very similar to that of the optimal prediction pools of Geweke and
Amisano (2011, 2012), except that they start the summation at t = 1.
It is also helpful to consider the formula for the posterior probability for each model.
To keep exposition simple, let us assume equal prior probabilities for the competing
models. Further, we assume that the first t∗ observations are used for training. Then,
the posterior probability for model Mk is
)T
∗ p(yt |y1:(t−1) , Mk )
P (Mk |y1:T ) = !K t=t)T . (2)
k=1 t=t∗ p(yt |y1:(t−1) , Mk )
Keeping (1) and (2) in mind, what can we infer when a model M , has posterior
? in the stacking of predictive distributions is
probability close to one but its weight w
much smaller than one? The posterior probability being close to one means that M ,
is probably, amongst the K models being considered, the model closest in Kullback–
Leibler sense to the true data generating mechanism. But its weight w? being much
smaller than one means that there are important aspects of the true data generating
,.
mechanism that have not been incorporated in M
We note that both (1) and (2) depend on the data only through the one-step ahead
predictive densities p(yt |y1:(t−1) , Mk ). Thus, for data observed over time, when there are
disagreements between the posterior probabilities of models and the stacking weights,
an examination of the one-step ahead predictive densities p(yt |y1:(t−1) , Mk ) such as
plotting them over time as in Vivar and Ferreira (2009) may help identify what aspects
of the true data generating mechanism are being neglected by model M ,.
For example, an examination of p(yt |y1:(t−1) , Mk ) may indicate that model M, pro-
vides better probabilistic predictions 95% of the time, but that in the remaining 5%
of the time the observations are outliers with respect to M , but are not outliers with
respect to a model M ∗ that has fatter tails than M ,. In that situation, the outlying
observations would prevent w ? from being close to one. Further examination of the out-
lying observations could possibly suggest ways to improve model M , to get it closer to
the true data generating mechanism.
, and
As another example, an examination of p(yt |y1:(t−1) , Mk ) may indicate that M
another model M ∗ take turns at providing better probabilistic predictions. For example,
say that for a certain environmental process, M, provides better predictions during a
certain period of time, and then after that M ∗ provides better predictions, and after
that M, provides better predictions, and so on. In that case, probably the environ-
mental process has different regimes, and thus for example a Markov switching model
(Frühwirth-Schnatter, 2006) may be adequate to model such environmental process.
I would imagine that a sensibly estimated leave-one-out predictive density p(yi |y−i ,
Mk ) could also be used for diagnostics. I would appreciate if the authors can comment
on advantages and difficulties associated with such use.
Finally, in the M-closed case, will the stacking weight of the true model converge
to one as the sample size increases?
References
Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer.
MR2265601. 987
Geweke, J. and Amisano, G. (2011). “Optimal prediction pools.” Journal of Econo-
metrics, 164(1): 130–141. MR2821798. doi: https://doi.org/10.1016/j.jeconom.
2011.02.017. 986
Geweke, J. and Amisano, G. (2012). “Prediction with misspecified models.” American
Economic Review , 102(3): 482–86. 986
Vivar, J. C. and Ferreira, M. A. R. (2009). “Spatio-temporal models for Gaussian areal
data.” Journal of Computational and Graphical Statistics, 18: 658–674. MR2751645.
doi: https://doi.org/10.1198/jcgs.2009.07076. 987
988 Contributed Discussion
Contributed Discussion
Luis Pericchi∗†
This article gave me a Déja Vu, of 25 years ago in London when the book by Bernardo
and Smith (1994) was being finished and Key et al. (1999) was starting. With the
formalization of the complement of the M-closed perspective, the end of Bayes Factors
and Bayesian Model Averaging (BMA) was predicted or at least confined to a very
small corner of statistical practice. However, re-sampling Bayes Factors, particularly the
Geometric Intrinsic Bayes Factors were being invented around the same years and these
re-sampling schemes changed completely the perspective. For some historical reasons
however, non re-sampling Intrinsic Bayes Factors were much more developed along the
years. Perhaps this will be one of the positive consequences of this paper and the previous
in this line, to recover the thread of development, theoretical and practical, of the rich
mine vein of re-sampling Bayes Factors.
Just two illustrations of the fundamentally different behavior of re-sampling Bayes
Factors, more in tune with open perspectives are in Bernardo and Smith (1994) p. 406
(Lindley’s paradox revisited) and in Key et al. (1999) p. 369 on which the Intrinsic
Implicit Priors were first named. However this paper seems to restrict its scope to non re-
sampling Bayes Factors which is insufficient. On the other hand the paper interestingly
relinquishing marginal likelihoods appears to get away, at least to some extend by using
K-L loss. This should be study also theoretically, but it should be noted that the loss
function, even restricted to K-L functional form, changes also whenever the training
sample and validation sample sizes change, and the change is huge. This paper seems
also restricted to n − 1 cross validation. Also it seems that the only goal of statistics
is to make good predictions, but good explanations are also of paramount importance.
In that direction Key et al. (1999) define different combinations of training sizes that
show the differences.
I finish with a list of questions and a conjecture: The questions are,
2. Are these approximations really Bayesian? in the sense that: is there a prior that
would produce asymptotically equivalent inference with stacking? In other words
Intrinsic Implicit Priors exists here?
3. Is there an optimal training sample size or combination of global and Local utility
functions, when the objective is prediction? Identification?
∗ I am grateful to Adrian Smith and Jim Berger for many discussions on the subject of Bayesian
luis.pericchi@upr.edu
L. Pericchi 989
Finally I conjecture that casual choice of estimators within models would lead to un-
Bayesian inefficient solutions, and the authors seem to agree with this conjecture in 5.3.
Careful consideration of all the entertained models and admissible estimators for pa-
rameters should be considered prior to the optimization procedures. In fact this may
solve the old conundrum of whether the same priors should or not be used for estimation
and Selection.
References
Bernardo, J. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley & Sons, 1st.
edition. MR1274699. doi: https://doi.org/10.1002/9780470316870. 988
Key, J., Pericchi, L. R., and Smith, A. F. M. (1999). “Bayesian model choice: what and
why?” In Bayesian Statistics 6 , 343–370. Oxford University Press. MR1723504. 988
990 Contributed Discussion
Contributed Discussion
Christopher T. Franck∗
I congratulate the authors on a fascinating article which will positively impact statistical
research and practice in the years to come. The authors’ procedure for stacking Bayesian
predictive distributions differs from Bayesian model averaging (BMA) in two important
ways. First, the stacking procedure chooses weights based directly on the predictive
distribution of new data while BMA chooses weights based on fit to observed data.
Second, the stacking procedure is not sensitive to priors on parameters to the extent that
BMA is. The leave-one-out approach in the stacking procedure bears some resemblance
to intrinsic Bayes factors, which made me curious as to whether intrinsic Bayesian
model averaging (iBMA) can make up some of the reported performance difference
between stacking and BMA. In this note, I restrict attention to the authors’ linear
subset regressions example, adopt the authors’ Bayesian model, and compare BMA
with iBMA. Since iBMA does not improve prediction over BMA in this initial study,
I ultimately suggest that the stacking procedure is superior to iBMA for prediction.
It appears that the stacking procedure’s replacement" of the full predictive distri-
bution with the leave-one-out predictive distribution p(yi |θk , Mk )p(θk |y−i , Mk )dθk is
the major mechanism that bestows invariance to priors. This approach makes priors
resemble posterior distributions by conditioning on a subset of the data. Further, a
simple re-expression of p(θk |y−i , Mk ) via Bayes’ rule reveals that nuisance constants
which accompany priors (discussed by the authors in the BMA segment of Section 2)
cancel out when any portion of the data is used to train priors. This is the same tac-
tic that partial Bayes factors (Berger and Pericchi, 1996) use to cancel the unspecified
constants which accompany improper priors and contaminate resulting Bayes factors.
Briefly, a partial Bayes factor takes a training sample from the observed data, uses the
likelihood of the training sample to update the prior, and forms a Bayes factor as the
ratio of marginal likelihoods that adopt the trained prior alongside the remainder of the
likelihood. An intrinsic Bayes factor is an average across some or all possible training
samples. Where the original motivation for intrinsic Bayes was to enable model selec-
tion using improper priors, the approach is also used for model selection that is robust
to vague proper priors. For improper priors, the goal is usually to choose a minimal
training sample size to render the prior proper. As the training sample size increases
for fixed n, the prior exerts less influence on the posterior model probabilities, but the
method becomes less able to discern competitive models (Fulvio and Fulvio, 1997).
Using the authors’ Bayesian model and data generating process and a similar out-
of-sample testing procedure, I compared standard BMA with iBMA. In the iBMA case,
I formed intrinsic Bayes factors which I then translated to posterior model probabilities
for use in model averaging. I obtained 50 Monte Carlo replicates with 10 test points.
I considered iBMA training sample sizes of 1, 5, and n−1. The M-open case (not shown)
favored the iBMA setting with n − 1 training samples, leaving only one data point for
Figure 1: Averages of log predictive densities for 10 test points based on 50 Monte Carlo
replicates. Black corresponds to authors’ prior, red corresponds to a more vague prior
on coefficients. Increasing the training sample size diminishes the effect of the prior but
also reduces log predictive density at higher sample sizes. None of the iBMA settings
produce predictive densities substantially higher than the BMA approach (open circles).
the likelihood. This setting barely changed the posterior model probabilities from their
uniform prior, which works well only in this specific case where a near uniform mixture
of the 15 candidate models performs adequately. The M-closed results shown in Figure 1
confirm that (i) iBMA diminishes influence of the prior as the size of the training sample
increases (note overlap in n − 1 training lines), (ii) an excessively large training sample
proportion erodes the predictive density especially at larger sample sizes, and more
importantly suggests that (iii) endowing BMA with prior-invariance machinery that
resembles the stacking procedure’s does not appear to offer any advantage in predictive
density. Hence, I second the authors’ conclusion that stacking is a superior approach for
prediction. The present study suggests that the stacking procedure’s prior invariance
property is a convenient bonus but not the major reason for its impressive performance.
References
Berger, J. O. and Pericchi, L. R. (1996). “The Intrinsic Bayes Factor for
Model Selection and Prediction.” Journal of the American Statistical Associa-
tion, 91(433): 109–122. URL http://www.jstor.org/stable/2291387. MR1394065.
doi: https://doi.org/10.2307/2291387. 990
Fulvio, D. S. and Fulvio, S. (1997). “Alternative Bayes factors for model selection.”
Canadian Journal of Statistics, 25(4): 503–515. MR1614347. doi: https://doi.org/
10.2307/3315344. 990
992 Contributed Discussion
Contributed Discussion
Eduard Belitser† and Nurzhan Nurushev∗‡
We thank the authors for an interesting paper that takes the idea of stacking from the
point estimation problem and generalize to the combination of predictive distributions.
Let us first mention some key ides of the present paper. One of the main problems in
statistics is prediction in the presence of multiple candidate models or learning algo-
rithms M = (M1 , . . . , MK ). In Bayesian model comparison, the relationship between
the true data generator and the model list M = (M1 , . . . , MK ) can be classified into
three categories: M-closed, M-complete and M-open. These mentioned problems in
the present paper are addressed by providing new stacking method. The authors com-
pare this method to several alternatives: stacking of means, Bayesian model averag-
ing (BMA), Pseudo-BMA, and a variant of Pseudo-BMA that is stabilized using the
Bayesian bootstrap. Based on simulations and real-data applications, they recommend
stacking of predictive distributions, with bootstrapped-Pseudo-BMA as an approximate
alternative when computation cost is an issue.
We enjoyed reading the paper and would like to make three comments/question.
First, the methodology of the present paper relies on the knowledge of K, the number
of models in the list M = (M1 , . . . , MK ). Without loss of generality assume K ∈ (0, n].
We wonder whether the authors could come up with a general idea of how to extend the
stacking method to the unknown number of models, i.e., K is unknown. Perhaps the
problem can be addressed by adding prior on K in author’s framework, but it might
lead to big computational costs.
Second, all problems discussed in the present paper are examples of supervised learn-
ing. In other words, for each observation of the predictor measurements xi , i = 1, . . . , n,
there is an associated response measurement yi . The authors wish to fit a model that
relates the response to the predictors, with the aim of accurately predicting the response
for future observations (prediction) or better understanding the relationship between
the response and the predictors (inference). In contrast, unsupervised learning describes
the somewhat more difficult situation in which for every observation i = 1, . . . , n, we
observe a vector of measurements xi but no associated response yi . For instance, it is
not possible to fit linear or logistic regression models, since there is no response vari-
able to predict. This situation is referred to as unsupervised because we lack a response
variable that can supervise our analysis. One of the popular examples of unsupervised
learning is cluster analysis. The goal of cluster analysis is to ascertain, on the basis of
x1 , . . . , xn , whether the observations fall into relatively distinct groups(e.g., stochastic
block model, see Holland et al. (1983)). We wonder whether it is possible to extend
the stacking method to the examples of unsupervised learning (e.g., stochastic block
model).
∗ Research
funded by the Netherlands Organisation for Scientific Research NWO.
† Departmentof Mathematics, VU Amsterdam, e.n.belitser@vu.nl
‡Korteweg-de Vries Institute for Mathematics, University of Amsterdam, n.nurushev@uva.nl
E. Belitser and N. Nurushev 993
The third comment is related to the simulation subsection 4.2. In that subsection
the authors consider some K = 15 different models of linear regression Y = β1 X1 +
. . . + βJ XJ + ϵ, ϵ ∼N(0, 1), where the number of predictors J is 15. However, the total
number of all possible linear regressions with at most J = 15 predictors is 215 . We
wonder whether the methods studied in the present paper with all K = 215 possible
models would be computationally costly. For instance, LASSO and Ridge methods solve
this problem without any big computational costs. It would be also interesting to know
for the future research whether the corresponding estimators of the present paper can
achieve the minimax rate for sparse linear regression problem studied in Bunea et al.
(2007).
We hope these comments will inspire the authors and other people to work on these
interesting problems in the future.
References
Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. (2007). “Aggregation
for Gaussian regression.” Annals of Statistics, 35(4): 1674–1697. MR2351101.
doi: https://doi.org/10.1214/009053606000001587. 993
Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). “Stochastic blockmodels:
First steps.” Social Networks, 5(2): 109–137. MR0718088. doi: https://doi.org/
10.1016/0378-8733(83)90021-7. 992
994 Contributed Discussion
Contributed Discussion
Matteo Iacopini∗,† and Stefano Tonellato‡
We congratulate the authors for their excellent research which leaded to the development
of a new statistical method for model comparison. The procedure is computationally
fast and can be applied in a variety of settings ranging from mixture models to variable
selection in regression frameworks.
Figure 1: Posterior log-predictive density of model (a) (blue) and model (b) (black ), for
different values of the sample size N ∈ {5, 10, 20, 30, 40, 50}.
Figure 2: Conditional densities p(y|x) for several values of x, N = 20. True function
(red ), model (b) estimate (black ) and model (a) estimate (blue).
996 Contributed Discussion
models under comparison, not to the methods used in order to produce stacking. What
is interesting to notice is that despite the different weights computed according to the
two schemes, the combined conditional predictive densities are rather similar for all the
values of the conditioning variable x. This feature has been proved to hold also when
the sample size increases and similar results (not reported here) have been obtained for
different model specifications.
The main insight from this small simulation study is twofold: first, the approach
proposed by the authors outperforms the alternative weighting schemes, both in fitting
and in computational efficiency. Second, the scheme of Li and Dunson (2016) yields
comparable results in terms of conditional density estimation. This might suggest that
coupling Pseudo-BMA and Reference-Pseudo-BMA might be a successful strategy in
those circomstances when leave-one-out or Pareto smoothed importance sampling leave-
one-out cross-validation are suspected to produce unstable results, due to small sample
size or large values of k̂.
References
Li, M. and Dunson, D. B. (2016). “Comparing and weighting imperfect models using
D-probabilities.” 994, 996
Müller, P., Quintana, F. A., Jara, A., and Hanson, T. (2013). Bayesian nonpara-
metric data analysis. Springer. MR3309338. doi: https://doi.org/10.1007/978-3-
319-18968-0. 994
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 997
Rejoinder
Yuling Yao∗ , Aki Vehtari† , Daniel Simpson‡ , and Andrew Gelman§
We thank the editorial team for organizing the discussion. We are pleased to find so many
thoughtful discussants who agree with us on the advantage of having stacking in the
toolbox for combining Bayesian predictive distributions. In this rejoinder we will provide
further clarifications and discuss some limitations and extensions of Bayesian stacking.
gelman@stat.columbia.edu
998 Rejoinder
where y are partially exchangeable, i.e. ymn are exchangeable in group j, and θm are ex-
changeable. Rearrange the data and denote the group label of (xi , yi ) by zi , then (3) can
)N ′
be reorganized as i=1 p(yi |xi , zi , θ, ψ) so the previous results follow. Furthermore, de-
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 999
pending on whether the prediction task is to predict a new group, or a new observations
within a particular group j, we should consider leave-one-point-out or leave-one-group-
!
out, corresponding to modeling the new covariate by p(x̃, z̃) ∝δ(z̃ = j) zi =j δ(x̃ = xi )
!J !
or z=1 zi =j δ(z̃ = j, x̃ = xi ).
When there is no obvious re-parametrization making the model conditional factoriz-
able, the pointwise log-likelihood has the general form log p(yi |y−i , θ). It is still possible
to use PSIS-LOO in some special non-factorizable forms. Vehtari et al. (2018a) provide
a marginalization strategy of PSIS-LOO to evaluate simultaneously autoregressive nor-
mal models. Bakka et al. express concerns about the reliability of PSIS. We refer to
Vehtari et al. (2017) and Vehtari et al. (2018b) for computation and diagnostic details.
Lastly, although we used LOO, other variations of cross-validation could be used in
stacking. Roberts et al. (2017) review cross-validation strategies for data with temporal,
spatial, hierarchical, and phylogenetic structure. Many of these can also be computed
fast by PSIS as demonstrated for m-step-ahead cross-validation for time series (Buerkner
et al., 2018).
and replace the LOO density " in (1) by the sequential predictive density leaving out all
future data: p(yt |y<t ) = p(yt |y1:t−1 , θ)p(θ|y1:t−1 )dθ in each model, and then stacking
follows. This is similar to the approach developed by Geweke and Amisano (2011, 2012).
Ferreira and Dawid suggest similar ideas. The ergodicity of y will yield,
N N
1 # 1 #
lim S (p(·|y<t ), yt ) − lim EY1:N S (p(·|Y<t ), Yt ) → 0. (4)
N →∞ N N →∞ N
t=1 t=1
When there is a particular horizon of interest for prediction, p(yt |y<t ) above is general-
ized
" to m-step-ahead predictive density p(yt<m |y<t ) = p(yt , . . . , yt+m−1 |y1 , . . . , yt−1 ) =
p(yt<m |y<t , θ)p(θ|y<t )dθ.
However, the prequential approach introduces a different prediction task. Unless
some stationarity of the true data generating mechanism is assumed (e.g., P (yt |y<t ) =
1000 Rejoinder
P (yt′ |y<t′ )), the average cumulative performance (the second term in (4)) is different
from the one-step-ahead assessment in (2), which is only evaluated at next unseen
observation t = N + 1.
from the model list. Stacking is not strongly sensitive to the misspecified models (see
Section 4.1 of our paper), but it will be sensitive to how good an approximation is
possible given the ensemble space.
We discuss the concern of inflexibility of linear-additive-form of density combina-
tion in Section 5.2, and construct the same orthogonal regression example as Clyde,
in which stacking will not work to approximate the true model that is a convolution of
individual densities. By optimizing the leave-one-out performance of combined predic-
tion, the stacking framework can be extended to more general combination forms, such
as the posterior family used in the BPS literature. Furthermore, simplex constraints
will be unnecessary if it goes beyond the linear combination of densities. We are inter-
ested in testing such approaches. Yoo proposes another way to obtain convolutional
combinations by stacking in the Fourier domain.
would prefer to form the component models using a projection predictive approach which
projects the information from the reference model to the restricted models (Piironen
and Vehtari, 2016, 2017a).
Zhou suggests Bayesian nonparametric (BNP) models as an alternative to model
averaging. Indeed, the spline models used in the experiments in Section 4.6 of our paper
can be considered as BNP models. We can compute fast LOO-CV also for Gaussian
processes and other Gaussian latent variable models (Vehtari et al., 2016).
References
Bernardo, J. M. and Smith, A. F. (1994). Bayesian theory. John Wiley & Sons.
MR1274699. doi: https://doi.org/10.1002/9780470316870. 997
Buerkner, P., Vehtari, A., and Gabry, J. (2018). “PSIS assisted m-step-ahead pre-
dictions for time-series models.” Technical report. URL http://mc-stan.org/loo/
articles/m-step-ahead-predictions.html 999, 1000
Dawid, A. P. (1984). “Present position and potential developments: Some personal
views: Statistical theory: The prequential approach.” Journal of the Royal Statistical
Society. Series A, 278–292. MR0763811. doi: https://doi.org/10.2307/2981683.
999
Geweke, J. and Amisano, G. (2011). “Optimal prediction pools.” Journal of Econo-
metrics, 164(1): 130–141. MR2821798. doi: https://doi.org/10.1016/j.jeconom.
2011.02.017. 999
Geweke, J. and Amisano, G. (2012). “Prediction with misspecified models.” American
Economic Review , 102(3): 482–486. 999
Kamary, K., Mengersen, K., Robert, C. P., and Rousseau, J. (2014). “Testing hypotheses
via a mixture estimation model.” arXiv preprint arXiv:1412.2044. 1000
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman 1003
McAlinn, K., Aastveit, K. A., Nakajima, J., and West, M. (2017). “Multivari-
ate Bayesian Predictive Synthesis in Macroeconomic Forecasting.” arXiv preprint
arXiv:1711.01667. 1000
McAlinn, K. and West, M. (2017). “Dynamic Bayesian predictive synthesis in time series
forecasting.” arXiv preprint arXiv:1601.07463. MR3664859. 1000
Piironen, J. and Vehtari, A. (2016). “Projection predictive model selection for Gaussian
processes.” In 2016 IEEE 26th International Workshop on Machine Learning for
Signal Processing (MLSP), 1–6. 1001, 1002
Piironen, J. and Vehtari, A. (2017a). “Comparison of Bayesian predictive methods for
model selection.” Statistics and Computing, 27(3): 711–735. 1001, 1002
Piironen, J. and Vehtari, A. (2017b). “On the hyperprior choice for the global shrinkage
parameter in the horseshoe prior.” In Artificial Intelligence and Statistics, 905–913.
1001
Piironen, J. and Vehtari, A. (2017c). “Sparsity information and regularization in the
horseshoe and other shrinkage priors.” Electronic Journal of Statistics, 11(2): 5018–
5051. 1001
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., Hauen-
stein, S., Lahoz-Monfort, J. J., Schröder, B., Thuiller, W., et al. (2017). “Cross-
validation strategies for data with temporal, spatial, hierarchical, or phylogenetic
structure.” Ecography, 40(8): 913–929. 999
Shimodaira, H. (2000). “Improving predictive inference under covariate shift by weight-
ing the log-likelihood function.” Journal of Statistical Planning and Inference, 90(2):
227–244. MR1795598. doi: https://doi.org/10.1016/S0378-3758(00)00115-4.
998
Sugiyama, M., Krauledat, M., and Müller, K.-R. (2007). “Covariate shift adaptation
by importance weighted cross validation.” Journal of Machine Learning Research,
8(May): 985–1005. 998
Sugiyama, M. and Müller, K.-R. (2005). “Input-dependent estimation of generalization
error under covariate shift.” Statistics & Decisions, 23(4/2005): 249–279. MR2255627.
doi: https://doi.org/10.1524/stnd.2005.23.4.249. 998
Vehtari, A., Buerkner, P., and Gabry, J. (2018a). “Leave-one-out cross-validation
for non-factorizable models.” Technical report. URL http://mc-stan.org/loo/
articles/loo2-non-factorizable.html 999
Vehtari, A., Gabry, J., Yao, Y., and Gelman, A. (2018b). “loo: Efficient leave-one-out
cross-validation and WAIC for Bayesian models.” R package version 2.0.0. 999
Vehtari, A., Gelman, A., and Gabry, J. (2017). “Pareto smoothed importance sampling.”
arXiv preprint arXiv:1507.02646. 999
Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., and Winther, O. (2016). “Bayesian
leave-one-out cross-validation approximations for Gaussian latent variable models.”
Journal of Machine Learning Research, 17(1): 3581–3618. MR3543509. 1002