Multivariate Generalized Linear Mixed Models For Count Data: Guilherme P. Silva Henrique A. Laureano
Multivariate Generalized Linear Mixed Models For Count Data: Guilherme P. Silva Henrique A. Laureano
Multivariate Generalized Linear Mixed Models For Count Data: Guilherme P. Silva Henrique A. Laureano
MMMMMM
AJS YYYY, Volume VV, 1–xx.
http://www.ajs.or.at/
Abstract
Univariate regression models have rich literature for counting data. However, this is
not the case for multivariate count data. Therefore, we present the Multivariate Gener-
alized Linear Mixed Models framework that deals with a multivariate set of responses,
measuring the correlation between them through random effects that follows a multivari-
ate normal distribution. This model is based on a GLMM with a random intercept and the
estimation process remains the same as a standard GLMM with random effects integrated
out via Laplace approximation. We efficiently implemented this model through the TMB
package available in R. We used Poisson, negative binomial (NB), and COM-Poisson distri-
butions. To assess the estimator properties, we conducted a simulation study considering
four different sample sizes and three different correlation values for each distribution. We
achieved unbiased and consistent estimators for Poisson and NB distributions; for COM-
Poisson estimators were consistent, but biased, especially for dispersion, variance, and
correlation parameter estimators. These models were applied to two datasets. The first
concerns a sample from 30 different sites collected in Australia where the number of times
each one of the 41 different ant species was registered; which results in an impressive
820 variance-covariance and 41 dispersion parameters being estimated simultaneously, let
alone the regression parameters. The second is from the Australia Health Survey with
5 response variables and 5190 respondents. These datasets can be considered overdis-
persed by the generalized dispersion index. The COM-Poisson model overcame the other
two competitors considering three goodness-of-fit indexes, AIC, BIC, and maximized log-
likelihood values. As a result, it estimated parameters with smaller standard errors and
a greater number of significant correlation coefficients. Therefore, the proposed model
is capable of dealing with multivariate count data, either under- equi- or overdispersed
responses, and measuring any kind of correlation between them taking into account the
effects of the covariates.
1. Introduction
It is well known that the Poisson distribution is the most popular model to deal with count
data under the framework of the generalized linear models (GLM) (Nelder and Wedderburn
1972). However, it is limited to equidispersed count data, i.e., when the mean of the response
variable is equal to the variance. As alternatives to the Poisson model, the statistical literature
for univariate count data is rich, either from overdispersed or underdispersed perspectives. For
example, negative binomial (NB) type II distribution parameterized on mean and dispersion
parameter known as the quadratic parametrization of the variance (Winkelmann 2008), hurdle
and zero-inflated models (Zeileis, Kleiber, and Jackman 2008) and mixed Poisson regression
(Winter and Bürkner 2021) are frequent options to model overdispersed count data; Gamma
Count (Zeviani, Ribeiro Jr, Bonat, Shimakura, and Muniz 2014), Conway-Maxwell-Poisson
(COM-Poisson) (Shmueli, Minka, Kadane, Borle, and Boatwright 2005) and the Extended
Poisson Tweedie (Bonat, Jørgensen, Kokonendji, Hinde, and Demétrio 2018) based on the
Poisson Tweedie distribution (El-Shaarawi, Zhu, and Joe 2011; Jørgensen and Kokonendji
2016) can handle all three situations (under-, equi-, over-dispersed), even though the compu-
tation of their probability mass function (pmf) relies on numerical methods.
On the other hand, the bibliography for multivariate data is scarce. However, there is an
increasing demand from researchers to analyze datasets with over one response variable. The
benefit of it is to better investigate the relationship between the response variables (Bonat
and Jørgensen 2016); at the same time that numerical methods can take advantage of it once
more data is available to estimate model parameters.
Some methodologies to deal with multivariate data have been proposed and introduced by
Winkelmann (2008) with a focus on building multivariate distributions. However, it comes
with the price of some practical limitations. The multivariate Poisson-lognormal regression
(MPLR) and the latent Poisson-normal regression models admit only overdispersed data. The
multivariate Poisson-gamma mixture model (MPGM) and multivariate NB model (MNBM)
are suitable only for overdispersed data with positive correlations. Another example is the
copulas framework, which allows the building of multivariate distributions. But a negative
correlation between many response variables is difficult to model, especially for count data
(Nikoloulopoulos and Karlis 2009). Inouye, Yang, Allen, and Ravikumar (2017) proposed
three alternatives to deal only with equi and overdispersed data constructed either via copulas
or a mixture of distributions.
Distributions with no practical limitations were also developed. Famoye (2015) proposed
a multivariate generalized Poisson regression model that can deal with any kind of disper-
sion and correlation with an estimation based on the maximum likelihood (ML) paradigm.
In a similar way, Muñoz-Pichardo, Pino-Mejı́as, Garcı́a-Heras, Ruiz-Muñoz, and González-
Regalado (2021) proposed a multivariate conditional Poisson regression model, where the
relationship between response variables is measured by a coefficient in the linear predictor.
In its turn, the dependence between response variables is conditional on the other response
variables.
A very flexible modeling framework based on estimating functions, the Multivariate Covari-
ance Generalized Linear Models (MCGLM) can also fit such data (Bonat 2016). It uses
only second-order moments assumptions and they estimated the parameters based on quasi-
likelihood (Wedderburn 1974). It can accommodate correlated data based on an approach
similar to the generalized estimating equations (GEE) (Liang and Zeger 1986), allowing both
multivariate responses and correlated data. We can also cite methodologies based on Bayesian
inference, such as MCMC Generalized Linear Mixed Models via MCMCglmm package (Hadfield
2010) and Bayesian Regression Models using Stan - brms package (BÜrkner 2018), which
comes with the price of a greater computational time. We are not going to discuss Bayesian
models in this article.
Another alternative is to model the correlation between response variables in the same indi-
vidual using the class of hierarchical GLM (Lee and Nelder 1996). This class allows to model
Austrian Journal of Statistics 3
of correlated variables or individuals via a random effect, an unobserved variable, that can
follow any distribution. When the distribution of the random effect is Gaussian, we have
the Generalized Linear Mixed Models (GLMM). However, GLMM is widely known and used
to model the correlation between sample units, not for response variables, such a method is
implemented in consolidated packages in software R (R Core Team 2020), such as glmmTMB
(Brooks, Kristensen, van Benthem, Magnusson, Berg, Nielsen, Skaug, Maechler, and Bolker
2017), lme4 (Bates, Mächler, Bolker, and Walker 2015) and nlme (Pinheiro, Bates, DebRoy,
Sarkar, and R Core Team 2017).
In this article, we propose to model multivariate count data under the framework of GLMMs
to accommodate the correlation between response variables. Parameter estimates are obtained
through the maximum likelihood method (Aldrich et al. 1997). We use multivariate normally
distributed random effects to accommodate the correlation between response variables. The
estimation process is similar to the one for GLMM and was implemented in R (R Core Team
2020) through TMB package (Kristensen, Nielsen, Berg, Skaug, and Bell 2016).
Here, we will focus on multivariate overdispersed data, whereas da Silva, Laureano, Petterle,
Júnior, and Bonat (2022) presented a similar approach focused on underdispersed data. This
modeling approach for multivariate count data is evaluated for three different distributions of
the response variables: Poisson, NB, and COM-Poisson mean parameterized (Huang 2017).
Even though COM-Poisson does not belong to the exponential family, we refer to it as a
GLMM framework, once the estimation process remains the same regardless of the distribution
being used.
This article contains six sections, including this introduction. Section 2 describes the datasets
used to provide illustrative applications of the model. Section 3 proposes the multivariate
generalized linear mixed model (MGLMM) model. Section 4 shows the result of the simulation
study to assess the estimators’ properties. Section 5 presents the results of the model applied
to the datasets presented in Section 2. Finally, Section 6 discusses the major contributions of
this article and future work.
2. Data sets
ship between the variables shows overdispersion for all variables, which Nhosp being the most
overdispersed. This is characterized by Fisher Dispersion Index (DI) greater than 1 (Fisher
1934). Also, the generalized dispersion index (GDI) (Kokonendji and Puig 2018) classifies
this dataset as overdispersed once it is greater than 1 and its 95% confidence interval does
not contain zero. However, these results should be confirmed by model fitting.
4000
Frequency
3000
2000
1000
0
0 2 4 6 8 0 2 4 6 8 10 0 1 2 3 4 5 0 20 40 60 80 0 2 4 6 8
Occurrence
Figure 1: Barplot for the count of each response variable from Australian Health Survey
(AHS) data. Ndoc (Number of consultations with a doctor or specialist), Nndoc (Number
of consultations with health professionals), Nadm (Number of admissions to a hospital, psy-
chiatric hospital, nursing or convalescence home in the past 12 months), Nhosp (Number of
nights in a hospital during the most recent admission) and Nmed (Total number of prescribed
and non-prescribed medications used in the past two days).
Table 1: Spearmean correlation, mean, variance, and dispersion index (DI) for the Aus-
tralian Health Survey (AHS) response variables. The generalized dispersion index (GDI) and
standard error (SE) equal 17.944 (1.99). Ndoc (Number of consultations with a doctor or
specialist), Nndoc (Number of consultations with health professionals), Nadm (Number of
admissions to a hospital, psychiatric hospital, nursing or convalescence home in the past 12
months), Nhosp (Number of nights in a hospital during the most recent admission) and Nmed
(Total number of prescribed and non-prescribed medications used in the past two days).
5 environmental variables and we gave their full description in the supplementary material
Table S2. The name of the response variables starts with an index number (1,...,41) followed
by the abbreviated name of each species.
Figure 2 presents the barplot for each response variable from ANT data. The 41 different
species present high variability. Some species were only seen one time in a single site, such
as Polyrhachis and Solenopsis, while Iridomyrmex and Pheidole were seen 20 times on over
one site.
20 25 25
6 20
20 20 10
15 15
15 15 4
10 10 10 5 10
5 2 5
5 5
0 0 0 0 0 0
0 1 2 3 4 0 5 10 15 20 0 1 2 0 2 4 6 8 10 12 0 5 10 15 20 0 2 4 6
8 6 10 15
15
6 10
10 4 10
4 5
5 2 5
2 5
0 0 0 0 0 0
0 2 4 6 8 10 0 2 4 6 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10
Figure 2: Barplot for the occurrence of each ant species genus from ANT data.
Table 2 presents the mean, variance, and DI for every response variable and the GDI for the
dataset. The variance is higher than the mean almost for all variables, except for Polyrhachis
and Solenopsis; which is described by a DI>1. This dataset is classified as overdispersed
based on the GDI index = 11,543 and the lower bound of a 95% confidence interval was
6 MGLMM count
greater than 1.
Table 2: Mean, variance, dispersion index (DI) for each genus for ANT data. Generalized
dispersion index (GDI) and its standard error (SE) equals 11.54 (.92)
Figure 3 explores how the occurrence of different species is related to each other by the
Spearman correlation in a correlogram. The marginal correlation ranges from -.58 up to .74,
which is well distributed along all potential values of the correlation parameter ρ = (−1, 1).
Austrian Journal of Statistics 7
25
30
32
14
12
34
39
22
26
17
21
24
23
41
36
37
38
27
31
13
20
19
28
33
35
40
10
16
11
15
18
29
2
7
4
6
5
3 1
25
2
30
9 0.8
32
14
7
4 0.6
6
12
34
39 0.4
22
26
8
17 0.2
21
24
23
41 0
36
37
38
27 −0.2
31
13
20
19 −0.4
28
1
33
35 −0.6
40
10
16
5 −0.8
11
15
18
−1
gr (µir ) = x>
irj β r + bir ,
where gr (.) is a suitable link function, β r is a p×1 vector of parameter estimates and bir is the
random intercept value for subject i and response r. Finally, the random effects distribution
is specified as:
8 MGLMM count
σ12
bi1 0 ρ12 σ1 σ2 . . . ρ1r σ1 σr
bi2 0 ρ21 σ2 σ1 σ22 . . . ρ2r σ2 σr
.. ∼ NM .. ; Σ = ,
.. .. .. ..
. . r×r . . . .
bir 0 ρr1 σr σ1 ρr2 σr σ2 . . . σr2
where each random effect has mean 0, variance σ 2 , and correlation ρrr0 (r 6= r0 ) between each
pair of random effects. This framework is general and can be applied to any distribution
f and link function gr (.). Nevertheless, as we are dealing with count data, we considered
only Poisson, binomial negative type II and COM-Poisson distributions, and a logarithm
link function. Extending this model for different distributions for each response (fr ) is also
possible, but we will not address it.
Maximum likelihood is the estimation procedure, and it is fully described in da Silva et al.
(2022). We integrated out the random effects via numerical integration using Laplace ap-
proximation (Tierney and Kadane 1986). The Newton-Raphson method was efficiently im-
plemented to perform the inner optimization. We optimized the marginal likelihood using
first-order derivatives methods, such as BFGS and OPTIM routines available in the R software.
The derivatives of the joint and marginal likelihood were obtained through automatic differen-
tiation (Baydin, Pearlmutter, Radul, and Siskind 2017). This was efficiently implemented by
the TMB package, which provides C++ templates where an objective function must be supplied,
in this case, the log-likelihood function.
The variability of the random variable in this model is being measured by two parameters,
the variance of the random effect and the dispersion of the pmf. A crucial point of the model
is the great flexibility to learn these two types of variances with no identifiability problems.
The code used to produce the results of this paper is available on https://github.com/
guilhermeparreira/papers/tree/master/AJS_MGLMM_Overdispersed_Count and http://
www.leg.ufpr.br/doku.php/publications:papercompanions.
4. Simulation Studies
In this section, we present simulation studies to assess the properties of the MLE estimators
(bias and consistency). We considered a bivariate regression model for count data. We
designed 12 simulation scenarios with four different sample sizes, 100, 250, 500, and 1000,
and three different correlations between random effects, ρ = −0.5, 0, 0.5. For the regression
structure, we considered only an intercept for each response, with β01 = log(7) and β02 =
log(1.5). The variance of random effects were σ12 = .3 and σ22 = .15. The dispersion parameter
for NB and COM-Poisson was equal to φ = 1 and ν = .7 respectively, which induces a small
overdispersion.
We generated 150, 200, and 300 datasets for Poisson, COM-Poisson, and NB distribution for
each design. The primary idea was to generate 100 datasets for each distribution. However,
as it did not return the SE of the parameter estimates in every repetition, it was necessary
to increase the number of datasets generated proportionally to the number of SE failures for
each distribution to obtain at least 100 valid estimations. The following three subsection
presents the results for Poisson, NB, and COM-Poisson scenarios.
Sample size
100 250 500 1000
−0.2 0.0 0.2 0.4
ρ = −0.5 ρ=0 ρ = 0.5
ρ12
σ2
σ1
β02
β01
Figure 4: Average bias and its 95% confidence interval based on the average standard error
(SE) by four different sample sizes, three correlation coefficients for the bivariate Poisson
regression model. The true value for each parameter were with β01 = log(7), β02 = log(1.5),
σ12 = .3 and σ22 = .15.
Figure 5 presents the coverage rate for each parameter by sample size and simulation scenario
for the bivariate Poisson regression model. Overall, all empirical coverage rates are close
to the nominal level of 95%, varying between 90% and 98% approximately. In special, the
coverage rate of regression parameters β 0r are slightly greater than the nominal level. For
the variance of random effects σ r , the coverage rate is slightly smaller than the nominal level.
Finally, for the correlation between random effects ρ, the coverage rate is slightly lower when
ρ = .5, and slightly greater when ρ = {0, −0.5} compared to the 95% nominal level.
Even for the Poisson distribution, which is the simplest case to estimate because there is no
dispersion parameter, there were 13 out of 1800 (3×4×150) simulations that did not return
SE for some parameters estimates or produced extreme values due to large SEs that were not
considered into the results. We used the PORT algorithm to estimate the model because it was
more stable and faster than BFGS in most situations. It also happened in Kristensen et al.
(2016), where 9 study cases from different model settings ranging from linear regression to
multivariate stochastic volatility models were considered, and PORT had a better performance
than BFGS.
0.96
ρ = 0.5
0.94
0.92
0.90
Coverage rate
0.98
0.96
ρ=0
0.94
0.92
0.90
0.98
ρ = −0.5
0.96
0.94
0.92
0.90
100 250 500 1000 100 250 500 1000 100 250 500 1000
Sample size
Figure 5: Coverage rate for each parameter (in the columns) by sample size and correlation
coefficient for the Poisson regression model. The true value for each parameter were β01 =
log(7), β02 = log(1.5), σ12 = .3 and σ22 = .15.
Austrian Journal of Statistics 11
Sample size
100 250 500 1000
−1.0 −0.5 0.0 0.5 1.0
ρ = −0.5 ρ=0 ρ = 0.5
φ2
φ1
ρ12
σ2
σ1
β02
β01
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Bias ± SE
Figure 6: Average bias and its 95% confidence interval based on the average standard error
(SE) by four different sample sizes, three correlation coefficients for the bivariate negative
binomial (NB) regression model. The true value for each parameter were β01 = log(7),
β02 = log(1.5), φ = 1, σ12 = .3 and σ22 = .15.
Overall, Figure 7 shows that the empirical coverage rates are close to the nominal level of
95%. In particular, the coverage rate of regression parameters β 0r , the variance of random
effects σ r and the correlation between random effects ρ are slightly greater than the nominal
level. On the opposite, there was a coverage rate close to 80% for ρ estimated when the
sample size was equal to 100 and ρ = 0.
12 MGLMM count
100 250 500 1000 100 250 500 1000 100 250 500 1000
0.95 ρ = 0.5
0.90
0.85
0.80
1.00
Coverage rate
0.95
ρ=0
0.90
0.85
0.80
1.00
0.95
ρ = −0.5
0.90
0.85
0.80
100 250 500 1000 100 250 500 1000 100 250 500 1000 100 250 500 1000
Sample size
Figure 7: Coverage rate for each parameter (in the columns) by sample size and correlation
coefficient for bivariate negative binomial (NB) regression model. The true value for each
parameter were β01 = log(7), β02 = log(1.5), φ = 1, σ12 = .3 and σ22 = .15.
Estimation problems were more severe for the bivariate NB regression model compared to
the Poisson and occurred in 740 out of 3600 (3×4×300) simulations (20,6%). It was also
necessary to remove those iterations when the dispersion parameter φ was greater than 5
from the results (we simulated at 1).
Sample size
100 250 500 1000
−0.5 0.0 0.5
ρ = −0.5 ρ=0 ρ = 0.5
ν2
ν1
ρ12
σ2
σ1
β02
β01
Figure 8: Average bias and its 95% confidence interval based on the average standard error
(SE) by four different sample sizes, three correlation coefficients for the bivariate COM-Poisson
regression model. The true value for each parameter were β01 = log(7), β02 = log(1.5), ν = .7,
σ12 = .3 and σ22 = .15.
95%, which agrees with the results presented in Figure 8, where the bias did not decrease
even for a higher sample size. In particular, the coverage rate of σ2 had the worst results
among all parameters (the coverage rate decreases as sample sizes increase), followed by ν2 .
Not surprisingly, these parameters had the two largest biases in Figure 8. In contrast, results
for β01 , σ1 and ρ (when the data was generated with ρ = 0) had coverage rates close to 95%
level.
Estimation problems were more severe for the COM-Poisson regression model compared to the
Poisson and less severe if compared to NB, accounting for 245 (10,2%) out of 2400 (3×4×200).
Besides the rules used for Poisson to classify extreme values, it was necessary to remove those
models when the dispersion parameter ν was greater than 4 and the SE of ρ estimate was
greater than 2.
5. DATA ANALYSES
This section presents the data analyses of the two datasets presented from section 2. Initial
values for the model in section 3 were chosen carefully in the following way. Firstly, and for
every dataset, we fitted a Poisson model via quasi-likelihood in MCGLM package (Bonat 2016)
to obtain initial parameter estimates for the regression and variance parameters (based on the
variance of the residuals). Because of the difference in methodologies, we set the correlation
parameter to 0. These were initial values for the Poisson model; the final estimates from the
Poisson became the initial values for the NB model, and from the NB, the initial estimates of
COM-Poisson. The initial values to the dispersion parameter were φ = 1 for NB, i.e., small
14 MGLMM count
100 250 500 1000 100 250 500 1000 100 250 500 1000
0.8
ρ = 0.5
0.6
0.4
1.0
Coverage rate
0.8
ρ=0
0.6
0.4
1.0
ρ = −0.5
0.8
0.6
0.4
100 250 500 1000 100 250 500 1000 100 250 500 1000 100 250 500 1000
Sample size
Figure 9: Coverage rate for each parameter (in the columns) by sample size and correlation
coefficient for bivariate COM-Poisson regression model. The true value for each parameter
were β01 = log(7), β02 = log(1.5), ν = .7, σ12 = .3 and σ22 = .15.
Table 3: Goodness-of-fit measures for AHS data from different distributions and specifications.
Number of parameters estimated (np), Akaike and Bayesian Information Criterion (AIC and
BIC), and log-likelihood value (Loglik).
The COM-Poisson full model was the best among all distributions and the simpler versions
of each distribution.
Figure 10 presents the regression parameter estimates and 95% confidence intervals by out-
come and final model. We can see that the confidence interval amplitude is almost the same
for all models, but slightly smaller for the COM-Poisson models over the competitors for
almost all estimates for the Ndoc response variable; for the other response variables, we saw
little differences. The point estimates were very close between Poisson and NB models, with
a difference for the COM-Poisson model in some estimates, for example, freepoor, age, sex,
illness, and chcond limited for Ndoc. We may relate this to the bias found in Figure 8.
chcond limited
actdays
illnes
freerepa
freepoor
levyplus
income
age
sex(fem)
Figure 10: Regression parameter estimates and 95% confidence intervals by outcome and final
model for the Australian Health Survey (AHS) data.
Table 4 presents the dispersion estimates for each model and outcome. Even though this
data can be considered as marginally overdispersed according to the GDI = 17.94 presented
in Table 1, we see that φ approaches the infinity (suggest an equidispersed model by NB
16 MGLMM count
distribution), and ν is greater than 1 for all response variables and indicates underdispersion.
The expected result would be φ and ν approaching zero, which results in an overdispersion of
both distributions. However, as explored in subsection 5.2, this behavior may occur because
of the variance of the random effect. For the COM-Poisson model, the only ν estimate
which indicates overdispersion was for Nmed, which had a sample DI equal to 1.99 (small
overdispersion).
Table 4: Dispersion of parameter estimates and standard errors (SEs) for each model and
outcome of the Australian Health Survey (AHS) data
NB(φ) COM-Poisson(ν)
Outcome Estimate SE Estimate SE
Ndoc 1.7e+04 1.8e+05 9.160 0.432
Nndoc 5.9e+03 6.0e+04 2.837 0.336
Nmed 3.6e+03 2.4e+04 0.674 0.032
Nhosp 8.5e+03 1.2e+05 5.110 0.615
Nadm 8.5e+03 8.1e+04 6.665 0.431
We presented the correlation coefficient estimates and their SEs in parentheses in (1) for the
COM-Poisson model. Among the 10 correlation coefficients returned, COM-Poisson had 6
significant correlation coefficients. The stars in the matrix represent significant coefficients at
the 5% level. Even though we do not have a primary interest in interpreting each correlation
coefficient estimate, it is important to know which of them is significant, because it may be
related to a smaller SE.
We can see an almost perfect significant correlation between Nadm and Nhosp, and a strong
correlation between Nmed and Ndoc. We had also seen this pattern in Table 1.
Table 5: Goodness-of-fit measures for ANT data from the best parametrization for each dis-
tribution. Number of parameters estimated (np), Akaike and Bayesian Information Criterion
(AIC and BIC), and log-likelihood value (Loglik).
From Table 5 we can see that the best models were from COM-Poisson distribution; Poisson
and NB models had similar logLik results. The COM-Poisson models had higher logLik
and BIC, and smaller AIC than the fixed dispersion model. A logLik ratio test (LRT) between
these two COM-Poisson models resulted in p < .00001 (χ41 = 132.84), which gives evidence in
favor of the fully specified model. Therefore, we will present the estimates of the full models
for each distribution.
Figure 11 presents the regression parameter estimates and 95% confidence intervals by out-
come and final model for the first 12 response variables. We presented the same graphic for
the remaining response variables in the supplementary material, Figures S1, S2, and S3, once
we found similar patterns for all response variables. First, we can see that not all response
variables share the same linear predictor (covariate feral mammal dung is not presented in
the second, third and ninth response variables). Second, there were still some regression es-
timates’ SEs that were not returned by the model and are presented by a hollow circle. It
was necessary to make the distinction when the SE was returned and when it was tiny (filled
circle, such as bare ground for COM-Poisson and response variables 1-8 and 10-12).
Overall, the confidence intervals for the COM-Poisson model were smaller than its counter-
parts and the point estimates were close among all models. The feral mammal dung covariate
had the largest confidence intervals compared to the other covariates. There are still regres-
sion coefficients that are very close to zero, suggesting that a better variable selection method
may improve the model fit.
Feral.mammal.dung
Shrub.cover
Canopy.cover
Bare.ground
−3 0 3 6 −4 −2 0 −6 −4 −2 0 2 4 −3 −2 −1 0 1 2
5.Camponotus.Co 6.Camponotus.Ni 7.Camponotus.Ns 8.Cardiocondyla
Feral.mammal.dung
Shrub.cover
Canopy.cover
Bare.ground
−1 0 1 2 −5 0 5 −2 0 2 4 −5 0 5 10 15
9.Crematogaster 10.Heteroponera 11.Iridomyrmex.Bi 12.Iridomyrmex.Dr
Feral.mammal.dung
Shrub.cover
Canopy.cover
Bare.ground
−2 −1 0 1 2 −3 −2 −1 0 1 2 0 1 2 −12 −8 −4 0
^
β ± 1.96SE
Figure 11: Regression parameter estimates and 95% confidence intervals by outcome and final
model for ANT data.
The correlation coefficient for the COM-Poisson model is presented in Figure 12. Among the
18 MGLMM count
820 correlation coefficients returned, COM-Poisson had 333 significant correlation coefficients,
NB 278, and Poisson 83. It shows that the correlation coefficient from COM-Poisson had
a smaller SE than their counterparts. The stars in the graphic represent the significant
coefficients at the 5% level. The correlation patterns presented in this figure differ somewhat
from the ones found in the marginal correlation. It shows the importance of calculating the
correlation coefficient in a model that accounts for the effects of the linear predictor. For
example, the sample correlation was nearly zero between responses 1 and 2, 2 and 3; while
the COM-Poisson correlation between the random effect of these two response variables was
negative and significant at 5% level, for NB only one was significant, and for Poisson neither
was significant possibly because of a high SE, once the ρ̂ was equal to -.7 between the random
effects of Y1 and Y2, and -.54 between the random effects of Y2 and Y3.
Figure 12: Correlogram of ant species occurrence from COM-Poisson Model. Stars represent
significant correlation at 5% level
14
18
29
11
39
25
22
16
26
35
40
10
33
27
36
38
37
19
23
41
28
31
13
20
21
24
30
32
12
34
17
7
9
2
8
6
3
5
15 1
* * ** ** * * * ** ** ** ** * * * * * * ** * **
7
9 * ** * * **
* ** * * * ** * * * ** * * ** *
14
18 * * * ** * * 0.8
* ** ** * * ** ** * ** ** * * * * * * * * * * *
29
4
* * ** * * * ** ** * ** ** ** * ** * * * * ** * * ** * * * *
11
39 0.6
* ** * * * * * * * * * * * * * *
1
25
** * *
** ** ** ** ** * * * * * * * * * * * ** * * * ** **
22
16
26
0.4
* ** * ** ** * * * * ** ** ** * ** ** ** ** * *
35
40
* ** * ** * ** * ** * * 10
33
0.2
** * * * * * * ** * ** * * * * ** 27
36
* ** ** * * * * * * 38
37
0
* ** ** * ** ** * * * * * ** ** * 19
23
* ** ** * ** * * * * * * 2
8
−0.2
** ** ** ** * * ** ** * 6
41
6. Discussion
The focus of this article was to propose MGLMM for count data. The great advantage of
this model is the ability to deal with multiple outcomes. We specified the framework based
on a GLMM with a random intercept, where we structured a joint model whose random
effect follows a multivariate normal distribution. As a result, we can measure the variance
and correlation of the random effects motivated by the multivariate set of responses. The
distributions used for variables of counts were Poisson, NB, and COM-Poisson. The estimation
Austrian Journal of Statistics 19
process is the same as a GLMM model. We used the TMB package to implement the model once
it provides state-of-art C++ libraries that handle automatic differentiation, CppAD (Bell BM
2005), linear algebra, Eigen C++ (Guennebaud, Jacob et al. 2010) and parallel computation,
BLAS (Blackford, Petitet, Pozo, Remington, Whaley, Demmel, Dongarra, Duff, Hammarling,
Henry et al. 2002), among others.
In order to evaluate the properties of the ML estimators, we conducted simulation studies
for each distribution, considering three different values for the correlation parameter and
four different sample sizes. They were all evaluated through average bias and confidence
interval based on the average SE and coverage rate with a nominal level of 95%. For Poisson
distribution, we achieved unbiased and consistent estimators for large samples with intervals
for bias ranging at most between (-.2;.2), except for sample size equal to 100 and values for
the parameter ρ. The coverage rate was close to 95% in all scenarios considered, varying
between 90% and 98%. We evidenced a greater variability for NB compared to the Poisson
distribution, especially because of the dispersion parameter φ in NB, necessary to model
overdispersion. For NB distribution, we also achieved unbiased and consistent estimators for
large samples with bias intervals ranging in most cases between (-.5;.5). In particular, we
noticed a greater confidence interval width for the correlation and φ2 parameter estimates;
and small bias for σ2 and ν2 parameter estimates. The coverage rate in most cases was equal
to or greater than 95%; with a coverage rate below 80% for the correlation parameter when
ρ = 0 and the sample size equal to 100.
For the COM-Poisson model, the parameter estimators were neither consistent nor unbiased.
The regression parameter was underestimated for β02 and showed no bias for β01 . The corre-
lation parameter ρ was always estimated towards zero: when ρ = −.5 it was overestimated,
ρ = +.5 was underestimated, and for ρ = 0 it showed no bias. The standard deviation of
random effect σ2 was overestimated while σ1 was underestimated. This behavior may be
correlated to the dispersion parameter ν: when more variance was captured from σ2 less dis-
persion was captured from ν2 ; on the other hand, when less variance was captured from σ1 ,
more dispersion was captured from ν1 .
As most parameters were biased, the coverage rate was not close to 95% in all scenarios. For
σ2 , β02 and νr the coverage rate were between 70% and 90%; for ρ the coverage rate were
close to 80% in 2 out of 3 scenarios (ρ = {−0.5, 0.5}). The other parameters, β01 , σ1 and ρ
when ρ = 0 had a coverage rate close to 95%.
After that, the two datasets were analyzed by each model and variation of them: no cor-
relation, fixed dispersion, fixed variance, and common variance for all random effects. The
first dataset was from the AHS survey with five random variables and 5190 participants.
According to the fit measures used, the COM-Poisson model provided the best fit. The SE
of β and ρ estimates were similar among the COM-Poisson, NB, and Poisson models. The
ν parameter captured 4 out of 5 underdispersed response variables and one overdispersed
response variable, followed by a small σ. The COM-Poisson model produced more significant
correlation values than its counterparts.
The second dataset was the ANT, which contains 41 response variables that count for the
number of ANT species at 30 sites in Australia. The multivariate response can be considered
as overdispersed by the GDI. The COM-Poisson model was also the best model regarding
AIC and logLik; the model with the best BIC was the COM-Poisson with fixed dispersion. In
almost all comparisons, the SEs of the COM-Poisson model was smaller than the other models.
The ν parameter was greater than 1 for all response variables, indicating underdispersion.
Therefore, we suggest using the MGLMM model framework for count data. In particular,
the best results were obtained with the COM-Poisson model in two real datasets. The major
advantage of it is the possibility of modeling all response variables at the same time and
measuring the correlation between the random effects.
The estimation process of this model was computationally intensive, being a computational
challenge to implement the model. For example, the estimation of the COM-Poisson model
20 MGLMM count
for AHS data was cumbersome. It took 5 days to fit in a Debian GNU/LINUX 8 (jessie) 92
GB RAM server with an AMD Opteron 6136 processor using two threads. Improving the
computational implementation or even using other frameworks are examples of future work.
Improving the estimation procedure for the COM-Poisson model in this context is in high
demand as well as understanding whether the COM-Poisson distribution can replace the NB
distribution for overdispersed data sets.
References
Aldrich J, et al. (1997). “RA Fisher and the making of maximum likelihood 1912-1922.”
Statistical science, 12(3), 162–176.
Bates D, Mächler M, Bolker B, Walker S (2015). “Fitting Linear Mixed-Effects Models Using
lme4.” Journal of Statistical Software, 67(1), 1–48. doi:10.18637/jss.v067.i01.
Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2017). “Automatic differentiation in
machine learning: a survey.” The Journal of Machine Learning Research, 18(1), 5595–5637.
Bell BM (2005). CppAD: a package for C++ algorithmic differentiation. URL http://www.
coin-or.org/CppAD.
Blackford LS, Petitet A, Pozo R, Remington K, Whaley RC, Demmel J, Dongarra J, Duff I,
Hammarling S, Henry G, et al. (2002). “An updated set of basic linear algebra subprograms
(BLAS).” ACM Transactions on Mathematical Software, 28(2), 135–151.
Bonat WH (2018). “Multiple Response Variables Regression Models in R: The mcglm Pack-
age.” Journal of Statistical Software, 84(4), 1–30. doi:10.18637/jss.v084.i04.
Bonat WH, Jørgensen B (2016). “Multivariate covariance generalized linear models.” Journal
of the Royal Statistical Society. Series C (Applied Statistics), pp. 649–675.
Bonat WH, Jørgensen B, Kokonendji CC, Hinde J, Demétrio CG (2018). “Extended Poisson–
Tweedie: Properties and regression models for count data.” Statistical Modelling, 18(1),
24–49.
Breslow NE, Clayton DG (1993). “Approximate inference in generalized linear mixed models.”
Journal of the American statistical Association, 88(421), 9–25.
Brooks ME, Kristensen K, van Benthem KJ, Magnusson A, Berg CW, Nielsen A, Skaug
HJ, Maechler M, Bolker BM (2017). “glmmTMB Balances Speed and Flexibility Among
Packages for Zero-inflated Generalized Linear Mixed Modeling.” The R Journal, 9(2), 378–
400. URL https://journal.r-project.org/archive/2017/RJ-2017-066/index.html.
BÜrkner PC (2018). “Advanced Bayesian Multilevel Modeling with the R Package brms.”
The R Journal, 10(1), 395–411. doi:10.32614/RJ-2018-017.
da Silva GP, Laureano HA, Petterle RR, Júnior PJR, Bonat WH (2022). “Multivariate
generalized linear mixed models for underdispersed count data.” doi:10.48550/ARXIV.
2205.10486. URL https://arxiv.org/abs/2205.10486.
El-Shaarawi AH, Zhu R, Joe H (2011). “Modelling species abundance using the Poisson–
Tweedie family.” Environmetrics, 22(2), 152–164.
Fisher RA (1934). “The effect of methods of ascertainment upon the estimation of frequencies.”
Annals of eugenics, 6(1), 13–25.
Gibb H, Stoklosa J, Warton DI, Brown A, Andrew NR, Cunningham S (2015). “Does morphol-
ogy predict trophic position and habitat use of ant species and assemblages?” Oecologia,
177(2), 519–531.
Hadfield JD (2010). “MCMC Methods for Multi-Response Generalized Linear Mixed Models:
The MCMCglmm R Package.” Journal of Statistical Software, 33(2), 1–22. URL http:
//www.jstatsoft.org/v33/i02/.
Inouye DI, Yang E, Allen GI, Ravikumar P (2017). “A review of multivariate distributions
for count data derived from the Poisson distribution.” Wiley Interdisciplinary Reviews:
Computational Statistics, 9(3), e1398.
Jørgensen B, Kokonendji CC (2016). “Discrete dispersion models and their Tweedie asymp-
totics.” AStA Advances in Statistical Analysis, 100(1), 43–78.
Kokonendji CC, Puig P (2018). “Fisher dispersion index for multivariate count distributions:
A review and a new proposal.” Journal of Multivariate Analysis, 165, 180–193.
Kristensen K, Nielsen A, Berg CW, Skaug H, Bell BM (2016). “TMB: Automatic Differen-
tiation and Laplace Approximation.” Journal of Statistical Software, 70(5), 1–21. doi:
10.18637/jss.v070.i05.
Lee Y, Nelder JA (1996). “Hierarchical Generalized Linear Models.” Journal of the Royal
Statistical Society. Series B (Methodological), 58(4), 619–678. ISSN 00359246. URL http:
//www.jstor.org/stable/2346105.
Liang KY, Zeger SL (1986). “Longitudinal data analysis using generalized linear models.”
Biometrika, 73(1), 13–22.
Nelder JA, Wedderburn RW (1972). “Generalized linear models.” Journal of the Royal Sta-
tistical Society: Series A (General), 135(3), 370–384.
Nikoloulopoulos AK, Karlis D (2009). “Modeling multivariate count data using copulas.”
Communications in Statistics-Simulation and Computation, 39(1), 172–187.
Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team (2017). nlme: Linear and Nonlinear
Mixed Effects Models. R package version 3.1-131, URL https://CRAN.R-project.org/
package=nlme.
R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Shmueli G, Minka TP, Kadane JB, Borle S, Boatwright P (2005). “A useful distribution for
fitting discrete data: revival of the Conway–Maxwell–Poisson distribution.” Journal of the
Royal Statistical Society: Series C (Applied Statistics), 54(1), 127–142.
22 MGLMM count
Winkelmann R (2008). Econometric analysis of count data. Springer Science & Business
Media.
Zeileis A, Kleiber C, Jackman S (2008). “Regression models for count data in R.” Journal of
statistical software, 27(8), 1–25.
Zeviani WM, Ribeiro Jr PJ, Bonat WH, Shimakura SE, Muniz JA (2014). “The Gamma-
count distribution in the analysis of experimental underdispersed data.” Journal of Applied
Statistics, 41(12), 2616–2626.
Affiliation:
Guilherme P. Silva
Laboratory of Statistics and Geoinformation
Paraná Federal University
CEP-81530-015 Curitiba (PR), Brazil
E-mail: guilhermeparreira.silva@gmail.com
URL: https://github.com/guilhermeparreira