Statistical Inference Based On Pooled Data: A Moment-Based Estimating Equation Approach
Statistical Inference Based On Pooled Data: A Moment-Based Estimating Equation Approach
Howard D. Bondell
Child Health and Human Development, Department of Health and Human Services,
to deal with situations where likelihood functions based on pooled data are difficult
to work with. We outline the method to obtain estimates and test statistics of the
process, construct tests for goodness of fit based on the pooled data.
1. Introduction
dom sample of size np (with both n and p being integers) from Fθ , say, X1 , . . . , Xnp ,
1
independent and identically distributed according to Fθ . Suppose that, in order to
reduce the cost of the study, the subjects that yield the np samples are randomly
grouped into n sets, each of size p. Subsequently, instead of observing each indi-
The X ∗ s are also independent and identically distributed, but each following the dis-
tribution Fθ∗ of the average of p random draws from Fθ . We are concerned with inference
The data framework described above, often called group testing as a result of testing
ing with dichotomous outcomes (Dorfman, 1943; Sobel and Groll 1959), and was later
further developed by Gastwirth and Johnson (1994) and Litvak, Tu and Pagano (1994)
in HIV contamination, Sobel and Elashoff (1975), Chen and Swallow (1990), Farrington
(1992), Hughes-Oliver and Swallow (1994), and Tu, Litvak and Pagano (1995) in esti-
mating population prevalence, and Barcellos et al. (1997) in localizing disease genes.
Recently Weinberg and Umbach (1999) proposed a set-based logistic model to explore
the association between a disease and exposures when only the pooled exposure values
are available. Farragi, Reiser and Schisterman (2003) and Liu and Schisterman (2003)
surements are assumed to follow normal or gamma distributions. Other areas where
pooling biospecimens has been found useful include gene microarray experiments where
mRNA samples are often pooled across subjects (Jin et al., 2001; Enard et al., 2002;
Although the strategy of pooling specimens has been used in practice, methods
for analysis of set-based data from such experiments have not been fully and well
2
developed in the literature, except for certain special cases. This is, perhaps, partly
because for a general distribution, Fθ , the likelihood methods based on set-based data
may not be feasible since the distribution of the averages involves convolution of p
methodology for reasonably efficient estimation and testing under a broad class of
Cox transformation model. The context is that Fθ possesses and is fully determined
by its first several moments, which may be estimated by converting the estimates of
The paper is arranged as follows. In §2 we shall describe the method under the
assumption that Fθ can be parameterized by no more than its first three moments.
extended to distributions with more than three parameters, but we omit the details
at the present time. In §3 we derive the large sample distribution of the estimates.
We then apply the method in §4 to the family of distributions generated by the Box-
an extension, we discuss several procedures to test goodness of fit based on the pooled
data. The methods are exemplified in §5 using data from a study evaluating oxidative
stress and antioxidants on cardiovascular disease in upstate New York. Some comments
special cases, the likelihood function based on X ∗ s will be extremely difficult to derive.
3
Alternatively we can obtain and connect the moments of X ∗ with that of X, and
construct inference based on moment estimates. For this purpose, we assume that Fθ
has at least 2k moments of which the first k moments will be used to estimate θ and
µ1 = E(X), µ∗1 = E(X ∗ ), µr = E{(X −µ1 )r }, and µ∗r = E{(X ∗ −µ∗1 )r }, for r > 1; these
are all functions of θ. Then it is straightforward to show that the first three central
X, are
sample moments and solving for θ will then yield an estimator of θ based on which
inference can be conducted. Putting the above into an estimating equation framework
We have restricted our attention to at most three parameters which are sufficient for
most practical needs. If more parameters are required, the approach described above
4
can be readily extended to accommodate the additional parameters, though the higher
order moments may have less simple formulas. (See next section below.)
Clearly, there is no simple exact distribution theory for the estimator θ̃, since it will
depend on the distribution Fθ∗ of X ∗ , which, as mentioned earlier, may not be feasible
to work with, unless the original distribution is of a particularly convenient form, such
as a normal or gamma. Here we derive asymptotic theory for θ̃, on which statistical
In order to obtain the asymptotic variance of the estimator, we also need the next
µ∗4 = p−3 {µ4 + 3(p − 1)µ22 }, µ∗5 = p−4 {µ5 + 10(p − 1)µ3 µ2 },
µ∗6 = p−5 {µ6 + 15(p − 1)µ4 µ2 + 10(p − 1)µ23 + 15(p − 1)(p − 2)µ32 }, (4)
For the particular form of the estimating equations (3), we can explicitly express
the matrices A and B in terms of the central moments of the original distribution, by
using the moment relationships given in (1) and (4). Putting θ = (θ1 , θ2 , θ3 )T , we then
5
have
∂µ1 ∂µ1 ∂µ1
∂θ1 ∂θ2 ∂θ3
−1
A = ∂µ2
∂θ1
∂µ2 ∂µ2
(6)
∂θ2 ∂θ3
3pµ2 ∂µ1
∂θ1
+ ∂µ3
∂θ1
3pµ2 ∂µ
∂θ2
1
+ ∂µ3
∂θ2
3pµ2 ∂µ
∂θ3
1
+ ∂µ3
∂θ3
and
µ2 /p µ3 /p p2 µ∗4
B= 2 ∗ 2
µ3 /p p µ4 − µ2 p µ5 − µ3 µ2
3 ∗
. (7)
p2 µ∗4 p3 µ∗5 − µ3 µ2 p4 µ∗6 − µ23
We will need estimates of Σ to construct confidence intervals and test statistics for
for θ in the expressions (6) and (7), respectively to obtain estimates of A and B.
stage strategy. First replace θ by θ̃ in (5), and then estimate the resulting functions
imate inference on any function of θ, using the asymptotic normality of the statistics.
For example, a 100(1 − α) per cent confidence interval for a linear function lT θ is
lT θ̃ ± z1−α/2 (lT ÃB̃ ÃT l/n)1/2 , with z being the upper percentile of the standard normal
distribution.
4.1 Preliminaries
In their seminal paper, Box and Cox (1964) developed a widely used transformation
6
family for the linear regression model.
(X λ − 1)/λ, λ 6= 0,
Y = , (9)
log X, λ = 0,
be strictly positive (in order that all real λ yield real values).
The Box-Cox power transformation assumes that there is some member of the power
family of transformations such that when applied to the data, the transformed data
are normally distributed. Hence the original data can take on a wide range of possible
distributions, and in most practical situations, there exists some member of this family
of distributions that is a reasonable model for the data generating mechanism. Three
important special cases of this family are the normal (λ = 1), lognormal (λ = 0), and
(non-central) χ2 (λ = 1/2).
It is now assumed that Y follows a normal distribution with mean µ and variance
σ 2 . We would like to conduct inference on the parameter vector θ = (µ, σ 2 , λ)T based
4.2 Inference
The standard approach to inference in the Box-Cox model is via maximum likeli-
hood, while other approaches have also been proposed (See Sakia, 1992, for a review).
However, when only pooled data are observed, these methods are extremely difficult or
impossible to carry out, except in special cases. We shall instead estimate the three pa-
rameters based on the pooled data via the moment-based estimating equation method
7
To proceed, we obtain expressions for the central moments of the distribution of X,
the transformed normal. Once the expressions are obtained, it is then computation-
ally straightforward to derive the estimator and its estimated large sample covariance
matrix.
It should be pointed out that the claim that Y has a normal distribution is not
exactly true for λ 6= 0, since Y is bounded by −1/λ from below (above) for λ > (<)
0. This effect in general practice is assumed to be negligible (and usually is), but must
We first deal with λ ≥ 0, in which case all moments exist for X. For λ > 0, define
where δ = (λµ + 1)/(|λ|σ), φ, and Φ are the standard normal density and distribution
functions. Note that the absolute value sign is used in order to unify the density
distribution, for which known formulas are available (See, for example, Aitchison and
Brown, 1957), and thus explicit expressions for the estimates of µ and σ 2 based on the
For the case λ < 0, we notice from (11)-(12) that if X is bounded from above then all
moments exist; otherwise, only the moments of order r < −λ exist. To ensure feasible
8
execution of the proposed moment-based procedure, we assume that X ≤ x0 for some
x0 > 0. If we define U = X λ −xλ0 = λY +1−xλ0 , then the density of U and the moments
of X are still given respectively by (10)-(12), but with δ = (λµ + 1 − xλ0 )/(|λ|σ), and
Using these moment formulas and the estimating equations (3), we can obtain
estimates of θ = (µ, σ 2 , λ)T . The derivatives required for the asymptotic distribution
given by (5)-(7) may be computed by differentiating under the integral sign, or may
be computed numerically using the estimated parameters. We may then plug in the
Note that if we assume λ to be known, then only the first two moments µ1 and µ2
are needed to obtain estimates of µ and σ 2 , and the next two higher moments, µ3 and
µ4 to derive asymptotic variance. See §5 for an explicit example in the lognormal case
(λ = 0).
Often, we are only interested in inference on µ and σ 2 , and the transformation param-
Write µ(λ) and σ 2 (λ) to denote the dependency on the transformation scale. For
a fixed λ, we obtain (µ̃(λ), σ̃ 2 (λ)) by using only the first two moment relationships. λ̃
is then found via a grid search using the third moment. Our limited simulation results
show that the third moment equation is monotone in λ, when considered as a function
of (µ̃(λ), σ̃ 2 (λ), λ), and hence the grid search should yield a unique estimate λ̃ of λ.
Once λ̃ is determined from the data, we then proceed, just as in the standard data
transformation situation, as if λ were known (to be λ̃), to estimate µ and σ 2 and the
9
asymptotic variance with the last row and column of A−1 and B in (6) and (7) being
removed.
debate in the literature (Bickel and Doksum, 1981; Box and Cox, 1982; Doksum and
Wong, 1983; Hinkley and Runger, 1984, among others). Since λ̃ is a consistent esti-
mate of λ under the Box-Cox model, the asymptotic equivalency of the “conditional”
Wong (1983) would hold for the t-statistic based on the moment estimates as well.
Since formulas for the asymptotic variances are available for both the λ known, and
other than those treated by Doksum and Wong (1983) deserves further investigation.
pooled data X ∗ .
tion into a larger family of distributions indexed by one or more additional parameters
and then test hypotheses regarding these parameters. Since the Box-Cox family is
a diverse family that includes many important special cases, a natural extension to
the estimation problem is the test for goodness-of-fit based on the pooled data to a
hypothesized distribution.
Using the estimate of λ, derived by using the estimating equations (3) with θ =
(µ, σ 2 , λ)T and the moments given in (11) and (12), we may test the fit of the underlying
10
data to a desired distribution of X, based solely on the pooled data. For example,
standard normal test statistic Z = λ̃/s, where s is the estimated standard error of λ̃
The Box-Cox transformation family has been used in the receiver operating charac-
teristic (ROC) curve analysis to evaluate the accuracy of a medical diagnostic test or
biomarker that yields continuous outcomes (Faraggi and Reiser, 2002; Zou and Hall,
2000, 2002). A key assumption to warrant the use of the Box-Cox transformation
theory in such analysis is that the transformation parameter λ be the same for both
diseased and non-diseased outcomes. Below we propose a method of testing this key
(k = 1, ..., m). The individual Xs and Y s are not observed. We assume that for cer-
H0 : λX − λY = 0 vs. H1 : λX − λY 6= 0.
We may test this hypothesis in the following manner. Let λ̃X and λ̃Y be the
and s2X and s2Y be their estimated variances derived from the methods described in §3.
We then reject H0 at significance level α if |λ̃X − λ̃Y | > z1−α/2 s, where s = (s2X +s2Y )1/2 .
If we do not reject the null hypothesis, we may then feel comfortable in combining
the two estimates to obtain a weighted estimate of the common value of λ, use the
11
common estimate as the true λ to estimate µ and σ 2 for each group, and then proceed
under the assumption that the true distribution is actually a member of the Box-Cox
family. We may also adapt other readily available techniques to test the distributional
We will assume that the two distributions, Fθ of X and Fθ∗ of X ∗ , are uniquely
determined by each other, which then implies that testing the hypothesis that the
that the observed pooled data X ∗ follow the distribution Fθ∗ . While there are certain
exceptions to this uniqueness characterization, we suspect that it holds for most of the
One simple method is to draw a Q-Q plot of the pooled data versus a hypothetical
Fθ generates the unobserved data from which the pooled data are observed. Using
the moment based technique we have developed in the previous sections, we obtain an
If the quantiles of F ∗ are difficult to compute, which is the case in general, We may
generate a large number of observations from Fθ̃ , and group them into sets of size p,
to yield the distribution, Fθ̃∗ . We then plot the quantiles of this large sample empirical
distribution versus the quantiles of the empirical distribution of the observed data, and
12
Smirnoff type test. This test is based on the statistic,
the largest difference in cumulative distribution functions between the empirical and
theoretical distributions. The distribution of D under the null hypothesis that the data
follows the hypothesized distribution is complicated by the fact that we are using an
estimate of θ, and not the true parameter. Hence, the critical regions of the standard
Based on the results of Romano (1988), the following bootstrap method will deter-
mine critical values that will yield tests with correct asymptotic significance levels.
Step 1. Based on the estimate, θ̃, generate a random sample of size np from Fθ̃∗ ,
and then group into sets of size p to obtain the pooled sample.
Step 3. Generate the empirical distribution Fθ̃∗∗ based on a large sample grouped
Step 4. Calculate D∗ = supx |Fn∗ (x) − Fθ̃∗∗ (x)| for this sample.
Step 5. Repeat a large number of times, and use the frequency distribution of D∗
5. An example
and Niagara counties, 35 to 79 years of age, was the focus of this investigation. The
New York State Department of Motor Vehicles drivers’ license rolls were utilized as the
sampling frame for adults between the ages of 35 and 65; whereas the elderly sample
(age 65 to 79) was randomly selected from the Health Care Financing Administration
database.
13
A total of 72 men and women were selected for the analyses. Personal history of my-
Participants were asked if they had been diagnosed with angina pectoris confirmed by
for outcome verification and were defined as having coronary heart disease. Partici-
pants provided a 12-hour fasting blood specimen for biochemical analysis. A number
of parameters were examined in fresh blood samples, including routine Vitamin E lev-
Box-Cox family. Fig 1(a) and 1(b) shows the normal Q-Q plots for the original and
A lognormal distribution appears to be a reasonable fit to the data. Based on the full
(un-pooled) data, the standard Kolmogorov-Smirnoff test rejects the normal assump-
A 95% confidence interval for λ based on the maximum likelihood estimate, which is
obtainable when full data are available, is found to be (-0.6924, 0.3334), overlapping
We now randomly group the subjects into groups of 2 and take the average as the
1.2003). This again indicates lognormality. Moreover, the simulated Q-Q plots of the
14
pooled data (Figure 2) also supports that data are lognormally distributed. We will
For the lognormal distribution, the four required central moments are given by
where ω 2 = exp(σ 2 ) − 1.
We find that
n o n o
µ = log (µ21 + µ2 )−1/2 µ21 , σ 2 = log 1 + µ−2
1 µ2 .
where the X ∗ s are the pooled observations. Using the explicit formulas for the mo-
ments and the asymptotic variance, some straightforward but tedious algebra yields
. −1
n V ar (µ̃) = (4pγ 2 ) {γ 6 − 8γ 4 + 16γ 3 + (2p − 11)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n V ar (σ̃ 2 ) = (pγ 2 ) {γ 6 − 4γ 4 + 4γ 3 + (2p − 3)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n Cov (µ̃, σ̃ 2 ) = − (2pγ 2 ) {γ 6 − 6γ 4 + 8γ 3 + (2p − 5)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
15
where γ = 1 + ω 2 .
Notice that the above formulas depend only on σ 2 , the variance of the underlying
normal distribution, along with the pooling size p. We may plug in γ̃ = exp(σ̃ 2 ) to
yield consistent estimates of the variances and covariance, and thus to construct test
Applying these formulas we obtain the estimates of (µ, σ 2 ) and their estimated
standard errors, for both the pooled and un-pooled data. The results are presented in
Table 1. For comparison, estimates based on pooled data with group size of 3 and 4
Table 1
p n µ̃ σ̃ 2
1 72 2.6421 (±0.0498) 0.1733 (±0.0373)
2 36 2.6408 (±0.0519) 0.1757 (±0.0465)
3 24 2.6396 (±0.0540) 0.1781 (±0.0545)
4 18 2.6405 (±0.0554) 0.1764 (±0.0603)
likelihood estimates µ̂ of µ and σ̂ 2 of σ 2 , using the un-pooled data. It turned out that
µ̂ = 2.6466 and σ̂ 2 = 0.1609, with standard errors 0.0473 and 0.0270, respectively.
We observe that, due to the small value of σ 2 , which is common in lognormal data,
there is not a great deal of efficiency loss in the moment based estimates as compared
there is also only a small loss of efficiency as we pool the data. This small efficiency
16
loss is in agreement with previous studies on the merits of pooling data, under normal
and gamma assumptions, to reduce costs associated with bioassays. See Faraggi et al.
(2003), Liu and Schisterman (2003) and Weinberg and Umbach (1999).
6. Discussion
In the goodness-of-fit testing problem, we are testing a hypothesis regarding the under-
lying distribution of the unpooled data. It is implicitly assumed that the distribution
vidual observations. This is true under general regularity conditions. For example, a
spondence. For other characterization conditions see Prokhorov and Ushakov (2002)
and the references therein. Regardless of the uniqueness of the characterization, the
type I error of the test is unaffected. However, the test will be unable to detect the dif-
ference between any two original distributions that may yield the identical convolution
how different are the generating distributions with respect to an underlying distance
assumptions. One of the problems inherent in this type of set-based data is created by
the central limit theorem. The pooled data tend to a normal distribution as the pooling
size increases, and even for small to moderate pooling sizes, much of the skewness (and
higher moments) of the original distribution is lost in the set-based distribution. While
the loss in variability is linear in the pooling size, this loss of skewness is quadratic,
17
as can be seen by the moments (1). This hampers the ability to detect differences in
To our knowledge, the current paper is the first to present a general methodology
for dealing with set-based data under a broad class of parametric distributional as-
the area of evaluation of disease biomarkers, more research on methods to deal with
this form of data needs to be done. For example, under a parametric assumption,
function for the set-based data and proceed via likelihood methods. The accuracy of
Non-parametric methods for set-based data may also be appropriate. A possible al-
ternative to the parametric models proposed in this paper, would be an approach based
challenging.
Acknowledgements
The authors thank W. Jack Hall and Kai F. Yu for helpful discussion and suggestions.
References
Barcellos, L. F., Klitz, W., Field, L. L., Tobias, R., Bowcock, A. M., Wilson, R., Nelson,
18
use of a pooled DNA genomic screen. American Journal of Human Genetics 61,
734-47.
Doksum, K. A. and Wong, C-W. (1983). Statistical tests based on transformed data.
Enard, W., Khaitovich, P., Klose, J., Zollner, S., Heissig, F., Giavalisco, P., Nieselt-
Struwe, K., Muchmore, E., Varki, A., Ravid, R., Doxiadis, G. M., Bontrop, R. E.
and Paabo, S. (2002). Intra- and interspecific variation in primate gene expression
Faraggi, D. and Reiser, B. (2002). Estimation of the area under the ROC curve.
Faraggi, D., Reiser, B. and Schisterman, E. F. (2003). ROC curve analysis for biomark-
19
Potential applications to HIV and drug testing. Journal of the American Statistical
Jin, W., Riley, R. M., Wolfinger, R. D., White, K. P., Passador-Gurgel, G. and Gibson,
Kendziorski, C. M., Zhang, Y. , Lan, H. and Attie, D. (2003). The efficiency of pooling
Litvak, E. , Tu, X. M. and Pagano, M. (1994). Screening for the presence of a disease by
Wiley.
20
Sobel, M. and Groll, P. (1959). Group testing to eliminate efficiently all defectives in
Sobel, M. and Elashoff, R. (1975). Group testing with a new goal: Estimation.
Tu, X. M., Litvak, E. and Pagano, M. (1995). On the informativeness and accuracy
Zou, K. H. and Hall, W. J. (2000). Two transformation models for estimating an ROC
curve derived from continuous data. Journal of Applied Statistics 27, 621-31.
models for comparing diagnostic markers with paired design. Journal of Applied
21
40
3.5
log(VitaminE)
30
3.0
VitaminE
20
2.5
10
2.0
-2 -1 0 1 2 -2 -1 0 1 2
This is Figure 1.
25
20
VitaminE
15
10
10 20 30 40
This is Figure 2.