0% found this document useful (0 votes)
44 views

Statistical Inference Based On Pooled Data: A Moment-Based Estimating Equation Approach

This document proposes a moment-based estimating equation approach for statistical inference on distribution parameters when only pooled data are observed rather than individual data points. The approach uses the relationships between the moments of the pooled data distribution and the moments of the original individual data distribution to construct estimating equations. Estimates of the distribution parameters and their asymptotic distributions can then be obtained. The method is generally applicable when likelihood-based inference is difficult with pooled data and is demonstrated for the Box-Cox transformation model family of distributions.

Uploaded by

dinadinic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Statistical Inference Based On Pooled Data: A Moment-Based Estimating Equation Approach

This document proposes a moment-based estimating equation approach for statistical inference on distribution parameters when only pooled data are observed rather than individual data points. The approach uses the relationships between the moments of the pooled data distribution and the moments of the original individual data distribution to construct estimating equations. Estimates of the distribution parameters and their asymptotic distributions can then be obtained. The method is generally applicable when likelihood-based inference is difficult with pooled data and is demonstrated for the Box-Cox transformation model family of distributions.

Uploaded by

dinadinic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Statistical Inference Based on Pooled Data: A

Moment-Based Estimating Equation Approach

Howard D. Bondell

Department of Statistics, Rutgers University, Piscataway, NJ 08854, U.S.A.

Aiyi Liu∗ and Enrique F. Schisterman

Division of Epidemiology, Statistics and Prevention Research, National Institute of

Child Health and Human Development, Department of Health and Human Services,

6100 Executive Blvd., Bethesda, MD 20852, U.S.A.



email: liua@mail.nih.gov

Summary. We consider statistical inference on parameters of a distribution when only

pooled data are observed. A moment-based estimating equation approach is proposed

to deal with situations where likelihood functions based on pooled data are difficult

to work with. We outline the method to obtain estimates and test statistics of the

parameters of interest in the general setting. We demonstrate the approach on the

family of distributions generated by the Box-Cox transformation model, and in the

process, construct tests for goodness of fit based on the pooled data.

Key words: Pooling biospecimens; Set-based observations; Moments; Box-Cox trans-

formation; Goodness of fit; Lognormal distribution.

1. Introduction

The classical approach to conducting statistical inference on an unknown vector of

parameters θ characterizing a distribution Fθ of a random variable X is based on a ran-

dom sample of size np (with both n and p being integers) from Fθ , say, X1 , . . . , Xnp ,

1
independent and identically distributed according to Fθ . Suppose that, in order to

reduce the cost of the study, the subjects that yield the np samples are randomly

grouped into n sets, each of size p. Subsequently, instead of observing each indi-

vidual X, the average of the Xs in each set is observed, yielding n observations,


Pjp
Xj∗ = i=p(j−1)+1 Xi /p, j = 1, . . . , n, which are often called set-based observations.

The X ∗ s are also independent and identically distributed, but each following the dis-

tribution Fθ∗ of the average of p random draws from Fθ . We are concerned with inference

on θ based on the pooled data X1∗ , . . . , Xn∗ .

The data framework described above, often called group testing as a result of testing

for a microorganism in pooled biospecimens, appeared initially in the context of screen-

ing with dichotomous outcomes (Dorfman, 1943; Sobel and Groll 1959), and was later

further developed by Gastwirth and Johnson (1994) and Litvak, Tu and Pagano (1994)

in HIV contamination, Sobel and Elashoff (1975), Chen and Swallow (1990), Farrington

(1992), Hughes-Oliver and Swallow (1994), and Tu, Litvak and Pagano (1995) in esti-

mating population prevalence, and Barcellos et al. (1997) in localizing disease genes.

Recently Weinberg and Umbach (1999) proposed a set-based logistic model to explore

the association between a disease and exposures when only the pooled exposure values

are available. Farragi, Reiser and Schisterman (2003) and Liu and Schisterman (2003)

considered evaluation of diagnostic biomarkers based on pooled specimens whose mea-

surements are assumed to follow normal or gamma distributions. Other areas where

pooling biospecimens has been found useful include gene microarray experiments where

mRNA samples are often pooled across subjects (Jin et al., 2001; Enard et al., 2002;

Kendziorski et al., 2003).

Although the strategy of pooling specimens has been used in practice, methods

for analysis of set-based data from such experiments have not been fully and well

2
developed in the literature, except for certain special cases. This is, perhaps, partly

because for a general distribution, Fθ , the likelihood methods based on set-based data

may not be feasible since the distribution of the averages involves convolution of p

random variables of Fθ . The purpose of the present paper is to develop a general

methodology for reasonably efficient estimation and testing under a broad class of

distributional assumptions, including the family of distributions generated by the Box-

Cox transformation model. The context is that Fθ possesses and is fully determined

by its first several moments, which may be estimated by converting the estimates of

the moments of the set-based random variable X ∗ .

The paper is arranged as follows. In §2 we shall describe the method under the

assumption that Fθ can be parameterized by no more than its first three moments.

We obtain estimating equations to yield estimates of θ. The method can be readily

extended to distributions with more than three parameters, but we omit the details

at the present time. In §3 we derive the large sample distribution of the estimates.

We then apply the method in §4 to the family of distributions generated by the Box-

Cox transformation model, which is an extremely versatile class of distributions that

includes the normal, lognormal, and (non-central) χ2 distributions as special cases. As

an extension, we discuss several procedures to test goodness of fit based on the pooled

data. The methods are exemplified in §5 using data from a study evaluating oxidative

stress and antioxidants on cardiovascular disease in upstate New York. Some comments

and discussions appear in §6.

2. Estimating Equations Based on the Moments: A General Method

Our aim is to conduct inference on a k-parameter vector, θ, that characterizes the

distribution Fθ of interest, based on the set-based observations X ∗ s. Except for certain

special cases, the likelihood function based on X ∗ s will be extremely difficult to derive.

3
Alternatively we can obtain and connect the moments of X ∗ with that of X, and

construct inference based on moment estimates. For this purpose, we assume that Fθ

has at least 2k moments of which the first k moments will be used to estimate θ and

the rest to estimate the variance.

We illustrate the method by assuming that θ is at most three-dimensional. Define

µ1 = E(X), µ∗1 = E(X ∗ ), µr = E{(X −µ1 )r }, and µ∗r = E{(X ∗ −µ∗1 )r }, for r > 1; these

are all functions of θ. Then it is straightforward to show that the first three central

moments of X ∗ , being the mean of p independent, identically distributed variables of

X, are

µ∗1 = µ1 , µ∗2 = µ2 /p, µ∗3 = µ3 /p2 . (1)

Replacing the left-hand sides of the equations by their corresponding set-based

sample moments and solving for θ will then yield an estimator of θ based on which

inference can be conducted. Putting the above into an estimating equation framework

allows us to conveniently utilize the general asymptotic theory on estimating equations

(See Serfling, 1980, and §3 below). Define


 
 w − µ1 (θ) 
 
 
Ψ(w; θ) =  2
 p{w − µ1 (θ)} − µ2 (θ)
,
 (2)
 
 
p2 {w − µ1 (θ)}3 − µ3 (θ)

then a consistent estimator, θ̃, of θ is the solution to the equation:


n
X
n−1 Ψ(Xj∗ ; θ̃) = 0, (3)
j=1
Pn
or equivalently θ̃ = arg minθ n−1 j=1 Ψ(Xj∗ ; θ̃)T Ψ(Xj∗ ; θ̃). Note that if θ is of dimen-

sion k < 3, we only require the first k components of Ψ.

We have restricted our attention to at most three parameters which are sufficient for

most practical needs. If more parameters are required, the approach described above

4
can be readily extended to accommodate the additional parameters, though the higher

order moments may have less simple formulas. (See next section below.)

3. Distribution Theory for Estimates

Clearly, there is no simple exact distribution theory for the estimator θ̃, since it will

depend on the distribution Fθ∗ of X ∗ , which, as mentioned earlier, may not be feasible

to work with, unless the original distribution is of a particularly convenient form, such

as a normal or gamma. Here we derive asymptotic theory for θ̃, on which statistical

inference may be based.

In order to obtain the asymptotic variance of the estimator, we also need the next

three higher-order moments. Straightforward manipulation leads to the following rela-

tionships (for p > 1):

µ∗4 = p−3 {µ4 + 3(p − 1)µ22 }, µ∗5 = p−4 {µ5 + 10(p − 1)µ3 µ2 },

µ∗6 = p−5 {µ6 + 15(p − 1)µ4 µ2 + 10(p − 1)µ23 + 15(p − 1)(p − 2)µ32 }, (4)

again, all are functions of θ.

Following standard asymptotic theory of estimating equations (for example, Ser-


³ ´
d
fling, 1980, Chapter 7), we can show, via Taylor expansion, that n1/2 θ̃ − θ →

N3 (0, Σ) , where Σ = ABAT , with,


( )
∂ n o
A−1 = −Eθ Ψ(X ∗ ; θ) , B = Eθ Ψ(X ∗ ; θ)Ψ(X ∗ ; θ)T . (5)
∂θ

For the particular form of the estimating equations (3), we can explicitly express

the matrices A and B in terms of the central moments of the original distribution, by

using the moment relationships given in (1) and (4). Putting θ = (θ1 , θ2 , θ3 )T , we then

5
have  
∂µ1 ∂µ1 ∂µ1
 ∂θ1 ∂θ2 ∂θ3 
 
−1
 
A = ∂µ2
 ∂θ1
∂µ2 ∂µ2 
 (6)
 ∂θ2 ∂θ3 
 
3pµ2 ∂µ1
∂θ1
+ ∂µ3
∂θ1
3pµ2 ∂µ
∂θ2
1
+ ∂µ3
∂θ2
3pµ2 ∂µ
∂θ3
1
+ ∂µ3
∂θ3

and  
 µ2 /p µ3 /p p2 µ∗4 
 
 
B= 2 ∗ 2
 µ3 /p p µ4 − µ2 p µ5 − µ3 µ2 
3 ∗
. (7)
 
 
p2 µ∗4 p3 µ∗5 − µ3 µ2 p4 µ∗6 − µ23
We will need estimates of Σ to construct confidence intervals and test statistics for

hypotheses regarding a function of θ. Here we propose two approaches to construct

estimates of A and B which then yield estimates of Σ. One approach is to “plug-in” θ̃

for θ in the expressions (6) and (7), respectively to obtain estimates of A and B.

Alternatively, we may obtain “semi-empirical” estimates for A and B, using a two-

stage strategy. First replace θ by θ̃ in (5), and then estimate the resulting functions

by their empirical sample means, we have


n
X n
X

Ã−1 = −n−1 Ψ(Xj∗ ; θ̃), B̃ = n−1 Ψ(Xj∗ ; θ̃)Ψ(Xj∗ ; θ̃)T . (8)
j ∂θ j

Both approaches lead to consistent estimators of Σ. We may then conduct approx-

imate inference on any function of θ, using the asymptotic normality of the statistics.

For example, a 100(1 − α) per cent confidence interval for a linear function lT θ is

lT θ̃ ± z1−α/2 (lT ÃB̃ ÃT l/n)1/2 , with z being the upper percentile of the standard normal

distribution.

4. The Box-Cox Transformation Family

4.1 Preliminaries

In their seminal paper, Box and Cox (1964) developed a widely used transformation

6
family for the linear regression model.



 (X λ − 1)/λ, λ 6= 0,
Y = , (9)
  log X, λ = 0,

where λ is a real-valued, unknown parameter, and X is the original data, assumed to

be strictly positive (in order that all real λ yield real values).

Based on this transformation, a diverse family of distributions for X is generated.

The Box-Cox power transformation assumes that there is some member of the power

family of transformations such that when applied to the data, the transformed data

are normally distributed. Hence the original data can take on a wide range of possible

distributions, and in most practical situations, there exists some member of this family

of distributions that is a reasonable model for the data generating mechanism. Three

important special cases of this family are the normal (λ = 1), lognormal (λ = 0), and

(non-central) χ2 (λ = 1/2).

It is now assumed that Y follows a normal distribution with mean µ and variance

σ 2 . We would like to conduct inference on the parameter vector θ = (µ, σ 2 , λ)T based

on observing X ∗ , which as before, are averages of p random observations of X.

4.2 Inference

4.2.1 Estimation of parameters

The standard approach to inference in the Box-Cox model is via maximum likeli-

hood, while other approaches have also been proposed (See Sakia, 1992, for a review).

However, when only pooled data are observed, these methods are extremely difficult or

impossible to carry out, except in special cases. We shall instead estimate the three pa-

rameters based on the pooled data via the moment-based estimating equation method

described in §2 and §3.

7
To proceed, we obtain expressions for the central moments of the distribution of X,

the transformed normal. Once the expressions are obtained, it is then computation-

ally straightforward to derive the estimator and its estimated large sample covariance

matrix.

It should be pointed out that the claim that Y has a normal distribution is not

exactly true for λ 6= 0, since Y is bounded by −1/λ from below (above) for λ > (<)

0. This effect in general practice is assumed to be negligible (and usually is), but must

be accounted for in our expression for the moments.

We first deal with λ ≥ 0, in which case all moments exist for X. For λ > 0, define

U = X λ = λY + 1, whose density is a truncated (at zero) normal density:


à !
1 u
f (u; µ, σ, λ) = φ − δ , (u > 0), (10)
|λ|σΦ(δ) |λ|σ

where δ = (λµ + 1)/(|λ|σ), φ, and Φ are the standard normal density and distribution

functions. Note that the absolute value sign is used in order to unify the density

expressions for both λ > 0 and λ < 0 (See below).

The moments of X as functions of µ, σ, and λ are thus given by


Z ∞ Ã !
1 u
µ1 (µ, σ, λ) = u1/λ φ − δ du, (11)
|λ|σΦ(δ) 0 |λ|σ
Z ∞n à !
1 or u
µr (µ, σ, λ) = u1/λ − µ1 (µ, σ, λ) φ − δ du, (r > 1). (12)
|λ|σΦ(δ) 0 |λ|σ
Note that as λ → 0, the moments converge to the moments of the lognormal

distribution, for which known formulas are available (See, for example, Aitchison and

Brown, 1957), and thus explicit expressions for the estimates of µ and σ 2 based on the

pooled data may be obtained.

For the case λ < 0, we notice from (11)-(12) that if X is bounded from above then all

moments exist; otherwise, only the moments of order r < −λ exist. To ensure feasible

8
execution of the proposed moment-based procedure, we assume that X ≤ x0 for some

x0 > 0. If we define U = X λ −xλ0 = λY +1−xλ0 , then the density of U and the moments

of X are still given respectively by (10)-(12), but with δ = (λµ + 1 − xλ0 )/(|λ|σ), and

u in the integrand being replaced by u + xλ0 .

Using these moment formulas and the estimating equations (3), we can obtain

estimates of θ = (µ, σ 2 , λ)T . The derivatives required for the asymptotic distribution

given by (5)-(7) may be computed by differentiating under the integral sign, or may

be computed numerically using the estimated parameters. We may then plug in the

empirical moment estimates to obtain the estimated asymptotic covariance matrix in

order to construct confidence intervals and test statistics.

Note that if we assume λ to be known, then only the first two moments µ1 and µ2

are needed to obtain estimates of µ and σ 2 , and the next two higher moments, µ3 and

µ4 to derive asymptotic variance. See §5 for an explicit example in the lognormal case

(λ = 0).

4.2.2 Computation and inference regarding µ and σ 2

Often, we are only interested in inference on µ and σ 2 , and the transformation param-

eter is regarded as being a nuisance. We adopt the following convenient approach to

conduct the inference.

Write µ(λ) and σ 2 (λ) to denote the dependency on the transformation scale. For

a fixed λ, we obtain (µ̃(λ), σ̃ 2 (λ)) by using only the first two moment relationships. λ̃

is then found via a grid search using the third moment. Our limited simulation results

show that the third moment equation is monotone in λ, when considered as a function

of (µ̃(λ), σ̃ 2 (λ), λ), and hence the grid search should yield a unique estimate λ̃ of λ.

Once λ̃ is determined from the data, we then proceed, just as in the standard data

transformation situation, as if λ were known (to be λ̃), to estimate µ and σ 2 and the

9
asymptotic variance with the last row and column of A−1 and B in (6) and (7) being

removed.

The appropriateness of such “conditional” inference has been a subject of much

debate in the literature (Bickel and Doksum, 1981; Box and Cox, 1982; Doksum and

Wong, 1983; Hinkley and Runger, 1984, among others). Since λ̃ is a consistent esti-

mate of λ under the Box-Cox model, the asymptotic equivalency of the “conditional”

transformed two-sample t-statistic with the “unconditional” one as in Doksum and

Wong (1983) would hold for the t-statistic based on the moment estimates as well.

Since formulas for the asymptotic variances are available for both the λ known, and

the λ estimated situations, the appropriateness of treating λ as known for problems

other than those treated by Doksum and Wong (1983) deserves further investigation.

4.3 Testing Goodness-of-fit

When inference procedures depend heavily on the distributional assumptions, it is

important to justify these assumptions before conducting the inference. Below we

propose several goodness-of-fit tests concerning the distribution of X based on the

pooled data X ∗ .

4.3.1 One sample test

One standard approach to testing goodness-of-fit is to imbed the distribution in ques-

tion into a larger family of distributions indexed by one or more additional parameters

and then test hypotheses regarding these parameters. Since the Box-Cox family is

a diverse family that includes many important special cases, a natural extension to

the estimation problem is the test for goodness-of-fit based on the pooled data to a

hypothesized distribution.

Using the estimate of λ, derived by using the estimating equations (3) with θ =

(µ, σ 2 , λ)T and the moments given in (11) and (12), we may test the fit of the underlying

10
data to a desired distribution of X, based solely on the pooled data. For example,

testing for goodness-of-fit to a lognormal distribution based on the pooled data is

accomplished simply by testing H0 : λ = 0 vs. H1 : λ 6= 0, using the asymptotically

standard normal test statistic Z = λ̃/s, where s is the estimated standard error of λ̃

based on the asymptotic distribution derived in §3.

4.3.2 Two-sample extension

The Box-Cox transformation family has been used in the receiver operating charac-

teristic (ROC) curve analysis to evaluate the accuracy of a medical diagnostic test or

biomarker that yields continuous outcomes (Faraggi and Reiser, 2002; Zou and Hall,

2000, 2002). A key assumption to warrant the use of the Box-Cox transformation

theory in such analysis is that the transformation parameter λ be the same for both

diseased and non-diseased outcomes. Below we propose a method of testing this key

assumption based on the pooled data.

We observe two independent pooled samples from np diseased and mq nondis-


Pjp Pkq
eased subjects, Xj∗ = i=(j−1)p+1 Xi /p, (j = 1, ..., n), and Yk∗ = i=(k−1)q+1 Yi /q,

(k = 1, ..., m). The individual Xs and Y s are not observed. We assume that for cer-

tain unknown parameters λX and λY , (X λX − 1)/λX and (Y λY − 1)/λY both follow

(truncated) normal distributions. The null hypothesis to be tested in this situation is

H0 : λX − λY = 0 vs. H1 : λX − λY 6= 0.

We may test this hypothesis in the following manner. Let λ̃X and λ̃Y be the

estimates of λX and λY , respectively, obtained by solving the estimating equations (3),

and s2X and s2Y be their estimated variances derived from the methods described in §3.

We then reject H0 at significance level α if |λ̃X − λ̃Y | > z1−α/2 s, where s = (s2X +s2Y )1/2 .

If we do not reject the null hypothesis, we may then feel comfortable in combining

the two estimates to obtain a weighted estimate of the common value of λ, use the

11
common estimate as the true λ to estimate µ and σ 2 for each group, and then proceed

with the analysis as proposed by Zou and Hall (2000, 2002).

4.3.3 General goodness-of-fit

The goodness-of-fit test based on an estimate of λ in §4·3·1 is technically valid only

under the assumption that the true distribution is actually a member of the Box-Cox

family. We may also adapt other readily available techniques to test the distributional

assumptions without assuming membership in the Box-Cox family.

We will assume that the two distributions, Fθ of X and Fθ∗ of X ∗ , are uniquely

determined by each other, which then implies that testing the hypothesis that the

unoberserved data X follow the distribution Fθ is equivalent to testing the hypothesis

that the observed pooled data X ∗ follow the distribution Fθ∗ . While there are certain

exceptions to this uniqueness characterization, we suspect that it holds for most of the

distributions actually used in practice. We comment on this further in §6.

One simple method is to draw a Q-Q plot of the pooled data versus a hypothetical

distribution F ∗ of X ∗ . Suppose we want to test the hypothesis that a distribution

Fθ generates the unobserved data from which the pooled data are observed. Using

the moment based technique we have developed in the previous sections, we obtain an

estimator θ̃ of θ based on the pooled data Xj∗ , (j = 1, . . . , n).

If the quantiles of F ∗ are difficult to compute, which is the case in general, We may

generate a large number of observations from Fθ̃ , and group them into sets of size p,

to yield the distribution, Fθ̃∗ . We then plot the quantiles of this large sample empirical

distribution versus the quantiles of the empirical distribution of the observed data, and

check for linearity in the plot.

Another approach would be to use a formal goodness-of-fit hypothesis test. One of

the most common goodness-of-fit tests of a parametric assumption is the Kolmogorov-

12
Smirnoff type test. This test is based on the statistic,

D = supx |Fn∗ (x) − Fθ̃∗ (x)|,

the largest difference in cumulative distribution functions between the empirical and

theoretical distributions. The distribution of D under the null hypothesis that the data

follows the hypothesized distribution is complicated by the fact that we are using an

estimate of θ, and not the true parameter. Hence, the critical regions of the standard

Kolmogorov-Smirnoff test are not valid.

Based on the results of Romano (1988), the following bootstrap method will deter-

mine critical values that will yield tests with correct asymptotic significance levels.

Step 1. Based on the estimate, θ̃, generate a random sample of size np from Fθ̃∗ ,

and then group into sets of size p to obtain the pooled sample.

Step 2. Compute the empirical distribution, Fn∗ , for this sample.

Step 3. Generate the empirical distribution Fθ̃∗∗ based on a large sample grouped

into sets of size p, as in the Q-Q plot.

Step 4. Calculate D∗ = supx |Fn∗ (x) − Fθ̃∗∗ (x)| for this sample.

Step 5. Repeat a large number of times, and use the frequency distribution of D∗

as the null distribution of D to find the critical region.

5. An example

A population-based sample of randomly selected residents of New York State’s Erie

and Niagara counties, 35 to 79 years of age, was the focus of this investigation. The

New York State Department of Motor Vehicles drivers’ license rolls were utilized as the

sampling frame for adults between the ages of 35 and 65; whereas the elderly sample

(age 65 to 79) was randomly selected from the Health Care Financing Administration

database.

13
A total of 72 men and women were selected for the analyses. Personal history of my-

ocardial infarction and angina pectoris was ascertained by self-reported questionnaire.

Participants were asked if they had been diagnosed with angina pectoris confirmed by

angiogram or with myocardial infarction. Medical charts were reviewed by a physician

for outcome verification and were defined as having coronary heart disease. Partici-

pants provided a 12-hour fasting blood specimen for biochemical analysis. A number

of parameters were examined in fresh blood samples, including routine Vitamin E lev-

els. We assume that the distribution of Vitamin E concentrations is a member of the

Box-Cox family. Fig 1(a) and 1(b) shows the normal Q-Q plots for the original and

log-transformed data, respectively.

*** (Insert Figure 1 here) ***

Figure 1. Normal Q-Q plot of Vitamin E concentration.

A lognormal distribution appears to be a reasonable fit to the data. Based on the full

(un-pooled) data, the standard Kolmogorov-Smirnoff test rejects the normal assump-

tion (p-value=0.0023), while not rejecting the lognormal assumption (p-value=0.5).

A 95% confidence interval for λ based on the maximum likelihood estimate, which is

obtainable when full data are available, is found to be (-0.6924, 0.3334), overlapping

zero, further confirming the lognormal assumption.

We now randomly group the subjects into groups of 2 and take the average as the

pooled observation. Based on this pooled sample, the moment-based estimate of λ

is 0.1096 with a standard error of 0.5565, yielding a confidence interval of (-0.9811,

1.2003). This again indicates lognormality. Moreover, the simulated Q-Q plots of the

14
pooled data (Figure 2) also supports that data are lognormally distributed. We will

therefore proceed under the lognormal assumption.

*** (Insert Figure 2 here) ***

Figure 2. Lognormal Q-Q plot of Vitamin E concentration with pooled data.

For the lognormal distribution, the four required central moments are given by

(Aitchison and Brown, 1957):


³ ´
µ1 = exp µ + 21 σ 2 , µ2 = µ21 ω 2 ,

µ3 = µ31 ω 4 (ω 2 + 3), µ4 = µ41 ω 4 (ω 8 + 6ω 6 + 15ω 4 + 16ω 2 + 3)

where ω 2 = exp(σ 2 ) − 1.

We find that

n o n o
µ = log (µ21 + µ2 )−1/2 µ21 , σ 2 = log 1 + µ−2
1 µ2 .

We thus obtain (µ̃, σ̃ 2 ) by respectively replacing µ1 and µ2 by their moment estimates:


n n
1X pX
µ̃1 = Xi∗ , µ̃2 = (X ∗ − µ̃1 )2 ,
n i=1 n i=1 i

where the X ∗ s are the pooled observations. Using the explicit formulas for the mo-

ments and the asymptotic variance, some straightforward but tedious algebra yields

the asymptotic variances and covariance as:

. −1
n V ar (µ̃) = (4pγ 2 ) {γ 6 − 8γ 4 + 16γ 3 + (2p − 11)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n V ar (σ̃ 2 ) = (pγ 2 ) {γ 6 − 4γ 4 + 4γ 3 + (2p − 3)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n Cov (µ̃, σ̃ 2 ) = − (2pγ 2 ) {γ 6 − 6γ 4 + 8γ 3 + (2p − 5)γ 2 − 4(p − 1)γ + 2(p − 1)} ,

15
where γ = 1 + ω 2 .

Notice that the above formulas depend only on σ 2 , the variance of the underlying

normal distribution, along with the pooling size p. We may plug in γ̃ = exp(σ̃ 2 ) to

yield consistent estimates of the variances and covariance, and thus to construct test

statistics and confidence intervals.

Applying these formulas we obtain the estimates of (µ, σ 2 ) and their estimated

standard errors, for both the pooled and un-pooled data. The results are presented in

Table 1. For comparison, estimates based on pooled data with group size of 3 and 4

are also given in the Table.

Table 1

Estimate (± standard error) of the lognormal mean µ and variance σ 2 .

p n µ̃ σ̃ 2
1 72 2.6421 (±0.0498) 0.1733 (±0.0373)
2 36 2.6408 (±0.0519) 0.1757 (±0.0465)
3 24 2.6396 (±0.0540) 0.1781 (±0.0545)
4 18 2.6405 (±0.0554) 0.1764 (±0.0603)

To evaluate the performance of these estimates, we also computed the maximum

likelihood estimates µ̂ of µ and σ̂ 2 of σ 2 , using the un-pooled data. It turned out that

µ̂ = 2.6466 and σ̂ 2 = 0.1609, with standard errors 0.0473 and 0.0270, respectively.

We observe that, due to the small value of σ 2 , which is common in lognormal data,

there is not a great deal of efficiency loss in the moment based estimates as compared

with the maximum likelihood estimates, especially in the estimation of µ. In addition,

there is also only a small loss of efficiency as we pool the data. This small efficiency

16
loss is in agreement with previous studies on the merits of pooling data, under normal

and gamma assumptions, to reduce costs associated with bioassays. See Faraggi et al.

(2003), Liu and Schisterman (2003) and Weinberg and Umbach (1999).

6. Discussion

6.1 Comments on Goodness-of-fit Test

In the goodness-of-fit testing problem, we are testing a hypothesis regarding the under-

lying distribution of the unpooled data. It is implicitly assumed that the distribution

of the convolution is in one-to-one correspondence with the distribution of the indi-

vidual observations. This is true under general regularity conditions. For example, a

nonvanishing characteristic function is a sufficient condition for this one-to-one corre-

spondence. For other characterization conditions see Prokhorov and Ushakov (2002)

and the references therein. Regardless of the uniqueness of the characterization, the

type I error of the test is unaffected. However, the test will be unable to detect the dif-

ference between any two original distributions that may yield the identical convolution

distribution. An additional question then arises, if the characterization is not unique,

how different are the generating distributions with respect to an underlying distance

such as Kolmogorov-Smirnoff, or (symmetrized) Kullback-Leibler?

6.2 Other Comments And Further Directions

In this paper we proposed inference on pooled data under parametric assumptions

on the individual observations. We also suggested methods to test these parametric

assumptions. One of the problems inherent in this type of set-based data is created by

the central limit theorem. The pooled data tend to a normal distribution as the pooling

size increases, and even for small to moderate pooling sizes, much of the skewness (and

higher moments) of the original distribution is lost in the set-based distribution. While

the loss in variability is linear in the pooling size, this loss of skewness is quadratic,

17
as can be seen by the moments (1). This hampers the ability to detect differences in

distributional shape even for modest pooling sizes.

To our knowledge, the current paper is the first to present a general methodology

for dealing with set-based data under a broad class of parametric distributional as-

sumptions. As the pooling of data becomes a more common procedure, particularly in

the area of evaluation of disease biomarkers, more research on methods to deal with

this form of data needs to be done. For example, under a parametric assumption,

we may be able to use Edgeworth expansions to write out an approximate likelihood

function for the set-based data and proceed via likelihood methods. The accuracy of

inference based on these approximations would be of interest.

Non-parametric methods for set-based data may also be appropriate. A possible al-

ternative to the parametric models proposed in this paper, would be an approach based

on density deconvolution, which again appears to be technically and computationally

challenging.

Acknowledgements

The authors thank W. Jack Hall and Kai F. Yu for helpful discussion and suggestions.

References

Aitchison, J. and Brown, J. A. C. (1957). The Lognormal Distribution. Cambridge:

the University Press.

Barcellos, L. F., Klitz, W., Field, L. L., Tobias, R., Bowcock, A. M., Wilson, R., Nelson,

M. P., Nagatomi, J., Thomson, G. (1997). Association mapping of disease loci, by

18
use of a pooled DNA genomic screen. American Journal of Human Genetics 61,

734-47.

Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformations revisited.

Journal of the American Statistical Association 76, 296-311.

Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the

Royal Statistical Society B 26, 211-52.

Box, G. E. P. and Cox, D. R. (1982). An analysis of transformations revisited, rebutted.

Journal of the American Statistical Association 77, 209-10.

Chen, C. L. and Swallow, W. H. (1990). Using group testing to estimate a proportion,

and to test the binomial model. Biometrics 46, 1035-46.

Doksum, K. A. and Wong, C-W. (1983). Statistical tests based on transformed data.

Journal of the American Statistical Association 78, 411-7.

Dorfman, R. (1943). The detection of defective members of large populations. Annals

of Mathematical Statistics 14, 436-40.

Enard, W., Khaitovich, P., Klose, J., Zollner, S., Heissig, F., Giavalisco, P., Nieselt-

Struwe, K., Muchmore, E., Varki, A., Ravid, R., Doxiadis, G. M., Bontrop, R. E.

and Paabo, S. (2002). Intra- and interspecific variation in primate gene expression

patterns. Science 296, 340-3.

Faraggi, D. and Reiser, B. (2002). Estimation of the area under the ROC curve.

Statistics in Medicine 21, 3093-106.

Faraggi, D., Reiser, B. and Schisterman, E. F. (2003). ROC curve analysis for biomark-

ers based on pooled assessments. Statistics in Medicine 15, 2515-27.

Farrington, C. (1992). Estimation prevalence by group testing using generalized linear

models. Statistics in Medicine 11, 1591-7.

Gastwirth, J. and Johnson, W. (1994). Screening with cost-effective quality control:

19
Potential applications to HIV and drug testing. Journal of the American Statistical

Association 89, 972-81.

Hinkley, D. V. and Runger, G. (1984). The analysis of transformed data. Journal of

the American Statistical Association 79, 302-20.

Hughes-Oliver, J. M. and Swallow, W. H. (1994). A two-stage adaptive group-testing

procedure for estimating small proportions. Journal of the American Statistical

Association 89, 982-93.

Jin, W., Riley, R. M., Wolfinger, R. D., White, K. P., Passador-Gurgel, G. and Gibson,

G. (2001). The contributions of sex, genotype and age to transcriptional variance

in Drosophila melanogaster. Nature Genetics 29, 389-95.

Kendziorski, C. M., Zhang, Y. , Lan, H. and Attie, D. (2003). The efficiency of pooling

m RNA in microarray experiments. Biostatistics 4, 465-77.

Litvak, E. , Tu, X. M. and Pagano, M. (1994). Screening for the presence of a disease by

pooling sera samples. Journal of the American Statistical Association 89 , 424-34.

Liu, A. and Schisterman, E. F. (2003). Comparison of diagnostic accuracy of biomark-

ers with pooled assessments. Biometrical Journal 45, 631-44.

Prokhorov, A. V. and Ushakov, N. G. (2002). On the problem of reconstructing a

summands distribution by the distribution of their sum. Theory of Probability and

its Applications 46, 420-30.

Romano, J. P. (1988). A bootstrap revival of some nonparametric distance tests.

Journal of the American Statistical Association 83, 698-708.

Sakia, R. M. (1992). The Box-Cox transformation technique: a review. The Statisti-

cian 41, 169-78.

Serling, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York:

Wiley.

20
Sobel, M. and Groll, P. (1959). Group testing to eliminate efficiently all defectives in

a binomial sample. The Bell System Technical Journal 38, 1179-252.

Sobel, M. and Elashoff, R. (1975). Group testing with a new goal: Estimation.

Biometrika 62, 181-93.

Tu, X. M., Litvak, E. and Pagano, M. (1995). On the informativeness and accuracy

of pooled testing in estimating prevalence of a rare disease: Application to HIV

screening. Biometrika 82, 287-97.

Weinberg C. R. and Umbach, M. (1999). Using pooled exposure assessment to improve

efficiency in case-control studies. Biometrics 55, 718-26.

Zou, K. H. and Hall, W. J. (2000). Two transformation models for estimating an ROC

curve derived from continuous data. Journal of Applied Statistics 27, 621-31.

Zou, K. H. and Hall, W. J. (2002). Semiparametric and parametric transformation

models for comparing diagnostic markers with paired design. Journal of Applied

Statistics 29, 803-16.

21
40

3.5
log(VitaminE)
30

3.0
VitaminE
20

2.5
10

2.0
-2 -1 0 1 2 -2 -1 0 1 2

Quantiles of Standard Normal


Quantiles of Standard Normal
(b)
(a)

This is Figure 1.
25
20
VitaminE
15
10

10 20 30 40

Quantiles of Lognormal Average

This is Figure 2.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy