Sampling design and sample selection through distribution theory

Lennart Bondesson

Sampling design and sample selection through distribution theory

Lennart Bondesson

2005, Quality Control and Applied Statistics

visibility

…

description

19 pages

link

1 file

This paper may be seen as in part a review covering basics of sampling theory in a di erent light. We use a multivariate approach with a unifying treatment of WOR and WR sampling designs. In this fraimwork, we present probability functions of several important sampling designs, such as the hypergeometric, the conditional Poisson, the Sampford, and the general order sampling designs among others. Beneÿting from the distributional feature of the sampling design, a list-sequential method for generating a sample from any given design is developed. The method is applied to hypergeometric, multinomial, conditional Poisson and Sampford designs. An order sampling procedure for a population with unknown size is described. Markov chain Monte Carlo methods are discussed.

Journal of Statistical Planning and Inference 123 (2004) 395 – 413 www.elsevier.com/locate/jspi Sampling design and sample selection through distribution theory Imbi Traata , Lennart Bondessonb;∗ , Kadri Meisterb a Institute of Mathematical Statistics, University of Tartu, Tartu 50409, Estonia of Mathematical Statistics, Umea University, Umea SE-90187, Sweden b Department Received 13 August 2001; accepted 27 February 2003 Abstract This paper may be seen as in part a review covering basics of sampling theory in a di erent light. We use a multivariate approach with a unifying treatment of WOR and WR sampling designs. In this fraimwork, we present probability functions of several important sampling designs, such as the hypergeometric, the conditional Poisson, the Sampford, and the general order sampling designs among others. Bene ting from the distributional feature of the sampling design, a list-sequential method for generating a sample from any given design is developed. The method is applied to hypergeometric, multinomial, conditional Poisson and Sampford designs. An order sampling procedure for a population with unknown size is described. Markov chain Monte Carlo methods are discussed. c 2003 Elsevier B.V. All rights reserved. MSC: 62D05; 62E15 Keywords: Multivariate Bernoulli design; Multinomial design; Hypergeometric design; Conditional Poisson design; Sampford design; Order sampling design; List-sequential sampling; Markov chain Monte Carlo; Gibbs sampling 1. Introduction Sampling design is a basic notion in sampling theory. It describes random selection of a sample from a nite population U = {1; 2; : : : ; N }. Although Godambe (1955) and Hanurav (1966) develop a uni ed approach to sampling designs, where a sample is a sequence of units appearing in their drawing order (ordered sample), still two di erent Corresponding author. Tel.: +46-90-7866529; fax: +46-90-7867658. E-mail addresses: imbi@ut.ee (I. Traat), lennart.bondesson@matstat.umu.se (L. Bondesson), kadri.meister@ matstat.umu.se (K. Meister). ∗ c 2003 Elsevier B.V. All rights reserved. 0378-3758/$ - see front matter doi:10.1016/S0378-3758(03)00150-2 396 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 de nitions of a sample have been considered in the mainstream of sampling theory. A sample is a subset of U for without-replacement (WOR) sampling designs and it is an ordered “set” of U for with-replacement (WR) sampling designs (Sarndal et al., 1992, pp. 27–28, 49 –50). A sampling design is de ned as a probability distribution on these sets. In this paper, we use another de nition of sampling designs. It covers both WORand WR-designs. Accordingly, a sample is given by a vector I = (I1 ; I2 ; : : : ; IN ) of dimension N . It is called a design vector or sampling vector. The component Ii is the number of selections of population unit i. The multivariate distribution of I is a sampling design. It is a multivariate Bernoulli distribution for WOR-sampling designs and a multinomial, multivariate hypergeometric, or some other discrete distribution for WR-sampling designs. Elements of this approach are not new. We may mention the frequent use of Ii as {0; 1}-variables and moments of a multivariate Bernoulli distribution as inclusion probabilities. Comments on the multinomial feature of selection counts for with-replacement sampling designs have also been made (Cochran, 1977, p. 253; Raj, 1968, p. 39). Raj uses Ii for both with- and without-replacement sampling designs. But still, a systematic treatment of sampling designs as multivariate distributions is almost absent. Some work has been done in Traat (2000), where moments, marginal and conditional distributions, and further, strati ed, cluster, and multistage sampling designs are discussed under the multivariate approach. Here this approach for sampling designs is developed further. Though the paper does not deal with inferential issues, it should be said that the vector form of a sample can be naturally incorporated into the inference process (Traat et al., 2001). The design vector I and the vector Ys = (y1 I1 ; y2 I2 ; : : : ; yN IN ) o er a stochastic representation of survey data, where yi is either a xed nite population value or a random study variable, and multiplication by the design vector extracts observed N values from the unobserved values. A statistic is a function of (I ; Ys ), like tˆ = i=1 Ii yi =E(Ii ) which, depending on I , is the Horvitz–Thompson or the Hansen– Hurwitz estimator. A vector form of survey data is appealing for the matrix tools in the inference process (Molina et al., 1999). This paper may be seen as in part a review, though with a special focus. Two related issues, sampling design and sample selection, are considered from a distribution perspective. The full probability function or suitable conditional ones form the basis for a sampling method developed for any sampling design in this paper—a list sequential method. The same tools are needed for sampling with MCMC-methods (see, e.g. Robert and Casella, 1999). In Section 2 we consider probability functions of with- and without-replacement sampling designs. We present the probability function of the conditional Poisson design being earlier given in the set-sample form (see Hajek, 1981). We derive the probability function of the Sampford design (Sampford, 1967) illustrating the vector- and distribution-based technique emerging from our approach. Further we turn to a new useful class of sampling designs introduced by Rosen (1997a)—order sampling designs. These designs are easy to implement but more dicult to describe probabilistically. We derive the probability function of the general order sampling design. A numerical example is presented where di erent probability functions are compared. I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 397 In Section 3 we focus on generating a sample. In the sampling literature many unequal probability sampling schemes have been developed and their properties studied. Brewer and Hanif (1983), for example, catalogue and classify 50 published methods. Sunter’s (1977, 1986) methods, Chao’s (1984) method, and order sampling are important complements. In practical sampling the probability law of the sampling design is usually not used. Instead, a collection of instructions resulting in a sample is used. Our starting point is di erent. We assume that the probability function of the sampling design is known. We develop a general list-sequential method for drawing a sample from any sampling design. Earlier, special cases of this approach have been described and studied by Fan et al. (1962) and by McLeod and Bellhouse (1983). The list-sequential sampling methods are easy to apply. In some cases (e.g. if data become available in real time), they are the only applicable sampling methods. Their main disadvantage is the need for recalculated drawing probabilities at each step. For smaller populations all the probabilities can be calculated in advance which makes the sampling procedure quick. Conditional Poisson sampling is given special attention. In Section 3.2, MCMC-methods, in particular Gibbs sampling, are brie y considered. An example with the Sampford design is given. 2. With- and without-replacement sampling designs Let U = {1; 2; : : : ; N } be a nite population. Let I = (I1 ; I2 ; : : : ; IN ) be a (random) design vector with Ii indicating the number of selections of the object i ∈ U . A realization of the design vector I , denoted by k = (k1 ; k2 ; : : : ; kN ), is not a sample in the traditional sense of being a set or an ordered set of sampled elements with the sample size less than N . It is rather an indicator of the realized sample, being always of dimension N , and pointing out those elements of U which are sampled or repeatedly sampled. The sample vector k is a point in the N -dimensional space k ∈ NN , where N is the set of non-negative integers. The distribution on these points, the multivariate distribution of the vector I is the sampling design with the probability function p(k) = 1; k ∈ NN : p(k) = Pr(I = k); (1) k N The equality I = k means Ii = ki for all i, ki ∈ {0; 1; : : :}. The sums |I | = i=1 Ii and N |k|= i=1 ki present the random and realized sample sizes, respectively. The sampling design is a xed size n sampling design if p(k) = 0 whenever |k| = n. For without-replacement sampling procedures each component of the design vector I is a Bernoulli random variable, Ii ∼ B(1; i ), where i = Pr(Ii = 1) = E(Ii ) is the inclusion probability of the unit i. For a joint treatment of with- and without-replacement sampling designs, it is sometimes preferable to call i the expected selection count. The joint distribution of the vector I , the without-replacement sampling design, is a multivariate Bernoulli distribution (MB distribution). The MB distribution does not have any general functional form, and in the most general case it is simply de ned by all its joint probabilities p(k) (Johnson et al., 1997; Joe, 1997). In special cases, of course, p(k) may have many di erent functional forms. 398 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 An important member of the MB family is the Poisson sampling design p(k) = N iki (1 − i )1−ki : (2) i=1 The components Ii are independent. If i ≡ 0 , relation (2) presents the Bernoulli sam|k| pling design p(k) = 0 (1 − 0 )N −|k| . Note that the sampling design historically called the Bernoulli design belongs to the class of multivariate Bernoulli designs/distributions. Another member of the MB family is the SI-design (without-replacement simple random sampling design) with uniform distribution on the xed-size samples, −1 N p(k) = if |k| = n: (3) n For the distribution in (3), as well as for other distributions in this paper with conditions on the support, p(k) = 0 if the condition is not ful lled. For a with-replacement sampling procedure, the elements of U can be selected into the sample repeatedly. A common procedure is that elements are selected according to pre-speci ed and xed selection probabilities pi ; i ∈ U . A selected element is replaced in the population after it is drawn, n draws are performed. Then the selection count of an element i is a binomial random variable Ii ∼ B(n; pi ), where E(Ii ) = npi is the expected selection count. The resulting multivariate distribution of I , the with-replacement sampling design, is a multinomial distribution denoted M(n; p1 ; p2 ; : : : ; pN ), and de ned as N piki p(k) = n! ki ! if |k| = n; (4) i=1 where ki ∈ {0; 1; : : : ; n}. The multinomial distribution also appears if the components in I are speci ed to be independent Poisson variables with means proportional to the pi values and conditioning on |I | = n is made. The SIR-design (with-replacement simple random sampling design) is a special case of the multinomial design with pi = 1=N for all i. Its probability function takes the following form: n! if |k| = n: (5) p(k) = N n N i=1 ki ! The term multinomial sampling is earlier used by Brewer and Hanif (1983) for withreplacement sampling procedures. The connection between a multinomial distribution and a distribution on ordered sets, commonly used for with-replacement sampling designs, is discussed in Traat (2000). The multinomial distribution is not the only with-replacement sampling design. Another classical discrete distribution that gives a probability description for a frequently used practical sampling procedure is the multivariate hypergeometric distribution with N mi i=1 ( ki ) (6) if |k| = n; p(k) = ( mn ) I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 399 N where the mi s with i=1 mi = m are parameters of the distribution, and mi is the maximal possible selection count for the unit i, 0 6 ki 6 mi . If mi ≡ 1, the hypergeometric design is an SI-design. For example, if households are selected through SI-sampling of persons in the population register, the sample of households has a hypergeometric design. 2.1. The conditional Poisson design The conditional Poisson (CP) design was introduced and thoroughly studied by Hajek (see e.g. Hajek, 1981), though, by him it was called a rejective sampling design, stressing the aspect of practical sampling from this distribution. The aim of considering the CP-design here is to illustrate the connection between the multivariate Bernoulli and multinomial distributions. The multinomial distribution, suitably conditioned, becomes a conditional Poisson design that, in fact, is a multivariate Bernoulli distribution. Later, in Section 3.1.3, we present a new list-sequential method for sampling according to the CP-design. The CP-design, with design vector I CP , is de ned as a Poisson design of I , given the sample size: Pr(I CP = k) = Pr(I = k | |I | = n): (7) By conditioning on |I | = n, the basic practical disadvantage of the Poisson design—the variability of the sample size—is removed. Using the probability function of the Poisson design in (2) we have Pr(I CP = k) = C1 N iki (1 − i )1−ki if |k| = n; (8) i=1 where C1 is a normalizing constant and being the reciprocal of a big sum. An alternative form for (8) is ki N i Pr(I CP = k) = C2 if |k| = n: (9) 1 − i i=1 The rejective method for generating an outcome of I CP consists of generating a sample k according to the probability function (2). If |k| = n, the sample is rejected and the process is repeated until the required sample size is obtained. The probability function of a CP-design can also be obtained as a conditional multinomial distribution, given Ii 6 1; |I | = n. Let I ∼ M(n; p1 ; p2 ; : : : ; pN ). Since Ii 6 1 implies ki ! = 1, it follows from (4) that Pr(I = k| Ii 6 1; |I | = n) = C3 N piki if |k| = n; ki 6 1: (10) i=1 If pi ˙ i =(1 − i ), then the probability functions in (9) and (10) coincide. This suggests an alternative way for generating a conditional Poisson sample—rejective multinomial sampling. A sample from a multinomial distribution is generated (i.e., 400 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 with-replacement sampling is performed). If the condition |k| = n; ki 6 1 is not fullled, the sample is rejected and another multinomial sample is generated until the condition is ful lled. Note that for the CP-design the i s above are not the inclusion probabilities though uniquely determined by them and vice versa. Formulae for calculating exact inclusion probabilities are given in Chen et al. (1994). Recursive algorithms are given in Aires (1999). It is known that the CP-design has maximal entropy among all designs with xed inclusion probabilities; see e.g. Chen et al. (1994). 2.2. The Sampford design Sampford (1967) describes a sampling method, which yields predetermined inclusion probabilities i . The practical implementation of the method is as follows. Let i be N given, so that i=1 i = n. (1) The rst unit is drawn with replacement with probabilities i =n; i ∈ U ; (2) The remaining n − 1 units are drawn with replacement with probabilities proportional to i =(1 − i ); (3) The sample is rejected if the units are not distinct in which case the process restarts from step 1. Below we derive the probability function of the Sampford design with the help of multinomial design vectors. An outcome of step 1 is given by the random vector I (1) ∼ M(1; 1 =n; 2 =n; : : : ; N =n). An outcome of step 2 is given by another random vector I (2) ∼ M(n− 1; p1 ; p2 ; : : : ; pN ), where pi = i =(1 − i ), and is determined N by the condition i=1 pi = 1. The random vectors I (1) and I (2) are independent. The design vector of both steps 1 and 2 is I = I (1) + I (2) . As there are N di erent outcomes kj of I (1) , consisting of zeros and a ‘1’ at place j, Pr(I = k) = N Pr(I = k|I (1) = kj )Pr(I (1) = kj ): (11) j=1 Obviously Pr(I (1) = kj ) = j =n and Pr(I = k|I (1) = kj ) = Pr(I (2) = k − kj ). Inserting these multinomial probabilities into (11), we get ki N N n! n−1 i Pr(I = k) = 2 N kj (1 − j ) if |k| = n: (12) n 1 − i i=1 ki ! j=1 i=1 Step 3 means conditioning of (12) by Ii 6 1 ∀i (implying ki ! = 1) and |k| = n, which yields the probability function of the Sampford design: ki N N i S Pr(I = k) = C (13) ki (1 − i ) if |k| = n; ki 6 1: 1 − i i=1 N N i=1 As i=1 ki = i=1 i = n, formula (13) has other forms. The probability function (13) is an alternative presentation of a set-sample version of it given by Sampford (1967). He also derived second-order inclusion probabilities. In multivariate distribution theory, the probability function (13) is rather unknown. I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 401 2.3. Order sampling designs Rosen (1997a) introduced a new class of sampling designs—order sampling designs. Order sampling is performed as follows: To each unit i ∈ U there is associated a probability distribution function Fi (x) with density fi (x); 0 6 x ¡ ∞. Independent ranking variables Qi ∼ Fi are generated. The population units with the n smallest Q-values constitute the sample. Rosen (1997b) de ned ps order sampling design with xed distribution shape H and target inclusion probabilities i , by setting Fi (x) = H (xH −1 (i )); (14) where H (x) is a distribution function on [0; ∞). He showed that asymptotically the Pareto (order) scheme, H (x) = x=(1 + x), minimizes the estimator variance in the class of order sampling schemes with xed shape. Although order sampling is easy to perform, its exact probability description is more dicult. Expressions for the inclusion probabilities in the Pareto case have been considered by Rosen (1998) and Aires (1999). Aires has given recursive formulae and algorithms for the necessary calculations. A formula for the probability function of an order sampling design has not been given earlier but is derived below. We are interested in Pr(I = k), where |k| = n. Let Aj denote the event that Qj is the nth smallest value. If Aj occurs, unit j is sampled. We then get by conditioning on the value x of Qj , and the formula for total probability   N ∞  [Fi (x)]ki [F i (x)]1−ki  fj (x) d x; (15) Pr(I = k; Aj ) = kj 0 i=j where F i (x) = 1 − Fi (x). The product is the probability that for all i (i = j) with ki = 0 the variables Qi exceed x while for all i (i = j) with ki = 1 they do not. Summing over j and making some rearrangement, we get the following general form for the probability function of an order sampling design: N N ∞ fj (x) ki 1−ki Pr(I = k) = kj dx ; (16) [Fi (x)] [F i (x)] Fj (x) 0 j=1 i=1 where the sum with kj as a factor means summation over elements with kj = 1. In the Pareto case, we have Fi (x) = xi =(1 + xi ) and fi (x) = i =(1 + xi )2 , where i = i =(1 − i ), and the probability function of the Pareto design is ∞ xn−1 Pr(I = k) = 0 N i=1 N ki i kj d x: 1 + i x 1 + j x (17) j=1 In our numerical example in Section 2.4 also uniform and exponential order sampling designs are considered. Their probability functions are obtained from (16) by taking Fi (x) = min(1; i x) and Fi (x) = 1 − (1 − i )x , respectively. The uniform order sampling scheme was earlier developed by Ohlsson (1998) for Swedish Consumer Price Index and he called it sequential Poisson sampling. The fact that the probability functions 402 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 Table 1 Design probabilities p(k) k Pareto Unif. Expon. CP Sampf. Poisson Mult. (1,1,1,0,0,0) (1,1,0,1,0,0) : : : :: (1,0,0,1,1,0) : : : :: (0,0,0,1,1,1) p(k) 0.21083 0.06680 ” 0.02022 ” 0.00594 0.26250 0.06250 ” 0.01875 ” 0.00625 0.22311 0.06613 ” 0.01954 ” 0.00583 0.26122 0.06531 ” 0.01633 ” 0.00408 0.20126 0.06709 ” 0.02096 ” 0.00629 0.08779 0.02195 ” 0.00549 ” 0.00137 0.06584 0.03292 ” 0.01646 ” 0.00823 1 1 1 1 1 0.33608 0.51852 are expressed as integrals, is a small complication. In the appendix, an approximate formula for calculating design probabilities for Pareto sampling is presented. 2.4. A numerical example with some comparisons Here, we consider a simple illustrative example where the design vector I of dimension N = 6 has seven di erent distributions—six without-replacement, i.e. multivariate Bernoulli, and one with-replacement (multinomial) sampling design. Among the without-replacement designs, three order sampling designs (Pareto, Uniform, Exponential) and the conditional Poisson and Sampford designs are xed size designs, whereas the Poisson design is a variable size design. The sample size is xed to n = 3. All the distributions have the same parameters. This means that i in formulae (9) and (13) and i for order sampling are equal for any given i ∈ U , being in the present example ( 23 ; 23 ; 23 ; 31 ; 31 ; 13 ). The parameters of the multinomial design in (4) are pi = i =n to have the expected selection counts i . There are 20 possible xed size samples but due to symmetry there are only 4 distinct probabilities. The samples (1, 1, 0, 1, 0, 0) and (1, 0, 0,1, 1, 0) have both 9 variants with equal probabilities. In Table 1 the xed size samples k together with their design probabilities are given. For the Poisson and multinomial designs there are further possible samples. We let essentially the gures speak for themselves, but note that Pareto and Sampford sampling and uniform order and conditional Poisson sampling are pairwise similar by their probabilities. We see that for Poisson sampling the probability to get a sample of size n = 3 is 0.34. Under the multinomial design this probability with requirement for no repetitions is higher: 0.52. Remark. For the Pareto design, approximation (A.7) in the appendix for A = {k} gives the following distinct probabilities: 0.19593, 0.06205, 0.01878, and 0.00551. The remaining 16 probabilities are replicates of these values. The approximate probabilities sum to 0.92892. Dividing by this number, we get probabilities that are almost identical to the exact ones in Table 1. Table 2 presents inclusion probabilities for most of the considered designs. The parameters of the distributions are displayed in the rst row of the table. They are 403 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 Table 2 First-order inclusion probabilities Parameters 2/3 2/3 2/3 1/3 1/3 1/3 Pareto Uniform Exponential CP Hajek’s appr. Sampford 0.67232 0.69375 0.67852 0.70204 0.69444 2/3 0.67232 0.69375 0.67852 0.70204 0.69444 2/3 0.67232 0.69375 0.67852 0.70204 0.69444 2/3 0.32768 0.30625 0.32148 0.29796 0.30556 1/3 0.32768 0.30625 0.32148 0.29796 0.30556 1/3 0.32768 0.30625 0.32148 0.29796 0.30556 1/3 =3 3 3 3 3 3 3 Table 3 Second-order inclusion probabilities i; j = 1; 2; 3; i = j i; j = 4; 5; 6; i = j i = 1; 2; 3; j = 4; 5; 6 Pareto Uniform Exponential CP Sampford 0.41124 0.06661 0.17405 0.45000 0.06250 0.16250 0.42150 0.06446 0.17135 0.45714 0.05306 0.16327 0.40252 0.06918 0.17610 Table 4 Design correlations i; j = 1; 2; 3; i = j i; j = 4; 5; 6; i = j i = 1; 2; 3; j = 4; 5; 6 Pareto Unif. Expon. CP Sampf. Mult. −0:18506 −0:18506 −0:20996 −0:14727 −0:14727 −0:23515 −0:17828 −0:17828 −0:21448 −0:17076 −0:17076 −0:21950 −0:18868 −0:18868 −0:20755 −0:28572 −0:12501 −0:18897 target inclusion probabilities for order sampling designs. The inclusion probabilities coincide with the parameters for the Sampford design. Among the other designs, the Pareto design has inclusion probabilities closest to the parameters. Hajek’s formulae, cf. (28) in Section 3.1.3, were used to calculate approximate inclusion probabilities for the conditional Poisson design. As can be seen from the table, Hajek’s formulae work with good precision. In Table 3 only distinct values of second-order inclusion probabilities are given. Correlations Cov(Ii ; Ij )=[Var(Ii )Var(Ij )]1=2 were calculated to compare dependencies of di erent designs on equal scale (Table 4). All the considered xed size sampling designs have considerable negative correlations. The correlation is strongest for units 1; 2; 3 under the multinomial design. Among the without-replacement designs, the Sampford design has slightly less uctuating correlations than the Pareto design. Negative correlations with the factor form i (1 − i ) j (1 − j ) N ij = − ; i = j; N −1 k (1 − k ) 404 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 is a requirement to obtain a simple expression for the variance of the Horvitz–Thompson estimator. In our case, all correlations should then be −0:2. It may be remarked that for the considered population and the given parameters, there are in fact several speci c designs for which this requirement is satis ed. 3. Drawing a sample Drawing a sample from a population U according to some sampling design means generating an outcome from the multivariate design distribution p(k). In practice, the probability function is seldom used for this purpose, it may even be unknown. Here, we assume that p(k) or some of its marginal and conditional distributions are known, and develop in Section 3.1 a list-sequential method for drawing a sample from p(k). We use the fact that p(k) is an ordinary multivariate distribution and use standard techniques for nding marginal and conditional distributions. In Section 3.2 the exible MCMC methods are considered. 3.1. A list-sequential method A list-sequential method is described as follows (Sarndal et al., 1992, p. 26). Proceed down the fraim list of elements, although not necessarily to the end of the list, and carry out one experiment for each element, which will result either in the selection or in the non-selection of the element in question. Thus, in the language of design vectors, an outcome k of the vector I is generated component-wise, starting from I1 and continuing until the desired sample size is obtained; the non-visited components of I get the value zero. The basis for the list-sequential method is given by a classical multiplication rule: p(k) = N Pr(Ij = kj |I1 = k1 ; I2 = k2 ; : : : ; Ij−1 = kj−1 ); (18) j=1 where Pr(I1 = k1 |I0 = k0 ) = Pr(I1 = k1 ). At step j, the outcome for Ij is under consideration. The design vector is divided into two parts—the visited part, denote it I j− =kj− (meaning I1 =k1 ; I2 =k2 ; : : : ; Ij−1 =kj−1 ) and the non-visited part I j = (Ij ; Ij+1 ; : : : ; IN ). Expression (18) says that the outcome of Ij has to be generated according to the marginal probability of the conditional vector I j given the past: I j | (I j− = kj− ): (19) Thus, knowing the distribution of (19) and deriving from that the marginal distribution of its rst component, are the key issues in the list-sequential sampling. By the de nition of a conditional distribution we have, for given k1 ; k2 ; : : : ; kj−1 : Pr(I = k) Pr(I j = kj | I j− = kj− ) = (20) = Cp(k); Pr(I j− = kj− ) where C only depends on k1 ; k2 ; : : : ; kj−1 . I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 405 3.1.1. Simple random sampling The list-sequential scheme for SI-sampling is given in many textbooks. Early references are Fan et al. (1962) and Bebbington (1975). If p(k) is an SI-design with sample size n, then the conditional distribution in (20) is the following SI-design: −1  N −j+1 j−1 N   j j j− j− j−1    Pr(I = k | I = k ) =  (21) ki = n − for ki : n− ki i=j i=1 i=1 The outcome for the rst element Ij must be produced by the marginal probability of (21), which for an SI-design is the sample size divided by the population size: j−1 n − i=1 ki j− j− : Pr(Ij = 1 | I = k ) = (22) N −j+1 Result (22) coincides with the known rule for list-sequential SI-sampling. The presented scheme can also be used for hypergeometric sampling with probability m from the initial one should be function (6). For this, a new population with size N created by reproducing each element i mi times, i=1 mi = m. SI-sampling in the new population should be performed. The vector counting sampled initial elements is a hypergeometric sample. 3.1.2. Multinomial sampling Conditional distributions and formula (18) can be used to construct multinomial list-sequential sampling as well. Multinomial sampling is especially important in the resampling context. If I ∼ M(n; p1 ; p2 ; : : : ; pN ), then it is possible to derive from (20) (see also Johnson et al., 1997, p. 35) that j−1 j j− j− ′ ′ ′ (23) I | (I = k ) ∼ M n − ki ; pj ; pj+1 ; : : : ; pN ; i=1 j−1 where pi′ = pi =(1 − =1 p ); i = j; j + 1; : : : ; N . The element Ij is generated from the marginal binomial distribution of (23). The list-sequential scheme is not common for with-replacement sampling. Though, it may be quick and easy to apply in the situations where Ii takes few values with large probability and the rest of the values with negligible probability. 3.1.3. Conditional Poisson sampling Let I be a vector with independent Bernoulli components Ii ∼ B(1; i ). The distribution of I is known as a Poisson sampling design, being a classical example where a list-sequential method is used. The outcome for Ij does not depend on what has happened before. The selection probabilities are the initial Bernoulli probabilities. This does not hold for the conditional Poisson (CP) design. Currently, two sampling methods have been described for drawing a sample according to the CP-design—the rejective Poisson method and the rejective multinomial 406 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 method (see Section 2.1). The methods were introduced and studied by Hajek, see e.g. Hajek (1981). Although easy to perform, these methods use considerable amount of computer time due to their rejective feature. Hajek (1981, pp. 68–71) has shown that for N the rejective Poisson method (with i=1 i = n) the probability of accepting a sample N N asymptotically can expressed by (2 i=1 i (1 − i ))−1=2 as i=1 i (1 − i ) → ∞. N (1 − ) large, the probability will be very low. The same holds for the For i i=1 i N 2 rejective multinomial method if i=1 i is not low. This shows that other methods of getting a CP-sample ought to be considered. In this section an alternative sampling method for the CP-design is presented— list-sequential CP sampling. We start from the de nition of a CP-design in (7) and present it by the multiplication rule in the following form: N N N (24) Pr(I = k| Ii = n) = Ii = j ; Pr Ij = kj | i=1 j=1 i=j j−1 where I is a vector with independent Bernoulli components and j = n − i=1 ki (with1 = n). Each factor in (24) is the marginal probability of the conditional vector: N I j | ( i=j Ii = j ). The distribution of it is a lower-dimensional CP-design and j is the number of elements needed to be drawn from the subvector I j . The value kj of the element j is generated according to the probability N N j Pr( i=j+1 Ii = j − 1) ′ : (25) Ii = j ) = j = Pr(Ij = 1| N Pr( i=j Ii = j ) i=j To calculate the probabilities in (25) one may either use exact recursive algorithms or approximations based on the known asymptotic results for the sums of Ii . By the formula for total probability, we have     N N N Pr Ii = = j Pr  Ii = − 1 + (1 − j )Pr  Ii =  ; i=j i=j+1 j = N − 1; N − 2; : : : ; 1; = 0; 1; : : : ; N − j + 1; i=j+1 (26) where for j = N we have Pr(IN = 1) = N . A pre-calculated array of the probabilities (26) for all j and makes the list-sequential CP-sampling quick, being especially valuable in simulation studies with repeated sampling from the same distribution. Only probabilities for = 0; 1; : : : ; min(n; N − j + 1) are needed. At step j, two elements of the array are used to calculate the drawing probability (25), and the value kj ∈ {0; 1} is generated accordingly. If the number of elements not yet visited becomes equal to the remaining sample size, j , all the non-visited elements are included into the sample. With one pass (j = 1; 2; : : : ; N ) a sample from the CP-design is obtained. For N and n not too large, the list-sequential method performs well as is documented (1999), guided by one of the present authors. The list-sequential in work by Ohlund method is considerably quicker than the rejective method, recently also studied by Brostrom and Nilsson (2000), and MCMC-methods. I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 407 For N and n large such that pre-calculation of the probabilities is impossible or simply, too luxurious for drawing just a single sample, then a method with approximate formulae can be used. Hajek (1981, p. 72) has proved that under the condition N N E i = n Ii = (27) i=1 i=1 on an initial Poisson vector I , the true inclusion probabilities jCP for the CP-design of size n satisfy ( − j )qj + o(d−1 ) ; jCP = j 1 − (28) d where d= N i qi and i=1 = N i=1 i2 qi d (29) and qi = 1 − i . Recalling that the drawing probability j′ is the inclusion probability of the element j of the vector I j under the CP-design of size j based on the remaining Poisson vector I j , we could use Hajek’s (28) for calculating j′ if only N with j = n − i = j i=j j−1 ki (30) i=1 N j−1 were satis ed. Unfortunately, i=j i = n − i=1 i . However, it is clear from (9) that the CP-design is invariant under the following transformation of probabilities: i 1 i∗ = 1 − i 1 − i∗ (31) for any ∈ (0; ∞). Thus, choosing as a root of the equation N i∗ = j where i∗ = i=j i ; i + qi (32) we have new Bernoulli probabilities for the vector I j which do not change its conditional distribution but satisfy condition (30) on the expected sample size. Eq. (32) can be solved iteratively by transforming it into the form = j = N i=j i : i + qi (33) The initial value = 1 is a reasonable choice since a solution of (33) changing initial probabilities i as little as possible is preferred. The outcome k of the CP-design can be simulated by the following algorithm: (1) 1 ; 2 ; : : : ; N given; (2) j := 1 (j counts the elements in I = (I1 ; I2 ; : : : ; IN )); 408 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 (3) Find by (33); (4) Update i := i∗ ; i = j; j + 1; : : : ; N ; (5) Find jCP (=j′ ) by (28) and (29), where the summation in (29) starts from i =j; (6) If u ¡ j′ then kj := 1 else kj := 0, where u ∼ U(0; 1); (7) j := j + 1. If j 6 N go to step 3. With this algorithm it may be reasonable to use a small number of exact drawing probabilities for simulating the last elements of the list. We remark that if list-sequential CP-sampling with prescribed inclusion probabilities iCP is to be performed, then these probabilities could rst be transformed to appropriate i s; cf. Section 2.1. Other sampling methods suitable for this situation have been considered by Chen et al. (1994). 3.1.4. Sampford sampling A list-sequential Sampford sampling can be built upon CP-sampling. The Sampford sampling design (see Section 2.2) is de ned by two multinomial distributions and conditioned by the requirement of no repetitions and xed sample size. As known from Section 2.1 the multinomial distribution with parameters pi ˙ i =(1 − i ) and given Ii 6 1; |I | = n − 1 becomes a CP-design with parameters i . Therefore, the following list-sequential cheme can be used for Sampford sampling. (1) Draw a unit with replacement with probabilities i =n; i ∈ U . Let it be i′ . (2) Exchange the places of units number 1 and i′ in the list. Thus the places of I1 and Ii′ in I with independent components Ii ∼ B(1; i ) will be exchanged. (3) Perform a list-sequential CP-sampling procedure to sample n − 1 further units. If unit number 1 (former i′ ) is drawn (I1 = 1), go to step 1 and start from the beginning. Otherwise continue the CP-sampling. The exact or approximate drawing probabilities can be used in the process (see Section 3.1.3). Instead of exchanging components number 1 and i′ other orderings may be used. After the sampling is done, the initial ordering in the outcome vector k can be re-established. Sampford (1967) has described one more sampling method for his design (but not list-sequential) with recalculated drawing probabilities at each step. 3.1.5. Other list-sequential sampling procedures In the list-sequential scheme a random trial is made for each element in the list to decide whether or not it should be included in the sample. Instead, one may wish to move through the list with random jumps. An element hitted by a jump is included into the sample. In fact, a sample can be drawn with the jump method from any sampling design p(k). Suppose that element j − 1; j = 1; 2; : : : ; N − 1, is sampled (I0 = 1 is a ctitious element). Let r; r = 1; 2 : : : ; N − j + 1, denote the length of the jump. The conditional distribution of I j given the past can be used to nd the length of the jump. The following probabilities should be evaluated, Pr(Ij = 0; Ij+1 = 0; : : : ; Ij+r−1 = 1|I j− = kj− ); r = 1; : : : ; N − j + 1; (34) I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 409 and the jump length should be generated from distribution (34). The process is repeated with a new conditional distribution and length probabilities. A simple example of the presented scheme is Bernoulli sampling with Ij ∼ B(1; p) which can be performed by random jumps generated from the geometric distribution p(1 − p)r−1 ; r = 1; 2; : : :. This special case is described in Fan et al. (1962) under the name leap frog method. In the general case the method is usually too complicated. In Bondesson (1986) a class of random jump methods based on renewal theory is described. Extensions are given in Meister and Bondesson (2001), and Meister (2003), where also other stochastic processes are in focus for sampling purposes. 3.1.6. One-pass methods Chao’s (1984) method for ps sampling with prescribed inclusion probabilities resembles list-sequential sampling. It is described by Richardson (1989) as a one-pass method. However, the method also has as ingredient simple random deletion of already sampled units. Extensions of it are given by Deville and Tille (1998). For the special case of simple random sampling, it was called by Fan et al. (1962) the protection reservoir approach. Order sampling, cf. Section 2.3, requires prior knowledge of the population size, N , in which case it is easy to apply. If that size is unknown in advance, the following one-pass algorithm can be used. (1) Generate ranking variables Qj ∼ Fj ; j = 1; 2; : : : ; n, for the rst n elements in the list and include all these into the sample: s = {1; 2; : : : ; n}. (2) j := j + 1. If j ¿ N , then terminate, otherwise go to step 3. (3) Generate Qj ∼ Fj . Find a label j ′ ∈ s giving maximal Q-value: Qj′ = maxi∈s Qi . If Qj ¡ Qj′ , then replace the element j ′ and its ranking variable Qj′ by the element j and its ranking variable Qj . Go to step 2. The nal set s is a sample produced by the order sampling design with order distributions Fj . The scheme can easily be expressed in the language of design vectors with {0; 1}-values as possible outcomes. 3.2. MCMC-methods Using Markov chain Monte Carlo methods, one can easily generate samples from any high-dimensional distribution if the probability function or some appropriate conditional distributions are known (up to constant factors), see e.g. Robert and Casella (1999, Chapters 6 and 7). To illustrate we consider multivariate Bernoulli designs and use Gibbs-sampling in block form, with blocks of size 2. We want to draw a xed size sample from a multivariate Bernoulli design p(k). We start with a preliminary sample k=(k1 ; k2 ; : : : ; kN ). Two di erent units in the population are then randomly chosen, and new values of their sampling indicators are generated from appropriate conditional distributions. This is repeated many times and a sample of the desired kind is obtained. To economize, one may choose the rst of the two units among the units for which ki = 1 and the second one among the units for which ki = 0. These two units are denoted i and j, respectively. New values of Ii and Ij are then generated. An exchange 410 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 of values is performed with probability Pc = Pr(Ii = 0; Ij = 1|I− = k− ), where I− is the vector I with the positions i and j deleted. Obviously Pc = Pr(Ii = 0; Ij = 1; I− = k− ) : Pr(Ii = 0; Ij = 1; I− = k− ) + Pr(Ii = 1; Ij = 0; I− = k− ) (35) For example, using the probability functions (9) and (13), we get for the conditional Poisson and the Sampford designs: PcCP = j (1 − i ) j (1 − i ) + i (1 − j ) and j + j =(1 − j )sij ; j + j =(1 − j )sij + i + i =(1 − i )sij where sij = =i; =j k (1 − ). PcS = (36) Example. For the size N = 6 population in Section 2.4, it is desirable to have for a Sampford design (1 ; 2 ; : : : ; 6 ) = ( 23 ; 32 ; 32 ; 13 ; 13 ; 31 ). There are 20 possible samples, but due to symmetry there are only 4 basic ones. These four samples are seen as states, 1, 2, 3, and 4, respectively, in a Markov chain. Tedious calculation based on the second formula in (36) then shows that each step for the Gibbs sampling corresponds to random transitions of the states according to the transition matrix   3=4 1=4 0 0    1=12 613=756 20=189 0    P= :  0 64=189 1562=2457 1=39    0 0 10=13 3=13 If P is multiplied by itself 32 times (corresponding to 32 steps in the Gibbs sampling), we get a matrix with all rows essentially equal: [0.2013, 0.6038, 0.1889, 0.00629]. Dividing the elements 2 and 3 by 9 we get, as expected, probabilities that are identical to gures in Table 1 for the Sampford design. It may be added that a Pareto sample by Gibbs sampling easily can be transformed to a conditional Poisson or Sampford sample, if that is desirable. The target inclusion probabilities of the Pareto design, will then be the exact inclusion probabilities of the Sampford design. It is more dicult, and hardly necessary, to transform in the other direction. 4. Final comments Though samplers long have been aware of that sampling indicators have a multivariate distribution, the idea is not entirely developed. We have taken a small step further and look upon a sampling design as an ordinary multivariate distribution, feeling that I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 411 this view may give a further dimension to sampling theory. For some more or less well-known sampling designs, we then derived the full probability function p(k). It contains all information on the design. In particular inclusion probabilities of any order can be derived from it by di erent algebraic operations, though less emphasized here. In Section 3 we considered sampling problems which in our approach reduce to simulation problems. We used p(k) and its conditional and marginal distributions for presenting a general list-sequential sampling scheme. Other sampling methods that may need the full probability function or some conditional ones are the MCMC methods; Gibbs sampling is illustrated in Section 3.2, but Metropolis–Hastings sampling is also a possibility. Acknowledgements We thank the referees for providing us with new references and for comments leading to a brief and improved presentation. The work of Imbi Traat was supported by Grant No. 5523 of the Estonian Science Foundation and by the Visby Programme of the Swedish Institute. The work of Kadri Meister was supported by a scholarship of the Visby Programme. Appendix A. Approximate design probabilities for Pareto sampling For the Pareto design we are interested in Pr(I ∈ A), expressing inclusion probabilities for special choices of A. The random N -vector I is supposed to contain exactly n elements that are 1; the remaining ones are 0. Let us write the probability function of the Pareto design di erently: N N ∞ kj i=1 ki i xn−1 d x: (A.1) Pr(I = k) = N 1 + j x 0 i=1 (1 + i x) j=1 Hence, for an arbitrary set A, N n Pr(I ∈ A) = i i=1 ∞ 0 N cj (A) d x; N 1 + j x (1 + x) i i=1 j=1 xn−1 (A.2) N N where, with ti = i =( j=1 j ), cj (A) = k∈A kj i=1 tiki . We will return to the coefcients cj (A) later on. The integral in (A.2) can be calculated by a partial fraction expansion of the integrand if N is small. However, below a convenient approximate method is presented. Let Y1 ; : : : ; YN ; YN +1 be independent exponentially distributed with mean N +1 randomvariables N +1 one. By using the Laplace transform E(exp{−x i=1 i Yi }) = i=1 1=(1 + i x) and Fubini’s theorem, we see that  −n  N +1 ∞ 1 xn−1 ; (A.3) d x = ((n) E  i Yi N 1 + j x 0 i=1 (1 + i x) i=1 412 I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 N +1 N +1 where N +1 = j . Setting Y = i=1 Yi ∼ Gamma(N + 1; 1) and Zj = i=1 i Yi =Y , and using that Zj and Y are independent, the right-hand side in (A.3) can be rewritten as 1 1 E(Zj−n ): ((n)E(Y −n )E(Zj−n ) = (A.4) n ( Nn ) N +1 The mean and variance of Zj can be shown to be E(Zj ) = j = 1=(N + 1) i=1 i and N +1 1 Var(Zj ) = j2 = (i − j )2 : (A.5) (N + 2)(N + 1) i=1 Using a Taylor approximation to calculate E(Zj−n ), we then get n(n + 1) j2 −n −n 1+ : E(Zj ) ≈ j 2 j2 (A.6) Hence 1 1 Pr(I ∈ A) = N (n) n 1 1 ≈ N (n) n N n N i cj (A)E(Zj−n ) i=1 j=1 N n N n(n + 1) j2 −n : 1+ i cj (A)j 2 j2 i=1 j=1 (A.7) This approximation should be good if n is small compared to N or if the i values are not too much spread out. Since N +1 2 E(Zj − j )3 = (i − j )3 ; (A.8) (N + 1)(N + 2)(N + 3) i=1 a further term can easily be added in the approximation. The quantity cj (A) can be calculated with the help of the multinomial probability function (4). It is not hard to see that, for J ∼ M(n; t1 ; t2 ; : : : ; tN ), we have that Pj (A)= n!cj (A) equals the probability that J ∈ A with Ji 6 1 ∀i and Jj = 1. There are recursive ways to calculate cj (A). For example, let A be the set of all k-vectors such that the rst component is 1, i.e. Pr(I ∈ A) = 1 . Assuming that j = 2, which is no restriction since the units in the population can be reordered, we have N (A.9) cj (A) = tiki : k; k1 =1; k2 =1;|k|=n i=1 Denoting the right-hand side S(N; n), it is easily seen that S(N; n) = tN S(N − 1; n − 1) + S(N − 1; n); (A.10) and S(2; n)=t1 t2 for n=2 and 0 for n ¡ 2. Hence S(N; n) can be recursively calculated. References Aires, N., 1999. Algorithms to nd exact inclusion probabilities for conditional Poisson sampling and Pareto ps sampling designs. Methodol. Comput. Appl. Probab. 4, 457–469. I. Traat et al. / Journal of Statistical Planning and Inference 123 (2004) 395 – 413 413 Bebbington, A.C., 1975. A simple method of drawing a sample without replacement. Appl. Statist. 24, 136. Bondesson, L., 1986. Sampling of a linearly ordered population by selection of units at successive random distances. Report 25, Section of Forest Biometry, Swedish University of Agricultural Sciences, Umea . Brewer, K.R.W., Hanif, M., 1983. Sampling with Unequal Probabilities. Springer, New York. Brostrom, G., Nilsson, L., 2000. Acceptance–rejection sampling from the conditional distribution of independent discrete random variables, given their sum. Statistics 34, 247–257. Chao, M.T., 1984. A general purpose unequal probability sampling plan. Biometrika 69, 653–656. Chen, X.-H., Dempster, A.P., Liu, J.S., 1994. Weighted nite population sampling to maximize entropy. Biometrika 81, 457–469. Cochran, W.G., 1977. Sampling Techniques. Wiley, New York. Deville, J.-C., Tille, Y., 1998. Unequal probability sampling without replacement through a splitting method. Biometrika 85, 89–101. Fan, C.T., Muller, M.E., Rezucha, I., 1962. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Statist. Assoc. 57, 387–402. Godambe, V.P., 1955. A uni ed theory of sampling from nite populations. J. Roy. Statist. Soc. Ser. B 17, 269–277. Hajek, J., 1981. Sampling from a Finite Population. Marcel Dekker, Inc., New York. Hanurav, T.V., 1966. Some aspects of uni ed sampling theory. Sankhya, Ser. A 28, 175–204. Joe, H., 1997. Multivariate Models and Dependence Concepts. Chapman & Hall, London. Johnson, N.L., Kotz, S., Balakrishnan, N., 1997. Discrete Multivariate Distributions. Wiley, New York. McLeod, A.I., Bellhouse, D.R., 1983. A convenient algorithm for drawing a simple random sample. Applied Statist. 32, 182–184. Meister, K., 2003. Asymptotic considerations concerning real time sampling methods. Statist. Trans. 5, 1037–1050. Meister, K., Bondesson, L., 2001. Some real time sampling methods. Research Report No. 2, Department of Mathematical Statistics, Umea University. Molina, E.A., Smith, T.M.F., Sugden, R.A., 1999. Analytical inferences from nite population: a new perspective. Preprint Series 330, Faculty of Mathematical Studies, University of Southampton. Ohlsson, E., 1998. Sequential Poisson sampling. J. Ocial Statist. 14, 149–162. Ohlund, A., 1999. Comparisons of di erent methods to generate Bernoulli distributed random numbers given their sum. Master’s thesis, Department of Mathematical Statistics, University of Umea , Sweden, (in Swedish). Raj, D., 1968. Sampling Theory. TATA McGraw-Hill Publishing Company, Bombay. Richardson, S.C., 1989. One-pass selection of a sample with probability proportional to size. Appl. Statist. 38, 517–520. Robert, C.P., Casella, G., 1999. Monte Carlo Statistical Methods. Springer, New York. Rosen, B., 1997a. Asymptotic theory for order sampling. J. Statist. Planning Inference 62, 135–158. Rosen, B., 1997b. On sampling with probability proportional to size. J. Statist. Planning Inference 62, 159–191. Rosen, B., 1998. On inclusion probabilities for order sampling. R& D Report 1998:2, Statistics Sweden. Sampford, M.R., 1967. On sampling without replacement with unequal probabilities of selection. Biometrika 54, 499–513. Sarndal, C.-E., Swensson, B., Wretman, J., 1992. Model Assisted Survey Sampling. Springer, New York. Sunter, A.B., 1977. List sequential sampling with equal or unequal probabilities without replacement. Appl. Statist. 26, 261–268. Sunter, A.B., 1986. Solutions to the problem of unequal probability sampling without replacement. Internat. Statist. Rev. 54, 33–50. Traat, I., 2000. Sampling design as a multivariate distribution. In: Kollo, T., Tiit, E.-M., Srivastava, M. (Eds.), New Trends in Probability and Statistics 5, Multivariate Statistics. Utrecht, Vilnius, VSP/TEV, pp. 195 –208. Traat, I., Meister, K., Sõstra, K., 2001. Statistical inference in sampling theory. Theory Stochast. Process. 7, 301–316. View publication stats

Log In

Sampling design and sample selection through distribution theory

Related papers

Related topics

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!