Chapter8 Double Sampling

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Chapter 8

Double Sampling (Two Phase Sampling)

The ratio and regression methods of estimation require the knowledge of population mean of auxiliary
variable ( X ) to estimate the population mean of study variable (Y ). If information on the auxiliary
variable is not available, then there are two options – one option is to collect a sample only on study
variable and use sample mean as an estimator of population mean.

An alternative solution is to use a part of the budget for collecting information on auxiliary variable to
collect a large preliminary sample in which xi alone is measured. The purpose of this sampling is to

furnish a good estimate of X . This method is appropriate when the information about xi is on file

cards that have not been tabulated. After collecting a large preliminary sample of size n ' units from
the population, select a smaller sample of size n from it and collect the information on y . These two

estimates are then used to obtain an estimator of population mean Y . This procedure of selecting a
large sample for collecting information on auxiliary variable x and then selecting a sub-sample from
it for collecting the information on the study variable y is called double sampling or two phase
sampling. It is useful when it is considerably cheaper and quicker to collect data on x than y and
there is high correlation between x and y.

In this sampling, the randomization is done twice. First a random sample of size n ' is drawn from a
population of size N and then again a random sample of size n is drawn from the first sample of size
n'.

So the sample mean in this sampling is a function of the two phases of sampling. If SRSWOR is
utilized to draw the samples at both the phases, then
- number of possible samples at the first phase when a sample of size n is drawn from a
N
population of size N is    M 0 , say.
 n'
- number of possible samples at the second phase where a sample of size n is drawn from the
 n '
first phase sample of size n ' is    M 1 , say.
n
1
Population of X (N units)

Sample
(Large) M 0 samples
N ' units

Subsample
(small) M 1 samples
n1 units

Then the sample mean is a function of two variables. If  is the statistic calculated at the second
phase such that  ij , i  1, 2,..., M 0 , j  1, 2,..., M 1 with Pij being the probability that ith sample is chosen

at first phase and jth sample is chosen at second phase, then


E ( )  E1  E2 ( )

where E2 ( ) denotes the expectation over second phase and E1 denotes the expectation over the first

phase. Thus
M 0 M1
E ( )   Pij ij
i 1 j 1
M 0 M1
  PP
i j / i  ij (using P( A  B)  P( A) P( B / A))
i 1 j 1
M0 M1
  Pi  Pj / i  ij

i 1 j 1
  
st
1 stage 2nd stage

2
Variance of 
Var ( )  E   E ( ) 
2

 E  (  E2 ( ))  ( E2 ( )  E ( )) 
2

 E   E2 ( )    E2 ( )  E ( )   0
2 2

2
 E1 E2   E2 ( )]2  [ E2 ( )  E ( ) 

 E1 E2   E2 ( )  E1 E2  E2 ( )  E ( ) 
2 2


constant for E2
 E1 V2 ( )  E1  E2 ( )  E1 ( E2 ( )) 
2

 E1 V2 ( )  V1  E2 ( ) 

Note: The two phase sampling can be extended to more than two phases depending upon the need and
objective of the experiment. Various expectations can also be extended on the similar lines .

Double sampling in ratio method of estimation


If the population mean X is not known then double sampling technique is applied. Take a large
initial sample of size n ' by SRSWOR to estimate the population mean X as
1 n'
Xˆ  x '   xi .
n ! i 1
Then a second sample is a subsample of size n selected from the initial sample by SRSWOR. Let
y and x be the means of y and x based on the subsample. Then E ( x ')  X , E ( x )  X , E ( y )  Y .

The ratio estimator under double sampling now becomes


y
YˆRd  x ' .
x

The exact expressions for the bias and mean squared error of YˆRd are difficult to derive. So we find
their approximate expressions using the same approach mentioned while describing the ratio method
of estimation.

3
Let
y Y xX x ' X
0  , 1  , 2 
Y X X
E ( 0 )  E (1 )  E ( 2 )  0
1 1 
E (12 )     Cx2
n N 
1
E (1 2 )  2 E ( x  X )( x ' X )
X
1
 2 E1  E2 ( x  X )( x ' X ) | n '
X
1
 2 E1 ( x ' X ) 2 
X
 1 1  Sx
2
   2
 n' N  X
1 1
    Cx2
 n' N 
 E ( 22 ).

E ( 0 2 )  Cov( y , x ')
 Cov  E ( y | n '), E ( x ' | n ')   E Cov( y , x ') | n '
 Cov Y , X   E Cov( y ', x ') 
 Cov  ( y ', x '
1 1S
    xy
 n ' N  XY
1 1 S S
   x y
 n' N  X Y
1 1
     CxC y
 n' N 
where y ' is the sample mean of y ' s based on the sample size n '.

4
1
E ( 01 )  Cov( y , x )
xy
 1 1  S xy
  
n N  XY
1 1  S S
   x y
n N  X Y
1 1 
     CxC y
n N 

1
E ( 22 )  Var ( y )
Y2
1
 2 V1  E2 ( y | n ')  E1 V2 ( yn | n ')
Y
1   1 1  
 2 V1 ( yn' )  E1    s '2y 
Y   n n '  
1  1 1  2  1 1  2 
  n '  N  S y   n  n '  S y 
Y2     
2
 1 1  Sy
   2
 n N Y
1 1 
    C y2
n N 
where s '2y is the mean sum of squares of y based on initial sample of size n '.

1
E (1 2 )  Cov( x , x ')
X2
1
 2 Cov  E ( x | n '), E ( x ' | n ')  0 
X
1
 2 Var ( X ')
X
where Var ( X ') is the variance of mean of x based on initial sample of size n ' .

5
Estimation error of YˆRd

Write YˆRd as

(1   0 ) Y
YˆRd  (1   2 ) X
(1  1 ) X
 Y (1   0 )(1   2 )(1  1 ) 1
 Y (1   0 )(1   2 )(1  1  12  ...)
 Y (1   0   2   0 2  1   o1  1 2  12 )
upto the terms of order two. Other terms of degree greater than two are assumed to be negligible.
Bias of YRd

E (YˆRd )  Y 1  0  0  E ( 0 2 )  0  E ( 01 )  E (1 2 )  E (12 ) 

Bias(Yˆ )  E (Yˆ )  Y
Rd Rd

 Y  E ( 0 2 )  E ( 01 )  E (1 2 )  E (12 ) 


 1 1  1 1  1 1 1 1  
 Y     Cx C y      Cx C y     Cx2     Cx2 
 n ' N  n N   n' N  n N  
1 1 
 Y     Cx2   Cx C y 
 n n'
1 1 
 Y    Cx (Cx   C y ).
 n n'
The bias is negligible if n is large and relative bias vanishes if C x2  C xy , i.e., the regression line
passes through origin.
MSE of Yˆ : Rd

MSE (YˆRd )  E (YˆRd  Y ) 2`


 Y 2 E ( 0   2  1 ) 2 (retaining the terms upto order two)

 Y 2 E  02  12   22  2 0 2  2 01  21 2 

 Y 2 E  02  12   22  2 0 2  2 01  2 22 

 1 1  1 1  1 1 1 1 1 1  
 Y 2    C y2     C x2     C x2  2     C x C y  2     C x C y 
 n N  n N   n' N   n' N  n N  
1 1  1 1
 Y 2     C x2  C y2  2  C x C y   Y 2    C x (2  C y  C x )
n N   n' N 
 1 1
 MSE (ratio estimator)  Y 2     2  C x C y  C x2  .
 n' n 
6
The second term is the contribution of second phase of sampling. This method is preferred over ratio
method if
2  Cx C y  Cx2  0
1 Cx
or  
2 Cy

Choice of n and n '


Write
V V'
MSE (YˆRd )  
n n'
where V and V ' contain all the terms containing n and n ' respectively.

The cost function is C0  nC  n ' C ' where C and C ' are the costs per unit for selecting the samples
n and n ' respectively.

Now we find the optimum sample sizes n and n ' for fixed cost C0 . The Lagrangian function is

V V'
    (nC  n ' C ' C0 )
n n'
 V
 0  C  2
n n
 V'
 0  C '  2 .
n ' n'
Thus  Cn 2  V
V
or n
C
or  nC  VC .
Similarly  n ' C '  V ' C '.
Thus

VC  V ' C '

C0
and so

7
C0 V
Optimum n   nopt , say
VC  V ' C ' C
C0
V'
Optimum n '   nopt
'
, say
VC  V ' C ' C '
V V'
Varopt (YˆRd )   '
nopt nopt
( VC  V ' C ') 2

C0

Comparison with SRS


C0
If X is ignored and all resources are used to estimate Y by y , then required sample size = .
C
S y2 CS y2
Var ( y )  
C0 / C C0
Var ( y ) CS y2
Relative effiiency = 
Varopt (YˆRd ) ( VC  V ' C ')
2

Double sampling in regression method of estimation


When the population mean of auxiliary variable X is not known, then double sampling is used as
follows:
- A large sample of size n ' is taken from of the population by SRSWOR from which the

population mean X is estimated as x ' , i.e. Xˆ  x '.


- Then a subsample of size n is chosen from the larger sample and both the variables x and y

are measured from it by taking x ' in place of X and treat it as if it is known.

Then E ( x ')  X , E ( x )  X , E ( y )  Y . The regression estimate of Y in this case is given by

Yˆregd  y  ˆ ( x ' x )
n

s  ( xi  x )( yi  y )
S xy
where ˆ  xy2  i 1 n
is an estimator of   based on the sample of size n .
S x2
 (x  x )
sx 2
i
i 1

8
It is difficult to find the exact properties like bias and mean squared error of Yˆregd , so we derive the

approximate expressions.

Let
xX
1   x  (1  1 ) X
X
x ' X
2   x '  (1   2 ) X
X
s  S xy
 3  xy  sxy  (1   3 ) S xy
S xy
sx2  S x2
4  2
 sx2  (1   4 ) S x2
Sx
E (1 )  0, E ( 2 )  0, E ( 3 )  0, E ( 4 )  0
Define

21  E ( x  X ) 2 ( y  Y ) 
3
30  E  x  X 

Estimation error:
Then

Yˆregd  y  ˆ ( x ' x )
S xy (1   3 )
y ( 2  1 ) X
S x2 (1   4 )
S xy
 yX 2
(1   3 )( 2  1 )(1   4 ) 1
S x

 y  X  (1   3 )( 2  1 )(1   4   42  ...)

Retaining the powers of  ' s upto order two assuming  3 1, (using the same concept as detailed in

the case of ratio method of estimation)

Yˆregd  y  X  ( 2   2 3   2 4   1   1 3   1 4 ).

9
Bias:
The bias of Yˆ upto the second order of approximation is
regd

E (Yˆregd )  Y  X   E ( 2 3 )  E ( 2 4 )  E (1 3 )  E (1 4 ) 

Bias (Yˆregd )  E (Yˆregd )  Y

 1 1  1  ( x ' X )( sxy  S xy )  
 X        
 n ' N  N  XS xy  

1 11  ( x ' X )( sx2  S X2 ) 


  
 n' N  N
 XS X2

 

1 1  1  ( x  X )( sxy  S xy ) 
  
n N N
  XS xy

 

1 1  1  ( x  X )( sx2  S x2 ) 
  
n N N
 XS x2

 
 1 1   1 1  1 1   1 1   
 X     21     302     21     302 
 n ' N  XS xy  n ' N  XS x  n N  XS xy  n N  XS x 

 1 1    
       21  302  .
 n n '   S xy S x 

Mean squared error:


MSE (Yˆregd )  E (Yregd  Y ) 2
2
  y  ˆ ( x ' x )  Y 
2
 E ( y  Y )  X  (1   3 )( 2  1 )(1   4   42  ...) 

Retaining the powers of  ' s upto order two, the mean squared error upto the second order of
approximation is

10
MSE (Yˆregd )  E ( y  Y )  X  ( 2   2 3   2 4  1  1 3  1 4 ) 
2

 E ( y  Y ) 2  X 2  2 E (12   22  21 2 )  2 X  E[( y  Y )(1   2 ) 


 1 1  S x2  1 1  S x2  1 1  Sx 
2
 Var ( y )  X     2     2  2    2 
2 2

 n N  X  n' N  X n N  X 
 1 1  S xy  1 1  S xy 
 2  X       
 n ' N  X  n N  X 
1 1  1 1 
 Var ( y )   2    S x2  2     S xy
 n n'  n n'
1 1 
 Var ( y )   2      2 S x2  2  S xy 
 n n'
 1 1   S xy 2 
2
S xy
 Var ( y )      4 S x  2 2 S xy 
 n n '   S x Sx 

2
1 1   1 1   S xy 
    S y2      
n N   n n '   Sx 
1 1  1 1 
    S y2      2 S y2 (using S xy   S x S y )
n N   n n'
(1   2 ) S y2  2 S y2
  . (Ignoring the finite population correction)
n n'

Clearly, Yˆregd is more efficient than sample mean SRS, i.e. when no auxiliary variable is used.

Now we address the issue that whether the reduction in variability is worth the extra expenditure
required to observe the auxiliary variable.

Let the total cost of survey is


C0  C1n  C2 n '

where C1 and C2 are the costs per unit observing the study variable y and auxiliary variable x
respectively.

Now minimize the MSE (Yˆregd ) for fixed cost C0 using Lagrangian function with Lagranagian

multiplier  as

11
S y2 (1   2 )  2 S y2
    (C1n  C2 n ' C0 )
n n'
 1
 0   2 S y2 (1   2 )  C1  0
n n
 1
 0   2 S y2  2  C2  0
n ' n'

S y2 (1   2 )
Thus n
C1

Sy
and n'  .
C2
Substituting these values in the cost function, we have
C0  C1n  C2 n '
S y2 (1   2 )  2 S y2
 C1  C2
C1  C2
or C0   C1S y2 (1   2 )  C2  2 S y2
2
1 
or   S C1 (1   2 )   S y C2  .
2  y
C0 

Thus the optimum values of n and n ' are


 S y C0
'
nopt 
C2  S y C1 (1   2 )   S y C2 
 
C0 S y 1   2
nopt  .
C1  S y C1 (1   2 )   S y C2 
 

The optimum mean squared error of Yˆregd is obtained by substituting n  nopt and n '  nopt
'
as

MSE (Yˆregd )opt 


 
S y2 (1   2 )  C1 C1S y2 (1   2 )   S y C2 
 
C0 S y2 (1   2 )


S y2  2 C2  S y
  
C1 (1   2 )   S y C2 

 S y C0
2
1 
 S y C1 (1   2 )   S y C2 
C0  
S y2  2
 C1 (1   2 )   C2 
C0  
12
The optimum variance of y under SRS for SRS where no auxiliary information is used is

C1S y2
Var ( ySRS )opt 
C0

which is obtained by substituting   0, C2  0 in MSE (YˆSRS )opt . The relative efficiency is

Var ( ySRS )opt C1S y2


RE  
MSE (Yˆ )
2
S y2  C1 (1   2 )   C2 
 
regd opt

1
 2
 C2 
 1   
2

 C1 
 1.
Thus the double sampling in regression estimator will lead to gain in precision if
C1 2
 .
C2 1  1   2  2
 

Double sampling for probability proportional to size estimation:


Suppose it is desired to select the sample with probability proportional to auxiliary variable x but
information on x is not available. Then, in this situation, the double sampling can be used. An initial
sample of size n ' is selected with SRSWOR from a population of size N , and information on x is
collected for this sample. Then a second sample of size n is selected with replacement and with
probability proportional to x from the initial sample of size n ' . Let x ' denote the mean of x for the
initial sample of size n ' , Let x and y denote means respectively of x and y for the second
sample of size n . Then we have the following theorem.

Theorem:
(1) An unbiased estimator of population mean Y is given as
x' n  y 
Yˆ  tot   i ,
n ' n i 1  xi 
'
where xtot denotes the total for x in the first sample.

13
2
 
1 1 (n ' 1) N
xi  yi 
(2) Var (Yˆ )     S y2 
 n' N 
 
N ( N  1)nn ' i 1 X tot  xi
 Ytot
 , where X tot and Ytot denote the totals

X 
 tot 
of x and y respectively in the population.

(3) An unbiased estimator of the variance of Yˆ is given by

 (Yˆ )   1  1  1  ' n yi2 xtot


'2
( A  B)  1 n
 xtot
'
yi ˆ 
Var     tot 
x      Y 
 n ' N  n( n ' 1)  i 1 xi n '( n  1)  n(n  1) i 1  n ' xi 
2
 n yi  n
yi2
where A     and B   2
 i 1 xi  i 1 xi

Proof. Before deriving the results, we first mention the following result proved in varying probability
scheme sampling.

Result: In sampling with varying probability scheme for drawing a sample of size n from a
population of size N and with replacement .
1 n yi
(i) z   zi is an unbiased estimator of population mean
n i 1
y where zi 
Npi
, pi being the

probability of selection of ith unit. Note that yi and pi can take anyone of the N values Y1 , Y2 ,..., YN

with initial probabilities P1 , P2 ,..., PN , respectively.


2
1  N Yi 2  1 N
Y 
2  
(ii) Var ( z )   N 2Y 2   2
Pi  i  Y  . .
nN  i 1 Pi  nN i 1  Pi 
(iii) An unbiased estimator of variance of z is
2
1 n
 yi 
(z ) 
Var  
n(n  1) i 1  Npi
 z  ..

Let E2 denote the expectation of Yˆ , when the first sample is fixed. The second is selected with

xi
probability proportional to x , hence using the result (i) with Pi  '
, we find that
xtot

14
 
Y 
ˆ  1 n
y 
E2    E2   i 
 n'  n i 1 n ' xi 
 
 '
xtot 
 x' n  y 
 E2  tot   i  
 nn ' i 1  xi  
 y'
where y ' is the mean of y for the first sample. Hence

  
E (Yˆ )  E1  E2 Yˆ | n ' 

 E1 ( yn ' )
 Yˆ ,
which proves the part (1) of the theorem. Further,

   
Var (Yˆ )  V1 E2 Yˆ | n '  E1V2 Yˆ | n '

 V ( y ')  E V Yˆ | n '


1 1 2

    S  E V Yˆ | n ' .
1 1 2
y 1 2
 n' N 
Now, using the result (ii), we get
2
 
 
 
n'
1 x y
V2 Yˆ | n '  '2  ' i  i  ytot
'

nn i 1 xtot  xi 
 x' 
 tot 
2
1 n' n' y y 
 '2
nn
 xi x j  i  j  ,
 xi x j 
i 1 i  j  
and hence

 
2
1 n '( n ' 1) N n '  y yj 
E1V2 Yˆ | n '  '2 
nn N ( N  1) i 1 i  j
xi x j  i   ,
x x 
 i j 

n '(n ' 1)
using the probability of a specified pair of units being selected in the sample is . So we can
N ( N  1)
express

15
2
 
 
  1 n '(n ' 1) N
xi y
E1V2 Yˆ / n '  '2   i  Ytot  .
nn N ( N  1) i 1 X tot  xtot 
X 
 tot 

 
Substituting this in V2 Yˆ | n ' , we get

2
 
1 1 (n ' 1) N
xi  y 
Var (Yˆ )     S y2 
 n' N 

nn ' N ( N  1) i 1
 i  Ytot  .
X tot  xi 
X 
 tot 
This proves the second part (2) of the theorem.

We now consider the estimation of Var (Yˆ ). Given the first sample, we obtain

 1 n y2  n'
E2   i    yi2 ,
 n i 1 pi  i 1
xi
where pi  '
. Also, given the first sample,
xtot

 1 n
 yi  
2

E2   
 n(n  1) i 1  n ' pi
 Y    V2 (Yˆ )  E2 (Yˆ 2 )  y '2 .
ˆ
 

Hence
  yi
2

2
n
1
ˆ
E2 Y  2
 
n(n  1) i 1  n ' pi
 Yˆ
  y' .
2

  
'
xtot n
 yi  xi
ˆ
Substituting Y   
n ' n i 1  xi
 and pi  ' the expression becomes
 xtot

 x '2  n y  2  n y 2  


       2    y '
i i 2
E2
 nn (n  1)  i 1 xi   i 1 xi  
'2
 

Using
 1 n y2  n'
E2   i    yi2 ,
 n i 1 pi  i 1

we get

16
 1 n 2 xtot
' '2
xtot  n'
E2   yi  ( A  B)    yi2  n ' y '2
 n i 1 xi nn '(n  1)  i 1
2
 n yi  n
yi2
where A     , and B   2 which further simplifies to
 i 1 xi  i 1 xi

 1  ' n y 2 xtot'2 ( A  B)  '2


E2   xtot     s y ,
i

 n ( n ' 1) 
 i 1 xi n '( n  1) 

where s '2y is the mean sum of squares of y for the first sample. Thus, we obtain

 1  ' n yi2 xtot


'2
( A  B) 
 tot      E1 ( s y )  S y
'2 2
E1 E2  x (1)
 n ( n ' 1)  i 1 ix n '( n  1) 

which gives an unbiased estimator of S y2 . Next, since we have


2
 
 
  1 (n ' 1) N
x y
E1V2 Yˆ | n '   i
 i

nn ' N ( N  1) i 1 X tot  xi
 Ytot
 ,

X 
 tot 
and from this result we obtain
 1  
 
2
n
 yi xtot
'
E2   
 n(n  1) i 1  n ' xi
 Y    V2 Yˆ | n ' .
ˆ
 

Thus
2
 
 1 n
 xtot yi ˆ 
'
2
 (n ' 1) N
xi  y 
E1 E2   
 n(n  1) i 1  n ' xi
Y      i  Ytot  (2)
  nn ' N ( n  1) i 1 X tot  xi 
X 
 tot 
when gives an unbiased estimator of
2
 
(n ' 1) N
xi  y 

nn ' N ( N  1) i 1
 i  Ytot  .
X tot  xi 
X 
 tot 

Using (1) and (2) an unbiased estimator of the variance of Yˆ is obtained as


2
 (Yˆ )   1  1  1  ' n yi2 xtot
'2
( A  B)  1 n
 xtot
'
yi ˆ 
Var    tot 
x      Y 
 n ' N  n(n ' 1)  i 1 xi n '( n  1)  n( n  1) i 1  n ' xi 
Thus, the theorem is proved.
17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy