Chapter8 Double Sampling
Chapter8 Double Sampling
Chapter8 Double Sampling
The ratio and regression methods of estimation require the knowledge of population mean of auxiliary
variable ( X ) to estimate the population mean of study variable (Y ). If information on the auxiliary
variable is not available, then there are two options – one option is to collect a sample only on study
variable and use sample mean as an estimator of population mean.
An alternative solution is to use a part of the budget for collecting information on auxiliary variable to
collect a large preliminary sample in which xi alone is measured. The purpose of this sampling is to
furnish a good estimate of X . This method is appropriate when the information about xi is on file
cards that have not been tabulated. After collecting a large preliminary sample of size n ' units from
the population, select a smaller sample of size n from it and collect the information on y . These two
estimates are then used to obtain an estimator of population mean Y . This procedure of selecting a
large sample for collecting information on auxiliary variable x and then selecting a sub-sample from
it for collecting the information on the study variable y is called double sampling or two phase
sampling. It is useful when it is considerably cheaper and quicker to collect data on x than y and
there is high correlation between x and y.
In this sampling, the randomization is done twice. First a random sample of size n ' is drawn from a
population of size N and then again a random sample of size n is drawn from the first sample of size
n'.
So the sample mean in this sampling is a function of the two phases of sampling. If SRSWOR is
utilized to draw the samples at both the phases, then
- number of possible samples at the first phase when a sample of size n is drawn from a
N
population of size N is M 0 , say.
n'
- number of possible samples at the second phase where a sample of size n is drawn from the
n '
first phase sample of size n ' is M 1 , say.
n
1
Population of X (N units)
Sample
(Large) M 0 samples
N ' units
Subsample
(small) M 1 samples
n1 units
Then the sample mean is a function of two variables. If is the statistic calculated at the second
phase such that ij , i 1, 2,..., M 0 , j 1, 2,..., M 1 with Pij being the probability that ith sample is chosen
where E2 ( ) denotes the expectation over second phase and E1 denotes the expectation over the first
phase. Thus
M 0 M1
E ( ) Pij ij
i 1 j 1
M 0 M1
PP
i j / i ij (using P( A B) P( A) P( B / A))
i 1 j 1
M0 M1
Pi Pj / i ij
i 1 j 1
st
1 stage 2nd stage
2
Variance of
Var ( ) E E ( )
2
E ( E2 ( )) ( E2 ( ) E ( ))
2
E E2 ( ) E2 ( ) E ( ) 0
2 2
2
E1 E2 E2 ( )]2 [ E2 ( ) E ( )
E1 E2 E2 ( ) E1 E2 E2 ( ) E ( )
2 2
constant for E2
E1 V2 ( ) E1 E2 ( ) E1 ( E2 ( ))
2
E1 V2 ( ) V1 E2 ( )
Note: The two phase sampling can be extended to more than two phases depending upon the need and
objective of the experiment. Various expectations can also be extended on the similar lines .
The exact expressions for the bias and mean squared error of YˆRd are difficult to derive. So we find
their approximate expressions using the same approach mentioned while describing the ratio method
of estimation.
3
Let
y Y xX x ' X
0 , 1 , 2
Y X X
E ( 0 ) E (1 ) E ( 2 ) 0
1 1
E (12 ) Cx2
n N
1
E (1 2 ) 2 E ( x X )( x ' X )
X
1
2 E1 E2 ( x X )( x ' X ) | n '
X
1
2 E1 ( x ' X ) 2
X
1 1 Sx
2
2
n' N X
1 1
Cx2
n' N
E ( 22 ).
E ( 0 2 ) Cov( y , x ')
Cov E ( y | n '), E ( x ' | n ') E Cov( y , x ') | n '
Cov Y , X E Cov( y ', x ')
Cov ( y ', x '
1 1S
xy
n ' N XY
1 1 S S
x y
n' N X Y
1 1
CxC y
n' N
where y ' is the sample mean of y ' s based on the sample size n '.
4
1
E ( 01 ) Cov( y , x )
xy
1 1 S xy
n N XY
1 1 S S
x y
n N X Y
1 1
CxC y
n N
1
E ( 22 ) Var ( y )
Y2
1
2 V1 E2 ( y | n ') E1 V2 ( yn | n ')
Y
1 1 1
2 V1 ( yn' ) E1 s '2y
Y n n '
1 1 1 2 1 1 2
n ' N S y n n ' S y
Y2
2
1 1 Sy
2
n N Y
1 1
C y2
n N
where s '2y is the mean sum of squares of y based on initial sample of size n '.
1
E (1 2 ) Cov( x , x ')
X2
1
2 Cov E ( x | n '), E ( x ' | n ') 0
X
1
2 Var ( X ')
X
where Var ( X ') is the variance of mean of x based on initial sample of size n ' .
5
Estimation error of YˆRd
Write YˆRd as
(1 0 ) Y
YˆRd (1 2 ) X
(1 1 ) X
Y (1 0 )(1 2 )(1 1 ) 1
Y (1 0 )(1 2 )(1 1 12 ...)
Y (1 0 2 0 2 1 o1 1 2 12 )
upto the terms of order two. Other terms of degree greater than two are assumed to be negligible.
Bias of YRd
Bias(Yˆ ) E (Yˆ ) Y
Rd Rd
1 1 1 1 1 1 1 1 1 1
Y 2 C y2 C x2 C x2 2 C x C y 2 C x C y
n N n N n' N n' N n N
1 1 1 1
Y 2 C x2 C y2 2 C x C y Y 2 C x (2 C y C x )
n N n' N
1 1
MSE (ratio estimator) Y 2 2 C x C y C x2 .
n' n
6
The second term is the contribution of second phase of sampling. This method is preferred over ratio
method if
2 Cx C y Cx2 0
1 Cx
or
2 Cy
The cost function is C0 nC n ' C ' where C and C ' are the costs per unit for selecting the samples
n and n ' respectively.
Now we find the optimum sample sizes n and n ' for fixed cost C0 . The Lagrangian function is
V V'
(nC n ' C ' C0 )
n n'
V
0 C 2
n n
V'
0 C ' 2 .
n ' n'
Thus Cn 2 V
V
or n
C
or nC VC .
Similarly n ' C ' V ' C '.
Thus
VC V ' C '
C0
and so
7
C0 V
Optimum n nopt , say
VC V ' C ' C
C0
V'
Optimum n ' nopt
'
, say
VC V ' C ' C '
V V'
Varopt (YˆRd ) '
nopt nopt
( VC V ' C ') 2
C0
Yˆregd y ˆ ( x ' x )
n
s ( xi x )( yi y )
S xy
where ˆ xy2 i 1 n
is an estimator of based on the sample of size n .
S x2
(x x )
sx 2
i
i 1
8
It is difficult to find the exact properties like bias and mean squared error of Yˆregd , so we derive the
approximate expressions.
Let
xX
1 x (1 1 ) X
X
x ' X
2 x ' (1 2 ) X
X
s S xy
3 xy sxy (1 3 ) S xy
S xy
sx2 S x2
4 2
sx2 (1 4 ) S x2
Sx
E (1 ) 0, E ( 2 ) 0, E ( 3 ) 0, E ( 4 ) 0
Define
21 E ( x X ) 2 ( y Y )
3
30 E x X
Estimation error:
Then
Yˆregd y ˆ ( x ' x )
S xy (1 3 )
y ( 2 1 ) X
S x2 (1 4 )
S xy
yX 2
(1 3 )( 2 1 )(1 4 ) 1
S x
Retaining the powers of ' s upto order two assuming 3 1, (using the same concept as detailed in
Yˆregd y X ( 2 2 3 2 4 1 1 3 1 4 ).
9
Bias:
The bias of Yˆ upto the second order of approximation is
regd
1 1 1 ( x ' X )( sxy S xy )
X
n ' N N XS xy
1 1 1 ( x X )( sxy S xy )
n N N
XS xy
1 1 1 ( x X )( sx2 S x2 )
n N N
XS x2
1 1 1 1 1 1 1 1
X 21 302 21 302
n ' N XS xy n ' N XS x n N XS xy n N XS x
1 1
21 302 .
n n ' S xy S x
Retaining the powers of ' s upto order two, the mean squared error upto the second order of
approximation is
10
MSE (Yˆregd ) E ( y Y ) X ( 2 2 3 2 4 1 1 3 1 4 )
2
n N X n' N X n N X
1 1 S xy 1 1 S xy
2 X
n ' N X n N X
1 1 1 1
Var ( y ) 2 S x2 2 S xy
n n' n n'
1 1
Var ( y ) 2 2 S x2 2 S xy
n n'
1 1 S xy 2
2
S xy
Var ( y ) 4 S x 2 2 S xy
n n ' S x Sx
2
1 1 1 1 S xy
S y2
n N n n ' Sx
1 1 1 1
S y2 2 S y2 (using S xy S x S y )
n N n n'
(1 2 ) S y2 2 S y2
. (Ignoring the finite population correction)
n n'
Clearly, Yˆregd is more efficient than sample mean SRS, i.e. when no auxiliary variable is used.
Now we address the issue that whether the reduction in variability is worth the extra expenditure
required to observe the auxiliary variable.
where C1 and C2 are the costs per unit observing the study variable y and auxiliary variable x
respectively.
Now minimize the MSE (Yˆregd ) for fixed cost C0 using Lagrangian function with Lagranagian
multiplier as
11
S y2 (1 2 ) 2 S y2
(C1n C2 n ' C0 )
n n'
1
0 2 S y2 (1 2 ) C1 0
n n
1
0 2 S y2 2 C2 0
n ' n'
S y2 (1 2 )
Thus n
C1
Sy
and n' .
C2
Substituting these values in the cost function, we have
C0 C1n C2 n '
S y2 (1 2 ) 2 S y2
C1 C2
C1 C2
or C0 C1S y2 (1 2 ) C2 2 S y2
2
1
or S C1 (1 2 ) S y C2 .
2 y
C0
The optimum mean squared error of Yˆregd is obtained by substituting n nopt and n ' nopt
'
as
S y2 2 C2 S y
C1 (1 2 ) S y C2
S y C0
2
1
S y C1 (1 2 ) S y C2
C0
S y2 2
C1 (1 2 ) C2
C0
12
The optimum variance of y under SRS for SRS where no auxiliary information is used is
C1S y2
Var ( ySRS )opt
C0
1
2
C2
1
2
C1
1.
Thus the double sampling in regression estimator will lead to gain in precision if
C1 2
.
C2 1 1 2 2
Theorem:
(1) An unbiased estimator of population mean Y is given as
x' n y
Yˆ tot i ,
n ' n i 1 xi
'
where xtot denotes the total for x in the first sample.
13
2
1 1 (n ' 1) N
xi yi
(2) Var (Yˆ ) S y2
n' N
N ( N 1)nn ' i 1 X tot xi
Ytot
, where X tot and Ytot denote the totals
X
tot
of x and y respectively in the population.
Proof. Before deriving the results, we first mention the following result proved in varying probability
scheme sampling.
Result: In sampling with varying probability scheme for drawing a sample of size n from a
population of size N and with replacement .
1 n yi
(i) z zi is an unbiased estimator of population mean
n i 1
y where zi
Npi
, pi being the
probability of selection of ith unit. Note that yi and pi can take anyone of the N values Y1 , Y2 ,..., YN
Let E2 denote the expectation of Yˆ , when the first sample is fixed. The second is selected with
xi
probability proportional to x , hence using the result (i) with Pi '
, we find that
xtot
14
Y
ˆ 1 n
y
E2 E2 i
n' n i 1 n ' xi
'
xtot
x' n y
E2 tot i
nn ' i 1 xi
y'
where y ' is the mean of y for the first sample. Hence
E (Yˆ ) E1 E2 Yˆ | n '
E1 ( yn ' )
Yˆ ,
which proves the part (1) of the theorem. Further,
Var (Yˆ ) V1 E2 Yˆ | n ' E1V2 Yˆ | n '
S E V Yˆ | n ' .
1 1 2
y 1 2
n' N
Now, using the result (ii), we get
2
n'
1 x y
V2 Yˆ | n ' '2 ' i i ytot
'
nn i 1 xtot xi
x'
tot
2
1 n' n' y y
'2
nn
xi x j i j ,
xi x j
i 1 i j
and hence
2
1 n '( n ' 1) N n ' y yj
E1V2 Yˆ | n ' '2
nn N ( N 1) i 1 i j
xi x j i ,
x x
i j
n '(n ' 1)
using the probability of a specified pair of units being selected in the sample is . So we can
N ( N 1)
express
15
2
1 n '(n ' 1) N
xi y
E1V2 Yˆ / n ' '2 i Ytot .
nn N ( N 1) i 1 X tot xtot
X
tot
Substituting this in V2 Yˆ | n ' , we get
2
1 1 (n ' 1) N
xi y
Var (Yˆ ) S y2
n' N
nn ' N ( N 1) i 1
i Ytot .
X tot xi
X
tot
This proves the second part (2) of the theorem.
We now consider the estimation of Var (Yˆ ). Given the first sample, we obtain
1 n y2 n'
E2 i yi2 ,
n i 1 pi i 1
xi
where pi '
. Also, given the first sample,
xtot
1 n
yi
2
E2
n(n 1) i 1 n ' pi
Y V2 (Yˆ ) E2 (Yˆ 2 ) y '2 .
ˆ
Hence
yi
2
2
n
1
ˆ
E2 Y 2
n(n 1) i 1 n ' pi
Yˆ
y' .
2
'
xtot n
yi xi
ˆ
Substituting Y
n ' n i 1 xi
and pi ' the expression becomes
xtot
Using
1 n y2 n'
E2 i yi2 ,
n i 1 pi i 1
we get
16
1 n 2 xtot
' '2
xtot n'
E2 yi ( A B) yi2 n ' y '2
n i 1 xi nn '(n 1) i 1
2
n yi n
yi2
where A , and B 2 which further simplifies to
i 1 xi i 1 xi
n ( n ' 1)
i 1 xi n '( n 1)
where s '2y is the mean sum of squares of y for the first sample. Thus, we obtain
nn ' N ( N 1) i 1 X tot xi
Ytot
,
X
tot
and from this result we obtain
1
2
n
yi xtot
'
E2
n(n 1) i 1 n ' xi
Y V2 Yˆ | n ' .
ˆ
Thus
2
1 n
xtot yi ˆ
'
2
(n ' 1) N
xi y
E1 E2
n(n 1) i 1 n ' xi
Y i Ytot (2)
nn ' N ( n 1) i 1 X tot xi
X
tot
when gives an unbiased estimator of
2
(n ' 1) N
xi y
nn ' N ( N 1) i 1
i Ytot .
X tot xi
X
tot