0% found this document useful (0 votes)
9 views

Advanced Sampling Theory

The document is a comprehensive text on advanced sampling theory and its applications, authored by Sarjinder Singh. It covers fundamental concepts, various sampling methods, and statistical estimators, providing a detailed exploration of both theoretical and practical aspects of sampling in statistics. The book is structured into multiple chapters, each focusing on specific topics related to sampling techniques and their implications in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Advanced Sampling Theory

The document is a comprehensive text on advanced sampling theory and its applications, authored by Sarjinder Singh. It covers fundamental concepts, various sampling methods, and statistical estimators, providing a detailed exploration of both theoretical and practical aspects of sampling in statistics. The book is structured into multiple chapters, each focusing on specific topics related to sampling techniques and their implications in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1242

Advanced Sampling Theory with Applications

Advanced Sampling
Theory with Applications
How Michael 'selected' Amy
Volume I

by

Sarjinder Singh
St. Cloud State University,
Department of Statistics,
St. Cloud, MN, U.S.A.

SPRINGER-SCIENCE+BUSINESS M E D I A , B . V .
A C.LP. Catalogue record for this book is available from the Library of Congress.

ISBN 978-94-010-3728-0 ISBN 978-94-007-0789-4 (eBook)


DOI 10.1007/978-94-007-0789-4

Printed on acid-free paper

All Rights Reserved


© 2003 Springer Science+Business Media Dordrecht
Originally published by Kluwer Academic Publishers in 2003
No part of this work may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, microfilming,
recording or otherwise, without written permission from the Publisher, with the exception
of any material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work.
GoO
(
I
.
A _""

With whose grace..'


TABLE OF CONTENTS

PREFACE
XX I

1 BASIC CONCEPTS AND MATHEMATICAL


NOTATION

1.0 Introduction 1
1.1 Population 1
1.1.1 Finite popul ation 1
1.1.2 Infinite population 1
1.1.3 Target population 1
1.1.4 Study popul ation 1
1.2 Sample 2
1.3 Examples of populations and samples 2
1.4 Census 2
1.5 Relati ve aspects of sampling versus census 2
1.6 Stud y variable 2
1.7 Auxiliary variable 3
1.8 Difference betwe en stud y variable and auxiliary var iable 3
1.9 Parameter 3
I. I0 Statistic 3
I. I I Stat istics 4
1.12 Sample se lectio n 4
1.12.1 Ch it method or Lottery method 4
1.12.1.1 With replacement sampling 4
1.12.1.2 Without replacem ent sampling 5
1.12.2 Random number table method 5
1.12.2.1Remainder method 6
1.13 Probability sampling 7
1.14 Probability of selecting a sample 7
1.15 Popu lation mean /tot al 8
1.16 Population moments 8
1.17 Population standard deviation 8
1.18 Population coefficient of variation 8
1.19 Relative mean square err or 9
1.20 Sample mean 9
1.21 Sample variance 9
1.22 Estimator 10
1.23 Estimate 10
1.24 Sample space 10
1.25 Univariate random variable 11
1.25.1 Qualitative random variables 11
VIII Advanced sampling theory with applications

1.25.2 Quantitative random variables II


1.25.2.1 Discrete random variable 11
1.25.2.2 Continuous random variable 11
1.26 Probability mass function (p.m.f.) ofa univariate discrete random
variable 12
1.27 Probability density function (p.d.f.) of a univariate continuous
random variable 12
1.28 Expected value and variance of a univariate random variable 13
1.29 Distribution function of a univariate random variable 13
1.29.1 Discrete distribution function 14
1.29.2 Continuous distribution function 14
1.30 Selection of a sample using known univariate distribution function 15
1.30.1 Discrete random variable 15
1.30.2 Continuous random variable 17
1.31 Discrete bivariate random variable 19
1.32 Joint probability distribution function of bivariate discrete random
variables 20
1.33 Joint cumulative distribution function of bivariate discrete random
variables 20
1.34 Marginal distributions of a bivariate discrete random variable 20
1.35 Selection of a sample using known discrete bivariate distribution
function 20
1.36 Continuous bivariate random variable 21
1.37 Joint probability distribution function of bivariate continuous
random variable 21
1.38 Joint cumulative distribution function of a bivariate continuous
random variable 22
1.39 Marginal cumulative distributions of bivariate continuous random
variable 22
1.40 Selection of a sample using known bivariate continuous
distribution function 22
1.41 Properties of a best estimator 24
1.41.1 Unbiasedness 24
1.41.1.1 Bias 28
1.41.2 Consistency 28
1.41.3 Sufficiency 28
1.41.4 Efficiency 29
1.41.4.1 Variance 29
1.41.4.2 Mean square error 29
1.42 Relative efficiency 29
1.43 Relative bias 29
1.44 Variance estimation through splitting 30
1.45 Loss function 31
1.46 Admissible estimator 31
1.47 Sample survey 31
1.48 Sampling distribution 32
1.49 Sampling frame 33
Table of contents ix

1.50 Sample survey design 33


1.51 Errors in the estimators 33
1.51.1 Sampling errors 34
1.51.2 Non-sampling errors 34
1.5 1.2.1 Non-response errors 35
1.51.2.2 Measurement errors 35
1.51.2.3 Tabulation errors 35
1.51.2.4 Computational errors 35
1.52 Point estimator 35
1.53 Interval estimator 35
1.54 Confidence interval 35
1.55 Population proportion 38
1.56 Sample proportion 38
1.57 Variance of sample proportion and confidence interval estimates 39
1.58 Relative standard error 50
1.59 Auxiliary information 50
1.60 Some useful mathematical formulae 56
1.61 Ordered statistics 57
1.61.1 Population median 57
[.6 [ .2 Population quartiles 58
[ .6 [ .3 Population percentiles 59
1.61.4 Population mode 59
1.62 Definition(s) of statistics 59
1.63 Limitations of statistics 60
1.64 Lack of confidence in statistics 60
1.65 Scope of statistics 60
Exercises 60
Practical problems 63

2 SIMPLE RANDOM SAMPLING

2.0 Introduction 71
2.1 Simple random sampling with replacement 71
2.2 Simple random sampling without replacement 79
2.3 Estimation of population proportion 94
2.4 Searls' estimator of population mean 103
2.5 Use of distinct units in the WR sample at the estimation stage 106
2.5.1 Estimation of mean 107
2.5 .2 Estimation of finite population variance 113
2.6 Estimation of total or mean ofa subgroup (domain) ofa population 118
2.7 Dealing with a rare attribute using inverse sampling [23
2.8 Controlled sampling 125
2.9 Determinant sampling 127
Exercises 128
Practical problems 132
x Advanced samp ling theory with applicat ions

3 l.JSEOF AUXILIARY INFORMATION: SIMPLE


RANDOM SAMPLING

3.0 Introduction 137


3.1 Notation and expected values 137
3.2 Estimation of population mean 138
3.2.1 Ratio estimator 138
3.2.2 Product estimator 145
3.2.3 Regression estimator 149
3.2.4 Power transformation estimator 160
3.2.5 A dual of ratio estimator 161
3.2.6 General class of estimators 164
3.2.7 Wider class of estimators 166
3.2.8 Use of known variance of auxil iary variable at estimation
stage of population mean 167
3.2.8.1 A class of estimators 167
3.2.8.2 A wider class of estimators 169
3.2.9 Methods to remove bias from ratio and product type
estimators 173
3.2.9.1 Queno uille's method 173
3.2.9.2 Interpenetrating sampling method 175
3.2.9.3 Exactly unbiased ratio type estimator 180
3.2.9.4 Unbiased product type estimator 183
3.2.9.5 Class of almost unbiased estimators of population
ratio and product 185
3.2.9.6 Filtration of bias 187
3.3 Estimation of finite population variance 191
3.3.1 Ratio type estimator 192
3.3.2 Difference type estimator 197
3.3.3 Power transformation type estimator 198
3.3.4 General class of estimators 199
3.4 Estimation of regression coefficient 203
3.4.1 Usual estimator 203
3.4.2 Unbiased estimator 204
3.4.3 Improved estimators of regression coefficient 207
3.5 Estimation of finite population correla tion coefficient 209
3.6 Superpopulation model approac h 214
3.6.1 Relationship between linear model and regression
estimator 214
3.6.2 Improved estimator of variance of linear regression
estimator 217
3.6.3 Relationship between linear model and ratio estimator 221
3.7 Jackknife variance estimator 223
3.7.1 Ratio estimator 223
3.7.2 Regression estimator 226
Table of contents xi

3.8 Estimation of population mean using more than one auxiliary


variable 229
3.8.1 Multivariate ratio estimator 230
3.8.2 Multivariate regression type estimators 231
3.8.3 General class of estimators 239
3.9 General class of estimators to estimate any population parameter 245
3.10 Estimation of ratio or product of two population means 248
3.11 Median estimation in survey sampling 250
Exercises 257
Practica l problems 281

4 USE OF AUXILIARY INFORMATION: PROBABILITY


PROPORTIONAL TO SIZE AND WITH
REPLACEMENT (PPSWR)SAMPLING

4.0 Introduction 295


4.1 What is PPSWR sampling? 295
4.1.1 Cumulative total method 300
4.1.2 Lahiri's method 303
4.2 Estimation of population total 306
4.3 Relative efficiency of PPSWR sampling with respect to SRSWR
sampling 312
4.3.1 Superpopulation model approach 312
4.3.2 Cost aspect 315
4.4 PPSWR sampling : More than one auxiliary variable is available 317
4.4.1 Notation and expectations 318
4.4.2 Class of estimators 319
4.4.3 Wider class of estimators 320
4.4.4 PPSWR sampling with negatively correlated variables 324
4.5 Multi-character survey 326
4.5.1 Study variables have poor positive correlation with the
selection probabilities . 326
4.5.1.1 General class of estimators 335
4.5.2 Study variables have poor positive as well as poor negative
correlat ion with the selection probabilities 336
4.6 Concept of revised selection probabilities 339
4.7 Estimation of correlation coefficient using PPSWR sampling 340
Exercises 341
Practical problems 345
xii Advanced samp ling theory with applications

5 U SE OF AUXILIARY INFORMATION:
PROBABILITY PROPORTIONAL TO SIZE AND
WITHOUT REPLACEMENT (PPSWOR) SAMPLING

5.0 Introduction 349


5.0.1 Useful symbols 349
5.0.2 Some mathematical relations 349
5.1 Horvitz and Thompson estimator and related topics 351
5.2 General class of estimators 373
5.3 Model based estimation strategies 375
5.3.1 A brief history of the superpopulation model 377
5.3.2 Scott, Brewer and Ho 's robust estimation strategy 378
5.3.3 Design variance and anticipated variance of linear
regression type estimator 383
5.4 Construction and optimal choice of inclus ion probabilities 385
5.4.1 Pareto zrps samp ling estimation scheme 386
5.4.2 Hanurav's method 387
5.4.3 Brewer' s method 388
5.4.4 Sampford's method 389
5.4.5 Narain's method 390
5.4.6 Midzunc--Sen method 390
5.4.7 Kumar- -Gupta--N igam scheme 391
5.4.8 Dey and Srivastava scheme for even sample size 392
5.4.9 SSS sampling scheme 393
5.4.10 Optimal choice of first order inclusion probab ilities 394
5.5 Calibration approach 399
5.6 Calibrated estimator of the variance of the estimator of population
total 409
5.7 Estimation of variance of GREG 413
5.8 Improved estimator of variance of the GREG : The higher leve l
calibration approach 419
5.8.1 Recalibrated estimator of the variance of GREG 424
5.8.2 Recalibration using optimal designs for the GREG 426
5.9 Calibrated estimators of variance of estimator of total and
distribution function 428
5.9.1 Unified setup 430
5.10 Calibration of estimator of variance of regression predictor 431
5.10.1 Chaudhuri and Roy's results 433
5.10 .2 Calibrated estimators of variance of regression predictor 436
5.10.2.1 Model assisted calibration 436
5.10.2.2 Calibration estimators when variance of auxil iary
variabl e is known 440
5.10.2 .2.1Each component of Vx is known 441
5.10.2 .2.2 Compromized calibration 442
5.10.2.3 Predict ion variance 444
Table of contents Xlll

5. I I
Ordered and unordered estimators 444
5.11.1 Ordered estimators 445
5.11.2 Unordered estimators 449
5.12 Rao--Hartley--Cochran (RHC) sampling strategy 452
5.13 Unbiased strategies using IPPS sampling schemes 462
5.13.1 Estimation of population mean using a ratio estimator 462
5.13.2 Estimation of finite population variance 464
5.14 Godambe 's strategy: Estimation of parameters in survey sampling 465
5.14.1 Optimal estimating function 470
5.14.2 Regression type estimators 472
5.14.3 Singh's strategy in two-dimensional space 473
5.14.4 Godambe's strategy for linear Bayes and optimal
estimation 476
5.15 Unified theory of survey sampling 479
5.15.1 Class of admissible estimators 479
5.15.2 Estimator 479
5.15.3 Admissible estimator 479
5.15.4 Strictly admissible estimator 479
5.15.5 Linear estimators of population total 483
5.15.6 Admissible estimators of variances of estimators of total 485
5.15.6. I Condition for the unbiased estimator of variance 485
5.15.6.2 Admissible and unbiased estimator of variance 485
5.15.6.3 Fixed size sampling design 485
5.15.6.4 Horvitz and Thompson estimator and its variance
in two forms 485
5.15.7 Polynomial type estimators 489
5.15.8 Alternative optimality criterion 490
5.15.9 Sufficient statistic in survey sampling 491
5.16 Estimators based on conditional inclusion probabilities 493
5.17 Current topics in survey sampling 494
5.17.1 Surveydesign 495
5.17.2 Data collection and processing 495
5.17.3 Estimation and analysis of data 496
5.18 Miscellaneous discussions/topics 497
5.18.1 Generalized IPPS designs 497
5.18.2 Tam's optimal strategies 498
5.18.3 Use of ranks in sample selection 498
5.18.4 Prediction approach 498
5.18.5 Total of bottom (or top) percentiles of a finite population 499
5.18.6 General form of estimator of variance 499
5.18.7 Poisson sampling 499
5.18.8 Cosmetic calibration 500
5.18.9 Mixing of non-parametric models in survey sampling 501
5.19 Golden Jubilee Year 2003 of the linear regression estimator 504
Exercises 507
Practical Problems 520
XIV Advanced sampling theory with applications

6 USE OF AUXILIARY INFORMATION:


MULTI-PH ASE SAMPLING

6.0 Introduction 529


6.1 SRSWOR scheme at the first as well as at the second phases of the
sample selection 530
6.1.0 Notation and expected values 530
6.1.1 Ratio estimator 532
6.1.1.1 Cost function 535
6.1.2 Difference estimator 539
6.1.3 Regression estimator 540
6.1.4 General class of estimators of populati on mean 541
6.1.5 Estimation of finite population variance 544
6.1.6 Calibration approach in two-phase sampling 545
6.2 Two -phase sampling using two auxiliary variables 549
6.3 Chain ratio type estimators 554
6.4 Calibration using two auxiliary varia bles 555
6.5 Estimation of variance of calibrated est imator in two-phase
sampling: low and higher level calibration 560
6.6 Two -phase sampling using multi-auxiliary variables 563
6.7 Unified approach in two-phase sampling 563
6.8 Concept of three-phase sampling 565
6.9 Estimation of variance of regression est imator under two-pha se
sampling 567
6.10 Two-phase sampling using PPSWR sampling 572
6.11 Concept of dual frame surveys 576
6.11.1 Common variable s used for further calibration of we ights 576
6.11.2 Estimation of variance using dual frame surveys 577
6.12 Estimation of median using two-phase sampling 578
6.12.1 General class of estimators 578
6.12.2 Regression type estimator 579
6.12.3 Position estimator 581
6.12.4 Stratification estimator 582
6.12.5 Optimum first and second phase samples for med ian
estimation 584
6.12.5 .1 Cost is fixed 584
6.12.5.2 Variance is fixed 584
6.12.6 Kuk and Mak 's techn ique in two-ph ase sampling 584
6.12.7 Chen and Qin technique in two-phase sampling 586
6.13 Distribution function with two-ph ase sampling 588
6.14 Improved version of two-phase calibration approach 590
6.14.1 Improved first phase calibration 590
6.14.2 Improved second phase calibration 592
Exercises 594
Practical problems 612
Table of contents xv
VOLUME II
7 SYSTEMATIC SAMPLING

7.0 Introduction 615


7.1 Systematic sampling 615
7.2 Modified systematic sampling 620
7.3 Circular systematic sampling 621
7.4 PPS circular systematic sampling 623
7.5 Estimation of variance under systematic sampling 624
7.5.1 Sub-sampling or replicated sub-sampling scheme 625
7.5.2 Successive differences 626
7.5.3 Variance of circular systematic sampling 627
7.6 Systematic sampling in population with linear trend 627
7.6.1 Estimators with linear trend 627
7.6.2 Modification of estimates 629
7.6.3 Estimators based on centrally located samples 631
7.6.4 Estimators based on balanced systematic sampling 633
7.7 Singh and Singh 's systematic sampling scheme 635
7.8 Zinger strategy in systematic sampling 637
7.9 Populations with cyclic or periodic trends 638
7.10 Multi-dimensional systematic sampling 639
Exercises 642
Practical problems 646

8 STRATIFIED AND POST-STRATIFIED SAMPLING

8.0 Introduction 649


8.1 Stratified sampling 650
8.2 Different methods of sample allocation 659
8.2. I Equal allocation 659
8.2.2 Proportional allocation 659
8.2.3 Optimum allocation method 662
8.3 Use of auxiliary information at estimation stage 676
8.3.1 Separate ratio estimator 677
8.3.2 Separate regression estimator 681
8.3.3 Combined ratio estimator 684
8.3.4 Comb ined regression estimator 688
8.3.5 On degree of freedom in stratified random sampling 693
8.4 Calibration approach for stratified sampling design 696
8.4.1 Exact combined linear regression using calibration 700
8.5 Construction of strata boundaries 70 I
8.5.1 Strata boundaries for proportional allocation 702
8.5.2 Strata boundaries for Neyman allocation 703
8.5.3 Stratification using auxiliary information 708
8.6 Superpopulation model approach 712
8.7 Multi-way stratification 713
XVI Advanced sampling theory with applications

8.8 Stratum boundaries for multi-variate populations 718


8.9 Optimum allocation in multi-variate stratified sampling 723
8.10 Stratification using two-phase sampling 726
8.11 Post-stratified sampling 729
8.11 .1 Conditional post-stratification 730
8.11.2 Unconditional post-stratification 731
8.12 Estimation of proportion using stratified random samp ling 735
Exercises 738
Practical problems 748

9 NON-OVERLAPPING, OVERLAPPING, POST, AND


ADAPTIVE CLUSTER SAMPLING

9.0 Introduction 765


9.1 Non-overlapping clusters of equal size 766
9.2 Optimum value of non-overlapping cluster size 790
9.3 Estimation of proportion using non-overlapping cluster sampling 792
9.4 Non-overlapping clusters of different sizes 796
9.5 Selection of non-overlapping clusters with unequal probability
sampling 805
9.6 Optimal and robust strategies for non-overlapping cluster sampling 808
9.7 Overlapping cluster sampling 812
9.7. I Population size is known 812
9.7.2 Population size is unknown 814
9.8 Post-cluster sampling 8 I7
9.9 Adaptive cluster sampling 819
Exercises 820
Practical problems 822

10 MULTI-STAGE, .SUCCESSIVE, AND RE-SAMPLING


STRATEGIES

10.0 Introduction 829


10.1 Notation 830
10.2 Procedure for construction of estimators of the tota l 83 I
10.3 Method of calculating the variance of the estimators 833
10.3.1 Selection of first and second stage units using SRSWOR
sampling 834
10.3.2 Optimum allocation in two-stage sampling 836
10.4 Optimum allocation of sample in three-stage sampling 837
10.5 Modified three-stage sampling 838
10.6 General class of estimators in two-stage sampling 839
10.7 Prediction estimator under two-stage sampling 842
10.8 Prediction approach to robust variance estimation in two-stage
cluster sampling 844
Table of contents xvii

10.8.1 Royall's technique of variance estimation 846


10.9 Two-stage sampling with successive occasions 847
10.9.1 Arnab's successive sampling scheme 848
10.10 Estimation strategies in supplemented panels 865
10.1 I Re-sampling methods 866
10.11.1 Jackknife variance estimator 867
10.11.2 Balanced half sample (BHS) method 871
10.11.3 Bootstrap variance estimator 873
Exercises 873
Practical problems 887

11 RANDOMIZED RESPONSE SAMPLING: TOOLS FOR


SOCIAL SURVEYS

11.0 Introduction 889


11 .1 Pioneer model 889
11.2 Franklin 's model 892
11.3 Unrelated question model and related issues 897
11.3.1 When proportion of unrelated character is known 897
11.3.2 When proportion of unrelated character is unknown 898
11.4 Regression analysis 903
11.4.1 Ridge regression estimator 905
11.5 Hidden gangs in finite populations 907
11.5.1 Two sample method 907
11.5.2 One sample method 911
11.5.3 Estimation of correlation coefficient between two
characters of a hidden gang 912
11.6 Unified approach for hidden gangs 916
11.7 Randomized response technique for a quantitative variable 920
11 .8 GREG using scrambled responses 924
11.8.1 Calibration of scrambled responses 925
11.8.2 Higher order calibration of the estimators of variance
under scrambled responses 928
11.8.3 General class of estimators 930
11 .9 On respondent's protection: Qualitative characters 930
11.9.1 Leysieffer and Warner's measure 930
11.9.2 Lanke's measure 932
11.9.3 Mangat and Singh's two-stage model 933
11.9.4 Mangat and Singh's two-stage and Warner 's model at
equal level of protection 935
11.9.5 Mangat's model 939
11.9.6 Mangat's and Warner 's model at equal level of protection 940
11.10 On respondent's protection: Quantitative characters 942
11 .10.1 Unrelated question model for quantitative data 942
11.10.2 The additive model 943
11 .10.3 The multiplicative model 943
XVIII Advanced sampling theory with applications

11.10.4 Measure of privacy protection 944


11.10.5 Comparison between additive and multiplicative models 945
11.11 Test for detecting untruthful answering 949
11.12 Stochastic randomized response technique 951
Exercises 954
Practical problems 972

12 NON-RESPONSE AND ITS TREATMENTS

12.0 Introduction 975


12.1 Hansen and Hurwitz pioneer model 976
12.2 Politz and Simmons model 980
12.3 Horvitz and Thompson estimator under non-response 984
12.4 Ratio and regression type estimators 986
12.4.1 Distribution and some expected values 987
12.4.2 Estimation of population mean 987
12.4.3 Estimation of finite population variance 993
12.5 Calibrated estimators of total and variance in the presence of
non-response 1000
12.5.1 Estimation of populat ion total and variance 1000
12.5.2 Calibration estimator for the total 1002
12.5.3 Calibration of the estimators of variance 1003
12.5.3. 1 PPSWOR Sampling 1005
12.5.3.2 SRSWOR Sampling 1007
12.6 Different treatments of non-response 1009
12.6.1 Ratio method of imputation 1010
12.6.2 Mean method of imputation 1010
12.6.3 Hot deck (HD) method of imputation 1010
12.6.4 Nearest neighbor (NN) method of imputation 1011
12.7 Superpopulation model approach 1013
12.7.1 Different components of variance 1014
12.8 Jackknife technique 1016
12.9 Hot deck imputation for multi-stage designs 1017
12.10 Multiple imputation 1021
12.10.1 Degree offreedom with multiple imputation for small
samples 1024
12.11 Compromised imputation 1025
12.11.1 Practicability of compromised imputation 1027
12.11.2 Recommendations of compromised imputation 1027
12.11.3 Warm deck imputation 1028
12.11.4 Mean cum NN imputation 1028
12.12 Estimation of response probabilities 1031
12.13 Estimators based on estimated response probabilities 1033
12.13.1 Estimators based on response probabilities 1035
12.13.2 Calibration of response probabilities 1037
12.13.2.1 Calibrated estim ator and its variance 1038
Tab le of contents XIX

12.13.2.2 Estimation of variance of the calibrated


estimator 1039
Exercises 1041
Practical problems 1058

13 M ISCELLANEOUS TOPICS

13.0 Introduction 1065


13.1 Estimation of measurement errors 1065
13.1.1 Estimation of measurement error using a single
measurement per element 1066
13.1.1. I Model and notation 1066
13.1.1.2 Grubbs' estimators 1066
13.1.2 Bhatia, Mang at, and Morri son's (BMM) repeated
measurement estimators 1068
13.1.2.1 Mode l and notation 1069
13.2 Raking ratio using contingency tables 1073
13.3 Continuous populations 1077
13.4 Small area estim ation 1081
13.4.1 Sympt omatic accounting techniques 1081
13.4.2 Vital rates meth od (VRM) 1081
13.4.3 Censu s component method (CCM) 1082
13.4.4 Housing unit method (HUM) 1083
13.4.5 Synthet ic estimator 1083
13.4.6 Composite estim ator 1086
13.4.7 Model based techn iques 1090
13.4.7.1 Henderson ' s model 1090
13.4.7.2 Nested error regressi on model 1093
13.4.7.3 Random regress ion coefficient model 1095
13.4.7.4 Fay and Herriot model 1097
13.4.8 Further genera lizations 1097
13.4.9 Estimation of prop ortion of a characteristic in small areas
of a population 1099
Exercises 1101
Practical problem s 1101

A pPENDIX

T ABLES

I Pseudo -Random Numbers (PRN) 1105


2 Critical values based on t distribution 11 07
3 Area under the standard normal curve 1109
xx Advanced sampling theory with applications

POPULATIONS

All operating banks : Amount (in $000) of agricultural loans


outstanding in different states in 1997 1111
2 Hypothetical situation of a small village having only 30 older
persons (age more than 50 years) : Approximate duration of sleep
(in minutes) and age (in years) of the persons 1113
3 Apples , commercial crop: Season average price (in $) per pound , by
States, 1994-1996 1114
4 Fish caught: Estimated number of fish caught by marine
recreational fishermen by species group and year, Atlantic and Gulf
coasts, 1992-1995 1116
5 Tobacco: Area (hectares), yield and production (metric tons) in
specified countries during 1998 1119
6 Age specific death rates from 1990 to 2065 (Number per 100,000
births) 1123
7 State population projections, 1995 and 2000 (Number in thousands) 1124
8 Projected vital statistics by country or area during 2000 1126
9 Number of immigrants admitted to the USA 1129

BIBLIOGRAPHY
1131
AUTHOR INDEX
1193
HANDY SUBJECT INDEX
1215
ADDITIONAL INFORMATION
1219
PREFACE

Advanced Sampling Theory with Applications: How Michael 'Selected' Amy is


a comprehensive exposition of basic and advanced sampling techniques along with
their applications in the diverse fields of science and technology .

This book is a multi-purpose document. It can be used as a text by teachers, as a


reference manual by researchers, and as a practical guide by statisticians. It covers
1179 references from different research journals through almost 2158 citations
across 1248 pages, a large number of complete proofs of theorems, important
results such as corollaries, and 335 unsolved exercises from several research papers.
It includes 162 solved, data based, real life numerical examples in disciplines such
as Agriculture, Demography, Social Science, Applied Economics, Engineering,
Medicine, and Survey Sampling. These solved examples are very useful for an
understanding of the applications of advanced sampling theory in our daily life and
in diverse fields of science. An additional 177 unsolved practical problems are
given at the ends of the chapters. University and college professors may find these
useful when assigning exercises to students. Each exercise gives exposure to several
complete research papers for researchers/students. For example, by referring to
Exercise 3.1 at the back of Chapter 3, different types of estimators of a population
mean studied by Chakrabarty (1968), Vos (1980), Adhvaryu and Gupta (1983),
Walsh (1970), Sahai and Sahai (1985) and Sisodia and Dwivedi (1981) are
examined. Thus, this single exercise discusses about six research papers. Similarly,
Exercise 5.7 explains the other possibilities in the calibration approach considered
by Deville and Sarndal (1992) and their followers.
The data based problems show statisticians how to select a sample and obtain
estimates of parameters from a given population by using different sampling
strategies like SRSWR, SRSWOR, PPSWR, PPSWOR, RHC, systematic sampling,
stratified sampling, cluster sampling, and multi-stage sampling . Derivations of
calibration weights from the design weights under single phase and two-phase
sampling have been provided for simple numerical examples. These examples will
be useful to understand the meaning of benchmarks to improve the design weights.
These examples also explain the background of well known scientific computer
packages like CALMAR, GES, SAS, STATA, and SUDAAN, etc., some of which
are very expensive, used to generate calibration weights by most organizations in
the public and private sectors. The ideas of hot deck, cold deck, mean method of
imputation, ratio method of imputation, compromised imputation, and multiple
imputation have been explained with very simple numerical examples. Simple
examples are also provided to understand Jackknife variance estimation under
single phase, two-phase [or random non-response by following Sitter (1997)] and
multi-stage stratified designs.
XXII Advanced sampling theory with applications

I have pro vided a summary of my book from which a stati stician can reach a fruitful
dec ision by makin g a comparison in his/her mind with the existing books in the
international marke t.

Title s) 4
Dedication 2
Table of contents 14
Preface 8 9 I
I 70 13 II 20 2 58
2 66 20 22 19 58 24
3 158 36 68 38 307 61
4 54 9 15 10 84 26
5 180 13 43 15 651 43
6 86 10 29 10 170 21
7 34 8 17 9 72 23
8 116 21 24 19 112 70
9 64 12 11 14 61 57
10 60 3 31 4 162 13
II 86 3 33 5 216 7
12 90 8 24 9 154 28
13 40 6 7 5 100 15
A endix 26 12
Biblio ra h 62
Author Index 22
Subi ect Index 4
Related Books 2
24

This book also covers, in a very simple and compact way, many new topics not yet
available in any book on the intern ational market. A few of these interesting topics
are: median estimation under single phase and two-ph ase sampling, difference
between low level and higher level calibration approach, calibration weights and
design weights, estimation of parametric function s, hidden gangs in finite
populations, compromised imputation, variance estimation using distinct units ,
general class of estimators of popul ation mean and variance, wider class of
estimators of population mean and variance, power tran sformation estimators,
estimators based on the mean of non-sampled units of the auxiliary character, ratio
and regression type estimators for estimating finite population variance similar to
prop osed by Isaki ( 1982), unbiased estimators of mean and variance under
Midzuno 's scheme of sampling, usual and mod ified jackknife variance estimator,
Preface XXIII

estimation of regression coefficient, concept of revised selection probabilities,


multi-character surveys sampling, overlapping, adaptive , and post cluster sampling,
new techniques in systematic sampling, successive sampling, small area estimation,
continuous populations, and estimation of measurement errors.

This book has 459 tables, figures, maps, and graphs to explain the exercises and
theory in a simple way. The collection of 1179 references (assembled over more
than ten years from journals available in India, Australia, Canada, and the USA) is a
vital resource for researcher . The most interesting part is the method of notation
along with complete proofs of the basic theorems . From my experience and
discussion with several research workers in survey sampling , I found that most
people dislike the form or method of notation used by different writers in the past.
In the book I have tried to keep these notations simple, neat, and understandable. I
used data relating to the United States of America and other countries of the world,
so that international students should find it interesting and easy to understand. I am
confident that the book will find a good place and reputation in the international
market, as there is currently no book which is so thorough and simple in its
presentation of the subject of survey sampling.

The objective , style, and pattern of this book are quite different from other books
available in the market. This book will be helpful to:

( a ) Graduates and undergraduates majoring in statistics and programs where


sampling techniques are frequently used;
( b ) Graduates currently involved in M.Sc. or Ph.D. programs in sampling theory
or using sampling techniques in their research;
( c ) Government organizations such as the US Bureau of Statistics, the Statistics
Canada, the Australian Bureau of Statistics, the New Zealand Bureau of Statistics,
and the Indian Statistical Institute, in addition to private organizations such as
RAND and WESTSTAT, etc.

In this book I have begun each chapter with basic concepts and complete
derivations of the theorems or results. I ended each chapter by filling the gap
between the origin of each topic and the recent references. In each chapter I
provided exercises which summarize the research papers. Thus this book not only
gives the basic techniques of sampling theory but also reviews most of the research
papers available in the literature related to sampling theory. It will also serve as an
umbrella of references under different topics in sampling theory, in addition to
clarifying the basic mathematical derivations . In short, it is an advanced book, but
provides an exposure to elementary ideas too. It is a much better restatement of the
existing knowledge available in journals and books . I have used data, graphs,
tables, and pictures to make sampling techniques clear to the learners .
XXIV Advanced sampling theory with applications

EXERCISES ,>,

At the end of each chapter I have provided exercises and their solutions are given
through references to the related research papers. Exercises can be used to clarify or
relate the classroom work to the other possibilities in the literature .

At the end of each chapter I have provided practical problems which enable
students and teachers to do additional exercises with real data.

I have taken real data related to the United States of America and many other
countries around the world. This data is freely available in libraries for public use
and it has been provided in the Appendix of this book for the convenience of the
readers . This will be interesting to the international students .

NEW TECHNOLOGIES <, : .


This provides to students or researchers new formulae available in the literature,
which can be used to develop new computer programs for estimating parameters in
survey sampling and to learn basic statistical techniques .

.SOLU.TIO:N·.MANUAL'·'
I am working on a complete solution manual to the practical problems and selected
theoretical exercises given at the end the chapters.

I was born in the village of Ajnoud, in the district of Ludhiana, in the state of
Punjab, India in 1963. My primary education is from the Govt. Primary School,
Ajnoud; the Govt. Middle School, Bilga; and Govt. High School, Sahnewal, which
are near my birthplace. I did my undergraduate work at Govt. College Karamsar,
Rarra Sahib. Still I remember that I used to bicycle my way to college, about 15 km,
daily on the bank of canals. It was fun and that life has never come back. M.Sc. and
Ph.D. degrees in statistics were completed at the Punjab Agricultural University
(PAU), Ludhiana, and most of the time spent in room no. 46 of hostel no. 5.

I attended conferences of the Indian Society of Agricultural Statistics held at


Gujarat, Haryana, Orissa, and Kerala, and was a winner of the Gold Medal in 1994
for the Young Scientist Award . I attended conference s of the Australian Statistical
Society in Sydney and the Gold Coast. I attended a conference of the International
Indian Statistical Association at Hamilton, and the Statistical Society of Canada
conferences at Hamilton, Regina, and Halifax in addition to the Concordia
University conference . I also attended the Joint Statistical Meetings (JSM-200 1,
2002) at Atlanta and New York.
Preface xxv

At present I am an Assistant Professor at St. Cloud State University, St. Cloud, MN,
USA, and recently introduced the idea of obtaining exact traditional linear
regression estimator using calibration approach. From 200 I to 2002 I did post
doctoral work at Carleton University, Canada. From 2000 to 2001 I was a Visiting
Instructor at the University of Saskatchewan, Canada. From 1999 to 2000 I was a
Visiting Instructor at the University of Southern Maine, USA, where I taught
several courses to undergraduate and graduate students, and introduced the idea of
compromised imputation in survey sampling. From 1998 to 1999 I was Visiting
Scientist at the University of Windsor Canada. From 1996 to 1998 I was Research
Officer-II in the Methodology Division of the Australian Bureau of Statistics where
I developed higher order calibration approach for estimating the variance of the
GREG, and introduced the concept of hidden gangs in finite populations. From
1995 to 1996 I was Research Assistant at Monash University, Australia. From 1991
to 1995 I was Research Fellow, Assistant Statistician and then Assistant Professor
at PAU, Ludhiana, India and was also awarded a Ph.D. in statistics in 1991. I have
published over 80 research papers in reputed journals of statistics and energy
science. I am also co-author of a monograph entitled, Energy in Punjab Agriculture,
published by the Indian Council of Agricultural Research, New Delhi.

Advanced Sampling Theory with Applications is my additional achievement. In


this book you can enjoy my new ideas such as:

"How did Michael select Amy? :'


"How can you weigh elephants in a circus?"
and " ¥!"
~.

"How many girls like Bob?"

in addition to higher order calibration, bias filtration, hybridising imputation and


calibration techniques, hidden gangs , median estimation using two-phase sampling,
several new randomised response models, and exact traditional linear regression
using calibration technique etc..

~CKNOWLED6EMENTS

Indeed the words at my command are not adequate to convey the feelings of
gratitude toward the late Prof. Ravindra Singh for his constant, untiring and ever
encouraging support since 1996 when I started writing this book. Prof. Ravindra
Singh passed away Feb . 4, 2003, which is a great loss to his erstwhile students and
colleagues, including me. He was my major advisor in my Ph.D. and was closely
associated in my research work. Since 1996 Mr. Stephen Hom, supervisor at the
Australian Bureau of Statistics, always encouraged to me to complete this book and
I appreciate his sincere co-operation, contribution and kindness in joint research
papers as well guidance to complete this book. The help of Prof. M.L. King,
Monash University is also appreciated. I started writing this book while staying
with Dr. Jaswinder Singh, his wife Dr. Rajvinder Kaur, and their daughter Miss
XXVI Advanced sampling theory with applications

Jasraj Kaur in Australia during 1996. Almost seven years I worked day and night on
this book, and during May-July, 2003, I rented a room near an Indian restaurant in
Malton , Canada to save cooking time and spent most of the time on this book

Thanks are due to Prof. Ragunath Arnab, University of Durban--Westville, for help
in completing the work in Chapter 10 related to his contribution in successive
sampling, and completing some joint research papers . The help of Prof. H.P. Singh,
Vikram University in joint publications is also duly acknowledged.

The contribution of late Prof. D.S. Tracy , University of Windsor, of reading a few
chapters of the very early draft of the manuscript has also been duly acknowledged.
The contribution of Ms. Margot Siekman, University of Southern Maine in reading
a few chapters has also been duly acknowledged. Thanks are also due to a
professional editor Kathlean Prenderqast, University of Saskatchewan, for critically
checking the grammar and punctuation of a few chapters. Prof. M. Bickis ,
University of Saskatchewan, really helped me in my career when I was on the road
and looking for a job by going from university to university in Canada. Prof. Silvia
Valdes and Ms. Laurie McDermott's help, University of Southern Maine, has been
much appreciated. Thanks are also due to Professor Patrick Farrell, Carleton
University, for giving me a chance to work with him as a post doctoral fellow .
Thanks are also due to Prof. David Robinson at SCSU for providing a very peaceful
work environment in the department. The aid of one Stat 321 student, Miss Kok
Yuin Ong in cross checking all the solved numerical examples, and a professional
English editor Mr. Eric Westphal in reading the entire manuscript at SCSU is much
appreciated. Thanks are also due to a professional editor Dr . M. Cole from England
for editing the complete manuscript, and to bring it in the present form. Mary
Shrode and Mitra Sangrovla, Learning Resources and Technology Service, SCSU,
for help in drawing a few illustrations using NOV A art explosion 600,000 images
collection is duly acknowledged.

I am also thankful to the galaxy of my friends/colleagues, viz., Dr . Inderjit Grewal


(PAU), Dr. B.R. Garg (PAU) , Dr. Sukhjinder Sidhu (PAU) , Prof. L. N. Upadhyaya
(Indian School of Mines), Er. Amarjot Singh (Australia), Mr. Qasim Shah
(Australia), Mr. Kuldeep Virdi (Canada) , Mr. Kulwinder Channa (Canada), Prof.
Balbinder Deo (Canada), Er. Mohan Jhajj (Canada), Mr. Gurbakhash Ubhi
(Canada), Mr. Gurmeet Ghatore (USA) , Dr. Gurjit Sidhu (USA), Prof. Balwant
Singh (USA), Prof. Munir Mahmood (USA), and Mr. Suman Kumar (USA) . All
cannot be listed, but none is forgotten . I met uncle Mr. Trilochan Singh at Ottawa,
who changed my style of living a bit and taught me to get involved with other
things, not only sampling theory, and I appreciate his advice . I sincerely appreciate
Dr. Jog inder Singh's advice at Ottawa, who taught me to do meditation imagining
the writing of the name of God with eyes closed and I found it helps when under
pressure from work . I am most grateful to my teachers and colleagues for their help
and co-operation. Special thanks are due to my father Mr. Sardeep Ubhi , my mother
Mrs. Ranjit Ubhi for making this book possible, my brothers Jatinder and
Kulwinder, and my late sister Sarjinder.
Preface XXVII

The permission of Dimitri Chappas , NOAA/ National Climatic Data Center to print
a few maps is also duly acknowledged. Free access to data given in the Appendix by
Agricultural Statistics and Statistical Abstracts of the United States are also duly
acknowledged. I would also like to extend my thanks to the Editor James Finlay,
Associate Editor Inge Hardon , and reviewers for bringing the original version of the
manuscript into the present form and into the public domain .

Note that I used EXCEL to solve the numerical examples , and while using a hand
calculator there may be some discrepancies in the results after one or two decimal
places . Further note that the names used in the examples such as Amy, Bob, Mr.
Bean, etc., are generic , and are not intended to resemble any real people. I would
also like to submit that all opinions and methods of presentation of results in this
book are solely the author's and are not necessarily representative of any institute or
organization. I tried to collect all recent and old papers, but if you have any
published related paper and would like that to be highlighted in the next volume of
my book, please feel free to mail a copy to me, and it will be my pleasure to give a
suitable place to your paper . To my knowledge this will be the first book , in survey
sampling, open to everyone to share contribution irrespective your designation ,
status, group of scientists, journals names, or any other discriminating character
existing in this world, you feel. Your opinions are most welcome and any suggestion
for improvement will be much appreciated via e-mail.

Sarjinder Singh (B:Sc., M.Sc., Ph.D ., Gold Medalist, and Post Doctorate)
Assistant Professor, Department of Statistics, S1. Cloud State University,
S1. Cloud, MN, 56301-4498, USA E-mail: sarjinder@yahoo.com
1. BASIC CONCEPTS AND MATHEMATICAL NOTATION

1.0 INTRODUCTION

In this chapter we introduce some basic concepts and mathematical notation , which
should be known to every surve y statistician. The meanin g and the use of these
terms is supported by using them in the subsequent chapters.

1.1 POPULATION

In statistical language the term population is applied to any finite or infinite


collection of individuals or units. It has displaced the older term 'universe' . It is
practically synonymous with ' aggregate' . A population is a collection of objects or
units about which we want to know something or draw an inferen ce. The population
may be finite or infinite . Assume a population cons ists of electric bulb s produced by
a plant. We may want to estimate the average life of the bulbs. The numb er of bulb s
produced by the plant may be finite or infinite.

1.1.1 FINITE POPULATION

If the number of objects or units in the popula tion is count able , it is said to be a
finite population. For example, the number of houses in a suburb is a finite
population.

1.1.2 INFINITE POPULATION

If the number of objects or units in the population is infinite, it is said to be an


infinite population. For example, the number of stars in the sky forms an infinite
population. In general, the population is denoted by Q and its size is denoted by N .
In the case of infinite population, N ~ 00 .

1.1.3 TARGET POPULATION

A finite or infinite population about which we requ ire information is called target
population. For example, all 18 year old girls in the United States .

1.1.4 STUDY POPULATION

This is the basic finite set of individuals we intend to study. For exa mple, all 18 year
old girls whose permanent address is in New York .

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
2 Advanced sampling theory with applications

A subset of the population, which represen ts the entire population, is called a


sample. The sample is denoted by s and its size by n .

We provide here a few examples of populations and samples as follows:

( a ) All bulbs manufactured in a plant constitute a population. Now consider we


want to estimate the average lifetime of all the bulbs. Instead of taking the whole
population for testing purpose into consideration, we take 50 bulbs . Then the
collection of 50 bulbs will be called a sample;
( b ) Consider we want to find the percentage of ticketless travellers in the TTC
buses of Toronto. Then all persons travelling in all the busses of Toronto will
constitute the population and the persons checked by a particular checker(s) will
form a sample.

A census is a particular case of a sample. If we take a whole population as the


sample then the sampling survey is called a census .

The following table provides some of the major differences between a sample and a
census .
.. . ..,
: i: ,lc~f~:lir,,~~~ AspeCt· i : '· '"Y'.i' : Hy ,..." . •ll~Ji:~ . ,.... .'F, : ' · " ·· c Ji'~~/ "'.i:i' $ l !l:; j :;: '''Census: · j : l( '';·)~
Cost Less More
Effort Less More
Time consumed Less More
Errors May be predicted with certain confidence No such errors
Accuracy of More Less
measurements

The variable of interest or the variable about which we want to draw some inference
is called a study variable . Its value for the til unit is generally denoted by Yi' For
example , the life of the bulbs produced by certain plant can be taken as a study
variable .
Chapter I : Basic concepts and mathematical notation 3

1. 7 AUXILIARY VARIABLE

A variable hav ing a direct or indirect relationship to the study variable is called an
auxiliary variable. The value of an auxiliary variable for the /" unit is generally
denoted by X i or zi , etc .. For example, the time or money spent on producing each
bulb by the plant to maintain the quality can be taken as an auxiliary variable.

1.8 DIFFERENCE BETWEEl'(STUDYVARIABLE AND AUXILIARY


VARIABLE

The main differences between the study variable and auxiliary variable are as
follows :
\
Factors >', ;> ;',.f "S tudy/V ariable Auxiliary Variable
Cost More Less
Effort More Less
Sources of availability Current Surveys or Current or Past Survey,
Experiments Books or Journals etc.
Interest of an investigator More Less
Error in measurement More Less
Sources of error More Fewer
Notation Y X,Z

1.9 PARAMETER

An unknown quantity, which may vary over different sets of values forming
population is called a parameter. Any function of population values of a variable is
called a parameter. It is generally denoted by O .
Mathematically, suppose a population n consists of N units and the value of its /"
unit is Yi . Then any function of Y; values is a parameter, i.e.,
Parameter = f(Y1'Y2 ' .... ' YN ). (1.9 .1)
For example, if Y; denotes the total life time of the /" bulb , then the average life
time of the bulbs produced by the company is a parameter and is given by
I
Parameter = -(l\+Y2+ .... +YN ) . (1.9 .2)
N

1.10 STATISTIC

A summary value calculated from a sample of observations, usually but not


necessarily as an estimator of some population parameter is called a statistic and is
generally denoted bye . Mathematically, suppose a sample s consists of n units
and the value of the /" unit of the sample is denoted by vt- Any function of vt
valu es will be a statistic, i.e.,
4 Advanced sampling theory with appl ications

Statistic = /(YI 'YZ'····'Yn)· (1.10.1)


For example, if Yi denotes the total life time of the {1' bulb , then the average life
time of the bulbs produced by the company is estimated by the statistic, defined as
Statistic = ..!..(YI + y z + .....+ Yn) ' (1.10 .2)
n

I: 11 STATISTICS

Statistics is a science of collecting, analys ing and interpreting numerical data


relating to an aggregate of individuals or units.

1.12 SAMPLE SELECTION

A sample can be selected from a population in many ways . In this chapter, we will
discuss only two simple methods of samp le selection. As the readers get familiar
with sample selection, more complicated schemes will be discussed in following
chapters.

1.12:1 CHIT METHOD ·OR UOTTERYMETHOD

Suppose we have N = 10,000 blocks in New York City . We wish to draw a sample
of n = 100blocks to draw an inference about a character unde r study, e.g., average
amount of alcohol used or number of bulbs used in each block produced by a
certain company. Assign numbers to the 10,000 blocks and write these numbers on
chits and fold them in such way that all chits look identical. Put all the chits in a
box. Then there are two poss ibilities :

1.12.1.1 WITH REPLACEMENT SAMPLING

Select one chit out of 10,000 chits in the box and note the number of the block
written on it. This is the first unit selected in the sample. Before selecting the
second chit, we replace the first chit in the box and mix with the other chits
thoroughly. Then select the second chit and note the name of the block written on it.
This is called the second unit selected in the sample. Go on repeating the process,
until 100 chits have been selected. Note that the chits are selected after replacing the
previous chit in the box some chits may be selected more than once. Such a
sampling procedure is called Simple Random Sampling With Replacement or
simply SRSWR sampling. Let us expla in with a few numbers of block s in a
population as follows :

Suppose a population consists of N = 3 blocks , say A, B and C . We wish to draw


all possible samples of size n = 2 using SRSWR sampling. The possible ordered
samples are : AA, AB, AC , BA, BB, BC, CA, CB, cc. Thus a total 9 samples of size
2 can be drawn from the population of size 3, which in fact is given by 3 z = 9.
Chapter I: Basic concepts and mathematical notation 5

In general, the total number of samples of size n drawn from a population of size
N in with replacement sampling is Nil and is denoted by s(n).
Thus
s(n) = s", (1.12 .1)

Now imagine the situation , 'How many WR samples, each of n = 100blocks, are
possible out of N = 10,000blocks?'

1.12.1.2 WITHOUT REPLACEMENT SAMPLING

In case of without replacement sampling, we do not replace the chit while selecting
the next chit; i.e., the number of chits in the box goes on decreasing as we go on
selecting chits. Hence, there is no chance for a chit to be selected more than once.
Such a sampling procedure is called Simple Random Sampling and Without
Replacement or simply SRSWOR sampling. Let us explain it as follows: Suppose a
population consists of N = 3 blocks A, Band C. We wish to draw all possible
unordered samples of size n = 2. Evidently, the possible samples are : AB,
AC, BC. Thus a total of 3 samples of size 2 can be drawn from the population of
size 3, which in fact is given by 3C 2 = 3 . In general, the total number of samples of
size n drawn without replacement from a population of size N is given by NCII or

Thus
N N!
s (n) = CII = ( ) (1.12.2)
n! N-n.
where n! = n(n-IXn - 2).......2.1 , and O! = I.
Now think again, 'How many WOR samples, each of n = 100 blocks, are possible
out of N = 10,000blocks?'

Note that it is a very cumbersome job to make identical chits if the size of the
population is very large. In such situations, another method of sample selection is
based on the use of a random number table . A random number table is a set of
numbers used for drawing random samples. The numbers are usually compiled by a
process involving a chance element, and in their simplest form, consist of a series of
digits 0 to 9 occurring at random with equal probability.

1.12.2 RANDOM NUMBERTABLE METHOD

As mentioned above, in this table the numbers from 0 to 9 are written both in
columns and rows. For the purpose of illustrations, we used Pseudo-Random
Numbers (PRN), generated by using the UNIF subroutine following Bratley, Fox,
6 Advanced sampling theory with applications

and Schrage (1983), as given in Table 1 of the Appendix. We generally app ly the
following rules to select a sample:

Rule 1. First we write all random numbers into groups of columns as already done
in Table I of the Appendix. We take as many columns in each group as the number
of digits in the population size.
Rule 2. List all the indiv iduals or units in the population and assign them numbers
1,2,3,...,N.
Rule 3. Randomly select any starting po int in the table of random numbers. Write
all the numbers less than or equal to N that follow the starting point until we obtain
n numbers. If we are using SRSWOR sampling discard any number that is repeated
in the random number table. If we are using SRSWR sampling retain the repeated
numbers .
Rule 4. Select those units that are assigned the numbers listed in Rule 3. This will
constitute a required random sample .

Let us explain these rules as follows : Suppose we are given a population of


N = 225 units and we want to select a sample of say n = 36 units from it. To pickup
a random sample of 36 units out of a population of 225 units , use any three columns
from the random number table. For example, use column I to 3, 4 to 6, etc.,
rejecting any number greater than 225 (and also the number 000) . As an example,
the following table lists the 36 units selected using SRSWR sampling procedure
with the use of Pseudo-Random Numbers (PRN) given in Table 1 of the Appendix.

Uiiits selected in the sample


014 049 053 039 196 183 171 225 179 153 142 138
070 083 001 209 222 075 219 092 155 012 099 211
027 039 048 048 080 161 006 059 199 150 025 173

In the case of SRSWOR sampling, the figures 039, 048 would not get repeated; i.e.,
we would take every unit only once, so we will continue to select two more distinct
random numbers as 078 and 163.

Although the above method of selecting a sample by using a random number table
is very efficient, may make a lot of rejections of the random numbers, therefore we
would like to discuss a shortcut method called the remainder method.

1.12.2.LREMAINlfER METHOD

Using the above example, if any three digit selected random number is greater than
225 then divide it by 225. We choose the serial number from 1 through 224
corresponding to the remainder when it is not zero and the serial number 225 when
the remainder is zero. However, it is necessary to reject the numbers from 901 to
999 (besides 000) in adopting this procedure as otherwise units with ser ial number
1 to 99 will have a larger probability (5/999) of selection, while those with serial
Chapter I : Basic concepts and mathema tica l notation 7

number 100 to 225 will have probability only equal to 4/999. If we use this
proced ure and also the same three figure random numbers as given in columns I to
3, 4 to 6, etc., we obtain the sample of units which are assig ned numbers given
below. Agai n in SRSWR sampling the number that gives rise to the same remainder
are not discarded while in SRSWOR sampling procedure such numbers are
discarded . Thus an SRSWR samp le is as give n below:
.... C' , , H Units selected in the sample
138 151 099 025 014 022 197 176 I I 209 042 194
015 049 095 040 027 124 116 097 126 142 073 158
108 053 046 001 207 156 201 027 II I 209 065 184

Note that in the SRSWR sample, only one unit 209 is repeated, thus for SRSWOR
sampling, we continue to apply remainder approach until another distinct unit is
selected, which is 089 in this case. Further note that the first random number 992
was discarded due to requiremen t of this rule .

1.13 PROBABILITY SAMPLING

Probability sampling is any method of selection of a sample based on the theory of


probability. At any stage of the selection process, the probability of a given set of
units being selected must be known .

1.14 PRO BABILI TY OF SELECTING A SAMPLE

Every sample selected from the popu lation has some known probabil ity of being
selected at any occ asion . It is generally denoted by the symbo l, PI or p(t) . For
example the probability of selecting a samp le using
with replacemen t sampling, PI = 1/ N n , t = 1,2, ..., N n , (1.14.1)
and
without replacement sampling, PI = 1/ N Cn , t = 1,2 , ... ,N CII • (1.14.2)
The following tab le describes the difference between with replacement and witho ut
replacement sampl ing procedures.

With repl acemen t sampl ing ' .:I··' Without replacement sampl ing
Cheaper Costly
Few units may be selected more than A unit can get selected only once .
once .
Less efficient. More efficient.
Number of possible samp les s(n) = N n
Number of poss ible samples s(n) = N C"

Probabi lity of selec ting a particular Probability of selecti ng a partic ular


sample PI = 1/ N" , t = 1,2,...., Nil . samp le PI = 1/ N c" , t = 1,2,...,N CII •
-th -th
Probability of selecting I unit III a Probability of selecting t unit III a
sample Ii = !/N , i = 1,2, ..., N . samp le Ii' = !/N , i = 1,2,....,N.
8 Advanced sampling theor y with applications

1.15 .POPULATION MEAN/TOTAL

Let Yi , j = 1,2,....,N, denote the value of the ( h unit In a population, then the
population mean is defined as
- 1( ) 1 N
Y = - l"\ + Y2 + ....+ YN = - L Y; ( 1.15.1)
N N ;=\
and popu lation tota l is defined as
Y=(l"\ +Y2 + ····+YN) = ~Y;= NY . (1.15 .2)
;=\
Th e unit s of mea surements of population mean are the sam e as thos e for the actual
data. For exa mple, if the (h unit, Y; , ';j j , is mea sured in doll ars, then the popul ation
mean , Y, has the same units as dollars.

1.16 POPULATION MOMENTS

Th e ,-th ord er central population moments are defined as


1 N( -)r
u ; = -( - ) L Y; - Y , r = 2, 3, ... . (1.1 6.1)
N- I ;= \
If r = 2 then fl 2 repr esents the second order popul ation mom ent , given by
2 1 N( - \2 (1.16.2)
u: = Sy =- - L }j - Y J
N -1 i=!
and is named the population mean square.
Note that the pop ulation variance is defined as

0"; = ~ ~ (}j - Y~ = (N- I) s;. (1.16 .3)


N i=l N
If the data is in do llars, then the units of measurement of 0";. wi ll be doll ars 2
.

1.17 POPULATION STANDARD DEVIATION

Th e positive square root of the popu lation variance is called the population standard
deviation and it is denoted by O"y . Th e units of measurements of " » will again be
the same as that of actual data. For instance, in the above example, the units of
0"
measurements of y will be doll ars.

1.18 POPULATION COEFFICIENT OF VARIATI ON

The ratio of standar d deviation to population mean is call ed the coe fficient of
variation. It is denoted by Cy that is

(1.1 8.1)
Chapter I : Basic concept s and math ematical notation 9

Evidently Cy is a unit free numb er. It is useful to compare the variability in two
different populations having different units of measur ements, e.g., S and kg. It is
also ca lled the relative standard error (RSE) . Sometim es we also consider
C y ~Sy /Y.

1.19 RELATIVE MEAN SQUARE ERROR

The relative mean square error is defin ed as the square of the coe fficient of
variation Cy and is generally written by RMSE.
Mathematically
2
2 ay (1.19.1)
RMSE = Cy = -=T .
y

Sometimes it is also denote it by rjJ2 .

1.20 SAMPLE MEAN

Let Yi' i = 1,2,..., 11, deno te the value of the til unit selected in the sample, then the
sample mean is defin ed as
_ 1 11
Y =- L Yi · (1.2 0. 1)
Il i=l

1.21 SAMPLE VARIANCE

The sample va riance s~ is defined as

2 1 /l ( \2 (1.21.1)
S =- - L Yi - YJ .
y 11- 1 i= 1

Remark 1.1. The popul ation mean Y and population van ance a; etc., are
unknown quantities (parameters) and can be denoted by the symbol 8 . The sampl e
mean Y and sample variance s~ etc., are known after sampling and are called

statistic and can be denot ed by iJ . Also note that sample standard deviation (or
standard error) and sample coe fficient of variation can also be defin ed as Sy = M
and Cy =

--=-
Sy
, respe ctively. Note that standard error is a statistic whe re as standard
Y
deviation is a parameter.
10 Advanced sampling theory with applications

1.22 ESTIMATOR

A statistic 81 obtained from values in the sample s is also called an esti mator of the
population parameter () . Note that the notation 81 , or 8, or 8
11 have same
meaning. For example the notation YI' or Y, or Yll have the same meani ng, and s;,
or S;'(I) have the same meaning. We choose acco rding to our requirements for a
give n top ic or exercise.

1.23 ESTIMATE

Any num eric value obtained from the sample information is called the estimate of
the population parameter. It is also ca lled a statistic.

1.24 SAMPLE SPACE

A sample space is a set of all possible values of a variable of interest. It is denoted


by If! or S . For exa mple, if we throw a pair of fair coins, eac h having two faces ,
then the sample space will consist of all possi ble 4 outcomes as:
If! = {HH, HT , TH, TT }.

A pic toria l represe ntatio n of such a sample space is give n in Figure 1.24.1.

Expe riment: Toss two coins

e50
T

2 x 2 = 4 outcomes

c/ .< :
Tr ee diagram:
HH
H

.< :
H
T HT
H TH
T

""" First
Coin
T
Seco nd
Coin
TT
Sample
Spa ce

Fig.1.24. 1 Sample space whi le toss ing two co ins.


Chapter I : Basic concepts and mathematical notation II

1.25 UNIVARIATE RANDOM VARIABLE

A random variable is a real valued function defin ed on the sample spac e lfI . It is
generally of two type s:

( i ) Qu alitative random variable; ( ii) Quant itative random variable.

Let us discuss these random variables in more detail s as follow s:

1.25.1 QUALITATIVE RANDOM VARIABLES

Qualit ative random variables assume values that are not necessar ily numerical, but
can be categorized . For example, Gender has two po ssibl e values: Male and
Female. These two can be arbitrary coded numerically as Female = 0 and Male = I .
Such coded variables are called Nominal variables. In another example, consider
Grades that can take five pos sible values: A, B, C, D , and F. These five
categori es can be arb itrarily coded numerically as: A = 4, B = 3, C = 2, D = 1, and
F = o. Note that here the magnitude of cod ing tells us quality of Grade that if code
is 3 then the Grade is better than the Grade if code is 2. Such a coded variable is
called Ordinal variabl e. Also note that in the case of the Nominal variable, code
Male = I and Female = 0, does not mean that males are superior to female s.
Adding, subtracting or averagin g such qual itative variables has no meaning. Thu s
qualitative variables are of two types: ( a) Nominal var iables; ( b ) Ordinal
variables. Pie charts or Bar charts are generally used to present qualitat ive
variables.

1.25.2 QUANTITATIVE RANDOM VARIABLES

Quantitative random variables can take num erical values for which addin g,
subtrac ting or avera ging such variables does have meanin g. Exa mples of
cont inuou s variables are wei ght, height, numb er of students, etc.. In general, two
types of quantitative random var iables are availabl e: ( a ) Discret e random variable;
( b ) Continuous random variable.

1.25.2.1DISCRETE RANDOM VARIABLE

If a random variable takes a countable number of value s, it is called a discrete


random variable. In other word s, a real valued function defin ed on a discrete
sample space is called a discrete random variable. For example, the number of
students can be 0, I , 2 etc..

1.25.2.2 CONTINUOUS RANDOM VARIABLE

A rando m variabl e is said to be continuous if it can take all possibl e value s bet ween
certain limits. For exa mple, height a student can be 5.6 feet.
12 Advanced sampling theory with applications

A pictorial representation to differentiate between Qua litative and Quantitative


variables is given in Figure 1.25.1 .

Random
Variable

...
Qualitative Quantitative

.> <, .> <,


Nominal Ordinal Discrete Continuous
e.g., e.g., e.g., e.g.,
Gender, Grade, No. of grades, Height,
Religion Age groups No . of students Age

Fig. 1.25.1 Forms of a random variable or data.

Note that Age itself is a quantitative variable whereas Age Groups is a qualitative
variable. Pie charts, bar charts, dot plots , line charts, stem and leaf plots, histograms
and box plots are generally used to present quantitative variables.

1.26 PROB ABILI TY MASS FUNCTI ON (p.m.f.) OF A UNIVARIATE


DISCRETE RAN DOM VARIA BLE

Let X be a discrete random variable taking at most a countable infinite number of


value s x" X2,... ., and with each possible outcome Xi , we associate a number
Pi = p(X = Xi) = P(Xi) ' called the probabi lity of Xi ' Then P(Xi) is called the p.m.f.
of Xi if it satisfy the following two conditions:
co
( a ) p(x;);:: 0 and (b) I: P(Xi ) = 1. (1.26 .1)
i=1

1.27 PROBABILITY DENSITY FUNCTION (p.d.f.) OF A UNIVARIATE


CONTINU OUS RAND OM VARIABLE

Let X be a continuous random variable on an interval X E [a,b] , where


- 00 < a s. x ::; b < + 00. Then a function f (x) is said to be a probability density
function (p.d.f.) ifit satisfies the following two conditions:
Chapter I: Basic concepts and mathematical notation 13

(a)f(x)::::O and (b) ff(x)dx=l. (1.27 .1)


a

VAl;UE.ANDNARIANCE"OF A UNIVARIATE
1.28EXPE:~TEJ)
RAN])OMVARIAimE . . .

If a discrete random variable X takes all possible values Xi with probability mass
function , P(Xi) , in the sample space, If, then its expected value is

E(x) = IXiP(XJ (1.28 .1)


IjI

and, the variance of the random variable X is given by

v(x)= I (x; - E(x))2 p(x;) ( 1.28.2)


IjI

or, equivalently

V(x) = I xl p(xJ- {E(x)}2 . (1.28.3)


IjI

Sometimes (1.28.2) is called a formula by definition and that in (1.28 .3) is called a
computing formula.

If X is a continuous random variable with x E [a ,b] and f(x) be the probability


density function , then its expected value is
b
E(x) = [x f(x)ix (1.28.4)
a
and the variance of the random variable X is given by
b
V(x) = I(x - E(x))2 f(x)ix ( 1.28.5)
a

or equivalently
b
V(x) = Ix 2 f(x)ix - {E(x)}2 . ( 1.28.6)
a

1.29 DISTRIByTIONFUNCJION OFA UNIVARIAJE RANDOM


NARIABEE '0

Let X be a random variable, then the function, F(x) = p(X ~ x) is called a


dist ribution function of the random variable and it has the following properties:
14 Advanced sampling theory with applications

(i) Ifa ::;X::;b then P(a::;X ::;b)=F(b) -F(a) ;


(ii) Ifa::;b then F(a)::;F(b);
(iii) O::;F(x)::;I;
( iv ) The distribution of F(x) is uniform between 0 and I .
Again it can be of two-types:

1.29.1 DISCRETEDISTRIBUTION FUNCTION

In this case there are a coun tab le number of points XI , Xz , .. . along with associated

probabilities p(Xj), p(xz),...., where pk):::: 0 and I p(xJ = 1 such that


i= 1
F(x) = p(X ::; x) = 2: pk). For example if Xi takes integral values Xi = {I, 2, 3, 4,5}
{i:x; "x}
with probabilities P(Xi) = 0.2 , then function F(x) is a step function as shown in the
Figure 1.29.1.

Discrete Distribution Function

1.5
_1
><
ir 0.5

2 3 4 5
x

Fig. 1.29.1 Discrete distribution function .

1.29.2 CONTINUOUS DISTRIBUTION FUNCTION

If X is a continuous random variable with the probability density function (p .d.f.)


j(x) , then the function
X
F(x) = p(X ::; x) = Jf(t')dt , - 00 < X< + 00 (1.29 .1)
- 00

is calIed the distribution function or sometimes the cumulative distribution function


(c.d .f.) of the random variabl e X. The relationship between F(x) and j(x) is given
Chapter I : Basic concepts and mathematic al notation 15

dF(x)
by f(x) = - - . Th e c.d.f. F(x) IS a non-decre asing function of x and is
dx
continuo us on the right. Also note that F(- 00) = 0, F(+00) = I , 0 ~ F(x) ~ I, and
b
P(a ~ .r s b) = Jf(x)dx = F(b)- F(a) . For exampl e, if x is a continuous random
a
variable with probability den sity function (p.d. f.)
I O c x « I,
f ()
.r = { (1.29.2)
o otherwise,
then its cumulative distribution function (c.d.f.) is given by
0 if x < O,
F(x) = x if 0 ~ .r ~ I,
1
(1.29.3)
I if x> I,
and its graphical representation is given in Figure 1.29.2.

Continuous distribution function

1
1.5

~ :: ~, ~+ -~ ,-~,
0.0 0.2 0.4 0.6 0.8 1.0 1.5 2 2.5 3
x

Fig. 1.29.2 Continuous distribution function .

1.30 SELECTION OF A SAMPLE USING KN OWN UN IVA RIA TE


DI STRIBUTION FUNCTION

There are two cases:

1.30.1 DI SCRETE RANDOM VA RIABLE

If Xi is a discrete random variable with probability mass function P( Xi) and


distribution function, F(x )= p[X s x]= I P (Xi ) . Let 0 ~ F(x) ~ I be any random
i :xi ::;'x
numb er drawn from the Pseudo-Random Number (PRN) Table I given in the
Appendix.

Th en we can wr ite F(x) as


16 Advanced sampling theory with applications

F(x)= Ip(Xi-d+p(x) (say) (1.30 .1)


i:xi_l<x
Then the integr al value of the random variable x selected in the samp le is given by

X=P-I[F(X) -. IP(Xi- I)] (1.30 .2)


l:Xi_I <X
where P - I denotes the inverse functio n.

Example 1.30. 1. A discrete random variab le X has the followi ng probability mass
function:

Select a random sample of three units using the method of random numbers .

Sol ution: The cumulative distribution function of the random variable X is given
by

We used the first six columns of the Pseudo-Random Nu mber (PRN) Table I give n
in the Appendix multi plied by 10-6 as the random ly selected values of F(x). Then
the integral value of the random variable x selected in the sample is obtained using

the inverse relationship x = P-1 [F(X)- . I p(xi-d] as follows:


l:.\:i-I<X

Va lue of<F(x)using..PRN < Value ofX' observed in the


Table I -. sample
0.992954 6
0.588183 3
0.601448 3

In case of with replacement sampling, the value of x = 3 has bee n selected twice, as
otherwise for WOR sampling we have to continue the process until three distinct
values of x are not selected .

Exa mple 1.30.2. If x follows a binomial distribu tion with parameters Nand p , that
is, x - B(N,p), say N = 10 and P = 0.4 . Select an SRSWR sample of 11= 4 units by
using the random number method .
Chapter 1: Basic concepts and mathematical notation 17

Solution. Since x follows a binomial distribution, so p(X = x) = C;' p X(I_ Pt- x


, for
x = 0, 1, 2..., N . Here N = 10 and p = 0.4, so the cumulative distribution

function F(x) = p[X :0; xl = I.C;' pX(I_


x~x
t - is:
P x

We used three columns from 7th to 9th of the Pseudo-Random Number (PRN) Tab le
1 give n in the Appendix multiplied by 10- 3 as the randomly se lected va lues of
F(x) . Then the integral value of the random var iable x selected in the sample is

obtained using the inverse relationship x = p - I [F(X) - . L P(X i - d] as follows:


I:Xi_I <X

Va lueof F(x) usipgPRN Value of.ieobserved-in the


;. .; Tab le 1; .; .. ;. sample
0.622 3
0.77 1 4
0.917 5
0.675 4

1.30.2 CONTINUO US RANDO M VARIABLE

If x is a continuous random variable with probability density function f(x) and


x
distribution function, F(x) = p[x :0; xl= fJ(t)dt. Let 0 :0; F(x) :o; 1 be any random
- (I)

number drawn from the Pseudo-Random Number (PRN) Table 1 given in the
Appendix .

Then we can wr ite F(x) as


-o
F(x) = fJ(t)dt + f(x) (say) ( 1.30.3)
- (I)

where Xo < x but very close to x .

Then the value of the ran dom variab le x selected in the sample is given by

x = rl(F(x) - :~(t)dtJ ( 1.30.4)

where r' denotes the inverse function.


18 Advanced sampling theory with applications

Example 1.30.3. A continuous random variable X has the cumulative distribution


function as
0 if x< I,

F(x)= ~ (x_I )4 if i s x s s, (1.30.5)

1
16
1 if x > 3.
Select a sample of 11 = 10 units by using SRSWR sampling.

Solution. We are given F(x) = ~(x _I)4 which implies that x = 2[F(x)JI/4 + I . By
16
using the first three column s of the Pseudo -Random Numbers (PRN) Tab le I given
in the Appen dix multiplied by 10- 3 we obtain the observed values of F(x) and the
samp led values of x as:
1'2 F(x) "t '
h .x • ."
C

0.992 2.995988
0.588 2.751356
0.601 2.760956
0.549 2.721563
0.925 2.961397
0.014 1.687958
0.697 2.827419
0.872 2.932676
0.626 2.778990
0.236 2.393985

Example 1.30.4. A continuous random variable x has density function

f (x) = JJr - 1(I + X2)' if- oo < x < +00,


lO otherwise.
(1.30 .6)

Select a samp le of 11 = 5 units by using with replacement sampling.


Solutio n. The cumulative distribution function (c.d.f.) of x is given by
F(x) = P(X :s:x)= ff(x)dx= ~
- 00
f _1+1_2dx=~ [tan-l(x)too = ~[tan-l (x) + '::']
Jr - 00 X Jr Jr 2
which implies that
x =tan[Jr F(x)- %]. (1.30 .7)

Using the three column s multiplied by 10-3 , say 7th to 9th, of the Pseudo-Random
Numbers (PRN ) Table I given in the Appendix , the first five observed values of
F(x) are given by 0.622,0.77 1,0.917,0.675 and 0.534 . Thus the sampled five
values from the above distribution are
Chapter I: Basic concepts and mathematical notation 19

'F(x)u" r. x
0.622 0.403214
0.771 1.141487
0.917 3.747745
0.675 0.612801
0.534 0.107222

Note that we have used the tan function in radians and :r = 4 tan- I ( I ).

Example 1.30.5. A continuous random variable X has density function

f(x) = f-o~ if S < x < 10,


1
(1.30.8)
otherwise.
Select a sample of n = S units by using with replacement sampling.

Solution. The dist ribution of x is uniform between 5 and 10, so its probability
distribution function is
F(x) = p[x ~ x] = ff (x)dx= .!-(x- S) ( 1.30.9)
5 S
which implies that
x =S[F(x)+I] . (1.30.10)

Using the three columns multiplied by 10- 3 , say t h to 9th , of the Pseudo-Random
Number Table I given in the appendix, the first five observed values of F(x) are
given by 0.622, 0.771, 0.917, 0.675 and 0.534 . Thus the sampled five values from
the above distribution are given by

u.
p(x)'
" , .'
.... .~,. X' . •
0.622 8.110
0.771 8.855
0.917 9.585
0.675 8.375
0.534 7.670

1.31 DISCRETE BIVARIATE RANDOM V ARIABLE

If X and Yare discrete random variables, the probability that X will take on the
value x and Y will take on the value y as p(X = x,Y = y) = p(x,y), is called the joint
probability distribution function of a bivariate random variable.
20 Advanced sampling theory with applications

1.32 JOINTPROBABILlTYDISTRIBUTION FUNCTION OF BIVARIATE


i ,. ., ',, ' ' .

DISCRETERANDOM VARIABLES'

A bivariate function can serve as the joint probability distribution of a pair of


discrete random variables X and Y if and only if its values, p(x,y) , satisfy the
conditions:

(a) p(x,y)?: 0 for each pair of values (x,y) within its domain.
and
(b) IIp(x,y)= 1, where the sum extends over all possible pairs (x,y) .
xy

1.33 JOINTCuMuLA:TrvIt'nIs'iRIBUTIONFUNCTION.OF BIVARIATE


DISCRETE RANDOM VARIABLES ..

If X and Yare discrete random variables, the function given by


F(x,y) =p(X ~x,Y~y)= I Ip(s ,t)
ss'x by (1.33 .1)
for - 00 < x < + 00 ,-00 < y < +00, where p(s,t) is the value of the joint probability
distribution of Xand Yat the point (s ,t), is called the Joint distribution function or
the Joint cumulative distribution, of X and Y.

1.34 MARGINAL DISTRIBUflONSOF A BIVARIATE DISCRETE


RANDOM VARIABLE.

If X and Yare discrete random variables and p(x,y) is the value of the joint
probability distribution at (x, y), the function given by

pAx) = Ip(x,y)
y (I .34.I)
for each x with in the range of X is called the marginal distribution of X , and the
function ,
Py{Y) = I p(x,y) ( 1.34.2)
x
for each y within the range of Y is called the marginal distribution of Y.

1.35 SELECTION OF A SAMPLE USINGKNOWN DISCRETE


BIVARIATEDISTRIBUTION FUNCTION

Letp(x,y)denote the joint probability mass function (p.m .f.) of two random
variables x and y . Also, let F(x,y) denote the cumulative mass function (c.m .f.)
of X and y . It is well known that, the distribution of the marginal distribution
function (m.d.f.) Py(Y) for any joint probability density function of X and y is
rectangular (or uniform) in the range [0, 1]. Random numbers in the random
Chapte r I : Basic concepts and mathematical notation 21

number table also follow the same distribution . Then to find out the value of y one
solves the equation (1.35.1) below .

Py(Y) = L: fp(x , y) = Rz· (1.35 .1)


x y;O

The known form of the joint dens ity function p(x, y) [one can choose any suitable

form for p(xI>Y)] can be substituted in (1.35 .1). The value / of y so obtained is
used to find the value x * of x . For this we use the cond itional mass funct ion of x
given y = y * since the distribution of the cond itional mass function will also be
un iform in [0, I] . Thus anoth er random number R, is drawn and the value / of X
is determined from the equation

( 1.35.2)

where At, *) is the conditional density function of


y X given y =y *. The values
x * and y * of x and y, respect ively, so obtained will follow the joint probability
mass function p(x,y) . The equations (1.35 .1) and (1.35 .2) may be solved either
through usual methods or through iteration procedures.

1.36 CONTINUOUSBIVARIATERAND()l\1VAR.IABEE

A bivariate func tion with value s f (x,y ) , defin ed over the two-dimensional plane is
called a j oint prob ability density function of the continuous random variables X
and Y if and only if

p[(X,Y)ES] = fff(x , y)dxdy . (1.36 .1)


s

1037 JOINT PROBABIUTYDISTRIBUTION FUNCTION OF BIVARIATE


CONTINUOUS RANDOM V ARIABEE

A bivar iate function can serve as the joint probability distribution of a pair of
continuous random variables X and Y if and only if its values, f (x, y), satisfy
the conditions:

( a) f(x , y ) ~ 0 for each pair of values (x,y ) withi n its doma in; (1.37 .1)

+00+00
(b) J fJ (x,y ) dxdy = 1. (1.37 .2)
- 00 -00
22 Advance d sa mp ling theory with applications

1.38 JOINT CUMULATIVE DISTRIBUTION FUNCTION OF A


BIVARIATE CONTINUOUS RANDOM VARIABLE

If X and Yare continuous random var iables, the fun ction given by
y x
F(x, y) = p(x s x, Y :$ y )= J fJ(s, 1}:isdl (1.3 8.1)
-00 - 00

for - 00 < x < + 00 , -00 < y < +00 , where j(s, I) is the value of the j oint probab ility
distribu tion of X and Y at the point (s, I), is calIed the Joint distribution function
or the Joint cumulative distri bution, of X and Y.

1.39 MARGINAL CUM ULATIVE DISTRIBUTIONS OF BIVARIATE


CONTINUOUS RANDOM VARIABLE

If X and Ya re continuous random variables and j(x,y) is the va lue of the j oint
prob ab ility density function, then cumulative marginal probab ility distributi on
function of y is give n by
v +00
Fy(y) = oJ fJ(x, y)dxdy (1.39 .1)
- 00-00

for - 00 < y < +00 , and the cumulative margi nal probability dis tribution funct ion of x
is give n by
x +00
FAx)= J fJ(x,y)dydx ( 1.39.2)
- 00 - 00

for -00 < x < +00 •

1.40 SELECTION OF A SAMPLE USING KNOWN BIVARIATE


CONTINUOUS DISTRIBUTION FUNCTION

In genera l, let j(x, y) deno te the joint probability density functio n (p.d.f.) of two
continuous ran dom varia bles X and y . Also let F(x, y) denote the cumu lative
density function (c.d.f.) of X and y . It is well know n that the distribution of the
marginal distribution function (m.d.f.) F2(y) for any joint probability density
fun ction of X and y is rectangular (or unifo rm) in the range [0, I] . Rand om
nu mbers in the rand om number table also follow the same distributi on. To find out

r
the value of y , one so lves the equ ation ( 1040.1) below.

Fz('v)='JI'{+OOfJ(x, y)dx y= Rz· ( 1040. 1)


o - 00
Th e known form of the joint density funct ion j(x, y) [one can choose any suitable
form for j(X1' y) ] can be subst ituted in ( 1040. 1). Th e value y * of Y so obtained is
used to find the value x * of x . For this we use the conditiona l density of X given
Chapter I : Basic concepts and math em atic al not ation 23

y =y* since the distribution of the conditional distribution function will also be
uniform in [0, I]. Thu s anoth er random number R, is drawn and the va lue x * of X
is de term ined from the equation:

'IJi (r I y = / )dx = R) ( 1.40 .2)


o
where Ji(rI y = y *) is the conditional den sity funct ion of x g iven y = y * . The
values x * and y * of x and y , respectively, so obtained w ill follow the joint
prob ability density function f(x,y ) . The equations (1.40.1 ) and (1.40 .2) may be
so lved either through usu al methods or through iteration pro cedures.

Example 1.40.1. If the joint prob ability density function of two continuous random
variables x and y is given by,

()
1
f x,y =
~3 (x + 2Y) for O< x <l, O < y <l ,

o otherwise,
then , se lect six pairs of obse rva tions (x, y) by using the Random Number Tabl e
method.

r l'{ r
Solution. We have

Fy(Y)= Y{+oo
f ff(x,y)dx 2 f(x
y= ' f - ) + 2y )dx y = y + 2Y 2
0-00 0 30 3

Let Fy(Y) = R2 be a Pseudo-Rand om Number (PRN) se lected from Tabl e I (say, by


using first three columns) is R2 = 0.992 then by so lving the qu adr atic equation
2i + y - 3R2 = 0 , we obt ain one real root 0 < y < 1 by so lving

- 1+)(I)z+ 24 Rz as y = -1 +.JI +24 xO.992 = 0.995.


y= 2 2

Now, given y = / = 0.995 we have

f (x I y = /)= f(x I y = 0.995) = ~(I.99 + x).


3

Let 0 < Rl < 1 be any oth er random number, say obtained by usin g i h
to 9 th
co lumns, of the Pseudo-Random Numbers given in Tabl e I of the Appendix, then
the value of x is given by so lving the integr al

x {
ff~r Iy *\ .
= y p x = Rl or
F( \ ..J
- f 1.99 + x J'IX = 0.622
o 30
or, equiva lently solving a quadr atic equation x 2 + 3.98x - 3Rl = 0 , which implies
24 Advanced sampling theory with applicat ions

- 3.98 ± ~3 .982 + 12R1


x= 2 '
and the real root between the range of x as 0 to 1 is given by

- 3.98 + ~3.982 + 12R] - 3.98 + ~3 .982 + 12 x 0.622


x= = = 0.423 .
2 2
On repeating the above process we obtain a sample of n = 6 observations as given
below:
R2 Y Rl X
0.992 0.995 0.622 0.423
0.588 0.722 0.771 0.5 14
0.601 0.732 0.917 0.600
0.549 0.691 0.675 0.456
0.925 0.954 0.534 0.368
0.014 0.039 0.513 0.355

1.41 PROPERTIES OF A BEST ESTIMATOR

An estimator 01 of a population parameter, 0 , is said to be best if it has the


following properties:

( a ) Unbiasedness; (b) Consistency; ( c ) Suffic iency ; and ( d ) Efficiency .

Let us now briefly explain the meaning of these terms .

1;41 ~lUNBIASEDNESS

An estim ator 01 is an unbiased estimator of a population parameter 0 if


( .) s(o ) •
E\OI = ! pA = 0 (1.41.1)
1=1

where, PI' denote the probability of selecting the (Iii sampl e from the population,

n , and s~) PI = 1. Note that total number of possib le samples, in case of SRSWR
1=1

sampling are, s(n ) = N n ,and in case of SRSWOR sampling are, s(n)=N en '

For example:
( i ) Sample mean YI is an unbiased estimator of population mean Y under both
SRSWR and SRSWOR sampling
Chapter I: Basic concepts and mathematical notation 25

E()lt) = s~)Pt)lt = Y.
_
(1041.2)
1=1

(ii ) Sample variance s; is an unbiased estimator of population mean squared


error S; under SRSWOR sampling
NC
E(S;)= In pt(s;1 = S; (1041.3)
1=1

and also s; is an unbiased estimator of population variance iJ; under SRSWR


sampling

E(s;) = 'tPt(s;)t =iJ;. (104104)


1=1

Example 1.41.1. Suppose a population consists of N = 4 units . The variable y;


takes values I, 2, 3,4 and are distinguished as A, B, C, and D, respectively. Then
the population mean is given by
_ 1 N 1
Y =-L Y. =-(1+2 +3+4)=2.5
N i=1 I 4
the population mean squared error is given by
S; = _1- I(Y; -
N -1 i=1
ff = _1-[(1-2.5f
4-1
+(2 - 2.5f +(3 - 2.5f +(4 - 2.5f]= 2.
3
and population variance is given by
(J~ = ~ I(Y; - ff = ~[(1-2.5f +(2 - 2.5f +(3- 2.5f +(4- 2.5f]=2.= 1.25.
N i=1 4 4

Show that the sample mean j', is an unbiased estimator of population mean f under
both SRSWR and SRSWOR sampling. The sample variance s; is an unbiased
estimator of population mean squared error S; under SRSWOR sampling, and the
population variance iJ; under SRSWR sampling, respectively.

Solution. There are two cases :

Case I. Suppose we are drawing all possible samples of size n = 2 by using


SRSWOR sampling.

The total number of all possible samples is s(n )= N C; = 4C2 = 6 and Pt = 1/6 .
Now we have the following table .
26 Advanced sampling theory with app lications

Sample Sampled-units ." S amp le mean < ' Sample < \' Probability of
, . .. ,.. <'<,' ,' • ., >
, -~'

varIance selecting the '


No " . '; ." I;.. · '·'/T
\. .~.

. >OJ' $;(1)', I : sample


I ...•.. .: PI
1 (A, B) or( I ,2) Y1 =(1+2)/2=1.5 0.5 1/6

2 (A, C) or (1,3) Y 2 = (1+3)/2=2 .0 2.0 1/6

3 (A, D) or (1 ,4) Y = (1+4)/2=2.5 4.5 1/6


3
4 (B, C) or (2,3) Y4 = (2+3)/2=2 .5 0.5 1/6

5 (B, D) or (2,4) Y = (2 +4) /2=3 .0 2.0 1/6


5
6 (C, D) or (3,4) Y6 =(3+4)/2=3 .5 0 .5 1/6

Thus the expected va lue of the sample mean, YI' is g iven by


_ 1 N C'L 1 6_ 1 _
E(YI)=~ L Yt =- LYt = - (1.5 + 2 + 2.5 + 2.5 + 3 + 3.5) = 2.5 = Y,
C ll 1=1 6 s=1 6
and that of sample variance, s;, is given by
ESy(t}
1 N CII 2
[ 2 ]= -N- I( ) 10 5 _ 2
I Sy(t} =- 0.5 + 2.0 +4.5 +0.5 + 2.0 + 0.5 = - = - - 5 y'
CII 1=1 6 6 3

Distribution of sample mean and sample variance using WOR sampling:

D istribution of samplemealls <.'" 'D istribution 'of sample variances


Sample mean . Fre quency SiurtoIevariimce Freo uencv
1.5 1 0.5 3
2.0 1 2.0 2
2.5 2 4 .5 1
3 .0 I
3 .5 1

The above tab le shows that the distribution of sample means is symmetric and that
of sample variance is skewed to the right in the case of without rep lacement
sampling.

Case II. Suppose we are drawing all possible samples of size n = 2 by usi ng
SRSWR sampling.

The total number of all possible samples is s(n) = N il = 4 2 = 16 and Pt = 1/16 for
all t = 1,2, ..., 16.

Now we have the following table:


Chapter I: Basic conce pts and math em atical notation 27

,..

S~ir].ple :,SaJTIpled oPnjt>,;w1; SilmpI~means


No. '"
' "' ,. Yt
,;"Sample,;
I·' variance
Probability
of se lecting
.
~c " ~,

t " ,
2
Sy(t) a sample
:'" c't Pt
'"
1 (A, A) or ( I, I) YI = (1 + 1)/ 2=1.0 0.0 1/ 16
2 (A, B) or (1 , 2) Y2 = (1+ 2)/2 = 1.5 0.5 1/l 6
3 (A, C) or (I , 3) Y3 =(1+3)/2 =2.0 2.0 l/16
4 (A, D) or ( I , 4) Y4 = (1 + 4)/ 2 = 2.5 4.5 1/l 6
5 (B, A) or (2, I) Y5 = (2 + 1)/ 2 = 1.5 0.5 l/ 16
6 (B, B) or (2,2) Y6 = (2 + 2)/2 = 2.0 0.0 1/ 16
7 (B, C) or (2,3) Y7 = (2 + 3)/ 2 = 2.5 0.5 1/l 6
8 (B, D) or (2, 4) Y8 = (2+ 4) / 2 = 3.0 2.0 l/ 16
9 (C, A) or (3, I) Y9 = (3 + 1)/ 2 = 2.0 2.0 1/l 6
10 (C, B) or (3 , 2) YIO = (3 +2)/2 = 2.5 0.5 1/l 6
II (C, C) or (3,3) YI I =(3+3) / 2=3 .0 0.0 I/l 6
12 (C, D) or (3, 4) Y12 = (3 +4)/ 2 = 3.5 0.5 l/ 16
13 (D,A) or (4, I) YI3 = (4 + 1)/2 = 2.5 4.5 I/l 6
14 (D, B) or (4,2) YI4 = (4 + 2)/2 = 3.0 2.0 l/16
15 (D , C) or (4,3) YI5 = (4 +3)/2 = 3.5 0.5 1/l 6
16 (D , D) or (4, 4) YI6 = (4 +4)/2 = 4.0 0.0 l/16

Distribution of sample mean and sample variance using SRSWR sampling:


Again for the SRSW R sa mpling the sa mple mean has sy mmetric distribution and
the distribut ion of sample variance is skewed to the right.

Distributio n of sample means Distribution of sample .variances


Sample mean Frequency Sampl e varia nce Freque ncy
1.0 I 0.0 4
1.5 2 0.5 6
2.0 3 2.0 4
2.5 4 4.5 2
3.0 3
3.5 2
4.0 I

Thus the expec ted val ue of the sa mple mean )it is give n by
28 Advanced sampling theory with app licatio ns

£(y,)=
_
-N"1 N" _ 1 16_ 1 40-
L Y/ = - LY, = - (1+ 1.5+ ....+4)= - = 2.5 = Y
/ =1 16 s=1 16 16
and that of the sampl e varia nce s; is given by
z ] 1 N" Z 1 16 Z 1( ) 20 _ 2
£ [sY(') =- " LS y(/) = -LS y(,)= -0+0.5+ ...+ 2+ 0.5 =- = 1.25 -o- y .
N ' =1 16 s =1 16 16
Then we have the following new term .

1.41.1.1 BIAS

It is the difference between the expected value of a statistic ()/ and the actual value
of the parameter () that is
B(O/) = £(0,)-o. ( 1.41.5)
Thu s an estimator 0, is unbi ased if £(0/)=(), which is obvious by setting B(OJ=o.
1.41.2 CONSISTENCY

There are several definitions for the consiste ncy of any statistic, but we will use the
simplest. An estimator 0/ of the population parameter () is said to be consistent if
Lim(O/)= o. (1.4 1.6)
n-too
For example:
( i ) The sample mean y/ ( or simply y) is a consis tent estimator of the finite
popul ation mean, Y.
( ii ) The sample mean squared error s; is a consistent estimator of the population

mean squared error , s;.


Remark 1.2. An unbiased estima tor need not neces sarily be consistent, e.g., the
sample mean based on a sample of size one is unbiased but not consistent.

1.41.3 SUFFICIENCY

An esti mator 0, is said to be suffic ient for a parameter () if the distribut ion of a
sample YI,YZ,...,Y" given 0/ does not depe nd on () . The distribution of 0, then
contains all the information in the samp le relevant to the estim ation of () and

°
knowledge of 0/ and its sampling distribution is 'sufficient' to give that
information . In general, a set of estima tors or statistics 1, Oz, ,Ok are 'jointly
sufficient' for para meters ()" (}z , . .. .. , (}k if the distribution of samp le values given
01 ,Oz , A does not depend on these (}I>(}z, ,(}k .
Chapter I: Basic concepts and mathematical notation 29

1.41.4 EFFICIENCY

Before defining the term efficiency, we shall discuss two more terms, viz., variance
and mean square error of the estimator.

1.41.4.1 VARIANCE

The variance of an estimator eof a population parameter 0 is defined as


l

(1.41.7)

It is generally denoted by the symbol aJ .


,--' . ,. " , ' " . "". . . .,~

1.41.4.2 MEAN SQUARE ERROR

The mean square error (MSE) of an estimator e of a parameter 0 is defined as:


l

MSE(el)= E[e of =v(eJ+ {s(eJ2


l - (1.41.8)
where s(e denotes the bias in the estimator e of O.
l ) l

Evidently if B(e )=0 then


l

MSE(el)= v(e l ) .

Thus if e and e
l 2 are two different estimators of the parameter e then the
estimatore is said to be more efficient than the estimator e2 if and only if
l

MSE(e,) <MSE(eJ
1.42 RELATIVE EFFICIENCY

In general, the relative efficiency of an estimator 0, with respect to another


estimator e is expressed as a percentage and is defined as:
2

RE =MSE(e2)xIOO/MSE(e,) . (1.42.1)

1:43 RELATIVE 'BIAS

The ratio of the absolute value of the bias in an estimator to the square root of the
mean squar e error of the estimator is called the relative bias.
It is defined as:
RB=ls(el)I/~MSE(el) (1.43 .1)
where B(e = E(e e and the
l ) l )- relative bias IS independent of the units of
measurement of the origin al data.
30 Advanced sampling theory with applications

1.44 VARIANCE ESTIMATION THRO UGH SPLITTING

If el, e2, ....,e" are independently distrib uted random variables with E(ej) = e '<j j ,
. I n .
and e= - I ej , then
11 j=l
v(e)= - ( 1_ ) f (ej -e'f (I .44.1)
1111- 1 j=1
is an unbiased estimator of v(e). If ej = e(jl is the l ' estimator of e obtained by
dropping the l'
unit from the samp le of size /I , then such a method of varia nce
estimation are also called Jackknife method of varia nce estimation , and the
estimator of variance takes the form
VJack (e) = (11 - I)
11 j=1
i. (e( jl - ef (1.44. 2)

. I" ·
where e = - I e(jl .
11 j=1
. _ I"
For example, if e = Y = - I Yi is an estimator of the population mean Y under
11 i=1
. I"
SRSWR samp ling, then e(jl = Y(jl = - - . I Yi , denote the estimator of the
II-I'*J=I
population mean Y obtai ned by droppingj" unit from the samp le. Clearly, we can
write

e(jl = Y(jl = _1- i:Yi = _ 1_[i: Yi - Yj] = _1_ [i: Yi - Y + Y - Yj ]


11 - I i*j=1 11 - I i=1 11 - I i=1
1 [_ _ -
= II_ IIIY - Y+Y- Yj
]_ I {_)
= Y - ~ \)'j - Y '

Also

e = ~II j=lf Y(j)=~II j=III-li;tj=1


f f Yi] [_1- =_1- f f Yi
II(II-I) j=li;tj=l

=_1- f [ f Yi - Yj] =_ I - f [II Y- Yj ]


11(11 -1) j=1 i=1 11(11 -1) j=1
= _ 1_ [ IllY _ I y.J= _1_[1I 2y -IIY]= 11(11 -1) Y = y .
11(11- 1) j=1 j=1 J 11(11- 1) 11(11-1)
Thus the Jackknife estimator of variance of e = y is given by

VJack Y srswr =-(II -I)~
(- ) {- -YJ
L.\)'(jl
11 j=1
-\2 (II- I) ~[{-
=- L. Y- -
11 j=\
I {
11 -
1\)'rY -Y_]2 -)t
2
i:
= _(11-_1) [_ I_ (Yr y)1 = _ I _ i:
(yr y~ = s;, ,
11 j=1 11 -I J 11(11 - I ) j=1 11
Chapter I : Basic concepts and mathematical notation 31

where s y2 = -1- In ( Yi - Y-)2 .


n -I i= \

Thus VJack(Y\rswr is an unbiased estimator of V(y) under SRSWR sampling. The


Jackknife technique provides good estimate of variance under SRSWR sampling,
but for other sampling schemes we need to adjust it to obtain unbi ased estimator of
variance. For example, under SRSWOR sampling, an unbiased estimator of V(y)
will be
• (_)
vJack Y srswor =
(l -fXn -I)~(.,-
L. \)'(;) -
_)2
Y (1.44.3)
n j =\

where f = n] N .
Note that this is not always possible to adjust Jackknife estimator of variance to
make it unbiased for other sampling schemes available in the literature.

1.45L.OSSFUNCTION

The risk or loss associated with an estimator 0, of e, of order r is defined as


R(O,)= E[O, -eJr (1.45.1)
for r = 2,3,.... If r = 2 then R(O,)= MSE(O,) which is generally called quadratic loss
function .

1.46 ADMISSIBLE ESTIMATOR

Let r be a class of estimators of a population parameter (J. For a given loss


function, let R(O,) represent the risk or expected loss associated with the estimator
0, of e. Out of two estimators 0\ and O2 of the population parameter e, the
estimator 01 will be said to be uniformly better than O2 if, for a given loss function,
the following inequality

(1.46 .1)
holds for all possible values of the characteristic under study . Now an estimator 0,
belonging to r is said to be admissible in r if there exists no other estimator in r
which is better than 0,.

1.47 SAMPLE SURVEY

A sample survey is a survey which is carried out using sampling methods, i.e., in
which only a portion and not the whole population is surveyed.
32 Advanced sampling theory with applications

1.48 SAMPLING DISTRIBUTION

A sampling distribution is a distribution of a statistic in all possible samples which


can be chosen according to a specified sampling scheme. The expression almost
always relates to a sampling scheme involving random selection, and most usually
concerns the distribution of a function of a fixed number n of independent
variables.

Example 1.48.1. Select all possible SRSWR samples each of two units from the
population consisting of four units 1,3,5 and 7.
( a ) Construct the sampling distribution of the sample means.
( b ) Construct the sampling distribution of the sample variances.
Solution. The list of 16 samples of size 2 from the population and the mean of each
sample is given in the following table.
Samnle ..~ 1,1 1,3 1,5 1,7 3,1 3,3 3,5 3,7 5,1 5,3 5,5 5,7 7,1 7,3 7,5 7,7
Means I 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7
Variances 0 2 8 18 2 0 2 8 8 2 0 2 18 8 2 0
( a ) The relative frequency distribution of the sample means is
."
Sample Frequency Relative
means frequency
I I 0.0625
2 2 0.1250
3 3 0.1875
4 4 0.2500
5 3 0.1875
6 2 0.1250
7 I 0.0625

A histogram of the above sampling distribution of sample means is given below.

Sampling distribution of
the sample means

>. 0.3
Q) <J
~ a; 0.2
ra
8:1 go
::::I

0.1
.;: 0
2 3 4 5 6 7
Sample Means

Fig. 1.48.1 Distribution of the sample means.


Chapter I: Basic concepts and mat hematica l notation 33

( b ) The relative frequency distribution of the sample variances IS given in the


following tab le.

Sample ii, Fre quep cy;


I', .
Relative
freqti~m~y •
y.
;"; ',t-;~

0 4 0.250
2 6 0.375
8 4 0.250
18 2 0.125

A relative histogram of the above sampling distribution of sample vanances IS


give n be low .

Sampl ing distribution of the sample


var iances

0 .4 .
'"
u 0 .35
Ii 0 .3
& 0 .2 5
£ 0 .2
~ 0 .15
... 0.'
~ 0 .0 5
o.
18
Sam pie v ari a nce

Fi g. 1.48.2 Distribut ion of the sam ple variances.

1.49 SAMPLING.FRAME

A sample space lfI (or S) of iden tifiable units or eleme nts of populatio n to be
surveyed is called a sampling frame . It may be a discrete space such as househo lds,
ind ividuals or a continuous space such as area under a particular crop.

f;50 SAMPLE SURVEY DESJGN

Let If/ ={t }, t = 1,2 ,..., s(O) be a speci fied space of samples, B( be a Borel set in
lfI and p( be the probability measure defined on B(, then the triplet (If/, B(, p() is
called a sample survey design.

1.51 ERRORS IN THE ESTIM ATOR S'

In general, two types of errors, which arise dur ing the process of sampling, have
bee n ob served in actual practice in the estima tors :
( a ) Sa mp ling errors; ( b) Non-samp ling errors.
Let us brie fly explain these errors.
34 Advanced sampling theory with applications

1.51.1 SAMPLING ERRORS

An error which arises due to sampling is called a sampling error. Let us explain this
with the help of the following example. For a population of size N = 4 , let the units
be A = 1 , B = 2, C = 3, and D = 4. The population mean is given by, Y = 2.5.
There are N Cll =4C2 = 6 possible samples each of size II = 2 . The units selected in
the six samples are : (A, B), (A,C), (A,D), (B, C), (B, D), and (C, D} Thus six
sample means are given by:

2 3 4 5 6
(A, C) (A, D) (B,C) (B,D) (C,D)
or
3,4)
3.5

Ifwe take each of the sample means and population mean separately, then, we have
the following cases: error of (A ,B) = 11.5 - 2.51 = 1.0 ; error of (A, C) = 12.0 - 2.51 = 0.5 ;
error of (A, D) = 12.5 - 2.51 = 0.0 ; error of (B,C) = 12.5 - 2.51 = 0.0; error of
(B, D) = 13.0 - 2.51 = 0.5 ; error of (C,D) = 13.5 - 2.51= 1.0 . Note that we are measuring
only two units out of four units, i.e., we have only partial information in the sample
therefore sampling error arises. One of the measurements for the sampling error is
the variance of the estimator. For example, the variance of the sample mean
estimator, YI' is

V(YI)= E[YI-E(YI W= fpl(yI-Y~ =-!- f(yl-Y~


1=1 6 1=1

=i[(I.S - 2.S)2-t{2.0 - 2.sf-t{2.S - 2.sf-t{2.S - 2.sf-t{3.0 - 2.sf-t{3.S - 2.sf]= 0.41

because PI = -1 'iI t =1,..,6, and


(~) -
EIYI =Y .
6
Also note that
-1l)S2 = (4-2)
( NNil y4x2
x~ = 0.41 = V(YI).
3
For its theoretical derivations refer to Chapter 2.

These errors are of four types:


( i ) Non-response errors; ( ii ) Measurement errors;
( iii) Tabulation errors; and (iv ) Computational errors.
Let us discuss each of these errors in brief as follows :
Chapter I: Basic concepts and mathematical notation 35

1.51.2.1 NON-RESPONSE ERRORS

The people from whom we get the information are called the respondents and the
people in the sample from whom we do not get information are called non-
respondents . The error which arises, when we fail to get the information is called
non-response error and the phenomenon is called non-response. This error arises
because of the fact that we are not able to cover the whole sample. For example , if
we want to interview 100 farmers and suppose 5 out of them do not allow us to
interview them. Then we are interviewing only 95. So the sample is not complete.
Such errors are called non-response errors.

1.51.ij,MEASQREMENT E~ORS .

The errors that we bring in measuring the characters are called measurement errors.
For example, suppose we want to measure the age of the respondents. Among the
respondents, some may report their age less than their actual age. These types of
errors are called measurement errors.

·1.5 1.~.J TABULATIQN ERRORS '

The errors which arise due to missing some numbers due to non availability of data
or recording some numbers wrongly, while making a table is called a tabulation
error.

1.51.2.4 € OMPUTA TIONAL ERRORS

After the table is formed, we start our calculations. The errors committed In
calculations are known as computational errors.
'. .
1.52 POINT ESTIMATOR

A point estimator endeavours to give the best single estimated value of the
parameter. For example, the average height of school children is 5.3 feet.

An interval estimator of a population parameter which specifies a range of values


bounded by an upper and a lower limit, within which the true value is asserted to
lie. For example, the average height of school children lies between 4.9 feet and
5.5 feet.

1.54 CONFIIn;NCE INTERVAL

If it is possible to define two statistics B1 and B2 (functions of sample values only)


to estimate a population parameter (), such that
36 Advanced sampling theory with applications

p(o, < 0< 02 )=(I-a) (1.54.1)


where I -a is some fixed probability, the interval between 0, and O2 is called a
confidence interval. The assertion that 0 lies in this interval will be true, on the
average, in a proportion 1- a of the cases when the assertion is made . Note that an
interval estimate at the same level of confidence with smaller width is considered as
better estimate. For example, if someone says with 95% confidence that the average
marks of a particular class lies between 65% to 85%, then this estimate is better
than the estimate if someone says with same confidence that the average marks lies
between 0% to 100%. We saw for SRSWOR sampling, the sample mean Yt is
unbiased for population mean with
VVt) = (N~n )s;. (1.54 .2)

and an unb iased estimator of variance is given by


,(-Yt )= (N---;:;;;-
v
-n)s 2
y . ( 1.54.3)

Thus there are two cases: If VVt) is known then for a large sample, a
(1 - a)100% confidence interval estimate for the population mean Y is given by
Yt±Za/2~VVt) (1.54.4)
where Za/2 values are given in Table 3 of the Appendix, and if VV t ) is unknown
then for a small sample, a (I - a)100% confidence interval estimate for the
population mean Y is given by
Yt±la/2(df=n-l~vVt) (1.54 .5)
where la/2[df = n - I) values are given in Table 2 of the Appendix, and df stands for
degree of freedom . Note that if a=0.05 it represents (l -a)100% =(1-0.05)100%
= 95% confidence interval.

Let us illustrate it with the following example:

Example 1.54.1. Consider a population consisting of N = 7 units, viz. A = I,


B = I, C = 8, D = 8, E = 8, F = 9 , and G = 9. Consider all possible SRSWOR
samples of n = 2 units, and compute confidence interval estimates of population
mean based on each sample under the following two different situations:
( a ) Variance or mean square error s~ is known ;
(b) Variance or mean square error S; is unknown.
Find the proportion of confidence interval estimates in which the true population
mean is included.
Solution. Note that for this population, we have population mean Y = 6.29 and
population mean square s~ = 13.24 .
Chapter 1: Basic concepts and mathematical notation 37

Now the total number ofSRSWOR samples will be


N =7 = 7! =~ = 7 x6 x5 x4 x3 x2 xl = 21 .
Cn C2
2!(7 -2) 2!x5! 2 x l x 5 x 4 x 3 x2 x l

( a ) When population va riance is known : The lower and upper limits are

L, = y, - Za/2~V(Y, ) = y, - 1.96 (N~" )s;, ,


and
- + za/2Vcr=\"(
VI = Y, -)) = Y,
V ~YI - + 1.96 (N-;:;;;-
- II) Sy2 .

( b ) W hen populatio n variance is not known: The lower and upper limits are

L2 =y,-fa/2(df=II-I~V(Y,) = Y, - 12.71 (N~"};',


and
- + f /2(df = 11- 1),jV\y,
V 2 = y, a -r:-;) =y,
- + 12.71 -;:;;;-
-II ) Sy2 (N '

All possible sample s, sample means, and variances, lower and upper limits of the
95% confidence interval, and their coverage is given in the following table.

Sample '" Sample Variance knOwn Variance unknown


Values I ~ Y,
"".
s2
y
LI .
fJ:~~~ Y E C/(I) If L2 U2 Ye''cJ(2)
A B 1 1 1.0 0.00 -3.26 5.26 No 1.0 1.0 No
A C 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
A D 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
A E 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
A F 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
A G 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
B C I 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
B D 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
B E I 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
B F 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
B G 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
C D 8 8 8.0 0.00 3.74 12.26 Yes 8.0 8.0 No
C E 8 8 8.0 0.00 3.74 12.26 Yes 8.0 8.0 No
C F 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
C G 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
D E 8 8 8.0 0.00 3.74 12.26 Yes 8.0 8.0 No
Continued .
38 Advanced sampling theory with applications

D F 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes


D G 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
E F 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
E G 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
F G 9 9 9.0 0.00 4.74 13.26 Yes 9.0 9.0 No

Thus we observed that the population mean Y = 6.29 lies 20 times between LI and
U 1 out of total 21 times, and hence the observed proportion of confidence intervals
containing population mean = 20/21 = 0.9524. In other words, 95.24% cases the
population mean lies between the confidence interval estimates when variance is
known . Thus the observed percentage is very close to the expected coverage of
95%.

Also we observed, when population variance is not known, then the populat ion
mean lies 16 times between L z and U z, and hence the observed proportion of the
confidence interval estimates containing population mean = 16/21 = 0.7619, that is,
only 76.19% times the population mean lies between the confidence interval
estimates when variance is unknown. Here the observed percentage of the coverage
is lower than the expected coverage of 95%. This may be due to very small sample
and population size. In practice as the sample size becomes large, (How large? Just
smile because there is no unique answer), then the observed proportion of coverage
in both cases converges to 95%.

It is defined as the total number of units of a particular interest in a subgroup A


(such that Au A C = n ) of a population divided by the total number of units in the
population . In other words, when the variable of interest Y takes only two values I
and 0 that is Yi = 1 (if i E A) and Yi = 0 ~f i E A c), then the population mean Y also
becomes population proportion P as follows:
- 1 N 1( ) N
Y = - L Jj = - 1 + 0 + 1+ 0 + ...... 0 + 1 = - I = P, (1.55.1)
Ni=1 N N
where N 1 denotes the number of units of the population in the group A, and N
denotes the total number of units in the population . Note that the value of
population proportion P lies between 0 and I that is 0:5: P :5: 1. Further note that
here we are dealing with qualitative variables.

It is defined as the total number of units of a particu lar interest in a subgroup A


(such that Au A C = s) of a sample divided by the total number of units in the
sample. In other words, when the variable of interest y takes only two values 1 and
Chapter I: Basic concepts and mathematical notation 39

0, that is, Yi = 1 (ifi E A) and Yi = 0 ~fi E A C ) , then the sample mean Y s also
becomes sample proportion P, as follows:
- 1~ 1(
Yt=-L.Yi=-I+O+1+0+ 0 +1 ) =-=p,
n\ ' (1.56.1)
n i; \ n n
where nj denotes the number of units of the sample in the group A, and n denotes
the total number of units in the sample. Note that the value of sample proportion
also lies between 0 and I that is 0:0; P:0; 1 .

( a ) The variance of sample proportion under SRSWOR sampling is give by


, ) (N - n) ( )
Vwor (P = - ( - - ) P 1- P (1.57.1)
n N-1
and its estimator is given by
,
V wor
(,) (N - n) ,( , \
P = N(n_1)p1-Pf
(I 57 2)
. .

Thus a (1- a )100% confidence interval estimate of the population proportion P is


given by
P ± 2a/2JVwor(P ). (1.57.3)

( b ) The variance of sample proportion under SRSWR sampling is give by


v:wr (,)
P
= P(l - p), (1.57.4)
n
and its estimator is given by
, (,) p(l- p) ( 1.57.5)
V wr P =-(-)'
n -1
Thus a (1- a )100% confidence interval estimate of the population proportion P is
given by
P'+2 ~J.
- a/2vvwr~P (1.57.6)
For detail of derivations of the results related to proportion, refer to Chapter 2.

Example 1.57.1. Consider a class consisting of 6 students. Their names and major
are given in the following table:

',';/ ''''' N am A,,, } IT" l'",,), " I . ,' , " '" ',i , ,'

Amy Math
Bob English
Chris Math
Don English
Erin Math
Frank English
40 Advanc ed sampling theory with applications

( a ) Find the proportion of English students in the class.


( b) How many SRSWOR samples, each of n = 4 students , will be there?
( c) What is the sampling distribution of estimate of proportion?

Solution. ( a ) Count the number of students in the population, N = 6.

Count the numb er of student s with major English, COUNT = N, = 3.

Compute the proportion of English students:


COUNT N 3
p= - ' =-=0.5 (parameter).
N N 6
Recall that a parameter is an unknown quantity and we try to estimate it by taking a
random sample from the population .

( b ) How many SRSWOR samples, each offour units, will there be?

The possible combinations of choosing 4 objects out of 6 object s are given by:
6 = 6! =~=6 x5 x4 x3 x2 xl =15.
C4
4!(6-4) 4!x2! 4 x3 x2 xl x2 xl
Note that each combination can be taken as a without replacement sample, so the
total number of distinct samples will be 15.
( c ) Sampling distribution of estimate of propo rtion: Let us construct those 15
samples as follow s:

I Am Bob Chris Don M E M E =2/4=0 .50


2 Am Bob Chris Erin M E M M =1/4=0.25
3 Am Bob Chris Frank M E M E =2/4=0 .50
4 Am Bob Don Erin M E E M =2/4=0.50
5 Am Bob Don Frank M E E E =3/4=0 .75
6 Am Bob Erin Frank M E M E =2/4=0 .50
7 Am Chris Don Erin M M E M =1/4=0.25
8 Am Chris Don Frank M M E E =2/4=0 .50
9 Am Chris Erin Frank M M M E = 1/4=0.25
10 Am Don Erin Frank M E M E =2/4=0.50
II Bob Chris Don Erin E M E M =2/4=0 .50
12 Bob Chris Don Frank E M E E =3/4=0.75
13 Bob Chris Erin Frank E M M E =2/4=0 .50
14 Bob Don Erin Frank E E M E =3/4=0.75
15 Chris Don Erin Frank M E M E =2/4=0.50

From the above table we have the following table:


Chapter I: Basic concepts and mathematical notation 41

Proportion Tally marking Frequency Relative frequency


estimate /; RFi = ii/Iii
P
0.25 III 3 3/15
0.50 IlII1 Ill! 9 9/15
0.75 III 3 3/15
Surr 15 1

The above table shows that the distribution of estimates of proportion is symmetric,
or say normal.

Let Xi = P and Pi = RF; . Then the expected value of Xi =P


3 3 9 3
E(Xi ) = J1 = E(j)) = 'LPi xi = PIx) + P2x2 + P3x3 = - xO.25+ - x 0.50 +-x 0.75
i= \ IS 15 IS
= 0.05 + 0.30 + 0.15 = 0.5 = P (Population proportion).

Because E(p )= P, whi ch implie s that the sample proportion IS an unbiased


estimator of the population proportion.
By the computing formula of variance, we have

0'2 = V(Xi ) = v(p) = [i~I(P;.1} )]- ell f = [PF(f + P2X~ + P3X!]- ell f
=[~ Xo.252 + ~x O.502 + ~X O.752 ] _ (0.5?
15 15 15
= [0.0125 + 0.15 + 0.1125] - (0.25) = 0.275 - 0.25 = 0.025.

Also note that


N
( N- I
-II) I P(I-P) =( 6 - 4) 0.5(1- 0.5) = ~ x 0.25 =0.025 .
6 -1 4 5 4

Thu s the variance of the estimator of population proportion using without


replacement sampling is given by

(N-II) P(I- p).


O' ~
fJ
=
N- I I
Simil arly, repeat this example yourself by taking all SRSWR samples, and study the
unbiasedness and variance of the estimate of proportion, and also histogram the
sampling distribution of the estimate of proportion.
42 Advanced sampling theory with applications

Example 1.57.2. Consider a class of 16 students taking statistics course , and their
names , marks, and major subjects are given in the following table:

"Sr:"N oF ,;Nam "' e;;;


"""!> {Marks"
Maiot'"\;
1 Ruth 92 Math
2 Ryan 97 Math
3 Tim 68 English
4 Raul 62 Math
5 Marla 97 English
6 Erin 68 Math
7 Judv 76 English
8 Trov 75 English
9 Tara 51 Math
10 Lisa 94 Math
11 John 70 Math
12 Cher 89 English
13 Lona 62 Math
14 Gina 63 Math
15 Jeff 48 Math
16 Sara 97 Math

I. Compute the following parameters :


( a ) Population mean;
( b ) Population variance ;
( c ) Population standard deviation;
( d ) Population mean square error;
( e ) Population coefficient of variation;
2. ( a ) Select an SRSWOR sample of 4 units using Random Numbe r method ;
( b ) Estimate the population mean and population total;
(c) Compute the variance of the estimator of population mean;
( d ) Estimates;;
(e) Estimate the variance of the estimator of population mean;
( f) Construct 95% confidence interval of the population mean assuming that
population mean square is known and sample size is large. Does the population
mean falls in it? Interpret it;
( g ) Construct 95% confidence interval assuming that population mean square
is unknown and sample size in small. Does the population mean falls in it?
Interpret it;
3. (a) Compute the population proportion of major in English students;
( b ) Estimate the proportion of major in English on the basis above sample;
( c ) Compute the variance of the estimator of population proportion;
(d) Estimate the variance of the estimator of population proportion;
( e ) Construct 95% confidence interval for the proportion.
Chapter I: Basic concepts and mathematical notation 43

Solution. We have
T·.
hi~ ..;!!~
.~
I l''1am ::.:•• j§ •• c~n~iii!. ,;
Ruth 92 8464
Ryan 97 9409
Tim 68 4624
Raul 62 3844
Marla 97 9409
Erin 68 4624
Judy 76 5776
Troy 75 5625
Tara 51 2601
Lisa 94 8836
John 70 4900
Cher 89 7921
Lona 62 3844
Gina 63 3969
Jeff 48 2304
Sara 97 9409
I:.i t "' Sum;:I :· .L 95559 .

From the population information we have


N N
N = 16, Dj = 1209 and Di = 95559.
i= l i=l
1.
( a ) Population mean :
N
If:
_ ._ I 1209
Y =.!.=L =- - = 75.56 (parameter) .
N 16
( b ) Population variance:
N )2
(
IJj2 _ ~
If:
95559- (1209f
(j2 =H N = 16 = 262.75 (parameter).
y N 16
( c ) Population standard deviat ion:
(jy = g = .J262.75 = 16.20 (parameter) .
( d ) Population mean square error :
N )2
N (
IJj2 _ ~
If:
. I
95559- (1209)2
52 = ;=1 N _ _ _ _1'-"6:.....- = 280.26 (parameter).
y N-l 16-1
44 Ad vanced sa mpling theory with applications

(e) Population coefficient of variation: We are using SRSWOR sampling , so


S fS2
_v_0 y_ _ ,,~
C v -- ---.X..
- -
- - -
280.26 -_ 16.74 -
- 0. 2215 (par ame t er ) .
. Y Y 75.56 75.56
2.
( a ) Se lection of 4 units using SRSWOR sampling ( 1/ = 4 ): Let us start with 1st row
and 6 th co lumn of the Pseudo-Random Number Tab le I given in the Appendix.

Random Decision: .. Name of the


Nu mber R - Rejection, S -- Selection selected student
62 R
77 R
92 R
67 R
53 R
51 R
33 R
07 S Jud y
62 R
69 R
76 R
48 R
50 R
88 R
37 R
72 R
63 R
21 R
33 R
25 R
76 R
09 S Tara
43 R
80 R
94 R
62 R
68 R
IS S Jeff
42 R
93 R
29 R
01 S Ruth
Chapter 1: Basic concepts and mathematical notation 45

So our SRSWOR sample consists of four students = {Judy, Tara, Jeff, Ruth} .

Now from the sample we have the following information:


I CY c~ame I ~f+"c Yi cCC
y
y; '~::c

Judy 76 5776
Tara 51 2601
Jeff 48 2304
Ruth 92 8464
Sum 267 19145
Thus
11 2
IYi
11
II = 4 , = 267 and IYi = 19145.
i =1 i=1
( b ) Sample mean :
11
Iy
- - = -267 = 66.75 (stati
- = -i-I
Yt . )
statistic
II 4
which is an estimate of popu lation mean.
N
Note that an estimator of population total Y = If; will be given by
i=\

Yt = N Yt = 16x66.75 = 1068 (statistic).


( c ) The variance of the sample mean estimator is given by
V(Yt) = (N -IIJS; = (16-4JX280.26 = 52.548 (parameter).
Nil 16x4

( d ) An estimator of Sf, is given by

I.Yi J2
( i=1
n 2
I Yi - -"'----~- 19145- (267f
2 i= 1 II _ _ _----""4_ = 440.91 (statistic).
Sy =
II-I 4-1
(e) An estimator of the variance of the estimator of the population mean is
;;(Y/) = (N -IIJs 2 = (16-4JX440.91 = 82.67 (statistic).
Nil y 16x4
( f) Here 95% confidence interval is given by
Yt ± I.96~V(yt), or 66.75± I.96~52.548, or 66.75 ± 14.20, or [52.55, 80.20] .

Yes, the true popul ation mean Y = 75.26 lies in the 95% confidence interval
estimate. The interpretation of95% confidence interval is that we are 95% sure that
the true mean lies in these two limits of this interval estimate. Note that interval
estimate is a statistic.
46 Advanced sampling theory with applications

( g ) Here 95% confidence interval estimate is given by


Yt ± la /2 (df = n -1 ).jv(Yt), or 66.75± 10.025 (df = 3}.j82.67 ,

or 66.75±3 .182~82 .67, or 66.75±28.93, or [37.82,95.68]

where la /2(df = n -1) = 10.025(df = 3) = 3.182 is taken from Table 2 of the Appendix.

Yes, again the true population mean lies in this 95% confidence interval and its
interpretation is same as above. Again note that interval estimate is a statistic.

3.
( a) Let us give upper case 'FLAG' of 1 to English majors and 0 to Math major
students in the whole population, then we have

o
2 R an Math o
3 Tim En lish
4 Raul Math o
5 Marla En lish
6 Erin Math o
7 Jud En \ish
8 Tro En lish
9 Tara Math o
10 Lisa Math o
11 John Math o
12 Cher En lish
13 Lona Math o
14 Gina Math o
15 Jeff Math o
16

Population Proportion:

N
L:FLAG i .. .
p= i=l = No.ofstudents wIth enghsh maJor =~=O .3125 (parameter).
N Total No.of Students 16
Chapter I: Basic concepts and mathematical notation 47

( b ) Let us now give the same lower case ' flag' to students in the sample .
r-:

:', "
Judy English I
Tara Math 0
Jeff Math 0
Ruth Math 0
:< J/>",Attt :,:;Y i:":, ~ UIIl'
I :~ ' " .
, ' '<> >' "/

Sample proportion: The sample proportion is given by


n
~:tlagi
i=1- - = -1
p=
A
= 025
. .
n 4

( c ) Variance of the estimator of proportion under SRSWOR sampling is given by


A) (N -n) (
Vwor(P =-(--)PI-P = (
) (16-4)
)xO.3125x 1-0.3125 =0.0429.
( )
nN-l 4x16-1

(d) An estimator of variance of the estimator of proportion under SRSWOR


sampling is given by
vAwor(pA) = (N - n)
(
A(
) p 1- p = (
A) (16- 4) 02
) x . 5 x 1- 0.25 = . 4
( ) 0 0 68 7 .
N n-l 164-1

( e ) A 95% confidence interval estimate for the true population proportion is


p± 1.96~vwor(P)
or 0.25 ± 1.96'/0.04687 , or 0.25 ± 0.424 , or [- 0.174, 0.674] , or [0.0, 0.674] .

Note that a proportion can never be negative, so lower limit has been changed to O.
Caution! It must be noted that we have here a very small sample, but in practice
when we deal with the problem of estimation of proportion, the minimum sample
size of 30 units is recommended from large populations. Note that instead of using
'FLAG' or 'flag' , sometimes we assign codes 0 or I directly to the variable Yor
X.

Example 1.57.3. For the population considered in the previous example:


( a ) John considers a sampling scheme consisting of only 4 samples as follows.

f §~l11ple '.
~Nu~ber '
Cher, John, Marla, Sara 0.25
2 Erin, Jud ,Raul, Tara 0.25
3 Gina, Lisa, Ruth, Tim 0.25
4 Jeff, Lona, Ran, Tro 0.25
48 Advanced sampling theory with applications

( b ) Mike considers another sampling plan consisting of 13 samples each of 4


students as given below:
,, )"
."",·"1, _." .; )
,.,',.". .. 11':;,'0:
· " ' .' ,~"!'f'. ,,
'" ) " ;" ,., ' . r:.• ,;' ;.,., .'·"lII,' ,"'!' c,
"'1,'

1 Cher, Erin, Gina, Jeff 1/13


2 Cher, Erin, Gina, John 1/13
3 Cher, Erin, Gina, Judy 1/13
4 Cher, Erin, Gina, Lisa 1/13
5 Cher, Erin, Gina, Lona 1/13
6 Cher, Erin, Gina, Marla 1/13
7 Cher, Erin, Gina, Raul 1/13
8 Cher, Erin, Gina, Ruth 1/13
9 Cher, Erin, Gina, Ryan 1/13
10 Cher, Erin, Gina, Sara 1/13
II Cher, Erin, Gina, Tara 1/13
12 Cher, Erin, Gina, Tim 1/13
13 Cher, Erin, Gina, Troy 1/13
Let Yt be an estimator of the population mean. Find the following for each one of
the above sampling schemes:
E(Yt); V(Yt) ; B(Yt); and MSE(Yt)·
Comment on the statement, 'Mike's sampling scheme is better than John's sampling
scheme '. Justify your logic and discuss the relative efficiency.

Solution. ( a ) John's sampling plan:

161.03
127.91
13.61
25.60
., 328.18
Chapter I: Basic concepts and mathematical notation 49

(_)
MSE Yt = ~
L. Pt Yt - Y {_ -}2 =-1 x328 .18=82.045 .
t= \ 4

( b ) Mike's sampling plan:


,;0:,:",{! '\!'; ~! ";:i{¥I~~(J?:;){~J
s:s ! ';; 1j'\t;:i Wt ~·~F·:.1;
Samole
" ;xr,,)[:' I ! ,;i; ~, l~~t?t~~~~ ~,; 1
's 0:';~Di~;::i ; Zj .'-!; I,Z
1 89 68 63 48 67.0 1/13 49 .269600 73.3164 10
2 89 68 63 70 72.5 1/13 2.308062 9.378906
3 89 68 63 76 74.0 1/13 0.000370 2.441406
4 89 68 63 94 78.5 1/13 20.077290 8.628906
5 89 68 63 62 70.5 1/13 12.384990 25.628910
6 89 68 63 97 79.3 1/13 27.360950 13.597660
7 89 68 63 62 70.5 1/13 12.384990 25.628910
8 89 68 63 92 78.0 1/13 15.846520 5.941406
9 89 68 63 97 79.3 1/13 27.360950 13.597660
10 89 68 63 97 79.3 1/13 27.360950 13.597660
11 89 68 63 51 67.8 1/13 39.303250 61.035160
12 89 68 63 68 72.0 1/13 4.077293 12.691410
13 89 68 63 75 73.8 1/13 0.072485 3.285 156
ii.;;j' !iir;;:i;.~0:ii:i .\',! !Surn 962:3 .". lift,!': ; 3;237.80770.0·.' 268:769560

where Y = 75.56 .

Thus we have
_ 13 _ 1
E(Yt ) = 'LPtYt = - x962 .3 = 74.02 ,
t= 1 13
and
B(Yt) = E(yt )- Y = 74.02 -75.56 = - 1.54 ,
V(Yt) = I Pt {Yt - E(Yt)}2 = ~13 x 237.8077 = 18.2929,
t=\

and
MSE(Yt) = Ipt ~t - Y}2 = ~ x 268.76956 = 20.675.
t =\ 13
Although John ' s samp ling scheme is less biased, it has too much mean square error
compared to Mike's sampling scheme . Thus we shall prefer Mike' s sampl ing
scheme over John's sampling scheme. Also note that the relative efficiency of
Mike' s sampl ing scheme over John 's sampling scheme is given by
MSE(- )
RE = Yt John X 100 = 82.045 X 100 = 396.83% .
MSE(Yt )Mike 20.675
Thus one can say that Mike's sampling plan is almost four times more efficient than
John 's sampl ing scheme.
50 Advanced sampling theory with applications

1.58 RELATIVESTANDARDERROR

The relative standard erro r of an estimato r e of population parameter e is defined


as the positive squ are root of the relative variance of the estim ator e.
Mathematically
RSE(e) =~Rv(e) (1.58 .1)

whe re Rv(e)=v(e)/[E(e)f denotes the relative variance of the estim ator e. The
another famous name for relative standard error is coefficient of variation .

1.59.AUXILIARYINFORMATION··.

In many sample surveys, it is possible to collect information abou t some variable(s)


in addition to the variable of interest or study variable . The auxiliary information is
accurately known from many sources like reference books, journals, adm inistrative
records etc. and is cheaper to obta in than the study variable. For example, while
estimating the average incom e of people living in a particular city, the plot area
owned by individual may be known from some published sources. Later on we will
observe that the known auxiliary information is also helpful in incre asing the
efficiency of the estimators. Before dealing with two variables, we should be
familiar with the following terms . If lj and X i denote the values of /" unit for the
study variable Y and auxiliary variable X , then we have :

( a ) The covariance between X and Y is


Cov (X, Y) = E[X - E (X )][Y - E (Y )] = E(xr)- E(X )E (Y ) . (1.59 .1)
The covariance between X and Y is same as that between Y and X, i.e.,
Cov(X ,Y)= Cov(Y,X).
For exam ple , for SRSWOR sampling, the covariance between X and Y is given
by
I
S xy=-- N(
IXi-X Yi-Y , -X -)
N -I i=1
- I N - IN
where X =- I X i and Y = - I Yi . Note that an unbiased estimator Sxy is given
N i=\ N i=1
by
-X
sXY = - I -1 In ( Xi - X Y i - Y -)
n - i=\
I n I n
where x =- IX i and y = - I Yi
11 i=l 11 i=1
( b ) The population correlation coefficient between X and Y is defined as
Cov(X,Y)
Pxy = ~v(x)~v(y) ' (1.59 .2)
For simple random sampling it is given by
Chapter I: Basic concepts and mathematical notation 5I

Px)' = Sx)'/ ~S}S.~ .


(1.59 .3)
A biased estimator of the corre lation coefficient P x)' is defi ned as

':,y= SXy/ ~s;s; . ( 1.59.4)


The value of Px)' (or rx)') is a unit free number and it lies in the interval [- 1, + 1]. It
is also indepe ndent of change of origi n and sca le of the variab les X and Y. The
linear relati onship can also be seen with the help of scatter diagrams as follows :

sex TTER PLOTS


y Px)' > 0 Y
o P x)' < 0

o o o
00 o
o
o o o
o o
o
x X

As X increases Yalso increases As X increases Y decreases


Relationship is positive Relationship is negative

y P x)' = + 1 Y Px)' = - 1

x X

As X increases Ya lso increases As X increases Y dec reases


and all points lie of a straig ht line and all points lie on a straig ht line
Perfect positive relation ship Perfect negati ve relationship
52 Advanced sampling theory with applications

y Pxy = 0 y Pxy maybe positive,negative or zero

000
o o
o o

x X

As X increases Y may increase As X increases Y first increases


or decrease (Y do not care X) and then decreases
No relationship Sign of relationship is not sure

Fig. 1.59.1 Scatter plots .

Note that a similar scatter plot can be made from sample values to find the sign of
sample correlation coefficient rty .
( C) The population regression coefficient of X on Y is defined as
,B = Cov(X,Y)/V(X). (1.59.5)
For simple random sampling, it is given by
,B=SXy /S; . (1.59.6)
A biased estimator of f3 is given by
b= sxy /s; (1.59 .7)
which in fact represents a change in the study variable Y with a unit change in the
auxiliary variable X . Note that sign of ,B (orb) is same as that of PXy(or rxy ) .

Example 1.59.1. Consider the following population consisting of five (N = 5)


units A , B , C , D, and E , where for each one of the unit in the population two
variables Yand X are measured .
Units A B C D E
I> Yi 9 11 13 16 21
..• Xi ' 14 18 19 20 24

Find the following parameters:


- - 2 2
( a) Y, X, S x , Sy , S ty ' P xy and f3 .
Chapte r 1: Basic concepts and mathematical notation 53

( b ) Select all possible SRSW OR samples of n = 3 units . Show that y , x, s.;,


s; , S xy are unbiased for Y , X, S; , S;, Sxy, but r xy and b remain biase d
estimators of P xy and fJ respective ly.
( c ) Compute Cov(y, r) by using definition .
( d ) Compute (1- f) S xy and comment on it.
n
Solution. From the complete population information, we have
Units -1'j -)2
X j (~.I ~'~rr. (Xi -x:)1"",(Yi ;-;X: (Xi - x)2 (Y; -rXXi-X)
A 9 14 -5 -5 25 25 25
B 11 18 -3 -1 9 1 3
C 13 19 -1 0 1 0 0
D 16 20 2 1 4 1 2
E 21 24 7 5 49 25 35
I ~ ·c 88 ,
Sum "'-70 95 0 O~
52 ,Ye Co 65 "-"""~
N N
( a ) From the above table we have LYi = 70, LXi =95, so that
;=1 ;=\
- I N I - I N I
Y =-LY;=- x70 =14 , and X=-LX;=- x95 =19 .
N ;=1 5 N i=1 5

From the above table, I(Y; - rl = 88, I {x; - xl = 52 and ~(y; - rXx; - x)= 65,
;;) ;;\ ;=\
so that
2 2
2 I N( -) 88 2 I N( - ) 52
Sy = - - L Y; - Y = - - = 22 , S x = - - LX; - X = - = 13 ,
N - I ;=\ 5-1 N-I;=I 5-1

I N( 65 -X -) Sxy 16.25
Sxy =--LY;-Y X i - X = -=16.25 , Px y = g = r.;:;-;:;:=0.960 ,
N - I;=I 5-1 S2 S2 ,,13 x22
x y

and fJ = Sxy = 16.25 = 1.25 .


S2 13
x
(b ) Here we have N = 5 and n = 3, so the total number of possib le SRSW OR
samp les will be 5C3 = 10.

Units
A
B
c
Sum
Continued .
54 Adva nced samp ling theory with applications

Units Sample'z " .,' >"'~


A 9 14 -3 -3.34 9 1I.I6 10.02
B II 18 -I 0.67 I 0.45 -0.67
D 16 20 4 2.67 16 7.13 10.68
Sum 36 '52 0 0.00 26 ,,-1 8.73 20.03
,
I~ Units , Sample 3 ';&"r.
A 9 14 -4.67 -4.67 21.81 21.81 21.81
B II 18 -2.67 -0.67 7.13 0.45 1.79
E 21 24 7.33 5.34 53.73 28.52 39.14
Sum-ll 41 ' 56 . . ~0.01 ·0.00; ' 82.67 '50.77 ,,, ;,, 62.74,' ;
Units ,I,;
.z :
;, j Sample 4- ' ii"
' j
.' "';~;;
A 9 14 -3.67 -3.67 13.44 13.44 13.44
C 13 19 0.33 1.33 0.11 1.78 0.44
D 16 20 3.33 2.34 1I.I 1 5.48 7.80
t~ Sum \1':515' ,; w 53 ~ 1.,0:00 ;>20!70.a ~i i rI 2 1~69 ,
\24;67J'
Units I f ' ",;' ';} 'fi. Sample '5~j\~ i" ,II' ,;<,i',' ",,;;;,
A 9 14 -5.33 -5.00 28.44 25.00 26.67
C 13 19 -1.33 0.00 1.78 0.00 0.00
E 21 24 6.67 5.00 44.44 25.00 33.33
Sum 43 57 0.00 ' 0.00 ;;,' 74.67 50.00 ~60. 00 ~
Units:> <'"
~'
"'" Sample 6 ;;&-. ,'~
A 9 14 -6.33 -5.33 40.11 28.44 33.78
D 16 20 0.67 0.67 0.44 0.44 0.44
E 21 24 5.67 4.67 32.11 21.78 26.44
Sum ': 46 58 I " 0.00 " 0.00,'0 72.67 50.67 -ir 60.67
Units ;;,.. "
''co ' . ~ Sample 7 ." . wo "'~ " i

B 11 18 -2.33 -1.00 5.44 1.00 2.33


C 13 19 -0.33 0.00 0.11 0.00 0.00
D 16 20 2.67 1.00 7.11 1.00 2.67
T'
I ,',; ",
Sum " i <tv "
;'tUriitsf I"; :;...
r.

, oi5 7;/:!', '""'0.00,;,


"," ' '1
'~:t~:
11112:67;'J' Ii:; 2.00 . Ii',: i d';5!OQ!r.
,;C' ,18.1" ''"' ",,,,,.\'''' ""'"
B II 18 -4.00 -2.33 16.00 5.44 9.33
C 13 19 -2.00 -1.33 4.00 1.78 2.67
E 21 24 6.00 3.67 36.00 13.44 22.00
SUI11 ,', 457 61 ' 0.00 ,1';;.0.00 <, ;5,6.00", 20.67 r,i-": "*34:00,rf
Units ', ,: ".'
Sample ~~ "
, ''§ "'o;"~
B 11 18 -5.00 -2.67 25.00 7.11 13.33
D 16 20 0.00 -0.67 0.00 0.44 0.00
E 21 24 5.00 3.33 25.00 1I.I I 16.67
Contin ued .
Chapter 1: Basic concepts and mathematical notation 55

13.44 4.00
0.44 1.00
18.78 9.00

From the above table we have

1 11.00 17.00 7.00 5.00 0.945 0.71


2 12.00 17.33 9.37 10.02 0.829 1.12
3 13.67 18.67 25.38 31.37 0.968 1.24
4 12.67 17.67 10.35 10.84 0.960 1.05
5 14.33 19.00 25.00 30.00 0.982 1.20
6 15.33 19.33 25.33 30.33 1.000 1.20
7 13.33 19.00 1.00 2.500 0.993 2.50
8 15.00 20.33 10.33 17.00 0.999 1.65
9 16.00 20.67 9.33 15.00 0.982 1.61
10 16.67 7.00 10.50 0.982 1.50

Thus we have the following results :


E(Y) = 14 = Y, that is, the sample mean y is unbiased for population mean of the
study variable ;

E(x) = 19 = X , that is, the sample mean x is unbiased for population mean of the
auxiliary variable;

E(s~)= 22.00 = s~, that is, the sample variance s~ is unbiased for population s~
of the study variable;

E(s;)= 13.00 = S; , that is, the sample variance s; is unbiased for population S;
of the auxiliary variable ;
E(sxy ) = 16.25 = S xy , that is, the sample covariance s xy is unbiased for population
S xy of both variables;
EVxy)= 0.964 7c- Pxy' that is, the sample rxy is biased for population Pxy' and
B~xy)= EVxy)- P xy = 0.964-0.960 = 0.004;
56 Advanced sampling theory with applicatio ns

and
E(b) = 1.4 l' 13, that is, the sample b is biased for the popu lation 13;
and
B(b)= E(b)- 13 = 1.40-1 .25 = 0.15.
( c) The covariance between ji and x is defined as:
Cov(y,x)= E[y - E(Y)Ix - E(x)] =E[y - fl:~ - x]=IPs~s -f Ixs - xl
s=1
Now we have

Qit.. . f.},Pthf,\{~I:- x)ro' (Y-, - Y-X"


XI ':"<-y.,
, -'l
i'"
~~XI '" , XI X<
11.00 17.00 -3.00 -2.00 6.0000
12.00 17.33 -2.00 -1.67 3.3400
13.67 18.67 -0.33 -0.33 0.1089
12.67 17.67 -1.33 -1.33 1.7689
14.33 19.00 0.33 0.00 0.0000
15.33 19.33 1.33 0.33 0.4389
13.33 19.00 -0.67 0.00 0.0000
15.00 20.33 1.00 1.33 1.3300
16.00 20.67 2.00 1.67 3.3400
16.67 21.00 2.67 2.00 5.3400
.
I. "' co Sum 0.00 <.,r~ o.oo 2 1.6667
So by definition,

Cov(y, IPI~' -iIx -xl =~ x 21.6667 = 2.1667.


x)= 1=1 l
10

( d ) Now we have
N - n S = (5-3) xI6.25 = 2.16667.
Nn ~ 5 x3

Thus we have
_ _) N- n (1- j)
COy (y , x =- - S ty = - -S,y, where j = niN .
Nn n

For theoretical proof refer to Chapter 2.

1.60 SOME,USEFUL MATHEMATICM:.l'FORMULAE''>

If x and y are two random variable and c and d are two real constants, then
(a) v(ex) = e 2V(x} (1.60. 1)

(b) Cov(ex ,dy) = edCov(x,y } (1.60 .2)


Chapter 1: Basic concept s an d mathematical notation 57

II
( C) If x = IXi , where the x; are also random variables, then we have
i= l

v(x) = V(;~X;) = ;~v(x;)+ ;;<~=I Cov(';' x;}. ( 1.60 .3)

II
( d ) If x = I Cixi , where C; are real constants, then we have
i= l

v(X) = V ( L" C;X;) = L" C;2V(x;)+ L" C;CjCOV(


x., )
Xj . ( 1.60.4)
;=1 ;=1 ;;<j=1

Not e that if Xi and Xj are independent then CovG;,Xj) = 0 and

V(X) = V ( ;~
" C;X; ) = ;~"C;2V(X;) .

II n
( e ) If x = I Cixi and Y = L d;y;, where Ci and d i are real con stant s, then we have
i= \ ;=1

Covtx, y ) = co{;~c;x;, ;~d;y;) = ;~C;d;COV(X;, y;) + ;;<tl c;djCov(x;, y j )' ( 1.60.5)

Note that if Xi and Yj are independent then COV(X;' Yj ) = 0 and

Cov(x, Y) =CO{;~IC;X;' Eld ;Y;) = E tC;d;COV(X;, y; ).


1.61 ORDERED STATISTICS

The se are param eters which dea l with arrangi ng the data in ascendi ng or descen ding
order, and we introdu ce a few of them here as follows:

1.61.1 POPULATION MEDIAN

It is a measure which divides the popul ation into exa ctly two eq ua l parts, and it is
denoted by M y . Its analogo us from the sample is ca lled sample median , and is
denoted by if y' A pictorial repr esentation is given below:

Data arranged in ascending order


50% data values 50% data va lues

/ Minimum )
\ Value

Fig. 1.61.1 Structure of dat a to find median.


58 Advanced sampling theory with application s

Rules to find sample median: Consider a sample having I


observations, and we
wish to find the sample median. The first step is to arrange the data in ascend ing
order , and then after that there are two situations:
(i ) If the sample size I is odd , then the value at the (II ; I} h position from the
ordered data are called sample median. As an illustration, consider a sample
I
consisting of = 5 (odd) observations as 50, 90, 30, 60, and 70. First step is to
arra nge the data in asce nding order as: 30, 50, 60, 70, 90. The second step is to
. k up a va Iue at the ( -2-
pIC 1I+I) th = ( -5 +
2-I) th = 3rd positron
. . = 60 , so M' y = 60 .

( ii ) If the sample size I is even, then the average of the values at the (%}h and

(%+ I}h positions from ordered data are called sampl e median . As an illustration,

consider a sample con sisting of 11= 6 (even) observations as 50, 90, 30, 60, 70 and
20. First step is to arrange the data in ascending order as: 20, 30, 50, 60, 70, 90.
Th e second step is to pick up two values: one at ( %}h = (%}h = 3rd position = 50 ,

and seco nd at (% + I}h = ( %+ I}h = 4th position = 60 . Then the average of these

values is called the median, and is given by if y = (50 + 60)/2 = 55.

1.61.2 POPULATION Q UARTILES

These are three measures which divide the popul ation into four equal parts. The /11
quartile is represe nted by Qj, i = 1,2,3. A pictori al representation is give n below:

25% 25% 25% 25%

Data arrang ed in ascending ord er

Minimu Maximum
Value Value

Fig. 1.61.2 Structure of data to find quartiles.


Chapter 1: Basic concepts and mathematical notation 59

Note that the second quartile Q2 is a median. The first quartile QI is a median of
the data less than or equal to the second quartile Q2, and third quartile Q3 is the
median of the data more than or equal to the second quartile Q2' Thus finding three
quartiles needs to find median three times from the given ordered data. The
population interquartile range is defined as: 0 = (Q3 - QI)' The sample analogous of
population quartiles are called sample quartiles and are denoted by Qi, i = 1,2,3 and
sample interquartile range is defined as: <3 = (Q3 - QI)' which is a measure of
variation in the data set.

1.61.3 ~OI'JILATION.PERCENl'ILES

These are 99 measures, which divide the population into equal 100 parts. The { Ii
population percentile is represented by 11, i = 1,2,.... ,99 and its pictorial
representation is given below:

1% 1% 1%

Data arranged in ascending order

Fig. 1.61.3 Structure of data to find percentile.

Its sample analogous is represented by A, i = 1,2,....,99 .


1..61.4 POI'ULATIONMOJ)E 7 /~>t'

It is value which occurs most frequently in the population and is denoted by M 0 ,

and its sample analogous is called sample mode and is denoted by ifO' As an
illustration, for the data set 60, 70, 30, 60, 30, 30, 80, 30, the mode value is 30,
because it occurred most frequently .

1.62 DEFINITION(S»OESIATISTICS

There are several definitions of statistics and we list a few of them are as follows :

( a) It is a science to describe or predict the behaviour of a population based on a


random and representative sample drawn from the same population.
60 Advanced sampling theory with applications

( b ) The science of statistics is the method of judging collective, natural, or social


phenomena from the results obtained from the analysis or enumeration or collection
of estimates.
( c ) The science which deals with the collection , analysis and interpretation of
numerical data.

1.63 LIMITATIONS OF STATISTICS

A few limitations of statistics are:


( a ) Statistics does not deal with individual measurements. This is the reason we
need police to investigate individuals;
( b ) Statistics deals only with quantitative characters or variables, and we have to
assign codes to qualitative variables before analysis ;
( c) Statistics results are true only on an average ;
( d ) Statistics can be misused or misinterpreted. For example last year 90% of the
pedestrians who died in road accidents were walking on paths, so it is safer to walk
in the middle of the road.

1.64 LACK OF CONFIDENCE lNSTATISTICS

A few people have the following types of views in their mind about statistics:
( a ) Statistics can prove anyth ing;
( b ) There are three types of lies --- lies, damned lies, and statistics;
( c ) Statistics are like clay of which one can make a God or devil as he/she pleases ;
( d ) It is only a tool , and cannot prove or disprove anyth ing.

1.65 SCOPE OF STATISTICS

It has scope in almost every kind of category we are divided in this world due to our
social setup , for example, Trade, Industry, Commerce, Economics, Biology,
Botany, Astronomy, Physics, Chemistry, Education, Medicine, Sociology,
Psychology, Religious studies , Meteorology, National defence, and Business:
Production, Sale, Purchage, Finance, Accounting, Quality control , etc..

EXERCISES

Exercise 1.1. Define the terms population, parameter, sample, and statistic.

Exercise 1.2. Describe the advantage of a sample survey in comparison with a


census survey. Write the circumstances under which census surveys are preferred to
sample surveys and vice versa?

Exercise 1.3. Describe the relationship between the variance and mean squared
error of an estimator. Hence deduc e the term relative efficiency.
Chapter 1: Basic concepts and mathematical notation 61

Exercise 1.4. You are required to plan a sample survey to study the environment
activities of a business in the United States . Suggest a suitable survey plan on the
following points : ( a ) sampling units; ( b ) sampling frame ; ( c ) method of
sampling; and ( d ) method of collecting information. Prepare a suitable
questionnaire which may be used to collect the required information.

Exercise 1.5. Define population, sampling unit and sampling frame for conducting
surveys on each of the following subjects. Mention other possible sampling units,
if any, in each case and discuss their relative merits .
( a ) Housing conditions in the United States.
( b ) Study of incidence of lung cancer and heart attacks in the United States .
(c) Measurement of the volume of timber available in the forests of Canberra .
( d ) Study of the birth rate in India.
( e ) Study of nutrient contents of food consumed by the residents of California.
( f) Labour manpower of large businesses in Canada .
( g ) Estimation of population density in India.

Exercise 1.6. What do you understand by the following terms?


( a ) Unbiasedness; (b) Consistency; (c) Sufficiency; and (d) Efficiency.

Exercise 1.7. Define the following :


( a ) Sampling frame; (b) Sample survey design ; and ( c ) Nonresponse.

Exercise 1.8. Show that the sample variance s; = _1_ I(Yi -)if can be put in
n-I i=l
different ways as

and the sample covariance


1 n( -
sXY=--IIxi -XYi-Y-)
X
n- i=1
can be written as

1 n ] 1 [nL.XiYi- (fXiJ(fYiJj n(fXiYiJ -(fxiJ(fYiJ


-I [i=\
Sxy=-- "LJiYi-nxy
- - " 1=\ 1=1 1=\
(1=\) 1=\
n -1 i=\ -I
=-- .
n n n 11

Exercise 1.9. Show that the population mean square error


2 1 N( -\2
Sy=--IJj-Y)
N-1i=1
62 Advanced sampl ing theory with applications

can be put in different ways as

N _ N
N
"y J2
S;=_I_[Ir/ _Ny2]= _1_ Iy;2_~
( L../

N - 1 i~ \ N -1 i ~\ N N(N -I)

and the population cova riance


S
xy
N(
N -1 i~1 I
-X -)
= -1 - I X -X y. -y
I

can be written as

rN (~ X;)( ~ lI )1 N(.~ Xill )- (~ X;)( ~ lI )


S = - I - [NIX-Y - N X
--]
Y = -I - IX-Y 1=1 1=\ 1= 1 1=\ 1=\
v N - I ; =1 / 1 N - I ;=1 / / N N (N - I ) '

Exercise 1.10. Construct a sample space and tree diagram for each one of the
following situations:
( a) Toss a fair coin; ( b ) Toss two fair coins ; ( c ) Toss a fair die; ( d ) Toss a fair
coin and a fair die; (e) Toss two fair dice ; and (f) Toss a fair die and a fair coin .

Exercise 1.11. State what type of variable each of the following is. If a variable is
quantitative, say whether it is discrete or continuous; and if the variable is
qualitative say whether it is nominal or ordinal.

I Religious preference.
2 Amount of water in a glass.
3 Master card number.
4 Number of students in a class of 32 who turn in assignments on time.
5 Brand of personal computer.
6 Amount of fluid dispensed by a machine used to fill cups with chocolate.
7 Number of graduate applications in statistics each year at the SCSU .
8 Amount of time required to drive a car for 35 miles.
9 Room temperature recorded every half hour.
10 Weight ofletters to be mailed .
11 Taste of milk.
12 Occup ation list.
13 Coded numbers to different colors, e.g., Red--l , Green--2, and Pink--3 .
14 Average daily low temperature per year in the St. Cloud city.
15 Nat ional ity of the students in your University.
16 Phone number.
17 Rent paid by the tenant.
18 Frog Jump in ems .
19 Colors of marbles .
Chapter 1: Basic concepts and mathematical notation 63

20 Number of mistakes in the examination.


21 Time to finish an examination.
22 Shoe number.
23 Gender of a student.
24 Discipline of a student.
25 Rating of a politician: good , better or best.
26 Sum of two real numbers.
27 Sum of two integers (or whole numbers).
28 Number of passengers in a bus.
29 Age of a patient.
° °
30 Age groups, e.g., to 5 years , 6 to 1 years etc.
31 Area code .
32 Postal code .
33 Product of two pos itive real numbers .
34 Length of a string in ems,
35 Height of door in feet.
36 Weather conditions: good, better and best.
37 Average of real numbers .
38 Proportion of red balls in a bag .
39 Number of e-mail accounts.
40 Number of questions in an examination .

PRACTICAL PROBLEMS

Practical 1.1. From a population of size 5 how many samples of size 2 can be
drawn by using ( a ) SRSWR and ( b ) SRSWOR sampling?

Practical 1.2. Mr. Bean selects all poss ible samples of two units from a population
consisting offour units viz. 10, 15, 20, 25 by using SRSWOR sampling. He noted
that the harmonic mean of this population is given by

y{ -101 +-151 + -20II


+ - }= 15.58442.
Hy =N
I IN - 1 = 4
i~ ' Yi 25

The total number of possible samples =N CI/= 4CZ = 6 and these samples are given by

(10, 15), (10, 20), (10, 25), (15, 20), (15, 25) and (20, 25) .
The harmonic means for these samples are, respectively, are

H,=n
, / I1/ -=21 y{-10II}
i~' Yi 15
1 y{ -10II}
+ - =12, ' / I -=2
Hz=n +- =13 .33333
1/

20
i~ IYi

Mr. Bean took the harmonic mean of these six sample harmonic means, as follows :
64 Advanced sampling theory with applications

HM = _ 6 _= 6
6 1
- ,- {-1 + 1 + 1 + 1 +- I-+I }
i~IHi 12 13.33333 14.28571 17.14286 18.75 22.22222

= 15.58442 = H y .

Then Mr. Bean made the following statements.

(a) Sample harmonic mean is an unb iased estimator of population harmonic mean.

( b ) The expected value of sample harmon ic mean ill is defined as

E(iI)=ts(nJ ~) = Hv :
I

1=1 HI

Do you agree with him? If not, why?

. .
Hmt. Expected value . E HI =
. (,)
z:
s(o ) '
PI H I
.
with PI
1 ( ) ( L N
=- '<f t = 1,2,oo ., s nand s n F ell '
1=1 6

( c ) Find the bias, var iance, and mean square error in the estimator ill .
( d ) Does the relation MSE(ill)= V (ill)+ {s(il l )}2 hold?

Practical 1.3. Suppose that a population consists of 5 units given by : 10, 15, 20, 25,
and 30 .Select all possible samples of 3 units using SRSWR and SRSWOR
sampling.
( a ) Show that the sample mean is an unbiased estimator of population mean in
each case.
( b ) The sample var iance is unbiased estimator of the population variance under
SRSWR sampling, and for population mean squared error under SRSWOR
sampling.
( c ) Also plot the sampl ing distribution of sample mean and sample variance in
each situation.
( d ) Find the variance of sample mean under SRSWOR sampling using the
definition of variance? Show all steps .
( e ) Also compute ( N~ n )s; and comment on it.

r
Practical 1.4. Repeat Mr. Bean's exercise with the geometric mean (OM) and
comment on the results.

(}~l/i
n

Hint: The OM of n numbers Yl ,Y 2" " 'Y n is GM =


Chapter 1: Basic concepts and mathematical notation 65

Practical 1.5. If a random variable x foIlows a Poisson distribution , that is,


x - P(A) with A = 0.4 over N = 20 tria ls. Select a with replacement sample of
n = 5 units by using the method of cumu lative distr ibution function .
- A. ,;tt
Hi nt: The p.d.f. ofa Poisson random variab le x is given by p[X = x]= _e__ .
x!

P r actical 1.6. Suppose an urn contains N baIls of which Np are black and Nq are
white so that p + q = 1. The probability that if n baIls are drawn (without
replacement), exactly x of them will be black, is given by

such that 0:0; x :0; Np; and 0:0; n - x :0; Nq . Using the concept of c.d.f., select a
sample of three units by using without replacement samp ling.
Hi nt : Hypergeometric distrib ution .

Practical 1.7. If a discrete random variable Xhas a cumulative distribution


function :
0 for x < I,
1/3 for 1:o;x <4,
F(x) = 1/2 for 4 :0; x < 6,
5/6 for 6:0; x < 10,
1 for x ~ 10.
Se lect a sample of n = 5 units by using with replacement sampling.
Hint: Use random number table method .

P r actical 1.8. If the distribution function of a population consisting of N = 5 units


is give n by
x2 +5x
x =- - -
F () for x = 1, 2,3,4,5.
50
Draw a wit hout rep lacement sample of n = 2 un its.
Hi nt : Use random number tab le method .

Practical 1.9. If the distribution function of a cont inuous random variable x is


1
f(x) = [ )2 ]' -00 < x < + 00
Jr 1+ (x - 100

Use the first 6 col umns multiplied by 10- 6 as the values of the cumulative
distribution funct ion (c.d.f.) F(x) of the random variable x , and select a random
samp le of IS units by using with replacement sampling.
Hint: F(x) = 100+tanHF(x) -0.5)].
66 Advanced sampling theory with applications

Practical 1.10. If the distribution function ofa continuous random variable x in a

rl,
population is given by

f(x) = ~() eXPj_(x: tJ - 00 < x < +00

with tJ = 100 and a = 2.5 .Use the first 6 columns multiplied by 10- 6 as the values
of the cumulative distribution function (c.d .f.) F(x) of the random variable x, and
select a random sample of 15 units by using with replacement sampling.
Hint: x = tJ+Z() and z - N(O,I).

Practical 1.11. Find the value of c so that


cx (x - y ) for O<x<l, - x < Y < +x,
f (x,y ) =
{o .
otherwise,
becomes ajoint probability density function . Select a sample of 11 = 5 units by
using Random Number Table method.
l+x
Hint: f ff(x,y)iydx = I or Freund (2000) .
O-x

Practical 1.12. In the hope of preventing ecological damage from oil spills, a
biochemical company is developing an enzyme to break up oil into less harmful
chemicals. The table below shows the time it took for the enzyme to break up oil
samples at different temperatures. The researcher plans to use these data in
statistical analysis:

( a ) If you are a consultant which variable you will consider dependent and
independent? Denote your dependent variable by Y and independent variable
with X .
( b ) Assuming that these six observations form a population, compute the following
parameters:
- - 2 2 _ Sy _ St _ Sxy _ Sty _ Cy
Y ,X, Sy,Sx' Sy,Sx,Cy-~,Ct-~,Sxy,Px
y--- ,f3--2 andK-pxy- .
Y X SxSy c, s;

Practical 1.13. Consider the following population consisting of 5 units A = 10,


B = 20, C = 25, D = 50, and E = 4.
( a ) Compute the population harmonic mean .
( b ) Select all possible SRSWOR samples each consisting of 3 units .
( c ) For each sample of 3 units, obtain estimate of the population harmonic mean .
( d ) How many sample harmonic means are less than population harmonic mean?
( e ) Find the bias in the sample harmonic mean.
Chapter 1: Basic concepts and mathematical notation 67

( f) Find the variance of sample harmonic mean by the definition.


( g ) Find the MSE of the sample harmonic mean by the definition.
( h ) Does the relation MSE(Hs )=V(H s)+~(Hs )}2 hold?
Practical 1.14 . Consider a population consisting of 15 countries as listed in the
following table, and also gives the hypothetical suicide rates in these countries per
100,000 persons .

Sr.'No5 Country ,t'; 'i C , \',$ uicide rate (%)


1 Australia 22
2 Austria 55
3 Canada 29
4 Denmark 59
5 France 35
6 Ireland 10
7 Israel 12
8 Japan 35
9 Netherlands 20
10 Norway 26
11 Poland 26
12 Sweden 39
13 Switzerland 55
14 United Kingdom 18
15 United States 25
1. Compute the following parameters :
( a ) Population mean;
( b ) Population range;
( c ) Population variance;
( d ) Population standard deviation ;
( e ) Population mean square error;
( f) Population coefficient of variation;
( g ) Proportion of countries having suicide rate more that 25%.

2. (a) Select an SRSWOR sample of 5 units using Random Number Table


method (Rule: Start from 1st row and 3rd column of the Pseudo-Random
Number Table 1 given in the Appendix).
( b ) Estimate the population mean and population total.
( c ) Compute the variance of the estimator of population mean.
( d ) Estimate s;.
( e ) Estimate the variance of the estimator of population mean.
( f) Construct 95% confidence interval of the population mean assuming that
population mean square is known and sample size is large. Does the
population mean falls in it? Interpret it.
68 Adv anced sampling theo ry with appl ications

( g ) Con struct 95% confidence interval assuming that population mean square
is unknown and sample size in small. Doe s the population mean falls in it?
Interpret it.
( h ) Find the variance of estimator of proportion of countries having suicide
rate more than 25%.

Practical 1.15. Consider a popul ation cons isting of the follow ing six units:

Now con sider the follo wing sampling plan :


Sample No. Samples Prob ability
PI
1 A, C,E 1/9
2 A, C. F 1/9
3 A,D, E 1/9
4 A,D, F 1/9
5 B,C,E 1/9
6 B, C,F 1/9
7 B,D,E 1/9
8 B,D,F 1/9
9 C, D, F 1/9

Compute the following:


( a ) £(YI ),V(Yt ), S(Yt), and MSE(Yt) ·
( b ) Does the relation MSE(Yt) = V(Yt)+ {S(Yt )}2 hold s?

Practical 1.16. For a bivariate data of n = 10 pairs of observations we are given


fI n 2 n
LXi =5 7, L Yi = 263 , and L XiYi = 299.
i=1 ~I ~I

Assume that these 10 observations form a sample compute the following stat istic:
- ·, x- ·, s2y '. s2x '. S ' S . C'y -----=-
_ Sy . C' _
, x '" -=- , xy '.
Sx . S rxy -- Sxy . b_ s xy and
Y y ' x' -- , --
2
Y x SxSy Sx

Practical 1.17. The follow ing data show s the daily temp eratures in Ne w York over
a period of two weeks:
Chapter I: Basic concepts and mathematical notation 69

Find the following: sample size; sample mean; median; mode; first quartile ; second
quartile; third quartile; minimum value; maximum value; and interquartile range .

Practical 1.18. Construct scatter diagrams and find the linear correlation coefficient
in each one of the following five samples each of five units and comment on the
different situations will arise:

Practical 1.19. The following balloon is filled with five gases with their different
atomic number and atomic weights.
La a.a ._ • .L.a. a .LL .......

·
·
·

··
................ . ·-
70 Advanced sampling theory with applications

( I ) Sampling distribution of atomic weight of gases:

( a ) Find the average atomic weight of all the gases in the balloon;
( b ) Find the population variance 0- 2 of atomic weight of all the gases in the
balloon ;
( c ) Select all possible with replacement samples each consist ing of two gases;
( d ) Estimate the average atomic weight from each one of the 25 samples;
( e ) Construct a frequency distribution table of all poss ible sample means ;
( f) Construct an histogram. Is it symmetric?;
( g ) Find the expected value of all sample means of atomic weights from the
frequency distribution table you developed ;
( h ) Find the variance of all the sample means of atomic weights from the
frequency distribution table you developed.

( II ) Sampling distribution of proportion of inert gases in the balloon:

( a ) Find the proportion of inert gases in the balloon, and denote it by P ;


(b) Select all possible with replacement samples each consist ing of three gases;
( c ) Estimate the proportion of inert gases in each sample ;
( d ) Construct a frequency distribution table of all possible sample proportions of
inert gases ;
( e ) Construct an histogram. Is it symmetric?
( f) Find the expected value of all sample proportions of inert gases from the
frequency distribution table you developed;
( g ) Find the variance of all the sample proportions of inert gases from the
frequency distribution table you developed.

Practical 1.20. Consider a sample Y \,Y2" "'Yn and let Y k and s; denote the sample
mean and variance, respectively, of the first k observat ions .
( a ) Show that
2 (k - J) 2 J ( - )2
s k+! = - - S k + - - Y k+! - Yk .
k k+l
( b ) Suppose that a sample of 15 observations has sample mean and a sample
standard deviation 12.60 and 0.50, respectively. If we consider 16th observation of
the data set as 10.2. What will be the values of the sample mean and sample
standard deviation for all 16 observations?
2. SIMPLE RANDOM SAMPLING

2:0 INTRODUCTION

Simple Random Sampling (SRS) is the simplest and most commo n method of
selecti ng a sample, in which the sample is selected unit by unit , with equa l
probability of selection for each unit at eac h draw. In other words, simple random
sampling is a method of selecting a sample s of II units from a popul ation n of
size N by giving equal prob abilit y of selection to all units. It is a sampling scheme
in whic h all po ssible combinations of II units may be formed from the popul ation
of N units with the same chance of selection.
As discussed in chapter I:
( a ) If a unit is selected, observed, and replaced in the popul ation before the next
draw is made and the procedure is repeated n times, it gives rise to a simple
rando m sample of II units. Thi s procedure is kno wn as simple rando m sampling
with replacement and is denoted as SRSW R.
( b ) If a unit is selected, observed , and not replaced in the popul ation befor e
makin g the next draw, and the procedure is repeated until n distin ct units are
select ed, ignoring all repetition s, it is called simple random sampling without
replac ement and is denoted by SRSWOR. Let us discuss the properties of the
estim ator s of population mean, variance, and proportion in each of these cases.

2.1 SIMPLE RANDOM SAMPLING WITH REPLACEMENT

Suppose we select a sample of II ~ 2 units from the population of size N by using


SRSWR sampling. Let Y i ' i = 1,2 ,..., 11, denote the value of the i''' unit se lected in the
sample and Yi , i = 1,2,...,N , be the value of the i''' unit in the popul ation . Then we
have the followin g theorems :

T heorem 2.1.1. The sample mean y" = 11 -1 I Yi is an unbiased estimator of the


i= 1
_ I N
population mean Y = N- I Yi .
i=1

Proof. We have to prove that £(yJ = Y. Now we have

_ [I"] I"
£V,,) = £ - I Yi =- I £(yJ .
II i =1 II i=\
(2 .1.1)

Now Yi is a random variable and each unit has been selected by SR SWR sampling,
therefore Yi can take value s JI,Yz"" ' YN with prob abilities l/ V , l/ N, ...,!/N . By
the definition of the expected value we have

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
72 Advanced sampling theory with app lications

I N -
E(Yi) = - L}j = Y .
Ni=1
Thu s (2. 1.1) impl ies

II
[I
E(YII ) =-InL - NLYi] =- LY
n
=Y .
i= 1 N i= l
I
Ili=l
c: -

(2.1.2)

Hence the theorem.

Corollary 2.1.1. The estimator YII = NYII is an unbiased estimator of population


tota l, Y .

Proof. We have
E(YII) = E[NYn ]= NE(Yn) =NY =Y (2.1.3)
which proves corollary.

Theor em 2.1.2. The varia nce of the estimator y" of the population mean Y is

- -I 2 2 -I N - -I N 2 -2
V(YII) = II O"y' whe re O"y = N i~I(}j - Y)2 = N [ i~l}j - NY ] (2 .1.4)

Proof. Because of independence of draws, we have

V(Yn) = V( -I LYi
II J=2I LV(Yi)
II
· (2. 1.5)
II i=1 II i=1
By the defin ition of var iance we have
V(Yi)=E[Yi - E(Yi)]2 = E~l )- {E(Yi )}2 =.l. ~ Y? _ y 2
N i=l

=-I [NL}j2 - NY
-2 ] =-I N(
L }j - -\2
YJ = 0"Y2 .
N i=1 N i=1
Using (2 .1.5) we have V(y,,) = O"}/11 . Hence the theorem.

Theorem 2.1.3. An unbiased estimator of the varia nce v(y,,) is given by


• _ s~
V(YII) = - , (2 .1.6)
II

where
2 I n _ I n 2 -2
Sy =- - L(Yi -Yn)2 = - - [ LYi -llYn ] ·
I/-li=l II- I i= 1

Proof. We have to show that E[v(yll)] = V(YII ). Now we have

E[v(y,,)]= E[S,~ ] = ~E(S}). (2.1.7)


II II

Note that
Chapter 2: Simple Random Sampling 73

2
E~,~ )= VVn )+y2= cry + y2
n
and

Es[y2]=E[1
- - (n'IYi2- nYn2)] =-- n 2-nYn2] =--
1 E[ 'IYi n 2-nYn2]
1 E[n- 'IYi
n- 1 i= \ n-I i= \ n- 1 n i=1

[1
_- -n- - 'In El)'i
n - 1 1/ i=1
(-2)~ -_ - n- [1"(1
( 2)- El)'n NY;2J - [cr;
- L- L
1/ - 1
- +Y-2)]
1/ i=\ N i= 1 1/

=_'_
'
1/ -
[J.- 2: Y/ - y2_cr; ]=_'_
1 N i=\
' [cr~ - cr.~ )
n- 1 )
1/ 1/
= cr 2.
)

From (2.1.7) we have

E[v(y,J= ~E(s;)
n
= cr;n = V(y,,) .
Hence the theorem.
Corolla ry 2.1.2. The variance of the estimator Yn = NYn of the popul ation total is
V(y,,) = N2V(y,,) .
T heorem 2.1.4 . Unde r SRSWR sampling, while estimating population mean (or
total) , the minimum sample size with minimum relati ve standard error (RSE) equal

1
to ¢ , is given by

n ~ [;,; , (218)

Proof. The relat ive standard error of the estimator Y" is given by

RSE(y,, )= ~v(y,, )/{EV,, )}2 = cr;/~,y2) . (2.1.9)

We need an estimator Yll such that RSE(y,, ) ~ rjJ , which implies that
2 2
cr; /~I P) ~rjJ, or cr! 2 ~rjJ2, or n e ;~2 '
1/Y rjJ Y
Hence the theorem.

Remark J.
2.1: If rjJ =( YZ:/2 with = Za/2 e j; then p[I(YIl; y)! ~ e) = 1- a.
Example 2.1.1. In 1995, a fisherman selected an SRSWR sample of six kinds of
fish out of 69 kind s of fish ava ilable at Atlantic and Gul f Coasts as give n below :

Kind offish Saltw ater White Blue Scup Summ er Scup


.- catfish perch runner flounder
No . offish 13859 3489 2319 3688 16238 3688
74 Advanced sampling theory with applic ations

( a) Estimate the average number of fish in each species group.


( b) Construct a 95% confidence interval for the average number of fish in each
speci es group
( c) Estimate the total number of fish at Atlantic and Gulf Coasts during 1995 .
( d) Construct a 95% confidence interval for the total number of fish at Atlantic
and Gulf Coasts during 1995.

Solution. We are given N = 69 and n = 6. From the sample information we have

13859 192071881
2 3489 12173121
3 2319 5377761
4 3688 13601344
5 16238 263672644
6 3688 13601344
.Sum 500498095'

( a ) Thus the average number of fish in each species group is given by


I n I 6 43281
Yn = - L: Yi = - L: Yi = - - = 7213.5.
ni=1 6 i =1 6

( b ) A (I - a)1 00% confidence interval for the population mean Y is given by


Yn ± (a/2(df = n -I )Jv(Yn)

where V(Yn) = s;/n.Now we have


sJ = _I_f I i _n-1( IyiJ2) = _1_[500498095- (43281)2] = 37658120.3.
n-Ili=1 li=l 6-1 6
Thus

V(Yn) = s;
= 37658120.3 = 6276353.38.
n 6
Using Table 2 from the Appendix the 95% confidence interval for the average
number of fish is given by
Yn± (O.05/2(df = 6 -1)Jv(Yn) , or 7213.5 ± 2.571.J6276353.38, or [772.46, 13654.53] .

( c ) An estimate of total number of fish is given by


y = NYn = 69 x 7213.5 = 497731.5 .
( d ) The 95% confidence interval for the total number of fish is given by
N x [772.46,13654.53] , or 69 x [772.46,13654.53] , or [53299.7, 942162.5] .
Chapter 2: Simple Random Sampling 75

Example 2.1.2. We wish to estimate the average number of fish in each one of the
species groups caught by marine recreational fishermen at the Atlantic and Gulf
coasts. There are 69 species groups caught during 1995 as shown in the population
4 in the Appendix. What is the minimum number of species groups to be selected
by SRSWR sampling to attain the accuracy of relative standard error 30%?
Given: sJ; = 37199578 and Y = 311528 .

Solution. We are given N = 69, SJ; = 37199578 and Y = 311528, thus


f .z..
N
311528 =4514.898
69
and
0'; = (N-l)SJ;
N
= (69-1) x37199578 = 36660453.68.
69
For ¢ = 0.30 either we are estimating population total or population mean, the
minimum sample size for the required degree of precision is given by
0';] 36660453 .68
n [
e ¢2f2 = 0.32 x (4514 .898f 19.98",20 .

Thus a sample of size n = 20 units is required to attain 30% relative standard error
of the estimator of population mean under SRSWR sampling.

Example 2.1.3. Select an SRSWR sample of twenty units from population 4 given
in the Appendix . Collect the information on the number of fish during 1995 in each
of the species group selected in the sample. Estimate the average number of fish in
each one of the species groups caught by marine recreational fishermen at Atlantic
and Gulf coasts during 1995. Construct the 95% confidence interval for the average
number of fish in each species group available in the United States.

Solution. The population size is N = 69, thus we used the first two columns of the
Pseudo-Random Number (PRN) Table 1 given in the Appendix to select 20 random
numbers between 1 and 69. The random numbers so selected are 58, 60, 54, 01, 69,
62,23,64,46,04,32,47,57,56,57,60,33,05,22 and 38.

01 Sharks, other 2016 -3977.25 15818517.560


04 Eels 152 -5841.25 34120201 .560
05 Herrings 30027 24033 .75 577621139.100
22 Crevalle jack 3951 -2042.25 4170785 .063
23 Blue runner 2319 -3674 .25 13500113.060
32 Yellowtail snapper 1334 -4659.25 21708610.560
33 Snappers , others 492 -5501.25 30263751.560
Continued .......
76 Advanced sampling theory with applications

38 Pinfish 16855 10861.75 117977613.100


46 Spot 11567 5573.75 31066689.060
47 Kingfish 4333 -1660 .25 2756430 .063
54 Tautog 3816 -2177 .25 4740417 .563
56 Wrasses , other 185 -5808 .25 33735768.060
57 Little tunny/ Atl bonito
782 -5211 .25 27157126.560
57 Little tunny/ Atl bonito
782 -52 11.25 27157126 .560
58 Atlantic mackerel
4008 - 1985.25 3941217.563
60 Spanish mackerel2568 -3425 .25 11732337.560
60 Spanish mackerel2568 -3425.25 11732337.560
62 Summer flounder16238 10244.75 104954902.600
64 Southern flounder
1446 -4547 .25 20677482 .560
69 Other fish 14426 8432.75 71111272.560
~ ~~~ 7 Suhi ~r,O O;) 1.1 6S943840IQQOclo
An estimate of the average number offish in each species group during 1995 is:
Yn = ~ IY; = 119865 = 5993.25.
II ;=1 20
Now
s; = _1_- 1 I (v; - Yn f = 1165943840
II ;=1 20-1
= 61365465.26,
and the estimate of variance of the estimator Yn is given by
2
v(y,,) = sy 61365465.25 = 3068273.26.
II 20
A (1- a)lOO% confidence interval for the average number of fish in each one of
the species groups caught during 1995 by marine recreational fishermen in the
United States is
Y" =Ffa / 2(df = II - 1).jv(y,, ).
Using Table 2 from the Appendix the 95% confidence interval is given by
Yn =FfO.02S(df = 20-1).jv(y,,) , or, 5993 .25+2.093~3068273.26
or [2327.05, 9659.45] .

Example 2.1.4. The depth y of the roots of plants in a field is uniform ly distributed
between 5cm and 8cm with the probability density function

f(y) = -1 V5<y<8•
3

We wish to estimate the average length of roots of the plants with an accuracy of
relative standard error of 5%, what is the required minimum with replacement
sample size n?
Chapter 2: Simple Random Sampling 77

Solution. We know that if y has a uniform distribution function


f(y) = - 1 \;f a < y < b•
b-a
Then the population mean
y = a + b = 5 + 8 = 6.5 ,
2 2
and the population variance
<J2 = = (8-5)2 =~=0.75.
(b-a)Z
y 12 12 12
We need ¢ = 0.05 , thus the required minimum sample size is given by
<J; 0.75
n '? 2-2 = 2 2 = 7.1'" 7.
¢ Y 0.05 x 6.5

Example 2.1.5. The depth y of the roots of plants in a field is uniformly distributed
between 5cm and 8 em with the probability density function

f (y) =.!. \;f 5< y<8•


3

Select a with replacement sample of n = 7 units. Construct a 95% confidence


interval to estimate the average depth of roots.

Solution. The cumulative distribution function (c.d.f.) is given by


Y Y1 (y - 5)
F(y)=P[Y~ y]= JJ(y}ty= F-dy=--
5 53 3
which implies that y = 3F(y)+ 5. We used the first three columns multiplied by
10- 3 of the Pseudo-Random Number (PRN) Table I given in the Appendix to select
seven values of F(y) and the required sampled values are computed by using the
relationship y = 3F(y)+ 5 as follows:

0.992 7.976 63.61658


0.588 6.764 45.75170
0.601 6.803 46.28081
0.549 6.647 44.18261
0.925 7.775 60.45063
0.014 5.042 25.42176
0.697 7.091 50.28228

Thus an estimate of the average depth of the roots is given by


78 Advanced sampl ing theory with applic ations

Yn =..!-i:Yi = 48.098 = 6.8711.


n i=1 7
To find sample variance s~ we apply here alternat ive method as
2
s; =_n -I_I !i:Y1 - n-
2
1(
i:Yi J j =_1_[335 .9864 - 48.098 ] = 0.9164,
i=1 i=l 7 -I 7

and the estimate of variance of the estimator Yn is given by


V(Yn) = sn; = 0.9164
7
= 0.1309.

A (I - a )100% confidence interval for the average depth of roots in the field is
Yn=+= ta/2(df = n -INv(y,J .
Using Table 2 from the Appendix the 95% confidence interval estimate of the
average depth of the roots is given by
Yn =+= to.025(df = 7 -I NV(Yn), or 6.8711+ 2.447~0. 1 309 , or [5.9857, 7.756] .

Theorem 2.1.5. The covar iance between two sample means Yn and xn under
SRSWR sampli ng is:
_ _) O'xy
C OY( Yn'Xn = -, (2.1.10)
n
where
0'xy = NI N -XXi - Xr) .
2: (Jj - Y
i= l

Proof. We have
- ,X-n) =C OY(I- In Yi,-InIXiJ= 2""
COY(Yn I ICOY(Yi
n
' X;) , (2.1.11)
n i=1 n i=l n i=l
Now
COY(Yi 'Xi )= E(YiXi) - E(Yi )E(Xi) . (2.1.12)
The random variable s (YiXi)' Yi and xi , respectively, can take anyone of the
value (JjXi ), Y.I and x.I for i = 1,2,...,N with probability 1/ N . Thus we have

C OY(Yi' Xi) =E(YiXi )-E(Yi )E(Xi) = NI IJiXi N J(IN IX


N - (IN IY; N iJ
1=1 1=1 1=1

I N
=-IYX- -X
Y I N(
- =-2: Y,. - -X -) =0'
y X -X .
N i= l l I N i=l 1 I xy

On substituting it in (2.1.11) we obtain


_ _) O'xy
COY(Yn , X n = - .
n
Hence the theo rem.
Chapter 2: Simple Random Sampling 79

Theorem 2.1.6. An unbiased estimator of Cov(Yn, xn) is given by


' (_ _) Sxy
COy Yn , X n =- , (2.1.13)
n
where
s xy = 1 In (Yi
--1 - Y-XXi - r)
x.
n- i=!
Proof. We have to prove that Elsxy)= axy . Now we have
1-
E(sxy)= E[_l_l I(Yi - yXXi -x)] = E- {IYiXi - nynxn}
n- i=\ n- 1 i=!

=_1_{
n i=!
IE(YiXi)-nE(Ynxn)} =
-1
_1_{i:...!.- IJiX
n -1 i=1 N i=1
i - n(Cov(Yn'xn)+ Y x)l
f
=_1_{~ ~}jXi _n(axy +y xJ} =_n_{...!.- ~}jXi -yx _axy}
n -1 N i=! n n -1 N i=l n

n- {O"xy---
O"x y}
=- =O"xy ·
n-l n
Hence the theorem.

Suppose we selected a sample of n ~ 2 units from the population of size N by


using SRSWOR sampling. Let Yi' i = 1,2,...,n denote the value of the {h unit
selected in the sample and Y., i = 1,2,...,N be the value of the /h unit in the
I
population . Then we have the following theorems:

_ -1 n
Theorem 2.2.1. The sample mean Y n =n L Yi is an unbiased estimator of the
i=1
- IN
population mean Y = N- Lli.
i=1
Proof. We have to show that E(Yn) = Y. It is interesting to note that this result can
be proved by using three different methods as shown below.

Method I. Note that the estimator Yn = n- 1 IYi can easily be written as


i=1
_ 1 N
Yn =- Ifill (2.2.1)
n i=1
where fi is a random variable defined as:
f. = {I if the ith unit of the population is in the sample, i.e., if i E S,

I 0 otherwise.
80 Adva nce d sampling theory with app lications

Note that Yi is a fixed value in the popul ation for the i''' un it, therefore, the
expec ted value of (2.2 . I) is give n by

E(YII) = E[~ ~ti}j] = ~E( ~ti}j) = ~[ ~ }jE(ti)]' (2.2.2)


II i ~1 II i~ 1 II i ~1

Not e that (N- I)C(II_l) is the numb er of samples in which a given population unit can
occur out of all NC II SRSWOR samples, and therefore the prob ab ility that i'''
(N-I )C
pop ulation unit is selected in the sample is = N (II-I) =.!!...- So the random
CII N

variab le t i takes the value I with probability ~ and 0 with pr obab ility (1-~ ).
Thu s the expected value of t, is

E(d = ~x 1+ (I - ~)x 0= ~ . (2.2.3)


From (2 .2.2) we have
_
E(YII) =-1 [ IN E(t;)}j ] =-1 IN-II }j =-1 I}j
N -
=Y .
II i~ 1 II i~ 1 N N i~ 1
Hence the theorem.

_ [I I
Method II. We can also prove the same result as foIlows

E(YII) = E - III Yi] = - III E(yJ . (2.2.4)


II i~ 1 II i~ 1

In (2 .2.4) the sample value Yi is a random vari able and can take any population
value lj , i = 1,2,...,N , with probabil ity 1/N .
Thu s we have
1 N -
E (Yi) = - L}j = Y.
Ni~ 1

From (2.2.4) we have


I II ( ) I II - -
E()
YII =- I EYi =- I Y =Y.
II i ~1 II i ~ 1

Hence the theorem.

Method III. To prove the above result by another method, let us consi der
(YII)I = sample mean YI based on the tl" sample selected from the pop ulation.
Note that there are N C II possi ble samples, the probability of selecting the l" samp le
IS

PI = I/(NCII ).
By the defin ition of expec ted value, we have
Chapter 2: Simple Random Sampling 81

L LY; J
C
=
I
N
N " ( ,,
n( C,,) 1;1 ;;1 1

Hence the theorem .

Solution. Let us consider a population consisting of four units A, B, C, and D.


The number of possible samples without replacement of size two is NCn = 4Cz = 6 .
Let }], Yz , Y3 and Y4 be the values of four units A, B, C, and D respectively, in
the population . Let vt denote the value of the /h unit selected in the sample. Then
we have the following situation:

Sam Ie no 2 3 4 5 6
Sampleduni ( A ,B) (A,C) (A,D) (B,C) (B,D) (C,D)
PopQla@nuilit~J. (}] , Yz ) (}], Y3) (}], Y4 ) (Yz , Y3) (Yz , Y4 ) ( Y3 , Y4 )

The values of the units in the sample in all these cases are YI and yz.

Thus we have

N~"(.Iy;) = f (l.y;) = f (YI + Yz)1


1;1 /;1 1 1;1 /; 1 1 1;1

=~+Yz~+~+Yz~+~+Yz~+~+Yz~+~+Yz\+~+Yz~
= (}] + Yz)+(}] + Y3)+ (}] + Y4)+ (Yz + Y3)+ (Yz + Y4)+ (Y3 + Y4 )

= 31) +3Yz +3Y3 +3Y4 =3CI1)+3qYZ+3CIY3+3qY4

=(4-1)qz-I)}] +(4-1)qz-I )Yz +(4-1)qZ-I)y3 +(4-1 )qZ -l)Y4


=(N-I)q,,_I)}]+(N-t)q,,_I)YZ+( N-I)q,, _I)y3+(N-I)q,, _I)Y4

= f {(N-I)q,,_I)}(Ji) .
;;1

Theorem 2.2.2. The probability for any population unit to get selected in the
sample at any particular draw is equivalent to inverse of the population size, that is,
Probabil ity of select ing the i1h unit in a sample = ~. (2.2.5)
N
82 Adva nced sampling theory with applications

Proof. Let us consider that at the r'" draw, the i''' popul ation unit Yi is se lected. Th is
is poss ible only if this unit has not been selec ted in the previous (r- I) draws . Let
us now consider the draws one by one.
First draw: The probab ility for the particular unit' }j , to get selected on the first
draw out of N units is = 1/N . Note that ' the probability ' that }j is not selected on
the first draw, from a popu lation of N units is = {1- 1/N } = (N - 1)/ N .

Second draw: The probabil ity that a particular unit is selected on the second dra w
(if it is not already selected on the first draw) is the product of two prob ab ilities,
namely
(Probability that }j is not selected on the first draw) x (Probability that }j is
selec ted on the second draw)
Therefore the prob ab ility that }j is selected on the seco nd draw is equal to
(N - I ) 1 1
- N - x(N_ I) = N

Note that the probabil ity the }j is not selecte d on the seco nd draw out of the
remai ning (N -I) popul ation units is equal to

1--- =--
1
(N - I)
N- 2
N- l '

Third draw : The pro babi lity that a particular popu lation unit is se lected on the third
draw (if it is not selected on the seco nd draw) is the product of three probabilities .
Probabi lity that }j is selected on the third draw (if it is not selected on first or
seco nd draw) is a prod uct of three probabilities as
(Probability that }j is not selected on first draw) x (Probab ility that }j is not
selected on second draw) x (Probability that }j is selected on the third draw)

= (~) x (~) x _1
N N- I N-2
=J.-.
N
Note that the prob ability that Y; is not selected on the third draw out of (N - 2)
population units is equal to

1-- -=-- .
1
N- 2
N -3
N- 2
Th is procedure continues up to (r - I) draws.

rth draw: Prob abil ity that }j is not selected up to (r - I) th draw is given by

- - x---x x
(N- I)
N
(N -2)
(N- I)
N- (r - I)
N- (r - 2)
=-- -
N- r+ 1
N
Probability that }j is selected at ,-t" draw [ass uming that it is not selected at any of
the prev ious (r - I) draws] is equa l to
Chapter 2: Simple Random Sampling 83

I I
N -(r-I) = N -r+1 .
So we obtain the probability of a particular unit Y; to get selected at the l' draw is
(N - r + I)
-'-----'- x
I
-
I
N (N -r+l) N
Hence the theorem.

Theorem 2.2.3. The variance of the estimator Yn is given by


V(Yn) = (1-f)
n
s; (2 .2.6)

where S'[;2 = -I - N(
2: Yi - Y-)2 and f = n/ N denote the finite population correction
N-I i=l
factor (f.p .c.).

Proof. We have

V(YII) = V(~ I.Yi) =


l n i=l vrl ~n iliY;) ,
i= J
(2 .2.7)

where Ii is a random variable that takes the value ' 1' if the {" unit is included in the
sample, otherwise it takes the value O. Note that the Jj is fixed for the {" un it in the
population we have

V(YII) = nl2 V(i~/iY; ) = n


l2 [~I V(/;Y; )\.~=I Cov&;Y; , liYi)]
= n\ [i~Y;2V(d \.~=1 Y;YiCOV&i,/J]. (202 .8)

In (2 .2.8) we need to determine the V(/J and CovV; , Ii). Note that the distributions
of t, and 11 are
I with probability n] N, and 12 = {I with probability (n/ N),
t, = { 0 with probability 1- (n/ N), , 0 with probability {I - (n/ N)}.
We have
vk) = E[/i - E(/iW= E~l)- {E(/i )}2

(2.2 .9)

The probability that both {" and r units are included in the sample IS

(N-2)C(n_2)
N =
n(n
(
- I)) an d ot herwise the pro babilitv
0 i
I ity IS 1-
n(n
(
-I)) ,there f ore
ell N N-I N N-I
84 Advanced sampling theory with app lications

I
I with probability /1((/1 - I)) .
N N- I
I;lj = 0 with probability { I - ~~:;_I?J
Now we have

COV/i,l ( )- E()
( j )=EI;lj /1(/1 - I)
li El( j ) = N(N - I)
(/IN )( N/I ) = Nn [ N- 1I- N
Il - Il ]

= n(n- N) =_.!!-( N - n J= n(N-n)


N2(N -I) N N(N -I) N2(N -I)" (2.2 .10)
Using (2.2.9) and (2.2.10) in (2.2.8) we obtain

VCYII)=~[~r;2{/I(N~n)}+
n N i=\
~j=\ r;Yj{-~(N-n)}]
N (N- I) i",

= (N-n)[~f2
c:
__I_ c: ~ fY .] (2.2 .11)
nN2 i=\ (N-I) i", j=\ I J '
Note that
-2 1 N 1 N 2 N
Y (N i=1I: YiJ2 = -N2 [ i=1I: Yi
= - + I:
ioto )= \
YiY) ] , (2.2 .12)

we obtain
N 2 -2 N 2
I r;Yj =NY -I r; ·
i"' j=\ i=1
On substituting (2.2.12) in (2.2.11) we obtain

VCYIl )= (N
fiN
-;1)[~r;2 __
(N
I_( N2y2 - ~ r;2J]
i= 1 i=1 -I )

=(N -2n)[~r;2 + _1- ~ r;2 _ ~ Y2]


nN i=\ N- 1i= 1 N- 1

=(N
/IN
-;1 )[(1 + _I_)~
N- I
r;2_ ~ y2]
N-I i= 1

= (N-;I)[~~r;2 _~Y2] = (N -fl)[_N_{ ~Y? _Ny2 }]


fiN (N- 1);=1 N- I nN 2 (N- I) i=1
_ N -Il S2 _ (1- f) S2
- ---;;;:;- Y - - fl - Y

where

Sy
2=-( -1) [N
Ir;2- NY-2].
N- I i= 1

Hence the theorem.


Chapter 2: Simple Random Sampl ing 85

Theorem 2.2.4. An unbiased estimator of the var iance V(Yn ) is given by


,(-) (1-fJ 2
v Yn = -n- Sy' (2.2.1 3)
whe re
2
Sy = - I - n 2-nYn
( LY; -2J and f =n/N .
n -I ;=1
Proof. We have to show that E[v(Yn )]=V(Yn) . Note that

E[v(Yn )] =E[ I ~f s~] = (I~fJE~~) . (2.2.14)

Now

E(s~)= E[_I{Iyr
n - I i= 1
-ny;}] =_1 [E( IYlJ -nE~; )l
n -I i= 1 J
= n~I[~i~/~l)- E~;)]. (2.2.15)

Not e that E~,~ )= V(Yn )+ {E(Yn)}2 = N- n s; + y2and each unit }j becomes selected
Nn
with probabil ity 1/ N , thus (2.2. I5) becomes

E(s 2)=_n [~ I (~ ~}j2 J - { (N -n) S 2 + y2 }]


~ y n- 1 ni= 1 i =1 N Nn y

=_n_[~ ~ y'2 _ y2 _ (N- n) S2] = _n_[~{ ~ }j2 _ Ny 2


n- I N i = 1 I Nn Y n-I N i =l
=_n_[ N -I { _ I_ ( ~}j2 _ Ny2 )}_ (N-n) s~ ]
n- I N N- I ;= 1 Nn )
=_n_[ N -I S2 _ (N-n) S2 ] = S2.
n-1 N Y Nn Y Y

On substituting this value in (2.2.13), we obtai n E[v(YI/ )]=V(YI/ ). Hence the


theorem .

Theorem 2.2.5. Under SRSWOR sampling, while estimating population mean (or
total), the minimum sample size with minimum relative standard error (RSE) equal
to ¢ , is

n> [~+ ¢2y2 ]-1 (2.2. I6)


N S y2

Proof. The relative standard error of the estimator Yn is given by


86 Advanced sampling theory with applications

(2.2.17)

Note that we need an estimator Yll such that RSE(y,, ) 5, ¢ , which implies that
- 1

( ~ - ~)
II
~}
N y2
5, ¢, or (~" - ~) 5-
N y2
5, ¢2, or
1 ¢ 2 y2
n >: [ N + s;]

Hence the theorem.

Rem ark 2.2: If ¢ = (YZ:/2) ,with e=Za/2 (I ~/\~ then p(! (Y"i Y)I : ; e]=I-a .

Exa mple 2.2.2 . A fishermen recruiting company, XYZ , sele cted an SRSWOR
samp le of six kinds of fish out of 69 kinds of fish avai lable at Atlantic and Gulf
Coasts as be low :

( a ) Estim ate the avera ge numb er of fish in each species group .


( b ) Construct a 95% confidence interval for the average numb er offish in each
species group
( c ) Estimate the total number of fish at Atlantic and Gulf Co asts dur ing 1995.
( d ) Con struct a 95% confidence interval for the total numb er of fish at Atlantic
and Gulf Coasts dur ing 1995.

So lutio n. We are given N = 69 and II = 6. From the sample information, we have

Samp le 2
Yi Yi
Unit
I 16855 284091025
2 10940 119683600
3 4793 22972849
4 2146 4605316
5 3816 14561856
6 935 874225
Sum 39485 446788871

( a ) Thu s the average number of fish in each species group is given by


_ I" 1 6 39485
y" = - L Yi = - L Yi = - - = 6580.83.
II i= 1 6 i=1 6
( b ) A (I - a)100% confidence interval for the population mean Y is given by
Chapter 2: Simple Random Sampl ing 87

y" ± la/2(df = n-I hJv{y,, ) , where v(y,,) =(1-nf )s~. .


Now we have

s; = _I_!IYT _
n - 1 i=l
n- 1( IYi)2) = _6 -I1_[ 446788871- (39485)2
i=l 6
j
= 37388933.4 .

Thus
v(Yn) = (I~/ }; = ( 1- 0;869 ) x 37388933.4 = 5689972.5 .

Using Tabl e 2 from the Appendix the 95% confidence interval for the average
number of fish is given by
Y,, ±lO.02S(df=6-I)Jv(y,,), or 6580.83±2.5nJ5689972.5 , or [448.05,12713 .61] .
( c ) An estimate of total number of fish is given by
y = Ny" = 69 x 6580 .83 = 454077.27 .
( d ) The 95% confidence interval for the total number of fish is given by
N x [448.05,1 2713.6 1], or 69 x [448.05,12713.61] , or [30915.45, 877239.09] .

Exa mple 2.2.3. We wish to estimate the average number of fish in each one of the
species groups caught by marine recreational fishermen at the Atlantic and Gulf
coasts. There were 69 species groups caught during 1995 as shown in the
popul ation 4 in the Appendix. What is the minimum numbe r of species groups to be
selected by SRSWOR sampling to attain the accuracy of relative standard error
30% ?
Gi ven: s; = 3719957 8 and Y = 311528.

Solutio n. We are given N = 69 , S; = 37199578 and Y = 4514 .898. Thu s for


¢ = 0.30, either we are estimating population total or population mean , the
minimum sample size under SRSWOR sampling for the required degr ee of
prec ision is

n ';? [~ + ¢2Y2]-1 = [~+ 0.32 x4514 .8982 j-l= 15.6 0::: 16.
N S2y 69 37199578

Thus a minimum samp le of size n = 16 units is required to attain 30% relat ive
standard error of the estimator of population tota l or mean under SRSWO R
samplin g.

Example 2.2.4. Select an SRSWOR sample of sixteen units from population 4


given in the Appendix. Collect the information on the number of fish durin g 1995
in each of the species group selected in the sample . Estimate the average number of
fish in each one of the species groups caught by marine recreational fishermen at
Atlantic and Gulf coast s during 1995. Construct 95% confidence interval for the
average number offish in each specie s groups available in the Uni ted States .
88 Advanced sampling theory with applications

Solution. The population size is N = 69, therefo re we used the secon d and third
columns of the Pseudo-Random Number (PRN) Table I given in the Appen dix to
select 16 random numbers between 1 and 69. The random numbers so selected are
01,49,25, 14,2~36,42 ,44,65 ,2~4~66, 17, 08, 33, and 53.

Rando m f Species group Yt.;/'·qI· (vi - Yn)


Y , ,0\

.y
~i ~ Yn ~ ,;,
No. ' ~Jt~"'" '~;
".
01 Sharks, other 20 16 -937.8 130 879492.2852
08 Toadfishes 1632 -1321.8100 1747188 .2850
14 Scu lpins 71 -2882 .8 100 83 10607.9100
17 Temperate basses, other 23 -2930.8 100 858966 1.9100
20 Sea basses, other 2068 -885 .8 130 784663 .7852
25 Florida pompano 644 -2309.8 100 5335233.7850
26 Jacks, other 1625 -1328.8 100 1765742. 6600
33 Snappers, other 492 -2461.8100 6060 520.7850
36 Grunts, other 3379 425. 1875 180784.4102
40 Red porgy 230 -2723 .8 100 74 19154.5350
42 Spotted seatro ut 246 15 21661.1900 469207043.9000
44 Sand seatrout 4355 1401.1880 1963326.4 100
49 Black drum 1595 -1358.8 100 184637 1.4 100
53 Barracuda 908 -2045.8 100 4185348.7850
65 Winter flounder 2324 -629.8130 396663 .7852
66 Flounders, other 1284 -1669 .8100 2788273.7850
.,.il< ,""",0 '.'\\ Sum ~ 47261
" 0.0000 52 1460078.4000

An estimate of the average number of fish in each species group during 1995 is
Yn =.!- fYi = 47261 = 2953.813 .
n i=l 16
Now
s; =-n-I-l f (vi - Yn)2 = 521460078.4 = 34764005.23.
16- 1
i=l
and the estimate of variance of the estimator Yn is
v(Yn) = C~f }; =C-:~69 Jx 34764005.23 =1668924.16.
A (I - a)I00% confidence interval for the average number of fish in each one of the
species grou ps caught during 1995 by marine recreational fisherme n in the United
States is
Yn+(a/2(df = n-1)Jv(Yn ).
Using Tab le 2 from the Appendix the 95% confidence interval is given by

Yn+ (o.o2s(df = 16 -1)Jv(Yn) , or 2953.81H 2.131.J1668924.16 , or [200.842, 5706.784] .


Chapter 2: Simple Random Sampl ing 89

Exam ple 2.2.5 . The distribution of yield (kglha) y of a crop in 1000 plots has a
Cauchy distribution:
1
f (y) = i }, - 00 < y < + 00 .
ll"ll+(y-IO)2

We wish to estimate the average yield with an accura cy of relati ve standard error of
0.15%. What is the minimum sample size 11 requ ired while using SRSWOR
sampling?

Solution. Since the true mean and variance of a variable having Cauch y distribution
are unknown, therefore it is not possible to find the required sample size under such
a distribution.

Exam ple 2.2.6 . The distribution of yield (kg/ha) y of a crop in 1000 plots has a
logistic distribution

f(y )= _1 sech2{~(~)}
4/3 2 /3.
with a. = 40 and /3. = 2.5.
( a ) Find the value of minimum sample size 11 required to estimate ave rage yield
with an accuracy of standard error of 5%
( b ) Select a sampl e of the required size and construct 95% confidence interv al for
the average yield .
( c ) Does the true average yield lies in the 95% confidence interval?
Solution. ( a ) We know that the mean and variance of a logistic distribution are
given by
Mean = a . = 40
and
.
Variance = O"y =
2 /3.2- ll"2
= 2.5 x
2 3.14159 2
= 20.56.
3 3
Also we are given N = 1000 thus
2 N 2 1000
S y = - - 0" Y = -
- - x 20.56 = 20.5806 .
N- I 1000-1
Thu s the minimum sample size required for 1ft = 0.05 is given by
2 2j-l
n e - I + -1ft2f2j
-
-l [ I
= - - +
0.05 x40
=5 .11 ,:::5.
[N S2
y
1000 20.5806

We know that the cumul ative distribution function for the logistic distribution is

F(y) = ±[I + tanh{ (Y2~:')}]


which implies that
y = a. + 2/3. tanh-I {2F(y )-I}= 40 + 5tanh- 1{2F (y )-I}.
90 Advanced sampling theory with applications

Using the last three columns of the Pseudo-Random Number (PRN) Table I given
in the Appendix, multiplied by 10-3, we obtain five values of F(y) and the
corresponding values ofy as given below:

0.072 36.460 1329.344


0.776 42.522 1808.111
0.406 39.071 1526.531
0.565 40.646 1652.128
0.108 36.675 1345.089

Thus an estimate of the average yield (kg/ha) of the crop is given by

h- =..!.-~
L~
. = 195.375 =39075
. .
n 1=1 5
We use the alternative method to find s; given by

s; = _1_[±Y1_
n -I
ny; ] = _1_[7661.202 - 5 x39.075
1=1 5-1
2]= 6.7309 ,

and estimate of the variance of the estimator Yn given by


v(Yn) = (I ~f}; = (1- 5~1000 Jx6.7309 = 1.339449.
A (I - a)1 00% confidence interval for the average yield of the crop is given by

Using Table 2 from the Appendix the 95% confidence interval is given by

Yn +lo.02S(df = 5 -1).jv(Yn) , or 39.075+ 2.776~1.339449 , or [35.8622,42.2877] .

( c ) Yes, the resultant 95% confidence interval estimate contains the true average
yield a. = 40.

Theorem 2.2.6. The covariance between the two sample means Yn and xn under

a,
SRSWOR sampling is:
- -) (1-
Cov (Yn' X n = - n - xy ' (2.2.18)
where
S
xy
=_l_~(y
N-l1:J
-YXx. -x).
I I

Proof. We have
Chapter 2: Simple Random Sampling 91

__
Cov(Y", x,,) = COY(I"
- LY; , -I"
II ;=1
L·r; )= COY(-IIIN;=1
II ;= 1
LI;Y; , -III ;=LI;X;
N )
1
(2.2. 19)

where I; is a rand om variable that takes the value' I ' if the { II unit is included in the

sample, otherwise it takes the value O. Note that the pair Y; and X ; is fixed for the

{ II unit in the popul ation, we have

Cov(y",x,,) = ~COV(II;Y;,
II 1=1
IliX; ) = ~[E{II;Y;}{II;X;}
1=1 II 1=1 1=1
- E{~I;Y;}E{It;X;}]
1=1 1=1

= ~[E{II?Y;X; + ,*. I l;ljY;Xj}-E{~I;Y;}E{II;X;}]


II 1=1 ;=1 1=1 1=1

= 11\ [t~IE~?~x; + i*~=1 E(I;lj){x j}-tEE(I; )Y; KEE(I;)X;}]- (2.2.20)

In (2.2.20) we need to determine the E~?) and E(I;Ij)' Note that the distributions
of and are :
{I
I; I?

I with pro bability 11/N , 2 with probability (11/ N),


and I· =
I; = { 0 with pro ba bility {I- (11/ N)}, 1 0 with probabi lity {I - (11/N )}.
We have

E~?)=I X ~ +O X (I- ~ ) = ~ . (2.2 .21)

The probability that both {II and /" units are included in the sample IS
(N-2)
N q ,,-2) =
c,
()
t
- I ) and otherwise the probability is I
II
N N- I
lit
(
)
-I ) , therefore
N N- I
11(11 -1)
with probabi lity N (N _I)'
I;lj = 11
o wit hprobabi lity {I N(N-I) .
II(II-I) }

Now we have
11(11 -1) { 11(11-1) } = 11(11-1)
E(
1;1)
j = 1x (
N N- I
) +0 I
N(N-I ) N(N - I)' (2.2.22)

Using (2 .2.2 1) and (2.2 .22) in (2.2.20) we obtain

Cov(y", x,,) = ~[{


II
IE~?~X; + ,*I; =1E(I;lj){Xj}-{IE(I;)Y;}{IE(I;)X;}]
1=1 1=1 1=1

= ~[{.~ E~l YiXi + . ~ EVil/YiX } -{.~ E(li)Yi}{.~ E(ldXi }]


II 1=1 I*- J= l
j
1=1 1=1
92 Advanced sampling theory with applicat ions

=1 n
- [{-LYX.+
N n(n -I) LN YX · } - {n
-LYN n N }]
}{ -LX.
n2 N i =1 1 1 N(N -IL;< j=\ t) N i =\ 1 N i = 1 1 . (2.2.23)

Note that

Y X = (.2.. .~ y;)(.2..N 1=1~Xi) = ~(


N 1=1 N 1=\~y;)( 1=1
~Xi) = ~[~Y;Xi
N 1=1 + 1;<)~=1 Y;X
j]
which implies that
N 2- - N
I Y;X j = N Y X- IY;Xi .
i;<j=1 i= \

Thus from (2.2.23) we have

Cov(Yn' x,J =~[{~IY;Xi+


n N
n~n-I\ I Y;Xj}-{~IY;}{~IXi}]
N N- 1 j=\
i= \ N N i;< i=\ i =\

=~[{~ ~Y;Xi + n(n -I) (N2y X - .~ Y;Xi)} - {~ .~ Y;}{~ ~Xi}]


n Nt=\ N(N - I) 1=1 N1 =1 Nt=1
= ...!-[{~_
2
n(n - I) } ~ y.x + Y x{n(n -I)N n2}]
n N N(N -I) i=\ t 1 (N -I)

= ~[~{~}
n N N- 1
~ Y;Xi - Y x {ntN-- I))}]
i =\

=~~{N-n }[~Y;Xi-NY x] = N-n_I_[ ~Y;Xi-NY x]


n N N -I i=\ nN N -I i =\

=(-
n
f)
N (Y;-Y Xi-X = -
I - - -1- I
N - I i=\
1- - St,. -x -) ( f) n }
Hence the theorem.

Theorem 2.2.7. An unbiased estimator of the covariance Cov(y,1' xn) is given by

COV(YII' C
xn )= ~f}XY (2 .2.24)

where

Sxy = -1-[fYiXi-nynXn].
n-I i=1
Proof. We have to show that
E[cov(Yn' xn)] = Cov(Yn ' x,J.
We have
E[cov(Yn , Xn)]=E[l~f Sty] = I~f E(Sty)
Now

Ekty)= E[_I {£YiXi- nyllx


n - 1 i=1
lI } ]
n- 1
f
= _ I [E( YiXi) - nE(YnXn)]
i=\
Chapter 2: Simple Random Sampling 93

n[1 1I
= - - - L E(y;x; )-
__ ] .
E(Yllx
/I-I /I ; ;1
lI)
Now
E(YllxlI) = COV(YII' xlI )+ E(YII)E(xlI) = N - n Sxy + YX ,
Nil
and each pair of units >j and X; gets selected with probability 1/ N , therefore we
have
(N
E (Sty ) =-/I- [ -I L11 L - I >jX;) - - --/I) {(N
- X
Sty + Y - }]
/I -I /I;;) ;;\ N Nil

= _II_[~ ~ YX - Y X _ (N - 11 )s ]
11 - I N ;;) I I Nil xy

= _1_1 [~{IYX -NY


11 - I N ;;1 I I
X}- (NNil- 11 )sxy ]

= _1_1 [ N -I {_ I_( I >j X; -NY x)}- (N - 11 )S xy ]


11 - I N N-=1\,.i;1 Nil

= _
I II_[N
-I N-I S ty _ (NNil-11) S ty] = S.w '
.
Thus we obtain
E[cov(YII ' XII)] = Cov(y,I' XII )'
Hence the theorem.

Ex am ple 2.2.7. Consider the joint proba bility densi ty function of two continuous

()l
random variables x and y is

f x,y =
~(x
3
+ 2y) 0 < x < I, 0 < y < I,

o otherwise.
ea ) Select six pairs of observations (y, x) by using the Random Number Table
method .
eb ) Estimate the value of covariance between x and y .
Solution . e a ) See Chapter I.
( b ) Estimate of covariance:
R Y R
2 I X
0.992 0.995 0.622 0.423
0.588 0.722 0.771 0.514
0.601 0.732 0.917 0.600
0.549 0.69 1 0.675 0.456
0.925 0.954 0.534 0.368
0.0 14 0.039 0.5 13 0.355
94 Advanced samp ling theory with applications

So we obtain
y "'~ (~, - y-) .'! 'l' X"c< ..' Ii" (x -x) (y - jl)(X- x)
'"
0.995 0.306 167 0.423 -0.030 -0.009080
0.722 0.033 167 0.514 0.06 1 0.002034
0.732 0.043 167 0.600 0.147 0.006360
0.691 0.002167 0.456 0.003 0.000000
0.954 0.265167 0.368 -0.085 -0.022450
0.039 -0.649830 0.355 -0.098 0.063467
I.. Sum k'. 4.133 , "0.000000 ""2 :7 16 0.000 0.0403 35
Thus an estimate of the covaria nce betwee n two sample means is give n by

• (-Y n , X- n )
COY = ( -1- f- J = (-1- -f J 1
Sxy --L, ~ (Yi - Y-XXi- r)
X
n n n - I i= l

= 1- 0 (_I_J x 0.040335 = 0.0013445.


6 6-1
Note that f = 0 because of infinite population size.

2:3 ESTIMATION OF POPULATION PROPORTIO N

Let N be the total number of units in the population nand N a be the numb er of
units possessing a certain attribute, A (say). Then population proportion is the ratio
of number of units possessing the attribute A to the total number of units in the
popu lation, i.e., Py = Nal N . Thus we have the following theorem:

Theorem 2.3.1. The popu lation proporti on Py is a special case of the popu lation
mean Y.

Proof. We know that popu lation mean is given by


- 1 N
Y=-IYi ' (2.3.1)
N i= l
Let us define
y, = {I if the /h unit possesses the attribute A,
I 0 otherwise .
Then (2.3. 1) becomes,
y _ 1+0 + 1+ ..... +1 _ NA _ P
- N - N - y '
Hence the theorem.

We will discuss the problem of estimation of popu lation proportion using SRSWR
and SRSWOR samp ling.
Chapter 2: Simp le Rand om Sampling 95

C ase I. When the sample is drawn using simple random sampling wit h rep lacement
(S RSWR samp ling), we have the following theo rems .

Theorem 2.3.2. An unbiased estimator of populatio n pro portion Py is given by


, r
Py = -;; (2.3.2)
where r is the numb er of units in the sa mple that possesses the attri bute A .
Proof. To prove that E(py) = Py , let us proceed as follows:
Defining
r. = {I if the it:, sampled unit possess the attribute A,
o otherwi se,
we have
_ 1 II r ,
Y II =- LYi =- = Py .
II i =1 II
Th erefore

E(py )=E(~) = E(YII) = Y = ~..


Hence the theorem.

Theorem 2.3.3. The variance of the est imato r Py is give n by


v(Py)=PyQy (2.3.3)
II
whe re
Qy = l - Py .
a2 a2
Proof. We know that V(YII) = -2. , whi ch implies that v(py) = -L .
II II
Now

a; = ~[ I Y;2 -Nf2 ].
N i=1

Note that
y, = {I if the;''' unit possesses the attribute A,
, 0 otherwise,
and

y2=
,
{I
0 otherwise.
if the ;''' unit possesses the attribute A,

So that
a~ = ~ [NA - N~~] = '; -~~ = ~.(l -~\,)= PyQy '
Th us we have
96 Advanced sampling theory with applications

V(Py)= PyQy .
n
Hence the theorem.

Theorem 2.3.4. An unbiased estimator of V(Py) is given by


-(-
vp ) =Pyqy
--
y n-1
where
qy = 1- Py.

Proof. We have to show that E[v(pJ = v(p)or in other words we have to show
that
E(PyqyJ= PyQy .
n-1 n
Now we know that s; /n is an unbiased estimator of a;/n .
Defining

y. =
I if the it" sampled unit E A,
an d y .2 = {1 if the /1 sampled unit E A,
1 { 0 otherwise, I 0 otherwise.
Hence we will obtain
2 1 [nLYi2-nYn-2] =--[r-npy
Sy =--
I r .2] n [r
= - - - - Py
_2] =--Py
n _ (1- Py
_) =npyqy
--.
n-1 i=l n-1 n-1 n n-1 n-I
So that
2
Sy = Pyqy
II II-I
Hence the theorem.

Theorem 2.3.5. Under SRSWR sampling, while estimating population proportion,


the minimum sample size with minimum relative standard error (RSE) equal to rjJ , is

(2.3.4)

Proof. The relative standard error of the estimator Py is given by

RSE(py) = ~V(py)/{e(pJ~ = ~PyQy/ liP; . (2.3.5)


Note that we need an estimator Py such that RSE(py) ~ rjJ , which implies that

~ IIP
Qy ~ rjJ, or Qy s rjJ2, or
IIPy
11;:0: ;y .
Py
y rjJ
Hence the theorem.

Note that the relation 11;:0: Qy /(rjJ2 p y ) shows that as Py -+ 0 then II -+ 00 •


Chapter 2: Simple Random Sampling 97

Example 2.3.1. We wish to estimate the proportion of the number of fish in the
group Herring cau ght by marine recreational fish ermen at the Atlantic and Gul f
coasts. There are 30027 fish out of total 311,528 fish caught during 1995 as shown
in the population 4 in the Appendix. What is the minimum number of fish to be
selected by SRSWR sampling to atta in 5% relative standard error of the estimator
of population proportion ?
Solution. We ha ve
P , = 30027 = 0.0964 and Q), = 1- Py = 1- 0.0964 = 0.9036.
) 3 11528 '
Thus for rp = 0.05 , we have

II ;:>: Q / (rp2 P) =
0.9036 = 3749.4 '" 3750.
Y 0.052 x 0.0964
Y
Thus a minimum sample of size II = 3750 fish is required to attain 5% relati ve
standard error of the estimator of population proportion under SRSWR sampling.

Example 2.3.2. A fisherm an visited the Atlantic and Gulf coast and caught 4000
fish on e by one . He not ed the species group of each fish cau ght by him and put
back that fish in the sea before making the next catch. He ob served that 400 fish
belong to the group Herr ings .
( a ) Estimate the proportion of fish in the group Herrings livin g in the Atl an tic and
Gulf coast.
( b ) Co nstruct the 95 % confidence inter val.
Solution. W e are given 11 =4000 and r = 400 .
( a) An estimate of the proportion of the fish in the Herrings group is give n by
P =!- = 400 = 0.1.
y 1/ 4000
(b ) Under SRSWR sampling an estim ate of the v(p) is given by
v(p ,)= P/ly = 0.1x 0.9 = 2.2505 x 10- 5.
)
11 -1 4000 -1
A (1- a)100% confide nce interval for the true proportion Py is give n by

Py + Za/2 ~V( Py ) .

Thus the 95 % confidence interval for the proportion of fish belonging to the
Herrings group is given by
Py + 1.96~v( Py ) , or 0.1 +1.96b.2505 x 10- 5 , or [0.0907, 0.1092].

Example 2.3.3. The height y of plant s in a field is un iformly distributed betwe en 5


cm to 20 em w ith the probabi lity den sity fun ction
I
j(y) = - V 5 < y < 20 .
15
W e wish to estimate the proportion of plants with height more than 15 cm , what is
the minimum required sa mple size II to ha ve an accuracy of relati ve standard erro r
of45 %?
98 Advanced sampling theory with applications

( a ) Select a sample of the required size, and estimate the proportion of plants with
height more than 15 cm.
( b ) Construct a 95% confidence interval estimate, assuming that your sample size
is large, and interpret your results.
Solution. We know that if y has uniform distribution function
1
f( y) = - \;f a <y <b •
b-a
Thus the proportion of plants with height more than 15cm is given by
20 20 1 1 5
Py = fJ(y}ty = f -dy
15
= -(20 -15) =- = 0.3333,
15 15 15 IS
and the variance
0"; = Py (1- py ) = 0.3333(1- 0.3333) = 0.2222 .
( a) We need ¢ = O.4S, thus the required minimum sample size is given by

n ~ 0"; = 0.2222 = 9.8 '" 10.


¢2 i} 0.4S 2 x 0.33332

( b ) We select a with replacement sample of n = 10 units as follows . The


cumulative distribution function (c.d.f.) is given by
F( y) = ply ~ y]= ~f(y}ty = ~~y = (y-S)
5 5 1S IS
which implies that y = lSF(y)+S. We used 4th to 6th columns multiplied by 10-3 of
the Pseudo-Random Number (PRN) Table 1 given in the Appendix to select ten
values of F(y) and the required sampled values are computed by using the
relationship y = ISF(y)+ S as follows:

19.31
0.183 7.75 o
0.448 11.72 o
0.171 7.57 o
0.567 13.51 o
0.737 16.06
0.856 17.84
0.233 8.50 o
0.895 18.43
0.263 8.95 o
Thus an estimate of the proportion Py is given by,
Chapter 2: Simple Random Sampling 99

p, = number of ' yes' answers = ~ = 0.4


} sample size 10
and an estimate of its variance is given by

V(Py )= pAI - Py) = 0.4(1 - 0.4) = 0.0267 .


n- I 10 - 1
Thus a 95% confidence interval estimate of the required proportion Py is given by

py+1.96~v(py) , or 0.4+1.96~0.0267, or [0.0797, 0.7202] .

Case II. When a samp le is drawn using SRSWO R sampling, we have the fol1owing
theorems.
Theorem 2.3.6. The unbiased estimator of the population proportion P; is given by
r
r- >;
A

where r is the number of units possessing the attribute A.


Proof. Obvious.
Theorem 2.3.7. The variance of the estimator Py is given by
(A) (N - n)
V\p y = n(N - 1{ yQy ·
Proof. We know that

V\YII (N-n)
(-:: ) =---Sy, where Sy = - I - 2 2 (NIf;2- NY-2J.
Nn N- I i=l

Again we define
y = {I if the ;ti' unit possesses the attribute A,
I ° otherwise ,
and
y'2 = {I if the /h unit possesses the attribute A,
I ° otherwise .
So
S2 = _I_(N _ Np2 )= ~(p _ p2 )= NPyQy = S2 .
y N- I A y N-I y y N- I P

Hence we have
v (p )=N-ns2=N- n x~PQv = (N -n)pQ "
y Nn P N/I N- 1 Y. n(N -1) y }
which proves the theorem.

Theorem 2.3.8. The unbiased estimator of the variance V(Py) is given by


A( A ) (N -n) A A
V Py = - (- ) Pyqy .
N /I-I
100 Adv anced sa mp ling theory with appl ication s

Proof. We w ill prove that E[v(p.J= v(py ) , that is


E[ ( - /I / vqy] = N( -II )pyQy.
N /I - I N /I- I

N ow we k no w t h at (N-II)
- - -Sy2 .
IS an un b lase
i d estimator
esti 0 f -N-- II Sy2 .
Nil Nil
Cha ngi ng
y; = {I if the /" population unit A, and E y;2 = {I if the i~" poulation unit A, E
o otherwise, 0 otherwise,
mak es
(N- II)S2 = (N - Il) PQ
Nil y II(N - I) y r:
Similarly, if we make the ch anges
v. = {I if the /~ sampled unit E A, an d
2
y. ={I if the /" sampled unit E A,
o otherwise, I 0 otherwise,
then

sJ, =-( I )[fYl- IlY;; ].


/I - I i= \
wi ll reduce to
2_ 1 [.
s p - 1l_I'-Il y - II-I
. 2]_ 11
P
[r; -Py
'2]_ II (. ' 2)_ IlPy(I- Py) _ IlPyCly
- II-I Py -Py - 11-1 - ~ .
Th er efore

-(N-II)
- - s2 _(-
N-Il
-JIl
- Pyqy
- _- (N-Il)
--P q
. .
Nil p - Nil (II-I) - N(II - I) y y'
Hen ce the theorem.

Theorem 2.3.9. Under SRSWO R sampling , w hile es tima ting population


prop ortion, the minimum sa mple size with minimum relative sta nda rd error (RSE)
equa l to ¢ is

(2.3.6)

Proof. T he relative standard error of the estimator Py is g ive n by

. ) I (. )I/{ (. )}2 (N - 1I )Py Qy


RSE (Py = 'Jv Py / E Py = ( )2 . (2.3.7)
II N- I Py

Note that we need an estimator Py suc h that RSE(p y ) ~ ¢, w hich implies that

(N
II
-IJ (N Q
y
-I )~,

- ,
Hen ce the theorem.
Chapter 2: Simple Random Sampling 101

Example 2.3.4. We wish to estimate the proportion of the number of fish in the
group Herrings caught by marine recreational fishermen at the Atlantic and Gulf
coasts . There are 30 ,027 fish out of total 311,528 of fish caught during 1995 as
shown in the population 4 in the Appendix . What is the minimum number of fish to
be selected by SRSWOR sampling to attain the accuracy of relative stand ard error
5%?
Solution. We have
P = 30027 = 0.0964 and Qy = 1- Py = 1- 0.0964 = 0.9036.
y 311528 '
Thus for ¢ = 0.05 , we have
ne 2( NQ y 311528 xO.9036 =3 704.8;::;3705.
¢ N -I)Py+Qy 0.052(311528-1)xO .0964 +0 .9036
Thus a minimum sample of size n = 3705 fish is required to attain 5% relative
standard error of the estimator of population proportion under SRSWOR sampling.
Example 2.3.5. A fisherman visited the Atlantic and Gulf coast and caught 4000
fish. He noted the species group of each fish caught by him . He observed that 400
fish belong to the group Herrings.
( a) Estimate the proportion of fish in the group Herrings living in the Atlantic and
Gulf coast.
( b ) Construct the 95% confidence interval.
Given: Total number of fish living in the coast = 311528.
Solution. We are given N = 311,528, n = 4,000 and r = 400 .
( a ) An estimate of the proportion of the fish in the Herrings group is
_ r 400
p = - = - - =0.1.
y n 4000
(b) Under SRSWOR sampling, an estimate of the V(Py) is given by
v(- ,) = (N -n)Pyqy = (311528-4000)x 0.l xO.9 =2.2216x10-5.
r, N n-1 311528 4000-1
A (1 - a)1 00% confidence interval for the true proportion Py is given by

Py + Za/2~V( Py ) .
Thus the 95% confidence interval for the proportion of fish belonging to Herrings
group is given by
py+1.96~~{ Py), or 0.1+1.96~2.2216x10-5, or [0.0908,0.1092].

Example 2.3.6. Ina field there are 1,000 plants and the distribution of their height
is given by the probability mass function

y (em) 50 100 150 200 225 275


p(y) 0 .1 0.2 0.3 0.1 0 .2 0 .1
102 Advanced sampling theory with applications

( a ) Select a random sample of n = 10 units and est imate the proportion of plants
with height more than or equa l to 225 ern.
( b ) Construct a 95% confidence interva l, assuming that it is a large sample.

Solution. The cumulative distribution functio n F(Y) is given by

y(cm}. 50 100 ISO 200 225 275

1'1\ <pCy) ,. 0.1 0.2 0.3 0.1 0.2 0.1


F;(Y )i 0.1 0.3 0.6 0.7 0.9 1.0

Using the first three columns, multip lied by 10-3 , of the Pseudo-Random Number
(PRN) Table I given in the Appendix, we obtain the 10 values of F(Y) and y as:

F(y) ' Y
:
~ 225
yes-vL
+ .. '" " ,,
:no-~O :i
0.992 225 I
0.588 100 0
0.601 150 0
0.549 100 0
0.925 225 I
0.0 14 Discard this number
0.697 ISO 0
0.872 200 0
0.626 150 0
0.236 50 0
0.884 225 I

(a) An estimate of the proportion of plants with height more tha n 225 em is

P = No. of plants with height ;:: 225 em = ~ = 0.3 .


y No. of plants in the sample 10

( b ) We have N = 1000 and n = 10 , therefore, an estimate the V(Py) is given by

v(' )=(N-n)pi1y = (1000-IO) xO.3 xO.7 = 0.023 1 .


Py N n -1 1000 10-1
Thus a 95% confidence interval for the proportion of plants with height more than
or equal to 225 em is:
Py+- 1.96~v(py) , or 0.3+ 1.96.J0.0231 , or [0.0021, 0.5978] .
Note that an estimate of proportion always follow normal distrib ution, so the use of
1.96 for deriv ing 95% confidence interval estimate is appropriate irrespective of the
sample size .
Chapter 2: Simple Random Sampling 103

2ASEARLS"ESTIMATOR OF POPULATION MEAN

Searls (1964) considered an estimator of the population mean r defined as


Yscarl=AYn (2.4.1)
where A is a constant such that the MSE of the estimator Yscarl is minimum. Thus
we have the following theorem.

Theorem 2.4.1. The minimum mean squared error of the estimator, Ysearl ' is
Min.MSE(Ysearl) = V(Yn)/ {I + V(Yn )/p}. (2.4 .2)
Proof. We have
MSE(Yscarl)= E~s - rf = E[AYn - rf = E[AYn -E(AYn)+ E(AYn)- rf
= E[A{Yn - E(Yn)} + AE(Yn)- Yf = E[A{Yn - E(Yn )} + (A -1)y]2
= ElA
2{Yn - E(Yn)f + (A -If y 2+ 2(A -1)YA{Yn - E(Yn )}j

= A2E{Yn - E(Yn)f + (A _1)2 y 2+ 0 = A2V(Yn)+ (A _1)2 y 2. (2.4 .3)


Thus the mean squared error, MSE(Ysearl)' is given by
MSE(Ysearl)= A2V(Yn)+ (A _1)2 P . (2.4.4)
On differentiating (2.4.4) with respect to A and equating to zero we obtain
AV(Yn)+(A-I)y2 = 0, or A= I/~ + V(Yn)/Y2} . (2.4.5)
On substituting (2.4 .5) in (2.4.4) we obtain

.
Mm.MSE Ysearl
(_ ) = ,2 V(_)
Yn + (, - I)2-2
Y = {_
f4V(Yn) }2 + {y2
2 (_) - I}2 Y
-2
A A
2
y +V(Yn) Y +VYn

= y4 V(Yn)
{y2 + V(Yn)}
2 +{ y2 -t 2
Y + V(Yn)
- ~(Yn )}\2
y4V(Yn) + y2 {V(Yn )}2 = y2V(Yn ){y2 + V(Yn )}
{y2 + v(Yn)f {r 2+ V(Yn)f {y2 + V(Yn )}2
y2 V(Yn) V(Yn) (2.4.6)
y2+ V(Yn) I+V(Yn)/y 2'
Hence the theorem .
Theorem 2.4.2. Under SRSWR sampling, the minimum mean squared error of the
Searls' estimator is
Min.MSE(Ysearl) = n-10"; / {I + n -10"; / Y2}. (2.4.7)
Proof. Obvious from (2.4 .2) because under SRSWR sampling we have
V(Y-n )= n - I O"y2 .
104 Advanced sampling theory with appl ications

Theorem 2.4.3. The relati ve efficiency of the Searls' estimator Y searl with respect
to usual estimator Y", under SRSWR samplin g, is given by
RE =I + eT;' /~IY2 ) . (2.4.8)

Thu s the relat ive gain in the Searls' estimat or is inversely proportional to the
sample size, II. In other words, as /I ~ 00 , the value of RE ~ 1.

Proof. It follows from the definition of the relative efficiency. Note that the relative
efficiency of Searl s' estimator with respect to the usual estimator is given by

-
RE -
MSE(YIl) _
( ) - .
V(YIl )
( ) -
-
/I
-I 21eT y
II -leT;
I 2 2
)-1
MSE Ysea rl Mm.MSE Ysearl 1+ /1- Y a Y
- 1- 2 2
=1 +/1 YeT y . (2.4 .9)
Hence the theorem .

Theorem 2.4.4. Under SRSWOR sampling , the minimum mean squared error of
the Searls' estimator is

Min.MSE(Ysearl) = U-~ )s.~ / {I + (~ - ~)s.~ /y2}. (2.4. I0)

Proof. Ob viou s from (2.4.2) because under SRSWOR sampling we have


V(YIl) = (~ - J...)s;. .
II N

Theorem 2.4.5. The relative efficiency of the Searls' estimator Ysearl with respect
to Y/I , und er SRSWOR, is given by
1
RE = + (~ -
/I
J...)C
N
2
Y
.

Thu s the relative gain in efficienc y of the Searls ' estimator is inversely proportional
to the sampl e size II. In other word s, as /I ~ N the value of RE ~ 1.

Proof. By the definition of the relative efficiency we have


-1

V(YIl) = ~ _J... S2 ( ~ _ J...)S2


N
eN) 1 {1_~);'i.
RE = /I Y

M',.MSE(y" o" )
1 \
Y
/I N y2
2
= +1 (~II - J...)
N y2
::y =1+ (~II - J...)c
N Y
2
.

Hence the theorem .


Chapter 2: Simple Random Sampling 105

E xample 2.4.1. We wish to estimate the ave rage num ber of fish in eac h one of the
spec ies gro ups caught by marine recrea tional fishermen at the Atlantic and Gulf
coasts. Th ere were 69 species caught during 1995 as show n in the pop ulation 4 of
the Ap pen dix . We selecte d a sample of 20 units by SRSW R sampling . Wha t is the
gain in effic iency owed to the Searls' estimator over the sample mean?
Giv en: S; = 371995 78 and Y = 3 11528 .
Solution. We are given N = 69 , S; = 37199578 and Y = 311528 , thus
- Y 311528
Y =- =- - = 4514.898
N 69
and

0-Y2 = -(N-N-1)SY2 = -
(69 -1)
69
- x 37199578 = 36660453.68.

The relative efficiency of the Searls' estimator Ysearl with respect to usual estimator
Yn , under SRS WR, is given by
RE = [1 + 0-12 ] x l 00 = [ 1+ 36660453.68 ] x 100 = 108.99% .
lIy 20 x45 14.8982

Example 2.4.2. On a ba nk of a river the height of trees is uniformly distribu ted


with the p.d.f. give n by
1
/(y) = - \j 200 ::; Y ::; 500 feet.
300

Find the relative efficiency of Searls' estimator ove r the usual estimator based on a
sample of 5 or 20 units, respec tive ly.

Solution. We are given a = 200 and b = 500 , therefore the population mean

y= a+b = 200 + 500 = 350 feet


2 2
and pop ulation var iance
o-~ = (b-af = (500 -200f = 7500 feet' .
.} 12 12
Thu s the relative efficie ncy of the Sea rls' estimato r ove r the usu al one is given by

If II = 5 then RE = [1 + 0-1 J x 100 = ( I + 7500 2 Jx 100= 10 1.22% .


IIY 2 5 x350

If II = 20 then RE = [ I + 0-1 J x 100 = ( 1+ 7500 2 J x 100= 100.30 % .


IIY 2 20 x 350

Searls ( 1967), Reddy ( 1978a) and Arn holt and Hebert ( 1995) studied the properties
of this estimator and found that it is useful if C y is large and sample size is small.
106 Advanced sampl ing theory with applications

2.5 USE OF DISTINCT UNITS IN THE WR SAMPLE AT THE


ESTIMATION STAGE

We sha ll discuss the problem of estimatio n of finite popul ation mean and variance
by using only distinct units from the SRSW R sample. However, before goi ng
further we shall discuss some results, which will be helpful in de riving the result
from distinc t units for SRSW R sample. Basu ( 1958) introdu ced the concept of
sufficiency in sampling from finite populations. Acco rding to him , for every
orde red sample SO there ex ists an unordered sample s uo which is obtained from SO
by ignoring information concerning the order in which the labels occur . The data
obtained from the sample suo can be repre sent ed as

C/UO= \Yi
( ... / E S uo ) . (2.5 .1)

Similarly the data obtained from SO can be represented as:

dO= ~i : i E so ). (2.5.2)

Then the probab ility of observing the ordered d" give n the unordered data dUO is

(2 .5.3)
where L is the summation ove r all those ordered sampl es sa which results in the
unordered sample suo . Since the probabil ity P(dOI dUO) is independ ent of any
populatio n parameters and hence the unordered statistics dU
Ois a suffic ient statistic
for any population parameter.

Let us now first state the Rao-- Blackwell theorem, which is based on Rao ( 1945)
and Blackwell ( 1947) results.

Theorem 2.5.1. Let e~ = e(do) be an estimator of e con structed from ordered data
s

«: Suppose es= Ele~ wo j then:


(a) E(e,) = E(e~ ); (2.5.4)

( b) MSE(e~)= MSE(es)+£(e,-e,O ). (2.5 .5)


Proof. ( a ) We have

£(es) = £le~ I dUO[, £lLe~(do~(do I duo)j = £(Le~{do):~UO))J

=
suo
L{Le,o(do\~
As:) }p(suO) Le~(do )p(so)=£(e~).
sO
=

Hence the par t ( a ) of the theorem.


Chapter 2: Simp le Random Sampl ing 107

( b) We have
MSE(e~ ) = E[es O - of = E[e.~ - es + es - of
=E(e~ -e. +E(es -0) +2E (e~ -esXe -0)
.s
I. ) l

=E(e~ - + MSE(es)+0 .
Hence the theorem.

Now we will discuss the problem of estimation of mean and variance on the basis of
distinct un its in the sample. Clearly a unit can be repeated onl y in W R sampling
schemes . Hence we are dealing only with SRSW R sampling scheme . Suppose v
denote the numb er of distinct units in the sam ple of size n drawn from the
popul ation of N units by using SRSWR scheme.
Th e distribution of distin ct units in the sample was first develop ed by Feller (195 7)
as follows

p(v =t)= _1 (N )±(_ly(t )(t _r


Nil t r=1 r
r, where t =I,2,...,Min.(n,N ). (2.5.6)

2.5.1 ESTIMATION OF MEAN

For esti mating the population mean Y by using information only from the distinct
units we have the following theor ems.

Theorem 2.5.1.1. An unbia sed estimator of populati on mean Y is given by


_ 1 v
Yv =-L y; · (2.5.1.1)
V ;= I

Proof. Followi ng Raj and Khamis (1958), let E 2 and E1 be the expected values
defined for a given sample (fixed numb er of distinct unit s) and for all possible
samples, respectively, then by taking expe cted value on both sides of (2.5. 1.1), we
have

E(yv )= E1EZ(Yvl v) =E,Ez (I -I v Yi J=E, (I- I Yi J=Y- .


n
(2.5 .1.2)
v i=1 n i =1
Hence the theorem.

Theorem 2.5.1.2. Th e variance of the unb iased estimator y,. based on distinct un its
IS

(2.5.1 .3)
Proof. Suppose V2 and VI denote the variance for the given sample (fixed numb er
of distin ct unit s) and over all possible samples, we have
108 Ad vanced sa mpling theory with applications

Hen ce the theorem.

Corollary 2.5.1.1. Path ak ( 196 1) has show n that


E(2-) = _1_ 2>(11-1) . (2 .5.1.4)
v N il J =l

Thus we have the following theorem

Theorem 2.5.1.3. The variance of the estimator Yv is give n by

V(Yv) = C~: J (I/- I)/ )s;.


N I/ (2.5. 1.5)
Proof. W e have

v(Yv ) = [ E(~ )- ~ ]s;. [X:(1l-1)/


= N il - ]s;.
1/ N

= [L~,J(I/-I) -N(I/- I)} / NI/ Js-~ = [XI J (I/ -I )/ NI/ ]s.~ .


He nce the theo rem.

It is interes ting to note that as the sa mple size 11 drawn wi th SRSW R sa mpling
approaches to the population of size N, the magn itud e of the relative efficiency also
inc reases. Th e rea son of inc rease in the relati ve effic iency may be that the increase
in sample size also increases the probability of rep etition of unit s in SRSW R
sa mpling .
Th e relati ve efficiency und er the Feller (1957) distribution is given by
V(y ) (N -I)N(I/-I )
RE = - = --'--r--'----,
n{Nil J(I/- I)}
_1/-
(2 .5. I .6)
V(Yv)
J=I

whic h is free fro m any po pulatio n parameter but depe nds upon populat ion size an d
sa mple size .

Th e following tabl e shows the percent relative efficie ncy of dist inct unit s based
est imators wit h res pect to the esti mators based on SRSW R sa mpling for different
va lues of sa mple sizes n an d po pulation sizes N = 10 .
Chapter 2: Simple Random Sampling 109

"
"
:: oi
BenetitOf\use ' distinct -units
>,J Sample size ( n )
J

J 2 3 4 5 6 7 8 9
J(n-l)
I I I I I 1 1 1 I
2 2 4 8 16 32 64 128 256
3 3 9 27 81 243 729 2187 6561
4 4 16 64 256 1024 4096 16384 65536
5 5 25 125 625 3125 15625 78125 390625
6 6 36 216 1296 7776 46656 279936 1679616
7 7 49 343 2401 16807 11 7649 823543 5764801
8 8 64 512 4096 32768 262144 2097152 16777216
9 9 81 729 6561 59049 531441 4782969 43046721
Sum I" 45 """ 285 2025 15333' 120825 978405 8080425 67731333

RE = [(N -1)N(n-1 f[ n{XJ(n-I) }]


1

RE '" 100.00 105.26 11 1.11 11 7.391 24.15 131.41 139.23 147.64

Corollary 2.5.1.2. An approximate expression for V(Yv ) valid up to order N - 2 IS


in Pathak (1962) as
V(yv) = [~ _ _ l + (n -1)]S2 .
n 2N 12N2 y (2.5.1.7)

Theorem 2.5.1.4. (a) Show that an altern ative estimator of the population mean
based on distinct units is

YI = E(v)Yv+ Xl 1- E(V)) (2.5.1.8)

where X is a good guess or a priori estimator of the popul ation mean Y.


( b ) If there is no priori information or good guess about the population mean Y,
then an altern ative estimator of popul ation mean is given by
_ v_
Y2 = E(v)Yv (2.5.1 .9)
1 v
whe re Yv is a mean of distinct units in the sample.
= - LYi
v i=1
Proof. Let us consider that an estimator of the population mean is given by
Ys = Ji(v)Yv + h (v) (2.5.1.10)
110 Advanced sampling theory with applications

where I] (v) and h (v) are suitably chosen constants such that ys is an unbiased
estimator of Y and its variance is minimum. Now from the property of
unbiasedness we have
E(ys) = EUi(v)yv + h(v)]= fi(v)Y + h(v)= Y . (2.5.l.l1)
This implies that
h(v)= [1 - fi(v)]Y. (2.5.l.l2)
Evidently the value of h (v) contains the unknown value Y, the exact value of
h (v) is not known unless fi(v)=1, which implies I: (v) = O. Thus we chose
fi(v) = 1, then h (v) = 0 , which means a better estimator of population mean Y
v
would be yy = v-I 2.: Yi . In practical situations, sometimes a priori information or
i=1
knowledge of X (say) is available about population mean Y from past surveys or
pilot surveys . In such situations, the value of h (v) is given by
h(v)= [1- fi(v)]X . (2.5.1.13)
Thus if we will chose h (v) as given in (2.5.l.l3), then the bias in the estimator ys
will be minimum . Unfortunately, I: (v) depends upon the value of II (v) too. The
best method to chose I I (v ) is such that the variance of ys is minimum . Now the
variance of the estimator ys is given by
V(Ys) = E]V2(Y.J + V] E2 (ys) = E]V2Ui(v)yv + h(v)] + V2ElUi (v)Yv + h(v)]

=E{fi 2(v{~- ~ )s;]+ V2[fi(v)y + h(v)]


= E{fi2(v{~ - ~)S;]+ V2[Y] =E{fi2(v{~- ~)s;l (2.5.l.l4)

The variance of Ys will be minimum if E{fi 2(v{ ~ - ~ )s; ] is minimum subject

to the condition E] Ui (v)] = 1. Then by Schwartz inequality, we have

E{fi2(V{~- ~)] ~E{~- ~) . (2.5.l.l5)


In the above inequal ity, the equality sign holds if and only if

fi(v )=U- ~)/EU - ~) = ~(;~;~-:~)J' (2.5.l.l6)


Thus if we have a priori information X about Y , then an optimum estimator of Y
is given by
- (Nv)/(N-v) - X[1 (Nv)/(N- v) ]
Y] = E[(Nv)/(N- v)JYv + - E[(Nv)/(N - v)} . (2.5.l.l7)
Chapter 2: Simple Rand om Sa mpling III

If no such information abou t Y is ava ilable, then we have X = 0 and the above
estima tor reduces to
_ (Nv)/(N-v) _
Y2 = E[(Nv)/(N _v)Vv . (2.5.1.18)
Path ak ( 1961) has show n that

E[ ( Nv )]=N2 I (PI2...III) (2.5.1 .19)


N-v 111=1 N- 1Il
where

PI2....m= 11 - lI1l J(I -~J"


N
+..+ (- lr(IIl J(l -~J"
III N
for III ~ II,
(2.5. 1.20)
o otherwise.
Th e relation (2.5.1.20) shows that the est imators YI and Y2 given at (2.5.1.17) and
(2.5.1.18) are very difficult to compute for large sample sizes . Now if we ignore the
samp ling frac tion II/ N and hence v/N , then the above estimators, respecti vely,
reduce to

YI = E(v) Yv+Xl 1- E(V) J (2.5.1.21)


and
_ v_
Y2 = E(v) Yv. (2.5.1.22)

Hence the theorem .

Theorem 2.5.1.5. Show that if the square of the population coeffi cient of var iation
Cf, = sf, /f 2 exceeds (II-I) , then the esti mato r Y2 = (v/E(v))Yv is more effic ient
than Yv'
Proof. We know that

V(Yv)= [E{~J - ~]sf, . (2.5.1.23)

By the defini tion of variance we have

V(Y2) = E,V2(Y2 I v)+V,E2(Y2 I v) = EIV2[ E(V)Yv IV] + VIE2[ E(V)Yv IV]

= E{{E(:)}2(~ - ~ )s;]+ v{iJ >


1
~ {E~:)J' [E(+ e(;')] l:.;j'VI(,)
+
(2.5.1.24)

It is very eas y to deri ve that


112 Advanced sampling theory with app lications

E(V) =N[I- ( I- ~rJ


and

E(V2)=N[I- ( I- ~ r]+ N(N - I{I -2(1 - ~ r +(1-~ Jl


therefore
{ I- N
1)11 _N 2 ( 1- N1)211 +N(N-\
V(v )=E (v2 )-{E (v}f =N ( I-fj 2 )11

From (2.5.1.24) we have

_
V(Y2) = 2
S;(N-I)1 /I 2
[ff-(1- N1 )" ) -f-
f 2(1- -;;1)"+ (I- fj2 )/1 )]
N! I-( I- N))

+ y2 II 2
[ N( I- fjI )" -N 21 I )211 +N(N - \{ I- N
( -fj 2 )" ] .

N 2 [1_( I_ ~) ] (2.5.1.25)

Now from (2 .5.1.23) and (2.5 .1.25), we have


V()lv)- V(Y2 )

l:
N- I
J
(
II -I
)

_ S2 ) -1
- Y Nil

2 -2 (2.5.1.26)
= CISy -C2Y (say) .
Now the estimator Y2 IS better than Yv if v(Yv)- V()l2 )<O or if
(s;jy2)> (c2/cd.
The approximate values of C) and C2 for large pop ulation s, correct up to terms of
order N - 2 , arc given by
C) =_1_+ 5(11- 1) and C, = (n- I) _ (n- IXn-2)
2nN 12nN 2 2nN 3nN 2
and thus, (C2 /C));:::(n-I) . Hence the theorem.
Theorem 2.5.1.6. If squared error be the loss functio n then show that )Iv is
adm issib le amo ngst all functio ns of )Iv and v .
Proof. Let I = )I,. + /
()lv, v) be the function of )lv and v . Suppose that the est imator
I is uniformly better than )Iv . Suppose R(l ) be the quadratic loss function for the
Chapter 2: Simple Random Sampling 113

es timator t. Then the estimator t will be uniformly better than the estim ator Yv if

R(r)~ vCYv), where V(y,.) = E(y,. - Y) .


Also we have

R(I) = E~ - rf = E~v - r+ j LY",v)f = E(Yv - yf + E{t(Yv,v)}2 + 2Ef (Yv'v)(yv - y) .


Thus
R(r)~ vLYv) if E~v - rf + E{jLYv,v)}2+ 2EjLYv, xYv - y) ~ E~v - yf
v (2.5.1.27)
hold s for all 1'[ , Y2 ,.. ., YN . In particul ar, if 1'[ = Y2 = ... = YN = C (say), where C is an
arbitrary cho sen con stant , then the relation (2.5.1.27) impli es that f( C, v) IS zero,
which pro ves the theorem.

2.5.2 ESTIMATION OF FINITEPOPULATION VARIANCE


Consider the problem of estim ation of

"»2 = N - I IN ( Y; - -Y \2J
i= 1
using distin ct uni ts in a sa mple of II unit s drawn by using S RSW R sa mpling. Th e
usual estimator of <7; is given by

S.~= (II- Itl f (Yi - y)2 =[211(1I - I)J- 1 f &i - yJ .


i=1 i ;tj =1
If we now construc t an estimator based on only distinct un its then we have the
follow ing theorem .

Theorem 2.5.2.1. A un iforml y better estimator of <7;' than s .~ is given by

-[1-C,.(II-()II I)]
Sv2 - Cv
2
Sci (2.5.2.1)

where
l
2 _! (V- It ±(Yi - Yv)2 if v > I, (2.5.2.2)
sci - i= 1
o otherwise,
and

CV(II ) = v" - (~}V - I Y' +... +(_I)lV-I{:_ I} / · (2.5.2 .3)

Proof. Suppose we have any convex loss function and T be an ord ered sufficient
stati stic , then by the Rao--Blackwell theorem we have

E[S; I d=E[_II-I- I i=1I(Yi - y)2 r]=E[211 (III- I)i=\I (Yi - Y j ~ I r]


1

= E[±(Y\ - Y2 f I r]. (2.5.2.4 )


114 Advanced sampling theory with applications

To prove that the estimator at (2.5.2.1) is uniformly better than s ~ let us consider
the following cases :

If v = I , i.e., only one unit has been selected in the sample of two units drawn by
SRSWR then (2.5.2.4) is obv iously zero . Suppose 'I I and 'I II denote the
summations over all integral values of at such that the following equalities holds:
v v
2:a(i) = n , a(i) > 0 for i = 1,2,..., v and 'I a()F (n - 2), a(j r:: 0, at) ) > 0 and
i=1 )=!

a(k) > 0, for k *' j *' j' = 1,2, ..., v .


Now if v> I then we have
1 II (n -2) ( 1 j a(l) ( I ja( v)
_. =. ]=Jl2 'I a(I )!a(Z)!....a(v)! IV ..... IV
P[xI - X(l) ' Xz X(J) I T I 1 a(l) 1 a(v ) (2.5.2.5)

'II a(I)!a(z~; ....a(viN j . . { N j


Pathak (1961) has shown some mathematical relat ions as:
II n! C (n) (2.5.2.6)
a(I)!a(2)!.....a(v)! v
and
I II (n-2) = Cv (n)-Cv(n - I)
a(1 )!a(2)I ·.. .a(v )! v(v - I) (2.5.2.7)
There fore we have
Cv(n)- Cv (n - I) . .
P[XI = X(i)' Xz = x(j) IT ] = ( ) () , l*' ) = 1,2 ,..., v . (2.5.2.8)
v v - I Cv n
Now if v > 1 then we have

E[(YI - yz)Z I
2
r] ± ~(i)
=
i ~j= l
- Y(j)~ P[XI = X(i )' Xz = x(j) I r].
2
On substituting the value of P~tl = X(i),X2 = x(j ) ITJ from (2.5.2 .8), we obtain

E[ (YI- Y2 f IT]= Cv(n )-Cv(n- I) I I &(i )-Y(j )t =cv(n)-Cv(n- I)s;.


2 Cv(n) 2v(v -! h~j=1 Cv (n)
Hence the theorem.

Example 2.5.2.1. We have selected an SRSWR sample of 20 units from the


population I by using the 3 rd and 4 th columns of the Pseudo-Random Numbers
(PRN) given in Table I in the Appendix. The 20 states corresponding to the serial
numbers 29,14,47,22,42,23 ,48,06,07,42,21,31 ,31,36,16,27,10, 18,26 and
48 were se lected in the sample . Later on we observed that the states at serial
numbers 42 , 31 and 48 have been selected more than once. We reduced our sample
size by keeping only 17 states in the sample and collected the information about the
real estat e farm loans in these states . The data so collected has been given below :
Chapter 2: Simple Random Sampling 115

14 IN 1213.024
47 WA 1100.745
22 MI 323.028
42 TN 553.266
23 MN 1354.768
48 WV 99.277
06 CO 315.809
07 CT 7.130
21 MA 7.590
31 NM 140.582
36 OK 612.108
16 KS 1049.834
27 NE 1337.852
10 GA 939.460
18 LA 282.565
26 MT 292.965
( a ) Estimate the average real estate farm loans in the United States using
information from distinct units only.
( b ) Estimate the finite population variance of the real estate loans in the US using
information from distinct units only.
( c ) Estimate the average real estate loans and its finite population variance by
including repeated units in the sample . Comment on the results .
Solution. Here n = 20 and v = 17, and on the basis of distinct units information, we
have

29 NH 6.044 -560 .7820 314476.8000


14 IN 1213.024 646.1977 417571.5000
47 WA 1100.745 533.9187 285069 .2000
22 MI 323.028 -243 .7980 59437 .6100
42 TN 553.266 - 13.5603 183.8816
23 MN 1354.768 787.9417 620852.1000
48 WV 99.277 -467.5490 218602.3000
06 CO 315.809 -251.0170 63009.6800
07 CT 7.130 -559.6960 313259.9000
21 MA 7.590 -559.2360 312745 .2000
31 NM 140.582 -426 .2440 181684.2000
Continued......
116 Advanced sampling theory with applications

36 OK 612.108 45 .2817 2050.4330


16 KS 1049.834 483.0077 233296.4000
27 NE 1337.852 771 .0257 594480.6000
10 GA 939.460 372 .6337 138855.9000
18 LA 282 .565 -284 .2610 80804.4800
26 MT 292 .965 -273 .8610 75000.0100
I* ,],i" }'i'/"'" l'Sum'; //':,
, ,/ ,,'
"
/~ /d
ro. '''ro.~ro.
eQlilqS()I()OO()'t
( a ) An unbiased estimate of the average real estate farm loans in the United States
is given by
_ I v I 17 9636.047
y = - L y.1 = - L y1. = = 566.826 .
v vi=1 17 i=1 17

( b ) An est imator of the finite population variance CT; based on distinct units
information is given by
2 [ C)n-I)] 2
Sv = 1- C)n) sd'

Now
2 1 V( _)2 3911380
sd = -(- ) L y.- y = = 244461.25,
v-I i=1 I v 17-1
and
C (n) = vn _(v)(V_I)n+.oo+(_I)(V-I)(V )In .
v I v-I

C)n) = CI7(20) = 1720 - cl 7(17 _1)20 + C~7 (17_2)20 - cj7 (17 - 3fO + CJ7(17 _4)20
- cF (17 _5)20 + C~7 (17- 6)20 - cj7 (17 _7)20 + cJ7 (17 _8)20

- cr (17- 9)20 + c1J(17 _10)20 - cli (17 _11)20 + cli (17 _12)20
-clI (17 _13)20 + cll (17 _14)20 - clJ (17 _ 15)20+ clJ (17 _16)20
= 2.6366 x 10
20 ,
and
C)n-I)= CI7(19)= 1719 -CF(17-1)19 +Cf(17-2)19 -Cj7(17-3)19 +CJ7(17 -4)19
-CF(17-5)19 +C~7(17-6)19 -Cj7(17-7)19 +CJ7(17- 8)19

-Cr(17-9)19 +cIJ(17-10)19 -cli(17 -11)19 +cl~(17-12)19

-cll(17 _13)19 +ClI(17 -14)19 -Cll(17 -15)19 +ClJ(17 _16)19


= 4.4805 x 1018 .

Hence an estimate of the finite population variance is given by


Chapte r 2: Simple Random Sampling 117

S; = ( I _ 2.6366
4.4805 x 10
x 10
18

20
) x 244461.25 = 240307.004 .

( c ) From the sample information including repeated units , we have


I~
Random ,, State Real estateli' . c !){y(
,

1i,ymEer
t
farm loans, fYl"'-I '~t;;;
~ )
~
(-
"" , (; Yi - Yn ~', ' Yi - Yn t ,;"~
'~ '

29 NH 6.044 -515.4150 265652.200


14 IN 1213.024 69 1.5654 478262.700
47 WA 1100.745 579.2864 335572.700
22 MI 323.028 -198.43 10 39374.700
42 TN 553.266 31.8074 10 11.71 1
23 MN 1354.768 833.3094 694404.600
48 WV 99.277 -422.1820 178237.300
06 CO 315.809 -205 .6500 42291 .760
07 CT 7.130 -514 .3290 264533 .900
21 MA 7.590 -513.8690 264060.900
31 NM 140.582 -380 .8770 145067 .000
36 OK 6 12.108 90.6494 8217.314
16 KS 1049.834 528.3754 279180.600
27 NE 1337.852 816.3934 666498.200
10 GA 939.460 418 .0014 174725 .200
18 LA 282.565 -238 .8940 57070.150
26 MT 292.965 -228.4940 52209.330
42 TN 553.266 31.8074 1011.711
31 NM 140.582 -380 .8770 145067.000
48 WV 99.277 -422 .1820 178237 .300
,'" ',L,'" "" "Sum
" 10.tl29.172 ::;",:0.00001' ! ;;1 4270686 .000~
If the repeated units are included in the sample then an estimate of the average real
estate farm loans in the United States is given by
_ 1 n 10429.172
Y =- l: y. = = 521.4586
n n i= 1 I 20
and an estimate of the finite popu lation variance is given by
s2 = _1_ f( ,_-
)2 = 4270686 = 224772.9474 .
y · y, Yn
n - I ,~I 20 - 1
Clearly estimates of the average and finite population variance remain under
estimate if repeated units are included in the samp le. For details on distinct units
one can refer to Raj and Khamis (1958), Pathak (1962) and Pathak (1966) . Some
comparisons of SRSW R and SRSW OR sampling schemes have also been
considered by several other researchers viz. Deshpande (1980) , Ramakrishnan
(1969) and Seth and Rao (1964) , and Basu (1958) .
118 Advanced sampling theory with applications

2.6 'ESTIMkTION
OF K,POPULA
Sometime we are interested in estimating the total or mean value of a variable of
interest within a subgroup or part of the population. Such a part or subgroup of a
population is called the domain of interest. For example, in a state wide survey, a
district may be considered as a domain. After completing the survey sampling
process from the whole population, one may be interested in estimating the mean or
total of a particular subgroup of the population. We are interested in estimating
population parameters of a subgroup of a population. For example
~ ..
" , ;r ;!,. ;; ~. ;
;. yr••";·· ;.0· !0".

United States population Employed


New York Unemployed
Retailers Supermarkets
All workers in a firm Part time workers

Let D be the domain of interest and N D be the number of units in this domain.
ND _ 1 ND
Let YD = L f; and YD =- L Y; be the total and mean for the domain D
;=1 ND ;=1
respectively . Suppose we selected an SRSWOR sample s of n units from the
entire population nand nD ~ n units out of the selected units are from the domain
D of interest. In certain situations the value of N D is known and in another
situations the value of N D is unknown. We shall discuss the both situations as
follows. Define a variable
y' =
I
{I0
if i E D ,
if i ~ D. (2.6.1)
N • •
Then we have If; = YD = Y (say).
i=1

Case 1. When N D is unknown. Then we have the following theorems.

Theorem 2.6.1. Under SRS sampling an unbiased estimator of total YD of the


subgroup (or domain D) is given by
, N n *
~=-IY; . (262)
n i=1 . .
Proof. Taking expected values on both sides of (2.6.2), we have

E(YD)= E( ~ i~JY;* J ~ i~1 Ek)= ~ i~t~/jPrk


= = Yj)]
N n 1 N * N n N * N * ND *
=- I - I Yj =- x- I Yj = I Yj = I Yj = YD .
n i= J N j=1 n N j=J j=J j=1
Hence the theorem.
Chapter 2: Simple Random Sampling 119

Theorem 2.6.2. The variance of the estimator YD under SRSWOR sampling is:
2(
V(YD) = N I- f)sb, where Sb=_I_f~Y;'2_N-I(~y;,)2J. (2 .6.3)
n N -I 1;=1 ;=1
Proof. Obvious from the results ofSRS WOR sampling.

Theorem 2.6.3. An unbiased estimator of the V(YD ) under SRSWOR sampling is


given by

v.(y'D)_ j
2
- N (1_ f) SD,
(J
2 1 '2 -1
2
where SD= - L Y; - n L Y; 2J .
/I /I ,
- (2.6.4)
n n -I ;=1 ;=1

Proof. Obvious .

Case II. When N D is known. Here we need the following lemma.

Lemma 2.6.1. If i E SD indicates that the t" unit is in the sample sub group of D
then, we have
Prf( . E SDI nD> 0 ) = -no . (2.6.5)
ND
Proof. We have
Pr(i E SD InD > 0)
Pr(i E SD, 1I1D) Number of samples of sizes n Dwith i E SD
= Pr(n D) = Number of samples of sizes nD

Number of ways (nD - I) can be chosen from (ND- I) and (n- nD)from (N- ND)
Numb er of ways n D can be chosen from NDand (n- nD) from (N- ND)

Then we have the following theorems


Theorem 2.6.4. Under SRS sampling a biased estimator of YD is given by

YD = lo~: ;~y;' if no > 0, (2.6.6)


otherwise.
Its relative bias is equal to the negative of the probability that n D is equal to zero.
Proof. Here nD is not fixed before taking the sample and therefore is a random
variable. Thus we have
E(YD)= E1[E2(YDI nD)]
120 Advanced sampling theory with applications

Now when n D > 0 then we have

E(YD) = E2[(YD I no > 0)] = E2[ :~ i~/t* I no > 0]


= E2[ N D Ili I n D >
nt: iESD
0] ' where S indicates sample subgroup of
D D

ND
=E2 [- - "i.JiYi InD > 0] , where Ii = {I if iES.O'
n DiED 0 otherwise,

= ND [ I E2 (Ii inD > 0)Yi] = ND [ I Yi Pr(I i E SD in D > 0)]


no iED no iED

= ND [IYiX nD] = Z:Yi =Yo .


no iED ND ieO
Therefore
, )_!E(Yo I no > 0) =Yo ,
E (Yo - (, )
E Yo I no = 0 = O.
Thus we have
E(YD) = E,[E 2(YD I nD)] = YD Pr[nD > 0]+ OPr(nD = 0) = Yo Pr[no > OJ
Thus the relative bias in the estimator YD ,when No is unknown , is given by
RB(Yo )=E(Yo)- Yo =Yo Pr(no > 0)- Yo = Pr(no > 0)-1 = -Pr(no = 0).
Yo Yo
Hence the theorem.

To find the variance of the estimator Yo we need the following lemma .


Lemma 2.6.2. Show that
1 I
E(-no no>
oj nPo
1- -Po
= -1+ - 2 2'
n Po
where Po __ No .
N
(2.6 .7)

Proof. We have
1 1

-;;; = n( n: ) = n( n: - Po ) + nPo
Chapter 2: Simple Random Sampling 121

Tanking expected values on both sides, we have

PD(I - PD) E(IID -PD)


1 II D>D J '" - i
E( ~ 1- -'-- ::----
ll - ~r===~="f"'==
II +---;----'-;-:,---- -~:___=__'~
v» IIPD IIPD ~PD(I-PD)/II
r

by

(2 .6.8)

Hence the lemma. For more details about the expected values of an inverse random
variab le one can refer to Stephen (1945).

Theorem 2.6.5 . Show that the variance of the est imator YD , when N D is know n, is

V(YD)", Pr(IID > O{{PD N2(~_ f) + : : (1- PD)}Sb + Pr(IID = 0)Y5]. (2.6.9)

Proof. We have
V(YD)= E, ~2(YD IIID)]+ VI [E2(YDI n» )] . (2.6.10)
Now

'
V2(YD [IID)= OIlD
N D2 ( I-~
ND
Jsb if no > 0, (, ) {YD if liD > 0,
and E2 YD j llD = 0
1 if no = 0,
if no = O.

From (2 .6.10) we have

V (Y,D )-- E, l lNb


»» (1-~JSb
ND if v» > 01 + V, [{Yo
D if liD >0]
if "o = 0 .
(2 .6.11)
o if no = 0

Now

.if liD > 0] = VI [YDI(IID > 0)] = Y5 V, [/(IID > 0)]= Y5 Pr(IID > OXI- Pr(IID > 0))
If liD = 0
= Y8 Pr(IID > O)Pr(IID = 0), (2.6.12)
122 Advanced sampling theory with app lications

and

if v» > 0]
If no =0
= min(flf,';'(IID = j )N5 (1 - ~)S5 +pr(IID = o)«0
j= 1 no ND

min(ND,II )Pr(IID = j) N5 ( liD ) 2


= Pr(1I D > 0) I ( ) 1- - SD
j= 1 Pr no > 0 ti D ND

= Pr(1I D > 0)
min(ND ,II) N2 ( II 2
I Prlll D = j ill D > 0)---.!2... 1- ~ S D
J
no

;J
j =l ND

= Pr(IID > 0)E{N5( tI~ - S5 1I1 D > o}


= N5S5{E( _1 I no > o)- _I_ }pr(tl D > 0)
n» ND

'" N5 S5{_ I_ + 1
IIPD II PD N D
~ P~
-_I_ }pr(IID > 0), where PD = N D
N

+ N5 1
2
'" { PD N
II
(1 -~)S5
N
~ P~ S5} pr(II D > 0),
II P (2.6.13)
D

On using (2.6. 12) and (2.6.13) in (2.6. 1I) we have

Hence the theorem.

Theorem 2.6.6. An estimator for estimating v(rD )is given by


2( 2
' ('D ) ", {'PD N 1_ f ) +-2
vY N ( I - PD
' )} SD
2 (2.6.14)
II II
where

1 ( J2) l
' no 2 1 II *2 *
IY; - IY;
1/
PD = - and SD = - - no t
II no -I ;= 1 ;= 1

Proof. Ob viou s by the method of moments.


Chapter 2: Simple Random Sampling 123

2.7 DEALING WITH A RARE ATTRIBUTE USING INVERSE SAMPLING

The probl em of estimation of proportion of some rare types of genes or acreage


under some special types of plants can not be done with the help of direct binomi al
distribution. The probl em is that if we select a sample of size 11 by SRSWR or
SRSWOR sampling, then the observed number of genes of particular type or
interest in the select ed sample will be zero due to its rare availability. Thu s the
traditional SRSWR or SRSWOR samplin g schemes cannot be used to estimate the
proportion of rare attribute in survey sampling. One of the possible solutions is an
Inver se Sampling. Inverse Sampling is a techniqu e in which sampling is continued
until a predeterm ined numb er of units possessing the attribute occur s in the sample
is useful in estimating the proportion of a rare attribute. Now inverse sampling can
be done either by using SRSWR or by SRSWOR sampling. If the Inver se Sampling
is don e by SRSWR sampling then the total numb er of trials 11 (say) to get
predetermined number of units m (say) possessing attribute A (say) follows the
Negative Binomial Distribution. It is also called Binomial waiting-time distribution
or Pascal distribution. If the invers e sampling is don e by SRSWOR sampling, then
the total number of trials 11 (say) to get predetermined number of units m (say)
possessing an attr ibute A (say) follow s Negati ve Hypergeometric distribution.
Figure 2.7.1 has been devoted to differentiate between Nega tive Binomi al and
Negative Hypergeometric distribu tions.

Fig. 2.7.1 Pictorial representation of the Inverse Sampling.


124 Advanced sampling theory with applications

Thus we have the following situations.


Case I. SRSWOR Inverse Sampling: Let N be the population size, P be the
proportion of the rare attribute of interest and Q = 1- P. In case of the Inverse
Sampling the sample size n is required to attain m is a random variable and its
probability distribution due to SRSWOR sampling is given by

P(n = I ) =
(:~J:QmJ
J (NP-m+l) m.m + I,... .m + NQ.
(
NN-I+I x , 1= (2.7.1)
I-I
Such a distribution is called negative hypergeometric distribution and we have
m+NQ
L: P(n = I) = I. Then we have the following theorem.
I=m

Theorem 2.7.1. An unbiased estimator of the proportion of rare attribute P is


given by
, m-I
p = n -I . (2.7.2)
Proof. We have
,) (m-I) m+NQm_1
E (p = E - - = L: - - P (n = I )
n -I I=m n-I
1
=mtQ{~}{NP-m+I}(NP J(NQ J(N J-
I=m n-I N- I+I m-I I- m I-I
=pm+I:Q-1{NP-m+I}(NP-IJ(NQ J(N-IJ-l =P.
I=m N-I+I m-2 I-m 1-2
Hence the theorem.

Theorem 2.7.2. An estimator for estimating the variance of p is given by


vp m-I[m-I
,(,)_ - - - - - (N-IXm-2)
- n- I n- I N(n - 2)
I]
N · (2.7.3)
Proof. We have
E{(m-l)(m-2)}= L: {(m-l)(m-2)}p(n=l)
(n-l)(n-2) tem (1-1)(1-2)

= {P(NP-I)} L: {NP-m+I}(NP-2J(NQ J(N-2J-1


N- I tem N- I+I m- 3 I- m I-3
Np 2 P
-----
N-I N-I
By the method of moments an estimator of p 2 is given by
Chapter 2: Simple Random Samp ling 125

1\2\
p =
(N - IXm- IXm- 2) +(m- I)
-- - .
1 . N(II - IXII - 2) N(II - I)
est imate

The variance of the estimator p is given by


V(p) = E(p 2)- {E(p)}2 .
By the method of mome nts, an unbiased estimator of V(p) is given by

v"(")_
p - {"}2
p - 11\p 2 ] _
- -(m-- If- - (N-I Xm- IXm-2 ) (m- I)
(II - If N(II - IXII - 2) N(II - I)
estimate

_ m- I [m- I _ (N - I Xm- 2) _-!.. ]


- II - I II -I N(II - 2) N ·
Hence the theore m.

Case II. SRSWR Inverse Sampling: In case of large N, the negative


hypergeometric probability distribut ion of total sample size II beco mes negative
binomial distri bution and is given by

P ( II; Ill , P ) = II - J
I P 11/Q /1- 11/ cl or /I = Ill, III + I, III + 2,...
( m- I
Then we have the following theorem.
Theorem 2.7.3 . An unbiased estimator of the required proportion P of a rare
attribute is
• Ill-I
p =- - (2.7.4)
/I - I
and an estimator of the V(p ) is given by
v(p) = p(l - p). (2.7.5)
11 -2
Proof. Ob vious for large N from the previous theor ems.

2.8 CONTROLLED SAMPLING

While using simple random sampling and without replacement (SRSWO R) design,
the number of possible samples, N C; , is very large, even for moderate sample and
population sizes. For example, if

Number of samples
N
/I 30 40
5 142,506 658,008
10 30,045 ,0 15 847,660,528
15 155, 117,520 40,225 ,345,056
126 Advanced sampling theory with applications

Some time s in the field surveys , all the possible samples are not equally preferable
from the operational point of view , because a few of them may be inaccessible,
expensive, or inconv enient, etc.. It is therefore advantageous if the sampl ing design
is such that the total number of poss ible samples is much less than N en , retainin g
the unbi asedness prop erties of sample mean and sample variance for their
respective population param eters. Neyman's (1923) notati on for causal effect s in
randomized experiments and Fisher's (1925) proposal to actually randomize
treatments to units . Neyman (1923) appears to have been the first to provide a
mathem atical analysis for a randomi zed experim ent with expli cit notation for the
potential outcomes, implicitly making the stability assumption. This notation
became standard for work in randomi zed exper iments (e.g., Pitman, 1937; Welch,
1937 ; McCarth y, 1939; Anscombe, 1948; Kempthorne, 1952; Brillin ger, Jones, and
Tuk ey, 1978; Hod ges and Lehmann, 1970 , and dozens of other place s, often
assuming con stant treatment effects as in Cox, 1958, and sometimes being used
quite informally as in Freedman, Pisan i, and Purv es, 1978) . Neym an's formali sm
was a maj or advan ce because it allowed explicit prob abil istic inferences to be
drawn from data, where the probabilit ies were explicitly defined by the random ized
ass ignment mechanism. Independently and nearly simultaneously, Fisher ( 1925)
invented a somewhat different method of inferenc e for rand omized experiments,
also based on the specia l class of randomi zed assignment mech anisms. Fisher's test
and resulting ' significance levels' (i.e., p values), remain the accepted rigorou s
standard for the analysis of randomi zed clinical trials at the end of the twent ieth
century, so called ' intent to treat analy ses. The notions of the cen tral role of
randomized experiments seems to have been ' in the air ' in the 1920, but Fisher was
the first to combine physical randomi zation with a theoreti cal analysis tied to it. A
review on randomi zation is also available by Fienberg and Tanur ( 1987). These
ideas were primarily assoc iated with the notion of fairness and obje ctivity in their
earlier work. The role of the International Statistical Institute in the earlier work
related to sample surveys, as reviewed by Smith and Sugd en (1985). Fienberg and
Tanur (1987) explored some of the developments following from their earlier
pioneer ing work with an emph asis on the parall els between the methodologies in
the design of experiments and the design of sample surveys . Chakrabarti (1963)
initiated the idea that the results on the existence and construction of balanced
sampling designs can be easily translated to the language of design theor y by using
the corr espondence between sampling design and block designs. Bellhouse (19 84a)
also work ed on these lines and has shown that a systematic applic ation of the
treatments minimi ses the variance of the treatment constant averaged over the
application of the treatment. The lack of cross reference in the review papers by
Cox ( 1984) and Smith (1984) suggested that the specia lisation extends even to
compartmentalisation within the minds and pro fession al lives of outstand ing
investigators, for both these authors have been steeped in the tradition of parall els.
For example, consider a balanced incomplete block design (BIBD) with standard
parameters (b , v, r, k,A), where v denotes the number of varieties, b the number of
blocks , k the block size, r the number of times each treatment occurs and A the
number of times any pair of treatments occur together in a blocks. In practice
Chapter 2: Simple Random Sampling 127

b < {I'Ck} . Each treatment represents a population units v = N, each block as a


sample with k = II . Thus each unit occurs in r samples and each pair of units in A
samples. Choose each sample with probability 1/b and under such designs, define
indicator variables

i ' __
SI
{I if i E S
0 if i '" S
such that £(1.." ) = -r ,
J b
and
I i f i.] E S ( ) A
i s,;; =
" { 0 if i, j '" S such that E i "I)" = -b . J

The sample mean


1 1/
Y= - L Yi
lIi=1
can be written as
_ 1 N
Y = - l.Y;lsi
II i=!

such that
I N ] 1N ) 1N r r N -
£(y)=£[ - l. Y/ si =- l.}j £(Jsi =- l. }j-=- l. }j=Y
II i= l II i= l II i=l b lib i=l
becau se vr = bk .
Similarly using r(k -I) = A(V-I) and bk = vr we have

1/ J ( N
£ ( ~~YiYj =£ ~ ~}jY/sij =-l. l. }jYj = - _
(
JA N
) l. ~ Yi Yj= -( _ )l.l.}jYj
(r(r - I)J N k(1I - I) N
'*1 '* 1 b'*l bv 1 '* 1 v v 1 1* 1
11(11-1) N
= (
N N- I
)l.DiYj
N.j
'

Thus, we have the following theorem:

Theorem 2.8.1. Under controlled sampling design, the sample mean and sample
variance rema in unbi ased to their respecti ve parameters.

2.9 DETERMINANT SAMPLING

Subramani and Trac y (1993) used the concept of incomplete block design in sample
surveys and introduced a new sampling scheme called determinant sampling. This
scheme totally ignore s the units close to each other for selection in the sample. In
the preceding discus sion , the units which are close to each other in some sense are
called contiguous units. Chakrabarti (1963) excluded conti guous units when
tran slating the result s of sampling designs to experimental design s since these units
have a tendency to provide identical inform ation which may be induced by factors
like time , category or location . As an example, in socio economic surveys people
128 Adva nced sampling theory with applications

have a tendency to exhibit similar expenditure patterns on household items dur ing
different wee ks of the month. More over peopl e belonging to the same income
category class have a grea ter tende ncy to have simi lar expe nditure patterns. With
regar d to the factor location, residents of a speci fic area show similar symp toms of a
disease caused by env ironmental pollution as of some infectious disease . Sim ilarly
in crop field surveys contiguous farms and fields shou ld be avoi ded. Because of
this limitation, Rao (197 5, 1987) has sugges ted that if contiguous units occ ur in any
observe d sample, they may be collapsed into a sing le unit, with the corresponding
response as the average observed respon se over these units. An estimate of the
unkn own par ameter is then recommended on the basis of such a reduced sample.
The situations for getting more information on the popul ation by avoiding pairs of
contiguous un its in the observed sample are well summarised by Heda yat, Rao, and
Stufk en (19 88). Tracy and Osahan (1994 a) furth er extend their work for other
sampling schemes.

EXERCISES

Exercise 2.1. Define simple random sampling. Is the sample mean a consistent or
unbi ased estimator of the population mean? Derive the variance of the estimator
using ( a ) SRSWR sampling ( b) SRSWOR sampling. Also derive an unbi ased
estimator of variance in each situation.
Exercise 2.2. A popul ation consists of N units, the value of one unit being known to
be YI • An SRS WOR of (II - I) units is drawn from the remainin g (N - I)
population units. Show that the estimator
)'1 = l) + (N-1)YIl_I
11 -1 N
where YIl- I = (11 - lt l LYi ,is an unbiased estimator of the popu lation total, Y=L Y j ,
i= 1 ;=2

but the variance of the estimator Y1 is not less than the variance of estim ator

Y2=NYIl ' where YIl=I1- 1 I Yi is an estimator of popu lation mean based on the sample
i= 1
of size 11 selected from the popul ation of N units. In other word s, the estimator Y1
is no more efficient than Y2. Give reasons.
Hint: By setting V(YI )?:: V(Y2) we obtain N > 11 which is always true for SRSWOR

sampling, where V(v,)=(N -If(_I


11- 1
I_)S; and V(Y2)=N2(~_~)S;
N- l N
.
11

Exercise 2.3. Suppose in the list on N businesses serially numb ered, k businesses
are found to be dead and t new businesses came into exis tence making the total
numb er of business (N - k + t). Give a simple procedure for selecting a businesses
with equal prob ability from (N - k + t) businesses, avoidin g renumbering of the
origi nal busine sses and show that the newly developed procedure achieves equal
probab ility for the new business too.
Chapte r 2: Simple Rand om Sampling 129

Hint: Using SRSWOR sampling the probability of selecting each unit will be
l
(N - k+ tt •

Exercise 2.4. Show that the bias in the Searl s' estimator defin ed as, Ysearl = AYn , is
B(Ysearl)= -Y V(YIl )/{f2 +v(y,.)}. Hence deduc e its values und er SRSWR and
SRSWOR.
Hint: Redd y (1 978a).

Exercise 2.5. An analogue to the Searls' estimator for estimating the population
propo rtion is defined as, P searl = Y Py , where y is a constant. Find the min imum
mean square error of the estimator Psearl under SRSWR and SRSWOR sampling.
Also study the bias of the estimator in each situation.
Hint: Conti (1995).

Exercise 2.6. An estimator of the optimum value of A in the Searls' estimator of


popul ation mean Y und er SRSWR sampling is given by,

i=[I+ 11
(N
N- I
)(s;,/y 2 )tJ
Show that i is a consi stent estimator of the optimum value of A . Also calcul ate
the bias and mean squared error, to the first order of approximation, in the estimator
of popu lation mean defined as, Yo = i yn . Deduce the results for estimating
popul ation proportion with the estimator, P searl = r Py , where r is a consistent
estimator of y .
Hint: Mangat, Singh , and Singh (199 1).

Exercise 2.7. Sho w that: ( a) under SRSWR sampling s;' is an unbi ased estimator
of a.~ , ( b) under SRSWOR sampling s;' is an unbia sed estimator of sJ,.
Exercise 2.8. Define the Searls' estimator of population mean . Show that the
relative efficiency of the Searls ' estimator is a decreasing funct ion of sample size
under (a) SWSWR (b) SRSWOR sampling designs.

Exercise 2.9. Show that the prob ability of selecting the i l " unit in the Sl" sampl e
remain the same under SRSWR and SRSWOR and is given by 1/ N .

Exercise 2.10. Why is the Searls ' estimator not useful in actual practice? Suggest
some modifications to make it practicable.
Hint: Use i in place of A. .
130 Advanced sampling theory with applications

Exercise 2.11. In case ofSWSWR sampling, if ther e are two characters Y and x ,

the covariance between Y and X is defined as a x)' =~ I (Yi - vXXi - x). Then the
N i=1

·
usua I estimator 0 f a . .
xy IS given
by Sri'
-X
= - I - I1/ ( Yi - Y Xi - rX) . Show that an
. n-I i= 1
estimator better than S ty based onl y on distinct units is:

Cv(II-I)] Sd(xy ) if v > I,


_ 1-
S,{xy ) - 0 j[ Cv(lI ) where Sd(xy) = (v- I t l ±
i= 1
t Vi - YvXXI - z.) .
otherwise,
H int: Pathak (1962).

Exercise 2.12. (a) Show that the usua l estimator of the popu lation tota l (namely
Ny) in SRSWOR has average minimum mean squared error, for permutations of
va lues attac hed to the units , in the general class of linear translation invariant
estimato rs of the population total Y.
( b ) Show that for SRSWOR sampling of size II , the estimator which minimises
the average mean squared error, for permutations of values atta ched to the unit s, in
the class of all linear estim ators is give n by,
N
Ie =-II (-I + Ii )iIYi
ES

where Ii = (N
N- I
-II ) C; and
II
Cy is the known population coefficient of variation.
Hint: Ramakrishnan and Rao ( 1975) .

Exercise 2.13. Let a finite population consi st of N units . To every unit there is
attached a characteristic y . The characteristics are assumed to be measured on a
given sca le with distinct points Y I,Y Z,...,Yt . Let N, be the number of unit s
associated with scale point Y/ ' with N = I N/ . A simp le random sample of size II
t

is drawn . Using the likelihood function of ( N], Nz ,...,Nt ) and assuming Nl n to be


an integer, show that a maximum like lihood estimator of the population mean Y is
_ I "
Yml = - L.,lI t Y /
/I t

where /It is the numb er of times the value s of Yt are observed in the sample.
Hin t: Hartl ey and Rao (196 8).

Exercise 2.14. Suppose we selected a sample of size II such that the {IJ unit of the
population occurs Ii times in the sample. Assume that II I of the se unit s (Ill < II ) are
r
selected with frequen cy one . Evidently II = /II + I f; , where r is the number of
i=1
units occ urring I i times in the sample. Let d (= III +r)be the number of distinct
Chapter 2: Simple Random Sampling 131

unit s in the sample. Th e d unit s are measured by one set of investigators and the r
repeated units by another set, preferably by the supervising staffs. The measurement
of the d units be denoted by Xl> X2, .•., xIII for the non-repeated ones and
xIII +1' xIII +2> •.•,x lIl +,. for the rep eated ones . The measurement of the r repeated unit s
be denoted by 21, 22, .. .,2,.. Us ing the abo ve information and not ation, study the
asymptotic properties of the following estimators of population mean:

(a) Xd =~~IIXIII
d
+rxrJ; (b ) ZR = Z,. ( ~dJ ;
X,.
(c) z/,.= Z,. +P(Xd - X,.);

( d ) YI = (1- W)Xd + wZR; and (e) Y2 = (1-W)xd + wZ/r •

Hint: Moh anty (1977) .

Exercise 2.15. Discuss the problem of the estimation of domain total in survey
sampling. Derive the estimator of domain total and find its variance under different
situations.

Exercise 2.16. Under SRSWR sampling, show that the distinct unit s based unbiased
estimators of the finite population variance a y2 are given by

(Nil) J (II- I)

( a) •
VI =
J= I
N"-I(N- I) Y
S
2•
, ( b) V2 =[(~ - ~ ) +NI-II(I - ~)}3;
. _ Cv_l(n - l) 2. ( d) ~ _
(N
il)J (II -1) [1- Cv(n- I)] 2.
( C ) v3 ~4 -
J- I
II-I ( )
- ( )
c, n
sd,
N N -I c, (n ) Sd ,

and ( e) Vs = . [(I NI) N(N-"-NI) ]2


- - -
v
+--- Sd

Hint: Path ak ( 1966).

Exercise 2.17. Discuss the method and theory of the estimation ofrare attributes in
survey sampling.

Exercise 2.18. Write a program in FORTRAN or SAS to find the values of the
coefficients Cv(n-I) and Cv(n ). Test the results for n = 5 and v = 3 with all steps
using your desk calculator.

Exercise 2.19. Under an SRSWOR sampling of n unit s out of a population of N


distin ct units, consider the following estim ator of population mean Y as
"
Ynew = L CIYI
1=1

where cI. is a constant depending on the /" draw, YI is the va lue of Y on the unit
se lected at the t" draw.
132 Ad vance d sampling theory with app lications

( a ) Show that -
Ynew is unbiased for population mean Y " Ci = 1.
if and onl y if L
i= J

( b ) Show also that under this condition V(Ynew) = s;t~c;( 1- ~ J} .


Hint: V(Ynew) = V( I CiYiJ = L C;V(Yi )+ LL CiCjCOV~i' Yj) = I c;V(Yi) = v(Yi)Ic;.
1=1 ' ''' }

( c ) Show that V(Ynew) is minim ised subject to the condition I Ci = 1 if and only if
i=1
Ci = 1/11, ,11 .
i = 1,2,...

H int: Ic; ~ (I ci i /11 = 1/11 and equal ity hold s if and only if Ci = 1/11 .

Exercise 2.20 . An SRSWR sample of size n is drawn until we have a pre-assigned

num b ermo f disti . .m rt.


istmct umts . Let y,,=-
- 1 - =-1 ~
I11/ fiYi an d YII/ L.Yi beth esampe
I
II ;= 1 III ; = 1
mean s based on the sample including repetitions and without repetitions.

( a ) Sho w that both estim ators y" and YII/ are unbi ased for population mean Y.

( b ) Show that v(y,,) = aJ,E(~ J.

( c ) Show that V(Ylll )= (~ - ~Js;


111 N
.
estimator YII/
( d ) Sh ow th at t hee esti - IS ' b etter tIian y"
- I'f E( 1J> N-
- III
---,-~
II m(N -I )'
Hint: Raj and Kham is (1958) .

Exercise 2.21. Discuss controlled sampling. Show that the sample mean and sampl e
variance rem ain unbiased to their respective parameters.

Exercise 2.22 . Discuss the concept of rare attribute and give a pos sible solution
using inverse samp ling.

PRACTICAL PROBL EM S
P r acti cal 2.1. Co nsider the problem of estimation of the tota l number of fish caught
by marine recreational fishermen at Atlantic and Gulf coasts. We know that there
were 69 species caught during 1992 as shown in the population 4 in the Appendix .
What is the minimum numb er of species groups to be se lected by SR SWR sampling
to attain the accuracy of relative standard error 12%?
Given: s; = 31,0 10,599 and Y = 291 ,882.
Chapter 2: Simpl e Random Sampling 133

Practical 2.2. Your supervis or has sugges ted you to think on the problem of
estim ation of the total numb er of fish caught by marine recreational fishermen at
Atlantic and Gulf coa sts. He told you that there were 69 species caught during 1993
as shown in the population 4 in the Appendix. He needs your help in deciding the
sample size using SRSWOR design with the relative standard erro r 25% . How your
kno wledg e in statistic s can help him?
Given: sJ, = 39,881,874 and Y = 316,784.
Practical 2.3. Th e demand for the Bluefi sh has been found to be highe st in certain
markets. In order to supply these types of fish the estimation of the proport ion of
bluefish is an important issue . At Atlantic and Gulf coas ts, in a large sampl e of
311,528 fish there were sho wn to be 10,940 Bluefish caught durin g 1995. What is
the minimum numb er of fish to be selected by SRSWR sampling to attain the
accuracy of relativ e standard error 12%?

Practical 2.4. John considers the problem of estim ation of the total number of fish
caught by marine recreati onal fishermen at Atlantic and Gulf coasts. There were 69
spec ies caught durin g 1994 as shown in the popul ation 4 in the Appendix. John
selected a sample of 20 units by SRSW R sampling. What will be his gain in
effi ciency ifh e considers the Sear ls' estimator instea d of usual estimator?
Given: sJ, = 49,829,270 and Y = 341,856.

Practical 2.5. Select an SRSWR sample of twenty units from population 4 given in
the Appendix. Collect the information on the number of fish during 1994 in each of
the species group selected in the sample . Estimate the average number of fish
caught by marine recreational fishermen at the Atlantic and Gulf coa sts dur ing
1994. Construc t 95% confid ence interval for the average numb er of fish in each
spec ies group of the United States .

Practical 2.6. Use populati on 4 of the Appendix to selec t an SRSW OR samp le of


sixteen units. Obtain the information on the numb er of fish durin g 1993 in each of
the spec ies group selecte d in the sample. Develop an estimate of the average
numb er of fish caught by marine recreational fishermen at Atlantic and Gulf coas ts
durin g 1993 . Construc t the 95% confi dence interva l estimate.

Practical 2.7. Select an SRSWR sample of 20 state s using Random Number Table
meth od from popul ation I of the Appendix. Note the frequency of each state
selected in the sample. Construct a new sample by keepin g onl y distinct states and
coll ect the information about the nonr eal estate farm loans in these states. From the
information collected in the sample:

( a ) Estimate the average nonreal estate farm loans in the Unit ed States USIng
information from distin ct units only.
( b ) Estimate the finite population variance of the nonreal estate loans in the United
States using distinct units only.
134 Adva nced sampling theory with applications

( c ) Estimate the average nonrea l estate loans and its finite pop ulati on variance by
inclu ding repeated unit s in the sample. Comment on the results.

Practical 2.8. A fisherman visited the Atlantic and Gulf coast and caught 6,000 fish
one by one. He noted the species group of eac h fish caught by him and put back
that fish in the sea before mak ing the next caught. He observed that 700 fish belon g
to the group Herrings.
( a ) Estimate the proportion of fish in the group Herrings living in the Atlanti c and
Gulf coast.
( b ) Co nstruc t the 95% confidence interval.

Practical 2.9. Durin g 1995 Michael visited the Atlantic and Gulf coast and caught
7,000 fish. He observed the spec ies group of each one of the fish caught by him
using SRSWOR sampling and found that 1,068 fish belong to the group Red
snapper.
( a ) Estimate the proportion of fish in the group Red snappe r living in the Atlantic
and Gul f coast.
( b ) Construct the 95% confid ence interval.
Gi ven: Total numb er of fish living in the coast = 311 ,52 8.

Practical 2.10. Follo win g the instructions of an ABC comp any, select an SRSW R
sample of 25 unit s from the popul ation I by using the 4 th and 5th co lumns of the
Pseud o-R and om Numb ers (PRN) given in Table I of the Appendix . Record the
states selected more than once in the sample. Reduc e the sample size by keeping
only eac h state onc e in the sample and collect the information about the real estate
farm loans in these states. Use this information to:

( a ) Estimate the average real estate farm loans 10 the Uni ted States using
inform ation from distin ct units only.
( b ) Estimate the finite popul ation variance of the real estate loans in the US using
informati on from distinct units only.
( c ) Estimate the average real estate loans and its finite popul ation variance by
includ ing repeated units in the sample. Comment on the result s.

Practical 2.11. You think of a practical situation where you have to estimate a total
of a variabl e or characteristic of a subgroup (dom ain) of a population. Tak e a
sample of reasonable size from the population under study and collect the
information from the units selected in the sample. Apply the appropriate formul ae
to construct the 95% confidence interval estimate.

Practical 2.12. A practic al situation arises where you have to estimate a proportion
of a rare attribute in a popul ation, e.g., extra marital relations. Coll ect the
information from the units selected in the sample throu gh inverse sampling from the
population under study. Apply the appropriate formul ae to construc t the 95%
confidence intervals for the prop ortion of the rare attribute in the popul ation.
Chapter 2: Simple Random Sampling 135

Practical 2.13. A sample of 30 out of 100 managers was taken, and they were
asked whether or not they usually take work home. The responses of these
managers are given below where ' Yes' indicates they usually take work home and
'No' means they do not.

Yes Yes Yes Yes No No Yes No Yes No


No Yes No Yes No Yes Yes Yes Yes Yes
No No Yes Yes Yes Yes Yes Yes No Yes

Construct 95% confidence intervals for the proportion of all managers who take
work home using the following sampling schemes :
( a ) Simple Random Sampl ing and With Replacement;
( b ) Simple Random Sampling and Without Replacement.

Practical 2.14. From a list of 80,000 farms in a state, a sample of 2,100 farms was
selected by SRSWOR sampling. The data for the number of cattle for the sample
were as follows :
n n 2
LYi = 38,000 , and L Yi = 920,000.
i ;1 i ;!
Estimate from the sample the total number of cattle in the state, the average number
of cattle per farm, along with their standard errors , coefficient of variat ion and 95%
confidence interval.

Practical 2.15. At St. Cloud State University, the length of hairs, Y, on the heads
of girls is assumed to be uniformly distributed between 5 em and 25cm with the
probability density function
1
f(y) = - \;j 5 < Y < 25
20
( a ) We wish to estimate the average length of hairs with an accuracy of relative
standa rd error of 5%, what is the required minimum number of hairs to be taken
from the girls?
( b ) Select a sample of the required size, and use it to construct a 95% confidence
interval for the average length of hairs?

Practical 2.16. The distribution ofweighty shipped to 1000 locations has a logistic
distribution

f Y =-sech
() 1
4fl.
2{ -
1 --
2 fl.
(x-a•J}
with a. = 10 and fl. = 0.5 .
( a ) Find the value of the minimum sample size n required to estim ate the average
weight shipped with an accuracy of standard error of 0.05% .
( b ) Select a sample of the required size and construct 95% confidence interval for
the average weight shipped.
( c) Does the true weight lies in the 95% confidence interval?
136 Advanced sampl ing theory with applicat ions

Practical 2.17. Assume that the life of every person is made of an infinite number
of good and bad events . Count the total number of good and bad events you
remember that have happened to you. Estimate the proportion of good events in
your life. Construct a 95% confidence interval estimate. Name the sampling
scheme you adopted to estimate proportion of good happenings, and comment.

Practical 2.18. Assuming that everyone dreams infinite number times during
sleeping hours in the life. Count the number of good and bad dreams in your life
you remember. Estimate the proportion of good dreams and construct a 95%
confidence interval estimate . Name the sampling scheme you followed to estimate
the proportion of good dreams, and comment.

Practical 2.19. Dr. Dreamer believes that if a person takes good dreams during
sleeping hours then he/she is mentally more healthy, and pleasant person . You are
instructed to report stories of your dreams to the doctor until you are not having 15
good dreams . Find the Dr. Dreamer's 95% confidence interval estimate of the
proportion of good dreams in your life. Can you be considered a pleasant person?
Comment and list the sampling scheme used.
3. USE OF AUXILIARY INFORMATION: SIMPLE RANDOM
SAMPLING

3.0 INTRODUCTION

It is well know n that suit able use of aux iliary informatio n in probab ility sam pling
results in co nsiderab le redu ction in the varia nce of the estimato rs of population
parameters viz. population mean (or total), med ian, variance, reg ress ion coefficient,
and popul ation correlation coefficient, etc.. In this chapter we will consider the
problem of estimation of different population parameters of interest to sur vey
statisticians using known auxiliary inform ation und er SRSWOR and SRSWR
sampling schemes only . Before proceeding furth er it is nece ssary to de fine som e
notation and ex pec ted values, which will be useful throu ghout this chapter.

3.1 NOTATION AND EXPECTED VALUES

Ass ume that a simple random sample (SRS) of size 11 is drawn from the give n
popul ation of N unit s. Let the value of the study variable Y and the auxiliary
variable X for the / " unit (i = 1,2,...,N) of the popul ation be denoted by >i and Xi
and for the i''' un it in the sample (i = 1,2,...,11) by Yi and Xi' respectiv ely. From the
sampl e obse rvations we have
- \ /I _ \ /I 2 \ /I _ 2 2 1 /I - 2
Y =- 'L Yt » X =- 'L Xi ' SY =-(- ) 'L (Yi - y) , Sx =-( - ) 'L (Xi - X) ,
11 i=1 11 i=1 11 - \ i=1 11 - \ i=1
and

S
xy
=-\()
11 -\ i=1
£(Y.-Y)(x.-x) .
I I

For the population observations we have the anal ogue qu antiti es


- \ N - 1 N 2 1 N( -)2 2 1 N( - )2
Y= - IYi , X = - I X i, Sy=- ( -) I>i -Yj , s, = -( -) 'LXi- Xj ,
N i=1 N i=1 N - 1 i=! N - \ i=1
and
N( - Y X i -X .
Sxy= -(- 1 -) 'L>i -X -)
N - \ i= 1
In genera l define the followi ng popul ation parameters

f i rs = -( -
I -) ~ (
L.. Yi - Y
-)1' (Xi - X- )s , and AI'S = fi rs
/V 1'/ 2 s /2)
' /20 fl 02 .
N- l i=\
Note that
fl2 0 = Sy2 , fl02 = S "2 and fil l = S ,y , so that Cy2 = Sy2/ Y-2 = fl 20 / Y-2 ,
Cr2 = S,2/ X- 2 = fl02 / - 2
X, and Pxy = S ty / (S,Sy ) = fill / (Vc-
f l 20 Vc-)
fl02 •

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
138 Advanced sampling theory with applications

Let us define
y x
&0 ==-1, &1 =~-I,
Y X

To the first order of approximation we have

E(d)=C~f}A40 - I), E(d)=C~f}A04 - I), E(&J)= C~f)[ ~~: -I}


E(&0&Z)=(I -f)Cy A30, E(&0&3)=(1-f)CyAIZ, E(&0&4) = (l -f)Cy AZI,
n n n Pxy

E(&I&Z)=C~f )C,AZI' E(&I&3)=C~f )CxAo3' E(&I&4) =C~f)C, ~I,:,

E(&Z&3) =C~f}AzZ-I), E(&Z&4)=C~f )[;:~ -l and E(&3&4)=C~/)[~~: -Il


where f = n]N denotes the finite population correction (f.p.c.) factor . These
expected values can easily be obtained by following Sukhatme (1944) , Sukhatme
and Sukhatme (1970), Srivastava and Jhajj (1981) , and Tracy (1984) .
Also define
, 1 ~( -)r( -)' l' firs Sxv fill
firs = -_I)L.Y,-
( Y X, -X , /l.rs = ' r/Z',I/ z ' and r = -'- =
n ,=1 fizo fioz xy s,S y ~.,f.f;;;
as unbiased or consistent estimators of fi rs' Ars and P xy respectively .

The next section has been devoted to estimate the population mean in the presence
of known auxiliary information,

3.2 ESTIMATION OF POPULATION MEAN

Several estimators of population mean are available in the literature and we will
discuss some of them .
3.2.1 RATIO :ESTIMATOR

Cochran (1940) was the first to show the contr ibution of known auxiliary
information in improving the efficiency of the estimator of the population mean Y
in survey sampling. Assuming that the population mean X of the auxiliary variable
is known, he introduced a ratio estimator of population mean Y defined as
Chapter 3: Use of auxiliary informat ion: Simple random sampli ng 139

- -(XJ
:x .
YR = Y (3.2.1.1 )

Then we have the following theorems:

Theorem 3.2.1.1. The bias in the ratio estimator YR of the population mean Y , to
the first order of approximation, is

B(YR) = (1~f )V[C; - PXyCx;Cy] . (3.2.1.2)

Proof. The estimator YR in terms of eo and e, , can easily be written as


l
YR = V(1 + eoXI + elt . (3.2 .1.3)

Assum ing led < 1 and using the binomial expansion of the term (1 + e,t' we have

YR = Y(1 + eoX1-e\ + et + ok])) = V[1 + eo- e, + e?- eoe, + ok, )] . (3.2.104)

where O(e,) denot es the higher order terms of e1' Note that le]1 < 1, ef """""* 0 as
g > 1 increases. Therefore the terms in (3.2.1.4) with higher powers of e ] are
negligible and can be ignored. Now taking expected values on both sides of
(3.2.104) and using the results from section 3.1 we obtain

E(YR) =Y[l +C ~f )k~ - PxyCxCy}+o{n-')]. (3.2 .1.5)

Thus the bias in the estimator YR to the first order of approximation is given by
(3.2. 1.2). Henc e the theorem.

Theorem 3.2.1.2. The mean squared error of the ratio estimator Y R of the
population mean Y , to the first order of approx imation, is given by
MSE(h) = C ~f)y2[c; + C; - 2PxyCyCxl. (3.2.1.6)

Proof. By the definition of mean squared error (MSE) and usin g (3.2.104) we have
MSE(YR) = E[YR- r] "" E[V(1+eo- el +et - eoe, + 0(e2))- vf
"" V2 E[eo -e]+e? - eoe,]2.
Again neglecting high er order terms and using results from section 3.1 the MSE to
the first ord er of approximation is given by

MSE(h) = V2 E[e6 + 6} - 2eoel]= (1~f)V2 [c; + C~ - 2PxyCxCy].


Hence the theorem.
140 Advanced sampling theory with applic ations

By substituting the values of Cy ' C r and Pxy in (3.2 .1.6), one can easily see that
the mean squared error of the estimator YR, to the first order of approximation, can
be written as

MSE(YR) = (I ~/) (N~ l)i~[(Y; - f)-Rk - x)[ (3.2.1.7)

where R = f / X is the ratio of the two population means.

Theorem 3.2.1.3. An estimator of the mean squared error of the ratio estimator YR ,
to the first order of approximation, is

MSE(YR) = (I-'lf )[s~ + r2s; -2rsxy1 (3.2.1.8)


where r = YIX denotes the estimator of ratio of two sample means .

Proof. Obvious by replacing the population parameters with the corresponding


sample statistics in (3.2.1.6). Such a method of obtaining estimators is also called a
method of moments .

Theorem 3.2.1.4. Another form of the estimator of the mean squared error of the
ratio estimator YR , to the first order of approximation, is

MSE(YR)= (1-nf )_(


n
1)I [(Yi - y)-r(x; -x)]z
- I i=1 (3.2.1.9)
where r = YIx is the ratio of two sample means .
Proof. Obv ious by the method of moments.

Theorem 3.2.1.5. The ratio estimator YR is more efficient than sample mean Y if
<. 1
P ry - >- ' (3.2.1.10)
. Cr 2
Proof. The proof follows from the fact that the ratio estimator YR is more effic ient
than the sample mean Y if
MSE(YR) < v(y)
orif

C~,f )f2[C; +C; - 2 PxyCy cxl<C~f )f2C;


or if
Cx2 - 2pxy Cy Cx < 0
orif
1 c,
P xy > - - ·
2 Cy
Hence the theo rem.
Chapter 3: Use of auxiliary information: Simplerandom sampling 141

In the condition (3.2.1.10), if we assume that C y :::: e x ' then it holds for all values
of the correlation coefficient Pxy in the range (0.5, 1.0] . A Monte Carlo study of
ratio estimator is availab le from Rao and Beegle (1967). Thus we have the
following theorem.

T heore m 3.2.1.6. The ratio estimator YR is more efficient than the sample mean
Y if Pxy > 0.5 , i.e. , if the correlation between X and Y is positi ve and high.

Example 3.2.1.1. Mr. Bean was interested in estimating the average amount of real
estate farm loans (in $000 ) during 1997 in the United States. He took an SRSWOR
sample of eight states from the population 1 given in the Appendix. From the states
selected in the samp le he gathered the following information.

:, State
"
CA GA LA MS NM PA TX VT
Nonreal estate /' 3928.732 540.696 405.799 549.551 274.035 298.351 3520 .361 19.363
fafrrnloans(X..) '$
Real est~J~ farfn 1343.461 939.460 282.565 627.013 140.582 756.169 1248.761 57.747
loans (Y $,

The average amount $878.16 of nonreal estate farm loans (in $000) for the year
1997 is known. Apply the ratio method of estimation for estimating the average
amount of the real estate farm loans (in $000) during 1997. Also find an estimator
of the mean squared error of the ratio estimator and hence deduce 95% confidence
interval.

Solution. From the sample information , we have


,
Sr. •. Yi x,l Lv;- yf (x; -~f (V; - y Xx; :ex )
No .
:~,
"
.
1 1343.461 3928.732 447549 .2926 7489094.4980 1830775 .5040
2 939.460 540.696 70219 .8326 424341.5022 -172618.6237
3 282.565 405 .799 153589.3331 618286 .5613 308159.4078
4 627.oI3 549.551 2252.1431 412883.3536 30493 .8092
5 140.582 274.035 285036 .1296 842863 .5418 490 149.5300
6 756.169 298 .351 6674 .7674 798806 .9376 -730 19.52 17
7 1248.761 3520.361 329810.4398 5420748.0630 1337093 .6030
8 57.747 19.363 380346 .9504 1375337.8720 723260.3716
Sum 5395 .758 9536 .888 1675478.8890 17382362.3300 4474294.0800

Thus we have II = 8,
142 Advanced sampling theory with applications

8
f(Xi -xf IXi
s2 = -'.::
i-:..!...
l _ 17382362.33 2483194.6, x = .!.::.!.- = 9536.888 = 1192.11
x 8-1 7 8 8
8
f(Yi - y)2 I Yi
s2 = H 1675478.89 = 239354.1 - = .!.::.!.- = 5395.758 = 674.469
y 8-1 7 ' Y 8 8 '
8
I(Yi - yXXi - r )
S = H = 4474294.08 = 639184.86 and r = I = 674.469 = 0.5658 .
xy 8-1 7 ' x 1192.11
We are given X = 878.16, N = 50 and f = 0.16.

Thus the ratio estimate of average amount of real estate farm loans during 1997, Y
(say), is given by

-
YR
= -(
Y
XJ
x = 674.469(878.162)
1192.11
= 496.86

and an estimate of MSE(YR) is given by

MSE(h) = (I ~f )[s~ + rZs; -2rsXy]


= ( I- ~.16)[ 239354.1 + (0.5658)2 x 2483194.6- 2 x 0.5658 x 639184.86]
= 32654.65 .
A (1- a)lOO% confidence interval for population mean Y is given by
YR±ta /2(df = n-I).jMSE(YR)'

Using the Table 2 given in the Appendix the 95% confidence interval is given by

496.86± 2.365~32654 .65 or [69.490, 924.229].

Example 3.2.1.2. After applying the ratio method of estimat ion, Mr. Bean wants to
know if he achieved any gain in efficiency by using the ratio estimator. The amount
of real and nonreal estate farm loans (in $000) during 1997 in 50 different states of
the United States has been presented in population I of the Appendix. Find the
relative efficiency of the ratio estimator , for estimating the average amount of real
estate farm loans during 1997 by using known information on nonreal estate farm
loans during 1997, with respect to the usual estimator of population mean, given the
sample size is of eight units.

Solution. From the description of the population, we have Y; = Amount (in $000)
of real estate farm loans in different states during 1997, Xi = Amount (in $000) of
Chapter 3: Use of auxiliary information: Simple rando m sampling 143

nonreal estate farm loans in diffe ren t states during 1997, Y = 555.43, X = 878.16,
Sy2= 342021.5, C,2= 1.5256 , Cy2= 1.1086 , Pxy = 0.8038, an d N = 50 .

Thus we have

MSE(YR)= ( I ~f )y2[c; + C} - 2PxyCxCy]

= C-~.16 } 555.43r[1.1086 + 1.5256 - 2 x 0.8038·JL 1086 x 1.5256]

= 17606.39 .
Also

v(y) = C~f) s; = C-~.16) x 342021.5 = 35912.26 .

Thus the percent relative efficiency (RE) of the ratio estimator YR w ith respect to
the usual estima tor Y is given by
RE = v(-) x 100/ MSE(- ) = 35912.26 x 100 = 203.97%
y YR 17606.39
which shows that the ratio estimator is more effic ient than the usual estima tor of
pop ulatio n mean. It shou ld be noted that the relative efficiency does not depend
upon the sample size.

Theorem 3.2.1.7. The minimum sample size for the re lative standard error (RSE) to
be less tha n or equal to a fixed value ¢ is give n by

1/ >
¢2 y2
+-
1]-1 (3 .2 .1. 11)
- [ S2y + R 2S2x - 2RS xy N

Proof. T he re lative stan dard erro r of the ratio estimator YR is

RSE(yR) =~V(yR)/y2 = (+,-~ )(c; +d-2PXYCXCy).


Now

RSE(YR) :'> ¢ if (~ _ J...)(C;


II N
+C} -2PXYCXCy) :,>¢,
squar ing on both sid es we obtain

(~ _ J...)(c;
II N
+C} - 2PxyCxCy) :,> ¢2

-1
¢2y2 1
or 11 >[ 2 2¢2 +J...]-I or II ~ 2 2s;2- 2RS + -N ]
Cy + C, - 2pxyC,Cy N [ Sy + R xy
Hence the theorem.
144 Advanced sampling theory with applications

Example 3.2.1.3. Mr. Bean wishes to estimate the average real estate farm loans in
the United States with the help of ratio method of estimation by using known
information about the nonreal estate farm loans as shown in population I in the
Appendix . What is the minimum sample size required for the relative standard error
(RSE) to be equal to 12.5%?
Solution. From the description of the population I given in the Appendix, we have
- - 2 2
N = 50, Y = 555.43, X = 878.16, Sy = 342021.5, Sx = 1176526, Sxy = 509910.41,

R = Y- / X
- =-
555.43 ..
- = 0.63249 , ¢ = 0.125, th us th e minimum samp Ie size
. IS.
878.16

¢2yz 1]-1
n> +-
2 - 2RS
y + R S2
- [ S2 x xy N

=[
2
0.125 x (555.43'f +J...-]-I =20.51",21.
342021.5 + 0.632492 x 1176526- 2 x 0.63249 x 509910.41 50

Example 3.2.1.4 . Mr. Bean selected an SRSWOR sample of 2 I states listed in


population I. He collected information about the real estate farm loans and nonreal
estate farm loans from the selected states. He applied the ratio method of estimation
for estimating the average real estate farm loans assuming that the average nonreal
estate farm loans in the United States is known and is equal to $878. I6. Discuss his
95% confidence interval.

Solution. Note that the population size is 50. Mr. Bean started with the first two
columns of the Pseudo-Random Numbers (PRN) given in Table I of the Appendix
and selected the following 2 I distinct random numbers between I and 50 as: 01, 23,
46,04,32,47,33,05,22,38,29,40,03,36,27,19,14,42, 48, 06, and 07.

:it~~~i!~[ :rf,[!~~r' :;, I r!f'!r~i~fi~' :~ ! ~i(~: ~!~):,:[I:! : l i ~~i[~Y) g 1·','(:1:" ;11:1~;t;!-z:i Y,1
I~ : : ',:
:'i ' ,!;!" , \2
1,! !.i:J::'r,, ! 'N
01 AL 348.334 408.978 303627.6 21302 .6 80424 .302
03 AZ 43 I.439 54.633 218948 .3 250299 .3 234099 .570
04 AR 848.317 907.700 2605.2 124445.1 -18005 .653
05 CA 3928.732 1343.461 9177106 .0 621777 .6 2388748 .500
06 CO 906.281 315.809 47.9 57179.9 -1655.427
07 CT 4.373 7.130 800998.3 300087.3 490274 .840
14 IN 1022.782 1213.024 15233.5 433084 .8 81224.255
19 ME 51.539 8.849 718797.2 298206.9 462979 .800
22 MI 440.518 323.028 210534.2 53779.6 106406.960
23 MN 2466.892 1354.768 2457163.0 639737 .2 1253769.700
27 NE 3585.406 1337.852 7214853 .0 6 I2963.4 2102960 .000
29 NH 0.471 6.044 807998 .0 301278 .3 493388 .550
Continued .
Chapter 3: Use of auxiliary information: Simple random sampling 145

32 NY 426 .274 20 1.63 1 223808.6 124821 .8 167 14 1.200


33 NC 49 4.730 639. 57 1 163723.9 71 63.7 -34247 .22 1
36 OK 1716.087 6 12.108 667046.1 3269. 1 46697.097
38 PA 298 .35 1 756. 169 361209.5 40496.2 -120944.720
40 SC 80.750 87 .951 670119.2 218071. 5 382274.620
42 TN 388 .869 553 .266 260599.1 2.8 850.596
46 VA 188.477 321.583 50535 1.9 5445 1.9 165883.560
47 WA 1228.607 1100.745 108404.8 29 791 1.6 179708 .250
48 WV 29 .29 1 99 .277 7570 16.8 20 762 1.7 396450 .630
,,;<, Sum' 18886.5 20 11653.577 25645 192.0 ~4667 9 5 2 . 0 8858429 .300
x - Nonreal estate farm loans, y - Real estate farm loans.

Give n N=50 and X=878 .16 . Now from the above table, n =21, Y=55 4 .93223,
x = 899.35809, s; = 1282260 , s; = 233397.6, Sty = 442921.47 , r = 0.617, and
f = 0.42.

Thus rat io estima te of the ave rage real estate farm loans in the United States is

-
YR Y xJ
= -( X = 554.93223( 878.16 ) = $541.85 .
899 .35809
An estimate of MSE(YR) is give n by

MSE\.YR f )f
• to: ) = ( -1-n- lSy2 + r 2Sx2 - 2rs ]
xy

= C -2
0{42
) [ 233397.6 + (0.617)2 x 1282260 - 2 x 0.617 x 442921.47]

= 4832.64 .

A (I- a)1 00% co nfidence interval for population mea n Y is given by

Us ing Tabl e 2 from the Appendix the 95% confide nce interval for the ave rage real
estate farm loans is given by

541.85± 2 .086.J4832 .64, or [396.84, 686.86] .

3.2.2 PRODU€ffESHMAffOR

Murthy (1964) considered another est imator of popul ation mea n Y using known
population mean X of the aux iliary variable as a product estima tor

(3.2.2 .1)

Then we have the following theorems:


146 Advanced sampling theory with applications

Theorem 3.2.2.1. The exact bias in the product estimator yp of the population
mean r is given by

B(yp) = (1 ~f )rpXYCYCx . (3.2.2.2)

Proof. The product estimator yp in terms of Co and CI can easily be written as

yp = r(1 + CoXI + cd = r(1 + Co+ CI +&OCI)' (3.2 .2.3)


Taking expected values on both sides of (3.2.2.3) and using the results from section

l
3.1 we have

E(yp) = Y[I+C~f)pxyCxCy (3.2.2.4)

Thus the bias in the product estimator yp of the population mean is given by
B(yp)=E(yp)-Y =C~f)yPXYCXCy .
Hence the theorem.

Theorem 3.2.2.2. The mean squared error of the product estimator yp, to the first
order of approximation, is given by
MSE(yp) = C~f)y2[c; + C; + 2PxyCyCxJ, (3 .2.2.5)

Proof. By the definit ion of mean squared error (MSE) using (3.2.2.3) and again
neglecting higher order terms and using results from section 3.1 we have

MSE(yp)= E~p - yF = E[r(1 + Co + cl + cOc))- r Y= r 2E[CO+ c) + COc)t .

Thus the MSE, to the first order of approximation, is given by

- ) -2 r 2
MSE (YP = Y ElCo + CI2 + 2 cOCI ]
Hence the theorem.

Theorem 3.2.2.3. An estimator of the MSE of the product estimator yp , to the first
order of approximation, is given by
, ( ) () -f)[
MSE yp = -n- Sy2 +r2Sx2 +2rsxy] . (3.2.2.6)

Proof. It follows by the method of moments .

Theorem 3.2.2.4. The product estimator yp is more effic ient than sample mean y
if
Cy )
PXYC <-"2 ' (3.2.2.7)
x
Chap ter 3: Use of auxiliary inform ation : Simple random sa mpling 147

Proof. The proof follows from the fact that the product estimator y p IS more
efficient than the sample mean y if
MSE(yp) < V(y)
orif

C~/ }T2[c; + C; + 2 PxyCxCy ]<C~f )y2C;


orif
Cx2 + 2pxyCxC y < 0
or if
Cy I
PXYC <- 2" .
x
Hence the theorem.

In the condition (3.2.2.7), if we assume that Cy '" Cx ' then it holds for all values of
the correlation coefficient P xy in the range [-1. 0, - 0.5) . Thus we have the
following theorem .

Theorem 3.2.2.5. The product estimator yp is more efficient than the sample mean
y if Pxy < -0.5 , i.e. , if the correlation between X and Y is negative and high .

Remark 3.2.2.1. We observed that the product and ratio estimators are better than
sample mean if the value of P xy lies in the interval [-1.0, -0.5) and (+0.5, +1.0],
respecti vely. Thus the sample mean estimator remains better than both the ratio and
product estimators of the population mean if Pxy lies in the range [-0 .5, + 0.5] .

Example 3.2.2.1. A psychologist would like to estimate the average duration of


sleep (in minutes) during the night for persons 50 years of age and older in a small
village in the United States. It is known that there are 30 persons living in the
village aged 50 and over. Instead of asking everybod y the psychologi st selects an
SRSWOR sample of six persons of this age group and record s the information as
given below

Person no. v "~it i j. 3 7 10 17 22 29


Age X "(years) '\ :1;" j ':" " , 55 67 56 78 71 66
Duratiori'ofsleeit y (in mjnutes) 408 420 456 345 360 390

Assume that the average age 67.267 years of the subj ects is known as shown in the
population 2 in the Appendix. Assuming that as the age of a person increases then
the sleeping hours decrease, apply the product method of estimation for estimating
the average sleep time in the particular village under study. Also find an estimator
of the mean squared error of the product estimator and deduce a 95% confidence
interval.
148 Advanced sampling theory with applications

Solution. From the sample information, we have


Sr. No. , Yi
.<J!<:
Xi '(y i - Y-)2; (~i - i~ (Yi - yXXi - X )

i;< '"
1 408 55 132.25 110.25 - 120.75
2 420 67 552.25 2.25 35 .25
3 456 56 3540.25 90.25 -565.25
4 345 78 2652.25 156.25 -643 .75
5 360 71 1332.25 30.25 -200.75
6 390 66 42 .25 0.25 -3.25
Sum 2379 393 8251 .50 389.50 ~ 1 49 8 .5 0

Here }j = Duration of sleep (in minutes) , Xi = Age of subj ects (~50 years) , n = 6,

Y = 396.5, i = 65.5, s; = 77.9, s;= 1650.3, Sxy = -299.7, and r = Yli = 6.053 .
Also we are give n X = 67.267, N = 30 and f = 0.20.
Thus product estimate of the average sleep time, Y (say), is given by

Yp ~J = 396.5(~J
- = Y-( X 67.267 = 386.08'
and an estimate of MSE(yp) is given by
MSE(yp) = (1 ~f J[s; + r2s~ + 2rSry]
= ( 1- ~.20 J[1650.3 + (6.053)2x 77.9 - 2 x 6.053 x 299.7] = 116.83 .
A (1- a)100% confidence interval for population mean Y is given by

Yp±l a/2(df = n- 2WMSE(yp) .


Using Table 2 from the Appendix the 95% confidence interval for average sleeping
time is given by
386 .08±2.776.J116.83 , or [356.Q7, 416.08] .

Exa mple 3.2.2.2. The duration of sleep (in minutes) and age of 30 people aged 50
and over living in a small village of the United States is given in the population 2.
Suppose a psychologist selected an SRSW OR sample of six individuals to collec t
the required information. Find the relative efficiency of the prod uct estimator, for
estimating average duration of sleep using age as an auxiliary variable, with respect
to the usual estimator of popu lation mean .

Solution. Using the description of the population 2 given in the Appendix we have
Yi = Duration of sleep (in minutes), Xi = Age of subjects (~50 years), N = 30 ,
- 2 2
X = 2018, Y = 11526, X = 67.267, Y = 384.2, Sy = 3582.58, Sx = 85.237,
C y2 = 0.0243, C x2 = 0.0188 , Sxy = - 472.607, an d Pxy = -0.8552 .
Chapter 3: Use of auxiliary information: Simple random sampling 149

Thus we have
I- f Y
- ) = ( -n-
MSE (yp )-2 rlCy2+ Cx2+ 2pxyCxCy]
= C-~.20 }384.2)2[0.0243 + 0.0188 - 2 x 0.8552~0.0243x 0.0188]

= 128.759.
Also
v(y) = (I ~f )s; = (1- ~.20) x 3582.58 = 477.677 .

Thus the percent relative efficiency (RE) of the product estimator yp with respect
to the usual estimator y is given by

RE = v(y)x 100/MSE(yp) = 477.677 x 100 = 370.98%


128.759
which shows that the product estimator is more efficient than the usual estimator of
population mean . The relative efficiency is independent of sample size n.

Corollary 3.2.2.1. The minimum sample size for the relative standard error (RSE)
to be less than or equal to a fixed value ¢ is given by
- 1
¢2f2 1 (3.2.2.8)
n> +-
- [ S2
y + R 2S2
x + 2RSxy N]

3 ~2.3 REGRESSIONESTIMATQR .,

We consider an estimator of the population mean Y as


Ydif = Y+ d(X- x) (3.2.3.1)
where d is a constant to be chosen such that the variance of the estimator V(Ydif )
is minimum. Such an estimator is called difference estimator. The estimator Ydif
can be written as

Taking expected value on both sides of (3.2.3.2), we obtain


E(Ydif) = Y. (3.2.3.3)

Thus the difference estimator Ydif is unbiased for the population mean, Y. The
variance of the estimator Ydif is given by
V(Ydir) = E[Ydif - yf
= E[Y(I+8o)-dX&] - y]2 = E[Y&O - dX&]]2
=E[ y2&6+d2X2&,2 _2dY X&o&,]
= (l~f)[Y2C;+d2X2C~_2dY XPXyCxcJ (3.2.3.4)
150 Advanced sampling theor y with applications

On differentiating (3.2.3.4) with respect to d and equation to zero we obtain


Cy y Sxy
d = Pxy C
x
X = S.~ . (3.2.3.5)

On substituting optimum value of d in (3.2.3.4), the minim um variance of the


estimator Ydif is given by

. (_ ) (1- / )[ - 2 2 ( Y Cy ]2-2 2 y C y__


M mY Ydif = - - X C y + Pxy ~- X Cx - 2 pxy ~ - X YPxyCxC y
]
n XC x XC x

= (1-/)[S2 _
n y
s.~y
S2
] = (1-n/)Sy2[1_ S2S
Sly ; = (1- /) S2(I_ p.~) .
2 n y Y
(3.2.3.6)
x Y x
Cy Y Sxy
For the optimum value of d = Pxy - ~ = - 2 = /3 (regression coefficient, say) the
c, X Sx
difference estimator becomes

-
Ydif =Y S;
- + [ Sxy )(X
- - x-) . (3.2 .3.7)

Thu s the difference estimator becomes non-functional if the value of the regression
coeffic ient /3 = Sxy / s1 is unknown . In such situations, Hansen, Hurwitz, and
Madow (195 3) consider the linear regression estimator of the popul ation mean
Y as
YLR = Y + p(x - x), (3.2.3.8)
whe re p = s.w / s.~ denotes the estimator of the regression coefficie nt /3 = Sxy / S; .
Then we have the follo wing theorems:

Theorem 3.2.3.1. The bias in the linear regression estimator YLR of population
mean Y is given by

B(YLR) = (I-1)/3XC (Ao3- ~J.


Il
t
Pxy
(3.2.3.9)

Proof. The linear regression estimator YLR , in terms of &0 , &\ , &3 and &4 , can
easily be written as
-( ) Sxy (I +&4)[- -( )]
YLR = Y 1+ &0 + 2 X - X 1+ &]
Sx( I +&3)
l
= Y(I+ &0)+ /3(1+ &4XI + &3t [X - X(I + &1)] .
Using the binomial expansion (I + «r ' = 1- &3 + &f + 0(&3) we obtain
YLR = Y(I + &0)- /3X lcl +&]&4 - &J&3 + 0(&)] . (3.2.3.10)
Taking expected value on both sides of (3.2.3.10) and neglecting higher order
terms, we obtain
Chapter 3: Use of auxiliary information: Simple random sampling 151

E(YLR) = Y
- - f3 X 1)[
1- - Cx--C
- (-
n
Al2
Pxy
xAo3 J= Y+
- (-1- - f3XCx Ao3 -A-
n
l2 J.
Pxy
I) - [
Thu s the bias is given by

S(YL R)= E(YLR)- Y = ( 1- l )f3 XCX [Ao3 -


n
Al2
Pxy
J.
Hence the theorem.

Theorem 3.2.3.2. The mean squared error of the linear regression estimator YLR,
to the first order of approx imation, is

MSE(YLR) = C~/ )S~ (I - Px/ ). (3.2.3.11)

Proof. By the definition of mean squared error (MSE) and using (3.2.3.10) and
neglecting higher order terms we have

MSE(YLR)= E[YLR - yf = E[Y(I+ &0) - f3 X {&, + &1 &4 - &,&) + ok )}- yf


- 2 2-2 2 --
=EY&
[ o+f3 X &, -2f3YX&0&] ] .
Thu s the MSE, to the first order of appro ximation , is given by

MSE(YLR) = C~/ )[y2C~ + 13 2X2C.~ - 2f3Y XPxyCXCy]


_ I- / )[- 2 S;' [SXyJ2- 2 S;
- (- - Y - 2+ - Sty Y
X -2- 2- -X s; ~
- -- StSY]
~
n y S2
x
X S2
V
S S
x y
X Y

_(1 -/ )[
- - - S 2, + -S}y - 2-S}y] - - - S 2 - -
S2 S2 n y
_(1-/ )[ S~y]
S2
II }
x x x
= (I~/ )S~(I- p}y).
Hence the theorem .
Theorem 3.2.3.3. An estimator of the mean squared error of the linear regression
estimator YLR, to the first order of approximation, is given by

MSE(YLR) = (I~/ }J,[I -I}Y] . (3.2.3.12)

Proof. It follows by the method of moments.

Theorem 3.2.3.4. The linear regression estimator YLR is always more efficient than
the sample mean Y if Pxy ;t 0 .
152 Advanced sampling theory with applications

Remark 3.2.3.1. If jJ = Ylx then the linear regression estimator YLR reduces to the
usual ratio estimator YR and if jJ = - Y/ X , then the linear regression estimator YLR
reduces to the usual product estimator yp .

Example 3.2.3.1. A bank manager in the United States of America is interested in


estimating the average amount of real estate farm loans (in $000) during 1997. A
statistician took an SRSWOR sample of eight states from the population 1 as given
in the Appendix and collected the following information .

I!'Y:,! :>/ ::<;:!t"j."'>". ; ::1;


AZ CO DE LA MT NC VT WA
:N"()iireal'estaJe :; 431.439 906.281 43 .229 405.799 722.034 494 .730 19.363 1228.607
faml'loaris (:%)$
I ""' f a r m:,,: 54.633 315.809 42.808 282.565 292 .965 639 .571 57.747 1100.745
.\~£:f· · ".~~ ..",.
Iloans ':(f; .: )$ l. :i,.l

Apply the regression method of estimation for estimating the average amount of the
real estate farm loans (in 000) during 1997. Also find an estimator of the mean
squared error of the regression estimator and deduce a 95% confidence interval.
Assume that the average amount $878.16 of nonreal estate farm loans (in $000) for
the year 1997 is known.
Solution. From the sample information, we have

1 54.633 431.439 86272.833580 9999.250014 29371.136040


2 315.809 906.281 1059.266526 140509.336300 -12199.870350
3 42.808 43.229 93359.198370 238345.342500 149170.138100
4 282.565 405.799 4328.373443 15784.467310 8265.656001
5 292.965 722.034 3068.093643 36327.883500 -10557.336240
6 639.571 494.730 84806.540240 1347.275378 -10689.142320
7 57.747 19.363 84453.227620 262217.989200 148812.484500
8 1100.745 1228.607 566090.147800 486048.449000 524544.791500
'27 86!8~3 ~2 5r.482 :94~43 7t@ r200 11l90~.79;99300() '826717 :857300

H ere 2 2
n=8, Y=348 .3554 , x=531.4353, sx=170082.85, sy=131919.67,
,
sxy=118102.55, fJ=sxy / sx=0.6943
2 I
and rxy=sxy/\SxSy ) =0.7884 . Also we are
given X=878 .16, N=50 and /=0 .16.
Thus the regression estimate of average amount of real estate farm loans during
1997, Y (for example), is given by
YLR = Y + jJ(x- x)= 348.3554 + 0.6943x (878.16 - 531.4353) = 589.08
Chapter3: Use of auxiliary information: Simplerandomsampling 153

and an estimator of MSE(YLR) is given by

MSE(YLR) = C~f}~[I_ r}y] = C-~.16J x131919.67x [1- (0.7884f] = 5241.78.


A (1- a)100% confidence interval for population mean Y is given by

.YLR±ta /2(df = n-2).jMSE(YLR) '


Using Table 2 from the Appendix the 95% confidence interval is given by
589.08±2.447.J5241.78 or [411.916,766.24].

Example 3.2.3.2. Suppose a bank manager selects an SRSWOR sample of eight


states to collect the required information on real estate and nonreal estate farm loans
during 1997. Find the relative efficiency of the regression estimator, for estimating
the average amount of real estate farm loans during 1997 by using data on nonrea1
estate farm loans during 1997 with respect to the ratio estimator of population
mean. The amounts of real and nonreal estate farm loans (in $000) dur ing 1997 in
the 50 states of the United States have been given in population I of the Appendix.

Solution. From the description of the population 1 given the Appendix we have
- - 2
Y = 555.43, X = 878.16, Sy = 342021.5, Pxy = 0.8038, and N = 50 .
Also from example 3.2.1.2 we have
MSE(YR)= 17606.39 .
Now

MSE(YLR) = C~fJS~(I- P.~y) = C-~.16Jx 342021.5 x [1- (0.8038)2] = 12709.55 .


Thus the percent of relative efficiency (RE) of the regression estimator YLR with
respect to the ratio estimator YR is given by
_ ) x 100/ MSE (_)
RE = V (YR YLR x100= 17606.39 x100= 138.53%
12709.55
which shows that the regression estimator is more efficient than the ratio estimator
of population mean . The relative efficiency is independent of sample size, n.
Corollary 3.2.3 .1. The minimum sample size for the relative standard error (RSE)
of the linear regression estimator to be less than or equal to a fixed value ¢ is
-1
¢2f2 1
(3.2 .3.13)
n> +-
- [ ~(I-P;y) N]

Example 3.2.3.3. Suppose a bank manager in the United States of America is


interested in making future plans about the selection of sample size while estimating
average real estate farm loans. The manager would also like to apply the regression
method of estimation using known information about nonreal estate farm loans.
154 Advanced sampling theory with applications

What is the minimum sample size required for relative standard error (RSE) to be
equal to 12.5%? Use that data as shown in population I of the Appendix .

Solut ion. The population I given in the Appendix shows N = 50 , Y = 555.43 ,


S; = 342021 .5 , and P xy = 0.8038 . Here ¢ = 0.125 , thus the minimum sample size is:

n> ¢ 2 -y 2 + ~ ]-1= [ 0.125 2


x (555.43)\2 +~ ]-1 = 16 7 : : : 18
- [ S; (I _P'; ) N 342021.5(1 - 0.8038 2) 50 . .

Exa mple 3.2.3.4. A bank manager selects an SRSWOR sample of eighteen states
from population I of the Appendix and colIects information about real estate farm
loans and nonrea l estate farm loans. Estimate the average real estate farm loans by
using the regression method of estimation, given that the average amount of nonreal
estate farm loans in the United States is known to be equal to $878.16 .
Solution. The bank manager used the 19th and 20 th columns of the Pseudo-Random
Numbers (PRN) given in Table I of the Appendix to select the folIowing 18 distinct

,
random numbers between 1 and 50 as:16, 31, 50, 29, 08, 33,19,28,11,07,27,37,

-r .(Yi X)(Xi-X)
48, 22, 24, 46, 41, and 32.

r (xt~xr
"-, ,
Random Stat~
No ,.,
Yi ~> ( y.-y
-&- c;
.',L I '< ,~

07 CT 4.373 7.130 393759.3178 88444.6782 186617.0307


08 DE 43.229 42.808 346504 .6366 68496 .5732 154059.6645
11 HI 38.067 40.775 352608.4687 69564 .8537 156617.8679
16 KS 2580.304 1049.834 3796373 .8360 555483 .2696 1452178.4160
19 ME 51.539 8.849 336790.3888 87425.1840 171592.4291
22 MI 440.518 323.028 36617.6715 342.3055 -3540 .3997
24 MS 549.551 627.013 6777.3141 103997.5427 -26548.5219
27 NE 3585.406 1337.852 8723342.7430 1067761.5890 3051958.4380
28 NV 16.710 5.860 378428.5240 89201.6782 183729.3102
29 NH 0.471 6.044 398671.5725 89091.8028 188463.1771
31 NM 274.035 140.582 128049.7837 26877.7990 58665 .9727
32 NY 426.274 201.63 1 42271.9539 10587.4839 21155.4634
33 NC 494.730 639.57 1 18808.8729 112254.8170 -45949.8268
37 OR 571.487 114.899 3646.7642 35958 .5887 11451.3097
41 SD 1692.817 413.777 1125596.9840 11935.6717 115908.3954
46 VA 188.477 321.583 196602.1805 290.9242 -7562 .8256
48 WV 29.291 99.277 363108.01270 42127 .3572 123680.1559
50 WY 386.479 100.964 60219.4149 41437 .6914 49953 .5137
r Sum 11373.76 5481.477 16708178.4400 2501279.8100 5842429 .5700
x - Nonreal estate farm loans, y - Real estate farm loans.
Chapter 3: Use of auxiliary inform ation: Simple random sa mpling 155

Here N = 50 and X = 878. 16 . The above table shows II = 18, Y = 304 .5265,
x = 631.8754, s; = 982 834.03, s;= 147134.11, and Sxy =343672.33 . Thu s
iJ = 0.3496 , t:~y = 0.9037 and f = 0.36 .
Thu s the regression estimate of the average real estate farm loans in the United
States is
YLR = Y + iJ(x - x)= 304.5265 + 0.3496(878.16 - 631.8754 ) = 390. 627.

An estimate of MS EV LR) is given by

MS E(YLR) = C~f };(I_ t:;) =C ]x 0 6


- 1/ 147134 .11 x (1 - 0.9037 )
2
= 959 .059 .

A (1- a)I 00% confidence interval for population mean Y is given by


YLR ±fa /2(df = n - 2NMSE(hR) .

Using Table 2 from the Appendix the 95 % confidence interval for the average real
estate farm loans is given by
390.627 ± 2.120v'959.059 or [324.973, 456.280] .

Example 3.2.3.5. Consider the following population consisting of five ( N = 5 )


units A, B, C, D , and E where for each one of the units in the population two
variables Y and X are measured.

'~:Units A B C D E
Yi 9 II 13 16 21
Xi 14 IS 19 20 24

Do the following:
( a ) Select all possible SRSWOR samples each of n = 3 units;
( b ) Find the variance of the sample mean estimator by definition;
( c ) Find the variance of the sample mean estimator using the formula. Comment;
( d ) Find the exact mean square error of the ratio estimator by definit ion;
( e ) Find the approximate mean square error of the ratio estimator using first order
approximation;
( f) Find the ratio of approximate mean square error to that of exact mean square
error of the ratio estimator and comment;
( g ) Find the exact mean square error of the regression estimator using definit ion;
( h ) Find the approximate mean square error of the regression estimator using first
order approximation;
( i ) Find the ratio of approx imate mean square error of the regression estimator to
that of the exact mean square error and comment;
( j ) Find the exact relative efficien cy of the ratio estimator with respect to sample
mean estimator;
156 Adv anced sampling theory with applications

( k ) Find the approximate relative efficiency of the ratio estimator with respect to
the sample mean estimator and comment;
( I ) Find the exact relative efficienc y of the regression estimator with respect to the
samp le mean estimator;
(m) Find the approximate relative efficiency of the regress ion estimator with respect
to the sample mean and comment.
Solution. ( a ) From Chapter I we have following information for this population
- - 2
Y = 14, X = 19, Sy = 22, s;2 = 13, S xy = 16.25, P xy = 0.96, and f3 = 1.25. Also
from the all poss ible 10 samples of n = 3 units taken from the population of N = 5
units .

1 11.00 17.00 0.7 1 0.900 12.29 0.291 12.42 0.250


2 12.00 17.33 1.12 0.400 13.16 0.071 13.87 0.002
3 13.67 18.67 1.24 0.0 11 13.91 0.001 14.08 0.001
4 12.67 17.67 1.05 0.177 13.62 0.014 14.07 0.000
5 14.33 19.00 1.20 0.011 14.33 0.011 14.33 0.011
6 15.33 19.33 1.20 0.177 15.07 0.114 14.93 0.087
7 13.33 19.00 2.50 0.045 13.33 0.045 13.33 0.045
8 15.00 20.33 1.65 0.100 14.02 0.000 12.81 0.143
9 16.00 20.67 1.61 0.400 14.7 1 0.050 13.31 0.047
10 16.67 21.00 1.50 0.713 15.08 0.117 13.67 0.011
]'Sum 140:00 190.00 14.00 2.933:, .., 0.714, 0:596 "
where
_ _I n _ _I n
YLR(t) = YI + bl (x - XI ),
Sxy
YI = n L Yi, XI = n L Xi, bl = - 2 ' YR (t) = Y{ ;,).
i=1 i= 1 Sx

PI = I(~) = I(~) =1/10 , and t = 1,2,....10.


Now with this information, we can answer all of the above questions:
(b) The exact variance of the sample mean YI is given by

(~) f- -\2
Exact V(Yt)= L PI 151 - Y J = 2.933 .
1=1
( c ) The variance of the samp le mean YI with formu la is given by

V(Yt)= C~f )s; = C-:/5) X22= 2.933 .

We can see from ( b ) and ( c ) that the exact variance and variance by the formula
are same .
Chapter 3: Use of auxiliary information: Simple random sampling 157

( d) The exact mean square error of the ratio estimator YR (t) = Y{ ; J is given by
(~) -
ExactMSE{YR} = I pJh(t)-yf =0.714.
1=\

( e) The approximate mean square error of the ratio estimator is given by


- f J[Sy2 + R 2s;2 - 2RS xy ]
- } = (I-n-
Approx.MSE{YR

= C-:/5 J[22+ (14/19)2 x 13- 2 x (14/19)x 16.25] = 0.681.

( f) The ratio of approximate mean square error to the exact mean square error is
given by
. f S Approx.MSE(h) 0.681 0 953
RatIO 0 Mean quare Errors = ( ) = - - =. .
Exact.MSE YR 0.714
Note that this ratio of the mean square errors approaches unity if sample size and
population size are such that f = n] N ~ 0 .

( g) The exact mean square error of the linear regression estimator


YLR (t) = YI + bl (x - XI)
is given by
(~) -
ExactMSE{YLR}= I Pt(YLR(t)-yf =0.596 .
1=\

( h ) The approximate mean square error of the linear regression estimator is

- } = (1-n-
Approx.MSE{YLR - fJ Sy2[1- Pxy
2] = (1-3/5J
- 3 - x 22 x [1- 0.962] = 0.230.

(i ) The ratio of approximate mean square error to the exact mean square error of
the linear regression estimator is given by
. 0 f Mean Square Errors = Approx.MSE(YLR)
RatIO () =-0.230 0 386
-= . .
Exact. MSE YLR 0.596

Note that , for this particular example, the ratio of approximate mean square to the
exact mean square is far away from one, but if f = n]N ~ 0 then this ratio
approaches to unity.

(j ) The exact relative efficiency (RE) of the ratio estimator with respect to sample
mean estimator is
Exact RE of the Ratio Estimator = V(Yl)XI(O) = 2.933x100 = 410.78% .
Exact.MSE YR 0.714
158 Advanced samplin g theor y with applications

( k ) The approximate relative efficiency of the ratio estimator with respect to the
sample mean estimator is
.
Approximate RE t hee Rati
atio Estimator
. = V(Yt) x IOO( ) = 2.933 x 100 = 43O.69 0Yo .
Approx. MSE YR 0.68 1

It shows that the app roximate relative efficienc y expr ession for the ratio estima tor
gives a slightly higher efficiency than in reality.

( I ) The exa ct relative efficiency of the regress ion estimator with respect to the
sample mean estimator is

Exact RE of the Regression Estimator = V(y t) x 100 = 2.933 x 100 = 492 .1 1% .


Exact MSE(YLR) 0.596

( m) The approximate relative effic iency of the linear regression estimator with
respect to sample mean estimator is

Approx. RE of the Regression Estimator = V(Yt ) x 100 = 2.933 x 100 = 1275 .21% .
Approx. MSE(YLR) 0.230

It also shows that the approximate relative effici ency expression for the regression
est imator gives higher effi ciency than in reality.

Caution : Be careful while using appro ximate expression for mean square error of
the linear regression estimator or approximate expression for estimating the mean
square error of the linear regression estimator. The interval estimate of the
popul ation mean may be bigger than you are constructing with the approximate
results.
Note the following graphic al situations in the Figure 3.2.1 for the use of ratio,
product, and regre ssion estimator in actual practice.

- --' -- '~ - ---- -- -----'-- l


Rat io Produc t
- -- -- r------ - - - - --1
I ~'\ion I
Estimator
Estimator I E.\tinlttor I

~:'I: V
~
i "I
' :E~~~

ll~~
2.5
J I

I
~ 'I j
.a 0.5 I

"j o ~--- .. ,__; ; ~ ~ !


iI
U>
0(1 - - ; . - -- - . : - -; . . . --- . ~ - - - ~

Auxiliary variab le (x) ALDdliay vai<tlle(x) ,


A u xili a ry v a ria bl e (x) !
. ~_ ..__ ~ ._ _ . . J

Fig. 3.2.1 Situations for using of ratio, produ ct or regression estimators.

The follo wing tabl e is used to collect some informati on about these three
estimators, which will be useful to the readers:
Chapter 3: Use of auxiliary information: Simple random sampling 159

Ratio Estimator .. c . Product estimator ... Rearesslon .estimator


1 Th e correlation betw een Th e correlation between The corre lation betwee n
y and x must be y and x mu st be y and x must be non-
posi tive and high (within negative and high zero wi thin the range
+0 .5 and + 1.0). (within - 1.0 and -0. 5). r- 1.0, +1.01-
.
I'.>'!'i .:.. ....
2 The regression lines Th e regression line The regression line may
between y and x betwee n y and x may have bo th parameters viz.
sho uld pass throug h the or may not pass though Intercept and slope.
origin. the origin.
Note : If the regression
line with two negatively
co-related variables will
pass throu gh the origin,
then one of the variables
among y and x will be
negative, wh ich may not
be practicable.
ii'.... .... . ... ,
3 The approximate mea n The approximate mea n The approximate mean
sq uare error expression square error expression square error expression
wi ll be small if the wi ll be sma ll if the will be sma ll if the
( a ) f.p.c. / = n/N IS ( a ) f.p.c. / =n/N IS ( a ) f.p.c. / =n/N IS
large, large, large,
( b ) samp le size n is ( b ) sample size n IS ( b ) samp le size n IS
large, large, large,
( c ) corre latio n between ( c ) correlation between ( c ) correlation between
y and x is very close y and x IS very close y and x IS very close
to plus one, to minus one , to plus or min us one ,
( d ) error terms, say (d ) error terms, say (d ) error term s, say
8; = (Y; - r)- R(X;- x) 8; =(Y; - r)+ R(X ;- x) 8; = (Y;- r)- p(x;-x)
are smal l. are small. are sma ll.
I:: . I' .»'[ , c c, · i r.. ".' ,
...

4 The usual estimator of Th e usua l estimator of The usual estimator of


the approximate mean the approximate mean the approximate mean
square erro r may be low square error may be low square erro r may be low
if samp le size is large, if samp le size is large, if sam ple size is large ,
which may provide us which may provide us which may provide us
the smaller confidence the smaller confidence the sma ller confidence
interval estimate than the interval estimate than the interval estimate than the
actual one . actual one. actual one .
.. '
,,
Co ntinued . . ...
160 Adva nced sampling theory with applications

5 We have to estimate only If both variab les are Here we have two
one mode l parameter, so positive (x > 0 , and unknown parameters,
the degree of freedom for Y > 0) but the correlation viz.: intercept and slope,
constructing confidence is negative, then we have thus we must use
interval estimates will be both intercept and slope, df=(n-2) . Its more
df = (n - I) . and then we shou ld must justification is give n In
use df=(n-2). the Section 3.6.

3;2.4 POW ER TRANSFORMATION ESTIMATOR

Srivastava (1967) considered another estimator of popu lation mean, Y , using the
know n popu lation mean, X, of the auxiliary variable, as a power transformation
estimator given by
_
Yrw = Y X
_(:x)a (3.2.4. I)

where a is a suitab ly chosen consta nt, if a = 1 then Yrw reduces to Yr and if


a -I then Yrw reduces to YR .
=
Then we have the following theorems.

T heorem 3.2.4.1. The bias in the power transformation estimator Yrw , to the first
order of approximation, is given by

S(Yrw) = C~f )y[ a(a2-I) c; + apXYCXcy] . (3.2.4.2)

Proof. The power transformation estimator Yrw , in terms of £0 and £, , can easily
be written as
Yrw = Y(I +£ 0 XI + elf = Y(I +&0 {I +a£, + a(a -I)£ ,2 +0(&1))
2

= Y[I+£o +a£1+ a(~ -I) £?+a£O£1 +0(&;)] . (3.2.4.3)

Taking expected values on both sides of (3.2.4.3) and using results from section
3.1, we obtain (3.2.4.2). Hence the theorem.
T heorem 3.2.4.2. The minimum mean squared error of the power transformation
estimator Yrw , to the first order of approximation , is given by

Min.MSE(Yrw)= (I ~f)S; [I _ P;y] . (3.2.4.4)

P roof. By the definitio n of mean squared error (MSE), using (3.2.4 .3) and aga in
neglecting the higher order terms we have
MSE(Yrw) = E~rw - r] = E[Y(I +&0 +a£1 +0(&; ))- yf
= y2E[ £6+ a 2£,2 + 2a£0£, ] .
Chapter 3: Use of auxiliary information: Simple random sampling 161

Thus the MSE, to the first order of approximation, is given by

MSE(ypw)= C~/)Y2[C; 2C;++a 2apxy xCcJ (3 .2.4.5)

On differentiating (3.2.4.5) with respect to a and equating to zero we obtain the


optimum value of a as
Cy
a = -Pxy
Cx ' (3 .2.4.6)

On substituting the optimum value of a in (3.2.4 .5) we obtain

Min.MSE(ypw) = (I ~f)S;[I_ p;y]. (3.2.4.7)


Hence the theorem.

The power a depends upon the optimum values of unknown parameters. Thus the
estimator ypw is not practicable. Thus we have the following corollary.

Corollary 3.2.4.1. A practically useful power transformation type estimator

r
YPW(pract) of population mean Y is given by

YPW(pract) = Y( ~ (3.2.4.8)

where
a = -(xsxy}/~s;)
is a consistent estim ator of a . Note that while making confidence interval estimate
with the power transformation estimator the degree of freedom will be (n - 2).

Remark 3.2.4.1. The difference estim ator Ydif of the population mean, Y , given as
Ydif = Y + d(X - x) (3 .2.4.9)
has the same variance equal to the mean squared error of the linear regression
estimator for the optimum value of d = Sxy / S.~ = f3 . Again note that the degree of
freedom for constructing confidence interval estimates will be df = (n - I), because
the slope is assumed to be known, but we estimate the intercept.

3.2.5 ADUAL OF RATIO ESTIMATOR

Srivenkataramana and Tracy (1980) considered the following estimator of the


population mean Y based on the use of the mean value of the non-sampled
information of the auxiliary var iable defined as
_ _ _(NX -nx)
Yn su - Y (N - n)X (3.2.5.1)
162 Adva nced sampling theory with applications

or Ynsu = Y( ~J (3.2.5.2)

where x-* -- (NX


- ) - -- -1- N~n
-nx) L, x · denotes th e mean 0
f non-samp Ied uruts
. 0 f t he
(N - n N- n i= \ I

auxiliary variable.

Then we have the following theorems.

Theorem 3.2.5.1. The estimator Ynsu is an inconsistent estimator of the population


mean, Y .
Proof. The estimator Ynsu in terms of So and SI can be written as
_ - NX - nx _ Y-(I
-
Ynsu-Y (N - n)X - + so ) NX(N- nX(l
)
- nX
+ sd _ Y-(l
-
{I n
+so - - - SI
N- n
)

n - s\--
- ( l +so--
= Y n - sOs\ ) .
N -n N- n
Tak ing expected values on both sides we have
E(Ynsu )= YE(I + So - _ n_ s\ - _n- SOS\)
N -n N- n

- n 1- f - _ -Y YPxyC,Cy
= Y - - - x --YPxyCxCy
N -n n N
Thus the bias in the estimator Ynsu is given by
_ )_ (_ )_ - _ _ YPxyCxCy __ Sxy
B (Ynsu - EYnsu Y- -_
N NX
which proves the theorem.

Theorem 3.2.5.2 . The mean squared error of the estimator Ynsu is given by

MSE(Ynsu ) = C~f)y2[C; + g2 C; - 2gpxyC ,Cy] (3.2.5.3)

where g = _ n_ and N 2 ~ 00 •

r
N- n
Proof. We have

MSE(YnsJ = E[Ynsu- yf = E[Y(I+ so - N ~ n Sl -( N ~ n )sOSIJ- Y


n n n
]2= Y E[sO (
+ - -)2s l -2 (- -)sOsl ]
-2 -2 2 2
[
",y E sO - - - s}
N -n N- n N -n
Chapter 3: Use of auxiliary informa tion: Simple random sampling 163

1- f -2 2 n 2 n
= ( -n- ) Y [ Cy + ( N-n ) 2 Cx-2(
N -)n PxyCxC y ]

=(1- f)y2[ C2 + g 2 C 2 -2gp C C


n Y x xy x y
J.
Hence the theorem.

Theorem 3.2.5.3. The estimator Ynsu is more efficient than the ratio estimator YR if
N N
n < - , and Pxy < ( )' (3.2.5.4)
2 2N- n
assuming that the correlation coefficient Pxy is positive .

Proof. The estimator Ynsu wilI be more efficient than the ratio estimator YR if
MSE(Ynsu)< MSE(YR)

or C~f)y2 [c~ + g 2C; - 2g p xy C xCy ]< C~f )y2 [cf, +C; - 2P xyCxCy ]

or (g2 - I~; - 2(g -1)pxyCxCy < 0 or (g -IXg+ l)c; - 2(g -1)pxyCxCy < 0
or (g -I)[(g+I)c}-2PxyC t Cy] < 0. (3.2.5.5)

Now there are two cases :


Case 1. The inequality (3.2.5.5) will be satisfied if
g-I <O and (g+I )c;-2Pxy Cx Cy >0

or _n__ 1 < 0 and (g+l)cx > 2p xyCy


N- n

or n - N + n < 0 and Cy (g + I)
N- n PXYC <-2-
x

or N Cy n+ N-n N
n < - and Pxy - < ( ) = ( )
2 Cx 2 N-n 2 N-n
N N
For Cy '" Cx we have n < - and Pxy < ( ).
2 2N- n
This cond ition holds in practice . For example , if N = 100 and n = 30 then Pxy IS

supposed to be less that 0.714 .

Case 2. The inequality (3.2.5.5) wilI be satisfied if

g-I >O and (g +I )c;-2pxy CxCy <0

n - 1 >0 and ()c


or - - g + 1 x< 2p xy Cy
N- n
164 Advanced sampling theory with applications

or n - N + n O d Cy (g + I)
> an Pxy->--
N-n c, 2

or N Cy n + N - n N
n > 2 and Pry C, > 2(N _ n) = 2(N - n) .
For Cy "" Cx we have

n >-
N
2
and Pxy > (
N
2 N-n
r
This condition will not hold in practice. For example, if N = 100 and n = 70 then the
value of Pxy needs to be more that 1.667, which is not possible. Hence the
theorem.

3.2.6 GENERAL CLASS OF ESTIMATORS

Srivastava (1971) proposed a general class of estimators to estimate the population


mean Y of the study variable which in the case of single known mean X of the
auxiliary variable is given by
(3 .2.6 .1)

where u = xl X and H(.) IS a parametric function, such that, it satisfies the


following conditions:
( a) H(I) = I; (3.2 .6.2)
( b ) The first and second order partial derivatives of H with respect to U exist
and are known constants at a given point u = 1.
Expanding H(u) about the value 1 in a second order Taylor's series we have
oH \2 I o 2H
H(u)= H[ -I) = H(I)+(u -1)-lu=l
I +(u]
ou +(u -I) -2 -du2 lu=l +...... (3263)
. ..

Note that lu - II < I thus the higher order terms can be neglected. Using (3.2.6.2)
and (3.2.6.3) in (3.2.6.1) we obtain
tg = y[ I+(u-I)H) + (u - I)2Hz + ..... ] (3.2 .6.4)
sn l o2 H
where HI = "u lu=l and H2 = - - - 2 lu=\ denote the first and second order partial
u 2 ou
derivatives of H with respect to u and are the known constants. Evidently the class
of estimators t g given at (3.2.6.4) can easily be written in terms of &0 and &) as

tg = Y(I + &0)[1+ &\H) +c\zHZ + ....]


= Y[I+co+c\H 1 +c\zH Z +coc)H) +0(&)] . (3.2 .6.5)
Chapter3: Useof auxiliary information: Simplerandom sampling 165

Thus we have the following theorems :

Theorem 3.2.6.1. The bias in the general class of estimators t g defined at (3.2.6.1),
to the first order of approximation, is

(3.2 .6.6)

Proof. Taking the expected value on both sides of (3.2.6.5) we obtain

E&g)= r[1 +C~f)(H2C'; +HIPXyCycJ].


Thus the bias in the class of estimators tg is given by

B&g)= E&g)- r = C~f)r[H2c,; + H1PxyCycJ


Hence the theorem.

Theorem 3.2.6.2. The minimum mean squared error of the general class of
estimators t g defined at (3.2.6.1), to the first order of approximation, is given by

.
Mm.MSE t ()(I-f)-22(
g= -n- Y Cy1- Pxy2) . (3.2.6 .7)

Proof. By the definition of the mean squared error we have

MSE&g)= E~g - rf : : : E[r{l +'<:0 + cJH J + cf H 2 + ....}- r].


Again neglecting higher order terms we have

MSE&g) = f2 Elc6 + H I2cf + 2Hl cocI j

=C~f)r2[c; +H J2C; +2HJPxyC ycxJ, (3 .2.6.8)

On differentiating (3.2.6.8) with respect to HI and equating to zero we obtain

HI = -P xy-CCy ·
x
(3 .2.6.9)

On substituting the optimum value of HI from (3.2.6.9) in (3.2.6.8) we obtain


(3.2.6.7). Hence the theorem.

Remark 3.2.6.1. If we attach any function of xl X to the sample mean y the


asymptotic minimum mean squared error of the resultant estimator cannot be
reduced further than that given in (3.2.6.7). Thus the usual ratio estimator, product
estimator, and power transformation estimator are the spec ial cases of the class of
estimators defined in (3.2 .6.1).
166 Advanced sampling theory with applications

3;2:7WIDERCI..;j\SS OJ? ESTIMAT()RS

One may note here that regression estimator and difference estimator are not special
cases of the general class of estimators defined in (3.2.6 .1). Srivastava (1980)
defined another class of estimators and named a wider class of estimators as
tw = H[y, u] (3.2.7.1)
where H[y, u] is a function of y and u and satisfies the following regularity
conditions:
( a ) The point (y, u) assumes the value in a closed convex subset R2 of two-
dimensional real space containing the point (Y,I) ;
( b ) The function H(y, u) is continuous and bounded in R2 ;
( c ) H(Y, I) = Y and Ho(Y, I) = 1, where Ho(Y, I) denotes the first order partial
derivative of H with respect to y;
( d ) The first and second order partial derivatives of H (y, u) exist and are
continuous and bounded in R2 .

Expanding H(y, u) about the point (Y, I) in a second order Taylor series we have
tw =H(y,u)=H[Y +(y-Y}I+(u-I)]

(- ) (_ -y:JH sn
= H Y,I + Y - Y o y ly=Y,u=1+(u -I) ou ly=Y,u=l +(u -I)
\2 1 0 2lJ
2 ou2 ly=Y,u=l +
_ -)2 1 0 2H (_ -y 1 0 2lJ
+(y-Y 2 02 y2 Iy =y,U=1 +Y - YAU- I)2 o yo u ly=Y,u=1 + (3.2 .7.2)

Using the above regularity condit ions and ~~ ly=Y,u=l= I, we have

tw = Y +(u - I )H I +(u - I)2lJ2+(y - yXu -1)lJ3+(y- Y) J-l 4 + .. (3.2.7.3)


where
2J-l 2
J-l _ oH I _ lJ _10 lJ = 1 0 H 1_ -
_
and
1- OU y=Y,u=I' 2 - 2 ou2 y=Y,u=I' 3 20you y=Y,u=1 '
1

2
lJ = 10 H 1_ _
4 2 0 2y 2 y=Y,u=l·
Thus we have the following theorems.
Theorem 3.2.7.1. The asymptotic bias in the wider class of estimators i; of the
population mean Y is:
B(t w ) = C~f )[YPxyCrCyJ-l3 + C';J-l 2+ Y 2C;J-l 4]. (3.2.7.4)

Proof. The wider class of estimators t w ' in terms of &0 and &1' can easily be
written as
Chapter 3: Usc of auxiliary information : Simple random sampl ing 167

(3.2.7.5)
Taking expected values on both sides of (3.2.7.5) and using the definit ion of bias,
we obtain (3.2.7.4). Hence the theorem .

Theorem 3.2.7.2. The minimum mean squared error of the wider class of
estimators, t w ' is given by

Min.MS E(t w ) = (I ~f )f2c;(I- p,;y). (3.2.7.6)


Proof. By the definition of mean squared error , we have

MSE(t w )= E[tw - r] '" E~ + HI&I +0(&)- yf = E[Y( ~ -1)+ H\&J


= E[Y&O+H]&!]2 = E[f2&6 + H?&\2 + 2H\Y&O&\]
= C~f )[Y2C; +H?C} +2H\YP C cJ xy x (3.2.7.7)

On differentiating (3.2.7.7) with respect to HI and equating to zero we obta in


- Cy
HI =-PXyY-
, C . (3278)
. ..
x
On substituting (3.2.7.8) in (3.2.7.7), we obtain (3.2.7.6) . Hence the theorem.
Remark 3.2.7.1. If we have any function of )I, x and X to estimate the population
mean, Y , the asymptotic minimum mean squared error of the resultant estimator
again cannot be reduced further than that given in (3.2.6.7). Thu s the usual linear
regression estimator and differen ce estimator are spec ial cases of the wider class of
estimators defined at (3.2 .7.1).

3.2.8 USE OF KNOWN VARIANCEPJ1,THE AUXILIARY VARIABLE


AT ESTIMATION STAGEOF POPULATION MEAN

In this sect ion, we will show that the known variance of the auxiliary variable can
also be used as a benchmark, in addition to the known population total or mean of
the auxiliary variable, to improve the estimators of the finite population mean of the
study variabl e under certa in circumstances.

3.2.8;1 A CLASS OF ESTIMATORS


Sriva stava and Jhajj (1981) introduced the use of known value of the variance of
the aux iliary variable to improve the efficiency of the estimators of population
mean . They considered a general class of ratio type estimators as
)lS J = )lH (u,v) (3.2.8.1)
where u = xl X, v = s}/ S} and H(u, v) is a function of u and v such that:
168 Advanced sampling theory with applications

where u = xl X , v = s;/
S; and H(u , v) is a function of u and v such that:
( a ) The point (u, v) assumes the value in a closed convex subset R2 of two-
dimensional real space containing the point (I, I);
( b ) The function H(u , v) is continuous and bounded in R2 ;
(c )H(I ,I) = I;
( d ) The first and second order partial derivatives of H(u ,v) exist and are continuous
and bounded in R2 .

Thus all ratio and product type estimators of population mean r defined as

- - X
YI = y (-J
x [ 2)
s;
S;
- - (-
X
, Y2 =y aX+(I-a)X
J[ Sx
2) ,
yS; +(I - y )S;
-
and Y3 = y
- X
(-Ja[
x s;
Sx
2)Y

are the special cases of the class of estimators defined in (3.2 .8.1).

Expanding H(u , v) about the point (I, I) in a second order Taylor's series we obtain
YSJ = yH (u,v) = yH [1+(lI - I),I +(v- I)]

_[ Of! Of! \2 1 (} 2 H
"' Y H (I,I) +(u-I )& I(I,I) +(v-I )~ I(I,I) +(lI - l) 2w
2 1(1,1)

1 {}2H 1 {}2H ]
+(v - 1 f 2 (}v2 1(1,1) +(u- 1Xv - I)2 cum 1(1,1)+ .'

2H
= r(1 + coX 1+ s .H, + C3H2 + c I 3 + c} H4 + CIC3 HS + ..... ]
-[
'" Y 1+EO +cIH I +c3H2 + cI2 H3 +c3H4
2
+ clc3HS
+ cOEIHI + COC3 H2 + .... ] (3.2.8.2)
where
Of! Of!
HI =& '(1,1), H 2 = ~ I(I , I ) ' and

1 (} 2 H
Hs = 2 wOv 1(1,1)'
Thus we have the following theorems:

Theorem 3.2.8.1. The asymptotic bias in the class of ratio type estimator s YSJ IS

S(YSJ)=c ~/ )r[C;H3+(Ao4 -1 )H4+CxAo3Hs+PxyCyC,HI+CyAI2HJ (3.2.8.3)

Proof. It follows by taking expected values on both sides (3.2.8 .2) we have

E(YSJ) = rE[ 1+ co + e.H , +c3H2 + c?H 3 + c}H 4 +clc3HS +coclH I + coc3 H2 + .... ]

= r[ 1+O+OH1 +OH 2 +E(c?)H3 + E(c})H4 + E(ctC3)Hs +E(cOCI)H1+ E(cOC3)H 2


Chapter 3: Use of auxi liary inform ation: Simple random sa mpling 169

Thus the bias is given by

B(YSJ ) = E(YSJ ) - Y

= C~/ )Y[C;H3 + (~4 -1)H4 +Cx~3H5 + PxyCyC,H, +CyA'2 H J


Hence the theorem.

Theorem 3.2.8.2. The minimum MSE of the class of estimators YSJ is given by

Min .MSE(YSJ) = (1- II


f) y2 C ;[I_P.;y _ (~3Pxy - A,~f ] .
~4 -1- AD3
(3.2.8.4)

Proof. By the definition of the mean squared error we have

MS E(YSJ) f
= E[YSJ - Y = y 2E[eo + e,H, +e3H 2 + 0 (& )j2
- 2 [2 2 2 2 2 ]
= Y E eo + e, HI +e3 H2 + 2eoe,H , + 2eOe3 H2 +2 ele3H ,H2

=c~/ )p[C;+C.;H,2+(~4 -1 )HJ.+2PxyC,CyH , +2CyA'2H 2

+2Cx~3HIH2 ]. (3.2.8.5)

On differentiating (3.2.8.5) with respect to H, and H 2 , and equating to zero,


respectively we obtain
H,Cx + H2 ~3 = -PxyCy , and H,Cx }'Q3 + H2(~4 -1)= - CyA' 2 ' (3.2 .8.6)
Solving equations in (3.2.8.6) for H , and H 2 we obtain
_
H I -
Cy{P.t}'(~4 - 1) -A'2~3}
(
C< ~4 - I - }'Q 3
2 r': an d H2 -
_ Cy{AI2 - PxY~3}
~4 -
2
I - AD3
' (32 87)
. . .
On substituting these optimum value s of H, and H 2 in (3.2.8.5) we have (3.2.8.4).
Hence the theorem.

3.2.8.2 A WIDER CLASS.OF ESTIMATORS

Srivastava and Jhajj ( 198 1) also con sidered a wider class of estimators of
population mean Y as
YSJ(w) = H (y , 1/, v) (3 .2.8.8)

where 1/ = x-j X, 2/ 2 and H (


- v = sx S< ) is a functi on of
y,l/, V Y, 1/ and v, such that:

( a ) The point (y, 1/, v) assumes the value in a closed convex subset R3 of three-
dimensional real space containing the point (Y, I, I) ;
170 Advanced sampling theory with applications

( b ) The function H(y, u, v) is continuous and bound ed in R3 ;

( d ) The first and second order partial derivatives of H(y, u, v) exist, and are
continuous, and bound ed in R3 .

Expanding H(y, u,v) about the point (Y, I, I) in a second order Taylor 's series we
have

(3.2.8.9)
where
if{ if{ if{ 1 0 2H 1 0 2H
oy I(Y.I,I)=I , HI =~ I(Y,I,I)' H 2=a; I(Y,I,I), H 3- 2 oy2 I(Y,I,I)' H 4 ="2 at2 I(h l)'
2 2 2 2
H t 0 H I H =..!.- 0 H I_ H =..!.- 0 H I_ and H =..!.- 0 H 1-
5 ="2 a,.2 (Y,I,I)' 6 20yat (Y,I,I)' 7 2 ata,. (Y,I,I)' g 20voy (Y,I,I)'

Thus we have the following theorems:


Theorem 3.2.8.3. The asymptotic bias in the wider class estimators YSJ(w) is

B(YSJ(w))=( 1 ~f)[YC;H3+C;H4+{Ao4 - 1)H5+YPxyCyCxH6 +CxAo3H7+YCyAI2 Hg].


(3.2.8.10)
Proof. It follows by taking expected value on both sides of (3.2.8.9).

Theorem 3.2.8.4. The minimum MSE of the class of estimators YSJ(w) is given by

Min. IYSJ(w) )=(!-=-L)y


· MSE("" -2C2y [t_ Pxy2 (Ao3PXY- AI2~ ] 2 ' (3.2.8.11)
n ..104 -1 - AQ3
Proof. By the definition of the mean squared error we have
Chapter 3: Use of auxiliaryinformation: Simplerandom sampling 171

MSE(YSJ(w))= E~SJ(w) - yf = E[Y&O +GIHI +&3 HZf


= (1 ~f)[yZC;+C;HIZ-t{Ao4 -1)Hi +2YPxyCxCyHI
+2YCyAuHZ+2CxAo3HIHZ ] . (3.2.8.12)
On differentiating (3.2.8.12) with respect to HI and HZ, and equating to zero,
respectively we obtain
H ICx+HzAo3 =-YPxyCy , (3.2.8.13)
and
HICxA03 + H2(~r 1)= -YCyA12' (3.2.8.14)

Solving (3.2.8.13) and (3.2 .8.14) for HI and Hz we have

HI =- C t,
YCy {PXy (A04
x /L04 -
-1)- A]2A03}
,z \
1- /L03 J
' and Hz =-
YCy {A]2 - Pxy A03}
,
/L04 -
I ,z
- /L03
. (3.2.8.15)

On substituting these optimum values of HI and Hz in (3.2.8.12), we obtain


(3.2 .8.11) . Hence the theorem.

Remark 3.2.8.1.

(a) The difference type of estimator YI =y+yly-x)+yz(s;-s;), where YI and


r: are real constants, is a special case of the wider class of estimators. Sahoo,
Sahoo, and Espejo (1998) have presented an empirical investigation on the
performance of five strategies for estimating the finite population mean using
parameters such as mean or variance or both of an auxiliary variable . They
considered the problems of comparison of bias, efficiency and approach to
normality or asymmetry.

( b ) The asymptotic minimum mean squared error of the ratio type and the wider
class of estimators remains the same.

( c ) Note that A]2 and A03 are odd ordered moments. In case X and Y follow the
bivariate normal distribution then both A]2 and A03 are zero. In such situations the
minimum mean squared error of the class of estimators proposed by Srivastava and
Jhajj (1981) reduces to the mean squared error of the usual linear regression
estimator . Thus there is no advantage in using the known variance of auxiliary
variable for the construction of the estimator of the population mean Y if the joint
distribution of the study variable Y and auxiliary variable X is a bivariate normal
distribution .
172 Advanced sampling theor y with applications

( d ) As the use of number of known parameters of an auxi liary variable in the


construction of ratio or regression type estimators increases, no doubt the mean
square error of the resultant estimators decreases, but note that at the sam e time the
stab ility of the estimators decreases.

( e ) There are large number of estimators belongi ng to the same clas s of estimators
with the same minimum asymp totic mean square error, so it is difficult to select an
estimator for a particu lar survey, and there is no theoretical technique avai lable in
the literature to select an estimator.

Example 3.2.8.1. Use information given in the population I of the Appendix to


show the relative efficiency of the general class of estimators over the linear
regression estimator whi le estimat ing real estate farm loans with the help of known
nonreal estate farm loans .

Solution. From the description of the population I give n in the Appendix we have
- 2
Y = 555.43 , X = 878.16 , C y = 1.1086 Ao3 = 1.5936 , Pxy = 0.8038 , N = 50 ,
A12 = 1.0982 , and Ao4 = 4.5247 .

Now

Min.MSE(YLR) = C~/ )Y2C;'(I _ P.~) C-~. 16}555.43f


= x 1.1086 x (1- 0.8038
2)

= 12709.55 ,

and the value of minimum MSE(YSJ) is given by

Min .MSE(YSJ) = ( 1- f) y2 C;'[I_p,;y (AOJPXY- AI~ ~]


1/ -104-1- AQ3

= ( 1- 0.16) (555.43)2x 1.1086 x [ 1- (0.8038)2 _ (1.5936 x 0.8038 - 1.0982f]


8 4.5247 -1 - (1.5936)2
= 1149 1.74 .

Thus percent relative efficiency (RE) of the general class of estimators, YSJ, with
respect to the linear regression estimator, YLR , is given by

- ) x 100/Mm.MSE
RE = V (YLR . (-)
YSJ = 12709.55 x I 00 = II 059
. %.
11491.74

It should be noted that in this case the relative efficiency is independent of sampl e
size 1/.

The next section of this chapter has been devoted to con structing the unbia sed ratio
and product type of estimators of the population mean . We will discuss
Queno uille 's method, interpe netrating sampl ing method, exactly unbia sed ratio and
product type esti mators, and bias filtratio n techniques .
Chapter 3: Use of auxiliary information: Simple random sampling 173

3.2.9l\1Ji:JHODS TOEEMOVEBIASFROM RATIO AND PRODUCT


TYPE ESTIMATORS

We have observed that the ratio and product type of estimators are biased . Several
researchers have attempted to reduce the bias from these estimators. We should also
like to discuss a few methods to construct unbiased ratio and product type
estimators of population mean before going on to the problems of estimation of
finite population variance, correlation coefficient, and regression coefficient.

3.2.9.1 QUENOUILLE?S METHOD \

In this method we draw a sample of size 2n units from a population of N units by


SRSWOR sampling. We divide the sample of 2n units into two equal halves each
of size n. The sample based on 2n units is called the pooled sample . Then we
have three biased ratio estimators of the population mean as:

(a) YRI = YI(~J , where YI = n-I.IYi and x] = n-1 .I,xi are the first half sample
XI 1= 1 1=1

means for Y and X variables , respectively ;

(b) YR2 = Y2( ~ J' where Yz = n-1 IYi and x2= n-I IXi are the second half sample
X2 i=1 i=1
means for Y and X variables , respectively ;

(c) YR = y(~J, where Y=(2nt l ~Yi and x= (2ntl ¥Xi are the sample means for
X 1= 1 1=1

the Y and X variables , respectively, based on the pooled sample .

By following the ratio method of estimation , we have


E(YRI)= Y+(~- ~)Y(c; - PXyCxCy) , (3.2.9.1)

E(YRJ = Y+ (~ - ~)Y(c.; - PXyCtCy) , (3.2.9.2)

and

E(YR) = Y+Un - ~ )Y(c; - PXyCxCy) . (3.2.9.3)

Quenouille (1956) considered an estimator of the population mean Y as


YQ = alYRI + YRz)+(1- 2a)YR (3 .2.9.4)
where a is a suitably chosen constant such that the bias in the estimator YQ IS zero.
Thus we have the following theorem:
Theorem 3.2.9.1. The QuenouiIIe's estimator YQ IS an unbiased estimator of
population mean Y if
174 Advanced sampling theory with applications

(N -2n)
a= 2N (3.2.9 .5)
Proof. We have
E(YQ) = E[a(h l + YRJ+(1 - 2a)YR] = a[E(yRI)+ E(YRJ] + (1- 2a)E(YR)

= a[Y+(~- ~ )Y(c} - PXyCxC y)+Y+(±-~ )Y(c} - PxyCxCy)]


+(1- 2al y+Un- ~)Y(c} -pxyCxcJ]

= y + Y(c~ - PxyCxC y I2a(~- ~ )+(1-2a{L - ~)].


Evidently the bias in the estimator YQ will be zero if

(N - 2n)
2a(~-~)
n N
+(1-2a{~-~)=o,
\2n N
or if a=-
2N
.

Hence the theorem.

Sample size n versus value of a for N=20

0.6
0.5
0.4
0.3

...o
1Il 0.2
0.1
Ql
~ O -f! 1 +
> -0.1 3 4 5
-0.2
-0.3
-0.4
-0.5 j
Sample Size (n)

Fig. 3.2.2 Value of Quenouille's constant.

For more details, one can refer to Singh and Singh (1993), Murthy (1962) and Rao
(1965a) . The reduction in bias to the desired degree by using the method of
Quenouille (1956) has also been discussed by Singh (1979) .
Chapter 3: Use of auxiliary information: Simpl e random sampling 175

3.2.9.2 INTERPENETRATING SAMPLING METHOD

Let us first present an idea about the interpenetrating samples. If we want to select
II units with SRSWOR sampling, we can select k independent samples each of
size til = /I / k , where we assume that /I / k is an integer. We draw til units out of N
units, then put back these til units so as to make the popul ation size the same. To
make the k samples independent, each individual sample of III units is selected
with SRS WOR sampling. Now we have k samples each of size til. From the /"
sample, a ratio type estimator to estimate the popul ation mean Y is
_ <v,_(x)
YRj Xj

where Yj = tII-
1m
Z:Yi
m
and xj =tII - I Z:Xi denote the i" sample means for
i=\ i=l
the Y and X variables, respecti vely, for j = 1, 2,..., k. Let us defin e a new estimator of
the population mean Y as
_
YRK =
i
-k
z;
L YRj = -k
1 ~_ ( X )
L Y j -=-:- . (3.2.9.6)
F I J= I xJ
Also from the full sample information, we have the usual ratio estimator of
population mean Y given by

- =y-(XJ
YR x

where Y = /I - I f Yi and x = /I -I f Xi are the sample means based on full sample


~I ~I

information. By following the ratio method of estimation, we have

E(YRK) = Y + (,:1- ~ )Y(c} - PXYCycJ, (3.2.9.7)

and

E(YR ) = Y + (~- ~ Jy(c~ - PXYCyC..). (3.2.9.8)

Note that til units are drawn k times from a population of size N wh ich is equivalent
to a sample of size /I = km is drawn from a population of size kN . Thus we have the
followin g theorem:
Theorem 3.2.9.2. An unb iased estimator of the popul ation mean Y is given by
- kYR - YRK
Yu = k -I (3.2.9.9)
Proof. We have
E(yu) = E[A.YR - YRK] = kE(YR) - E(YRK)
k- 1 k- l
176 Advanced sampling theory with applic ations

k[y +(~ __1)Y(C;-PXyCxCJt~ i [Y+(~- ~)Y(c;-PXyCxcy)l


n kN J k j=1 m N J
(k- I)

(k -I)Y + (~- ~ )Y(C;-PxyCxCy)- (~- ~ )Y(C; -PXyCXCy)


(k - I)

(k -1)Y +(~-~) Y(C; -PxyCxC y)- (~- ~)Y(C; -PX


yCxCJ -
m N m N
= (k-l) =Y .
Hence the theorem.

Theorem 3.2.9.3. The varianc e of the unbiased estimator Yu of the popul ation mean
Y IS

V(Yu) = (~- k~ )[s; + R 2S,; - 2RS xy]. (3.2 .9.10)


Proof. We have
Chapter 3: Use of auxiliary information: Simple random samp ling 177

Note that k > 1, thus the unbiased estimat or Yu is less efficient than the ratio
estimator YR in case of finite popu lations.

Sengupta (1981 a, 1982a) considered the problem of interpenetrating sub-sampl ing


with unequal sizes of the samples and compared with an equicost procedure based
on equal sized samples . He observed that unequal sized samp les lead to more
precise estimates of the finite population mean in almost all cases. He consi dered the
simple random sampling design only and assumed that the cost of the survey is
proportional to the number of distinct units in the sample by following Koo p (196 7),
Singh and Bansal (1975, 1978), Singh and Singh (1974) and Srikan tan (1963).
Schucany, Gray, and Owen (197 1) consi dered the problem of higher order bias
reduction in estimating genera l parame ter in survey sampling.

Exa mple 3.2.9.1. Select three different samples each of five units by using
SRSWO R sampling from the population 1 given in the Appen dix. Collect the
information for the real and nonrea l estate fann loans from the states selected in
each samp le. The average nonreal estate farm loan is assumed to be known . Obtain
three differen t ratio estimates of the average real estate farm loans from the
information collected in the three sample s. Pool the information collected in three
sample s to obtain a pooled ratio estimate of the average real estate farm loans.
( a ) Derive an unbiased estimate of the average real estate farm loans.
( b ) Construct 95% confidence interval.
Give n: Average nonreal estate farm loans $878.16.

So lution. Here N = 50, k = 3, m = 5 and n = mk = 5 x 3 = 15. We selected the


following three independent samples each of size 5 units. The first sample is
selected by using first two columns , the second sample is selected by using 3rd and
4 th columns and the third sample is selected by using 5th and 6th columns of the
Pseudo-Random Numbers (PRN) given in Table 1 of the Appendix.

S ampleI I
Random-Number State Real estatefann Nonrea l estate farm
I S; Rli :5; 50 loans, Yi loans, Xi
01 AL 408 .978 348 .334
23 MN 1354.768 2466 .892
46 VA 321 .583 188.477
04 AR 907 .700 848.317
32 NY 201.631 426.274
Sum 3194 .660 4278.294

Thus YI = 638.932 and XI = 855.6588.


178 Advanced sampling theory with applica tions

6.044
1213.024
1100.745
323.028
553.266
3196 :107 '

Thus yz = 639.2214 and Xz = 616.2494.

Random Number Nonreahestate farm


',. o '=:; R3i ' ~'< .50 :
""c:, . ,.< , ..~. '":'2
,.:': :lgan~'
;i ' .
48 99.277
37 114.899
33 639.571
18 282.565
25 1579.686
2715 .998 ':

Thus Y3 = 543.1996 and x3 = 604.2602 . It is given that X = 878.16. Thus three


different ratio estimates of the average real estate farm loans in the United States are
- =- X = 638.932x878.16 655.7339 - =- X = 639.2214x878 .16 =910.8953
YRI Y, xI 855.6588 ' YRz Yz Xz 616.2494
and
- =- X = 543.1996 x878 .16 = 789.4218 .
YR3 Y3 x3 604.2602
Thus a pooled estimate from the above three ratio estimates of the average real
estate farm loans is given by
YRK f
=..!.. YR ' = 655.7339 + 910.8953 + 789.4218 = 785.3503 .
3 j= \ J 3
Now we have the poo led samp le information as follows:

P 00 Ied Sam I: e
..,
Stafe .;;;';;;.Yi:;; :liF••r:·;ix; :,,;
: ". ';,' \y;; ~).;• •
.... >
m'u'l'·i'y)2.;. ' I;e.:(.{ cW ·;' ...;.. (V.
' ",'
'.:A c..,
, ...•
AL 408.978 348.334 -198. 1400 -343.722 39259 .3 118 144.9 68 104.989
MN 1354.768 2466 .892 747.6503 1774.836 558981.0 3150042.0 1326956 .627
VA 321 .583 188.477 -285 .5350 -503 .579 81530 .1 253591.9 143789 .300
AR 907 .700 848.317 300.5823 156.261 90349 .7 24417.5 46969.256
Contmued .....
Chapter 3: Use of auxiliary information: Simple random sampling 179

NY 201.631 426.274 -405.4870 -265.782 164419.4 70640 .1 107771.111


NH 6.044 0.471 -60 1.0740 -691.585 361289.6 478290.0 415693 .612
IN 1213.024 1022.782 605.9063 330.726 367122 .5 109379.6 200388.897
WA 1100.745 1228.607 493.6273 536.551 243667.9 287886.8 264856 .174
MI 323.028 440.518 -284 .0900 -251.538 80706 .9 63271.4 71459 .384
TN 553.266 388.869 -53.8517 -303 .187 2900 .0 91922.4 16327.132
WV 99.277 29.291 -507.8410 -662 .765 257902.1 439257.6 336579.087
OR 114.899 571.487 -492 .2190 -120 .569 242279 .2 14536.9 59346.378
NC 639.571 494 .730 32.45333 -197 .326 1053.2 38937.6 -6403.891
LA 282.565 405.799 -324 .5530 -286.257 105334.4 81943 .1 92905.516
MO 1579.686 1519.994 972.5683 827.939 945889.2 685481 .1 805226.151
sum'9J 06.'765 10380.842 :~~, :.· O:ooOO L;'·j;>tl .OOO 3542685tO '5907743:0 3949969r724

Thus Y=607 .1l76 , i=692.056, s;=253048.93, s~=421981.64, sxy=282140.69


and r = yli = 0.8773.

A ratio estimate of the average real estate farm loans from the pooled sample
information is given by

- =_x =607.1176x 878.16 =770.380.


YR Y i 692.056

An unbiased estimate of the average real estate farm loans in the United States is

- = kYR - YRK = 3 x 770.380 - 785.3503 = 762.894 .


h k-1 3-1

An estimate for estimating the V(Yu) is given by

v(Yu) = (.!-__I_J(s; + r 2s; - 2rsxy)


n kN

= (...!... - _1_)(253048.93 + 0.8773 2 x 421981.64 - 2 x 0.8773 x 282140.69)


15 3 x 50
= 4967.1166.

A (1- a)1 00% confidence interval of the population mean Y is given by

Using Table 2 from the Appendix the 95% confidence interval of the average
amount of the real estate farm loans in the United States is

762.894+2 .145~4967 .1166, or [611.71, 914.06].


180 Advanced sampling theory with applications

3'7.9.3 ·EXACTLY'{JNBIASED RATIO TYPEESTIMATOR ;'

Hartley and Ross (1954) suggested an ingenious method of proposing an exactly


unbiased ratio type estimator of population mean . Suppose we draw a sample of n
units from a population of N units. For each unit in the sample, let us define the
ratio ri = n! Xi , where i = 1,2,..., n and sample mean of the ratios as
_ 1 n 1 n Yi
r=-IJi=-I- . (3.2.9.11)
n i=1 n i=1 Xi
We will define a ratio estimator of the population mean as
YR=rX. (3.2.9.12)
Taking expected values on both sides of (3.2.9.12) we obtain

E(YR)= E~ x)= XE(r) = XE[~ ±~] = x(~ ~~):;to Y. (3.2.9.13)


n i=\ Xi N i=\ Xi

Note that
R = .l, ~ JL and Ri = JL .
u t:«, XI
YR is given by
Therefore the bias in the estimator
_ ) (_) - X N y; - X N y; 1 N y; - - 1 N
( =EYR -Y=-2:- -Y=-2:- --2:-' Xi =XR--IRiXi
BYR 1 1

N i=\Xi Ni=lXi N i=\Xi N i=\


=-~~RX+RX
N i=1 I
=-~[~RX-NRX]=jN-l)S
N i=1
I N rx I ,

where

Srx=-( 1 )[~RiXi-NRX]=_I_[~Y;-NRX]=-( N )[r-Rxj.


N -1 i=\ N- 1 i=\ N- 1
Thus an estimator of Srx is given by

Srx=-(
n )[y-rx]
n-l
and hence an estimator of B(YR) is
•(_) ( N - 1) n (_ __)
B YR = - ---;;- (n-l) Y- r X • (3.2.9.14)
Thus we have the following theorem :
Theorem 3.2.9.4. An unbiased ratio type estimator of the population mean r IS
given by

- (N -1) n (- --)
YHR = r- X- +---(-
N n-l) y-r x .
(3.2.9.15)

Proof. We have

E(YHR)=E~ X)+(N -1)E[_n_(y _r x)] = XE(~ I.2)+ (N -1) E(srx)


N n- 1 n i=l Xi N
Chapter 3: Use of auxiliary information: Simple random sampling 181

-(INYi] (N-I) X Ny (N-I)


=X -L:- + - - S rx = - I -' + - - x -I - [ IRiXi
N --]
-NRX
Ni=IX i N N i=\X i N (N-l)i =\

X Ny I[ N X NY] X NY 1 N X NY -
= - I - ' +- D j - N - I - ' = - I -
' +-IJj--I-' =Y .
N i=\X i N i=\ N i=\X i N i=\X i N i=J N i=\Xi
Hence the theorem .

Theorem 3.2.9.5. For large samples, the variance of the estimator YHR is given by
1 r 2 -2 2
(- ) =-lay+R
VYHR
-
a x-2Raxy , whereR=-I- .
1
1 N Jj - (32916)
n Ni=IXi , ..
Proof. For large values of n and N, the estimator YHR can be written as
-
YH R =r
-X- (N-l) n (- --) -X- (- --) -
+--:V-(n-l) y-rx "'r + y-rx =y+r
-(x- -x-) .
Defining t = ~ -1 such that E(r) '" O . The estimator YHR in terms of &0, &1 and t
R
can be written as
YHR = Y(1+ &o)-RX&](I+r).
Now we have
V(YHR) = E[YHR - E (YHR )]2 '" E[Y(I+&o)-RX&1 _y]2

= E [Y
- 2 2
&0
-2 -2 2
+R X &1 -
- -
2R X&o&\
1 '" -1 [-2 2 -2 -2 2
Y Cy + R X C -
- -
2R XPxyCxCy
1
n . t

=J.-[S; +R 2S; -2RS xy] ",J.-[a; +R 2a; -2Ra xy].


n n
Hence the theorem .

Remark 3.2.9.1. Exact variance of the estimator YHR is available in Robson (1957).

Theorem 3.2.9.6. In case of infinite population , the unbiased ratio estimator YHR
is more efficient than the usual ratio estimator YR if either
- -
(R-R) <O and f3 <R+R or (R-R»O and f3 >R+R .
2 2
Proof. For large values of nand N we have
MSE(YR)= J.-[a; + R 2a.; -2Raxy] and V(Y HR)= J.-[a; + R 2a.; - 2Raxy].
n n
Now the estimator YHR is more efficient than YR if
V(YHR) < MSE(YR)

1 r 2 -2 2 - ] 1r 2 2 2 1
orif -lay+R a x-2Ra xy < - lay + R a x-2Raxy
n n
182 Advanced sampling theory with applications

or if (R + R)(R - R)a; - 2(R - R)aXY < 0


or if (R - R)a;[(R + R)- 2,B] < O.
The above condition will hold if either
(R-R) <O and,B <R+R or (R-R»oand ,B >R+R .
2 2
Hence the theorem .

Example 3.2.9.2. The estimation of the average amount of real estate farm loans (in
$000) is an important issue for the banks operating in the United States of America.
A statistician suggests that they consider the terminology of unbiased estimation
proposed by Hartley and Ross (1954) . Select an SRSWR sample of eight states
from the population I given in the Appendix and derive the appropriate estimate of
the real estate farm loans, given that the average amount $878.16 of nonreal estate
farm loans (in $000) for the year 1997 is known .

Solution. An SRSWR sample of eight states is selected by using the 3151 and 3th
columns of the Pseudo-Random Numbers (PRN) given in Table I of the Appendix
as: 17,36,50,50,31, 05,18, and 50. Note that the state WY has been selected
three times in the sample .

Random State i •. y. ',; '';~

'; Nq:.i. '~ k .


'fe'
' E·" ';,· ",' ;Y i'i.I;
'!.';';';
;'.!,; I.. ·.··.:;i;;!.!;; IR~ ~~::;;;~~V·i;!i
· ;v .,J!
;·{YL~-- ~\1,;; (Vi - -'~X!~-;~
,,'. ' ;,,]:;I!:i~!·) · .!
"'r .! =' }i!
;;~ P! Xi
5 CA 3928 .732 1343.461 8546932 .65 770219.936 2565739.2530 0.341958
17 KY 557.656 1045.106 200311 .97 335549.968 -259257.9300 1.874105
18 LA 405 .799 282 .565 359303.44 33589.451 109858.1135 0.696318
31 NM 274.035 140.582 534628 .95 105792 .279 237822 .6531 0.513007
36 OK 1716.087 612.108 505334.38 21394 .547 103977.8835 0.356688
50 WY 386.479 100.964 382838 .26 133133.948 225762.6385 0.261241
50 WY 386.4 79 100.964 382838.26 133133.948 225762.6385 0.261241
50 WY 386.479 100.964 382838.26 133133.948 225762.6385 0.261241
"i. l! i'; . . ;ISum ,804 1.746 !~ 726;7J4 11225026 115 1665948.025 :3435427;8820 4.565798

From the above table:


n=8, Y=465 .8393 , x=1005.218 ,sx2 =1613575 .16, Sy2 =237992 .57,

S xy = 490775.41, and r =..!.- I. 21 = 4.565798 = 0.57072 .


n i=! xi 8
Also X=878 .16, N=50 and 1=0.16 are given.
Thus the Hartley and Ross (1954) unbiased ratio estimate of average amount of
real estate farm loans during 1997, Y (say), is given by
Chapter 3: Use of auxiliary information: Simple random sampling 183

- -X- (N -1) n (- r r)
YHR =r +---(-)
N n-1
y-r x

= 0.57072 x 878.16 + 8(~0 - 1~ (465.8393 - 0.57072 x 1005.218)


508-1
= 501.183- 120.802 = 380.381 .
A rough estimate of V(YHR) is given by

v,(-
YHR
) = -1 [2 -2 2 2-
SY + r S x - r S xy
1
n

=..!.. [237992.57 + (0.57072? x 1613575.16 - 2 x 0.57072 x 490775.41]


8
= 25422 .21.
A (1 - a)1 00% confidence interval for population mean r is given by
YHR =Ffa /2(df = n -l)Jv(YHR)'
Using Table 2 from the Appendix the 95% confidence interval is given by
380 .381±fo.05/ 2 [df = 8 -1 }h5422.21

or 380.381±fo.02S(df = 7}h5422 .21

or 380.381± 2 .365~25422 .21 or [3.29, 757.46] .

3.2..9;4 UNBIA-SED,P.R,()DUCT l'YPWESTIMATOR

The usual product estimator of the population mean r is defined as


- = Y-( X
Yp x J. (3.2.9.17)
The bias in the estimator Yp is given by

B(yp) =(1- f)rpXYCYCX =(1- f)r Sxy ~ ~ =(1- f) S:! .


n n s.s, Y X n X (3.2 .9.18)
Thus an unbiased estimator of bias in the product estimator is given by

BA(_
Yp )=(~JSXY
- • (3.2.9.19)
n X
Thus we have the foIlowing theorem:
Theorem 3.2.9.7. A product type unbiased estimator of the population mean is:
_ = _( X )_(~)Sty
Ypu Y X n X' (3.2 .9.20)
Proof. The estimator Ypu in terms of 80 , 81 and 84 can be written as
- _-(x)
Ypu - Y X - (l-
f J Sxy _-(
-n- X - Y 1+80
)X(1+8\)
X
(1-f)Sxy(1+ 84)
- -n- _=":=X~:":"
184 Advanced sampling theory with applications

-( X ) (I-f) Sxy (I+&4 )


= Y 1+ &0 1+ &1 - - - ---'-=--
n X
-( ) (l- f)Sx y(I+ &4)
= Y 1+ &0 + &\ + &0&\ - -n- X .

Taking expected values on both sides we obtain

E\ypu -[1+0+0+ (I-f)


(.,.,) = Y -n- PxyCxCy ] - (l-f)Sxy(I+O)
-n- X

-- (~)_
- Y+
n
YPxyCxCy_(~)Sty
n
-
X
_- (1-
- Y + --
n
f)- s; Sy
Y--~~ - - - ~
SxSy X Y n
Sf
f)S XY
X
(1-
=y+C~fYJ -C~fYJ =y.
Hence the theorem.

Theorem 3.2.9.8. The variance vtypu) to the first order of approximation is same as
(- ) (1- f)[
V Ypu = -n- Sy2 +R 2 s;2 + 2RSxy ] . (3.2.9.21)

Proof. The variance of the estimator Ypu is given by

-"]2
V (Ypu ) = E [Ypu - Y J = E [-( )
Y 1+ &0 + &\ + &0&\ - -n-
(1-
f)Sx (1+ &4)
y X

I-f )[
'" ( -n- Sy2 +R 2 s;2 + 2RSxy ] = MSE(-
Yp ) .

Hence the theorem.

Example 3.2.9.3. It is a well known phenomenon that as people become older their
sleeping time reduces. A psychologist wants to study the average duration of sleep
(in minutes) during the night for the persons 50 years of age or over in a small
village of the United States. Instead of asking everybody, the psychologist selects
an SRSWOR sample of six persons and records the information as given below
.,
In. . . 0ii.. ..;~.{~·. y0~ 7 12 15 19 24 29
Age x..(yeais) if,. .... ··· . t • .·..... 67 70 53 77 87 66
Duration ·ofsl~ep ; ·y·mnlii1iriutes) 420 360 5 10 330 270 390

The average age of 67.267 years of the subjects is known as shown in the
population 2 in the Appendix. Apply the unbiased product method of estimation for
estimating the average sleep time in the particular village under study. Also find an
estimator of the variance of the unbiased product estimator and hence deduce a 95%
confidence interval.
Chapter 3: Use of auxiliary information: Simple random sampl ing 185

Solution. From the sample information, we have

7 420
12 360
15 510
19 330
24 270
29 390
' SUfi'

We have Yi = Duration of sleep (in minutes) , X i =Age of subjects (~50 years),


n=6 , y=380, x=70 , s~ = 1 30.4 , s;=6720, sxy=-918 and r=y/x=5.4286.
Also we are given X = 67.267, N = 30 and f = 0.20.
Thus unbiased product estimate of average sleep time, Y (say), is given by
- = -( ~J-(1-
Ypu Y X n
fJs,:!X = 380(~J-(1-0.20J
67.267 6
(-918) = 397.26
67.267
and an estimate of approximate v(ypu) is given by

A{-
Vl,Ypu )
= -n-(1 - Jr 2 ]
f lSy2 +r 2Sx2 + rsxy

= C-~.20 J[6720+ (5.4286)2 x 130.4- 2 x 5.4286 x 918] = 79.458 .

A (1- a) 00% confidence interval for population mean Y is given by


Ypu±la j2(df = n-2~v~pJ .
Using Table 2 from the Appendix the 95% confidence interval is given by
397.26±2.776·b9.458 or [372.51,422.00].

3;2.9.5,QLASS9FALl\:!0ST UNBI""SED ESTIMATORS OF PpPUL""TION",


" :~ ' , :', ,' RATIQ AND PRODUCT·'; ~ <' . '~ " :~ ,:'I;i~; S ' : : :;i"" "", '

Singh and Sahoo (1989) have considered the problem of estimation of the
parameter D = Y Y I XS , and defined a gener alized estimator d g of the form
dg = ~ IxS ) (3 .2.9.22)
where 0 is a constant which takes the value +1 and -1 according as we wish to
estimate population ratio or product. The estimator dg in terms of &0 and &1 can
be written as:
186 Adva nced sampling theory with applications

dg =
r(I + &O)
8(
( X )-8 =Dl+
\ ",=DI + &O 1+ &1
( &o{ 1- 0&1 +0(0+ 1) 2 )
-- -&1 + ..
X 1 + &IJu 2

= D[1 + &0 - 0& 1 + 0(02+ I) &12 - 0&0&) + ..J. (3.2.9.23)

+'1 1)(jtJ-j'f)]
Taking expected values on both sides of (3.2.9.23) we obtain

E(d,). 0[1+o{ (0; I) c; - p


"c,c,)]. (0;
or

(3.2 .9.24)

Emp loying the techn iques developed by Beale (1962), Tin ( 1965) and Sahoo
(1983~, we may construct three almost unbiased estimators, up to terms of orde r
O{n-I ), for D as
f 0(0 +1) 2]-1
dB =d [l+&yx-LI +- -2-Cx , (3.2.9 .25)

dT = d[1 +O(CyX_ (0; 1) C;)] ,

r'
(3.2.9.26)
and

ds d[1+ o(0; 1c~ - cyx )


= (3.2.9.27)

respecti vely, where c~ = s~ /;2 and Cxy = Sry/(; y) are the estimators of C; and
Cry , respectively. Singh and Sahoo (1989) proposed a class of almost unbiased
ratio and prod uct estimators. According to them , whatever the samp le chosen let
t = (0, Cxy' c~ ) assume values in a bounded closed convex subset S of the three
dimensional real space cont aining the point T = (0, C ry , C;). Let f(t) be a functio n
of t satisfying the folJowing conditions:
( a ) The function f(t) is continuous in S;
( b ) first and seco nd order partial derivatives of f(t) exist and are continuous in S;
( c ) After the expans ion under the give n conditions

f (t )= 1+ o( Cry - (0; I) c~ ) + higher order terms .


These conditions take n toget her are calJed regularity conditions and any situation
where these condi tions hold is calJed a regular estimation case. Then we have the
following theorem:
Chapter 3: Use of auxiliary information: Simple random sampling 187

Theo rem 3.2.9 .9. Let f(t) be any function of 0, e,y and e; satisfying the above
conditions. Then a class of almost unbiased estimators of D may be defined as
dss = df (t). (3.2.9.28)
The class of estimators represented by d ss gives us an infinit e number of almost
unbiased ratio and product estimators by substituting a prop er choic e of f( t). It is
easy to see that the estimators suggested by Beale ( 1962), Tin ( 1965), Sahoo
(1983), Robson (1957) and the estimator s of the type:

d I -- d (1+ <X:
s:
1O 2]. I -d(I- "-xy )-1[1_ 0(0 +1) cx2] '.
xy f 1 _ (02+ 1) cx , (2 - <X: 2

d3=d ( l-&:xy )-I[1+-0(0- +1)- c x2]-1 ; d4=d [ 1+-0 (0- +1) c x2]-1 ( xy );
expoc
2 2-

0(0 +1) cx2


ds-_ d (I + oc xy )exp [ ---2- ]., d6-_d exP[oCyx - -0(0+1) -2- cx2 ].,

d7 = d[ 1+ IOg(l +dCx)' - 0(0 + I)


2
c;)l and dg= d[1 + IOg( I - O( 0(0 I)
2+
c;- cx)'))]
etc. for 0 = ± 1 are the members of the class d ss'

3.2.9.6 FILTRATION OF BIAS

Singh and Singh (1991 , 1993a, 1993b, 1993c) have sugg ested a new method to
separate the bias precipitates from the ratio and produ ct type estimators by using a
funnel connected with a filter paper. The apparatu s consists of a linear variety of
estimators and linear constraints. Consider
Rj = ;{x/xy, such that Rj E G for i = 1,2,3
where G denotes the set of all possible ratio type estimators for estimating the
popu lation mean Y .
By definition the set G will be a linear variety if
• 3 •
s, = I, g iRi EG
(3.2.9.29)
i; 1
for
3
j~lgj =1 and gjER (3 .2.9.30)

where gi (i = 1,2,3 ) denote s the amount of chemical s used for separating bias
precipitates and R denotes the set of real number s. Using (3.2.9.30), the relation
(3.2.9.29) in terms of £0 and £ ) may be written as

Rs = Y + Y[£o - (g\ + 2g 2 + 3g 3)&\ + 0(£2)] . (3.2.9.3 1)


Let us choose
gl + 2g2 + 3g3 = K (say, another constant). (3.2.9.32)
188 Advanced sampli ng theory with applications

Then the mean square error (MSE) to the first order of approximation is given by
MS E(Rs) = (1~lf )yz[c; + K ZC} - 2KPxy C r Cy ] . (3 .2.9.33)

The MSE (up to term s of order O~l -l) ) is minimum for


K = Pxy Cy/Cx (3 .2.9.34)
and minimum mean squared error is given by

Min.MSE(Rs )=(I~lf )S;(I - P.;y) . (3 .2.9.35)

From (3.2.9 .30), (3.2.9.32) and (3.2.9.34), Singh and Singh (1991 , 1993a, 1993b,
1993c) made a funnel consisti ng of two equations given by
Lgj
3

j =1
=1
(3.2.9 .36)
and
3 . C
Llgj= Pxy -Cy
j=1 r
' (3293
. . . 7
)

In the above two equations , we have three unknowns to be determined. It is


therefore not possible to find out unique values for the amo unt of chemicals gz gl,
and g3 to be used for filtration . They claimed that their funne l is handicapped and
cannot be used without a filter paper. In order to get unique values of these
chemic als, they sugge sted using a filter paper with the funnel by imposing a linear
constraint as:

EgjB(Rj)= 0 (3.2.9.38)

where B(R j) denotes the bias in the Rj, i = 1,2,3 , of population mean and the above
three equations can be written as

I. 1, 1l[gl] [I]
[~(RI} B&Z} B(R:)J :~ = : or A
3xP3
xl = K
3x l
' (3 .2.9.39)

The values of gj , i = 1,2.3 obtained by so lving the system of equations (3.2.8.39)


separates the bias precipitates from the linear variety of estimators if IAI "* 0, i.e. if
(, )"* B(RI)+B(R3)
B R2 2 . (3.2 .9.40)
Thus we have the follow ing theorem:
Theorem 3.2.9.10. The values of s., i = 1,2,3 which filter the bias up to terms of
first order are given by
g ,= 3 - 3K + K z , gz= -3+ 5K- 2K z, and g3 = 1- 2K + K z (3.2 .9.41)
Proof. To the first order of approximation we have
Chapter 3: Use of auxiliary information: Simple random sampling 189

S(Rt )= C~f)y[c; - PXyCycJ (3.2.9.42)

S(R z )= C~f)Y[3C; - 2PxyCyCJ (3.2.9.43)

and
S(R3 )= (I ~f)Y[6C; - 3PxyCyCx] . (3 .2.9.44)

On substituting these values of biases in (3.2.9.39), we obtain the values of gi


given in the theorem.
Remark 3.2.9.2. If someone is interested in reducing the bias of second order, then
S(Ri ) i = 1,2,3 can be obtained to the second order of approximation by following
Gupta and Kothwala (1990).

Example 3.2.9.4. Ms. Stephanie Singh wishes to estimate the average amount of
real estate farm loans (in $000) during 1997 in the United States using known
information about the nonreal estate farm loans. She would like to apply the method
of filtration of bias from the ratio method of estimation . Suggest her the values of
the required real constants g i , i = 1,2,3 .
Given: Pxy = 0.8038, C; = 1.5256 and C; = 1.1086.
• C 1.1086
Solution. Here Pxy --..L = 0.8038 - - = 0.6852 , thus values of g, , gz and g3 are:
c, 1.5256

Cy z C;
xyCx + PXYClx = 3 - 3 x 0.6852 + 0.6852 = 1.414,
Z
gl = 3 - 3 p

z
Cy z Cy z
gz =-3+5Pxy--2pxy-Z =-3+5xO.6852-2xO.6852 =-0.513,
c, Cx
and
z
z
g3 = 1- 2pxy
Cy
C + Pxy Cl
Cy Z
= 1- 2 x 0.6852 + 0.6852 = 0.099 .
x x

Example 3.2.9.5. Ms. Stephanie Singh considers the problem of estimating the
average amount of real estate farm loans (in $000) during 1997 in the United States.
She takes an SRSWOR sample of eight states from the population 1 of the
Appendix and collects the following information:
NY ME IL CT CA AZ OH
16.710 51.539 2610.572 4.373 3928.732431.439635.774

56.908 5.860 8.849 2131.048 7.130 1343.461 54.633870.720


190 Advanced sampling theory with applications

The average amount $878.16 of nonreal estate farm loans (in $000) for the year
1997 is known . She wishes to use three estimators Rj = Ji(X/xY , such that Rj E G
for i = 1,2,3, where G denotes the set of all possible ratio type estimators for
estimating the populat ion mean Y. Discuss her estimate based on the estimator
• 3. 3
Rs = I g jRiEG for Igi = 1and g iER .
i= \ i=\

Also find an estimator of the mean squared error of the estimator R s and hence
deduce a 95% confidence interval.
Given : gl = 1.414 , g 2 =-0.513 and g3 =0 .099 .

Solution. From the sample information we have

I:k'r::'l[(~~' ;,~ ~;':~"k, tlv, ;:.-;;Vv


T1T 0 ; . ;
l i~r 'i,k f" '"
Ji'A:" : '1 " i1~,W '~'xJ" , i
5~ \ '2"
hS:' ,xI3 ;3 I; 'i " ',;" "Y': Or/ ,:" "{ ', A !:'on:

UT 197.244 56.908 619847 .3916 252926 .6405 395949 .3886


NY 16.710 5.860 936710 .1523 306878.4676 536149.3972
ME 51.539 8.849 870505.5608 303575 .7923 514066 .5475
IL 2610.572 2131.048 2643954.4550 2468738.1800 2554844.6740
CT 4.373 7. 130 960742.7856 305473.0066 541738. 8552
CA 3928.732 1343.461 8668220.1620 614083.6173 2307165 .3590
AZ 431.439 54.633 305929.4276 255220 .0935 279426 .8010
OH 635.774 870.720 121643.2159 96655 .00 15 - 10843 1.6615
.Surn] 7876.383 4478:609 + 5127553:1500 4603550.8000 '7020909.3610

Here n = 8, Ji = 559.8261, x = 984.5479, Sx


2
= 2161079.02, Sy
2
= 657650.11,
Sxy = 1002987.05, and rry = Sry/(SxSy ) = 0.8413. Also given X = 878.16 , N = 50 and
1 =0 .16.

Thus the unbiased ratio type estimate of average amount of real estate farm loans
during 1997, Y (for example), is given by

s, =
3
I giY
_(X
-=-Ji
i= \ X

= 1AI4 X559.8261( 878.16 ) _ 0.513 X559.8261( 878.16


984.5479 984.5479
)2
+ 0.099 x 559.8261( 878.16
984.5479
)3
=706.05-228.47 +39.33=516.91 .
and an estimator of MSE(RJ is given by
Chapter 3: Use of auxiliary information : Simple random sampling 191

C C
MSE(Rs~ ~f};(I_ r~,) = -~.16)x 657650.11x [1- (0.8413)Z ] = 20178.35 .
A (l-a)l 00% confidence interval for population mean Y is given by
Rs ±t a /2(df = n-I).jMSE{RJ .
Using Table 2 from the Appendix the 95% confidence interval is given by
516.91±2.365~20178 .35 or [180.96, 852.86J.

Unbiased ratio and product type estimators have also been discussed by Tukey
(1958), Durbin (1959), Murthy and Nanjamma (1959), Nieto de Pascual (1961),
Rao (1966), Rao and Webster (1966), Murthy (1967), Rao (1969), Hutchison
(1971), Schucany, Gray, and Owen (1971), Sharot (1976) and Rao (1981). A
complete review on the reduction of bias using auxiliary information can be had
from Sahoo, Sahoo, and Wywial (1997) . Williams (1958, 1961) compared the two
unbiased regression estimators, one suggested by himself and other by Mickey
(1959). Williams (1963) compared the precision of some unbiased regression
estimators. Rao (1967) investigated in a variety of natural populations the
performance of Mickey's estimator and concluded that it is usually inferior to
standard ratio and regression estimators . Sahoo (1994) compared the effic iency of
William's (1963) estimator with the standard regression estimator and found
through an empirical study that the former is usually inferior to the latter. Rao and
Rao (1971) investigated the exact efficiency of Mickey's unbiased ratio type
estimator.
The next section has been devoted to estimating the finite population variance in the
presence of auxiliary information.

3.3 ESTIMATION OF FINITE POPULATION VARIANCE

We consider the problem of estimation of finite population variance (or population


mean square error) of the study variable Y defined by
N 2
s~
J
= (N -1)-\ i;\
I(Y; - r) (3.3.1)
in the presence of the known population mean
_ IN
X=N- IXi
i;\

and the known population variance


S; =(N -It I i;\I (Xi -xf
of the auxiliary variable, x. Note that the admissibility of the usual estimator of
finite population variance has been studied by Strauss (1982), but we will discuss a
few estimators of S;.
192 Advanced samp ling theory with applications

3.3.1 RATIO TYP E ESTI MATOR

Isaki (1983) proposed a ratio type estimator of finite pop ulation variance s;' as

(3.3.1.1 )

The estimator s? in terms of 5 2 and 53 can easi ly be written as

s? = S;'(I +5zXI+53t ' = S;'(I +5zX1- 53 +c:f +...)


= S;'[1+ 5Z- 53 +5f- 5z53 + ....] . (3.3.1.2)

Thus we have the following theorems :

Theorem 3.3 .1.1. The bias, up to terms of order O(n -I), in the estimator s? is

B(s?)=C ~f)S;'(.104 - Azz) . (3.3.1.3)

Proof. Taki ng expected values on both sides of (3.3.1.2) we have

E(s?) = S;E[I + 5Z - 53 + 5f - 5Z 53] = S;[I +C ~/ } (.104 - 1)-(Azz -I)}]


and using the result B(s)z)= E(s? )- S; we have (3.3.1.3) .
Henee the theorem.

Theorem 3.3.1.2. The mean squared error of the estimator s?, up to the first order
of approximation, is

MSE(s?)= (I~/ )S;[A40 + .104- 2Azz]· (3.3.104)

Proof. We have
MSE(s?)= E[s? -S;f '" E[S;' (I+ EZ - E3 + E5 - EZ E3 +....)- S; f
'" S~E[EZ - E3 F = S~E[E~ + E5 -2 EZE3 ]
= C ~f )S~[(A40 - 1)+(A04 - 1)-2(A22 - I)].
Hence the theorem .

T heore m 3.3.1.3. A consiste nt estimator of the mean squared error of the estimator,
s? , up to the first order of approximation, is given by
, (z)
MSE Sl =
(1-f) 4['A40
- n- Sy
, ,]
+ .104- 2AZZ . (3.3 .1.5)

Proof. It follows from (3.3.104.) by the methods of moments.


Chapter 3: Use of auxiliary information: Simple random sampling 193

Corollary 3.3.1.1.The limiting distribution of the estimator is asymptotically sf


normal under certain regularity conditions described by Swain and Mishra (1994).
Example 3.3.1.1. Consider the following population consisting of five (N = 5)
units A, B , C, D, and E where for each one of the unit in the population two
variables Y and X are measured.

"" Units, A B C D E
I ' ,~+ ·, '. ,4 '."'J1'O
' 9 II 13 16 21
liP .2h;: Ii 14 18 19 22 24

(a) Find population mean squares s; and S; of the study variable (Y) and
auxiliary variable ( X ), respective ly.
( b ) Select all possible samples of three units (n = 3) with SRSWO R sampling.
s; and s; from each sample.
( c ) Find the sample variance
(d) Find the exact mean square error of the estimator s; using the definition .
( e ) Assuming that the population mean square S; of the auxiliary variable is
known, find the ratio estimates of the S; as
s2 =
I
s2(S
Y ;J 2
Sx

from each sample, and its exact mean square error by definition .
( f) Find the relative efficiency of the ratio estimator Sf with respect to sample
. 2
estimator s y .

Solution. ( a ) The population mean squares of Y and X variab les are given by
S; = 22 and S~ = 14.8 .
( b ) and ( c ) All the 10 possib le samples, estimates of population mean square
errors from a given sample s; s;
I, , I, and related results are given as:

9 11 13 0.1 324.000
x: 14 18 19
2 y: 9 11 16 13.000 16.000 12.025 0.1 99.50 1 81.000
x: 14 18 22
3 y: 9 11 21 41.333 25.333 24.147 0.1 4.6 11 373.778
x: 14 18 24
4 y: 9 13 16 12.333 16.333 11.176 0.1 117.170 93.444
x: 14 19 22
Continued ......
194 Adva nced sampling theory with appl ications

5 y: 9 13 21 37.333 25.000 22.101 0.1 0.010 235 .111


x: 14 19 24
6 y: 9 16 21 36.333 28.000 19.205 0.1 7.813 205.444
x: 14 22 24
7 y: II 13 16 6.333 4.333 21.63 1 0.1 0.136 245.444
x: 18 19 22
8 y: 11 13 21 28.000 10.333 40.103 0.1 327 .727 36 .000
x: 18 19 24
9 y: 11 16 21 25.000 9.333 39.643 0.1 311.270 9.000
x: 18 22 24
10 y: 13 16 21 16.333 6.333 38 .168 0.1 261.418 32.1 11
x: 19 22 24
I ; · + : · i~ ; : · { ..:+'+ .,'; :, £+,+A ,+\ , .+. '.\[':'1: ·::~:A+;. :.·.:.: :: :;:.. Sum ; 1 31 3:065 ;1635.333 .
( d ) The mean square error of the estimator s; is given by

MSE(s;)= Ipl~; II _S;}2= 163.5333 .


1=1
( e ) Th e ratio estimates of sl ll' t = 1,2,....,10 are given in the abov e table and the
mean square error of the ratio estimator sl is given by

MSE(sl)= Ipl~} II _S;}2= 131.3065 .


1=1
( f) The relative efficiency of the ratio estim ator sl with respect to usu al estimator
s; is given by
RE = MSE(s;) x 100 = 163.5333 x 100 = 124.54% .
MSE(s}) 131.3065
Note that this smal1 example wil1 demonstrate that the highe r level calibr ation
estimators of variance given in Chapter 5 are bett er than the low level calibration
estimator s of variance and we will study calibrat ion approaches in Chapter 5.

Example 3.3.1.2. Mr. Jack Al1en wishes to estimate the finite population variance
of real estate farm loans in the United States using the known benchmark as the
variance of the auxiliary variable nonreal estate farm loans. Using the information
given in population 1 of the Appendix, discu ss the relative effi ciency of the ratio
type estim ator
s;
s, =
2 2
Sy
[ s~2J
with respect to the usual estim ator s; .
Given: ..1.40 = 3.5822, A.o 4 = 4.5247 and ..1.22 = 2.8411 .
Chapter 3: Use of auxiliary information: Simple random sampling 195

Solution. The percent relative efficiency of the estimator sf with respect to s; IS

given by

RE v(s;) x 100 _ (,140 -I)x 100


= --'-"--'-, --,----
(3.5822 -I)x 100 = 106.49%.
MSE( 2 ) ,140 + -104 - 2,122
J Sf
3.5822 + 4.5247 - 2 x 2.8411

Theorem 3.3.1.4. The minimum sample size for the relative standard error of the
estimator Sf
to be less than or equal to ¢ is given by
-1
.1.
2 1
n> 'I' +_ (3.3.1.6)
- [ ,140 + -104 - 2,122 N ]

Proof. We have

RSE = MSE(sf)/ (s; r ~ ¢


which implies

(~ - ~ }A40+-104-2~2k¢2
or 1 ¢2 I
-~ +-,
n ,140 + -104 - 2,122 N

Hence the theorem.

Example 3.3.1.3. Mr. Jack Allen is a bank consultant in the United States of
America. He considers the problem of estimation of finite population variance of
the real estate farm loans in the United States . Based on the information given in the
population 1 of the Appendix, what is the minimum sample size required for the
estimator sfto have minimum relative standard deviation 45%?

Solution. Here ¢ = 0.45 . From the description of the population 1 we have


,140 = 3.5822, -104 = 4.5247 and ,122 = 2.8411. Thus using (3.3 .1.6), we have

n;:::[ ¢2 +~]-I =[ 0.45


2
+~]-I =9.66 ;dO.
A40+-104-2~2 N 3.5822+4.5247-2x2.8411 50
Thus a minimum sample of 10 states is required to mainta in the required level of
accuracy of the estimator.

Example 3.3.1.4. Select an SRSWOR sample of 10 states from the population 1


given in the Appendix. Apply the ratio method of estimation for estimating the
variance of the real estate farm loans (in $000) during 1997. Use the known
variance of nonreal estate farm loans, S;
= 1176526 . Find an estimator of the mean
squared error of the ratio estimator and hence deduce a 95% confidence interval.
196 Advanced sampling theory with applications

Solution. Using the 23 rd and 24th columns of the Pseudo-Random Numbers Table I
given in the Appendix , we selected the 10 distinct random numbers between 1 and
50 as : 44, 32, 33, 09, 10, 50, 31, 38, 06, and 13. The results obtained are as given
below.

CO 315.809 87013.200 3683861941 7571297044 5281251085


FL 825.748 46207.372 1457854696 2135121198 1764283556
GA 540.696 939.460 14213.838 108024.626 202033179.2 11669319875 1535444494
IL 2610.572 2131.048 3805051.808 2311187.427 1.44784E+13 5.34159E+12 8.79419E+12
NM 274.035 140.582 148905.535 221094.6232217285845448882832253 32922213175
NY 426.274 201.631 54589.425 167410.269 298000535228026198 155 9138830368
NC 494.730 639.571 27287.009 828.403 744580874.8 686252.3986 22604654.64
PA 298.351 756.169 130730.551 21135.344 17090476929 446702782.9 2763035216
UT 197.244 56.908 214067.045 306784.1624582469984394116522153 65672379110
WY 386.479 100.964 74768.777 259921.531 559037006667559202082 19434015051
8.93272E+12

From the above table:


10
L:(Xi -xf
n=10, N=50, 1=0.20, fioz=s;= i;\O_1 =507610.07,

10(
L: Xi- X)4 I(Yi - yf
t, = H = 1.6198x1O IZ fizo = s; = H = 392178.55,
r04 10-1 ' 10-1
10(
L: v, - Y)4 1 10
t, = i;1
r40 10-1
= 6.2244444 x 1011
'
fi22 = - - L:(Yi - y)Z(Xi - xf = 9.925244 x 1011,
10 -1 i;1

,i •
_ ,u40 _ 6.2244444xl 0
II • fi04 1.6198xl0 lZ
40-· Z- Z 4.04699, ~4 = - z = Z = 6.28638, and
,uZO 392178.55 fioz 507610.07
11
~z = .fiz; = 9.925244xl0 =4.985711.
flzofloz 392178.55 x 507610.07

Thus the ratio estimate of variance, s ~ , of the real estate farm loans is

s2 =s2[S; ]=392178.55( 1176526 )=908981.69


1 Y s2 507610.07
x
and an estimate of MSE(s7) is given by
. (z)
MSE sl = (1-1) 4['
-n- s y ..1.40 + ~4 - 2A.zz . .]
Chapter 3: Use of auxiliary information: Simple random sampling 197

= (1 -10~20 )392178.55? [6.28638 + 4.04699 - 2 x 4.985711] = 4453524452 .


A (I - a)1 00% confidence interval for population variance S; is give n by
Sf±(~/2(df = n -1).jMSE(d) .
Using Table 2 from the Appendix and the relationship (2 = F, the 95% confidence
interval is given by
908981.69 ± (2 .262)Z .J4453524452 or [567523.82, 1250439.56] .

3;3:2 DIFFERENCE TcYPEESTIMATOR ·

Isaki (1983) also considered the difference type estimator of finite population
variance S; given by
s~ =s; +ko (S; -s;) (3.3.2. 1)
where ko is a real constant. The estimator s ~ in terms of &2 and &3 can be written
as
s~ = S;(I + &2) + ko[S; - S;(I + &})] = S;(I +&2)-kOS~&3 . (3.3.2 .2)
Thus we have the following theorems:

Theorem 3.3.2.1. The estimator s ~ is an unbiased estimator of S; .


Proof. Taking expected values on both sides of (3.3 .2.2), we have

E(s~ ) = s;.
Hence the theorem.

Theorem 3.3.2.2. The minimum variance of the estimator s~ is given by

Min.v(s~)= ( 1-n f)Sy4[A40 -1- (,1,22


-104--1)2] .
1 (3.3.2.3)

Proof. By the definition of variance, we have


v(si)= E[si _S;]2 = E[S; +ko(S~ -s; )_S;]2 = E[S; (1+&2)- kO S;&3_S;]2
=ES 2)2 =ES
( Y2&2- kO Sx&} 2 k2oSx&}-2
[ 4&2+ 4 2 koSy
2s2x&2&}]
Y

=( I- )[s; (A40-1)+k5S:(A04 -1)- 2k oS;S;(A22 -1)] . (3.3.2.4)


nf
On differentiating (3.3.2.4) with respect to k o and equating to zero we obtai n
ko = S;(A22 - 1)/ S;(A04 -I) . (3.3.2.5)
Substituting of the optimum value of k o in (3.3.2.4) proves the theorem.
198 Advanced sampling theory with applications

Exa mple 3.3.1.5. Using the information given in pop ulation I of the Appendix,
find the relative efficiency of the regression type estimator si = s;' + kots'; - s; )
with respe ct to the ratio type est imator sl = s;' (s';/ s;)whi le estimating the variance
of real estate farm loans using know variance of nonreal estate farm loans.
G iven: ..1.40 = 3.5822, A.o4 = 4.5247 and A.z2 = 2.8411.
Solution. The percent relative efficiency of the estimator si with respect to sl IS

RE = v(sl ) x 100 = (3.5822 +4.5247 -2 x2 .841 1)XIOO=149.63% .


MSE( si ) {3 .5822 _I _ (2.84 11- 1)2 }
4.5247 -I

3.3.3 POWER TRA NSFORMATIO N TYPE ESTIMATOR

We consider here an estima tor of finite pop ulation varian ce S;. as

si =s;.( ~r (3.3.3.1)

The estima tor si in terms of b"l and b"2 can easi ly be written as
si = S}(I + b"2XI + b"1)"0
which , after ignori ng the higher order terms, becomes

=Sy2(1+ b" 2{t l +ao b"\ 1ao(ao-


2
I) b"12 + ..] =Sy2[ 1+ b" 2+a o b"1 1ao(ao-
2
I) b"\2+aob"Ib"2+'"]

Obviously the estimator si is a biased estimator of sf, . Thu s we have the


following theorems:

Theorem 3.3.3.1. The bias 111 the estimator si of s;' to the first order of
approximation is given by

(3.3.3.2)

=Sy2[ 1+0 +0 + ao(ao-


2 I)E(b"1
2) +a E(b"Ib"2)] -Sy2
o

I- f ) 2[aO(ao-l
2 )C{2 +aopxyC{Cy] .
= ( -/1- Sy
Hence the theorem.
Chapter 3: Use of auxiliary information : Simple random sampling 199

Theorem 3.3.3.2. The minimum mean squared error of the estimator sf IS


Min.MSE(s~) = C~f )S~[A40 -1- ~J (3.3.3.3)

Proof. We have
MSE(s~) = £ls~ - s; J '" £[S;(I +£2 +aoGj +aoGj G2)- S; r s~Eh
= +a oG) j2

= s~£[Gi + a?;G? + 2aoG) G2]

=C~f) S~[(A40-1)+a?;C;+2a oCx~J . (3.3.3.4)

On differentiating (3.3.3.4) with respect to a ; and equating to zero, we obtain

ao =-~
c; ' (3.3.3.5)

On substituting the optimum value of a ; from (3.3.3.5) In (3.3.3.4), we have


(3.3.3.3). Hence the theorem.

Remark 3.3.3.1. If the population follows the bivariate normal distribution, then
A21 = 0 , hence the minimum mean square error of the proposed estimator becomes

Min.MSE(s~)= C~f)S~(A40 -1)= V(s;). (3.3.3.6)

The expression (3.3.3.6) conveys the message to survey statisticians that if the
study variable and auxiliary variable follow bivariate normal distribution then use
of known population mean or total of auxiliary variable is not helpful in improving
the usual estimato r of variance.

3:3:4GENERAl..:tCCASSOF'.ESTIMATORS

Srivastava and Jhajj (1980) proposed a general class of ratio type estimators for
estimating the finite population variance as s;
(3.3.4.1)

where 1I = x/X, = v s;/ v)


S~ and H(lI, is a parametric function such that H(l,l)=1
and satisfies certain regularity conditions as defined earlier . Thus all ratio or
product type estimators of population variance considered by Das and Tripathi
(1978) and Kaur and Singh (1982), for example

s)
2
[-J(
2 X
=Sy-=-
x St
s;2J 2 [-
2 , s2= Sy-=-
2 X
X
Ja(2s;2JY,and
Sx
2 2
s 3=Sy [ -X
(I)X
ax + - a
J[ YS 2
t
Sx2
( ) 2
+ 1- Y S t
J
are special cases of the class of estimators defined in (3.3.4.1). Now the class of
estimators s~J defined in (3.3.4.1) can be written as
200 Advanced sampling theory with applications

(3.3.4.2)

Thu s we have the following theorems:


Theorem 3.3.4.1. The bias in this class of estimato rs of variance, to the first order
of approximation, is given by

B(S§J) = C~f JS;[C;H3+ (Ao4 - 1)H4+CxAo3H5"tCxAZIHI+ (~Z - I)HJ (3.3.4.3)

Proof. Obvious by taking expected values on both sides of (3.3.4 .2).

Theorem 3.3.4.2.The minimum mean squared error of the class of estimators s§],
to the first order of appro ximation, is

Min MSE(s z ) = ( 1- fJ S4[( A -1)- AZ


. ~ SJ y 40 ZI
{~IAo3
1.
- 1(~z1?- 1)}2 ] . (3.3.4.4)
11 " (J4 - - "(J3

Proof. By the definition of mean squared error and by ignoring higher order terms
of &1 , &z, and &3 we have

MSE(S§J)= E[s.t - s;' f '" E[S;(&z + &I H I +&3 Hz)f


=
4 [&zz + &1ZHIZ + &3ZHzZ + 2H] &I&z + 2HZ&Z&3 + 2&]&3 H]H Z]
S yE

= C~/ JS;[(A40-1)+ C;H? + (Aoc I)Hi + 2Cr AZ1HI + 2(AZZ- I)H z

+ 2Cr Ao3H 1H Z ]. (3.3.4.5)


On differentiating it with respect to H I and Hz, respecti vely, and equating to zero
we obta in
CxH I + A03 H Z = -AZI, (3.3.4.6)
and
Cx A03 H] + (A04 - 1)H z = - (A22 -I). (3.3.4.7)

On solving equations in (3.3.4.6) and (3.3.4.7) for H I and Hz we have


H = ~1 (Ao4 - 1) -Ao3(~z - l ) andH = (~z- I) -Ao3~1
I z
CAAo4 -1 - ~3 ) ' Ao4 - 1- ~3 (3.3.4.8)

On substituting these optimum values of HI and Hz in (3.3.4.5) we will obtain


(3.3.4.4) . Hence the theorem.
Chapter 3: Use of auxiliary information: Simple random sampling 20 I

Theorem 3.3.4.3. The minimum MSE , to the first order of approximation, of the
wider class of estimators defined as
slJ(w) = H (s;', u, v) (3.3.4.9)

where H(., .,.) is a par ametric function such that H(S;"I,I)=S;' , o~


OSy Sy.I,1
I( 2 )=1

and satisfies the regul arity cond itions, is

Min . MSE{s2( »)=( I-/J S4y [(A.40 _1)_..1.221 _ {A.2IAo3- 1)}2]


\ SJ w 1. 1(A.22-
1? . (3.3.4.10)
II "U4 - - "u3

Proof. Expanding H(s;', u, v) around the point (S;',I,I) in a first order Taylor' s series
we have
slJ(w) = H(S;',u, v) = H[S;' +(s;, -s;, } 1+(u -I). 1+(V- I)]
= H (S;' .I,I)+ (S;' -S;' ) O
Of-~ I ( 2 ) +(U- l)OH
S)' s)' ,1,1
( ,2 ) +(V- l)OH
t3U S)' .I.I
1 ( ,2 )+....
OV S)' .I.1 1

=S;'+ (S; _S; )+(U _ I)OH


al l(Sy2 ,l ,l ) +(V_ I)OH
ov
I( Sy2 ,l ,l )+....
=s; + (u -I )HI + (v -I )H2 +.... =S;(I+&2)+ &IHI + &3 H2 +..... (3.3.4 .11)
Thu s the mean squared error, to the first order of approxi mation, of the wider class
of estimators is
MSE[slJ(w)]= E [sL(w)- s;, f = E[S; (I+&2 )+&1H I +&3H 2- S; ]2

= E[S;'&2 +&I H I +&3H2f

4 2+&12HI2+&32H 22+ 2S 2H &1&2 + 2H 2&2&3 + 2H H &1&3 ]


= E [SY&2 y 1 2SY I 2

= C~/ J[S~(A.40 - 1)+H I2C; + Hi (..104- 1)+ 2S;Ct~ IHI


+ 2H 2S;'(A.22 -1) +2H1H 2Cx Ao3 ]. (3.3.4.12)

The mean squared error of the wider class of estimators is minimum for the set of
normal equations

..103 ][HI ] [ - S;'~I ] (3.3.4.13)


(..104- 1) H 2 = - S;(~2-1 ) .

On solving (3.3.4. 13) for H I and H 2wehave


S; [~ I (..104 - 1)- Ao3(~2 - I)] S; [(..1.22 - 1)- ..1. 21..1.03 ]
HI = - ( )' and H 2 =- 2 ' (3.3.4.14)
Cx ~Ao4 - 1- ~3 ..1.04 - 1- A.03
On substituting H I and H 2 from (3.3.4. 14) in (3.3 .4.12) we prove the theorem.
202 Advanced sampling theory with applications

Example 3.3.4.1. Suppose we selected an SRSWOR sample of eight states from


population 1 given in the Append ix and collected the required information. Find the
relative efficiency of the general class estimators which makes use of known
variance of the auxiliary variable at the estimation stage, for estimating the finite
population variance of the amount of real estate farm loans during 1997 by using
information from nonreal estate farm loans during 1997, with respect to the ratio
and regression estimators of finite population variance .
Solution. From the description of the population I given in the Appendix we have
Yi = Amount (in $000) of real estate farm loans in different states during 1997,
Xi = Amount (in $000) of nonreal estate farm loans in different states during 1997,

S; = 342021.5, ,.1,03 = 1.5936, N = 50 , A2l = 0.9387 , ,.1,40 = 3.5822, ,.1,04 = 4.5247 ,


,.1,22 = 2.8411 and f = 0.16 .
Now we have
2)
MSE(SI = (I ~f )S;[A40 + Ao4 - 2Az2]

= C-~.16 }342021.5)2 [3.5822 + 4.5247 - 2 x 2.8411] = 2.97820183 x 1010 ,

MSE(si)= (1- f)S;[ A40 -1- (Az2 _1)2]


n Ao4-1

= (1- 0.16)(342021.5)2 [3.5822 -1- (2.8411-1)2] = 1.990441025 x 1010


8 4.5247-1 '
and

Min.MSE(sgJ) = (1- f )S;[A40 _ I _ ~ I_ {Az IAo3- (,.1,222- 1)}]


n Ao4 -1 - A03
= ( I - 0.16 )(34202 1.5)2[3 .5822 _ 1_ (0.9387)2 _ (0.9387 x 1.5936 - 1.8411)2 ]
8 4.524 7 - 1- (1.5936)2

= 1.940787669 x 1010 •

Thus the percent relative efficiency (RE) of si with respect to sf is given by


2 10
RE=MSE s l xl00 2.97820183x10 xI00=149.63%
MSE si 1.990441025x 1010
and the percent relative efficiency (RE) of slJ with respect to si is given by
10
RE= MSEsi x 100 = 1.990441025xl0 xlOO =102.58% .
MSE sgJ 1.940787669 x 10 10
Chapter 3: Use of auxiliary information: Simple random sampling 203

Thus the general class of estimators remains more efficient than both ratio and
regression type estimators. It is noted that the relative efficiency, to the first order of
approximation, in each case remains independent of the sample size .

We have seen that the minimum mean square errors of the class of estimators of
population mean and variance depend upon the values of the unknown population
parameters H J and Hz etc.. Srivastava and Jhajj (1983a) have shown that
consistent estimators of the unknown parameters HI and Hz can be used in place
of actual values. The minimum mean square error of the resultant class of
estimators with estimated optimum values remains the same as that with the actual
optimum values to the first order of approximation. We are not discussing this
procedure in detail, but the interested reader may refer to Srivastava and Jhajj
(1983a) and Singh and Singh (1984a). Singh and Zaidi (2000) estimated square of
population mean and variance.

The next section has been devoted to study ing the asymptotic properties of the
various estimators of regression coefficient
s;
fJ=-z .
Sr

3.4 ESTIMATION OFREGRESSIONcCOEFFICIENT

The meaning and purpose of the estimation of regression coefficient is well known
to survey statisticians. Some like to define it as the slope of the linear regression of
X on Y or change in the study variable for per unit change in the auxiliary variable.
Here we will discuss a few estimators for estimating the regression coefficient in
survey sampling.

3.4.1 USUAL ESTIMATOR

Srivastava, Jhajj, and Sharma (1986) considered the usual estimator of regression
coefficient fJ as

q= sx; (3.4 .1.1)


sx
which in terms of 53 and 5 4 can easily be written as
Sxy(l + 54)
q = ---'-;;-;'-----"7
S;(1+ 53)
l
= fJ(l H4Xl H '3t = fJ(l H4Xl- 53 H i + ...)

= fJ[1 + 54 -53 + 5i - 5354 + ....] (3.4.1.2)

Thus we have following theorems:


204 Advanced sampling theory with applications

Theorem 3.4.1.1. The bias , to the first order of approximation, in the estimator b,

l
of 13 is

B(b,) = C~f)f3[(~4 -1)- (;~: -I J (3.4.1.3)

Proof. Taking expected values on both sides of (3.4.1.2) we have

£(b 1) = 13£[1 + £4 - £3+ £f - £3£4 + ....]


= 13[1 + £(&4)- £(&3)+ £(£f)- £(&3£4)]

Thus the bias in the estimator b, of 13 to the first order of approximation is

.(b,)~ £(b,) - P "C~fy[("" -I)-(~: -I]l


Hence the theorem .

Theorem 3.4.1.2. The mean squared error of the usual estimator b, of the regression
coefficient 13 is given by

MSE(l1)=(I~f)f32[~2 +~4-2~] . Pxy P xy


(3.4.1.4)

Proof. By using the definition of mean square error (MSE) we have


MSE(b,) = £(l1 - 13 f '" £[13(1 + £4 - £3 + ....)- 13]2 '" 13 2£[&4 - £3]2

~ p2£~1 +&1- 2'3"] ~ C~f)p2[[~Y +(kl)-{;~ -1]]


Hence the theorem .

3.4;2QNBIASEDESTIMATOR

We are also considering an unbiased estimator of the regression coefficient 13 as


Sxy
b2 = - · (3.4 .2.1)
52 x
Thus we have the following theorem:

Theorem 3.4.2 .1. The variance of the unbiased estimator of the regression
coefficient 13 is given by
Chapter 3: Use of auxiliary information: Simplerandom sampling 205

V(b2)=C~fY2[~~: -I} (3.4 .2.2)

Proof. By the definition of variance we have

V(b2) = E[b2- ,By = E[,8(1+84)- ,BJ2 = ,B2E(eJ)= C ~f),B2[ ~: -I}


Hence the theorem.

Example 3.4.2.1. For estimating the regression coefficient of the amount of the real
estate farm loans (in $000) on the nonreal estate farm loans, Mr. Nelson used the
53rd and 54th columns of the Pseudo-Random Numbers given in the Table I of the
Appendix to select eight distinct random numbers between I and 50 as:06, 08, 45,
15,22,39,43 and 34. He collected the following samp le information:
State CO DE VT IA MI RI TX ND
Nonreal estate s-.»: 906 .281 43 .229 19.363 3909.738 440 .518 0.233 3520 .361 1241.369
farm loans ( X )$
l ~ellJ estate farm" 3 15.809 42.808 57.747 2327 .025 323.028 1.611 1248.761 449 .099
loans ( Y )$ '.

The population variance S} = 1176526 of nonreal estate farm loans (in $000) for the
year 1997 is known from population I of the Appendix.
( a ) Estimate the regress ion coefficient ,B with two different method s.
( b ) Also find an estimate of the mean squared errors in each case and hence derive
the 95% confidence intervals.
Solution. From the sample information , we have
,
(Y;-yXx;-x ),i (x; -:t}2 , \;, (y; _ y)2
hi

(y; ~j;t" (.~i=-x)2 (y;_y) (x;-xY(Y;-y)


Sr.
No.
(x; - ~t
c,
;

I 99053.71 125213.72 7835 9. 12 15678474394 6 140 152522 9811 637177 12402 882820
2 672862.23 1480863.86 305729.37 2. 19296E+ 12 93470 4496 27 4.52 744 E+II 9.964 17E+ II
3 667522.49 153951 8.88 289432 .16 2.370 12E+ 12 83770977628 4.455 86E+ II 1.02766E+ 12
4 45872 25.90 70203 88 .11 299736 1.60 4.92858E+ 13 8.984 18E+ 12 2.10426E+13 3.22041 E+13
5 22351 6.52 671774.48 74369 .65 4.512 8I E+ll 5530845327 499596355 61 1.50153 E+11
6 748 540. 17 1587356.83 352984.5 1 2.519 7E+ 12 1.24598E+ II 5.603 12E+ II 1.1882 E+ 12
7 1475983.10 510 8614 .79 426441 .65 2.60979E+ 13 1.81852E+ II 2.17853E+ 12 7.54023E+ 12
8 2752.01 352 .22 21502.41 12405 8.2636 46235 3625.9 7573558.476 9693 10.3289
Sum 8477456.20 17534 082 .89 , 546 180.49 8.29335E+ 13 9.48E+ 12 2.47396E+ 13 4.31192E+ 13

Thus from the above table we have


206 Advanced sampling theory with applications

_ 18 _ 1 8
n = 8, N = 50, f = 0.16, Y = - LYi = 595.736, x = - LXi = 1260.137,
8 i=1 8 i=1

Jl02 1 ~(Xi -X)
=Sx2 =-L. -\2
= 2504868 .98, .
Jl04= -1L~(
. xi -X-)4 = 1.184764xlO 13,
8 - 1i=1 8 -I i=1

iL20 = s; = -1-f(Yi - ji)2 = 649454.36,


8 -I i=1
iLl I =-I-f(Yi - jiXxi -x)= 1211065 .171,
8 - 1i=1
8 8
L(Yi - ji)2(Xi _X)2 L(Yi - jiXXi -XY
iL22= i=1 3.534228 x 1012, iL13 = i=1 = 6.15988 x l 0 12,
8-1 8-1
• 13 • 12
~ =Jl04 1.184764xlO =1.88826 ~=~= 3.534228 x 10 =2 .1725
4 iL62 2504868.98 2 ' 2 iL20iL02 649454.36 x 2504868.98 '
12
JlI 3 6.15988x10
1.9280,
3 3
..ji:;;(iL02J2 .J649454.36 x (2504868 .98)"Z
and
r = JIll 1211065 .171 = 0.9495.
xy ~ iL20iL02 .J649454 .36 x 2504868.98

( a ) Usual estimator of regression coefficient

The usual estimator of the regression coefficient f3 is given by


b, = Sxy = iLl I = 1211065 .171 =0.4835
s;
iL02 2504868.98
and an estimator of MSE(q) is given by

MSE(q) = (~)q2[ ~2 + ~4 _ 2 213 ]


n rxy rxy

=(1-0.16)(0.4835?[ 2.1725 +1.88826-2x 1.9280]


8 0.9495 2 0.9495
= 0.005815 .

The (I - a)i00% confidence interval for regress ion coefficient f3 is given by

Using Tab le 2 of the Appendix the 95% confidence interval of f3 is given by

0.4835 ± to.05/2(df = 8 - 2 ).J0.0058 15

0.4835 ± 2.447.J0.0057815 or [0.2974, 0.6695] .


Chapter 3: Use of auxiliary information: Simple random sampling 207

( b ) Unbiased estimator of regression coefficient

The unbiased est imator of the regression coefficient p is given by


b = Sxy = ,U" = 1211065.141 = 1.029
2 S2x
S2x
1176526
and an estimator of MSE(b2) is given by

Mim(b 2)= (1-


n
f )bi [iz2
r;y
-I] = (1-0.16)(1.029)2[ 2.1725 - I]
8 0.94952
= 0.1568.

The (I - aft 00% con fidence inter val for regression coefficient p using b2 is

b2 ± l a/2(df = n - 2NMSE(b2) .

Using Table 2 from the Appendix the 95% confidence interval is given by
1.029 ± IO.05/2(df = 8 -2))0.1568

1.029±2.447~0 .156 8 or [0.060,1.997] .


Note that we used (n - 2) degree of freedom while con structing con fidence interval
estimate for the regression coe fficient.

3;4.3 Il\tIPROVED ESTIMATORS OF REGRESSION COEFFICIENT

Singh and Singh (1988) introduced a general class of estimators to estimate the
regression coefficient as
Sxy ()
bSS Hu
= - 2 (3.4.3.1)
Sx
where H ( .) is a parametric function such that H( I) = I satisfying certain
regularity conditions as defined earlier. Thus we have the following theo rem :

Theorem 3.4.3.1. The minimum mean square error, to the first order of
approximation, of the general class of estimators bss is given by

Min.MSE(bss ) = MSE(I1)- ( 1-
n
f)p2(&- Pxy
-103J2 (3.4.3 .2)

Proof. The estimator bss can be writt en as

Sxy H ()
bss = -2 Sxy H [1+ (u -I )J = -Sxy
u = -2 2 [ H ()
I + (U - 1)-()H Iu;j +( () 2 H lu; j +..]
)2 --2
u -I
Sx Sx s, du {)u

) + ()2
Sxy [1+ (u-IH ] S,y(I +&4)[ 2 2 + ....]
=-2 1 u -I H 2+ .... = 2( ) 1+ £jH, + £jH
S, SX 1+ £3

=P(I+ £4XI + £3t'[ 1+ £,H, +&?H2 + .... ]

= p(I+ £4XI- £3 + £} + ...)[I+ £jH 1 + l}H 2 + ....]


208 Advanced sampling theory with applications

= .0(1 + c4 - c3+Cl- C3C4 + ...11 + e.H, + c?H Z + ....]

= .0(1+c4 - c3 +Cl- C3C4 +c,H ] +c? Hz +c, c4H, -c ,c3H, + .....).

The mean squared error of the estimator bss , to the first order of approximation, is

'" E~(I + C4 - C3 + cl- C3C4 + e.H, + c?Hz + C]C4H ]- c]83 H , + ·····)-.of


'" E[.o(t:4- 83 + 8] H ])]2
= .oZ E[d + cl - 2C3C4 + c?H? + 2H, (C'C4 - C,C3)]

(3.4.3.3)

On differentiat ing (3.4.3 .3) with respect to H, and equating to zero, we obtain

HI=- C.~'[~_
,
P
1.3J.
"\}
xy
(3.4.3.4)

On substitut ing the optimum value of H] in (3.4.3.3) we obtain

Min.MSE(bss)= ( I ~f).oZ[~Z + Ao4-2~- [~- Ao3JZ]


P xy P xy P xy

z
I - f z An
= MSE(bl)-(- n-).o P [ xy
- Ao3
J
. (3.4.3.5)

Hence the theorem .

Now we have the foIIowing remark:


Remark 3.4.3.1. Several improved estimators of regression coefficient .0 can be
suggested for exampl e

( - )a ,bsz = -Sxz (-~X )a(-SxZJY


z , and bs3 = - z ( -) ) (-zz JYetc..
Sxy x Sxy x Sx Sty X Sx
bs ' =-z ~ (
Sx X Sx ro:+ l -a X Sx

It is interesting to note that the use of anyone of these estimators of regression


coefficient .0 will result in different types of regression estimators of the
popul ation mean Y defined as
Chapter 3: Use of auxiliary information : Simple random sampling 209

Y1 =y+bsj(X-:X) , }=1,2,3,...etc . (3.4.3.6)

Sampath (1989) has shown that the resultant regression type of estimators of the
population mean have the same minimum mean squared error, to the first order of
approximation, as that of the usual regression estimator. In other words, the
construction of an improved estimator of regression coefficient j3 is not helpful in
improving the estimator of the population mean Y

Example 3.4.3.1. The real and nonreal estate farm loans (in $000) during 1997 in
the different 50 states of the United States have been given in population I of the
Appendix . If we selected an SRSWOR sample of eight states, find the relative
efficiency of the usual estimator b l with respect to the unbiased estimator b z of
regression coefficient for this population .

Solution. Using the results from the description of the population 1 we have

MSE(q) = (.!.=.L]j3z[ ~z + -104 - 2~]


n Pxy Pxy

=(1 -0.16](0.4334)Z[ 2.8411 +4.5247-2x 3.2561] =0.016178,


8 (0.8038)Z 0.8038
and

MSE(b z)=(I-n f ]j3Z[A.zZ


P;y
-1] = (1-0.16](0.4334)Z[ 2.8411
8 (0.8038)Z
I] =0.067005.

Thus the percent relative efficiency of b l with respect to b z is given by


RE = MSE(bz ) x 100 = 0.067005 x 100 = 414.17% .
MSE(q) 0.016178

3.5 ESTIMATION OF cFINITE POPULATION CORRELATION


COEFFICIENT

For a bivariate normal population, an estimator of population correlation coefficient


Pxy due to Pearson (1896) has been defined as

(3.5.1)

Wakimoto (1971), Gupta, Singh, and Lal (1978, 1979), Rana (1989) and Biradar
and Singh (1992a) studied the behaviour of the estimator rxy under SRSWOR
sampling. Singh, Mangat , and Gupta (1996) have shown that the class of estimators,
to estimate the population correlation coefficient, proposed by Srivastava and Jhajj
(1986) as
210 Advanced sampling theory with applications

(3.5.2)

where u=x/x, v= s~ /S~ and H(. , .) is a parametric function such that H(I, 1)=1
satisfying certain regularity conditions can take an inadmissible value , i.e., outside
the range [-1.0, + 1.0] from a given sample. In other words, the ratio type or
regression type estimators cannot be made to estimate the correlation coefficient.
The estimator r,y can easily be written in terms of &2, &3 and &4 as follows:

113232111 ]
=P ,[ 1 + &4 - - &2 - - &3+- &2+- &3 +- &2&3-- &2&4--&3&4 + .. . . ( )
X} 2 2 8 8 4 2 2 3.5.3
Thus we have the following theorems :

Theorem 3.5.1. The bias, up to terms of order O(n- I ) , in the estimator rxy of Pxy is

B~xy) = (1-nfJpxY[~(A40
8
+Ao4 - 2)-..!..(~+~- 2J +..!..(Az2 -1)] .
2 Pxy Pxy 4
(3.5.4)

Proof. It follows by taking expected values on both sides of (3.5.3). Hence the
theorem .

Theorem 3.5.2.The mean squared error, up to terms of order O(n- I ) , of the usual
estimator rxy is

1- -
MSE (r,y ) = ( - J
f Pxy
2 [[-Az2
2 -1 +- J
1 (,.1,22 -1 ) +-
1 ( A4o+Aoc 2 ) - (,1,3 I I ,1,13 2]1 (3.5.5)
n Pxy 2 4 Pxy Pxy

Proof. By the definition of the mean squared error (MSE) we have

MSE(rxy) = E~<y - Pxy f '" E[Pxy( 1 + &4 - ~&2 ~&3 J- Pxy


- + ... r
2 [ 1 1]2
= PxyE &4 - 2 &2 - 2 &3

2[21212
= PxyE &4 + 4" &2 + 4" &3 -
I]
&2&4 - &3&4 + 2 &2&3 .

Using expected values we obtain (3.5.5). Hence the theorem.


Chapter 3: Use of auxiliary information: Simple random sampling 211

Example 3.5.1. The real and nonreal estate farm loans (in $000) during 1997 in the
50 states of the United States have been presented in population 1 of the Appendix.
If we selected an SRSWOR sample of eight states to collect the required
information. Study the relative bias of the usual estimator of the correlation
coefficient.

Solution . Using the results from the description of the populat ion 1 given in the
Appendix , we have

B~xy) = (1-nfJpXy[~(,140
8
+ -'4J4 - 2)-2.[~+ ~- 2J+2.(,122 -I)]
2 Pxy Pxy 4

= (1-0.16 J(0 .8038{ ~(3.5822+4.5247 _ 2) _ 2. ( 2.9287 + 3.2561 -2J


8 t8 2 0.8038 0.8038

+±(2.8411-1) ] =-0.0081 7.

Also

MSEky)= (~Jp;y[[
n
12xy -1]+2.(~2
P 2
-1)+2.(,140 + -'4J4 -2)-[~+~-
4 Pxy Pxy 2J]
=(I - 0.16)(0.8038?[( 2.8411
8 (0.8038?
-IJ + (2.8411-1) + (4.5247 + 3.5822-2)
2 4

_(2.9287 + 3.25611_ 2)] =0.01018.


0.8038 0.8038
Thus the required relative bias is given by

- 0.008171
=1 = 0.081 .
~0.01018

Example 3.5.2. Consider the problem of estimation of the correlation coefficient


between the amount of the real and nonreal estate farm loans in the United States.
An SRSWOR sample of eight states has been drawn from the population I given in
the Appendix and gathered the following information.

iFf' State / ' " MA NC MO FL TX IA TN NH


Noni-e aLestate 56.47 1 494 .730 1519.994 464.516 3520 .361 3909 .738 388.869 0.471
£;

1«·'" .b''0- f.-'''

farmJolll'ls X·)$
<Inn 7.590 639.57 1 1579.686 825.748 1248.761 2327.025 553.266 6.044
t~~~~'(il;';~ ;;; · ·i
( a ) Estimate the correlation coefficient Pxy between the real estate farm loans and
the nonreal estate farm loans.
212 Advanced sampling theory with applications

( b ) Also find an estimate of the mean squared error and hence deduce the 80%
confidence interval for the correlation coefficient.

Solution. ( a ) Estimation of correlation coefficient: From the sample information


we have

MA 7.590 56.471 793651.8068 1532452.7350 1102829 .9420


NC 639 .571 494 .730 67024.2263 639462.1131 207025 .2481
MO 1579 .686 1519 .994 464066.9897 50895.4728 153684.4457
FL 825.748 464 .516 5287 .2349 688697.0799 60343 .2120
TX 1248.761 3520.361 122709.8273 4954930.1980 779755.4929
IA 2327.025 3909.738 2040794.0310 6840025.5460 3736185.6620
TN 553 .266 388 .869 119159.8469 819975.0729 312582.9556
NH 6.044 0.471 796408.7712 1674236.0830 1154719.1440
!l~7 1 8 7 . 69 1 1 0355:150 '4 409102:7340 17200(>7;4;3000 7507126:1030

From the above table, we have n = 8 , N = 50 , f = 0.16 ,

fi20 = s~ = _1_ f(Yi - yf = 629871.82,


8-1 i=1

, 2
1102 = Sx = -
1 ~(
L, Xi -X
-)2 = 2457239.19,
8-1 i= 1
and
, I =1- I8 (Yi
III -X -)
-Y xi- x =1072446.59 .
8-1 i=1

Thus the usual estimator of the correlation coefficient, P xy » is given by

r = ' = = 1072446.59 = 0.8620 .


xy P xy ~
fill
V1l201l02
.J629871.82 x 2457239.19

( b ) Estimation of mean squared error and confidence interval: From the


sample information we have

1.21623E+ 12 6.29883E+ 11 2.3484IE+ 12 8.75263E+II


42859453356 4492246907 4.08912E+II 13875707073
23618908852 2.15358E+II 2590349152 71319878084
Continued .
Chapter 3: Use of auxiliary informatio n: Simple random sampling 2 13

4 364 1303239 27954 852.93 4.74304E+ 11 4 1558193926 3 19048736.9


5 6.08019E+ l l 150577017 10 2.455 13E+13 3.86363E +12 95683661855
6 1.39591E+13 4.16484E+12 4.67859E + 13 2.55556E + 13 7.624 79E+ 12
7 97708 104162 14 1990691 18 6.72359E+ II 2.5631E +I I 37247337145
8 1.33338E+ 12 6.34267E+ 11 2.80307E+ 12 1.93327 E+ 12 9.19628E+ 11
Sum 1.72845E+13 5.67813E + 12 7.80469E+ 13 3.34806E+13 9.63812E+12

Fro m the above tabl e we have

~ (Yi - Y- )4= 8.111614 x l 011,


• = - 1 L..
fJ40
8- 1i=1
8
L (Yi - yf(Xi -:t)
• fJ40
it31 = i=1 = 1.37687 x 1012 , ,1,40 = --=-2 = 2.04457,
8- 1 fJ20
• 9 12 •
~2 =--.!:!lL. = 2.46 214 x 10 1.59536, ~4 = fJ~4 = 1.84655,
it 20it02 629871 .82 x 2457239 .19 it02
12
~ = itl3 = 4.78294 x 10 = 1 56457 and
3 ..v;;;(it02f/ 2 ..}62987 1.82 x 2457239 .19 3/ 2 . ,

i _ it31 1.376874 x 10
12

31 - it~~2 ~ 629871.82 3/ 2..}2457239 .19 1.75708 .

Thus an estimator of MSEVxy ) is given by

.
MSEky)=(- )rXY
n
1- f 2
[[ ]
Az2
. -I +-(Az2-
- 2
rty
1 •
2
1)+-(A40 +Ao4- 2) - - +- . - 2
4
I · · ,1,31 ,1,13
rxy rxy
[. )]
= ( 1- 0.16 ) (0.8620 )2[( 1.59536 IJ + ..!.. (1.59536 -I )
8 0.8620 2 2

+ ..!.. (2.0445 7 + 1.84655_2) _(1.7 5708 + 1.56457 - 2J ]


4 0.8620 0.8620

=(' - ~. 16 )0.8620? [1. 14706 + 0.29768 + 0.47278 - 1.85342] = 0.00500 1 .


T he (1 - a)1 00% confide nce interval for the corre lation coefficie nt p x)' is given by

rx)' ± (a/ 2(df = n - 2~M~mk).


Using Table 2 from the Appendix the 80% confidence interval of the correlation
coefficient is given by
214 Advanced sampling theory with applications

0.8620 ± (O.20/2(df =8 - 2WO.00500 1 , or 0.8620 ± (o.lo(df = 6WO.005001


0.8620 ± 1.440~0 .005001, or [0.760 1, 0.9638 ] .
Agai n note the use of (11- 2) degree of freedom while construct ing the confidence
interval estimate of the population corre lation coe fficien t.

3.6 SUPERPOPULATION MODEL APPROACH

We shall first introduce the concept of a superpopulation mod el and its role in
survey sampling. Estimation of popul ation mean from a sample is equivalent to
predicting the mean of the non-sampled values of the study variable. Thus model-
based sampling theory considers the problem of estimating finite population
parameters can be expressed as prediction problem. The model based strategies
have been pro ved to overcom e the gap between finite popul ation problems and rest
of the statistics. We discuss here two types of super population models for
regre ssion and ratio estimator of the popul ation mean .

3.6.1 RELATIONSHIP BETWEEN LINEAR MODEL AND REG~SSION


ESTIMATOR

In fact und er a superpopulation model a relation between a study var iable Yi and
auxiliary variable X i is:
Y; =a +fJX;+ c; (3.6 .1.1)

where C
i is a random variable such that E{t:d =0 , E(c;)= 0-
2 and EleiC j J= 0 for
i ;t j . In short, under the superpo pulation model, the popu lation itself becomes a
large random sample. Royall ( 1970a, 1970b, I970c) was the first to show that the
linear regression estimator of popul ation mean defined as
YLR=Y+b(X - x) (3 .6. 1.2)
is the best linear unbi ased predictor of population mean f . Following Brewer
(1963 a), Royall (197 0a, 1970b, 1970c), and Scott and Smith (1969), by imposing a
superpopulation model on the actual finite population, inference about the
characteristics of the finite population can be made via the structure of the model. In
this section we shall only introdu ce the idea of superpopulation model under
SRSWOR design . Thu s we have the following theorems :

Theorem 3.6.1.1. Under the superpopulation model (3.6. 1.1) the leading term of
the variance of the linear regression estimator YLR is given by

V(YLR) = C~/ ) (N ~ 1);~lcl (3.6.1.3)


where
c; = (Y; - r). fJ(X; - x) (3.6.1.4)
denotes the residu al of Y; to the fitted values f + fJ(X; - x).
Chapter 3: Use of auxiliary information: Simplerandom sampling 215

Proof. We have
f i = a+f3Xi + &i for i = 1.2•...• N . (3.6.1.5)
Taking the sum on both sides of (3.6.1.5) and dividing by N we obtain
I N I N
- Ifi =a+f3- IX i
N i=1 N i=1
which in fact implies that
a =V- f3 X . (3.6.1.6)
On subst ituting the value of a in (3.6.1.5) and solving for ei , we obtain (3.6.1.4).
Following the same steps as for MSE in the Theorem 3.2.3.2 we can easily see that
variance. up to terms of 0(11-I). of the usual linear regression estimator is

VVLR)= C~f )[s; + f3 2S; - 2f3Sxy]

= (1- f)_I_[
(N - I)
II
~(}j _Vr
;=1
~(X; -xt -2f3 ;=1~(}j - vXX; - x)lJ
+ 13 2
;= 1

=C ~f ) (NI_I) ;~I [(}j -Vr + 13 2(X;- Xr - 2f3(}j - v Xx; - x)J


= (1-f) _I()~ [(}j _V)- f3(X;- x)f
II N- I ;=1

=( 1-f )_I_~ &~


II (N- I) ;=I"
Hence the theorem .

Theorem 3.6.1.2. An unbiased estimator of approximate variance of the linear


regression estimator is:
V(YLR) = (1-f )_(
II
1 ).I el
2 11 - 1=1
(3.6.1.7)

where ei = Vi - y)-b(Xi - r) and b= i~/Yi - yXXi - z)/i~I(Xi - x)2 .


Proof. It follows from the well known Gauss-Markov Theorem. We have
Yi = a + 13 Xi + ei (3.6.1.8)
and
y =a+f3 x+e (3.6.1.9)
where E(e; )= O. E(el)= CT 2• Ete;ej )=O and ei and X; are independent.
By subtracting (3.6.1.9) from (3.6.1.8). then
Vi - n- f3(Xi - x)+(ei - r ) (3.6 .1.10)
or Y; = f3x; + (e; - e) . (3.6.1.11 )
216 Adv anced sampling theor y with applications

Remember that E(e;)= 0 but e*-O . In other word s, in takin g a very large numb er of
samples we expect ei to have a mean value of zero by assumption, but in any
particular sample e is not necess arily zero.
Sim ilarly we have
(3.6.1.12)

One can easily observe that


ei =y7 -;,7 =(e; - e)- ~- p~7 .
Thu s we have
I 2 = L~I r(ei - -)e - (.0 - pfi*
Lei
i= l i= 1
A \ r
= i~I(e; _e)2 +~- .0)2 x? -2(ei -e~- p~7]

= £(ei -e}2 +~ - p)2 £X? -2£(e; - e~- P)>:7 .


i= l i=1 i=1
Taking expected values on both sides, we have

E(i~tl ) = Ei~/ei - e}2 + E~ - .or i~t? -2Ei~l(ei -e~ - p~7 .

n
(3.6 .1.13 )
Now we have

Ei~/'i -,)' {~1'! -"-t(,~lei


= = i~IE (el)-H~ri J'
I ( 2) - -1 E(ILei
= LE\ei I 2 I ) I 2 1 (II 2 )
i=1 II i= 1 +2 i'"Lj=1eiej = i=1La --tl i=1La +0
(3.6.1.14)

2
E[ fIfJ-P; " X;*2] = L"*2
;. \2 L A .0;\2
X; E(.0- = L"*2
X; x-,,-a -= a 2. (3 .6.1.15)
;= 1 ;=1 ;=1 LX?
;=1

" " */
Now
.0 - .0 = Lw;e; , where W; = X; LX;*2 /l
and using LX;* = 0 ,
11

;=1 ;= 1 ;=1
therefore we have
Chapter3: Use of auxiliary information: Simple randomsampling 217

I: X? E(e?)+2
i= l
I: x;x~E(eiej)
h" j=l _
a
2fx?+2
i=l
f
i'" j=l
x ,x)xO

n *2 n *2
I~ ~~
~l i =l

a2 • = (3.6.1.16)
Using these expected values in (3.6 .1.13) we have
n
E ( i~lei
2) = Ei~ln ( ei - -)2 {( , \2 n *2} n ( _ 'Ii' \ *
e + E fJ - fJ) i~IXi - 2Ei~1 ei - e Af3 - fJ)Xi

= (n -1)a 2 + a 2 - 2a 2 = (n - 2)a2
which impl ies that

a 2 =_I_ E(
n-2
Ie1J .
i=1 (3.6.1.17)

By the method of moments an unbiased estimator of a 2 is given by


0- 2 _ _1- Ie7 (3.6.1.18)
- n - 2 i=1 '
Under SRSWOR sampl ing we have
(n
a 2= - I - E Iei2) = -1 - IN ei2 .
n - 2 i= l N - I i= 1 (3.6 .1.19)
The refore
V(YLR) = (1-f)_(
n
I )I e? (1 -f)a
N -I i= l
=
n
2
. (3.6.1.20)
Hence an unbiased estimator of V(yLR) is given by

V(YLR)=( I-
n
f )_(
I )f e?
n- 2 i=1
(3.6.1.21 )
Hence the theo rem .

Deng and Wu (198 7) proposed an improved estimator of V(YLR) as follows :


,(_) (I - f) I n 2( :x )g
v Y LR DW =-n-(n_2);~lei X (3.6 .2.1)
Now we have
218 Advanced sampling theory with applications

1 11 2 1 II 2 1 11 2[ 2]-1 1 II [ 2 4 ]
- ()Iei = ( )Iei =-I ei 1-- "'-Izi 1+-+ 2+ .... (3.6 .2.2)
n - 2 i=J 1
n - - 2 i=J n i=J n n i=1 n n
n
where Zi = el, i = I, 2, ..., nand
(;)=[I +(;-I)r ",[I+g(; -I)+g(g2-
1
)(; -lr + ...} (3.6.2.3)

Using (3.6.2.2) and (3.6.2.3) the estimator (3.6.2.1) can be written as

YLR ow '" (1-


v"(_) -n-
/)-[ (x )+ --2-
Z 1+ g X -I
g(g-I) (XX - 1)2 + ...] . (3.6.2.4)

Defining
x- -
Z
&J =~ -I and 0, ==-1
X Z
- 1 N 2
where Z = - IZi for Z, = &i , we have
Ni =J

E(6}) =C~/)c; , E(0,2)=C~/)C1, and E(sJOJ ) = C ~/ )pxzc,Cz


where
s;2 = (N - I )-J IN(Zi- ZJ
-\2 '
i=J
Pxz = S,z/(S,Sz) '
Thus we have the following theorem .

Theorem 3.6.2.1. The minimum mean squared error of the estimator of


approximate variance of the usual regression estimator is given by

Min.MSE[v(hR)OW]= (I ~/r s1[1- p;z]. (3.6.2.5)

Proof. By the definit ion of mean squared error we have


MSE[v(YLR )ow ]= E[v(YLR)ow - V(hR)f

'" E[C ~/)Z(I +O,~I + g&J + g(g2- 1)s?+.. }-C ~/)z r


1)2 r
1- -2 2
'" -n- Z ElOJ + g &\ + 2g&JoJ
(
2 2 ]

1-
=( -n- 1)3 Z rlCz
-2 2
+ g C, + 2gpxz C,C z .
2 2 ]
(3.6.2.6)

On differentiating (3.6.2.6) with respect to g and equating to zero we obtain


Cz
g = -Pxz
Cx ' (3.6.2.7)

On substituting (3.6.2.7) in (3.6.2.6) we obtain (3.6.2.5) . Hence the theorem.


Chapter 3: Use of auxiliary information: Simple random sampling 219

Example 3.6.2.1. Apply the regression method of est imat ion for estimating the
average amount of the real estate farm loans (in $000) during 1997 . Also find an
estimator of the variance of the regression est imato r assuming that the relationship
between Y and X is given by Yi = fJo + fJIX i + &i ' An SRSWOR sample of eight
states selected from the population 1 of in the Appendix is given below.

:t~I~ ; . "tate.jj ;"i;± C AK CA CT ME NE NY VA WI


·/ 1',
3.433 3928.732 4.373 51.539 3585.406 426.274 188.477 1372.439
L" ';~~~'t:f~I "'."('~~,
<"BU<,,"« I} $
'
Reiil estate:fann ' 2.605 1343.461 7.130 8.849 1337.852 201.631 321.583 1229.572
Ib"al1c YcC)$., ...

Th e average amount $878 .16 of nonreal estate farm loan (in $000) for the year 1997
is known .
( a ) Derive the 95% confidence interv als using unbiased estimator of variance
under the linear model.
( b ) De rive the 95% confidence intervals using the estimator proposed by Deng and
Wu (1987) .
( c ) Compare both confidence interval estim ates in ( a ) and ( b ), and comment.
Given: Z, = &; , c; = 1.5097, C; = 1.5256 and Pxz = 0.4762 .

Solution. From the sample information we have

11.~r:1S(). I;". .• l cjj)i:~;~,~ : ;1;,, ? ' i i i.i~:


·': (;7I; y:)2!~i'l 1 ~f:'(xi}'x)f' ;i ;C I~tri''- Y'';{, ';),
I 2.605 3.433 306894.26 1420032.4 660151.3371
2 1343.4 61 3928 .732 619173 .25 7472830.7 2151040.8800
3 7.130 4.373 301901.21 1417793.0 654242.6277
4 8.849 51.539 300015.14 1307695.5 626361 .2614
5 1337.852 3585.406 610377.54 5713638.7 1867478.7040
6 20 1.631 426 .274 125992.61 591069.0 272892.5174
7 321 .583 188.477 55226.12 1013257.9 236555 .0651
8 1229 .572 1372.439 4529 11.00 31454.7 119357.4588
1:;;<Sum ;$44521683' 51'9560.6731 2772491 .10 N 8967772.0 6588079.8520

From the above table we have


n =8 , Y = 556.585 , x = 1195.084, s; = 2709682, s J = 396070.2 , Sxy = 941154.3,
• s xy / sx2 = 0.3473 ,
fJ= I
rxy =s xyj\SxSy ) =0.9085, ei= Yi-14 1.496-0.347xi, and
8 2 -
L.ei = 484252.1. Also X = 878.1 6 , N = 50 and f = 0.16.
i=1
Thus the regression estimate of the average amount of the real estate farm loan s
during 1997 , Y (say), is given by
220 Advanced sampling theory with applications

.J!LR = Y+ p(X - x)= 556.585 + 0.3473x (878.16 -1195.084) = 446.517.

(a) Usual estimator of variance: The usual estimate of V(YLR) under the
superpopulation model is given by

V(YLR) = (1- f)_1_f el = (1-0.16)x_1_X484252 .1 = 8474.41.


n n-2 i = 1 8 8-2

A (I - a )100% confidence interval for population mean Y is given by

Using Table 2 from the Appendix the 95% confidence interval is given by

446.517± 2.447.J8474.41 or [221.255,671.779] .

(b) Deng and Wu's estimator: The Deng and Wu (1987) estimate of V(YLR)
under the superpopulation model is given by
"(_)DW (1--n-f) n_2i~ei
v YLR =
1 2( Xx)g n

= (1- 0.16) x .i.. 484252.1x (1195.084 )-0.4737 = 7323.475.


8 8- 2 878.162
Given C; = 1.5097, C; = 1.5256 and P xz = 0.4762 , therefore
Cz
g = -Pxz - = -0.4762 x
Cx

A (I - a)1 00% confidence interval for the population mean Y is given by


YLR ±fa/2(df = n - 2}.jv(YLR)DW .
Using Table 2 from the Appendix the 95% confidence interval is given by
446.517± 2.447.J7323.475 or [237.109, 655.924].

( c ) Interpretation: Note that in this case the length of the confidence interval by
the usual estimator is more than that of Deng and Wu (1987) which states that their
estimator perform better than usual estimator of variance in this situation .

Deng and Wu (1987) have shown that the estimator at (3.6.2.1) remains better than
the estimators proposed by Royal1 and Eberhardt (1975), Royal1 and Cumberland
(1978, 1981a, 1981b, 1985) and Rao (1968a, 1969). It is to be noted that Deng and
Wu (1987) have taken (n-2) in the denominator of the usual estimator of variance
of the regression estimator, which comes only from Gauss--Markov Theorem
discussed above. Fol1owing Devil1e and Sarndal (1992) we have the fol1owing
theorem .
Chapter 3: Use of auxiliary information: Simple random sampling 22 I

Theorem 3.6.2.2. An estimator for estimating the appro ximate variance of the usual
linear regre ssion estimator is given by
•_ (1 - I ) 1 11 2
V(YLR )DS =n- - I) i=Lei
(n -- (3.6.2.8)
1

where
.
ei = Yi - fJxi with fJ. = i~11 XiYi / i~11 Xi2 ' This technique is called model assisted,

not model based , technique.


Proof. Follo wing Deville and Sarndal (1992), the regre ssion estimator can be
written as

YLR = Y+ ,B(X - x)= ~ t~i - ,Bxi ]+,BX = ~ t ei +,BX = e+,BX . (3.6.2.9)


n i=1 n i= 1
Thu s the variance of the linear regression estimator can be appro ximated as
V(YLR) = V(e+ ,BX)'" V(e) = (1-l )v(eJ =(1- 1)_ 1_ r, el
n n N - I i= 1
(3.6.2.10)
where Y; = fJ Xi + ei is the true superpopulation model passing throu gh the origin .
By the sample analogue an estimator of the variance V(YLR) given in (3.6.2.10) is
given by (3.6 .2.8). Hence the theorem .

Remark 3.6.2.1. In the above theorem if we estimate fJ = Y/ X with ,B = y/x , then


the results are true for the usual ratio estimator, and if we estimate it with
,B = - Y/ X , then results are true for the product estimator.

3.6.3 RELATIONSHIP BETWEEN LINEAR MODEL AND RATIO


ESTIMATOR -

Consider a finite popul ation of N units taken from a superpopulation can be


described with a linear model as
Yi =RXi +ei , i =I,2,...,N (3.6.3.1)
where E(ei )= 0 , E(el )=0' 2 and Ekej )= 0, i ;t. j . Clearly this mod el is a special
case of (3.6.I . I) with zero intercept and fJ = R. A sample analogous of the abov e
model is given by
Yi = rXi + ei, i = 1,2,..., 11 (3.6.3.2)
where E(eJ =O, E(e; )= 0' 2 and Eteiej )=O, i e j . The value of R from model
(3.6.3. I) and its estimate from (3.6.3.2) can be obtained under different
assumptions:

Method I. From model (3.6.3. I) on setting E(e;) = °we obtain


N
L ei = 0 . (3.6.3.3)
i= 1
From (3.6.3.1) we have
222 Advanced sampling theory with application s

E:i = Yi - RXi . (3.6.3.4)


On using (3.6.3.4) in (3.6.3.3) we obtain
N( ) N N 1 N 1 N __
L Yi - RXi = 0 or L Yi - R L Xi = 0 or - L Yi - R - LXi = 0 or Y - RX = 0 or
i=1 i=! i=1 N i=1 N i=1
R= Y/ X , which is a ratio of two population means.
Using the sample analogou s model (3.6.3.2) and setting E(ei) = 0 we obtain
(3.6.3.5)
r= -=~x .
If we multiply r in (3.6.3.5) with X then we obtain a traditional ratio estimator of
the population mean as
- - Y -
YR = r X = -=X . (3.6.3.6)
x

Method II . Again from the model (3.6.3. I) on setting E(e; )=0 we obtain sum of
square due to errors (SSE) as
SSE = I (Y; - RXi )2.
i= 1
(3.6.3.7)
. -
O n setting aSSE
- = 0 we 0 b tam
. NL ( Yi - RXi ) X i = 0 and it ai
It gives us
aR i=1
N
LY; X i
R = i.=.L-
N . (3.6.3.8)
LX;
i= 1
Note that we are not interested in this ratio, but the sample analogous model of
(3.6.3.8) will give us
n
LYixi
r -l=.L- (3.6.3.9)
- n 2 •
LXi
i= \

If we multiply r in (3.6.3.9) with the known population mean X, we obtain a new


ratio type estimator of the population mean

(3.6.3.10)

n
Note that LYixi = nx ~ +(n-l}sxyand Ln Xi=
2
nx-2 +n
(
- 1)s2x' then (3.6.3.10)
i=! i=1
becomes
Chapter 3: Use of auxiliary information: Simple random sampling 223

(II-I ) s <)'
1+ - - -
=2. II xy X
YR 2 - ( I) 2 (3.6.3.11)

r
e X 1 + ~2
II x2
which is called Beal e ( 1962) ratio estimator of the population mean Y .

Method III. The weighted sum of square of errors (WSSE) is given by


N I 2 N I ( )2 (3.6 .3.12)
WSSE = I - ci =I - }j - RXi .
i =I X i i= I X i
On setting
aWSSE = 0
aR
we obtain the ratio as
R=Y/X
which is again a ratio of two population means , and its sample analogous will again
lead to the trad itional ratio estimator of the population mean. An interesting
discussion on imposing linear structures in con ventional sampling theory and
related work is also available in Pokropp (200 I).

3.7 JACKKNIFE VARIANCE ESTIMATOR

Quen ouill e's (1951 ) method of bias reduction, popularly known as the Jackknife
procedure, has been successfully applied for estimating the variance of estimators.
We shall discuss the idea of Jackknife variance estimator for ratio and regression
estim ator of the popul ation mean, althou gh it can be used to estimate the variance of
any linear or non-linear estimator fJ of a parameter ().

3.7.1 RATIO ESTIMATOR

Consider(Yi,Xi), i =I ,2,....n denotes a simple random sample of size 11 from a


- -) - -1 11 _ _I 11
bivariate in fimite population with means (Y , X . Let Y = II IYi and x = II I Xi
i=1 i=1
be the sample means based of 11 observations. Divide the sampled data into g
. . _ I 11-11;
groups each havmg II; = 11/ g, j = 1,2, ..., g observations. Let Y(;) = - - I Yi ,
II-II; i= l

X(;) = _1_ll~;Xi be the sample means after dropp ing


II-II ; i=1
group from the sample.r
From the r subsample obtained by dropping II; units, a ratio type estimator to
estimate the popul ation mean Y is given by

YR( .) = Y(; )l_X J' for j = 1,2,...,g. (3.7.1.1)


} X(;)
224 Advanced sampling theory with applications

Let us define a new estimator of the population mean Y as

-
YRg =-
1~_ 1~_ (xJ
L. YR(j) = - L. Y(j) -=- . (3.7.1.2)
g j~ 1 g j~ l X(j)

Then a usual Jackknife estimator of variance Vu is given by

,
Vu = --
g - 1 L.
~ [_YR(j) - -f
YRg (3.7.1.3)
g ) ~1

Also from the full sample information we have the usual ratio estimator of
population mean defined as
_ _(x)
YR = y x

_ _I n _ -I n
where Y =n LY; and x = n LX; are the sample means based on full sample
;~I i ~l

information.
A modified Jackknife estimator of variance is given by

,
Vm = --
g-1~[_
g
L. YR( ') - YR
j ~l )
-f . (3.7.1.4)

For g = n it reduces to the situation of dropping one unit at a time while making
groups .

Example 3.7.1.1. People Bank took an SRSWOR sample of eight states from the
population I given in the Appendix and collected the following information :

AR NY WA NC CA MI PA SD
848.317 426.274 1228.607 494.730 3928.732 440.518 298.351 1692.817

907.700201.6311100.745639.5711343.461323.028 756.169 413.777

( a ) Apply the ratio method of estimation for estimating the average amount of the
real estate farm loans (in $000) during 1997.
( b ) Estimates the variance of the ratio estimator using the usual Jackknife
estimator of variance and derive the 95% confidence interval estimate .
( c ) Estimates the variance of the ratio estimator using the modified Jackknife
estimator of variance and derive the 95% confidence interval estimate .
( d) Which estimate of variance gives smaller confidence interval estimate?
Given: The known average amount $878.16 of nonreal estate farm loans.

Solution. From the sample information we have


Chapter3: Use of auxiliary information: Simplerandom sampling 225

y,.
~
State xi .,"",

AR 848.3 17 907 .700


NY 426 .274 201.631
WA 1228.607 1100.745
NC 494.730 639.571
CA 3928.732 1343.461
MI 440 .518 323.028
PA 298.35 1 756.169
SD 1692.817 413 .777
Sum 9358.346 , ' 5686.082,

Thus II = 8, Y = 7 10.7603 and x = 1169.793 . Note that X = 878.16 .


( a ) The ratio estimate of the average amount of real estate farm loans during 1997,
Y (say), is given by

-
YR
= -(
Y
X)
x = 71O.7603( 11878. 162 ) = 533.5667 .
69.793

Takin g g = II we have the following table:

.,.
Jackknife mec fianism
". x j '0
~ Yj x(j) y(j). ;Yk(j) (YR(; )- YRn~ ~R(;vYRf
848 .3 17 907 .700 1215.718 682.6260 493.0 869 2440.14489 1638.50906
426 .274 201.631 1276.010 783.4930 539.2059 10.7510 8 31.81452
1228.607 1100.745 1161.391 655.0481 495 .3000 2226.40 154 1464.24376
494 .730 639.571 1266.231 720.9301 499 .9815 1806 .52439 1127.87878
3928.732 1343.461 775 .659 620.3744 702.3549 25558.47260 28489.89390
440 .518 323 .028 1273.975 766.1506 528.1128 206 .55132 29.73054
298 .351 756.169 1294.285 704.2733 477 .8427 417 8.59394 3105 .02188
1692.817 413 .777 1095.076 753.1864 603.9932 3783 .29055 4960 .07229
. •.:\2:. " '>: SUtrl.· 4339.8780 40210. 73030 40847.1 6470
where
_ IIX - X j _ IIY - Yj _ y(j ) _
x(j) = - - , Y(j) = -- , and YRe) = _( .)x .
11-1 II- I ) xv)

Thu s an average of all Jackknife ratio estimates is given by

- -.!. L.~ Y- R( 0) -- 4339.878 -_ 5424847


YRn - ..
II j= 1 } 8
226 Advanced sampling theory with applications

(b) Usual Jackknife estimator of variance: The usual Jackknife estimator of


variance Vu is given by

Vu = n -I
n
t ~R( ') - YRnf = 8 8-1 x 40210.7303 = 35184.38.
j =J )

A (I - a) 00% confidence interval for population mean Y is given by


YR±fa/2(df = n -1),/0; .
Using Table 2 from the Appendix the 95% confidence interval is given by

533.5667± 2.365·J35184.38 , or [89.9518, 977.1816].

( c ) Modified Jackknife estimator of variance: The modified Jackknife


estimator of variance is given by

vm = n-I
n
t ~R( ') - YRf = 8-1
j =1 ) 8
x 40847.1647 = 35741.269 .

A (I - a)1 00% confidence interval for population mean Y is given by


YR ±fa/2(df = n -I~ .

Using Table 2 of the Appendix the 95% confidence interval is given by

533.5667± 2.365.J35741.269 , or [86.4549, 980.6785] .

( d ) In this particular example the usual Jackknife estimator of the variance of the
ratio estimate provides smaller confidence interval estimate at the same level of
confidence.

3:7.2,REGRESSION•• ES.TIMATOR··

Assume that (Yi, Xi), i = 1,2,..., n denotes a simple random sample of size n from a

bivariate infinite population with means (Y, r). Let Y = n- I tYi and x = n- I tXi
i=J i=1
be the sample means based on n observations. Let b = sxy / s~ be an estimator of the
regression coefficient. Then the linear regression estimator of population mean Y is
Ylr=Y+b(X-x) . (3.7.2.1)

Divide the sampled data into n sub-samples each having (n -I) observations.
Let
_(.)_( _I)_I~I _ ~IY-Yj) _( .)_ ( )_1"-1 _ (nx-xj)
YJ - n L.Yi - ( I)' and x J - n -I i=1LXi - ( )
i=1 n- n-I
Chapter3: Use of auxiliary information: Simplerandom sampling 227

be the sample means after dropping /" unit from the sample. From the sub- r
sample obtained by dropping one unit a regression estimator to estimate the
population mean Y is given by
(3.7.2.2 )

Let us define a new regression estimator of the population mean Y as

Ylm == ~ IYlr(})== ~ I~(}) + b(jXX - x(j))]. (3.7 .2.4)


n j =1 n j=1
Then a usual Jackknife estimator of variance Vu (Ylr) is given by

• (_) n - 1 ~ [_ ( .) - ]2
Vu Ylr = - - L. Ylr J - Ylm (3.7.2 .5)
n j =l

A modified Jackknife estimator of variance is given by


• Vir n - I ~ [_ () _]2
(7; ) == -
Vrn - L. Ylr j - Ylr (3.7 .2.6)
n j =l

For details see Miller (1974) , Rao (1969 , 1974, 1979), Rao and Rao (197 1) and
Krewski and Chakrabarty (1981).

Examp le 3.7.2.1. A key bank took an SRSWOR sample of eight states from the
population I given in the Appendix and collected the following information:

'A State CA FL MO IN NJ MA OK ME
Nonrealestate 3928.732 464.5 16 1519.994 1022.782 27.508 56.471 1716.087 51.539
farm loan (X )$
Real estate farm 1343.461 825.748 1579.686 1213.024 39.860 7.590 6 12.108 8.849
loan (Y )$

A statisncian advised them to apply the regression method of estimation for


estimating the average amount of the real estate farm loans.
( a ) App ly the regression method of estimation for estimating the average amount
of the real estate farm loans (in $000) during 1997.
( b ) Estimates the var iance of the regression estimator using the usual Jackknife
estimator of variance and derive the 95% confidence interval estimate .
( c ) Estimates the variance of the regression estimator using the modified Jackknife
estimator of variance and derive the 95% confidence interval estimate.
( d ) Which confidence interval estimate is smaller?
Given: The known average amount $878.16 of nonreal estate farm loans.
Solution. Using the sample information we have
228 Advanced sa mp ling th eory wi th app lications

'," Stat e t.;~ 1~:;;,ifXj


' ;,
<

'ie,
YI ,(Xj-X
,,~)~~
!;,,' , -y
~', (y'j - Y (y,,]y X-~~~x) ,
I \M.t> ,_
CA 3928,732 1343.461 8010476,0 409178 .0 1810444.900
FL 464.516 825.748 401876.9 14873.6 -773 13.289
MO 1519.994 1579.686 177696.3 767192 .5 369225 .210
IN 1022.782 1213.024 5726.2 2593 18.5 -38534.508
NJ 27.508 39.860 1146925.0 440804 .0 711033.730
MA 56.471 7.590 1085728.0 48469 5.5 725429 .090
OK 1716.087 612.108 38 1471.0 8405.7 -56626.326
ME 51.539 8.849 1096030.0 482944.0 727544.680
;.~ .Sumt j~ 8 7 8 7 ( 6 29 5630.326 12305929.0 28674 12.0 1 4 17 1203.500

From th e above table, n= 8 , Y = 703.7908, x = 1098.454, s; = 1757989.86,


s; =409630.28, Sxy =595886.21, b=sxy/s; = 0.3389 . Given X=878 .16, N=50
and 1=0.16.

( a ) The regression estimate of average amount of the rea l estate farm loan s is
Ylr = Y + b(X - x) = 703.7908 + 0.3389 x (878.16 -1098.454) = 629.133 .

" .~ .~' ;;"" Jackknife m echanism !'" ."'" <;"if"'."""


x(j) ~(jJ,~i I~k(j) , " dj ~b(j);~
~ ,~ l ~ftr(J) {Y'r (j >:-Ytrn f ~G§'kV
694.128 612.4093 0.775944 -319.5110.666878 735.1361 8855.95961 11236.6590
1189.016 686.3683 0. 157657 336.798 0.359497 574.6163 4410.78061 2972.0715
1038.234 578.6629 0.139440 733.035 0.309721 529.0847 12531.75270 10009.6650
1109.264 631.0431 0.125465 534.878 0.342661 551.8529 7952.559 19 5972.2173
1251.446 798.6380 0.218201 -300.987 0.305395 684.6383 1901.68053 3080.8341
1247.308 803.2480 0.213228 -343.073 0.301978 691.7733 2574.88106 3923.8054
1010.220 716.8883 0.155999 -300.999 0.356799 669.7693 825.94493 1651.3054
1248.013 803.06810.214065 -340.142 0.302081 691.3426 2531.35713 3870.0336
11 J fJ i /' . ~'J'-'~'"
fJ
,tt SUm .- 5128.2130 1}T41584.91570 42716.5910

(b) Usual Jackknife estimator of v a r ia n ce: The usual Jackknife estimator of th e


va ria nce vu (Ylr) is give n by

VuVir) = n -1 f[ytr(j)- YlmJ2 = 8 -1 x41584.9157 = 36386.80


n j=1 8
where

Ylm = ~ fY'r(j) = ~ f~(j)+ b(jXX - x(j))]= 5128.213 = 641.03 .


n j=1 n j =1 8
A (1- a)100 % confidence interval for po pulat ion mean Y is given by

Ylr±la/2 (df = n - 2).JvuVlr) '


Chapter 3: Use of auxiliary information : Simple random sampling 229

Using Table 2 from the Appendix the 95% confidence interval is given by
629.I33±2.447,J36386.80, or [162.36,1095.90].

(c) Modified Jackknife estimator of variance: The modified Jackknife estimator


of variance is given by
,_ n-I n _ _ 2 8-1
vm(YJr ) = - L[Jilr(})- Ylr] =-x42716.591 = 37377.02 .
n j=J 8
A (I - a)1 00% confidence interval for population mean Y is given by
YJr ±ta/2[df = n- 2).Jv m(Ylr) .
Using Table 2 of the Appendix the 95% confidence interval for the average real
estate farm loans is given by
629.133±2.447,J37377.02, or [156.05, 1102.21].

( d ) In this particular example, the usual Jackknife estimator of the variance of the
regression estimate provides smaller confidence interval estimate at the same level
of confidence.

While using the known single population mean of an auxiliary variable at the
estimation stage or for the construction of ratio, product, and regression type
estimators, a natural question arises of how to utilise the information if available on
more than one auxiliary variable. The next section studies such situations.

3.8 ESTIMATION OF POPULATIONJMEAN.. USING MORE THAN ONE . '


....1.1I.~ UXiLIAjiV VARIABLE... .,.;11 \' " ....'..; ..\......·;:\~ ' ; ; i . . . ~. . I.I 11;.;;llf,,;."1 "', ;
I

Let us first consider the case where the information on two auxiliary variables is
available. Suppose Yj , Xli and X 2i are respectively values of j''' unit of the study
variable Y and auxil iary variables Xl and X 2 from a finite population o. Let Y,
Xl and X2 be the population means of the study variable Y and auxiliary
variables Xl and X 2 . Let the auxiliary variables Xl and X 2 be correlated with Y
with correlation coefficients P YXI and P YX2 ' respectively. A simple random sample
of size n is drawn by SRSWOR sampling from the population n and let Y, Xl
and X2 denote the corresponding sample means.

Now define
- -
&0
Y
=-=-1,
XI
1]1 =~-I and 1]2
X2
=-=--1
Y XI X2

such that
230 Advanced sampling theory with applications

and

£(&6)= C ~/ )c;, £('7 ? )=c ~,f )c;l' £('7i)= C ~f )C;2 '


£(eO'7d=C ~,f )pyX\ CyCr\ , £(e0'72) =C ~f )pYX2CyC_<2'
£('71 '72 )=c ~/)PXIX2 C rt C r2 .
where
PYXI=SYXI /tS ySrl) ' PYX2=SYX2/tSySX2)' PX\ X2= SX\_<2 /tS'ISX2) ' Cy=SyIY,

1- 1-
CX,=SXI X"C X2= SX2 X 2, Sy2 = ( N-I )-\ LYi-Y
N ( -)2 , Sx,
2 =N
( - l )-1 L
i= \
N ( - )2
X ii -X' ,
i=1

S; =(N -It' ~(X2i -X2)2,


2 i=1
SYXj =(N - t): ' ~(Yi -yXX\i -XI ~
i=1

SYX2 = (N -Itt ~(Yi - yXX 2i - X2 ), and S'I_<2 = (N -It I ~(Xli - X, XX 2i -X 2 ) have


i= \ i=\
the ir usual meanings.

We have redefined these terms only to avoid any kind of confusion in learning the
est imation strategies with more than one auxiliary variable. Note that if we have p
auxiliary variables, say X 1,X 2,...,X p and their popul ation means X\,X 2,...,X p
are known, then the abo ve results can easily be extended.

3.8.1 MULTIVARIATE RATIO ESTIMATOR

Olkin (195 8) proposed a weighted ratio type estimator of popul ation mean Y as
(3.8.1.1)

where YRj = y(x j Ix j ) , for j = 1,2 , are the two usual ratio estimators of population

mean using different auxiliary variables at the estimation stage.

Then the estimator YRa in terms of &0, '71 and '72, can easily be expressed as:
- -(1+ &0-'72 +'7 2-
YRa=Y 2 &0'72+" ) +wY
-('7r'7I+'7 I-'72
2 2+&0'l r &0'71 + ...) . (3.8.1.2)

Thu s we have the following theorems :


Theorem 3.8.1.1. The bias , up to terms of order O~,-\) , in the multi variate ratio
estimator, Y Ra ' is

B~Ra )=c ~f )Y[(C;2 -P YX2C yCr2 )+wk ;1 -e.;2+PYX2 C yCX2- Pyx\ CyCX) )] (3.8.1.3)

Proof. It follows by takin g expected values on both sides of (3.8 .1.2). Hen ce the
theorem .
Chapter 3: Use of auxiliary information: Simple random sampling 23 1

Theorem 3.8.1.2. Th e minimum mean square d error of the mult ivariate ratio
estim ator Y Ro is given by

Min.MSE0 Ro )=( 1 ~/ )Y2 [c; + C ;2 - 2p YX2CyC'2


~YX2 CyCX2 -C;2 -P):tICyC'1 +P.' IX2CXICX2 )
2 2
CX2 + C'I - 2 PXI X2 C'l C<z
l (3.8.104)

Proof. By the definition of mea n squared error we have


MSE0RJ= E~Ro - r] ,., y2 E[(&o - 1] 2)+ w{ 72 - 1]1 W
1

= E [(&J + d - 2&01]2)+ w 2(1]i + 1]? - 21]11]2)+ 2W(&0 172 -1]i - &01]1+ 1]11]2)]
=c~f JYZ[c; 2
+ C;2 - 2pYX2CyCX2 + w (C; 1 + C;2 - 2PX\X2 C' 1CxJ

+ 2W~YX2 CyC X2 - C;2 - Py.q CYC'I + PXI X2 C q C '2 )] . (3 .8.1.5)


On different iating (3.8.1.5) with respec t to wand equating to zero we obtain

w=
~YX2CyC'2 - C;2 - PYXICy C'1 +PXIX2CXICt2) (3.8 . 1.6)
f~C XI2 + CX22 -2 P.'lX2 CXI CX2 )

On substituting this val ue of w in (3.8 .1.5) we obtain (3.8.104). Hence the theorem.

3.8.2 MULTIVARIATE REGRESSION TYPE ESTIMATORS

Raj (1965a) proposed a multivariate regression type estimator, which in the case of
two aux iliary variab les can be wr itten as

YRaj = Y + .8\ (XI - X\)+.82(X 2 - xJ (3.8.2. 1)

The above est imator YRaj in terms of &0 , 1]1 and 1]2 can be easi ly ex pressed as :

(3 .8.2.2)

Thu s we have the foll ow ing theorems:

Theorem 3.8.2.1. Th e estima tor YRaj is an unbi ased estimator of the popul ation
mean Y.
Proof. On tak ing expected values on both sides of (3 .8.2.2) we have

E0RaJ= E [Y(1 +&0 )- .81X\1] 1- .82X21]2 ]= Y .

Hence the theorem .


232 Adva nced samp ling theory with applications

Theorem 3.8.2.2. The minimum variance of the estimator Y Raj is give n by

Min.V(YRaj)=(I~/ )Y2C; [1 - R~0.qX2 ] (3.8.2.3)


2 2
Pyx I + PYX2 - 2P yxi PYX2 PoqX2
where R 2o x denot es the mult iple correlation
y oq 2 1 2
- P XIX2
between Y , XI and X 2 •
Proof. By the definition of variance of an estimator we have
V (YRaJ =E~Raj - y ]2 =E[Y(I +CO ) - .81X I'71 - .82 X 2'72 _ y ]2
= E[Y£O - .8I X I'7I - .82 X 2'72 ]2
- 2 2 2- 2 2 2- 2 2 - -
=E [ Y £0 +.81 XI '71 +.82X2 '72-2YX I.8I £0'71
-2Y X 2.82 £0172 +2.81.82X IX2 '71 '72 ]

1- / ) [- 2 2 2-2 2 2- 2 2 --
= ( - n- Y Cy+.81 XI C XI +.82 X 2C x2-2.8I Y X \PYXlCyC'1

- 2.82Y X2PYX2 CyCX2 +2.81.82XIX 2PXI X2C'I CX2] · (3.8.2.4)


On differentiating (3.8 .2.4) with respec t to .81 and .82 , respectively, and equating to
zero we obtai n
.81 XI Cx j +.82Pxlx2X2CoQ = YPYXI Cy , (3.8.2.5)
and
.81PXIX2 X I C XI + .82 X 2 C x2 = YPYX2 Cr (3.8.2.6)
Th e above system of equations can be written as

X IC q , X2 C QPXI X2] [.8I ] [YP YXIC y ]


[ X ,C oq PXl x2' X 2CQ .82 = Yp YX2Cy .

Using Cramer's rule we have

6 = det _
X IC XI ' _
X 2C'2 PXIX2] = X
- X- (2)
I 2C qCQ I -Pqx2 '
[ X ,C XIPx'oQ' X 2C
X2

- YPYXICy , X 2C'2PXI OQ] - - ( \


61 =det - - C = Y X 2CyC'2 p Y.q -PYX2P XI X2P
[ -Yp YX2CJ" X 2 X2

and
X le ' I ' YPYXICY ] _ _ ( )
62 = det - - = Y XIC yC'I\PYX2-PYXIPXloQ .
[ XIC'1 PXIX2' YpYX2Cy

On so lving (3.8.2.5) and (3.8.2.6) for .81 and .82 we obtain


Chapter 3: Use of auxiliary information: Simple random sampling 233

/31 =~= YCY~YXI -P XIX2P 'X2 ) , and /32 =~= YC~(PYX2 -PXI X2 P XI).
2 2
"" X I C xI l_p .qx2 "" X 2 C x2 l _p xIX2

On substituting these values of /31 and /32 in (3.8.2.4) we obtain


Min.v(YRaj)
1- / J[- 2 2 2-2 2 2-2 2 - -
= ( -n- Y C y +/31 X I Cq +/32 X2 C X2 -2/3I YXIP yx,CyCXI -2/32YX2PYX2C yCX2

+ 2 /31/32 X IX 2Px jX2CXI CX2 ]

2p YXI (pYXI - P YX2 P xlx2 )

(l-p;I X2
)

=(-n- I)
1- Sy2[ 1+ (_ 12 \2 12
Py.q 2 2 PXIX2
+ PYX 2 - 4 pYXIPYX2 PXI.Q+ PYX2+
2 PYXIPXIX2
2 2
1 PXI .Q J

+ 2p"" ~'" P,,,,- pi"~ P"" - pi"~ P"" + P", P", p;",) )

(1- 22 ) ~;XI - pYXj PYX2PXI X2 + P;X2 - pyxi PYX2PXj } •


P X1X2

=
1- I) 2[
( - / 1 - S y 1+(1-
1
2 \2
{ 2 2 2 2 2 2
P yxl- P YXIP XI X2 +PYX2-PYX2PXI X2-2PYXIPYX2P.qx2+2pYXIPYX2P.qx2 J
3 )

P,q X2 J
234 Advanced samp ling theory with applications

Hence the theorem.

Example 3.8.2.1. The season average price (in $) per pound amount of the
commercial apple crop in 36 different states of the United States has been given in
pop ulation 3. Sup pose we selected an SRSWOR sample of nine states to collect the
required information from 1996. Find the relative efficiency of the regression type
estim ator of average price in the United States that make s use of past information
from two years with respect to the estim ator that makes use of past inform ation only
from one year.
Solution. From the description of the popul ation we have
Yi = Seaso n average price per pound during 1996,
Xl i = Season average price per pound during 1995,
X 2i = Season average price per pound during 1994,
N = 36, s; = 0.006488 , Py.q = 0.8775, PYX2 = 0.8577, PXI X2 = 0.8788, n =9 and
f = n]N = 9/36 = 0.25 .
Now we have
( a ) Use of one auxiliary variable
Regression estimator: Yll = Y + p(X l - Xl) .
Mean squ are error:

MSE(Yl l ) = C~f JS;(I- P~Xl ) C-~.25J XO.006488 X


= [1-0.8775
2
] = 0.0001243.

( b ) Use of two auxiliary variables


Regres sion estimator : Y I2 = Y + PI (Xl - XI)+P2(X 2 - X2)'
Mean square erro r:

(- )_(1- fJ 2[
MSE Yl 2 - - - Sy I
n
PYX~ + P~X2 -2PYXIPYX2P
2
I- p
XI X2]
xlx2
2 2
= (1 -0.25 J(0.00648sfl 0.8775 +0.8577 - 2XO.8775XO.8577X O.8788]
9 1 1-0.8788 2
= 0.0001066.

Thus the required relative efficiency is given by

RE=MSE(- )xIOO/MSE{- ) =0.00012343 x I00=116 .60%.


YI I \.Y1 2 0.0001066

Now we hav e the following corollaries:


Chapter 3: Use of auxiliary information: Simple random sampling 235

Corollary 3.8.2.1. A practically usefu l linear regression estimator of the population


mean Y in the presence of two aux iliary variables X , and X 2 is give n by
haj(p) =y+,8,(X,-X)+,82(X2- X2 ) (3.8 .2.7)
where
,8, = ~Ybx\ - r ,\.'2 rYX2)}/ k"(I - ':~X2 )}
and
,82= ~ yb X2- 1:'IX2ryx,)}/ k'2 (I - 1:~X2 )}
denote the estimators of the partial regression coefficients fJ, and fJ2, respect ively.

Corolla ry 3.8.2.2. Under the superpopulation mode l


. . .
Yi = fJo + fJ,xli + fJ2 x2i +ei (3.8.2.8)
where Ek )=o, E(el )=er 2, and Eke;)=O i'l'-J= I,2, .... .n.. Then the values of
,8, and ,82 in the estimator YRaj(p) =y+,8,(X, -X,)+,82(X2 - X2) can be obtained
by solving the following set of norma l equations:

£(X'i -XI)2 , £(Xli -X, XX2i - X2) 1[,811 r£(Yi - YXX'i - XI)]
'f'(X'i -X, XX2;~X2 )' .I(X2i - X2? ,8 = 'f'(Yi - YXX2i- X2)
r
(3.8.2.9)
,~' 1~ 1 2 l ~'
or

(3.8.2. 10)

Corollary 3.8.2.3. Under the superpopulation model

Yi =,80+ ,8,xli + ,82x2i +ei (3.8.2.11)


where Ek) =0, E(el)= er2 and EkeJ=0, i 'I'- J = 1,2,....,1/. An unbiased esti mator
of the vlYRaj(p)) is give n by

.i:
VI)'Raj(p) (I-f)
n
I I" ei2 .
) = - - - -3
1/ - i~ '
(3 .8.2.12)

Corollary 3.8.2.4. An estimator of the vlYRaj(p)) is given by

/-I)'RaJ.(p) ) -_ (~)sY2(1-/FY·XIX2)
/I
(3.8.2.13)

where

(3.8.2.14)
236 Adva nced sampling theory with applications

Theorem 3.8 .2.3. The min imum sample size requi red to achieve the minimum
relative standard error rjJ with the estimator YRaj is given by

(3.8 .2.15)

-1

1 rjJ2 1 ",2 1
or - < -.,----'-----,- +-
11 - C2 (1- R 2
y
)
y .XJX2
N'
or II ?

1
'f'
C 2 1_ R2
y ( y ..q x 2 )
+_
N )
Hence the theo rem .

Example 3.8.2.2. We wish to estimate the seaso n's average price per pound (Y) of
the commercial apple crop during 1996 in the United Sta tes. The corre lation
between the price duri ng 1996 ( Y ) with that dur ing 1995 (X 2 ) and 1994 (XI) are
assumed to be kno wn. Find the minimum sample size , 11 , required to est imate the
average price with relative standard devia tion 5.6%.
Gi ven : R~.XJ.Q = 0.8029 , C; = 0.1563, and N = 69 .

Solution. The minimum samp le size is given by

-1

+~ = 14.09 "" 14.


69 ]

Example 3.8.2.3. Select an SRSWOR sample of 14 states from the population 3


give n in the Appendix. Collect the inform ation about the seaso n's ave rage price per
pound of the appl e crop durin g 1996, 1995, and 1994 . Estimate the average per
pound price ($) of the com mercia l apple crop dur ing 1996 in the United States.
Assume that the average prices per pound of the commercial app le crop dur ing
1995 and 1994 are accurate ly known.
( a ) Apply the regression estimator of population mean with two aux iliary
varia bles.
( b ) Derive the 95% confi dence intervals using superpop ulation model approac h
( c ) Derive the 95% confidence interva ls using design based approach.
Gi ven : XI = 0.1856 and X2 = 0.1708 .
Chapter3: Use of auxiliary information: Simple random sampling 237

Solution. We started with the first two columns of the Pseudo-Random Numbers
(PRN) given in Table 1 of the Appendix to select 14 distinct random numbers
I ~R ~36 as 01, 23, 04, 32, 33, 05, 22, 29, 03, 36, 27,19,14 and 06.

Sr. No. State and Terr itory Year 1994 Year 1995 Year ,1996
, X2i Xli Yi
01 AZ 0.078 0.071 0.122
03 CA 0.133 0.183 0.160
04 CO 0.157 0.145 0.223
05 CT 0.283 0.276 0.292
06 DE 0.168 0.125 0.173
14 ME 0.174 0.179 0.185
19 MO 0.198 0.160 0.228
22 NM 0.2 19 0.298 0.306
23 NY 0. 118 0.121 0.130
27 PA 0.104 0.095 0.133
29 SC 0.130 0.126 0.136
32 VT 0.165 0.181 0.194
33 VA 0.090 0.099 0.101
36 WI 0.230 0.241 0.133
Sum 2.247 2.300 2.516

From the sample information we have


y = 0.1797, xI = 0.1642 , x2 = 0.1605 , rYXj = 0.7641, rYX2 = 0.7722 , rXIX2 = 0.8893,
S y = 0.063063, S XI = 0.0680 I , and S X2 = 0.057778 .

The estimates of the partial regression coefficients /31 and /32 are
PI = sy bXl - rXIX2rYX2 ) = 0.063063(0.764 1- 0.8893 xO.7722) = 0.3431 ,
2
s Xl (l- r~X2 ) 0.068009( 1-0.8893 )

and
P2 = sy bX2 -rXIX2rYXI ) = 0.063063(0.7722 -0.8893 xO.7641) = 0.4837.
2
s X2 ( 1- r~ X2 ) 0.057778( 1- 0.8893 )

( a ) Regression estimator : An estimate of the average price of the commercial


apple crop during 1996 is given by
YRaj(p) =Y + PI(X\ -XI)+P2(.¥2 - X2 )
= 0.1 797 + 0.3431(0.1856 - 0.1642) + 0.4837(0.1 708 - 0.1605) = 0.19202.

( b ) Superpopulation model approach: The linear relation ship between y , XI'


and X2 is Yi = 0.04571- 0 .343x li - 0.4837 X2i ' thus the residual terms are given by
238 Advanced sampling theory with applications

1'/ ' :,:' +,' """"1'''''' ~ e2


, I :, ",': , i
0.0142013 0.000201677
-0.0128294 0.000164594
0.0515996 0.002662519
0.0147073 0.000216305
0.0031409 9.86525E-06
-0.0062887 3.95477E-05
0.0316214 0.000999913
0.0521159 0.002716067
-0.0 143017 0.000204539
0.0043907 1.92782E-05
-0.0158216 0.000250323
0.0063784 4.0684E-05
-0.0222099 0.00049328
-0.1066481 0.011373817

14 14
One can see that I ei = 0.0 and I e; = 0.019392408. Thus an estimate ofVlYRaj(p)) is
i; 1 i ;]

. )=(1-nfJ_l_f
v(-YRa)(p) n-3 ; 1
e 2 =(1-0.39J x 0.019392408 =0.00007694.
14 14-3
I
i

A (1 - a)1 00% confidence interval for estimating the population mean Y is

Y Raj(p)+ta /2(df = n - 3~V(yRaj(p)) .


Using Table 2 from the Appendix the 95% con fidence interval for the average price
of the apple crop during 1996 is given by
0.19202 + t o.05/ 2 (df = 14- 3)J0.00007694

or 0.19202+ 2 .201~0.00007694 , or [0.1727,0.2113] .

( c ) Design based approach: An estimator of the multiple correlation coefficient is


2 2
ryx 1 + r - 2ryX] rYX2 rX1X2
YX2

1- r 2
X]X2

2 2
0.7641 +0 .7722 -2xO.764l xO.7722xO.889 =0.6249 .
1-0.88932
An estimator of the vlYRaj(p)) is given by

V(YRaj(p))= C~/}; (1- R~.XIX2)= C-1~39 Jx 0.003977x (1- .6249)= 0.00006499.


Chapter 3: Use of auxiliary information: Simple random sampling 239

A (I - a)1 00% confidence interval for estimating the population mean Y is given by

YRaj(pfFl a / 2 [df = n - 3~V(yRaj(p)) .


Using Table 2 from the Appendix the 95% confidence interval for the average price
of the apple crop during 1996 is
0.1920Hlo.05/2 [df = 14- 3W'O-
.0-00-0-64-9-9

or 0.1920H2.201~0.00006499, or [0.1743,0.2097] .

Corollary 3.8.2.5. If we have p auxiliary variables, for example Xl' X 2 ,•.., X p ,

and their population means XI' X2 ,.•., Xp are known, then the multivariate
regression type estimator proposed by Raj (1965) is
Ym =y+ If3;(X i
i=l
-xJ (3.8.2 .16)
Proceeding as in the above theorem the minimum variance of the multivariate
regression type estimator Ym is given by

Min.v(Ym)=(I-
n
f]Y 2Cy2( I_ Ry2•xlx2····xp ) (3.8.2.17)

2
where R y.xlx2·..
xp denotes the multiple correlation coefficient between Yand Xl,
X 2 , ••••, X p '

3.8.3 GENERA'CCLASS OF ES'fIMATORS

Srivastava (1971) proposed a general class of ratio type estimators for estimating
the population mean Y as
Ys =yH(Ul,U2 ,...,u p)=YH~) (3.8.3.1)
where

u i" xj / Xj ,j = 1,2, ..., P, and H~) is a parametric function such that H~) = 1 for
.§ = (I, I, ...., I),xp , satisfying certain regularity condition such as the first and second
order partial derivatives of H with respect to !:!. exist and are known . Expanding
H~) around the point .§ by using second order Taylor's series we have

(3.8.3 .2)

Thus the general class of estimators can be written as

Ys =Y[I+~_.§) OH lu=e +~~-§)' o2~ lu=e ~-§)+


OU - - 2 OU U - -
]
- --
240 Advanced sampling theory with applications

~ II/=E &.i - §)+&O &.i - §) 0d Hu II/=E


2
= Y[ I + Co + &.i - §) 0 H IU=E +.!. &.i - §)' 0 +....]
OU - - 2 OU U - - --
- -- -

= Y[I +& 0 + &.i -§)H I ++&.i - §)' H 2&.i -§)+&O &.i -§)H I + ....] (3.8.3.3)
~ 1 0 2H
where H J =-II/=E and H 2 = = - - - 11/ E are the matrices consisting of first and
- OU - - 2 ou'u - =-

second order partial derivatives of the function H with respect to l:i and evaluated
at l:i = § . Thus we have the following theorems:

Theorem 3.8.3.1. The bias, to the first order of approximation, in the general class
of estimators of population mean is

(- ) (I-f )-[pj =! j j=1 j j


p
B Y s '" - - y 'L C t2 H2 j + 'LPyx ·CyCt ·H I ·
n j
] (3.8.3.4)

where H 2j and HIj denote thej'" diagonal element of H 2 andj" component of H,


.
respectively . AlsoCXj =SXj2 2/-2
X j and PYXj = I )V,SyStj
,SYXj I ) for J=I
. ,2, ... , p.

Proof. Follows by taking expected values on both sides of (3.8.3.3).

Theorem 3.8.3.2. The minimum mean squared error, up to the terms of O(n- I
), of
the general class of estimators of population mean is given by

Min.MSE(Ys)=(I-n f)Y 2 C; [I _ RY2• XIX2··..Xp ] .


(3.8.3.5)
Proof. By the definition of mean squared error, we have
MSE(y s) = E~s - yf
'" E[Y(1 +& 0 + &.i - §)H I + &.i - §)' H 2 &.i - §)+&o &.i - §)H! + ....)_ r]
= y 2 E[Co + &.i - §)H I ]2 = y 2E[~o + &.i - §)H I }' ~o + &.i - §)H I}]
= PE[c& +H {&.i -§Y&.i -§)H I +2co&.i-§)HI]

=(1-
n
f)p[c; + I fHl tHIjPxtxjCtt Ctj +2
t=l j =1
I PYX jCyCXjHIj] .
j =1

(3.8.3.6)
On differentiating (3.8.3.6) with respect to HI = (HI! ,..., Hlp Y and equating to zero,
we will obtain a set of p equations, as
Chapter 3: Use of auxiliary information: Simple rand om sampling 241

P XI.'2 C XI C.'2 ,.., P.' IXI' C'I C,p - YPYXICyCXI


2
Cx2 ' - YPYX2 C yC X2
(3.8 .3.7)

or :i H I = C . (3.8.3.8)
The set of equations given by (3.8.3.8) can easily be solved for unknown
parameters as HI = £ C . As we saw in the Theorem 3.8.2.2, one can easily see
that by substituting the optimum values so obtained in (3.8.3.6) we obtain (3.8.3 .5).
Hence the theorem.
Corollary 3.8.3.1. A wider class of estimators for estimating population mean, Y ,
using p auxiliary variables X I ' X 2 ' ... , X p can easily be defined as
Yw = H(y, !:!)
(3.8.3.9)
where H[.,.J is a parametric function such that H(Y, .§)=I , satisfying certa in
regularity conditions. It is easy to show that the wider class of estimators Yw has
the same asymptotic mean squared error as that of the general class of ratio type
estimators of population mean defined at (3.8.3.1).

Corollary 3.8.3.2. Let X = lxi} ) IJXp


be the matrix of p variates and an equat ion of
the plane of regression of XI on X 2, X 3, . .. , X I' be defined as
XI = b I2.34...pX 2+b13 .24...pX 3 +b I4.235...p X 4 + .. +blp.23...(p-llX I' . (3.8.3.10)
Let sl denote the sample variance of the /" variable and ri} be the simple
correlation between the /" and j " variables. Define the correlation matrix as
1, r 12, r 13 , .. ., rip

r21, 1,

(3.8.3.11)

rpl' r p2' r p3 "" I


pxp
Let !!Ii} be the cofactor of the element in the /" row andj" column of the correlation
matrix !!I , then the partial regression coeffici ents bl } , j = 2,3,..., P are given by

Sl detl!!l,}) .
bl } = - - ( ), ) = 2,3,..., p (383 12)
s} det !!III .. .

and the estimator of multiple correlation coeffici ent R I~ 2.3.4 .....P is given by
242 Advanced sampling theory with applications

k2 -I det(tl)
1.2 ,3, ..., p - - det(tl ) (3.8.3.13)
ll

Example 3.8.3.1. An estimate of total number of fish at the Atlantic and Gulf Coats
helps in making decision to recruit labour by fishermen contractors. The average
number of fish during 1994, 1993, and 1992 are known to be 4954.435 , 4591.072,
and 4230.174 respectively . Apply the following estimator to estimate the total
numbers of fish in all 69 types of groups

Ylr =y+H ll (ul -1)+H12 (U2 -1)+H13(U3 -I)


where u} = x} Ix} and the estimates HI}' j = 1,2,3 are obtained by solving the of
normal equations given by

- yr C C
YX I Y Xl

- yrYX CY Cx
2 2
- yrYX CY Cx
3 3

To estimate the number offish during 1995, a consultant takes an SRSWOR sample
of 16 types of fish as given in the following table:

3
3 Skates/rays 2152 1981 2939 2353
5 Saltwater catfishes 13466 12690 14441 13859
4 Eels 138 222 186 152
16 Striped bass 3840 4799 8521 10758
21 Bluefish 11990 10301 12405 10940
43 Weakfish 1668 2219 4929 5739
55 Cunner 1931 1876 1255 1375
58 Atlantic mackerel 1045 2307 4860 4008
33 Snappers , other 746 861 462 492
69 Other fishes 12249 14953 20488 14426
39 Sheepshead 5933 5593 4383 5118
10 Pollock 168 397 862 832
59 King mackerel 1289 1023 1148 1252
45 Silver perch 1198 1034 1729 2146
62 Summer flounder 11918 22919 17741 16238
66 Flounders , other 1103 999 918 897
Chapter 3: Use of auxiliary information: Simplerandomsampling 243

Solution. From the sample information we have

S <lrt1 nl p x x xI
y
"" " " ,,"" ",, ,1 " , 3 2
<~;;, <" ; /; 4427.1250 5260.8750 6079 .1875 5661.5625
'!;;!
I!' 24711755.32 43283631 .72 44 114004.70 31798379.33
I ""';r , ,'~~
O,~, 1.1228 1.2505 1.0925 0.9960
I !J ,! ~~V~

and sample correlation matrix is given by

1.0000
0.9347 1.0000
0.9246 0.9723 1.0000
Here
n = 16 , p = 3, Y = 5661.56, Xl = 6079.18, X2 = 5260.87, x3 = 4427.12,
Cy =0.9960, CXl = 1.0925, CX2 = 1.2505 , CX3 = 1.1229 , rYXI =0.9723,
'YX2 = 0.9246, rYX3 = 0.9 176 , rXI X2 = 0.9347, rX1X3 = 0.9305 and rX2X3 = 0.9133.
The set of normal equations becomes

1.1935,1.2769,1.1415 lA 111 - 5989.87


1.2769,1.5637,1.2824 ~12 = -6519.78
H 13
1.1415,1.2824,1.2609 - 5810.18

A so lution of these normal equations is given by


-I
1.1935,1.2769,1.1415 -5989.87

1.2769,1.5637,1.2824 -6519.78

1.1415,1 .2824,1.2609 -5810.18


244 Advanced sampling theory with applications

9.234,-4.127,-4.163 -5989.87 -4215.54

- 4.127,5.702,- 2.059 -65 19.78 -492.43

-4.163,-2.059,6.657 - 5810.18 -318.31

It is given that XI = 4954.435, X 2 = 4591.072 ,and X 3 = 4230.174 , therefore,


UI =xI!"x J =1.2270, u2 = X2 /X2 =1.1458 and U3 =X3/X3 =1.0465.

Hence a point estimate of the average number of fish during 1995 is given by

Ylr =y +H11(ul -1)+H I2(U2 -1)+H 13(U3 - I)


= 5661.56- 4215.54(1.2270 -1)- 492.43(1.1458 - 1)- 318.31(1.0465 -1) = 4618.03 .

In this case we have


1.0000,
0.9723,
0.9723, 0.9246, 0.9176
1.0000, 0.9347, 0.9305
1
~ = del = 0.000794 ,
0.9246, 0.9347, 1.0000, 0.9133
r
0.9176, 0.9305, 0.9133, 1.0000
and
1.0000, 0.9347, 0.9305]
~II = del 0.9347, 1.0000, 0.9133 =0.0150528,
[
0.9305, 0.9133, 1.0000
therefore the value of estimate of multiple correlation coefficient is given by

i?2 = 1-~ = 1- 0.000794 = 0.9473 .


YOXI.X2. X3 ~ II 0.0150528
An estimate of the variance of the estimator Ylr is given by
v(- )- (1- f) s 2(1- i? 2 )
Y lr - n Y YOXJ,x2.x3

= 1- 0.32 x 31798379.33(1-0.94732 )= 138687.52 .


16
A (1 - a)l00% confidence interval of the population mean Y is given by
Ylr ±la/2(df = n- p-1Nv(Ylr) .
Using Table 2 of the Appendix the 99% confidence interval of the average number
of fish during 1995 is given by
Ylr ± 10.01/2(df = 16- 3 -l)Jv(Ylr) , or Ylr ± to.0 1/2 (df = 16- 3 -1 NV(Y,r)
or
4618.0Hto.00s(df=12N138687.52 , or4618.03 ± 3.055 x 372.407 , or [3480.32,5755 .73].
Chapter 3: Use of auxiliary information: Simple random sampling 245

A point estimate of the total number of fish during 1995 is given by


Ylr = NYlr = 69 x 4618 .03 = 318644 .07
and a 99% interval estimate for the same is given by

N[Ylr ±to.o l/z [df = 16-3-lNv(Ylr) ]


or
69[3480.32 ,5755.73], or [240142.08,397145.37],

The next section has been devoted to study the general class of estimators to
estimate any population parameter (e.g., population mean, population variance,
population correlation coefficient, population coefficient of variation , population
regression coefficient, etc.) by making use of p auxiliary variables at the
estimation stage.

3.9 GENERA.LCLASS OF ESTIMATORS.TO ESTIMATE AN¥


,y' \ 'c •.•. •••• ~ .. "
· · MET·"E
~POPUL>" TI'ON":P' lRA.
.. .
··· R ....
+ , ft c
·c • c . : ••' • '......
>:i:;liT:; "

Singh, Mangat , and Mahajan (1995) considered the problem of estimation of any
population parameter Fo of the study variable Y by using known supplementary
information on p auxiliary variables F1, F z,...,Fp. Let fo ,ft>...,fp be the unbiased
or consistent estimators of Fo,F1 ,...,Fp , respectively, each based on a sample of
size n > p. Let g = (ut>uz ,...,U p), where Ui = f d F, , i = 1,2,..., P , assume values in a
bounded closed , convex subset R p of p-dimensional space containing the point
~=(I, I,..., I} Let 1/ =(b" bz,...,bp), where e, = {FoCOV(jO,fi)}/{FiV(jo)}'
i = 1,2,..., p, and A= la . j where hZCovVi,fj )}/{FiFjV(jo)} ,
1J pxp'
a ij =
-

i, j = I, 2,..., p and the matrix A is assumed to be positive definite . Define


eo = (jo /Fo) -I , &1=(el, ez,...,ep)
where
e.1 = u.I - I, C~ = v(jo)/ Foz , cl = V(ji)/ F? , and COi = COV(jO,fi )/(FoFi) .
Then we have
E(e;) = S(ji v F, , i = 1,2, .. . , P ,
where B(fi ) denotes the bias in the estimator Ii of r, and
E(eJ )= C~, E(el)=cl, E(eOei) = COi , E(eo&) = C~ b , and E(&&I)= C~ A
for i = 1,2,..., p .
A wider class of estimators to estimate any population parameter F o of the study
variable is then defined as
th =h(jo, ~)
(3.9.1)
where h(jo,~) is a function of fo and Ui ' i = 1,2,..., P , such that
246 Advanced sampling theory with applications

h(FO' .§)= Fo (3 .9.2)


and h is bounded and continuous with bounded and continuous first and second
order partial derivatives in R p+1. Expanding h(Fo, !:!) about the point (Fo, .§) in a
second order Taylor's series we obtain
th = h(Fo, .§)+ Vo - Fo)h6(Fo, .§)+ f (Ui -I)J! (Fo, .§)
i~1

+ f(U i -IXU j -1~[(Fo, .§)+ f(Ui -1)2hff(Fo,.§)


i« j i ~1

+ f (Ui -IX/o - Fo)h~ (Fo, .§)+Vo - Fo? h~o(Fo, .§)+ ..... (3.9.3)
i~ l

Using (3.9.2) and further assuming that h6(Fo, .§)= 1 we obtain

t» =10+iE(Ui -1)Jf(Fo,.§) +i~(Ui -IXuj -1~[(Fo , .§)+tE(Ui -1)2hff(Fo,.§)

+± ~(Ui -lX/o -Fo)h~(Fo,.§) +±Vo -FO)2h~o(Fo ,.§)+ · · (3.9.4)

where hf (Fo, .§) and hJ (Fo, .§) denote the first and second order partial derivatives
of hVo, ~) with respect to 10 and Ui' respectively. Thus we have the following
theorem .

Theorem 3.9.1. The bias in the wider class of estimators of any population
parameter is of the order O(n- I ) , i.e., prove that
E(th) =Fo+O(n- 1 ) . (3 .9.5)
Proof. Taking expected values on both sides of (3.9.4) we obtain
) 1 P ) 1
E(th) =Fo +'LC ij hij Fo,.§ +- 'LCiihii Fo, .§ +-FoCoohoo Fo, .§
p I ( II ( II ( )
i<j 2 i~l 2

(3.9.6)

Note that 10, II> fz, ..., I p are either unbiased or consistent estimators of
Fo, F1, F2, ..., Fn> respectively, by the definition of consistency from Gujarati
(1978), Cij» (i, j = 0, I, 2,..., p) representing the variance--covariance terms and
SVi) will tend to zero as sample size n~ 00 . Thus (3.9.6) can be expressed as
(3.9.5).
Hence the theorem.

Theorem 3.9.2. The minimum mean squared error, up to terms of order O(n- I ) , of
the class of estimators th is given by

Min.MSE(th) = vVo {I- R}oeJj ,f2,...,fP) (3.9.7)


Chapter 3: Use of aux iliary information: Simple random sampling 247

where R}oo f l, n .... ,fp denotes the multiple correlation coefficient betwe en fo and

r
fl,fz ," ',fp'
Proof. By the definition of mean squared error we have

MSE(th)= E[t h - Fo F=E[fo +i~(Ui -1)z;(Fo, §)- Fo


=Fl c6+~:,,(Fo, §)f C6:ih~(Fo , §)+ 2C6b lh:n (Fo, §)h O( Fo, §)Fo. (3.9.8)
On differentiating (3.9.8) and equating to zero we obtain the optimum values of the
parameters as
hm(FO' §)= Fo ~ Q. (3.9 .9)
On subst ituting the optimum values from (3.9 .9) in (3.9.8), we obtain
Min .MSE(th)=V(Jo)[I - e ~ Q] . (3.9.10)
where QI ~ Q= RJizoofiI , f z,...,r,:
p
Hence the theorem.

Remark 3.9.1. ( a) Note that the multiple correlation coefficient increases with the
numb er of secondary variables, it follows from (3.9.10) that the minimum mean
square error of th is a monoton e decreasing function of the number of secondary
variab les.

( b ) The value of RJizoofiI . f z,...,f P also increases if there is high correl ation between
two auxiliary variables and such high correlation between the auxili ary variables
may bring artificial reduction in the variance of the estimator of population mean or
total. Such a problem can be addressed as a problem of mult icolIinearity in survey
sampling. For example suppose there are three variab les Y, XI ' and X z. Then the
minimum variance of the linear regression estimato r

Ylr =y +,BI(X1 - XI)+ ,BZ (X z - xz)


is given by

Consider the folIowing two situations:

Situat ion I. Let P YXJ = 0.6 , P yxZ = 0.8 ,and P XIXZ = 0.3 ;

Situation II. Let P YXI = 0.6, P yxz = 0.8 , and P XIXZ = 0.95.
Then the ratio of V(Ylr) under case I to case 2 is given by
248 Advanced sampling theory with applicat ions

2 +p2
P YXI YX2 - 2 PYXI P YX2 PXI X2 ]
[1 l_p 2
ti
R alO= _V...,.(Y_-I:.:...,r"")s""itu:::;at",,io.:.:..n
,--I ,,\x2 situation 1 = 2.233 .
V(Ylr )situation II
I P~XI +P~X2 -2PYXIPYX2P ,,\X2]
[ l_p 2
,,\ X2 situation II
Clearly the reduction in variance V(Ylr) in situation II is due to high correlation
between XI and X2 '

Biradar and Singh (I 992a), Singh and Kataria (1990), Singh (1988), and Singh and
Upadhyaya (1986) have suggested some methods to improve the general class of
estimators. Srivastava (1992) has shown that those methods are not valid for
improving the general class of estimators . Biradar and Singh (1997 , 1998)
considered a class of estimators based on a general sampling design for a
population parameter ¢o utilizing the information on two paramete rs ¢I and ¢2 of
an auxiliary variable .

Remark 3.9.2. It is to be noted that although Srivastava's (1971) general class of


estimators and its ramifications fetched man followers of course leading to
numerous published papers in world wide journals but they are of dubious
theoretical merit and practical use according to the views of a few well known
scientists .

Suppose there are two variables Y\ and Y2 under study in a finite population n of
size N . Let Y\j and Y2i denote the values of the {" unit in the population. Suppose
we want to estimate the ratio of two population means defined as
YI
R Y1Y2 =-=- (3.10.1)
Y2
_ _I N _I N
where Y1 = N 2: Y1i and Y2 = N 2: Y2i denote, respectively, the popul ation means
i=1 i=1
of the two variables . Suppose a sample of n units is drawn by using SRSWOR and
both (Yli ' Y2i) for i = I, 2, ..., n paired observ ations are observed from the sample .

Let YI = n -I I Yl i and Y2 = n - I I Y2i denote the sample means . Obviously an


i=1 i=1
estimator of the population ratio, RYIY2 ' is given by
, YI
RY IY2 = -=- . (3.10.2)
Y2
To find the bias and mean squared error of the estimator RYIY2 let us define
Chapter 3: Use of auxiliary information : Simpl e random sampling 249

YI and 00 = ~2 - 1,
&0= -=- - 1 ,
Y1 Y2
such that
E(&o )= E(oo )= 0
and
E{&2
~ 0
)=(~)C2
/I YI '
E(05 )= ( 1-
/I
f)c~J 2 , and E{&oOo )= ( 1- f) PYIY2 CY1CY2 .
/I

Now the estimator RYIY2 in terms of &0 and 00 can easily be written as
RY1Y2 = RJ'1Y2 [ l +&o- oo+ oJ- &ooo +···] · (3.10.3)

Then we have the folIowing theorems.

Theorem 3.10.1. The bias, to the first order of approximation, in the estimator
RYIY2 is given by

B(Ryly2)= (I ~f )RY1Y2 [c~2 -PY1Y2CYICY2 ]' (3.10.4)


Proof. Taking expected value s of (3.10.3) we obtain
E(R yly2) = R ylY2 E[ 1+&0 -00 + oJ -&000 + ... ]

= RYIY2 [ 1+ E{co )- E(oo )+E(OJ )- E{&ooo) ]

= RY1Y2 [I + 0 -0 +( 1 ~/)~~2 -PY1Y2CYI CY2 }] '


Thu s the bias, to the first order of appro ximation , is given by

B(RY1Y2 ) = E(R ylY2 )- RYIY2 = (I~f )RYIY2 ~.~2 - CC


PYIY2 YI Y2 }.

Hence the theorem.

Theorem 3.10.2. The mean square error, to the first order of approximation, of the
estimator RY1Y2 ' is given by

(, ) (1- f)
MSER ylY2 = -/1-
2 r2 2
RYI Y2lCYI +CY2 -2PYI Y2CYI CY2 .
]
(3.10.5)
Proof. By the definition of mean squared error , we have
MSE(R yly2)= E[RY1Y2 -R yly2]2 ,., R~IY2E[&0 -00 +05 - &000]2
2
,.,R y1 [ 2 2
y2 E &0 +00 - 2&000 =
j (I-f)
-/1-
2 r 2 +C 2 - 2PYI Y2CYI CY2 ] .
RYIY2lCYI Y2
Henc e the theorem.

Remark 3.10.1. We can also estimate product of two population means ,


PY1Y2= r; Y2, with the help of estimator PYIY2 = :h Y2 .
250 Advanced sampling theory with applications

Remark 3.10.2. In the case of the presence of known auxiliary inform ation , general

class of estimators of the form, Rir ={ ; :}H(II ), h = {:VI Y2}H(II ) where lI =x/x
and H(.) is a parametric function can also be constructed to estimate the ratio
RYIY2 and product Py 1Y2 of two popul ation means . The interested reader may refer
to Singh ( 1982a).

3.11 MEDIAN ESTIMATION IN SURVEY SAMPLING

The median is often regarded as a more appropriate measure of location than the
mean when variables with a highly skewed distribution, such as income, are
studied. As we have seen in the previou s section s of this chapter, there is extensive
literature available on the estimation of mean and total s in sample sur veys.
Relatively few efforts have been made to develop an efficient estimator of the
median . Gross (1980), Sedran sk and Meyer (1978), and Smith and Sedransk (1983)
have cons idered the probl em of estimation of the median using simple random
sampling. Kuk and Mak (1989) are the first researchers to attempt the estimation of
the median using auxiliary information. Franci sco and Fuller ( 199 1) have also
considered the problem of est imation of the median as a part of estimation of finite
popul ation distribution function . In this chapt er we shall restrict our self to the
discussion of the ratio type estimator developed by Kuk and Mak ( 1989). Let Yj
and X j , i= 1, 2,..., N , be the values of the popul ation units for the study variable Y
and auxiliary variable X respectively. Furthermore, let Yi and Xi' i = 1,2,..., 1/ , be
the values of the units includ ed in an SRSWOR sample of size n . Assuming the
median M x of the var iable X is known we have the followin g theorem :

Theorem 3.11.1. The ratio type estimator to estim ate the median M y of the study
variable is give n by

M
. R
. (M
=M y it : J (3.11.1)

where it Y and it x are the sample median estimators.

Proof. Follows from the usual ratio method of estimation .

Suppose Y(l) ' Y(2)' . .. , Y(II) are the Y values of sample units in the ascending

order. Furthe rmore let p =!- be the proportion of Y values in the sample which are
1/

less than or equal to the median value M y which is an unkno wn par ameter and is to
be estimated from sample observations and so is the case of p . If p is an estimator
of p , the sample median it y in terms of quantiles can be written as Qy (p) , whe re
p = 0.5. Kuk and Mak ( 1989) defined a matrix of proport ions [ Pi) ] as:
Chapter 3: Use of auxiliary information: Simple random sampling 251

Y>M · Total
y.

P2. I

Thus we have the following theorems.

Theorem 3.11.2.The variances of the estimators My and M'; are, respectively,


given by

V(M y) (3 .11.2)

and
V(M x ) = C~fJ {Jt(~x)}-2 , (3 .11.3)

where f y and It are the density functions of y and x respect ively. The
covariance between My and Mx is given by

(3 .11.4)

Proof. Let Fy be the cumulative distribution function of y . Then under some


assumptions given by Kuk and Mak (1989) and Francisco and Fuller (1991) the
Taylor's series expansion gives

Fy(M y)= Fy[M y +(M y-My)] = Fy(M y)+(My-My)fy(My)+Op {n- I/Z).


So we have
My- My = {ry(My)}-l [Fy(My)- Fy(M y)]+o)n-1/ Z).
Furthermore it may be shown that
Fy(My)- Fy(My)=ft)M y)- fry(M y)+op{n-I /Z),
where fry is an estimator of Fy. So we have

My - My = {Jy(MJ rlfry( M
J-fry ( MJ ]+op(n- I 2
/ )

={rAM y)}-l[0.5- Py1+ 0 P (n-IIZ), with Py = fry (M y).


Similarly
Mx-Mx={Jt(MJ}-1[0.5 -p x]+op(n- II Z), with Px=F:t(M x ) '
Then we have
252 Adva nce d sampling theory with applications

V(My)=V(M y -M y)= (ry(My)}-Z V(Py)=(1- f X41lti (ry(MJ-Z ,


V(M x)= V(Mx-Mt)= {jAMx)}-ZV(Px)= (1 - fX41lt i Vt (Mx)}-2,
and
Cov(M,t' My)=Cov[ Mx-M,t' My-M y]
=Cov[ VAMJ}-I(O,5- pJ, (r)My)}-I(O,5- py) ]
v,
={rAMt (M y)}-l Cov(Px, Py)
v,
={rx(M x (M Y )}-l [E(pxPy)- E(px)E(py)]
=(rt(Mt )fy(M y)}-I [PII- 0,25].
Hence the theorem .

Now we have the following theorems:

Theorem 3.11.3 . Th e variance of the ratio estima tor MR of popu lation median My

[::r (rA~Jt'
is give n by

v(Ai R ) 0 C~f J[ \r, (~, ))-, +


- { :~}PIl - 0.25){rx(MJfy (My )}-1 J (3.11.5)

Proof. By the ratio method of estimation we have

(,) ! E(My))2 (, ) ! E(My)) (, , )


( , R ) = VMy (3.11.6)
VM + E(M x) Vu , - 2 E(M x) COY My, u , .

On substituting the va lues of V(M y) , V(MJ and COV(M y, MJfrom theo rem
3.11.2 in (3 .11.6), we have (3. 11.5). Hence the theorem.

Theor em 3.11.4. Th e ratio estimator of population median M R is more efficie nt


than the sample median estimator My if
V3 (M x )}-IM-x 1
2{ry(M yII M;! '
P > (3.11.7)
c

where Pc = 4(PIl - 0.25) goes from - I to + 1 as PII increases from 0.0 to +0 .5. This
condition is ana logo us to the condition under which the ratio estimator of
po pulation mean remains superior to the sample mean .
Proof. By setti ng V(M R)<V(My) we have
Chapter 3: Use of auxiliary information : Simple random sampling 253

or

or

or
M-x I {fx (M x )}-I
2M;I{i)M y l l
P >
c

Hence the theorem.


Example 3.11.1. The amounts of the real and nonreal estate farm loans (in $000)
during 1997 in the US have been presented in population I. Suppose we selected an
SRSWOR sample of eight states to collect the required information. Find the
relative efficiency of the ratio estimator of median for estimating median of the
amount of real estate farm loans during 1997 by using information on nonreal estate
farm loans during 1997 with respect to the usual estimator of population median .
Assume that both real and nonreal estate farm loans follow independent normal
distributions.
Solution. From the description of the population I given in the Appendix we have
Yi = Amount (in $000) of the real estate farm loans in different states during 1997,
Xi = Amount (in $000) of the nonre al estate farm loan s in different states during
1997,
N = 50, M y = 322.305, M x = 452.517, f.1 y = 555.434 , f.1 x = 878 . 162, (J" x = 1073.776 ,
(J" y = 578.948 , and PI I = 0.42 .
Not e that we are given
2
1 y- J.1 y
I ( x- J.1 x)2
f (x ) = _ I _ e -"2 --;;- , f( y) = 1 e
( "v J
and 2
..j2;(J"x ..j2;(J"y
which implies that

fAM x) = ..j2; 1 e -"2


1( 452.517- 878.162
1073.776
r =3.4 345 x 10-4 ,
2;r x I073.776
and
I ( 322.305- 555.434 ) 2
f y (M y ) = 1 e-"2 578.948 =6.354I xlO - 4 .
..j2; x5 78. 948

Therefore we have
254 Advanced sampling theory with applications

V(M y) = C~fJ{ry(~y)}-2 =C-~·16Jk3541:1O-4r =65016.15

and

r
V(M R) =C~f )[ {ry(~y)}-2 +(:: V,(~x)}-2 {::](1j -O.25){rAMJfy(MJ}-I]
I

= (1-0.16J[
8
~.354Ix 10- 4 r
4
+(322.305 J2 fu.4345 x 10-
452.517 4
4
r
-2(322.305J(0.42-0.25)~.354IXIO-4 X3.4345XIO-4}-' ]
452.517
= 61393.76 .

Thus the percent relative efficiency (RE) of the ratio estimator MR with respect to
the usual estimator My is given by

RE= V(M y) xI00=65016.15 x I00=105 .90%


MSE(M R) 61393.76
which shows that the ratio estimator is more efficient than the usual estimator of
population median.

Remark 3.11.1. In the above example, for simplicity we have considered univariate
normal distributions for X and Y separately, but a more interesting example may
be considered by assuming bivariate joint normal distribution of X and Y .

Theorem 3.11.5. The minimum variance of the regression type estimator of median
defined as
Mlr = My +r(M x - Mx ) (3.11 .8)
is given by
(I
Min.v(M Ir)= ~f}PI' (1- 2PI 1 ){ry (My )}-2 . (3.11.9)

Proof. We have
V(M Ir)= V(M .l- r 2V(M x)- 2rCov(My'Mx)

= (' 1
~f (r,(:')\-' + r' (rA~, l ' - 2r( R, - 025 )(rAMJfAM,)r 1
(3.11.10)
On differentiating (3.11.10) with respect to r and equating to zero we obtain
[4(fj I - 0.25){ry(My)f,(Mx )}-J] (3.11.11)
r= [V,(M x )}-2 ]
Chapter 3: Use of auxiliary information: Simple random sampling 255

On substituting the optimum value of r in (3.11.10) we obtain

Min.v(if Jr ) = (~J[{rY(My)}-2 (~l -0.25)2 {rAMx)f (M )}-2]


n 4 2
Vt(MJt /4 y y

= C~f) {ry(~y )}-2 [1-16(~1 -0.25)2]

=( I~l f }P 2~ ){ry (M y)}-2 .


11(1 - J

Hence the theorem.

Example 3.11.2. Consider the following population consisting of five (N = 5 )


units A, B, C, D and E where for each one of the unit in the population two
variables Y and X are measured.

A B C D E
9 11 13 16 21
14 18 19 20 24

( a) Find the population medians My and Mx of the study variable and auxiliary
variable, respectively.
( b) Select all possible samples of three units (n = 3) with SRSWOR sampling.
( c ) Find the estimates of the medians My and Mx from each sample.
( d ) Find the exact bias in the estimator ify using the definition.
( e ) Find the exact mean square error of the estimator if y using the definition .
( f) Assuming that the median M x of the auxiliary variable is known, find the ratio
estimate of the median if R = if y(M x/if x) from each sample.
( g ) Find the exact bias in the ratio estimator if R using the definition.
( h ) Find the exact mean square error of the ratio estimator if R using the
definition .
( i ) Find the relative efficiency of the ratio estimator if R with respect to ify'
Solution. ( a ) The population medians of Y and X variables are given by

My = 13 and M'; = 19 .

( b ) and (c) All 10 possible samples, estimates of medians from a given sample
ify Is, if x Is and related results are given in the following table:
256 Advanced sampling theory with appl ications

;S ample .:,1!{",(;. {'i i ,§~fupled i,IM ; Ui (ii


+ ;Nb':~ "
~f~J; I i 1\1~.~ls~ 1 ; ~*; li~~~ ~}2 l i~1 ;l iJ· AfY}~
(" iN alues .;. . ;. ;.; { + .;; ",' ' l ;

v.: 9 II 13 11 18 0.1 11.61 4 1.93


1 14 18 19
xi :

Yi : 9 11 16 11 18 0.1 11.61 4 1.93


2 14 18 20
Xi :

n: 9 11 21 II 18 0.1 11.61 4 1.93


3 14 18 24
Xi :

Yi : 9 13 16 13 19 0.1 13.00 0 0.00


4 14 19 20
Xi:

n: 9 13 21 13 19 0.1 13.00 0 0.00


5 14 19 24
Xi :

n: 9 16 21 16 20 0.1 15.20 9 4 .84


6 14 20 24
xi :

n: II 13 16 13 19 0.1 13.00 0 0.00


7 18 19 20
xi:

n: II 13 21 13 19 0.1 13.00 0 0.00


8 18 19 24
xi :

n: 11 16 21 16 20 0.1 15.20 9 4 .84


9 18 20 24
xi :

n: 13 16 21 16 20 0.1 15.20 9 4 .84


10 19 20 24
xi:
. i,il;(ii' ;;;.; )" .";.;,,,
Sum 133• i190;
, .;;;.;,
~{111"'"i'q .·i" ; . ; i ;;i i ·" " i "',.;3 9';';·.,;; ;; ·20:3 1 1'{;'i.!

(d ) The bias in the estimator if y is given by


, ) 10 ,
B (My = LPsM y Is -M y =13 .3-13=0.3.
s=1

( e ) The mean squ are error of the estimator if y is giv en by


,
( y = LPstM yl s-M y
) 10 f.', }2
MSEM =3 .9.
s=1

( f) Th e ratio estimates of if R Is, S= 1,2,....,10 are giv en in the above table .

( g) Th e me an square error of the rat io est imator if R is given by

, ) f.',
MSE (M R = L10 Ps tM R Is -M y
}2
= 2.031 .
s=1
Chapter 3: Use of auxiliary information: Simple random sampling 257

( h) The relative efficiency of the ratio estimator MR with respect to the usual
estimator My is given by

MSE(M y) 390
RE =----,,..--'----,xIOO = -' -xlOO = 192.05% .
MSE(M R ) 2.031

Remark: 3.11.1. It is not clear if the estimator M1r of median can work as
efficiently as the usual linear regression estimator of population mean, YLR' Kuk
and Mak (1989) studied two more estimators under the names of 'position
estimator' and 'stratification estimator', these were found to be as efficient as the
M!r from the variance point of view. Graf (2002) also pointed out that estimators
of median developed by Kuk and Mak (1989, 1994) and Ren (2000) deserve
practical investigations. A bootstrap method for smoothed est imators of median has
been discussed by Brown, Hall, and Young (2001) . Nelson and Meeden (1998)
used prior information about the population quartiles of the auxiliary variable to
improve estimator of median.

EXERCISES

Exercise 3.1. Under SRSWOR sampling, find the first order approximations of bias
and mean squared error in each of the follow ing estimators:

(a) y1=aY+(I-a)5{;); (b) Y2=aY+(I-a)y(~);

( c)
and
Y3 = Y[ ax + (~a )x l (d)
Y _ y[(I+a)X+(I - a)x].
4- (I - a)X + (I + a)x '

(e) Y5 =(I-a)y+aY(;Y;

where a and r are suitably chosen constants such that MSE(Yt), t = 1,2,3,4,5, is
minimum. Show that Min.MSE(Yt), t = 1,2,3 , to the first order of approximation, is
the same as that of the usual linear regression estimator.
Hint: ( a ) and ( b ) Chakrabarty (1968), Vos (1980), Adhvaryu and Gupta (1983);
( c ) Walsh (1970) ; ( d) Sahai and Sahai (1985); (e) Sisodia and Dwivedi
(1981).
258 Advanced sampling theory with applications

Exercise 3.2. Compare the following estimators with the usual ratio estimator under
SRSWOR design

(a) YI = + Cx J, where c, = srl x denotes the known coefficient of variation


Y(~x+C
x

of the auxiliary variable X,

- -( NX - nx
( b ) Y2=Y (N -n )X
Jor Y2=Y-=-'w
- - x* here x-* = -1 - N-/I
LX; denotesthemeano f
X N -n ; =1

non-sampled units of the x variable,

( c) Y3 = y*( ~ J' where y* = Ay denotes the Searls' estimator.

-
yx +sxy In h
.£(x; -xf
d
£(x; -xXy;- y)
( d) Y4= x2 + s ; l n 2 1=1 ;-1
,w eresx==-'-(--n-_-I,-)-an Sxy= - (n-I)

- y[ 1+ (I--;; - N1)(sx
( e) Ys = i s;J] '
Y x - i2
y

and

(f)
Hint: ( a ) Sisodia and Dwivedi (1981); (b) Srivenkataramana (1980),
Srivenkataramana and Tracy (1980, 1981); (c) Prasad (1989); ( d ) Srivastava,
Dwivedi, Chaubey, and Bhatnagar (1983); ( e ) and ( f) Swain and Sahoo (1982).

Exercise 3.3. Find the minimum MSE of the estimator of population mean, given
by

- =_(s;
YI Y-
Jf3
2
SX

and study the behaviour of the resultant MSE under bivariate normal distribution
and discuss your views.

Exercise 3.4. Show that the minimum MSE of the estimator of population mean Y
defined as
Y sl = a y + jJ(X - x)
IS

. _ (-1-n-f )Sy2(l - pxy2)


Mm.MSE(Ysl) = ( )
1+ 1- f C 2 (I _ p2 )
n y xy
for the optimum values of a and jJ .
Chapter3: Use of auxiliary information: Simplerandom sampling 259

Exercise 3.5. Suppose a class of ratio type estimators to estimate the population
mean Y is defined as
Yc = Yf(u,v,w)
where u = xl X, v= s;/S; , and w =rty / P xy , and rxy = S xy Ils x S y ) is an estimator of
population correlation coefficient, P xy .

( a ) Find the minimum mean squared error of Yc by defining certain regularity


conditions.
( b ) Also show that the minimum mean squared error of the ratio type class of
estimators, yc , is equivalent to that of the wider class of estimators, defined as
Yw = f(y,u , v, w).
Hint: Srivastava and Jhajj (1983b, 1995).

Exercise 3.6. Consider A= y(xlxY, such that AE H,i = 1,2,3, where H denotes
the set of all possible product type estimators of population mean . Construct the
following terms:
( a ) Linear variety of estimators;
( b ) Funnel to filter the bias precipitates;
( c ) Filter paper to filter the bias precipitates;
( d ) Amount of chemicals to reduce the bias of first order of approximation.
Hint: Singh and Singh (1991, 1993a, 1993b, 1993c).

Exercise 3.7. Study the bias and MSE of the predictive product estimators, given by
( a) YI = Y( X~) , and (b) yz = ...!.- I YC i ,
ni=1 X

where Y = n- IIYi is the usual sample mean, x = n( I~J-I , and X = N( I_I_J-I


i= ! ;=!Xi i=IXi
are the sample and population harmonic means respectively, under the super
population model defined as , Yi = f3/ Xi + ei , where E(e;) = 0, E(eiz) = Ax;1.
Hint: Agarwal and Jain (1989), Agrawal and Sthapit (1996).
Exercise 3.8. Study the asymptotic bias and MSE of the following almost unbiased
product type estimators of population mean

(a) - _(x)
Y1 =Y
(l -f)Sxy .
X --n- X '
( b) Yz- -( x
- - Y~
X
)[1(1- f) Sty ]-I
+--=
n xY

( c) Y3 = Y ~
X
(x) (l -f)Sxy
- - -n- =x- ; and (d)
260 Advanced samp ling theory with applications

where a is the characterising scalar. Also show that YI and Y2 are the speci al
cases of Y4 for certain choice s of a and the estimator Y4 remains better in the
sense of smaller MSE than the other estimators, Yt , t = 1,2,3.
Hint: (a) Robson (1957 ); (b) Singh (1989) ; ( c ) Dube y (1993); ( d) Srivastava
and Bhatnagar ( 198 1), Bhatn agar (1996).

Exercise 3.9. Let X m and X M denote the minimum and maximum values of a
known positive variate X respecti vely. Using these values, let us transform the
auxiliary variable X to create two new variables Z and V such that
Z, = Xi + X m and Vi = Xi + X M for i = 1,2,..., N .
XM +Xm X M+Xm
The same transformations are applied on the Xi value s In the sample as
Xi + X m
Zi = and "i = Xi + X M , for i = 1,2,....n . Find the conditions under which
XM +Xm X M +X m
the estimators

- -(VJ
(b ) Y2 = Y Ii

_ -I -I -I N
= N I Zi and U = N- 1 ~Vi
11 _ 11 -
where Z = n I Zi, II =n I "i , Z , of population
i=1 i=1 i=1 i= 1

mean Yare better than the usual ratio estimator.


Hint: Moh anty and Sahoo ( 1995), Sahoo , Sahoo , and Moh anty (1995a), Sahoo
(1986).

Exercise 3.10. Consider the transformation on the auxiliary information of sample


observat ions as, x; = (NX- nXi )/(N - n).
( a ) Adju st the bias in the estimator of population mean given by
YI = ,.*X
wher e" ... = n" I n
Iri* for ri... = Yi Xi . /*
i=1
( b ) Show that an unbiased estimator of population mean is given by
- __oX
Y2 -r
(N-I),z{-
+ N(n_I)\y -r
-*-*) X •

Hint: Singh and Singh (1993).

Exercise 3.11. Find the asymptotic bias and mean squared error expressions for
. f h
three estim ators 0 t e parameter K =
Cy X
S xy
P ty - = ~-2 '
.
given by
. Ct Y s;
X S ty k _ X Sxy
(a) k _ ~ SXY .
1 -- 2 ' ( b ) k 2 = -=---T ; and (c ) 3 - -- .
Y St Y Sx Y S}
Chapter 3: Use of auxiliary information : Simple random sampling 261

Also study the properties of the six general classes of estimators defined as:
= kJH(u) and kbJ = kJH(u, v) for J = 1,2,3
k aJ
where u = xl X and v = s~ / S; . Construct the wider classes of estimators to
estimate the parameter K and study the properties . Comment on the results .
Hint: Singh and Singh (1988), Srivastava, Jhajj, and Sharma (1986), Reddy
(1978a) .

Exercise 3.12. Study the properties of the almost unbiased product type estimators
of population mean, Y given by

(a )---(x)X
YI - Y ~ -(N-n)(sry)
- - - ~ ,.
nN X
(b)
- - n(N-I)_(X)_ (N-n)(p) .
Y2 - N(n-I)Y X N(n-I) X '

(0) Y3 '(n
11)[y(~)-m l end (d) y" ymH1~fXjJ-;n];
h sxy = (1)
were n- - 1 2:
n(
Yi - Y-XXi - -)
X an d -p = n- 1 2:
n
Yi Xi '
i=1 i=1
Hint: Shah and Shah (1979); Murthy (1964); Pandey and Dubey (1989) .

Exercise 3.13. Suppose a ratio type estimator of the finite population variance S;
given by
Sl2 = s;2(S';/ s.;)
where s;2 = AS; denotes the Searls (1964) type estimator. Find the minimum mean
squared error of the estimator sf and find the condition under which it is more
efficient than the ratio type estimator of variance proposed by Isaki (1983) for
A=l.
Hint: Prasad and Singh (1990).

Exercise 3.14. (a) Show that the power type of estimator of variance S; given by
s~ s; (s.; /s~ r
=

IS always more efficient than the ratio type estimator si s; (s.; / s.~ ),
= for the
optimum choice of real constant a .
( b ) In case of multi-auxil iary information (say, k-variables), study the asymptotic
properties of the estimator, defined as

Sl~ = Sy2 n(s2 / s2 )a}


} =I x) x)

where a}, j = 1, 2, ..., k are real constants to be chosen .


Hint: Garcia and Cebrian (1996); Cebrian and Garcia (1997).
262 Advanced sampling theory with applications

Exercise 3.15. Show that an improved and admissible estimator of population

.
correlation coefficient Pxy is given by
Sty
rs = - -
StSy
where s:y = AS xy denotes the Searls (1964) type estimator of Sxy .
Hint: Singh, Mangat, and Gupta (1996).

Exercise 3.16. Consider

81 = ~, 82 = ~( ~J, and 83 = ~( ~; J
where cy = (Sy/Y), Cy = (Syjf) and such that 8i E H, i = 1,2,3, where H denotes
the set of all possible estimators for estimating the 'Inverse of population mean
0= I/f '.
Define the following terms in statistical language:
( a ) Linear Variety of estimators;
( b ) Funnel to filter the bias precipitates ;
( c ) Filter paper to filter the bias precipitates ;
( d ) Amount of chemicals to reduce the bias of first order of approximation.
Hint: Singh and Gangele (1995, 1997), Singh and Singh (1991, 1993a, 1993b,
1993c).

Exercise 3.17. (a) Show that the MSE, to the first order of approximation, of the
usual ratio estimator of population mean can be expressed as
MSE(YR)= (1- f)_I_IE?'
n N - li= l
where Ei = (r; - r). R(Xi - x) and R= f/ X have their usual meanings.
( b ) Study the asymptotic properties of an estimator of MSE(YR) defined as:

, (_) (I-f)
MSE YR = -
n
I n
-(-)L: ei ~
n -1 i=l X
2(X)g,
where g is a suitably chosen constant and ei = (Yi - n- r(xi - r) for r = YIX .
Hint: Wu (1982).

Exercise 3.18. Consider cy = (Sy/Y) as an estimator of the population coefficient


of variation given by Cy = (Syjf) .
( a) Derive the bias and mean squared error of the estimator cy to the first order of
approximation.
(b) By suggesting an estimator of the bias in the estimator cy , propose an almost
unbiased estimator of Cy .
Chapter 3: Use of auxiliary information: Simple random sampling 263

( c ) Obtain the mean square error of the almost unbiased estimator in (b) to the
first order of approximation and compare it with that of the estimator in ( a ).
( d ) Find the bias and MSE of the general class of estimators defined as:

where u = x/X and H(.) is a parametric function such that H(l)= I and define
certain regularity conditions required.
Hint: Expand the ratio Sy/Y in terms of &0 and &2 by using binomial expansion
and use the results from the section 3.1 to proceed further.

Exercise 3.19. Study the second order asymptotic properties of the estimators of
population mean Y given by

(a) t~=y(;r, and (b) t;= ax+~~a)x'


where a= (v sxy )/(x s; ) and compare the results.
Hint: Sahoo and Swain (1987).

Exercise 3.20. ( a ) Show that the minimum mean squared error of the estimator of
population mean as:
Ygen = a)i + p(X- x)
where a + p = I is given by

MSEV'gen
(.. ., )=(1- fJ 2
S;S;(I-P;y)
2 •
n Sx+ 2pxyS xS y+Sy
Hint: Jain (1987) .
( b ) Find the bias and mean square error of the estimator
Yds = WIY+W2 X+(1-WI- W2)X,
where WI and W2 are suitably chosen constants, such that the MSE of Yds IS

minimum .
Hint: Dubey and Singh (2001).
Exercise 3.21. Consider an estimator of population mean Y is defined as:
- _ _[(A+C)X + f BXa ]
Ysk - Y (A + f B)X + C x a '
where
xa = ax + (I - a)X , a = n/(N + n), f
= n/N and A, Band C are the functions of k
and A=(k-IXk-2), B=(k-IXk-4), C=(k-2Xk-3Xk-4) such that ke(O,oo) .
( a ) Show that several estimators are special cases of the class of estimators defined
as Ysk for different choices of A , Band C .
264 Advanced sampling theory with applications

( b ) Find the mean square error of Y sk over the sample mean estimator for
different choices of parameters involved, and comment.
Hint: Singh and Shukla (1993).

Exercise 3.22 . Suppose Y is the variable under study and XI' Xz, ...,XP are the p
auxiliary variables correlated with it. A sample of size n is drawn from the finite
population of N units with SRSWR. If prior information about the coefficient of
variation Cy of Y along with information about p auxiliary variables is available,
study the properties of the estimator of population mean Y defined as
Ym = w~ + ,lJ"(X -x)] ,
where {J"= [.a1,{JZ,···,{JP]PXI ' X"= [XI>XZ, oo .,Xp]PXI' :£'= [Xl>xz ,...,xp]PXI and W is
a suitably chosen positive constant such that the mean square error of the estimator
Ym is minimum.
Hint: Kothwala and Gupta (1989).

Exercise 3.23. Study the asymptotic proper ties of the multivariate estimator of
population mean, Y defined as:

( a) YI = YI~lwI( ~ J, I
(b) yz =fl[y( ~ J]WI
1;1 XI
and

P n
where LWI = 1, and WI are real constants, and XI = /l -I LXii denote the sample mean
1;\ i;1

- IN h
unbiased estimator of the known population mean XI = N- LXii of the t' auxiliary
i;1
variable, t = 1, 2, ..., p .
Hint: Singh (I 967b ), Tuteja and Bahl (1991) .

Exercise 3.24. Suppose there are two auxiliary variables X i and Z, on which
information is available and we wish to estimate the population mean, Y,of the
study variable, Y. Then study the behaviour of the estimators

( a) Yo = Y(; J(; J. and ( b) YI = WY( ; J(; J


assuming that Pyx > 0 and Pyz < 0, w is a constant.
( c ) Obtain an expression for bias in the estimator of the population mean given by

yz = X t(YiJ(:!.J
n i;1 Xi Z
and show that an estimator after adjusting the bias is given by

Y3 = yz + ~G =~~[(y - rx)-( ~J~mZ -rzl where r ~i~l~; .


=
Chapter 3: Use of auxiliary information: Simple random sampling 265

( d ) Show that the estimator Yo defined in ( a ) is a spec ial case of another


estimator defin ed as:

Y4 = WI~ ~ J+ WzY(~ ) where WI +Wz= 1.


( e ) Find the bias and variance of the estimator

Ys = y( ;)( ~),
_ _I n - _I N
wherell =n L.1I;, U = N L.1I; , f or ll;= L -x; andLis ascalartobechosen so
;= 1 ;= 1
that the mean squared error of Ys is minimum.
( f) Find the bias and variance of the product-cum-difference estimator, defined as
Ypd = ~ [x + k(z - z)]
X
for the optimum value of k .

r
( g ) Study the generalized regression ratio estimator

Ygrr = ~ +b(X -x)~ ~


for the optimum values of b and a .
( h ) Stud y the asymptotic prop erties of the follow ing est imators of the population
mean:

and

where WI +W z = 1, al and az are real constants.


( i ) Discuss the optimum choice of k in the estimator of population mean :
_ _[kX +z ]·
Yk =Y kX + z

Hint: ( a ) and ( b ) Singh (1967a, 1967b, 1969), Tracy and Singh (1998); ( c )
Sahoo and Swan (1980); (d) Biradar and Singh(1992b), Tracy , Singh, and Singh
(1996); (e) Singh and Ray (198 1), (f) and (g ) Khare and Srivastava (1981);
( h ) Singh and Singh (I 984b ); ( i ) Agarwal (1980).

Exercise 3.25. Suppose it y and it x denote the estimators of the population


medians M y and M x of the study variable and auxiliary variables, respecti vely.
Find the cond ition under which the product estimator of population medi an
266 Advanced sampling theory with applic ation s

ifp = ify[~: J
remains more efficient than the sample median estimator ify.
Hint: Singh and Joarder (2002).

Exercise 3.26. Use the power transformation type estimator to estimate the
population median, My, defined as follows :

• • M. Ja
«; =M [M:
y

( a ) Show that the ratio and product type estimators are spec ial cases of if pw .

( b ) Find the optimum value of a such that MSE of the if pw is minimum.


( c ) Discuss the difficulties in the use of optimum valu e of a .
( d ) Compare the minimum MSE of if pw with that of usual ratio and product
est imat or of med ian.
Exercise 3.27. In case ofSRSWR sample of n units, the well known Hartley and
Ross (1954) unbi ased ratio type estimator of population mean is given by
Y IIR = rX __ n_(y - r r)
(n - I)
where r = n -I f (Yi / Xi ) and X = N- 1 I Xi , etc., have their usual meanings. Show
i=1 i=1
that an est imator better than YH R based only on the distinct units v in the sample
is:

if v> I,

otherwise,
_ I VM _ Iv
where rv =- .L and Xv =- .LX(i) .
V, =IX(i) V ,=I
Hint: Pathak (1962).

Exercise 3.28 . Find choic es of 5 , OJ, '7 and G such that the cla ss of estimators
defin ed as

r
reduces to the following estimators :

(a) YSr = Y( ~ [ Srivastava ( 1967) ]


Chapter3: Use of auxiliary information: Simple random sampling 267

[ Reddy (1973, 1974), Walsh (1970) ]

[ Gupta (1978) ]

[ Ray and Sahai (1979), Chakrabarty


(1979) ]

[ Sahai and Ray (1980) ]

[ Tripathi (1980) ]

[Adhvaryu and Gupta (1983)]

[Adhvaryu and Gupta (1983)]

_ _ x
[ Mohanty and Sahoo (1987) ]
(i ) YMSI = Y (1-UJ)x + UJX
-z
[ Mohanty and Sahoo (1987) ]
(j) YMSz = Y (I -UJx +UJx-X )~
( k ) Replace x by x = (I + d)X - dX for some real constant d in Ye and discuss
the different members of the resultant class of estimators.
Hint: ( a ) to (j ) Ceccon, Diana, and Salvan (1991); ( k) Diana (1992); David and
Sukhatme (1974).

Exercise 3.29. Suppose x Y denote the sample means in an SRSWOR


and
sample of size n and let w=x/X, z=y/Yk , where Yk be some Yi i=I,2,...,N,
but less than population mean Y be known from past surveys. Find the bias and
variance of the estimator of population mean Y defined as
Yd = Ykz d
where zd = I + Z- w.
Hint: Srivenkataramana (1978) .

Exercise 3.30 . Discuss the efficiency conditions of the following estimators of


population mean Y as

(a) YI = s, -(:~ J' and (b) yz = Yp -( s~~ J


268 Advanced sampling theory with applications

where Yp = Y( ~ ), under the super population model, Yi = a + f3xi + ei with f3 < 0 ,

Ek I Xi) = 0, E(e;e j I XiXj )= O\ii;t j and V(ei IX;)= no, where t5 is a constant of
order n- I and the variate xdn have a gamma distribution with the parameter
m = nh.
Hint: Singh and Singh (1997) , Dalabehera and Sahoo (1995) .

Exercise 3.31. Suppose r = y/x and r = n-1 L(ydx;).


II

i=1

( a ) Find the values of e(, t = I, 2, 3 such that the estimator of Y defined as


YI = elY + e2 rX + e3 r X
3
is unbiased under SRSWOR sampling for LO( = I.
( =1

( b ) Show that the estimator Y2 = ,tr X + {I- ,t }rX is almost unb iased for population
mean for the optimum value of A given by

,topt = k + (1- ck { ~)
where c = (n -1)/ nand k is some constant.
Hint: Rao (1981) .
Exercise 3.32. Compare the predictive ratio estimator

YI = Y( ;J +(~- ~ )Y( ~' }C -c.~) XY

where

s, = (NX - nx)/(N-n), n-
1I)
cxy =- ( I(~
i= 1 X
-I)(~Y -1), and 1)I(~ _ 1)2
c.~ =_(
n- 1 i= 1 X

with the usual ratio estimator and with the estimator, given by

Y2 = Y( ;][1 +(~- ~)(CXy -C;)].


Hint: Sahoo , Sahoo, and Mohanty (1995b), Srivastava (1983), and Tin (1965) .
Exercise 3.33. Consider the problem of estimation of median of a study variable Y
as
,
M new =
, 3 (Q. Ja ,
M y IT --,!L
i

1=1 Qix

where Qix' i = I, 2, 3 denote the t" known quartiles of the auxiliary variable and QiX
denotes its sample analogue. Find the minimum MSE of Mnew for the optimum
values of ai .
Hint: Singh, Singh, and Puetas (2003c) .
Chapter 3: Use of auxiliary information: Simple random sampling 269

Exercise 3.34. (a) Assume that the mode M o of the auxiliary variable is known,
find the bias and variance of the estimator of median M y defined as

Mnew = My( ~ x + M 0
] •
M x +M o
Compar e the mean squared error of Mnew with that of the usual ratio estimator

MR = My( ~:: ],
and develop a condition of it being efficient estimator.
( b ) Study the follow ing estimator for population median M y defined as

M(a)= My (A-M x]
y A -M
x

where A is a suitably chosen scalar.


Hint: Singh , Singh , and Joarder (2003) , Singh, Singh, and Puertas (2003a).

Exercise 3.35. Study the asymptotic properties of an estimator of the median M y


defined as
. . 99( ]a
P
M new = M y Il --,!£
i

F1x
1= 1

where P;x , i = I, 2, ..., 99 denotes the / , known percentil e of the auxiliary variable
and hr denot es its sample analogue . Find the minimum MSE of Mnew for the
optimum values of ai'
Hint: Singh (2002a) .

Exercise 3.36. An SRSWOR sample is selected from a finite population of N


units from a superpopulation model
M : Yi =
a + fJxi + ei
with EM(e;)= O, VM (ei)=a;(I-p.;y ), and CovM k,ej )=O for i* j .

Show that under this model, the variance of the linear regression estimator

Y LR = Y+ (Sxy/s~ Xx -x)
can be written as

EM [V(YLR )] = ~[(I - f)a;(I - P~y~l+ n~3}]


Hint: Sukhatme, Sukhatme, Sukhatme, and Asok (1984) .
270 Advanced sampling theor y with applications

Exercise 3.37. Stud y the asymptotic properties of the multi vari ate estimator of
popul ation mean Y given by

YI = fWk~+akC~\ - Xk )].
k=!
Discu ss the choice of Wk and a k such that the estimator YI reduces to the well
known multivariate ratio type of estimator proposed by Olkin (195 8) as

Yo = f Wk( ~kXk ).
k=1
Also discuss the nature of the estimator defined as
p
YHRM = L WkYHR(k) ,
k=1
- - - II(N-I)(_ -- ) P d h h
h
were Y HR(k) = rkX k + - (- - ) Y - lie Xk ' LWk = l, 'ie , Xk an
N il - I k=1
x, ave t e same

meaning as r, x and X for the ~" auxiliary variabl e.


Hint: Raj (1965), Agarwal, Sharma, and Kashyap (1997), Ramachandran and Pillai
(197 6), Goodman and Hartley (195 8).

Exercise 3.38. Stud y the asymptotic properties of the follo wing two estimators of
popul ation total Y defined as

( a) 1\ = r(; J' and ( b) T2 = r(~ ) ,


where X = (I + gXX - f X) where g is suitably chosen constant and f = 11/ N .
Hint: Bandyopadhyay (1980).

Exercise 3.39. Show that under the transformation II . = a + bx . , where a and b are
I I

pre-assigned constants, a product type estimator of population mean , Y , defined as


Y(I ) = y(i7 )
can be expressed as
-
Y(I) = y 1+
_( b(x -X)] .
a +bX
( a ) Find the bias and variance of the resultant estimat or of population mean .
( b ) If a = C, and b = /h (x) , where /h (x) denote s the kurto sis and ex denote s the
coeffi cient variation of the auxiliary variable, then find the bias and MSE
expressions for the ratio type estimator
_ _(u).
Yr = Y Ii

Hint: ( a ) Tripathi and Sing h (1992), ( b ) Upadhyaya and Singh ( 1999).


Chapter 3: Use of auxiliary information: Simple randomsampling 271

Exercise 3.40. Show that an unbiased estimator of finite population variance, S; ,


can be defined as
da = s; -k(s;/ S; )+k.
Deduce its minimum variance for the optimum value of real constant k .
Hint: Prasad and Singh (1992) .

Exercise 3.41. Let Y and X be the unbiased estimators of population totals Y


and X of the two variables Y, x respectively, based on a sample of size n drawn
by SRSWOR sampling from a finite population of size N . Consider a
transformation

where A, is a scalar. Obtain the minimum mean squared error of the estimator,

11, = '(WJ
y W

for the optimum value of A, .

. {Io {I
Hint: Srivenkataramana and Tracy (1983 , 1986).
if i E A, if i E B,
Exercise 3.42. Suppose Yi = . and Xi = for Vi , where
otherwise, 0 otherwise,
A and B represent two groups in the population.

( a ) Show that the usual difference estimator


yp=y+k(X-x)
reduces to the estimator of population proportion of the units belonging to A .
( b ) Assume that the proportion of units possessing the attr ibute B are known.
Discuss the optimum choice of k .
Hint: Das (1982).
. _* nYn -mYj and -* nXn -mxj _ _I n
Exercise 3.43. Suppose Y(j) = - - - " - x(j) = , where Y n = n LYi
n-m n-m i=\
n
and xn = n-\ LXi, are the sample means based on a sample of (n - m) units obtained

r
i=\
by omitting the /' group and y} ., X. are the sample means based on the
}
sub-
sample of size m = n]g , then show that a general class of almost unbiased ratio cum
product estimator is given by

Yau = {I- A(l-B)}y + Ay(~J(


X Z
~)-~g jf=! yD)[3x(j) J[~Z J'
(N - n)(n -m)
with B = ( ) and A is a real constant.
n N -n+m
Hint: Biradar and Singh (1992).
272 Advanced sampling theory with applications

Exercise 3.44. Stud y the asymptoti c properties of an estimator of, r given by,

Y; = Y( ~) - fi{j - ( ~)}
where fi is a suitably chosen constant, under the super popul ation model
~ =a + fi Xi + ~ ,
with a and fi are unkn own real constants, G.
1
are random errors distributed

withEckIX;) =O, EJ1Ix;) =<5Xl and EclG;Gj lx ;Xj)= O for every i*j. A lso
assume that 0 < <5 < 0 ~ g s 2 and the Xi are independently identically distributed
<X) ,

with a common gamma density


G(e ) = r(B) e-x xB- 1 for x > 0 and 2:s; 0 < <X) •

Hint: Singh, Singh , and Espejo (1998) .

Exercise 3.45. Let the variables y ,x, z take real values (y; ,x;, z;) on the { h unit
(i = 1,2, ,..., N) in a finite popul ation. Assuming that the popul ation mean s X and Z
of the auxi liary variable are known and we are to estimate population mean of r
the study variable y . Assuming that y is positively correlated with x and
negatively with z, stud y the asymptotic properties of the ratio cum product
estimators of population mean defined as
_ _(xx )(z)Z
Yj = Y

where (y, x, z) are the unbiased estimators of the popul ation means (r, X, z)
respectively based on a simple random sample of size 11 drawn without
replacement. Defin e 11; = A - x; and V; = B + z;, i = 1,2, .. . , N where A and B are
suitably chosen scalars. Then ii = A - x and v = B + z are the unbi ased estimators
of fJ = A - X and V = B + Z , respectively. Study the asymptotic bias and variance
of the estimator,

Y2 = y( g)( ~).
Compare the estimator Y2 with the estimator Yl and discuss your opinions.
Hint: Singh (1967a, 1967b), Trac y, Singh, and Singh (199 8), Chang and Huang
(2001 a).

Exercise 3.46. Suppose a simple random sample of 11 ~ N units is drawn without


repla cement from the popul ation under consideration and observation s are made on
the selected units . Let the totals of y, x variates for the 11 units in the sample be
y(l) and X(I) and similar totals of the (N - 11) units not included in the sample are
y(2) and X(2) . Stud y the asymptotic properties of the following two estimators of
popul ation ratio R = Y/ X , given by
Chapter 3: Use of auxiliary information : Simple random sampling 273

(a) R1 = y(I{I+~)] and (b ) R2= y(I{I +~)]


with f = III N and g = 1- f .
Hint: Srivenkataramana and Trac y ( 1979).

Exercise 3.47. Consider a ratio estimator

~ =y(XJ
Y x
based on a SRSWOR samplin g of n units, where y ,x and X have their usual
meanings. Under the model
Em{r; )= jJX;, v"Jr;) =u 2 X;, and Cm(r;, y)) =O .
Show that:
( a ) The ratio estimator can be written as:
YR = Y+ r(X - x)
where r = ylx ;
( b) cov(x, Y)= 1~/ S; [Q - R], where f = III N and Q = Sty / s; ;
( c ) When the model holds true we expect the negligible gain in efficiency from the
optimum estimator, YM U = y + Q(X - x);
( d ) If the model is wrong the relative gain in efficienc y of YM U = Y+ Q(X - x) over
YR is expected to be substantial.
Hint: Montanari (1998 , 1999).

Exercise 3.48. Suppose (Y, x) denote the population means of the two variables
y and x , and (y, x) denote the means of a random sample of size n. Let y(j) and
x(j ) be the means obtained by deleting III g observations of the/ ' group.
(a) Compare the classical estimator R= y/X and the Jackknife estimator
R*=g R- (g-I) f ~(j)
g ); 1 x{J)
with g = 2.
( b ) Study the following three estimators of population mean given by
YI= RX =y+R(X- x), Y2 =R*X, and Y3 =y+ R*(X - x)
unde r the super-population model
Yi = a + jJxi + ci '

where 8; has mean zero and variance 0./ (O ~ I ~ 2) and is uncorrel ated with x. .
I

Hint: Rao (1979).


274 Advanced sampling theory with applications

Exercise 3.49. Compare the usual Jackknife variance estimator

Vu
g g-I j~ l
f
=-(1_) [Ok)-O(rl

whereO .(r)=gr-(I-g)r . and O(r)=l-fo .(r), with the modified Jackknife


J J g )=1 J

variance estimator

Vm =_(_1_)
g g -I
f [OJ(r)-Of
j ~l

where 0 = r.
Hint: Krewski and Chakrabarty (1981).

Exercise 3.50. In SRSWOR sampling scheme consider the following class of


estimators of the population mean as
n
Ye = LWiYi
i~ l

where Wi is a constant depending upon the { Ii draw and Yi is the value of Y on


the unit selected at the {Ii draw.
_ n
( a ) Show that yc is unbiased for Y if and only if I Wi = I.
i~ 1

( b ) Show that the variance of the estimator Ye is given by

v(yJ =s;(fwl -~) .


i~l N

( c ) Show that V(Yc) is minimum if and only if wi = J/n Vi.


Exercise 3.51. Let f, and Ys denote, respecti vely, the means of observed and
unobserved Y values, while S2(
ys
) and S2(_)
ys
denote the corresponding variances .

(a) Show that an estimator of population variance S; can be written as

&2 = _I_[(n -1)s2 + (N_n-I)T + n(N- n)f- _T*\2]


Y N- I Y N \Y J
where T and T* are, respectively, the predictors of S2( ) and
y s
Y_.
s

( b ) Show that if T = s; (s;(s) js;) and T* = Y+(s Y jsJxs -x) then the estimator
'2
Sy ' 2 = Sy2(s;2/ Sx2)
reduces to Sf
Hint: Biradar and Singh (1998) .
Chapter 3: Use of auxiliary information : Simple random sampling 275

Exercise 3.52. Let it y' it x and it z denote, respectively, the median estimators
of My, M x and M z for the y, x, and z values. Study the asymptotic properties
of the multivariate ratio type estimator

were a and f3 are real constants to be chosen .


Hint: Garcia and Cebrian (200 I) .

Exercise 3.53. If the relationship between the two variates x and y is given by the
relationship y = a +bx, b *- 0, then show that in SRSWOR sampling, the sample
estimator y = NY is more efficient than the ratio estimator YR = y(x/ x), where
x= Ni and X denotes the known total if

(a 2XZ/ b 2 S~ f(x- I » (1 ~fJ .


Hint: Raj (1968)

Exercise 3.54. ( a ) An SRSWOR sample s of n units is divided at random into g


subgroups each of size, rounded to the nearest integer. Let YR (;) = y[j](X/ [j]) x
and YR = y(x/ x) denote, respectively, the ratio estimator of population total based
on ]" sub-sample and the complete sample, j=I,2' oO.,g . If a=-(l-f)/(n-g),
where f = n/ N , then an unbiased estimator of the population total is given by
g
Ya =aLh(;)+(I-ga)h
j~l

( b ) Suppose g be the number of independent interpenetrating samples each of n


units drawn from the population using SRSWOR sampling. Let y I. and x.1 be
unbiased estimators of Y and X based on the til sample. Show that an unbiased
estimator of the population mean Y is given by
-
Yu = r
-X- g(y-rx)
+ ..::....::'-----'-
g-I

h
were -
y = g -I IYi,
g -
X = g -I IXi
g
an d -r = g -I Ig (Yi / Xi ) .
~I ~I ~I

( C )An unbiased regression estimator based on splitting the sample into g groups
each of size m = n]g is given by

Ylru = Y + bg (x - x)+ I - f { f xjb_ j - nXbg }


g j ~1
276 Advanced sampling theory with applications

- 1g
where bg = g- L b: i » b: j is the sample regression coefficient computed from the
j=\

sample after omitting thelh group and Xj is the sample mean for the/" group.
Hint: Mickey (1959), Rao (1969).

Exercise 3.55. Let YI = Y, yz = y(x/x) and Y3 = y(x/x) denote the usual, ratio and
_ 3
product estimator of population mean Y . Consider a new estimator Y new = La,y, ,
;=1
3 3
such that Ia, = 1 and Ia,B(y,) = 0, where B(y,) denotes the bias in the lh
,=1 ,=1
estimator of population mean. Choose a, such that the MSE(Ynew) is minimum.
Hint: Singh and Singh (1993a).

Exercise 3.56. An SRSWOR sample of n units is drawn from a finite population


of N units, show that the bias in the estimator of the ratio R = Y/ X is given by,
B(k) = -fov(k,x)jx}
where k = y/X , Y= n- I Iy; and x = n- I Ix;.
;=1 ;=\

Exercise 3.57. Let v., be the value of the t" unit III the population
n = [UI>Uz, ...,U N] for lh variable Yj' (j = 0,1,2) of which Yo is the variable under
study and Y\ and yz are the auxiliary variables defined over n . Let Yj be the

conventional unbiased estimator of population mean Yj (j = 0,1,2), based on a


simple random sample of size n drawn without replacement from the population.
When the correlation between the variable under study Yo and the auxiliary
variable y\ is positive (high), the ratio estimator YI = Yo(~jYI) has been widely
used. If the correlation between study variable Yo and auxiliary variable Y2 is
negative (high), a product estimator yz = Yo (.Yz/Y;) can be used effectively. It is
assumed that the population means ~ and ~ of auxiliary variables Y I and Y2 are
known. In large scale sample surveys, we often collect data on more than one
auxiliary variable and some of these may be correlated positively with the study
variable while others may be negatively correlated. Let the correlation between Yo
and YI be positive and between Yo and Yz be negative.

( a ) Study the following estimators for Yo as


r; = uY\ + vyz [ Srivastava (1965) ]
where u and v are constants such that u +v = 1, and
Chapter3: Use of auxiliary information: Simple randomsampling 277

[Singh (1967a, 1967b)]


known as ratio cum product estimator of ~ .

(b) Consider Yo = Yo, YI = YO(~ /YI)' Y2 = YO(:Y2 /Y;) and Y3 = YO(~/YlxY2 /Y;) such
that Yt E H, for t = 0,1,2,3, where H denotes the set of all possible estimators of
_ • 3. 3
population mean Yo . The set H is a linear variety if Yh = "IhtYtEH for "Iht = 1 and
t=O t=O
ht ER , where h, (t = 0,1,2,3) denote the real constants used for reducing the bias and
R stands for the set of real numbers . Find the values of ht (t = 0,1,2,3) such that the

variance of the estimator Yh is minimum and bias is zero .


Hint: Tracy, Singh , and Singh (2001) .

Exercise 3.58. Consider any variable V = {V : Y or X} in a finite population having


values : n = {Vi : I ~ i ~ N} and let the values of the variable in a sample s be
denoted as s = {Vi : I ~ i ~ n} . Then ath quantile is defined as Va = inf {v : Fv(v) ~ a},
where F; is the population cumulative distribution function of the variable V.

Consider fry(y)=n-II~(Y-Yi),where ~(a)=1 ifa>O and ~(a)=O ifa~O . The


i=1

inverse of Fy is Fyl(a)= inf~ : fry(y) ~ a}= Ya ' Note that a = FY(Ya) and that
Ya = Fy 1{fry (ya )}. Define a new variable Z, = ~(Ya - Y;) , i = 1,2,.., N . If we est imate
Ya as z= a + b(w - w), where W= g(x) is correlated to Z variable. Find the
minimum variance of the resultant estimator.
Hint: Mak and Kuk (1993)

Exercise 3.59. Let Y and Xki' k = I, 2, ..., p , respectively, be the survey variable
and the auxiliary variables related to Y and the information about the quantiles of
the auxi liary variables or distribution functions are known . From the sample of n
units from a population of size N we observe (Xki, Yi) where i E s. Consider
QXk (ak) for ak E(O.O,O.S)U(OS,l.O) are known and we wish to estimate Qy(,B) with
,B = 1/2 . Study the asymptotic properties of the following estimators of Qy(,B) as

. . ( ())p !FXk(QXk(ak)))
FR=Fy Qy,B "IWk • ~ ())
k=1 FXk QXk ak
and
P
FD=Fy(Qy(,B)}t fbk{FXk (Qxk (ak)~FXk (QXk (ak ))}, where "IWk = 1.
k=1 k=1
Hint: Rueda and Arcos (2002)
278 Advanced sampling theory with applications

Exercise 3.60. Let y and x, respectively, be the survey variable and the auxiliary
variable related to each other and the information about the quantil es of the
auxiliary variables or distribut ion functions are known. From the sample of n units
from a population of size N we observe lXi'
Yi ) where i E S . Consider FAM x ) be
a known and we wish to estimate FylMy). Study the asymptotic prop erties of the
followin g estimators of F y lMy) as
FR = FY(MytX{~X~
F u, x
and
FD = Fy(My)+b ~AM x )- FAMx)}.
where b is a real constant.
Hint: Rueda, Arcos, and Artes (1998) .

Exercise 3.61. Show that under simple random without replacement (SRSWOR)
sampling, the following estimator:

Yll = "t-~;)ry -'~!:~x~ ~;l(x - Xl1


is unbiased for the population total.
Hint: Singh and Srivastava (1980).

Exercise 3.62. For the following situations, discuss how you may use ratio,
produ ct, difference or regression estimators:
( a ) Estimate the average number of fish caught per month by marine recreational
fishermen at Atlantic and Gulf coasts, assuming that the number of employees per
month are known .
( b ) Estimate the avera ge amount that graduate students spent on stationary in your
class, assuming that weekly sale of stationary at a local shop is known.
( c ) Estimate the proportion of time devoted to politics in the television on the
national channel of your country , and no further auxiliary information is available.
( d ) Estima te the total weight of bones discarded at the time of shipm ent of usable
meat of chickens, assuming that the number of shipments are known.

Exercise 3.63. Consider a popul ation of N identifi able units on which a study
variable y is associ ated with p auxiliary variables xl" " 'x p whose population
variances aI, n
k = 1, 2, ...., p are assumed to be known. Let (Yi' Xik), i = 1, 2, ..., be
the observed values in an SRSWR sample on the (p + 1) variables. Consider the
probl em of estimation of finite population variance
Chapter 3: Use of auxiliary information: Simple random sampling 279

2 1 N _ 2 - 1 N
0"0 =-I(if- Y) where Y = - I if .
Ni=1 Ni= 1

Find the bias and variance of the following estimators of 0"6 as:
p
k = I, 2, ..., p , 0 < w k < 1 and I Wk = I.
k=1

• •2 P( • 2) 2 p +1
( e) O"s = I Wk rkO"k + wp +l sO ,where I Wk =
I.
k=1 k=1
Hint: Isaki (1983), John (1969), Shukla (1996) , Mohanty and Pattanaik (1984) ,
Singh and Singh (2001) .

Exercise 3.64. Let Yi and Xi' i = 1,2,.., N denote the value s of the popu lation
units for the study variable Y and the auxiliary variabl e X , respectively . Further,
let Yi and xi , i = 1,2,.., n denote the values of the units included in a sample Sn of
size n drawn by simple random sampling without replacement (SRSWOR). The
parameter unde r interest is the population interquartile range of the study variable
Y defined by
0 y =Q3y -Ql y
where Ql y and Q3 y denote the first and third popul ation quartil es of Y respectively.
The conventional estimator of 0 y is
0y = Q 3y -QIY

wher e Q1yand Q3 y are sample estimates of Q ly and Q 3y respecti vely. Let


o x = Q 3x - Ql x be the known population interquartile range of the auxili ary
variable X and its conventional estimator is
. . .
0 x = Q 3x -Ql x

where Ql x and Q3x are sample estimates of first quartile Q lx and third quartile Q3x
of X respectively.
( a ) Study the asymptotic properties of the following estimators of 0 y defined as
0(1)=0
y
(0 / 0)
y~ x xt»
and 8(2)=8
y y
+/3 iqr
. (0
~ x
-8 x )
where fJiqr is a suitably chosen constant such that the variance of 0~) is minimum .
( b ) Study the asymptotic prop erties of the following estimators of M y defined as
• (J ) _ • (" / : ) • (2) _ • ( _ : )
My - M y \0 x 0 x and M y - M y + fJ iqr \0 x 0 x

where fJiqr is a suitably chosen constant such that the variance of 0~) is minimum .
Hint: Singh and Singh (2002), Singh, Singh, and Puetas (2003b).
280 Advanced sampling theory with applications

Exercise 3.65 . Study the asymptotic properties of the following estimators of the
population mean, Y, defined as

( a) YSD == Y( ~ + Cx ] , where c, denotes the known population coefficient of


x+Cx
variat ion of the auxiliary variable.

(b) YSK == y(xx + 132f32((X]]


+
x
, where f32(X) is the population coefficient of kurtosis of
the auxiliary variable .
- I -_-(Xf32(X)+C
( c ) Yus X ] d(d) - _-(XCX+f32(() X)]
Y
xf32 (x) + c, , an YUS2 - Y
xCx + 132 x .
Hint: Sisodia and Dwivedi (1981) , Singh and Kakran (1993), Upadhyaya and
Singh (1999) .

Exercise 3.66. Consider a finite population of N units n:{u I' U2 ,... , UN} ' Let Y
and x be the variables taking value Yi and Xi respectively on Ui (i == 1,2,...,N) .
For estimating Y , Srivenkataramana (1980) and Bandyopadhyay (1980) proposed
a dual to product estimator as
.:.. _( X]
Yr == Y -=* , where x_* == NX - nx .
x N- n
Using predictive approach advocated by Basu (1971), Srivastava (1983) envisaged
another estimator for Y as
Y.s == Y-JnX(+(N
_
-2n)xL
)
-[I _(X_*-x]]
Y
NX - nx x
Let a sample of size n be drawn without replacement from the population and let it
be split into g sub-samples each of size m == n]g each, where m is an integer. Let
lx}, y}), j == 1,2,.... g be the unbiased estimators of (x, Y) based on/" sub-sample
of size m . The Jackknife versions of the above estimators are given by

.:..J-:, . == Y}
- [X
-=* , and J s: - [ X - X}]
Yr ' == Y} 1--_-*- , where x} ==
-* NX - nx} .
, ; == 1,2,...,g .
J x- J .r . N-n
J J

Let };I == ~ f Yr', };2 == ~ f Y, ., };3 == Yr , };4 == y.. and };5 == y . Study the
g } ~l J g }~ I J

asymptotic properties of an unbiased estimator of Y defined as


.:.. 5 , 5
Yo == "LAYi , such that LA == 1 and where 8i , i == 1,2,3,4,5 are real constants.
i ~1 i~ 1

Hint: Upadhyaya, Singh, and Singh (2003).


Chapter3: Use of auxiliary information: Simplerandomsampling 281

Exercise 3.67. Consider an estimator of population mean Y as


_ I".
Yilu = N L.Yio
iEQ

where
if i E S,
• {Yi
Yio = a + bx, if i E (0 - s],
and a and b are constants to be chosen such that the mean square error of the
estimator is minimum.
( a ) Show that the estimator Yilu can be written as
Yilu=1 y+(I-l)a+b(X-/i)
where 1 = niN .
( b ) Show that the minimum mean square error of the estimator Yilu is given by

Min.MSE(Yilu) = (I ~I )/ S;(I- P;y) .


2

Hint: Singh (2003a).

Exercise 3.68. Find two complex numbers such that the variance of the difference
estimator Ydif = Y+ k(X - r] is zero.
Hint: Set V(Ydif)= 0 and solve the quadratic equation for two complex values of
k.

Practical 3.1. A private company ABC was interested in estimating the average
amount of real estate farm loans (in $000) during 1997 in the United States . The
company collected information from six states included in an SRSWOR sample as
shown below :

CT ME NE NY VA WI
4.373 51.539 3585.406 426.274 188.477 1372.439
7.130 8.849 1337.852 201.631 321.583 1229.572

( a ) Given that the average amount of nonreal estate farm loans is $878 .16 (in
$000) for the year 1997, obtain the ratio estimate for the average amount of the real
estate farm loans (in $000) during 1997. Develop an estimator of the mean squared
error of the ratio estimator and hence deduce the 95% confidence interval. Verify
using information given in population 1 of the Appendix, if the true mean lies in the
confidence interval you suggested. Interpret your findings in two lines.
( b ) Estimate the average of the real estate farm form loans during 1997 by using
Beale's estimator defined as
282 Advanced sampling theory with applications

Y-{I + -(n -I)


-C.yx }
n - • S • s2r
x.2
.0.
where C
X Y , and
= --!L C
Yb == -{I (n -I) . } X xy xx = .
X +- - c....
11

Construct 95% confidence interval for estimating the average real estate farm loans
by assuming that the Beale's estimator has the same mean squared error as that of
usual ratio estimator .
( c ) Also apply Tin's estimator of the population mean given by

.0.
Yt =Y
-(XJ[ (I-/)r.
x 1+- . )~
- \cXy - c'''' J '
Il

Construct 95% confidence interval estimate of the average real estate farm loans
assuming that the mean squared error of Tin's estimator is same as that of usual
ratio estimator .
( d ) Estimate the average of the real estimate farm loans using a new estimator of
population mean, defined as

LXiYi
• .n I ] -
Y new == .!.=.-.- X.
II 2
[ L Xi
i=1
Construct 95% confidence interval estimate for the real estate farm loans assuming
that the mean squared error of this new estimator is same as that of usual ratio
estimator .
( e ) Compare your results obtained in part ( a ), ( b ) , ( c ) and ( d ), and comment.
Answer:
(a) 484.69, [120.86,848.52]; (b) cxy == I .5306 , c == 2.1997 , Yb = 389.30;
xx

( c) Yt == 437.129 ; ( d) Y new = 389.30 .

Practical 3.2. An instructor suggested to the class to use ratio estimator while
estimating the average nonreal estate farm loans by making use of known real estate
farm loans. Use the information given in population I of the Appendix to support
the instructor's statement.
Hint: Discuss the relative efficiency of the ratio estimator with respect to usual
estimator.

Practical 3.3. A team of doctors wishes to estimate the average duration of sleep
(in minutes) during the night for persons aged 50 years and over in a small village
in the United States. It is known that there are 30 persons living in the village aged
50 years and over. Instead of asking everybody, the psychologist selects an
SRSWOR sample of five of these people and records the information as given
below:
Chapter3: Use of auxiliary information: Simplerandom sampling 283

Person no. " 18 24 9 12 24


Age X ,,(years) 63 87 84 70 87
Duration of sleep Y (minutes) 405 270 276 360 270

It is well known fact that as the age of a person increases, the sleeping hours
decrease. Apply the appropriate method of estimation for estimating the average
sleep time in the particular village under study. Find an estimator of the mean
squared error of the estimator you used and derive the 95% confidence interval
estimate. Assume that the average age 67.267 years of the subjects is known as
shown in the population 2 in the Appendix.
Hint: Apply product method of estimation.

Practical 3.4. The age and sleeping time (in minutes) of 30 persons aged 50 and
over living in a small village in the United States is given in population 2 of the
Appendix. Discuss the relative efficiency of the product estimator under SRSWOR
design.

Practical 3.5. The regression method of estimation has been found to be the most
effici ent method among others . Use it to estimate the average amount of the real
estate farm loans (in $000) during 1997 based on an SRSWOR sample of six states
selected from the population I in the Appendix and is given below

'" State AL FL MD OH TX VT
Nonreal estate sfarm'loans ( X )$ 348.334 464.516 57.684 635.774 3520.361 19.363
Real estate farm loans ( Y ) $ 408.978 825.748 139.628 87 1.720 1248.76 1 57.747

The average amount of nonreal estate farm loans $878.16 ($000) for the year 1997
is known . Construct a 95% confidence interval estimate and interpret it in non-
technical language.

Practical 3.6. Find the relative efficiency of the regres sion estimator for estimating
the average amount of the nonreal estate farm loans during 1997 by using data on
the real estate farm loans during 1997 as an auxiliary variable with respect to the
ratio estimator of population mean. The real and nonreal estate farm loans (in $000)
during 1997 in the 50 states of the United States have been presented in population
I of the Appendix.

Practical 3.7. The amounts of the real and nonreal estate farm loans (in $000)
during 1997 in the 50 states of the United States have been given in population 1 in
the Appendix. If we select an SRSWOR sample of six states to collect the required
information, find the relative efficiency of the general class estimators which makes
use of the known variance of the auxiliary variable at the estimation stage, for
estimating average amount of nonreal estate farm loans during 1997 by using
information from real estate farm loans during 1997 as an auxiliary variable, with
respect to the regression estimator of population mean.
284 Adva nced sampling theory with applications

Practical 3.8. The study of the relationsh ip between age and duratio n of sleep helps
a local hospital in developing future policies. If hospita l researchers consider 10
patients to collect the information from the population 2 of the Appendix, then what
will be the relative bias of the usual estimator of the correlation coefficient under
SRSWOR design ?

Practical 3.9. A bank manage r raised the issue that the variation in the nonreal
estate farm loans effects their customers . The bank selected an SRSW OR sample of
six states . The manage r decides to pick up an estimator from the general class of
estimators. Discuss the relative efficiency of the genera l class estimato rs which
makes use of known variance of the auxiliary variable at the estimatio n stage, for
estimating the finite population variance of the amount of the nonrea l estate farm
loans dur ing 1997 by using known information abo ut the real estate farm loans
during 1997, with respect to the ratio and regre ssion estimators of finite population
vanance.
Hint: Use information from population I of the Appendix .

Practical 3.10. A private company Kitty Manage ment believes that real and nonreal
estate farm loans have a cause and effect relationship between them. They want to
know the effec t of unit change in real estate farm loans on nonrea l estate farm
loans. A statistician suggests to them that there are two different measuring tools to
estimate regress ion coefficient. Study the relative efficiency of the usual estimator
bl with respect to the unbiased estimator b2 of regression coefficient by using
comp lete information available in population I of the Appendix.

Practic al 3.11. Your instructor provided you an SRSWOR sample of six states
from the population 3 of the Append ix as:

State DE MD NC VT WA WI
Year- 1994 <' 0.168 0.173 0.088 0.165 0.138 0.230
Year 1996 <• • 0. 173 0.158 0.117 0.194 0.204 0.241

Find the mistake made by the instructor during the collection of data . Correct the
data accord ingly. Apply the regression method of estimation for estimating the
average price of apple crop during 1996, assuming that average price during 1994
($0 .1708) is known.

Practical 3.12 . Jackknife variance estimation technique has become popular due to
its simplicity. Suppose we took an SRSW OR samp le of six states from the
popu lation 1 in the Appendix and gathered the following infor mation:
State AR KY MN OK UT WA
Nonrea l estate farrn loans (X )$ 848.317 557.656 ~466 . 892 1716.087 197.244 1228.607
Real estate fa rm loans ( Y )$ 907.700 1045.106 1354.768 6 12.108 56.90 8 1100.745
Chapter3: Use of auxiliary information: Simplerandomsampling 285

Apply the ratio method of estimation for estimating the average amount of the real
estate farm loans (in $000) during 1997. Also find an estimate of variance of the
ratio estimator using Jackknife technique and deduce the 95% confidence interva ls.
Assume that the average amount $878.16 of nonreal estate farm loans (in $000) for
the year 1997 is known.
Answers: 95% CIs are[21O.830,1060.399] , and [209.063, 1062.166].

Pract ical 3.13. Sometimes the estimation of variance of the regression estimator is
difficult, but the Jackknife technique has been found to be the best solution in such
situations. Apply the regression method of estimation for estimating the average
amount of the real estate farm loans (in $000) during 1997. Also find estimates of
variance of the regression estimator using Jackknife technique and hence deduce
the 95% confidence intervals by using the data given below:

;di~, State / KY NC SD UT AL FL
'" /;~

NonreaLestate 'farmHOaiis (X)$ 557.656 494.73 1692.817 197.244 348.334 464.516


Real estate;far.tn~loan.s!CKj)$·i '"' 1045.106 639.571 413.777 56.908 408.978 825.784

The average amount $878.16 of nonreal estate farm loans (in $000) for the year
1997 is known.

Practical 3.14. The season average price (in $) per pound of the commercial apple
crop in 36 states of the United States has been given in population 3 of the
Appendix. Suppose we selected an SRSWOR sample of six states to collect the
required information for 1996. Find the relative efficiency of the regression type
estimator of average price in the United States that makes use of past information
from two years with respect to the estimator that makes use of past information
from only from one year.

Practical 3.15. The real and nonreal estate farm loans (in $000) during 1997 in 50
states of the United States have been presented in population I of the Appendix.
Suppose we selected an SRSWOR sample of six states to collect the required
information. Find the relative efficiency of the ratio estimator of median, for
estimating median of the amount of the nonreal estate farm loans during 1997 by
using information from real estate farm loans during 1997, with respect to the usual
estimator of population median. Assume that both real and nonreal estate farm loans
follow normal distributions.

Practical 3.16. Consider the problem of estimating the average nonreal estate farm
loan in the United States . We wish to apply the ratio method of estimation using
known information about the real estate farm loans as shown in population I in the
Appendix. What is the minimum sample size required for relative standard error
(RSE) to be equal to 25%?
286 Advanced sampling theory with applications

Practical 3.17. Select an SRSWOR sample of ten states listed in population 1 of


the Appendix. Collect information about the real estate farm loans and nonreal
estate farm loans from the selected states. Apply the ratio method of estimation for
estimating the average nonreal estate farm loans, assuming that the average real
estate farm loans in the United States is known.

Practical 3.18. Consider the population under study consists of the students today
present in the class . Construct a list of names of the students and assign a number
to each of them. Use random number table to select a sample of20% of the students
present in the class. Collect information about their GPA from the students selected
in the sample. Assume that the average number of lectures attended by the whole
class are known (or can be found from the register) . Also collect information about
the number of lectures attended by the students selected in the sample . Assuming
that the relationship between the average number of lectures attended and GPA of
the students is positive, apply the appropriate method(s) to estimate the average
GPA of the class, and derive 95% confidence interval estimate(s) .

Practical 3.19. John needs an estimate of the average nonreal estate farm loan in
the United States . His supervisor has advised him to apply the regression method of
estimation using known information about the real estate farm loans as shown in
population 1 of the Appendix with minimum relative standard error equal to 10%.
What will be John's sample size to meet his supervisor's conditions?

Practical 3.20. Select an SRSWOR sample of the size obtained in Practical 3.19
from population 1 of the Appendix . Collect information about the real estate farm
loans and nonreal estate farm loans from the selected states. Apply the regression
method of estimation for estimating the average nonreal estate farm loans, assuming
that the average real estate farm loans in the United States is known .

Practical 3.21. For estimating the regression coefficient of the amount of the real
estate farm loans (in $000) on the nonreal estate farm loans during 1997 in the
United States, we took an SRSWOR sample of six states from the population 1 in
the Appendix. From the states selected in the sample, we collected the following
information :
J';J~~~li;; : i· i.~$ . 11 :.
iiil'~;~l~ AK CA CT ME VA WI
NOr1reaI:estatefarm'loa.ns(Xi)$ 3.433 3928.732 4.373 51 .539 188.477 1372.439
Reallestate~farri1,;n()aiis (T) $~l ~t$l 2.605 1343.461 7.130 8.849 321 .583 1229.572

Assume that the population variance S;


= 1176526 of the nonreal estate farm loans
(in $000) for the year 1997 is known. Estimate the regression coefficient f3 with
two different methods. Also find an estimate of the mean squared errors in each
case and hence deduce 95% confidence intervals.
Chapter 3: Use of auxiliary information: Simple random sampling 287

Prac tical 3.22. The amounts of the real and nonreal estate farm loans (in $000)
during 1997 in 50 state s of the United States have been given in population I in the
App endi x. Suppose we select ed an SRSWOR sample of six states to collect the
required information. Study the relati ve efficiency of the usual estimator bl with
respect to the unbiased estimator b 1 of regression coefficient for this population.

Practical 3.23. A team of medical doctors claims that ther e is a strong negati ve
relationship between the age of a person and the hours of sleep required . Justify
their statement based on the information given in population 2 of the Appendix.
Also study the relative bias of the usual estimator of the correlation coefficient
based on a sample of 10 units.

P ractica l 3.24. A student in medical college studies the statement made by the team
of medical doctors about the negative relationship between age and duration of
sleep, and takes an SRSWOR sample of 6 persons from population 2 of the
Appendix as given below :

Cc" Pers on No. " 17 08 24 30 02 16


··· · Age(x) 78 74 87 72 72 66
Duration of sleep 345 381 270 345 384 420
Estimate the correlation coefficient Pxy between the age and the duration of sleep
and comment. Find an estimate of the mean squared error and der ive a 75%
confidence interval for the correlation coeffi cient.

Practica l 3.25. Select four different samples each of four unit s by using SRSWOR
sampling from the population I of the Appendix . Collect the information for the
real and nonreal estate farm loans from the state s selected in each sample. The
average nonreal estate farm loan is assumed to be known . Obtain four different ratio
estim ates of the aver age real estate farm loans from the information collected in the
four samples. Pool the information collected in four samples to obtain a pooled
ratio estimate of the average real estate farm loans .
( a ) Derive an unbiased estimate of the average real estate farm loans .
( b ) Construct 95% confidence interval.
Given: Average nonreal estate farm loans $878 .16 .
R ules: Use the Pseudo-Random Numbers (PRN) given in Table I of the Appendix
to select four samples with starting columns as:
Sample Starting
NUrhber Columns s

I 3 and 4
2 8 and 9
3 5 and 6
4 4 and 5
Answer: The 95% CI is [545.36, 904.80].
288 Advanced sampling theory with applications

Practical 3.26. Consider the problem of estimating the finite population variance of
the duration of sleep in the United States by using the known benchmark as the
variance of the auxiliary variable, age . Using information given in population 2 of
the Appendix, find the relative efficiency of the product type of estimator
s; = s;(s;/S;) with respect to the estimator s~ .
Practical 3.27. Estimate the finite population variance of the duration of sleep in
the United States based on the information given in the population 2 of the
Appendix. What is the minimum sample size required for the estimator
s; s;&
= ;/S;) to have minimum relative standard deviation 30%?
Practical 3.28. A pilot survey related to population 1 of the Appendix indicates that
the values of certain parameters of interest are given as
,1,40 = 3.5822, Ao4 = 4.5247 and ~2 = 2.8411 . Use this information to study the

relative efficiency of the regression type estimator si = s; + k(S; - s;) with respect
to the ratio type of estimator s} = s;S; / s; under two situations:

( a ) estimate finite population variance of the nonreal estate farm loans using real
estate farm loans as known benchmark.
( b ) estimate finite population variance of the real estate farm loans using nonreal
estate farm loans as known benchmark.

Practical 3.29. A supermarket is worried about the average price of the commercial
apple crop during 1996. The correlation between the price during 1996 (Y ) with
that during 1995 (X 2 ) and 1994 ( Xl) is assumed to be known. Find the minimum
sample size, n, required to estimate the average price with relative standard
deviation 15% from the population 3 given in the Appendix.
Given: R;.XI,X2 = 0.8029.

Practical 3.30. Select an SRSWOR sample of the size developed in Practical 3.29
and select so many states from population 3 given in the Appendix. Collect the
information about the season's average price per pound of the apple crop during
1996, 1995, and 1994. Estimate the average price per pound ( $ ) of the commercial
apple crop during 1996 in the United States. Assume that the average price per
pound of the commercial apple crop during 1995 and 1994 are accurately known.
Apply the regression estimator of population mean with two auxil iary variables.
Derive the 95% confidence intervals estimates using ( a ) superpopulation model
approach, and ( b) design based approach.
Given: Xl = 0.1856 and X2 = 0.1708 .
Chapter3: Use of auxiliary information: Simplerandom sampling 289

Practical 3.31. A private organisation PQR was interested in estimating the average
amount of real estate farm loans (in $000) during 1997 in the United States. The
organisation collected information from 15 states, the real estate (y) and nonreal
estate ( x ) farm loans, included in an SRSWOR sample taken from a list of 50
states and observed the following results :
11 n n 2 n 2
LXi = 18867.089, LYi = 12525.246, LXi = 48780336.98, LYi = 18001501.38, and
i= ! i= ! i=1 i=1
II
LXiYi = 26591710.56 .
i=1

The average amount of nonreal estate farm loans is $878.16 (in $000) for the year
1997 is known .
( a ) Obtain a ratio estimate for the average amount of the real estate farm loans (in
$000) during 1997. Develop an estimator of the mean squared error of the ratio
estimator and hence deduce the 95% confidence interval.
( b ) Obtain a regression estimate for the average amount of the real estate farm
loans (in $000) during 1997. Develop an estimator of the mean squared error of the
regression estimator and hence deduce the 95% confidence interval.
( c ) Comment on the interval estimates.
Answer: (a) [329.37 , 836.58] , (b) [460.08, 881.46] .

Practical 3.32. From a popu lation consisting of 100 individuals with their weekly
income ( Y ) and age ( x ), we have the following information:

N
LXi = 35,00,
N
LYi = 50,000,
N
Lx? = 85,0000 Li
N
= 90,000000 and
i=! i=! i=1 i=!
N
LXiYi = 7500000 .
i=1

If we consider a sample of n = 20 units, find the following :

( a ) Compute the variance of the sample mean estimator;


( b ) Compute the mean square error of the ratio estimator;
( c ) Find the relative efficiency of the ratio estimator with respect to usual sample
mean estimator;
( d ) Find the mean square error of the regression estimator;
( e ) Find the relative efficiency of the regression estimator with respect to both
ratio and sample mean estimator.
Answer: ( a ) 26262 .63; (b) 19872.19; (c) 132.15%; (d) 7900 .31; (e)
251.53%, and 332.42%;
290 Advanced sampling theory with applications

Practical 3.33. An estimate of total fertility rate (TFR) in the world is helpful to the
policy makers in the world for each country. The fertility rate has been found to
have relationship with crude birth rate (CBR), crude death rate (CDR) and infant
mortality rate (lMR) . The average CBR, CDR and IMR for 96 countries during
1997 have been found to be 26.0 II , 10.872 and 50.138, respectively. To estimate
the total fertility rate (TFR) in the world, a team of consultants takes an SRSWOR
sample of 20 countries out of a list of 90 countries as given in the following table:

Sr. No. Country or area e BR CDR IMR ~· TFR •.


3 Algeria 26.5 5.4 42.2 3.16
9 Bangladesh 27.4 10.1 93.0 3.08
14 Bulgaria 12.5 13.6 14.8 1.74
19 Cameroon 41.4 13.9 74.1 5.73
23 China 15.0 6.8 32.6 \.8 1
28 Czech Republic 13.3 11.0 8.0 1.74
32 Ethiopia 44.3 17.6 117.7 6.75
36 Greece 10.7 9.5 6.6 1.52
37 Guatemala 31.2 6.5 44.6 4.00
42 India 23.5 8.9 63.5 2.90
44 Iran 29.2 5.8 45.1 3.90
48 Kazakhstan 18.8 9.8 59.6 2.29
51 Korea, South 15.8 5.7 7.4 1.80
57 Morocco 25.1 5.1 32.9 3.13
62 Nigeria 41.5 11.7 63.7 5.95
66 Poland 14.2 10.2 1\.8 1.93
69 Russia 14.5 14.8 2\.6 \.95
76 Spain 12.0 9.1 5.7 \.51
82 Taiwan 14.4 5.7 6.3 \.75
95 Zambia 43.8 25.8 97.6 6.28

Apply the following estimator to estimate the TFR all 96 countries listed III
populat ion 8 of the Appendix :
)i'r =)i + H\\(u\ - 1)+ Hdu2 -1)+ H13(U3 -1),
where u j = Xj /X j and the estimates Hlj' j = 1,2,3 are obtained by solving the of

l
normal equations :
' 2
CX \ ' rX\X2 C X\C X2'
' 2
rX2X\ :X\ : X2 ' CX2 '

rX3X\CX\CX3' ':Q X3C X2C X3 '


Chapter3: Use of auxiliary information: Simplerandom sampling 291

Practical 3.34. Consider the following population consisting of five ( N = 5) units


P, Q, R, Sand T where for each one of the units in the population two variables
Y and X are measured.

I' :, ltiit,, ':,' P Q R S T


;;'1( v",; 9 II 13 16 21
1: ;"05; , " 24 20 19 18 14

( a ) Find the population medians M y and M x of the study variable and auxiliary
variables, respectively.
( b ) Select all possible samples of three units ( n = 3 ) with SRSWOR sampling.
( c ) Find the estimates of the medians it y and M x from each sample.
( d ) Find the exact bias in the estimator it y using the definition.
( e ) Find the exact mean square error of the estimator it y using the definition .
( f) Assuming that the median M x of the auxiliary variable is known, find the

product estimates of the median if p = if Yl ~: J from each sample .

( g ) Find the exact bias in the product estimator it p using the definition.
( h ) Find the exact mean square error of the product estimator it p using the
definition.
( i) Find the relative efficiency of the product estimator it p with respect to
sample estimator it y .

Practical 3.35. Consider the following population consisting of five ( N = 5) units


A, B, C, D, and E where for each one of the units in the population two variables
Y and X are measured.

" ' Units. A B C D E


, . ; ')'; .., ::: 190 210 230 260 290
' ;,:; ,TX;' , ;; 142 182 192 202 242

( a ) Select all possible samples each of n = 3 units.


( b ) Estimate the population mean using ratio estimator from each sample . Find the
exact bias and exact mean square error of the ratio estimator
( c ) Estimate the population mean using the Hartley and Ross (1954) estimator
from each sample. Find the exact bias and exact mean square error .
( d ) Find the loss in relative efficiency of the Hartley and Ross estimator at the cost
of reducing bias.
( e ) Give your views on the results.
292 Advanced sampling theory with applications

Practical 3.36. Consider the following population consisting of seve n (N = 7 )


units P, Q, R, S, T , U and V , where for each one of the uni t in the popul ation
two variables Y and X are measured .

Units p Q R S T U V
y.
. 1 9 II 13 19 21 26 29
J

Xi 10 26 28 25 36 37 40

( a ) Find the population medians M y and M x of the study variable and auxiliar y
variables, respectively.
( b ) Select all possible samples of three units ( 11 = 3 ) with SRSWOR sampling.
( c ) Find the estimates of the median s if y and M x from each sample.

( d ) Find the exact bias in the estimator if y using the definition.


( e ) Find the exact mean square error of the estimator if y using the definition .
( f ) Assuming that the median M x of the auxil iary variable is known , find the
linear regression type estimator of the median if LR = if y + r Vux- if x) from each

sample, where r = 4(Fj 1 - 0.25)tr


()' y
(M )} and given that f
()
x =-
1
V 10 $ x $ 40 ,
IT M x 30

f(Y) = ~ V 9 $y$ 29 and f(x, y)= _ I- V 9 $ x,y$ 40 .


20 961
( g ) Find the exact bias in the linear regression type estimator M LR using the
defin ition.
( h ) Find the exact mean square error of the linear regression type estimator if LR
using the definition.
( i ) Find the relati ve efficiency of the linear regre ssion type estimator if LR with
respect to sample estimator if y .

My M x
Hint: Fj 1 = J JJ(x,y}dxdy , Singh and Joarder (2002)
9 9

Practical 3.37. From a population consisting of 200 individu als with their weekl y
income ( y ) and age ( x ), we have the following inform ation:

N N N N
LXi =1 8867.089 , LYi = 12525.246 , Lxl = 48780336.98 L yl = 18001501.38 and
i=1 i=1 i=1 i=1
N
LXiYi =26591710.56 .
i=1
Chapter 3: Usc of auxiliary information: Simple random sampling 293

( I ) Ifwe consider 10% of the population as an SRSWOR samp le of units, then


find the following:

( a ) Comp ute the variance of the samp le mean estimato r;


( b ) Compute the mean square error of the ratio estimator ;
( c ) Find the relative efficienc y of the ratio estimator with respec t to usual sample
mean estimator;
( d ) Compute the mean square error of the regress ion estimator;
( e ) Find the relative efficiency of the regressio n estimator with respect to the ratio
estimator;
( f) Find the relative efficienc y of the regressi on estimator with respect to the
sample mean estimator.

( II ) If we consider 20% of the population as an SRSWOR sample of units, then


again repeat ( a ) to ( f ), and report the change in relative efficie ncies of the ratio
and regression estimator with respect to the usual samp le mean estimator. Comment
on your findings.

Practical 3.38. The following map shows states in the USA that have beaches. We
wish to estimate the average of certain parameters of interes t with the beaches
across the USA.

Beaches in the USA

-.,~'--ru
NJ CT
<;
\, ..........-
r-:;-Tj-:-:::--.;;;;-----jf.----, .;C---..L;::...-=~\ \ , 'OE
\ \.110
OC

.. -:.
HI "
...... _ OTHER PACIFIC ISLANDS -OTHER ATLANTIC
Source: Printed with permission from NOAA

( a ) Make a list of all the states that have beaches in the USA and arrange them in
alphabetical order. (Rule: Use two letter abbrev iations for sorting the states)
294 Advanced sampling theory with applications

( b ) Select an SRSWOR sample of 5 states from the sorted list of states. (Rule :
Start from the first two columns of the Pseudo-Random Number Table given in the
Appendix). Collect the information on the number of immigrants admitted during
1996 in these states from the population 9 given in the Appendix.
( c ) Estimate the total number of immigrants admitted to these states during 1996
and construct a 95% confidence interval estimate.
( d ) Assume that the number of immigrants admitted to all the states shown in the
above map are known for the year 1994. Using this information, apply the ratio
estimator to estimate the total number of immigrants during 1996. Also construct
95% confidence interval estimate for it.
( e ) Apply the regression estimator to estimate the total number of immigrants
during 1996, and compare the resultant 95% confidence interva l estimate with other
two cases.
( f ) Use Jackknife to estimate the variance of the ratio estimator, and construct
95% confidence interval estimate for the total number of immigrants during 1996.
( g ) Use Jackknife to estimate the variance of the regress ion estimator, and
construct 95% confidence interval estimate for the total number of immigrants
during 1996.

Optional:

( h ) Collect information from the internet about the current temperature in the
states selected in your sample. Estimate the average temperature on the beaches in
the USA , and construct a 95% confidence interval estimate.
( i ) Collect information about the precipitation (or any other auxiliary variable
related to temperature, for example number of visitors) of all the beaches in states
of the USA. Use this information to use the ratio (or product) estimator to find the
average temperature on the beaches, and construct a 95% confidence interval
estimate.
(j) Use the regression estimator to estimate the average temperature and develop
95% confidence interval estimate.
( k ) Use the Jackknife to estimate the variance of the ratio (or product) estimator,
and construct 95% confidence interval estimate of the average temperature across
all beaches in the USA.
( I ) Use the Jackknife to estimate the variance of the regression estimator, and
construct 95% confidence interval estimate of the average temperature across all
beaches in the USA.
( m ) Discuss your interval estimates in each case.
4. USE OF AUXILIARY INFORMATION: PROBABILITY
PROPORTIONAL TO SIZE AND WITH REPLACEMENT
(PPSWR) SAMPLING

4.0 INTRODUCTION

Th rou gh the pr eviou s chapter we have see n that the proper use of auxi liary
informat ion at the estimat ion stage for estima ting any populati on param eter results
in a ga in in the efficiency of the res ultant estimators. For example, the product and
rati o estimators of the population mean remain bett er than the sa mple mean when
the corre lation between the study variable and the aux iliary va riable lies in the
interval [-1.0, -0.5) and (+0.5, + 1.0], respectively. In this chapter we shall show
that the auxiliary information can also be used to sel ect a sa mple whi ch can provide
better estimators of population parameters. In oth er words, the auxiliary
information can be used at the sample sele ction stage as well as at the estimation
stage . A sa mpling scheme with replacement in which each sa mpling un it has
unequal probability of sel ection, the probability bein g proportional to the size of the
auxiliary va riable associated w ith the particular unit , is ca lled probability
proportion al to size and wit h repl acement (PPSW R) sa mpling sc heme.

4.1 WHAT IS PPSWR SAMPLING?

Let Y be a study variable and X be an auxiliary variable. For example, con sid er
we wa nt to estimate the popul ation in the villages of a particul ar distri ct. Th en we
would choose as our auxiliary va riable a variable on whic h we have information ,
e.g.:
( a ) Area of each village of the district (co rre lation wi th a study va ria ble = 0.70 ,
say);
( b) N umber of hou seholds in each village of the distri ct (correlation wi th a study
va riable = 0. 85 , say) .
On the basis of the above information , we would choose the aux iliary variable
which has maximum correlation with the study va riabl e. Thus the variable at ( b )
may be a mo re useful au xiliar y var iable when selecting a sa mple usin g PPSWR
sampling.
Let us explain the method of PPSWR sampling with the help of a simple example.
Con sider a population con sists of N = 4 unit s, viz., A, B, C, and D . Consid er there
are two variables Y and X associated with each unit as foll ow s:
.
Unit number.or identifier A B C D
Values of a stud y variable ( r; ) 2 4 5 6
Values of an au xiliary variable (X i) 4 8 10 12

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
296 Adv anc ed sampling theor y with appl ication s

4 4
Obviously we have X = I X i = 34 and Y = I Yi = 17 . Consider we wish to draw a
i= 1 i= 1
sample of n = 2 units by using PPSWR sampling . Not e that we are using WR
sampling , the total number of possible samples wi ll be N n = 4 2 = 16 and are listed
in Table 4.1 .1. Evid entl y we have
,) Nn , 16 32 144
E (Y = j~/(J )Yj = 1156 x 17 + 1156 x 17 + + 1156 x 17 = 17 = Y . (4.1.1)

Hence
, X n Yi
Y=- I -
n i= IX i
is an unbi ased estimator of population total Y. No w the variance of this estimator is
giv en by
n 2
v(y)=E[Y- E(y)f = 1 p(JXY E(Y)) j -
j=1

=~(17-17r + .... +
144 (17-17)2 =0 . (4.1.2)
1156 1156
The po ssible 16 samples are shown in the follow ing table :

Table 4.1.1. Possibl e 16 with replacement sampl es.


.~
,Sample Sampleq l .·Rat iO Probability of X n y.
I,,·,./n y• I---L
A

~Y'l • 3~ ·· ··Yd Xj ";" sel~ctiIlg the /h '•. Yj


i i;:· ······· . . ..;..,'. ,.. .', - -
hu mber .... ~...... n i=IXi
' f j ).....'
'. • 1
~'" 'U ' ''
2 sarriple, "~(j)., .with ; .,
. ; 2'1=
r. 1 ' I, ,'y,.,.12,. .yy "; . I · ..... . 1 ... IpPSWR+design
.'
"
I (A, A) YI =2 XI = 4 0.5 4 4
- x-=--
16 17
Y2 =2 x2 =4 0.5 34 34 1156
2 (A, B) YI = 2 XI =4 0.5 4 8
- x-=--
32 17
Y2 =4 x2 =8 0.5 34 34 1156
3 (A, c) YI =2 XI =4 0.5 4 10
- x-=--
40 17
Y2 =5 x2 =10 0 .5 34 34 1156
4 (A, D) YI =2 XI =4 0 .5 4 12
- x-=--
48 17
Y2 =6 x2 =1 2 0.5 34 34 1156
5 (B, A) YI =4 XI =8 0.5 8 4
- x-=--
32 17
Y2 =2 x2 =4 0.5 34 34 1156
6 (B, B) YI =4 XI =8 0.5 8 8
- x-=--
64 17
Y2 =4 x2 = 8 0 .5 34 34 1156
7 (B, c) YI =4 XI =8 0 .5 8 10
- x-=--
80 17
Y2 =5 x2 =10 0.5 34 34 1156
Continued .
Chapter4: Use of auxiliary information: PPSWRSampling 297

8 (B, D) YI =4 xI = 8 0.5 8 12
-
96
x - = --
17
Y2 = 6 x2 = 12 0.5 34 34 1156
9 (C, A) YI = 5 XI = 10 0.5 10 4
-
40
x- =--
17
Y2 = 2 x2 =4 0.5 34 34 1156
10 (C, B) YI = 5 x l = 10 0.5 10 8
-
80
x -= - -
17
Y2 = 4 x2 = 8 0.5 34 34 1156
11 (C, C) YI = 5 XI = 10 0.5 10 10
-
100
x- = - -
17
Y2 = 5 x2 = 10 0.5 34 34 1156
12 (C, D) Yl = 5 XI = 10 0.5 10 12
-
120
x -= - -
17
Y2 = 6 x2 = 12 0.5 34 34 1156
13 (D, A) Yl =6 xl = 12 0.5 12 4
-
48
x -= - -
17
Y2 = 2 x2 = 4 0.5 34 34 1156
14 (D, B) Yl =6 xI = 12 0.5 12 8
-
96
x -= - -
17
Y2 = 4 x2 = 8 0.5 34 34 1156
15 (D, C) YI = 6 xl = 12 0.5 12 10 120
- x- = - -
17
Y2 = 5 x 2 = 10 0.5 34 34 1156
16 (D, D) Yl =6 xl = 12 0.5 -
12 12 144
x -= - -
17
Y2 = 6 x 2 = 12 0.5 34 34 11 56

The following graph shows the relation ship between Y and X as X = 2Y and
o = tan- 1(i/2}

Relation between Y and X

12
10
Y=Xl2
8
6
4 -
2
o - - --- 1 ----1- - -i-- - -----j- - - t - --- - - -,

o 2 3 4 5 6
x--->

Fig. 4.1.1 Linear relation ship with zero intercept.


298 Advanced sampling theor y with applications

We observed that the variance of the estimator reduces to zero if an exact linear
relat ionship exist between the study variable and the aux iliary variable. As the
direct proportionality between Y and X deviates, the variance of the estimator
incre ases. If Y and X are perfectly correlated and the regression line passes
through the orig in, then the relationship is of the type as show n in the Fig. 4.1.1.
Note that if Y and X are perfe ctly correl ated but the regression line doe s not pass
through the origin then the accuracy of the estimators under a PPSWR desi gn will
be decre ased . Thu s the conditions determ ining the use of the PPSWR sampling
scheme being more efficient than SRSWR sampl ing are:

( a ) there is high positive correlation between Y and X ; and


( b ) the line of regres sion of Y on X passes through the origin .
Let us now consider the following graph :

Relat ion betwee n Y and X

8
6
'1
.;. 4
2 I
O+--- ~._ .... _ - - j

3 4 5 6

x---->

Fig. 4.1.2 Linear relationship with non- zero intercept.

Th e above regre ssion line shows a perfect positi ve correlation but it does not pass
through the origin and therefore the second cond ition is not satisfied. In this case,
the accuracy of estimators will decrease. To discuss the effect on accuracy, let us
aga in consider a population consisting of N = 4 units , viz., A , B, C and D .
Assume there are two variables Y and X associ ated with each other through the
relation Y = 2 + O.5X as follows:

:U nit number oridentifier A B C D


Values of a study variable OJ)' 3 4 5 6
Values of an auxiliary variable ( Xi ) 2 4 6 8

4 4
Obviou sly we have X = I X i = 20 and Y = I f; = 18 . Con sider we wish to draw a
i=1 i=1
sample of 11 = 2 un its by using PPSWR sampling. No te that we are using with
Chapter 4: Use of auxiliary information: PPSWR Sampling 299

replacement (WR) sampling, therefore the total number of possible samples will be
N n = 42 = 16 . The possib le 16 samples are shown in the Table 4.1.2.

Evidently we have
n
4 8 64
,)
E (Y =
N
L p ()
j Yj
A

= - x 30 + - x 25 + + -xI5 = 18 = Y . (4.1.3)
j =l 400 400 400

Hence Y=
X ±(YdxJ is again an unbiased estimator of population total Y .
n i= 1
However the variance of this estimator is

4
=-x(30 -18) 2 + -8x(25- 18)2 + 64
+-(15-18) 2 =9.667. (4.1.4)
400 400 400
which is obviously greater than zero.

Thus both conditions ( a ) and ( b ) must be satisfied for PPSWR sampling to


provide more efficient estimators.

YI = 3 xl = 2 1.5 2 2 4 30.00
- x- = -
Y2 = 3 x2 = 2 1.5 20 20 400
2 (A, B) YI =3 xl = 2 1.5 2 4
-x- = -
8 25.00
Y2 = 4 x2 = 4 1.0 20 20 400
3 (A, C) Yl = 3 xI =2 1.5 2 6
-x -= -
12 23.33
Y2 = 5 x2 = 6 0.833 20 20 400
4 (A, D) Yl = 3 xl =2 1.5 2
-x -
8
= -
16 22.50
Y2 = 6 x2 = 8 0.75 20 20 400

5 (B, A) Yl =4 xl =4 1.0 4
-
2 8
x- = -
25.00
Y2 = 3 x2 = 2 I.5 20 20 400
6 (B, B) Yl =4 xl =4 1.0 4 4
-x -= -
16 20.00
Y2 = 4 x2 = 4 1.0 20 20 400
7 (B, C) Yl =4 xl =4 1.0 4
- x-
6
= -
24 18.33
Y2 = 5 x2 = 6 0.833 20 20 400

8 (B, D) YI = 4 xl = 4 1.0 -
4 8
x- =-
32 17.50
Y2 = 6 x2 = 8 0.75 20 20 400
Continued ... ....
300 Adva nced sampling theory with applications

9 (C, A) YI = 5 xI = 6 0.833 6 2
- x- = -
12 23.33
Y2 = 3 x2 = 2 1.5 20 20 400

10 (C, B) YI = 5 xl = 6 0.833 -
6 4 24
x- =-
18.33
Y2 = 4 x2 = 4 1.0 20 20 400

II (C, C) Yl =5 xI = 6 0.833 -
6 6 36
x- = -
16.33
Y2 = 5 x2 = 6 0.833 20 20 400

12 (C, D) YI = 5 xI = 6 0.833 6
-
8 48
x - =-
15.83
Y2 = 6 x2 = 8 0.75 20 20 400

13 (D, A) YI =6 xI = 8 0.75 8
-
2 16
x- = -
22.50
Y2 = 3 x2 = 2 1.5 20 20 400

14 (D, B) YI = 6 xl = 8 0.75 8
-
4
x- =-
32 17.50
Y2 = 4 x2 = 4 1.0 20 20 400

15 (D, C) YI = 6 xI = 8 0.75 8
-
6
x- =-
48 15.83
Y2 = 5 x2 = 6 0.833 20 20 400

16 (D, D) YI =6 xl = 8 0.75 8
-
8
x- =-
64 15.00
Y2 = 6 x2 = 8 0.75 20 20 400

Remark 4.1.1. ( a ) The probabi lity of selecting the t il unit in a partic ular sample by
N
using PPSWR sampling is given by p; = X;/ X where X = LXi for i = 1,2,...,N.
i= l
( b ) The probab ility of selecting any particular sample s of II units from a
population n of N units by using PPSWR sampling is given by

p(s)=fjP2 .. .. p" = tip;, where s =1,2,3,oo .,Nn. (4.1.5)


i =1

There are several methods of selecting a sample by using PPSWR sampling, but we
will discuss here only two methods:

( a ) Cumulative total method; ( b) Lahiri's method.

4;1.1 CUMULATIVKTOTALMETHOD

Consider we have N serial numbers or identifiable labels of the units in a


pop ulation n. Consider we wish to select a sample s of size n with probabi lity
proportional to the size of the auxiliary variable X. The first step is to prepare a
table of cumulative totals on the auxiliary variable as shown below:
Chapter 4: Use of auxiliary informat ion: PPSWR Sampli ng 30 I

': '\; Cumulative Total Table


Sr. I '~ Auxiliary Variable Cumulative.Totals Abbreviation
No _ I,> (Xi )
I X, XI 1\
2 Xz X,+X z Tz
3 X3 X I +Xz + X 3 T3

N XN X I +Xz +",+ X N TN
Total , N Remark: Note that: X = TN
.~. X= 'IXi
.,. i=1

The second step is to select n random numbers (Ri , i =I, 2, ...., II), say , between 0
and TN' Consider the first random number selected is R1 and now if 71-1< R, ~ 71,
then the /h population unit Xi is selected to be included in the sample. Then a
second random number Rz is selected and again tested in the same manner and the
process continued till II un its are selected. Th ese II selected unit s will form a
sample s of size n with the PPSWR sampling scheme. A pictori al representation
of the method is show n below :

<--- XI ---> <--- X Z ---> <---- X Z ----> <--- Xi -----> <-- X N --->
1 1 1 1 -' 1 1 1

To 1\ Tz T3 71-1 71 TN_I TN

Fig . 4.1.1.1 Cumulative total method.


Th e probability of any random numb er fallin g in any particular interval IS
proportional to its length , that is,
Prob ability for the /h unit getting selected in a sample is

R = (71 - 71-,) =( X i ) OC X . (4 .1.1.1)


I TN X I

Note that X is con stant, thu s the prob ability of selection of the /h population unit is
proportional to the size of the /h unit of the auxili ary variable. Hence this method is
called a method of probability proportional to size.

Example 4.1.1. Use the cumul ative total method to select a sample of eight units
from population 1 given in the Appendix by using nonreal estate farm loans as an
aux iliary variable.

Solution. The cumulative totals of the auxiliary variable ' nonrea l estate farm loans '
are given in column 2 of the Table 4.1.1.1.
302 Advanced sampli ng theory with applications

Table 4.1.1.1.Cumul ative totals (C.T.), random numbers, and states selected.
II't:\P.T . "Ii fSr? : i 3C'~';~i: ~RiltidOD1 .' : U llits V:
i"'i:: ' tt'lJ l'lits
~~:.t 1 1 :1Z:~':I::
(~;,. n'i '
i '! .' i "
'Selected: I'Nt;" I ·i';:;, .i· :.Number :Selected
I 348.334 26 25514.990
2 351.767 27 29100.390
3 783.206 28 29117.100
4 1631.523 01473 AR 29 29117.570
5 5560.255 04981,05365 CA,CA 30 29145.080
6 6466.536 31 29419.120
7 6470.909 32 29845.390
8 6514.138 33 30340.120
9 6978.654 34 31581.490
10 7519.350 35 32217.260 32063 OU
II 7557.4 17 36 33933.350 33313 OK
12 8563.453 37 34504.840
13 11174.030 38 34803.190
14 12196.810 39 34803.420
15 16106.550 40 34884.170
16 18686.850 41 36576.990
17 19244.510 42 36965.860
18 19650.300 43 40486.220 38107 TX
19 19701.840 44 40683.460
20 19759.530 45 40702.830
21 19816.000 46 40891.300
22 20256.520 47 42119.910
23 22723.410 22650 MN 48 42149.200
24 23272.960 49 43521.640
25 24792.950 23626 MO 50 43908.120

We used the first five colum ns of the Pseudo -Random Numbers (PRN) given in
Table I of the Appendix to select eight random numbers betw een I and
TN = 43908 . These selected random numbe rs came in the sequence as 01473,
23626,04981,32063,33313,05365,22650 and 38107. These random numbers
have bee n show n in the column 3 of the Table 4 .1 . I .I . The last column of this table
show s the states selected in the sample from the population I given in the
Appendix. It is to be noted that the state CA has been select ed twice.

Remark 4.1.1.1. The difficulty in this method is in calculating the cumulative totals
when the population size N is too large. Thus for large populations we shall
discuss another method called Lahiri's method .
Chapter 4: Use of auxiliary information : PPSWR Samplin g 303

4.1.2 LAHIRI'S METHOD

Lahiri (1951) introduced a new method, which doe s not need cumul ative totals, for
selecting a PPSWR sample, but in this method we need to kno w the maximum
va lue of Xi wh ich we denote by X o ' Sometimes it is not possible to know the
maximum value of Xi' e.g., X = Number of errors in a book. In such cases, we
choose X °to be more than the maximum amon g all values of Xi ' In other words,
we choo se X o such that X o ~ Max(X!, X 2 ,... ,X N ) . Consider we expect (depending
upon the quality of printing) that the maximum numb er of errors in a book may be
°
100. Then if we choo se any value of X greater than or equal to 100, for example
X o = 200, it will not affect the sampling procedure too much . The steps for
selecting a sample by using Lahiri 's method are as follow s:

( a) Select a random number R, such that I::; R. ::; N .


I

( b) Select another random number Rj such that 1::; R j s X0 .


( c ) Compare the magnitude of the random number Rj with that of Xi' If the
magnitude of R j ::; X i' then the unit with serial numb er R, is selected to be
included in the sample, otherwis e it is reject ed. The pro cess is continued till we
se lect a random sample of the desired size.

Remarks 4.1.2.1.

( i ) Any draw on which a unit is selected is called the effective draw otherwise it is
ca lled ineffective draw.
( ii ) One should note that the larger the difference betwe en the exac t maximum and
X o, the larger the numb er of rej ection s.

Now we have the follo win g theorem :

Theorem 4.1.2.1. With the Lahiri's method of PPSWR sampling, the probability
for the i'h population unit to get selected on the first effective draw is proportional to
the size of the i'h unit of the auxili ary variable. In other words, the probability of
selecting i'hunit in the sample by using PPSWR sampling is given by
x. N
P; =-' , where X = I Xi . (4.1.2 .1)
X ~I

Proof. We shall not take into account those dra w which are ineffect ive because
units are selected onl y on the effective draws. There are two possib ilitie s, i.e., either
the dra w is effective or not effective. Thu s if the first draw is effective then the
corresponding unit is selected and if the first dra w is ineffecti ve then we go to the
seco nd draw and so on.
304 Advanced sampling theory with applications

The probability for the {" unit to be selected on the first draw if it is effective is
Xi
(N X o)'
where 1/ N is the probability for a particular unit to be selected and x. ] X o is the
probability of an effective draw.

The probability of the first draw not being effective and consequently the
probability for the {" unit not being selected on the first draw is

1-( ;~ J.
Therefore the probability that any unit among the N units not being selected on the
first draw is
N I X X - -I N
l-L--' =1-- , where X=N LXi '
i=\ N Xo Xo i=1

Now the probability for the {" unit to be selected on the second draw (in case the
first draw was ineffective and second draw is effective) is

( 1 - ~J(~J
X
o NX o .
(4.1.2 .2)

Similarly , if the first and second draws are ineffectiv e and the third is effective, then
the probability for the {" unit to be selected on the third draw

(
1
Xo
X J2( oJ
Xi
NX
and so on.
Hence the probability that the /', unit will be selected on the first effective draw is

X ( X
Nl: + 1- X
o o
JX ( X
Nl: + 1- X
o o
J2 X
Nl: + ...
o

Xi
(4.1.2 .3)
X
Hence the theorem .

Example 4.1.2.1. Use Lahiri's method to select a sample of eight units from
population 1 given in the Appendix by using nonreal estate farm loans as an
auxiliary variable .
Solution. The maximum value of nonreal estate farm loans is $3928 .732. Therefore
we decided to choose X o = 4000. We started with the first two columns of the
Pseudo-Random Numbers (PRN) given in Table I of the Appendix to choose a
random number R, between 1 and N = 50 and reported it in the second column of
Chapter 4: Use of auxiliary information: PPSWR Sampling 305

Table 4. 1.2.1. We started with 7th to l oth columns to choose another random numb er
R j between I and 4000, and the random numbers so obtained have been presented
in the third colu mn of the Table 4.1 .2.1. The value of the nonreal estate farm loans
th
(Xi) correspond ing to the first random number R, has been coded in the 4
column. The s" column of the Table 4.1 .2. 1 has been devoted for making the
decision following Lahi ri 's instructions.
(a) If Rj > Xi then the pair of selected random numbers (Ri,R j ) is rejected and
marked ' R' in the 5th column.
( b ) If Rj < X i the n the pair of random numbers (Ri , Rj ) is selected and marked 'S'

,. -
in the 5th column.

T a ble 4. 1..
2 1. L ah'In., s met h 0 d 0 f sample se ection.
I~ T!ri~J H l~ii1 11&' ":,,, " cc"CO
'Cit::',,,
;':~~1;~[~1
r' "v" 'i' ~ ' tti '

.f,> "": I" Y ,~j}f,~ :;: -i'~


"

: :':"'. ,,:.: " $


.' " I 01 3399 348.334 R
2 23 0757 2466 .892 I·i· " -s: v.
3 46 3707 188.477 R
4 04 2 194 848 .317 R
5 32 3331 426 .274 R
6 47 2597 1228 .607 R
7 33 0926 494 .730 R
8 05 1557 3928 .732 t::, ' S ,:': i '
9 22 2958 440 .5 18 R
10 38 0 122 298 .351
, S,,: :,: ,
II 29 3235 0.471 R
12 40 0990 80.750 R
13 46 2111 188.477 R
14 03 0275 431.439 «':.:1.tS , ; ,c,
15 36 3363 1716.083 R
16 27 3900 3585.406 R
17 19 3548 51.539 R
18 29 2605 0.471 R
19 29 0395 0.471 R
20 14 1574 1022.782 R
21 47 2207 1228.607 R
22 22 2013 440 .518 R
23 42 0607 388 .869 R
24 23 2782 2466 .892 R
25 48 2579 29.291 R
Continued .
306 Advanced sampling theory with applications

26 06 3849 906.281 R
27 07 3466 4.373 R
28 42 0270 388.869 ;;r;~Si' '::
29 21 3490 56.471 R
30 31 1064 274.035 R
31 31 1101 274.035 R
32 36 3770 1716.087 R
33 16 2300 2580.304 !' r;?"2s '
34 27 1036 3585.406 11';' !!,' S ,i':'
35 10 3688 540.696 R
36 18 3591 405.799 R
37 26 0747 722.034 R
38 48 2486 29.291 R
39 02 0949 3.433 R
40 44 0635 197.244 R
41 12 2000 1006.036 R
42 49 0905 1372.439 I!\ "'S ' ;

Thus this method will include the following states in the sample.

1 2 3 4 5 6 7 8
23 05 38 03 42 16 27 49
MN CA PA AZ TN KS NE WI

We have seen that every unit within the sample has a different probability of being
selected. Also it follows from (4.1.5) that in the PPSWR sampl ing scheme, unlike
SRSWR or SRSWOR sampling, every sample has different probability of selection
from a given population. Now we shall discuss the problem of estimation of the
population total using PPSWR sampling.

4;2 ESTIMATION()FPOPULATION TOTAL",.

Assume we selected a sample s of size n using PPSWR sampling from a finite


population n of N units. Let (Yi, x;), i = 1,2,...,11, denote the values of the study
variable Y, and auxiliary variable x, for the n units selected in the sample. Let
N
p; = xd X, where X = LXi, denotes the known and positive probability of
i=l
selecting the /h population unit in the sample for i = 1,2,oo .,N. Following Hansen and
Hurw itz (1943) we have the following theorems:
N
Theorem 4.2.1. An unbiased estimator of population total Y = Lf; is given by
i=l


YHH = -
1
L ydP;
n ( )
. (4.2.1)
n i=l
Chapter 4: Use of auxiliary information: PPSWR Sampling 307

Proof. This can be proved by two different methods as given below


Method I. We have

E(YHH ) = E[~i~1 (yjP; )] =~i~ E(yj p;) . (4 .2.2)

Now YiIPi is a random variable and can take the values r.. 1Ii , Yzi Pz ,....., YN IPN .
Note that the draws are independent, the probability for the unit with value YI I PI to
be selected is Po. , Yz IPz to be selected is Pz, and so on.
Therefore
N N
E(yjp;) = I.p;(Y;IP;) = I.Y; = Y. (4 .2.3)
i=1 i=1
Using (4.2 .3) in (4.2 .2) we obtain
E(YHH ) = ~ iy = Y. (4.2.4)
n i= 1
Hence the theorem.

Method II. We have

(, ) [1 I
E YHH = E -;; i~n yj P; ] =-;; E[N I N1 E(r; XY;I p;)
i~ r; (Y; 1 p;)] =-;; i~ (4.2 .5)

where ri is such that


I if the ;til unit is selected in the sample, (4 .2.6)
'i = {0 otherwise.

Obviousl y ri is a binomial variate and it can take any va lue between 0 and n
depending upon how many times {IJ unit is being selected in the sample under
PPSWR sampling scheme with probability of success Pi.
Thus
E('i) = np; .
Therefore (4 .2.5) reduce s to
, ) 1 N 1 N N
EY
( HH =- I. E(r; XY;I P; )=- I.np;(Y;I P; )=I. Y; =Y. (4 .2.7)
n i=1 n i= 1 i=l
Hence the theorem.

Theorem 4.2.2. The var iance of the estimator YHH is given by

V(YHH ) = +'[i~1 (y;2/ p; )- y


2 (4 .2.8)
}

Proof. Again there are two method s to prove this theorem as follows :

Method I. Note that the draw s are independent in PPSWR sampling, therefore

V(YHH)= V[~ i Yi] = ~ i V( Yi ). (4 .2.9)


n i=1 P; 11 i=1 P;
308 Advanced sampling theory with applications

Now we have

V(yd Pi )=E(yd p;)l -[EV il pJZ = i~ti(Y/ / p/ )-[i~t;(Yi IPi )r


= I (Y;Z/I})- y Z. (4.2.10)
i=1

Further note that yl/ p/ takes value Y/ / p/ with probability I} and v.I Pi takes
value v.] Pi with prob ability Pi. On using (4.2 .10) in (4.2.9) we have the theor em.

Method II. Again using the binomial variate ri defined in (4.2.6) we have

V(YHH ) = v(!... I Y;/I} ) = v(!... IriY; II} ) =~V( Ili Y;II} )


n i=1 n i=1 n i= 1

= n1z [iEV(1j Y; I I} ) \'~=ICOV~i Y; II}, rjYj / Pj )]


1 [ N yZ
=2" N y.y . ( )~
I~v(r;)+ I ...!-.LcOV Ij , rj . (4.2.11)
n i= 1 I} i¢ j=1I}Pj
Now we have
V(Ij) = nl}(l- I} ) (4.2.12 )
and
CovlIi , Ij ) = E~jrj) - Ek )Eb) = ElIirj )- nl}nPj . (4.2.13)
Also if E I and E z denote the expected values for all possible values of ri and for
the given values of ri , respectively, then

E(rirJ=EIEz h rjlri ]= EIh Ezb I r;}]= E{r;(n- ri ) 1~~i ] = (1 ~jpi ) EI[nJi - r/]
j j
= (P )[nEI (/;)- EI~/ )] = ( P )~I.nl}- hr;) +(nl} f }] =n(n -l)I}Pj . (4.2.14)
I-I} I-I}
From (4.2.13) and (4.2.14) we have
CovlIi, Ij) = n(n - 1)I}Pj - nl}nPj = - np;Pj . (4.2.15)
Note that

Yz =( IY;
N )Z =IY; N Z+ IN Y; Yj,
i=1 i=1 i'1'j=1
and putting V(r i ) and CovlIi, rj) in (4.2.11) we have

V(YHH)= ~[I Y;> (Ij)+ ~ Y;Yjcov~, I} =~[.~ Y;~nl}(I-I})- . ~ :Y nl}Pj ]


j
n 1=1I} P;PJ I¢ p i J n I} J= IP;PJ1=1 1'1'

Z
=- I[ INY Z(
-'-I-I} Y.. - Yz] .
) - IN y; y. ] =-1 [NI -'-
Il i=1 P; i¢ j=1 J n i=1 I}

Hence the theorem.


Chapter4: Use of auxiliary information: PPSWRSampling 309

Theorem 4.2.3. An unbiased estimato r of V (YHH ) is given by

v. (Y
. ) = - (1- -)
HH -2 - II Y'2]
[nI yf HH . (4.2 .16)
II 11 -1 ;= 1P;

Proof. We have

EVY
• •
[ (HH)~ =E -(--)
1
-1 [ II II
n y;
I -
{ ;= 1 22
p;
•2
- II YHH }] =-(-)E
-1
II
- I -p; J n ]
l i n y;
[
II
[
;=1
22
II • 2
- - YHH

=- (
n
~ 1)[~E[fY~J-E(yaH)l
p;
II J
1= 1
=-~()[[Iy;2J- {V(YH
ill p;
H) +(E(YHH)Y}] 1=1

=_1
(n - 1)
[[IY;2p; - Y2J- ~[
;= 1 n
I Y;2p; -Y2J] = ~[
n
I Y;2p; -Y2J.
;= 1 ; =1

Hence the theorem.


Remark 4.2.1. If p; = 1/ N \:j i = 1,2 ,..., N, then the results of PPSWR samp ling
reduce to those of the SRSWR samp ling.

Ex a mple 4.2.1. For esti mating the total amou nt of the real estate farm loans (in
$000) during 1997 in the United States we took a PPSWR samp le of eight states
from the pop ulation I given in the Appendix by using Lahiri 's method. From the
states selected in the sample we gathered the following information.
State AZ CA KS MN NE PA TN WI
Nonreal estate 431.439 3928.732 2580.304 2466.892 3585.406 298 .35 1 388.869 1372.439
farm loans, 'x
Real' estate 54.633 1343.46 1 1049.834 1354.768 1337.852 756.169 553.266 1229.572
farm loans, Y
Assume that the tota l amount $43908 .12 of nonreal estate farm loans (in $000) for
the year 1997 is known . App ly the Hansen and Hurwitz (1943) estimator for
estimating the tota l amount of the real estate farm loans (in $000) duri ng 1997 in
the United States . Also find an estimator of variance of the used estimator and
hence deduce a 95% confi dence interva l.

Solution. We are give n X = 43908.12 and p; = xd X, therefore we have


i," .~ . ',.
Sampled i Yi Y~
Xi ~~ •
P.'"= -Xi
.:
----L
Units ';
0
I f~:~: ' I:b:;' / X lj i : p,2
I

AZ 54.633 43 1.439 0.00982 5950 5560.072965 309144 11.4


CA 1343.461 3928 .732 0.089476206 15014.729130 225442090.8
KS 1049.834 2580.304 0.058765987 17864.653640 319 145849.8
Continued .
310 Advance d sampling theory with applications

MN 1354.768 2466.892 0.056183048 24 113.465820 58 1459233 .7


NE 1337.852 3585.406 0.0816570 15 16383.797580 268428823 .3
PA 756. 169 298.35 1 0.006794894 111284.893300 12384327470.0
TN 553.266 388.869 0.008856426 62470 .574720 3902572706.0
WI 1229.572 1372.439 0.031257066 39337.409480 1547431784.0
;;,.
'Sum 292029.596600 19259} 22369.0

Thus an estima te of population total Y (say) is given by

YHH = J.. f Yi =.!. x 292029 .5966 = 36503 .7 .


n i=1 P; 8

An estimator of V (YHH ) is given by

YHH ) =- (--
v"(" 1 ) i - n Y"2HH ] =-(--)
[nL ----T 1 [19259722369 - 8 x (36503 .7) 2 ] = 153563601.9 .
n n- 1 i=1 p; 8 8- 1
Using Table 2 from the Appendix the 95% confidence interva l is given by

YHH =+:la/2(df = n - 1~V(YHH )


or 36503.7 =+: 2.365,/1 5356360 1.9 , or [7196.43,65810.96] .

Example 4.2.2. The amounts of the real and nonreal estate farm loans (in $000)
during 1997 in different 50 states of the United States have been presented in
population 1 of the Appendix. Consider we selected a PPSWR sample of eight
states to collect the required information. Find the relative efficiency of the PPSWR
sampling estimator for estimating total amount of the real estate farm loans during
1997 by using information on the nonreal estate farm loans dur ing 1997 with
respect to the ratio estimato r of population total based on SRSWR sample.

Solution. From the description of the population we have N = 50,


Yi = Amount (in $000) of the real estate farm loans in different states during 1997,
Xi =Amount (in $000) of the nonreal estate farm loans in differe nt states during 1997,
- - 2 2
Y = 555.43, X = 878.16, = 1.5256 , Cy2 = 1.1086 , and Pxy = 0.8038.
Sy = 342021.5 , Cx
From Chapter 3 under SRSWR sampling (j = 0) the MSE of the ratio estimator

YR = NY( ;]
of popu lation total Y is given by
2
" ) = -N Y
MSE(YR
-2 lCy
r 2 + C 2 - 2p
x xyCxCy
1
n

(2500)(555.43)2 [ 1
= 8 1.1 086 + 1.5256- 2 x 0.8038~1.1 086 x 1.5256 = 52399977.81.
Chap ter 4: Use of auxi liary information: PPSWR Sampling 311

The variance of the estimator YHH of the population total under PPSWR sampling
is given by

V(YHH )=~[I y;2 - y 2] = ~[11 39152442 - (2777 1.726 f] = 459854 59.59 .


II i=l P; 8
Thu s the percent relative efficiency (RE) of the PPSWR sampling estimator YHH
with respect to the ratio estimator YR under SRSWR sampling is given by

RE = MS~(YR) x l 00 = 52399977.81 x l 00 = 113.95%


V( YHH ) 459 85459 .59
which shows that the PPSWR estimator is more efficient than the ratio estimator
under SRSWR design in this situation .

Theorem 4.2.4. The variance V (YHH ) IS rmmmum if Pi oc Yi or the vanance


V(YHH)=O if P;=Y;/Y , i= I,2,...., N .
Proof. We like to choose F; to minimise

V YHH = -;; i~ ~
(
A ) 1 (N y,2
- Y
2)
N
under the constraint I P; = 1 .
i= l
The Lagrange function is given by

L= IN -Y/- Y +..t 2 2( I N
Pi - l ) .
i=1 Pi i=l

Now
o L =0
oF;
implies that
2
_lL+..t2 = 0 or P; y, I y, If.
= -.1.. = -.1.. Y. ;::: 0 V i .
P; 2 '
I..t A I

N
Note that I Pi = 1 which implies that P; = Y; / Y and
i=l

- '- -Y 2]
(. ) I[ INY,2
V YHH = - =- l[ IN- y,2'--Y 2] =- l[YLY.-Y
N 2] = -1[ Y2- Y2] =0 .
II ;= 1 P; II i=IY;/Y II i= 1 I 11

Henc e the theorem.


Thu s the optimum value s of the size measures are norm ed values of Y; . Howe ver,
note that the value s of Y; are not known for all the units in the population, thus the
values of an auxiliary variable X which is expected to have high positi ve
corr elation with the study variable Y can be taken as optimum values of size
measure s.
3 I2 Advanced sampling theory with application s

As noted it is not necessarily the case that PPS WR sampling is always more
effic ient than SRSWR the next section has been devoted to study the efficiency
cond itions.

4.3 RELApVE EFFICIENCY OF PPSWR SAMPLING WITH RESP.ECT


TO SRSWR SAMPLING

Here we shall discuss two different methods based on superpopu lation and cost
aspects in survey sampling.

4.3.1 SUPERPOPULATION MODEL APPROACH

We have the follo wing theorem.

Theorem 4.3.1. The PPSWR sampling remain s more efficient that SRSWR
sampling if the regres sion line should pass through or near the origin.
Proof. In order to study the relative efficiency of PPSWR sampling with respect to
SRSWR sampling we have to consider the superpopulation model approach by
following Foreman and Brewer (I 97 I). The superpopulation mod el is defined by
the linear relation ship between Y and X as follows
Y; =a + bX; + e; , (4.3 . I.I )
where e; is an error term satisfy ing the following assumptions:

(a) E(e; IX;) = 0;


( b) E(el IX;)= a'X f, a' > 0 and g is a constant and in most app lications it takes
value betwee n I and 2. The value of a ' is a constant for a given value of g and
determines the coe fficient of correlation;
(c) Ekej IX;Xj) = 0, V i ~ j .

To study the relative effic iency let us define


2
a
VI = Varianc e of the estimator of total under SRSWR = N 2 ~ , (4.3.1.2)
11
and
2
V2 = Varianc e of the estim ator of total under PPSWR =!!.L, (4.3 .1.3)
11
where

a 2=-I
y
I N(Y;- Y-)2 and
N ;=1
a; = I_yP;
N

;=1
2
i_ _ y2 •

Now we have

E(VI ) = -E~a
11 11
[I
N 2 - IN ( Y; - Y
N 2 (2 ) =-E
N
-)2 ] N N (
;=1
-)2
=- E I Y; - Y .
11 ;=1
Also
Chapter 4: Use of auxiliary information: PPSWR Sampling 313

-Y = (1- L:NY. J=-I L:N(a+ bX. + e.) = a+ bX- + -1 L:Ne..


N i=1I N i= 1 N i='
I I I

Thus we have

E(VJ ) = N I E[Y. - y]2 = N I E[a +bX. + e.- (a +bX +~ I e.J]2


n i= J I n i= ' I I N i=1 I

(4.3.1.4)

E(ei -ef = E(e;) + E(e 2)- 2E(eeJ ,


E(e;) = E,E2[e;lxi]= E,[v(ei)lxiJ= E,[a'xig ]= a'Xl ,
(2) = E[I- Le;
E\e N ]2 = -1 [N (2) N ( )~ a' N
2 LE\e; + L E\eiej = - 2 LXf =!!....XI?
N I'
N ;= \ N ;=1 j,,, j=1 N ;=\

E(e;e)= J.- E[e; I.e;] = J.- E[e1+


N ;= \ N
I. e;ej] =!!....N Xf .
;¢ j =\

Thus we have
E(e.- e)2 = a'xl? + ~a'
I N
x g-l:..-a'
I N
xl?= at x g- ~a'
I N
xg I I I

and
E(Xi - x)(ei - e) = E[ei(x i - x)- e( Xi - x)J
=E[ei( Xi- x)rx;] = E[(Xi - x)E(eiI xJJ = O.
Putting these values in (4.3 .1.4) we obtain
E(V\)= N 2:[b2a; +a' Xf -!!....Xf +2b.O]
n ;= N
N 2[ 2 2 N _3:-' LXI?
N]
=- b a +~, LXI? (4.3.1.5)
n x N ;=\ I N 2 ;= \ I •

We now cons ider


Ny
E(V2)=-I E[ L - ' 2-Y 2] . (4.3 .1.6)
n ;=1 P;

Obviously
314 Advanced sampling theory with applications

-{I (a + t.x, + ei )2+ . I (a + sx, + eja +


1=1 ' '' J =l
j j)} ]
bX + e

=XIN(a+bX f +a'XIJI I Il(a+bX} +a'Xf 1- IN (a+bXiAa+bX


N[ \f
j
)
i= l Xi i=1 i;<j=l

N(a+bX )2+a'XIJ
= XI I I
N {N
.I a'Xf -I(a+bX;)
}2
i= l X, 1=1 1=1
N I N N
=a2XI:-+a'XI:Xrl-a'I:Xf _N 2a 2. (4 .3.1.7)
i=IXi i=l i=1

Now we know that the harmonic mean of the values of Xi given by

HM = N/~;i = X(say)
which implies that
N I N
I - =-=, and X=NX.
i=IXi X
Therefore (4.3.1.7) becomes

E[~ Y?P; - y 2] = N2a2(~X -IJ-a'~xf +a'NX~xrl .


i=l i= 1 i=1
(4.3 .1.8)

On using (4.3.1.8) in (4.3.1.6) we obtain

E(V2)=.!-[N2a2(~
n X
-IJ-a'~xf +a'Nx~xrl]. i=l i=l
(4 .3.1.9)

From (4.3.1.9) and (4.3.1.5) the PPSWR sampling will be more efficient than
SRSWRif
E(vd-E(V2 » 0 (4.3.1.10)
or if

~[b2(j.~
n
+~~Xf
N
--4-~Xf]-.!-[N2a2(~
N n
i=l X
-IJ-a'~xf +a'Nx~xrl] >O
i=1 i=1 i= l

or if
Chapter 4: Use of auxiliary information: PPSWR Sampling 315

~[Ixr -XIxr1] > a2(~ -IJ-b a.; .


2
(4 .3.1.11)
N i=1 Xi= 1

Now we know that

Cov(x, r) = ~[IXiY; -NXr] .


N i=l
(4.3 .1.12)

In the present case we have X Xi and r


=
g
= X i -I , so that

r=~~xrl .
N i=1

Using this value in (4.3.1.12) we obtain

cov(xi, xr1)= ~[i~xr -xi~xrl]. (4 .3.1.13)

From (4.3 .1.11) and (4.3 .1.13) we find that the condition under which PPSWR
sampling is better than SRSWR is

alCov(xi, xr'» a2(~ -IJ-b 2a;. (4.3.1.14)

From the relation (4.3.1.14) the following cases are obvious:


( a ) If the line passes through the origin, i.e., a = 0 and g > I then the condition
(4 .3.1.14) is always satisfied;
( b) If the intercept a increases then the value of a' may be found such that the
inequality (4.3 .1.14) may not be satisfied.

In such situations PPSWR sampling will be less efficient than SRSWR sampling.
Hence the theorem.

4.3.2'COSTASPECT

The relative efficiency ofPPSWR sampling with respect to SRSWR sampling is


2
V(~SRSWR) ={Nn a2}/1~IP;(Y;
RE=
v(lPPSWR
v ) n P; -rJ2) =~2a2dY IP;(Y;P; -rJ2
i= 1 i=l
(4 .3.2.1)

Consider the total cost of the survey depends upon the number of draws and total
sample size. For simplicity, let us consider the following cost function
" xi
C = Co +nC1 +C2 L (4.3.2.2)
i= 1
where Co is the overhead cost, C 1 is the cost per sampled unit and C2 is the cost of
collecting data per unit size of the sample. In (4.3.1.2) the values of Co' CI> C2

and n are fixed, but "


LXi = nx changes from sample to sample. Thus
i=l
316 Advanced sampling theory with applications

( a ) For SRSWR sampling E(iExiJ= E(nx) = nX, thus the expected cost of the
survey is given by
E(C) = C· = Co + nCI + nXCz· (4.3.2.3)
The Lagrange function is then given by

LI = ~o'z - Al [C' - Co - n(CI + XC z)]. (4.3.2.4)


n
Setting oLI/on = oimplies n = (NO'V( Ji~CI +Czx) and substituting it in (4.3.2.3),
we have
n =(C' -CO);(CI +Czx). (4.3.2.5)
Thus the minimum variance under SRSWR sampling is given by
'
M m.
v(vISRSWR )_- -N za z -_ N z(~CI• + CzX-) a z . (4.3.2.6)
n C -Co

( b ) For PPSWR sampling E(~Xi J= ni~XiP; = I1X- i~ Xl , 1


thus the expected cost

of the survey is given by


E(C) = C· = Co +I1CI +I1CZX- 1 ~xl . (4.3.2.7)
i=1
The Lagrange function is then given by

Lz = ~ IP;( P;Y; - r)Z - Az[C' -


n 1=1
Co -II(CI + Cz IXl/xJ].
1=1
(4.3.2 .8)

Ili
Setting oLz/olI = 0 implies

II =~p;( P;
1=1
Y; _ r)Z- Ji

and substituting it in (4.3.2.7) we obtain


11 = (C' - Co) .
(4.3 .2.9)
( C1 + Cz i~IXl / X)
Thus the minimum variance under PPSWR sampling is given by

A 1 N
Min.v(rpPSWR)=-IP;(---C..-r) =
Y. Z C1 +Cz .I X i
N
.1-1
z/ 1.I P;(---C..- r)z
X N Y,
(4.3.2.10)
P; P;
I
11 i=1 C - Co 1=1

Using (4.3.2.6) and (4.3.2.10) the relative efficiency of PPSWR sampling with
respect to SRSWR sampling for the fixed cost of the survey is given by
Chapter 4: Useof auxiliary information: PPSWR Sampling 317

RECost = . (,
Mm.V YpPSWR
(')
M in ,VYsRswR
) = (C] + C2
-
X { N
C] + C2 ~ Xi
I-I
2 IXJ-] x RE. (4.3.2.11)

Case I. If C2 = 0 and CI > 0 then RECos t = RE .


Case II. If CI = 0 and C2 > 0 then
RE RE
REcos t = N ---,.N.,--------
~ L xl! X2 ~ L (x;/X -I +If
u t: s t:
RE RE
~ #I(X;/X -If +1+2(X;/X -I)J I+C; (4.3.2 .12)

where Cx denotes the coefficient of variation of the auxiliary variable X. Large the
variation in the auxiliary variable, less will be the efficiency of PPSWR sampling
for the expected cost. If there is no variation in the auxiliary variable, i.e., C, = 0,
then REcost = RE .

'WRSAMPLIN
.! AILAnJ:,E:: e: :.' .0
We have observed in the previous chapter that for a simple random sample (SRS) of
size n drawn with replacement from a population of size N, the usual estimators
R= YIx and P = y x for estimating the ratio R = Y/ X and product P = YX ,
respectively, are well known. Prasad (1989) has increased the efficiency of the
estimator of R by certain scalar multiplication. The estimators of the product P
proposed by Beale (1962), Robson (1957), Tin (1965) and Sahoo (1983) are found
to be special cases of the class of estimators suggested by Singh and Sahoo (1989).
The efficiency of the estimators of the ratio and product available in the literature is
generally low. It is well known that suitable use of auxiliary information in
probability sampling results in considerable reduction in variance of the estimators
used for estimating a population parameter. Deng and Wu (1987) have increased
the efficiency of the estimator of variance of linear regression estimator by making
use of prior information about the population mean of the auxiliary variable .
Srivastava and Jhajj (1980, 1986) have shown that the efficiency of the estimators
for estimating finite population variance and the finite population correlation
coefficient, Pxy' may be increased by making use of some known population
parameters of the auxiliary variable at the estimation stage. If there is more than
one auxiliary variable, the problem remains as to how the entire information can be
utilised in a better way. Multi-variate ratio and regression methods of estimation
provide a solution to the problem. Singh (1967a, 1969) has also suggested a method
of using two auxiliary variates and the estimators suggested by him would be more
efficient than the usual ratio estimator under some conditions . Agarwal and Kumar
318 Advanced sampling theory with applications

(19 80) have used two auxiliary variables: one at the stage of selection of the sample
and the other at the estimation stage. Then taking the best linear combination of the
prob ability proportional to size (PPS) estimator and the ratio estimator so obtained
to estim ate the popul ation mean of the study variable. Jhajj and Srivastava (19 83)
have proposed a class of estimators of the population mean of the study variable
when the sampling is done by the method of prob ability proportional to suitable
size variable which is different from the auxiliary variable used at estimation stage.
We would like to first discuss the class of estimators proposed by Jhajj and
Srivastava ( 1983) and to show that other estimators ava ilable in the literature are the
spec ial cases of their class of estimators.

4:4.1
. NOTATION. AND 'EXPECTATIONS
.

Assume that information on two auxili ary variables Xi and Zi (say) for i = 1,2,...,N ,
highl y correlated with the study variable Y is available . Let a sample of size n be

drawn with PPSWR sampling. Let P; = zli~lzi denote the prob abil ity of selection

on the basis of known auxiliary variable Zi, i = I, 2,..., N . Let Yi and Xi denote the
value of the variable under study Y and the auxiliary variable X for the /" unit of
the popul ation , and r and X denote their popul ation means respecti vely. Let
Yi X, _ _I n
= NP'
V,=_' and VII = n IVi '
II i
I
1 NP;' i=1
Defining
II V
°o = f- I, and 0 = !!..-I
X
we have
E(00 ) = E(0) = 0
and
E (\u5:2)=
o
n- 1 eu2 , E(\u5:2)= I I - l ev2 , E(5: 5:)
uOu = II
-I e ev'
Pu v u

where
ell 2 = a ll2/-2
Y --R eIatrve
' vanance
. 0f -
11 11
ith
Wit au 2= ~ { -
L.P;~lIi -\2
Y J '
i= 1

e2= a 2/ X2= Relative variance of v with


. a"2= L:
N
Pi (Vi - X-)2,
v v II i=1

and

P all"
=- -= COV(UII,VII) corre Iation
' between Yi an d Xi fior PPS WR
uv all av ~V(UII ) V(vlI )
samplin g,
where a liV = IP;(lIi - rXVi - X).
i= \
Chapter 4: Use of auxiliary information: PPSW R Sampling 3 19

It may be noted that the results for SRSW R sampling are straightforward if
P; = 1/ N \;/ i, (i.e., if
Zi = 1 \;/ i ). It is worth noting that the same auxiliary variable

cannot be used at both selection stage and estimation stage. In other words if
Zi = Xi \;/ i = 1,2,..., N , then E(02)= 0 and PI/V will take inde terminate form. Thus
it is compulsory to use one auxiliary variable at selection stage and other at
estimation stage .

4.4.2 CLASS OF ESTIMATORS

Jhajj and Srivastava (1983) defined a class of ratio type estimators with PPS WR
sampling for estimating population mean as r
YJs =unH(w), (4 .4.2. 1)
where w=vn/X and H(e ) is a parametric function such that H(I )=I satisfying
certain regularity conditi ons. Expanding H(w) around the point I by using Taylor' s
second order series we have
YJS = un H(w) = unH[I +(w-I )]
H
_[
=u" H(I)+ (w - I)-t3H IIV:I +(W-I)\2-1 -()-2 2 IIV: I +.....] . (4.4.2 .2)
~v 2 ~v

The expression (4.4.2.2) in terms of 00 and 0 can easily be written as:


YJS = r(l + ooX I +oHI +02H2+... ] = r [ 1+00+ OH1 +02 H2+ 000H I +.. ] (4.4 .2.3)
t3H 1 () 2 H
where HI =- IIV= I and H 2 =- - -
2 IIV= I denote the first and second order partial
~v 2 ~v
derivatives of H with respect to wand are assumed to be known and constants at
the give n point. Thus we have the following theorems.

Theorem 4.4.2 .1. The bias, up to terms of order O(n -I ), in the class of ratio type of
estimators in PPSWR sampling is

B(YJs)= r[c';H2+ HIPl/vCl/cJ (4 .4.2.4)


n
Proof. On taking expected values on both sides of (4.4 .2.3) we have
E(YJs ) = rE[1 + 00 +oHI + 02H2+00OH)+....]
= r[l + E(OQ) + E(0)H1+E(oZ )Hz +E(OQO)H]]

= r[l + ~{H2C,7 + H\Pl/vCI/Cv}].


Thus the required bias is give n by

B(YJs) = E(YJs) - r = r [C,7H2+ H\Pl/vCuCv] '


n
Hence the theorem .
320 Advanced sampling theory with applications

Theorem 4.4.2.2. The minimum mean squared error of the general class of ratio
type estimators YJS for estimating population mean Y is given by
-z
Min.MSE(YJs) = ~ Cl;[I- p~v]. (4.4 .2.5)
II

Proof. By the defin ition of the mean squared error we have

MSE(YJS)= E~JS - yf '" E[Y( 1+ 8 + 8H


0 J+
....)- yt '" y 2
E[80 +8HJ

-z
= y 2E[8~ + 8 2H~ + 2808H,]= ~[Cl; + H)ZC~ + 2PuvCuCvHI]' (4.4.2 .6)
On differentiating (4.4 .2.6) with respect to HI and equating to zero we have

HI = - Puv Cu . (4.4.2.7)
Cv
On substituting the value of H) from (4.4.2.7) in (4.4.2.6) we obtain the minimum
mean squared error given by (4.4.2.5). Hence the theorem .

Remark 4.4.2.1. One can easily observe that the following estimators of the
population mean Yare special cases of the general class of ratio type estimators
under PPSWR sampling:
_ _
YI = all" + (1-a)l"
_
(~- ),
X
t Il
. _ _
yz = II" (~v".
X
Ja,
_ _
and Y3 = II" [av" + X-1
(_)
a X
] '

The estimator YI was proposed by Agarwal and Kumar (1980) , but the estimators
Yz, and Y3 are the analogues of the estimators proposed by Srivastava (1967) and
Walsh (1970) for PPSWR sampling strategies. Estimation of the population mean
using supplementary information has been considered by Unam (1995) which
shows the comparison of ratio type estimators with PPSWR sampling through
simulation.
Thus all ratio type estimators are special cases of the class of estimators YJs, In
order to cover the regression type estimators, Jhajj and Srivastava (1983) also
proposed a wider class of estimators as discussed in the following section.

4.4.3 WIDERCLASS OF ESTIMATORS

The wider class of estimators of population mean Y is defined as


YJSW = H(u", w) (4.4.3.1)
where w = v,.IX and H(.,.) is a parametric function such that H(Y, 1)= Y
satisfying certain regularity conditions as:

( a ) The first and second order partial derivatives of the function H with respect to
u" and w exist and are known constants;
Chapter4: Use of auxiliary information: PPSWRSampling 321

O'H
( b) iii l(y,I)= I .
n

Expanding H(iln , w) around the point (Y, I) by Taylor's second order series we
have
YJSW = H[Y + (un - Y~I + (w-I)]

=H(y,I)+(un-Y)O'H I(YI)+(W-I)O'H I(yI) +(w-I)l {)2 H2


iiin ' Ow ' 20w
{_ -\2 {)2 H {_ -y {) 2 H
+\U n - Y J - -
2 +\Un-YAw-I)---+ .. .
2iiin 2iiinOw
Using the regularity conditions we have
YJSW = Un + (w-I)H ol + (w_I)2 H 02 + (un - yf H 20 + (U;, - YXw-I)H" +... (4.4.3.2)
where H j k denote the /' and II" derivative of the function H with respect to Un
and w, respectively. Then (4.4.3 .2) in terms of 00 and 0 can easily be written as
YJSW = Y(I + 00)+ OH OI + 02 H02 + f2oJH 20 + YOoOH" + .... (4.4.3 .3)
Thus we have the following theorems:

Theorem 4.4.3.1. The bias, up to terms of o(n- I


), in the wider class of estimators
YJSW is given by
2C;;H
B(YJsw) = ..!.-[c;H02 + y 20 + YPuvCuCvHII] ' (4.4 .3.4)
n
Proof. Taking expected value on both sides of(4.4.3.3) we have
E(YJsw) = E[Y(I +Oo)+OHOI +02 H02 + y 20JH20 + YOooH" + ...]
2
= Y(I + E(00))+ E(0)H01+ E(02 )H02 + y E(05 )H20 + YE( 000)H11

- I[ 2 -2 2 - ]
=Y+-;; CvH02+Y CuH20+YPuvCuCvHll .
Thus the required bias is given by
- 1r 2
B (Y JSW) = E (-
YJSW ) - Y
-
= -lC -2 2 - ]
vH02 + Y CIIH20 + YPllvCIICvH" .
n
Hence the theorem.

Theorem 4.4.3.2. The rrummum mean squared error of the wider class of
estimators YJSW is given by

Yne (1- P,;J


-2 2
u
Min.MSE(YJSw) = (4.4 .3.5)
Proof. By the definition of the mean squared error we have
MSE(YJsw) = E~JSW - Y r'"
E[YOo + Hlo = E[ y
2
r 2
+ H?0 + 2H]rooo oJ 1
(4.4.3.6)
322 Advanced sampling theory with applications

On differentiating (4.4 .3.6) with respect to HI and equating to zero we have


-y C/I
H 1 =- P /l V - ' (4.4 .3.7)
Cv

rc, 2~
On substituting the value of H I in (4.4.3.6) we have

Mi,MSE (Y"w)' ; [ PC, + [ - Yp", ~: + - YPw ~: } ",.c"c,]


-2 2
=~[Y2C2+Y2p2
II U
C2_2y2 p2 c2]
/IV U UV /I
= Y C/I [1_ p2].
II /IV
Hence the theorem.

Remark 4.4.3.1. It may be noted that estimators of population mean Y of the form
YIO=un+a(X-vn), and YII =un+a(X-vn)
where a = i~I~(Ui - Un XVi - V,,)/i~I~(Vi - vn )2 is an estimator of a , are special cases
of the wider class of estimators.

Example 4.4.3.1. Select a PPSWR sample of six units from population-S given in
the Appendix by Lahiri 's method using the season average price during 1994 of
apple crop as a benchmark or auxiliary variable in the United States . Collect the
information on the prices of the apple crop during 1995 and 1996 from population 3
for the units selected in the sample . Use the estimator

for estimating the average price of the apple crop during 1996 by assuming that the
average price X = 0.1856 during 1995 is known. Also construct the 95% confidence
interval for the average price of the apple crop during 1996 in the United States .

Solution. We applied Lahiri's method to select a sample of 11 = 6 units from the


population of N = 36 units. The maximum value of the price of apples during 1994
is 0.332. Taking X o = 0.333 We used first two columns of the Pseudo-Random
Numbers (PRN) given in Table I of the Appendix to select a random number
1 s R, s 36 and the three columns from 7 th to 9th to select a random number
1 s Rj ~ 0.333.

Then the first seven effective pairs to select a sample of seven units are (01, .075) ,
(05, .155) , (29, .012), (36, .099), (19, .027), and (29,.039). Ultimately we have the
following sampled information.
Chapte r 4: Use of auxiliary information: PPSWR Samp ling 323

" '{ ~
1 2 3 4 5 6 ,,;,:i SUin '"
orate
, '0;:+
AZ CT MO SC SC WI i:t :',
;, Z'f}ii10.0780000 0.2830000 0.1980000 0.1300000 0.1300000 0.2300000 ;",
,, " " ;iY< ~;{:
It ,f;':"" 0.0126788 0.0460013 0.0321847 0.021 1313 0.0211313 0.0373862
I :'{;':::~i?'ff 0.0710000 0.2760000 0.1600000 0.1260000 0.1260000 0.2410000 !Y f+r;;' ;:) :;,Y">
~::" : ""
1 :~~'.Y i 1 :" 0.1220000 0.2920000 0.2280000 0.1260000 0.1260000 0.1330000 ii' ,, ' XI::
l i!;i u;; i;:'i 0.2672877 0.1763235 0.19678 11 0.1656308 0.1656308 0.0988184 ,<
1.0704723
'
I>
,
I"t Vi ' 0.1555527 0.1666620 0. 1380920 0. 1656308 0.1656308 0.17906 18 0.9706301
:b .' +>+, "
l'il.U':'i vi5;'> -0.0000070 -0.0000005 -0.0000140 -0.0000010 -0.0000010 -0.0000515 :0.0000750
""' +2 ,',
~Ui j 0.0001001 0.0000002 0.0000109 0.0000035 0.0000035 0.0002368 5 ;0003550
,~,0R~?li, 0.0000005 0.00000 11 0.0000 180 0.0000003 0.0000003 0.0000112 '0.00003 14
where
u7 = (Ui - un) , v7 = (Vi - vn ) , f; = z/i~lzi , with i~lzi = 6.152 (given), Ui = Yi/ (Nf; ),

and Vi = Xi /(Nf;) .
Then we have
n
Lf;Ui*. Vi
Un = n- I
n I
n
LUi = 0.178412, vn = n- LVi = 0.161772, and a = i=1
n *2
= -2.38563.
i=1 i=l L f;Vi
i=1
Thus an estimator of the average price of the apple crop during 1996 is given by
Yll = u, +a(X - vn ) = 0.178412-2.385636x(0.1856-0.161772) = 0.121567.
Taki ng

MSE(Yll) = s~n (1 - r;v)


as an estimator of MSE(Yll) where

Su
2= n (
If; Ui - -)2
Un = 0.000355
i=l
and
n
If;(Ui - UnXVi -vn )
i=1 -0.0000750 = -0.71036,
n 2 n 2 .J0.0003550 X 0.0000314
I f; (Ui -Un) If;(Vi - il,,)
i=l i=1
we have
MSE(Yll) = 0.000355 [1- (-0.71036)2]= 0.00002931.
6
Using Tab le 2 from the Appendix the 95% confidence interva l of the average price
of the apple crop during 1996 in the United States is
324 Advanced sampling theory with applications

YI\ +la/2(df = n-2).jM?m(YII)


0.12156+2.776~0 .00002931, or [0.1065, 0.1366].

'4 .1:.4 !,! S.'YR S~M!LING ~ITH~EGATIVELY.~O~IJATED


'", . :<VARIABLES ! · .....; :' ;" L ' · ..• '::'., . ":',. "

N
Consider y and x are highly negatively correlated and X = IX; > n.Max(x;}.
;;\
Srivenkataramana (1980) , Sahoo, Sahoo, and Mohanty (1994) considered a
transformed auxiliary variable z whose value for X; is defined as
X-nX
Z, = 1 \;j X; En (4.4.4 .1)
N-n
and proposed an unbiased estimator of population total Y as

YSM = ~ f y~, where If = Zl/.~Z; = ZI/ .~X; , (4.4.4 .2)


n / ; 1 P; 1;\ / ;\

Thus we have the following theorem :

Theorem 4.4.4.1. The variance of the estimator YSM is given by

(, ) I[ Nf,2 2J .
V YSM =- I~-Y
n ;;\ p;
(4.4.4 .3)

Proof. Obvious by following Hansen and Hurwitz (1943) .

The limitation of this strategy is that in some situations the condition X > nMax(x;)
may not be satisfied, but this condition is reasonably satisfied in most of the
practical situations. A practical example in which this condition holds is given
below:

Example 4.4.4.1. Consider that the ranks of the auxiliary variable x; are used to
select the sample of n units with PPSWR sampling from a population of size N .
The condition X > n. Max (x;) will be satisfied if

n < 1+ {~~);; } }Max(x;)} . (4.4.4.4)

Evidently Max (x;) = N and Ni~; = N(N -I) , therefore (4.4.4.4) reduces to the
;;\ 2
condition on sample size

n <-- .
(N +1)
(4.4.4.5)
2
Chapter 4: Use of auxiliary information : PPSWR Sampling 325

In other words, the condition X > n.Max(xi) remains satisfied if the size of the
selected sample is less than 50% of the population size. Thus the condition
X > n.Max(xi) may be satisfied in most practical situations. The literature related to
the use of ranks and its benefits in selecting the sample can be had from Wright
( 1990).

We observe that if the correlation between Y; and Xi is positive and high and the
regression line passes through the origin then PPSWR sampling remains better than
the SRSWR sampling. If the correlation between Y; and Xi is negative then the
transformation suggested by Sahoo, Sahoo, and Mohanty (1994) works well.

Example 4.4.4.2. In population 2 the age and duration (minutes) of sleep are
negatively correlated. Assuming that the whole population information is known .
Discuss the gain in efficiency due to Sahoo, Sahoo, and Mohanty (1994) method of
estimation over the Hansen and Hurwitz (1943) estimator . Assume that the sample
size consist of five units.

Solution. Defining p; = Xi / X, Z, = (X - nX i )/(N - n), and P;- = Z,/ X. Then from


population 2 we have

I :i'~i~i~llil i[;i~t:M'\'<l~»'
--
,X ; \' ;
>,:" .9 ." 'c :c
.,
::,'".,·W,'" 7~:' §~:~,tt~:1'1
(;:!'!

I 60 492 0.029732408 8141419 .200 68.72 0.034053518 7108340 .396


2 72 384 0.035678890 4132864.000 66.32 0.032864222 4486824 .608
3 55 408 0.027254708 6107715.491 69.72 0.034549058 4818192 .083
4 56 465 0.027750248 7791822.321 69.52 0.034449950 6276496 .692
5 82 312 0.040634291 2395612 .098 64.32 0.031873142 3054107.463
6 78 315 0.038652131 2567128 .846 65.12 0.032269574 3074877.918
7 67 420 0.033201189 5313062 .687 67.32 0.033359762 5287807.487
8 74 381 0.036669970 3958579 .703 65.92 0.032666006 4443793.962
9 84 276 0.041625372 1830037.714 63.92 0.031674926 2404930 .663
10 56 465 0.027750248 7791822.321 69.52 0.034449950 6276496.692
11 68 420 0.033696729 5234929.412 67.12 0.033260654 5303563 .766
12 70 360 0.034687810 3736182 .857 66.72 0.033062438 3919856 .115
13 59 435 0.029236868 6472136.441 68.92 0.034152626 5540569.501
14 64 405 0.031714569 5171913 .281 67.92 0.033657086 4873416.519
15 53 510 0.026263627 9903430 .189 70.12 0.034747275 7485479 .179
16 66 420 0.032705649 5393563 .636 67.52 0.033458870 5272144 .550
17 78 345 0.038652131 3079390 .385 65.12 0.032269574 3688458 .999
18 63 405 0.031219029 5254007.143 68.12 0.033756 194 4859108.191
19 77 330 0.038156591 2854028 .571 65.32 0.032368682 3364363.135
20 73 285 0.036174430 2245370.548 66.12 0.032765114 2479008 .621
21 55 438 0.027254708 7038930.764 69.72 0.034549058 5552799 .656
22 71 360 0.035183350 3683560 .563 66.52 0.032963330 3931641.612
Continued .
326 Advanced sampling theory with applications

23 63 390 0.031219029 4872028.571 68.12 0.033756194 4505839 .695


24 87 270 0.043111992 1690944.828 63.32 0.031377602 2323313 .329
25 61 375 0.030227948 4652151.639 68.52 0.033954410 4141582 .750
26 58 375 0.028741328 4892780.172 69.12 0.034251734 4105631.510
27 60 390 0.029732408 5115630 .000 68.72 0.034053518 4466498 .836
28 70 360 0.034687810 3736182 .857 66.72 0.033062438 3919856 .115
29 66 390 0.032705649 4650572.727 67.52 0.033458870 4545879.739
30 72 345 0.035678890 3336006.250 66.32 0.0328642220 3621719.692
:.Sum J. 2018 Ili? ':11526 .1.000000000 143043805:2' :·20180. 1:0000000000 '135132599:5
From the above table we have
X=Z=2018, and Y=11526.
Also we have

V(YHH )=~(i:
n i=1
}'/
P;
J=~[143043805.25-115262]=
-y 2
5
2039025 .84,

and

V(YSM )=~(i: Y;~ - Y2J= ~[135132599.5 -11526 2]= 456784 .69.


n i=lP; 5
Thus the percent relative efficiency (RE) of the estimator YSM with respect to the
estimator YHH is given by

RE = V(YHH ) xl 00 = 2039025 .84 xl 00 = 446.38% .


V(YSM ) 456784.69

The next section has been devoted to the situation in which in large scale
multipurpose surveys, where the cost involved is quite high, the information
collected on several variables of interest is available. Some of these study variables
may be positively and highly correlated with the auxiliary information used at the
selection stage while others may be poorly correlated. Such a situation is called
'multi-character survey'.

We shall here discuss two cases:

( a ) Study variables have poor positive correlation with the selection probabilities.
( b ) Study variables have poor positive as well as poor negative correlation with the
selection probabilities.

4~5.1 :ST:YI>X'~ARI~~~~S·HAV~1;P;()QR POSIT


1····::·, WITH.THE.SEl;,EGTlON PROBABIUTIE~
Bansal and Singh (1985) proposed an estimator of population total Y given by
Chapter4: Use of auxiliary information : PPSWRSampling 327

• I" y .
YBS =-L~ (4.5 .1.1)
II i =1 p;
wher e

p;*= (I+ ~ f PXY


(I + p; Y'xy - I. (4.5.1.2)

( a ) If Pxy = 0 then (4.5.1.2) becomes p;*= 1/ N and the estimator (4.5.1.1) reduces
• N "
to the well known estimator owed to Rao (1966a, 1966b) , that is YRao = - L Yi .
n i= 1

( b ) If Pxy = I then (4.5 .1.2) becomes p/ = Pi and the estimator (4.5.1.1) reduces

to the well known estimator of Hansen and Hurwitz (1943), that is YHH = ~ ± Yi .
n i= 1 p;

We now study some further properties of the estimator YBS in the following
theorems:

Theorem 4.5.1.1. The estimator YBS is not consistent.


Proof. Taking expected values on both sides of (4.5.1.1) we have

(4.5.1.3)

Thus the bias in the estim ator YBS is given by

B(YBS)= E(YBS)- Y = ~I ( I} i . ;*- (4.5.1.4)

Evidently B(YBS ) is not converging to zero as sample size Increases. Hence the
theorem.

Theorem 4.5.1.2. The varian ce of the estimator YBSis given by

V(YBS)=~[I}j2l
p;*
- (I}j~p; J2] .
II i =1 i= 1
(4.5.1.5)

Now we have the following remarks :

Remark 4.5.1.1. On putting p;*= 1/ N in (4.5.1.5) we have the variance of the


estimator of population total propo sed by Rao (1966a, 1966b) as follows:
2
V(YRao )=N JI }j2p; _( I }jp;)2).
II 1i= , i= 1
(4.5.1.6)
328 Advanced sampling theory with applicat ions

Remark 4.5.1.2. On putting P;' = ~ In (4.5.1.5) we have the vanance of the


estimator of population total proposed by Hansen and Hurwitz (1943) as follows:

(" ) I[NY,z
V YHH = - I-'- -Y 2J . (4.5.1.7)
n i=\ P;

Theorem 4.5.1.3 . An unbiased estimator for estimating the V(YBS ) is given by


v"("
/1/1-1 i=l~
yl
YBS ) = - (1- -) [ I" "O"f -n YBs
"2 J. (4.5. 1.8)

Proof. Taking expected values on both sides of (4.5.1.8) we obtain

["(" )J
E v YBS j =-(--)
E ["I y
/I /I - 1
2
----!z - n YB"2SJ=-(--)
i= 1 p;
1
n /I - 1
1["I ----!zp;i J-
E
i= \
("2)!
nE YB S

=
1
( _-) /lI -+zp;-
nn 1
N y"
2 n V(YBS)+ I~ J2)]
[ I= \~ 1 [ N yp
I=J ~

1
=-(--) nI-+zp;-n
2
/I /I - 1
N y
[ i=\ ~ 1V(Y )+[I~
.
p; J2)]
BS
N yp
i= J

1 N Y N yp "
= (/I-I) 2 - [i~ ~.~ J2 V(YBS)1
[i~ ~'2P; -

=-( 1 ) ~IV(YBS )-V(YBS )]=


n-l
V(YBS).
Hence the theorem.

Theorem 4.5.1.4. An unbiased estimator of bias B(YBS ) is given by


S(YBS)=~t[
/I
~. - IJYp;i .
p;i=l
(4.5. 1.9)

Proof. Taking expected values on both sides of (4.5.1.9) we have

E[S(YBS)]= E[~n t[p;p;. -IJYip; ] ~[ p;~. -IJr;.


i=1
=
i= \

Hence the theorem .

Remark 4.5.1.3. An estimator for estimating the MSE(YBS) is given by


M§E(YBS)=V(YBS)+~(YBS)}2. (4.5.1.10)
In order to compare the three estimators let us consider a superpopulation model as
follows:
Chapter4: Useof auxiliary information: PPSWR Sampling 329

Let Y; and F; denote respectively the value of variable Y and the relative measure
of size X for the t" ( i = 1,2,..., N) unit in the population so that the values
N
F; = X ;/ X will serve the purpose of selection probabilities with IF; = I . A general
;=1
superpopulation model following Cochran (1963) is given by
f3F; +e;, i = 1,2,....,N .
Jj = (4.5.1.11)
where e; are the error terms such that:
Em[e; I F;l= 0;
Em(ellF;)=aF;g, a >O, g;::O; (4.5.1.12)
and Em(e;ej I F;Pj) = 0, where Em denotes the expected value under the super-
population model. Here Em (ell F;) is the residual variance of Y for P = F; . The
expected value of this residual variance is given by,
E(apg)=!!...- IPg
, N;=l l '
when the infinite superpopulation is simulated by the finite population of N units
having the same characteristics as that of the superpopulation. Also , this expected
value of residual variance is known to be given by
0-;(1- P';y),
where Pxy is the correlation coefficient between Y and P . Thus we have
a N g _ 2( 2)
-IF; -o-y I-pxy . (4.5.1.13)
N ;=l
Now we have the following theorems :

Theorem 4.5.1.5. The expected value of the vanance V(YRao) under the

n1
superpopulation model (4.5.1.11) is

Em r(f..,))" :'h~ p;'"'H)+ p' j,~P;' {~Jj' (45 114)

n
Proof. The V(YRao)in (4.5.1.6) under the model (4.5.1.11) can easily be written as

v(fR,o)" :' [;~(pJj +e,l' Jj -t~(PJj +,,)Jj


N 3 N 2 N 2 2N 2
=-
N 2[13 3 IF; + IF;e; + 213 IF; e; - 13 ( IF; )2
n ;=1 ;=1 ;=1 ;=1

N 22
-IF;
1=1
N F;Pje;er2f3 ( .IN F; 2)( IF;e;
e; - I
'''J=1 1=1
J] . (4.5.1.15)

On taking expected values on both sides of(4.5 .1.15) and using (4.5.1.12) we have
(4.5.1.14) . Hence the theorem .
330 Advanced sampling theory with applications

Theorem 4.5.1.6. The expected value of the variance V(YBS ) under the
superpopulation model (4.5.1.11) is

Em HyBS)] ~[a~,~I p;g+I~2-P;)


=
n P;
+ p2f ~ P:2 - [~ p;~J2)] .
1,=1 P; ,=1P;
(4.5.1.16)

Proof. Follows on the lines of the proof of the above theorem.

Theorem 4.5.1.7. Under the superpopulation model (4.5.1.11) the estimator YB Sof
population total will be more efficient then Rao (1966a , I966b) estimator if
P,;y > (l-ot l . (4.5.1.17)
where

(4.5.1.18)

Proof. It follows by setting Em[YBs]~ Em[YRao ] and using (4.5.1.14) and (4.5.1.16).
Hence the theorem .

Example 4.5.1. 1. A team of medical doctors wishes in estimating the total of three
variables 'Crude Birth Rate (CBR)', 'Crude Death Rate (CDR)' and 'Infant
Mortality Rate (IMR)' in the world. We know that CBR and IMR have positive and
high correlation with the 'Total Fertility Rate (TFR)', where as CDR has low
positive correlation with TFR. Select a sample of 10 countries from the list given in
population 8 of the Appendix by using PPSWR sampling and using TFR as known
auxiliary information. The correlation coefficients values of TFR with TBR, CDR
and IMR are +0.9855, +0.5492 and +0.8525 respectively. Apply the appropriate
transformations on the selection probab ilities using know values of correlation
coefficient to obtain estimates of total CBR, CDR and IMR. Also construct 95%
confidence interval in each situation .
Solution. (a) Selection of PPSWR Sample:
The population 8 in the Appendix consists of N = 96 countries. The maximum
value of TFR is 7.17. Thus in the Lahiri's method we selected random numbers
1 ~ Ri ~ 96 and 1 ~ Rj ~ 8. We used first two columns of the Pseudo -Random
Number (PRN) Table I given in the Appendix to select the random number R,
whereas the random number Rj selected using the 13th column . Then we
performed the following trials:
Chapter 4: Use of auxiliary information: PPSWR Sampling 331

2 60 7 1.53
3 54 3 3.10
4 92 5 2.53
5 01 4 2.07
6 69 1 1.95
7 87 1 2.35
8 62 8 5.95
9 23 8 1.81
10 88 4 6.24
II 94 8 6.86
12 64 8 2.66
13 46 2 1.55
14 04 7 6.05
15 32 5 6.75
16 94 6 6.86
17 47 4 1.50 R
18 57 8 3.13 R
19 56 6 2.79 R
20 77 7 1.95 R
21 57 7 3.13
22 81 4 5.19
23 60 4 1.53
24 33 6 1.73
25 05 6 2.64
26 72 2 5.87
27 22 6 1.80
28 88 6 1.80
29 38 3 3.16

Thus we have the following sample information collected from 10 countries


selected in the PPSWR sample out of 96 countries or area listed in population 8.
Assuming that the total of the TFR is 336 .15 and is known, the selection
probabilities P; have been listed in the last column of the table given below.
332 Advanced sampling theory with applications

Sr. Country Crude birth Crude Infa nt Total Selection


No. or area rate death rate mortal ity fertil ity r~te In;Probabilio/ •
c

I; ~ '7 ~:: rate ~.


32 Ethiopia 44 .3 17.6 117.7 6.75 0.020080321
38 Guinea 40 .0 16.9 123.7 5.46 0.016242749
54 Mala ysia 24.1 5.2 20 .9 3.10 0.009222073
58 Mozambique 42 .0 16.8 114.9 5.76 0.017135207
69 Russia 14.5 14.8 2\.6 \.95 0.005800982
72 Senegal 43.4 10.4 58.4 6.04 0.017968169
81 Syria 36 .1 5.3 35 .2 5.19 0.0 15439536
87 Turkey 20.4 5.2 33.3 2.35 0.006990927
88 Uga nda 43 .1 2\.8 95.6 6.24 0.018563 141
94 Yemen 43.3 7.9 57.9 6.86 0.020407556

( b ) Estimation of total Crude Birth Rate (CBR):

1 )l- PXY(I)
We have N = 96, Pxy(l) = 0.9855 and fl; = ( 1+ N (1 + p; Y'xy(l) - 1 thus we have

Sr." Country 61- CBR p;"'" fl; *' YliI'


fli &IJ/fl;f
No.
,'"
area
.'C'
~Y l; ..
32 Ethiopia 44 .3 0.02008032 1 0.019939540 2221.716 4936023 .195
38 Guinea 40 .0 0.016242749 0.01615803 1 2475 .549 6128343.431
54 Mala ysia 24.1 0.009222073 0.009239385 2608 .399 6803744.595
58 Mozambique 42.0 0.0 17135207 0.0 17037470 2465 .155 6076988.131
69 Russia 14.5 0.005800982 0.005867759 2471.13 1 6 106488 .384
72 Senegal 43.4 0.017968169 0.0 17858270 2430.247 5906098.555
81 Syria 36.1 0.015439536 0.015366526 2349 .262 55 19033 .079
87 Turkey 20.4 0.006990927 0.0070405 17 2897 .514 8395589.856
88 Uganda 43 .1 0.018563141 0.018444549 2336 .734 5460324.512
94 Yemen 43.3 0.020407556 0.02026 1985 2137.007 4566798.293
It: .?;JJ; ,; ; .. ';;;;,<11::;;' Sum 2~ 392 .71 0 59899432.030

An estimate of total CBR is given by

YCBR =..!.- ±Yl~ = 24392.71 = 2439.271


n ;;lfli 10
and an estimate of the V(Y is given by
CBR )

v-(Y-CBR ) = -(-1- ) [" Yl~ - n Y 2 ]


I -----.-z -
CBR
n n -1 i; lfli
Chapter 4: Use of auxiliary information : PPSWR Sampling 333

= (1 ) [59899432.03 -10 x 2439.2712 ] = 4433.35 .


1010-1
A (I - a)1 00% confidence interval of total CBR is given by

YCBR ± l a /2 [df = n - 1 ~ V(YCBR ) •

Using Table 2 from the Appendix the 95% Confidence Interval estimate of the total
CBR is given by
2439.271 ± 2.262"'4433.35, or [2288.66,2589.88] .

(c) Estimation of total Crude Death Rate (CDR):

1 )1 -PXY(2)
Here N = 96 , Pxy (2) = 0.5492 , and P;i = ( 1 + N (1 + p; )pxY(2) -1 thus we have:

32 Ethiopia 17.6 0.020080321 0.015712557 1120.123 1254676.010


38 Guinea 16.9 0.016242749 0.013612204 1241.533 1541404.074
54 Malaysia 5.2 0.009222073 0.009760421 532 .764 283837.384
58 Mozambique 16.8 0.017135207 0.014100977 1191.407 1419450.260
69 Russia 14.8 0.005800982 0.007879117 1878.383 3528323 .172
72 Senegal 10.4 0.017968169 0.014556991 714.433 510415 .059
81 Syria 5.3 0.015439536 0.013172143 402 .364 161897 .024
87 Turkey 5.2 0.006990927 0.008533810 609 .341 371296.383
88 Uganda 21.8 0.018563141 0.014882611 1464.797 2145629.387
94 Yemen 7.9 0.020407556 0.015891492 497 .121 247129.634
Sum .Q652.267 11464()58:390

An estimate of total CDR is given by

YCDR ±
=.!. Y2j = 9652.267 = 965.2267
n i=l P2i 10
and an estimate of the V(VCDR ) is given by

[ ~2- ]
1 11 Y2i 2
-(--) I
A • A

V(YCDR) = n YCDR
n n-I i=IP2i

2]=23860
= (1 )[11464058.39-IO X965.2267 .36 .
1010-1
A (1- a)100% confidence interval of total CDR is given by
334 Advanced sampling theory with applications

YCDR±(a/2(df=n-l~v(YCDR) .
Using Table 2 from the Appendix the 95% Confidence Interval estimate of the total
CDR is given by
965.2267 ± 2.262.J23860.36, or [810.75,1119.69] .

( d ) Estimation of Infant Mortality Rate (IMR):


1 )1 -PXY(3)
Here N = 96 , Pxy(3) = 0.8528, and P3~ = ( 1+ N (1 + p; YXY (3) -1 thus we have:

32 Ethiopia 117.7 0.020080321 0.018652051 6310.298 39819860.21


38 Guinea 123.7 0.016242749 0.015383046 8041.320 64662833 .81
54 Malaysia 20.9 0.009222073 0.009397829 2223 .918 4945811 .29
58 Mozambique 114.9 0.017135207 0.016143441 7117.442 50657976.27
69 Russia 2 \.6 0.005800982 0.006479085 3333 .804 11114249.47
72 Senegal 58.4 0.017968169 0.016853056 3465 .247 12007936.64
81 Syria 35 .2 0.015439536 0.014698605 2394.785 5734995 .29
87 Turkey 33 .3 0.006990927 0.007494466 4443 .279 19742724.66
88 Uganda 95.6 0.018563141 0.017359870 5506.954 30326538.49
94 Yemen 57.9 0.020407556 0.018930720 3058.521 9354549.37
. ··" Sum; 45895.570 248367475.50

An estimate of total IMR is given by

~MR = ..!.- ±Y3: = 45895.57 = 4589.557


n i=J P:Ji 10

and an estimate of the V(~MR) is given by

, , I n Y3i '2
V(Y1MR ) = -(--) [ I
2
n n- 1
--.z
i=J P:Ji
- n l'iMR ]

= (1 )[248367475.5-10 X4589.557 2]=419190.45 .


10 10-1
A (1- a)1 00% confidence interval of total IMR is given by

~MR ± (a/2(df = n -1~V(~MR) .

Using Table 2 from the Appendix the 95% Confidence Interval estimate of the total
IMR is given by
4589.557=F2.262.J419190.45 , or [3125.02,6054.08].
Chapter 4: Use of auxiliary information: PPSWRSampling 335

Th e probl em of estimation of population total using mult i-character survey has also
been considered by Pathak (1966) , Kumar and Herzel ( 1988), Amahia, Chaubey,
and Rao (19 89) and Rao (l993a, 1993b ). They considered the following types of
transform ations on the selection probabilities.
• = N1
Pio [ Rao ( I966a , I966b) ]

• ( 1 J(I-PXY)
f}, = 1+ N (I + Pi )pxy - 1 [ Bansal and Singh (19 85)]

• (1- PXY) [Amahia, Chaubey, and Rao (19 89)]


f}2 = N + PxyPi

[ Amahia, Chaubey, and Rao (19 89) ]

[ Mangat and Singh (1992-93) ]

4.5.1.1 GENERAL CLASS OF ESTIMATORS

Singh , Grewal , and Joarder (2002) proposed a general class of estimators as


• 1 n
Yg = - 'IYi H (pJ (4.5.1.1.1)
II i =1
where H (Pi ) is a function of Pi and satisfies certain regularity conditions such as:

(i ) H(~ J= N;
( ii ) The first and second order partial derivatives of H with respect to Pi exist and are
assumed to be known constants.
Expanding H(P i )around 1/ N with Taylor' s series and noting IPi - 1/ < 1 we have NI
J ,( JH
2
Y•g = -;iEYi
1 1/ [ N+ ( Pi - N1 H + Pi- N1 " + .....] , (4 .5.1.1.2)

where
i3H i32H
H' =- 1 I and H" =- 2 - 1 I
13 Pi Pi =fj 13 Pi Pi=fj
denot e the first and second order partial derivative s of H with respect to Pi and
are known constants for Pi = 1/ N .
Thu s we have the following cases:

Case I. Ifwe choose H ' = -PxyN 2 and H " = {N3/(N + 1)~xy[Pxy (1 + 2N)+ I] then the
class of estimators Yg reduce s to estimator of Bansal and Singh ( 1985).
336 Advanced sampling theory with applications

Case II. Ifwe choose H' = -PxyN Z and H" = 2p;yN 3 then the class of estimators
Yg reduces to the first estimator of Amahia, Chaubey, and Rao (1989) .

Case III. If we choose H ' = - Pxy N Z and H" = 2P xyN 3 then the class of estimators
Yg reduces to the second estimator of Amahia, Chaubey, and Rao (1989) .

Thus there are several choices of H' and H" exist depending upon the form of the
function H(Pi)' but still no one knows which is the best or optimum .

;4.5.2 S'FUDY VARIABLES HAVE POORPOSIHVE'AS WELL AS POOR


' ,NEd ATIVE CORR:ELATION WUM 'FHESELECnON -
; PRUBABILIHES ' ," .; . .:, ·.i· · . · · · 0 .' .

A natural question arises that in multi-character survey, some variables may be


poorly negatively correlated and some variables may be poorly positively
correlated. Singh and Horn (1998) have made an attempt to suggest a
transformation in such situations. They suggested an alternative estimator of
population total Y as
, _ 1 ~ Yi
YoP --LJ-. ,
n i-i P; (4.5.2 .1)
where
p; = (pt)Pxy (I+Pxy )/Z~i TPxy(l-Pxy )/Z( ~ J(I- pxy h+pxy) (4.5.2.2)

with pi =Zi/2: zi=Zi/2: xi


i=1 i= 1
for z, = X -nxi
I N -n '
+
I I
N
p . = x . / 2: x . and Pxy is the
;=1 I

correlation between Y and x.

Thus there are the following cases :

Case I. If Pxy = 1 then P; = pt and Yp reduces to YHH owed to Hansen and


Hurwitz (1943) .

Case II. If Pxy = 0 then P; = 1/ Nand Yp reduces to the estimator proposed by


Rao (1966a, 1966b).

Case III. If Pxy = -1 then P; = pi and Yp reduces to YSM owed to Sahoo , Sahoo,
and Mohanty (1994) .

The transformation at (4.5.2.2) makes use of the known correlation coeffic ient P xy

between the study variable Y and the auxiliary variable x . The study variable y
can be anyone among the k variables of interest, say YI , Yz ,..., Yk , some of them
have low positive and others having low negative correlation with the auxiliary
Chapter 4: Use of auxiliary information: PPSWR Sampling 337

variable x. In actual practice, the value of P xy is not known in most of the surveys .
Thus Singh and Hom (1998) have advised to use the estimator of Pxy in (4.5.2.2).
Suppose rxy is an estimator of the correlation coefficient Pxy defined as

IPi(Ui - UXVi - Ii)


i=1
(4.5.2.3)

Yi Xi - -I + (1
d Pi=aPi+-api
-I )-
h
were Ui=--'Vi=--,u=n
11 11
'L.vi,an
'L.ui,v=n
-

NPi NPi i=1 i=1


denotes the probability of selecting the lh unit in the sample for a suitable choice of
a. In practice a can be taken as 0.5. Note that if a = 1 then the estimator rxy will
not work and in such situation the size measure is supposed to be different than
x variable. Then an estimator of the transformation at (4.5.2.2) is given by P;
. ' _ ( + ) rxy (l+rxy
Pi - Pi
l/2(Pi_'r-rxy(l-r
J
lf2( N
XY 1 )(I-rxy XI+rxy l
' (4 .5.2.4)

We have the following theorems :

Theorem 4.5.2.1. The estimator j/I at (4.5.2.4) is a consistent estimator of / I .


Proof. Note that

I~ -11 < 1, Ipt -11 <1, and Ipi -11 < 1,


therefore using the binomial expansion and neglecting the higher order terms the
estimator (4.5.2.4) can easily be rewritten as

.' ( 2Y I) +2':ry (+
Pi = 1- rxy \ N
-) r;y (+ -) [+2 _2 ( 1)2 J.
Pi - Pi +2 Pi +Pi +0 Pi .r, ' N (4.5.2.5)

Let £1 denotes the expected value over all possible values of the correlation
coefficient, rxy' then
£1 (p;)= P; +O(n-I). (4.5.2.6)
Hence the theorem.

Theorem 4.5.2.2. A practically useful estimator of population total In multi-


character surveys is obtained by replacing by in (4.5.2.1) as P; p;
Ys =~ :~ ± .
ni=IPi
(4 .5.2.7)

Theorem 4.5.2.3. The expected value of the estimator Ys is given by

£(Ys )=I Y~ Pi +o(n-I). (4.5.2.8)


i=IPi
338 Advanced sampling theory with app lications

Proof. Let E 2 denote the expected value for the given value of r<y' then we have

(,) (,) [I- L:--::+


EYs = E1E2 YSn I r = EJ E2
/1
v,
;;IP;
I r ] =E 1[n y;
I--::-;-p;
;; I p;
] Y;. Pt +0/1
= IN -
;; I p ;
( -I) .
Hence the theorem.

In order to find the variance of the estimator Ys , let us define

Jlrs = Ip;(u;-r)(v;-xY, firs = Ip;(u; -uY(v;-'VY and Ars = r~rs S I2


;; \ ;; \ Jl20 Jl02
Defining
0 = (fi02/ Jl02)-I , 17 =(fiJ J / Jill )-1 , and e = (fi 20/ Jl20)-1
we have to the firs t order of approxi mation

EI(0 2 )=/1-1(.1.04 -I), E1(17


2
)=/1-1[~2 - IJ, E I (&2)=/1-1(.1.40 -I)
Pxy

1(
E 017) = II-\(~-IJ
Pxy
, EI(& 17 ) =/1-J (.3R- 1
Pxy
J, and E 1(&)= /1-1(~2 -I)
Let VI denote the variance of the estimator of the correlation coefficient, then we
have the following lemmas:

Lemma 4.5.2.1. The value of Vi ~<y ) is give n by

_
Vi ~<y )- /1
-I 2 2 +pxy
Pxy .1.22 - -
[ [
2-
2
2 ~,
1
J
+4"(.1.40+ -1.04)- - [J]
.1.13 .1.3 \
+-
~ ~,
. (4.5.2.9)

Lemma 4.5. 2.2 . The value of Vi (r';) is given by


(4.5.2.10)

Thus we have the following theorem :

Theorem 4.5 .2.4. The variance of the estimator Ys to the first order of
approximation is given by

V(YS)"' ~[Iyf~;
I p; -[IY;~;J2]
p;
;;1 ;;1
Chapter 4: Use of auxiliary information: PPSWR Sampling 339

lp;[[ (P7+p;f jPx412A.22


)' (2 + P.~Y ) 4(AI3 + A31))
Px2y + A40+ .104
I N I
+ - L~ - 2 +
Il i =1 Pi N 4 Px2y
p;v(p7-p;) l A22(2+2 p.~y) + -1(1 J. ) 1(1 1))]
(4.5.2.11)
+
4
/1.40 + "U4 - -
Px)' /1.13 + /1.31
4 Pxy
Proof. Let VI and V2 denote the variance for all possible values of ':,y and for the
given value of r,y' respectively, then we have

=E -
I
ljN i po- [NL yopo j2)1 + V [NYopo]
L _'_I
*2
L_1
_ 1
,* I.
_1_1
,*
r P' 1=1 P. 1=1 P,
0 0

Il 1=1
i l l

_1[NL ~
Y;Pi - [NY i Pi ]2] Ny;p; (,*)
- - L -*- + L ~I PI '
Il i=1 Pi i=1 Pi i=1 Pi

which after using above expected values and on simplification reduces to (4.5.2.11).
Hence the theorem.

4.6 CONCEPT OF REVISED SELECTION PROBABILITIES

We observed that SRSWR will be better than PPSWR sampling if the line of
regression passes far away from the origin. Thus linearity between Y and X is not
a sufficient condition for PPSWR scheme to be better than SRSWR scheme. So in
practice it is very difficult to have an idea whether PPSWR sampling scheme can be
preferred over SRSWR scheme or not. In order to overcome this problem, Reddy
and Rao (1977) suggested that the sample be selected by probability proportional to
revised sizes scheme and with replacement. The revised sizes are obtained through
a location shift in the aux iliary variable as

X i ' = Xi +(C1 -IV where 0 < L < I .

This can also be treated as a compromise selection probability between PPSWR and
SRSWR leading to new selection probabilities given by

P.'= x.
1
'I LN'X.=LP. +(I-L) I N .
1 i=1 I I
(4.6.1)

Thu s we have the following theorem:


340 Advanced sampling theory with applications

Theorem 4.6.1. An unbiased estimator of population total Y is given by


A 1 n y-
YRR = - L: .a: (4.6.2)
n i~l P;'
with variance

V YRR
(
A ) 1
=- [N y;2 Y2] .
L:-'-,- (4.6.3)
n i~ 1 P;
Proof. Obvious by foJIowing Hansen and Hurwitz (1943).

For details one can refer to Reddy (1974), Singh, Kumar, and Chandak (1983) and
Bedi and Rao (1996) .

Example 4.6.1. One can observe for population I, as the values of L changes from
0.1 to 0.9 with a step of 0.2, the percent relative efficiency of the estimator YRR
with respect to the estimator YHII changes as shown in the following table.

Thus a proper choice of L may lead to efficient estimators of population total.

The next section has been devoted to define the estimator of finite population
correlation coefficient Pxy under PPSWR sampling .

4:7 ESTIMATION OFCORRELATION COEFFICIENT USING PPSWR


SAMPLING' it ~tt;./ ' t ' to,

Assume (Yi' Xi )' i = 1,2,..., n , denotes a pair of observations on the variables Yand
X for the sample of size n drawn with varying probability p; and with
replacement procedure from a population of size N . FoJIowing Gupta, Singh, and
Kashani (1993) , we have the foJIowing theorem .

Theorem 4.7.1. An estimator of Pxy = sxAs;s; t l2


under PPSWR sampling is
given by
_( A A)-112
r ty = B,\B2B3 ) ,
(4.7.1)

where

81 = i: Xi~i
i~1 p;
- (I p; )(I p;
i~1
Xi
i~1
Yi) + N(n -I)I sn:p;
,~I
(4.7.2)

_ nx2 (n X.)
B2 = L:-T- L:-L +N(n -I)L:-'-,
nx2
(4.7.3)
i~1 p; i~ 1 p; i~ 1 p;
and
Chapter 4: Use of auxiliary information : PPSWR Sampling 341


(}3 =
11
.L - Z -
Yi ]+ N(II-I).L-.
y1 (".L- y1 11
(4.7.4)
1= 1Pi 1=1 Pi 1= 1 P;

It is remarkable here that if p; = xdX ,


then the above estimator of correlation
coefficient becomes non-functional. Thus one can use the second auxiliary variable,
say Z, for selecting a sample while using above estimator of correl ation coefficient.
In order to see the bias and variance of the estimator 'xy' one can refer to Gupta ,
Singh , and Kashani (1993) . The bias and variance expressions are complicated and
hence are not reported here. Gupta (2002) suggested a new estimator of correlation
coefficient under PPSWR sampling which guarantee adm issible estimate under
such designs .

EXERCISES

Exercise 4.1. Define an estimator of the regression coefficient f3 = sxy (s.; t' In

PPSWR sampling.

Exercise 4.2. Write a short note on PPSWR sampling? Discuss at least one method
of selecting a sample by PPSWR sampling.

Exercise 4.3. Consider a population consists of N =5 units. Consider the


population values are given by:
Y; 3, 4, 5, 6, 8
Xi 2, 4, 7, 9, 6
:
Draw all possible sampl es of size II = 3 units. Construct the sample space. Show
that the estimator
• I ll y- N
Y HH= -II L -..Lp; ,
i= 1
where Pi = X i / L X i '
i= 1

is an unbia sed estimator of population total. Also show that an unbiased estimator
of the variance

V(YHH )= ~(~
II
lf _y z ]
P; i =1
is given by V(YHH )= -(_I- _)[t
II II I
Y~ -(YHHJ].
P;
i =1

Exercise 4.4. Show that the probability of selection of the i''' unit in the sample by
using Lahiri's PPSWR sampling scheme is given by

P; = XI! ~Xi.
i=1
What is the probability of selecting i" sample of II units using PPSWR sampling?

Exercise 4.5. Show that the estimator H= ~ t


YH
II i=\
Yi
P;
is an unbiased estimator of

population total. Find its variance and deduce the unbia sed estimator of variance.
342 Advanced samp ling theory with applications

Exercise 4.6. Write a short note on the relative efficiency of the PPSWR sampl ing
with respect to SRSWR sampling. Also discuss the concept of revised selection
prob abilities.

Exercise 4.7. Find the asymptoti c bias and variance of the following estimators:
_ _ ( \._(x) .
v" '
YI=all,,+ I - a )'l"
_ _(v)a
it ;
Y2 = 1I"

where _
II" =n - I I" Y;
- , -v" =n - I I" -X; N
and P; = Z ;/ IZ; have the ir usual mea ning
;=1 P; ;= 1 P; ;= 1

under PPSW R sampl ing.


( a ) Discuss your views .
( b ) Sugges t an estimator of the optim um value of a and study the effect of
replacement of a by its estimator in each case .
Hint : Pandey and Singh (1984).

Exercise 4.8. Is it possible to app ly PPSWR sampling scheme if the stud y variable
and auxiliary variable are negatively correlated?
Hint: Sahoo, Sahoo, and Mohanty (1994).

Exercise 4.9. Under multi-character surveys, consider an estimator of population


total Y is
. I " y.
YM =- I -+,
II ;=IP;
where p;. = PxyP; + (1 - P.'}')/ N or p;. = [PXyj P; + (1 - PXy)Nr l
( a ) Find the values of corre lation coefficient, Pxy ' under which the estimator YM
• N " • I" y .
reduces to YRao = - I Yi and PPSWR estimator, YHH = - I -l.. .
n ;= 1 n ; =1 P;
( b ) Find the variance of the estimato r YM and derive the value of V(YRao ) and
V(YHH ) .
Hint: Ama hia, Chaubey, and Rao (1989) , Kumar and Agarwal (1997), Amab
(200 1).
(c) Consider a general class of estimators of the form Yg = ~ I YiH(Pi)' where
II i= 1
H(p ;) is a function of Pi and satisfies certain regu larity conditions defined as:

and the first and second order partial derivati ves of H with respect to Pi exist and are
assumed to be known constants.
Chapter 4: Use of auxiliary information: PPSWR Sampling 343

Show that the estimators in ( a ) and ( b ) are special cases of ( c ) for certain choice
of parameters in the function H(Pi)' Study the bias and variance of the general
class of estimators.
Hint: Singh , Grewal , and Joard er (2002 ), Espejo , Pineda, and Nadarajah (2003) .

Exercise 4.10. ( a ) List a few situations where PPSWR sampling can be used in
actual practice.
( b ) Pickup one situation of interest to you and collect information on a variable(s)
of your interest from a reasonable good number of respond ents.
( c) Later you found that a few of the variable(s) have very low correlation with the
selection probabilities, what kind of analysis might be helpful to you?

Exercise 4.11. Let Y = N y be the usual unbiased estimator of Y based on a


SRSWR sampling scheme of size n. Let
, X y.
I-l..
II
Ypps = -
n i = l Xi

be the usual unbia sed estimator of Y based on a PPSWR sample of size n and let
'. X
• y.
Ypps = - I --'-=
II

n i=\ Xi +dX
be an alternative unbia sed estimator of Y based on a PPSWR sample of size n,
where the size measure is the transformed variate
X'= 2: (X i+dX )=X(I+ d),
i=\
show that
v(Y;ps )<- d()v(y)+
l +d
(I +d
1-)v(YP
Ps).
Hint: Reddy and Rao (1977), Agarwal, Singh , and Goel (1979) .

Exercise 4.12. ( a ) Under multi-character surveys, an estimator of popu lation total


Y is
Yp = J.- i: y~ ,
n i= l P;

and Pxy is the correlation between y and x , pt =!.L and pi = (x - nXi)/ {(N - n)X}
X
N
with X = L Xi ' Discuss the cases for Pxy = I , Pxy = 0 and Pxy = - I .
i= l
( b ) If the value of P ry is unknown , then use its consistent estimator
344 Advanced sampling theory with applications

_ -I " _ -I "
where II i = Yi/(Npi)' II = II 'Illi, v = II 'I vi and
i= 1 i =1

Pi = a p7 + (I- a )pj for some real constant 0::; a s 1.


Stud y the asy mptotic properties of the practi cal estimator
• I " y-
ypc = - L-;'
II i=1 P;
h ••
w ere P; = Pi
( +) rxy (l+rxy )/2(Pi_\-r
J
xy(l -rxy )/2( 1 j(I- t:tyXI+r
N
xy)
.
Hint: Singh and Hom (1998).

Exercise 4.13. If Y is perfe ctly linearly correlated with the auxiliary variable x so
that Y =a + fJx , where a, fJ(* 0) are constants . Show that sample mean estimator

Y = N y, with y = II-I IYi


;=1
under SRSWR sampling is more efficient than the

Hansen and Hurw itz (1943) estimator

Y hh = ~ I Yd Pi ' with Pi = xd X if ~2 / a 2}< tx-H)/(Ho-.~)},


r
II i =1
l

where H = N(;~I~i and X = N-


I
i~lxi'
Hint: Raj ( 1968)

Exercise 4.14. Under the superpopulation model M : Y; = fJ Xi + ei such that


EM(e; IXi) =O , EM(el IXi) =o-2 Xf , and Ekej IXi,Xj )=O , g z O, show that

EM [V(Yhh )] = ~12[i~P;g-l - i~lP;gl where hh = ~i~I; '


Hint: Rao (I 966a , 1966b).

Exercise 4.15. Consider a situation of having inform ation on two auxiliary


variables Xi and Zi (say), for i = 1,2,....,N. Consider a sample s of size II be drawn

with PPSWR sampling. Let P; = z/i~lzi denote the probability of selection on the

basis of known auxi liary variable Zi for i = 1,2,....,N. Let Yi and Xi denote the value
of the variable under study y and the second auxi liary varia ble x to be used at
estimation stage. Let IIi = yd (Np; ) and Vi = xd(Np;) for i = 1,2,...,11 . Obviously
RI = U" Iv" , R2 = (u"I X) and R3 = (u"/V" Xxiv,,) can be considered as three
Chapter 4: Use of auxiliary information: PPSWR Sampling 345

estimators of the ratio R = Y/ X . Find the values of the real constants gt' t = 1,2,3
such that the linear variety of the estimators of R defined as
, 3, 3
Rms = l. g/Rt , for 'Igt = 1
t =l t= l
remains unbiased with minimum variance .
Hint: Mahajan and Singh (1997).

PRACTICXVPROBLEMS

Practical 4.1. Use the cumulative total method to select a sample of ten units from
population I given in the Appendix by using nonreal estate farm loans as an
auxiliary variable.

Practical 4.2. Use Lahiri 's method to select a sample of ten units from population
I given in the Appendix by using nonreal estate farm loans as auxiliary variable .

Practical 4.3. Your supervisor used PPSWR sampling to select a sample of six
states from the population I in the Appendix and collected the follow ing
information.
State KY NE ME OH OK TX
Nonreal.estatefarm loan CX) $ 557.656 3585.406 51 .539 635.774 1716.087 3520.361
Real estatefarm loanH' » $ .).. 1045.106 1337.852 8.849 870.720 612.108 1248.761
The total amount $43908 .12 of nonreal estate farm loans (in $000) for the year
1997 is known. Use the Hansen and Hurwitz (1943) estimator for estimating the
total amount of the real estate farm loans (in $000) during 1997 in the United
States . Also find an estimator of variance of the estimator and hence derive the 95%
confidence interval.

Practical 4.4. Discuss the relative efficiency of the PPSWR sampling estimator for
estimating total amount of the real estate farm loans during 1997 by using
information on nonreal estate farm loans during 1997 with respect to the usual and
ratio estimator of population mean based on SRSWOR sample . Use the information
given in popul ation I of the Appen dix.

Practical 4.5. Consider the problem of estimation of average price of the


commerc ial apple crop. Select a PPSWR sample of seven units from population 3 in
the appendix by Lahiri's method using the season average price during 1994 of
apple crop as a benchmark or auxiliary variable in the United States . Collect the
information on the prices of the apple crop during 1995 and 1996 from population 3
for the units included in the sample .
Use the estimator
)Ill = Un + a(X- vn \
where
346 Advanced sampling theory with applications

a= {f.~(Ui
1=1
- Un XVi- Vn)}/f.~(Vi - Vn ! ,
1= 1
for estimating the average price of the apple crop during 1997 by assuming that the
average price during 1995 is known. Also construct 95% confidence interval for
estimating the average price of the apple crop during 1996 in the United States.

Practical 4.6. It is a well known fact that age and duration of sleep are negatively
correlated . Assuming that the whole population information is known as presented
in population 2 of the Appendix. Discuss the gain in efficiency due to Sahoo,
Sahoo, and Mohanty (1994) method of estimation over the Hansen and Hurwitz
(1943) estimator. Assume that the sample size consist often units.

Practical 4.7. Your boss Ms. Stephanie Singh used TFR to select a PPSWR sample
of 5 units from the population 8 given in the Appendix. Her interest was to estimate
the total 'Crude Birth Rate (CBR)', 'Crude Death Rate (CDR)', ' Infant Mortality
Rate (IMR)' and 'Expectation of Life at Birth (ELB)' in the world as shown in the
following table.

42 India 23.5 8.9 63.5 61.5 2.90


06 Australia 13.0 6.9 5.0 8004 1.80
20 Canada 12.3 7.2 5.5 80.0 1.80
01 United states 14.2 8.8 6.2 76.3 2.07
90 United 12.1 10.9 6.0 77.1 1.79
Kin dom
Later she found that CBR and IMR have high positive correlation with the 'Total
Fertility Rate (TFR)', where as CDR has low positive correlation with TFR and
ELB has high negative correlation. She was worried as she came to know the
correlation coefficients values ofTFR with TBR, CDR, IMR and ELB are +0.9855,
+0.5492, +0.8525 and -0.8152 respectively. Help her by suggesting an appropriate
transformation on the selection probabilities using know values of correlation
coefficients to obtain good estimates of total CBR, CDR, IMR, and ELB. Also
suggest her to construct 95% confidence interval in each situation.
Hint: Singh and Horn (1998).

Practical 4.8. The following table gives the weekly wages and expenditure of all
the seven households on a drive way:

lii'Ho1.lsenolONo;,1:','i,:" ri: j,!iij,ij I 2 3 4 5 6 7


'JifWeeI<ly wages ( ~ ;· w 'iL .i';! 358 540 225 I 88 579 I 50 530
hWeeI<ly expendit1.lrei 225 329 198 160 357 130 329
Chapter 4: Use of auxiliary information: PPSWR Sampling 347

( I ) Jo hn considers the problem of estimation of weekly expenditure using an


SRSWR sampling design and selects only two households, the second and seventh
household.
(a) Estimate the total week ly expenditure based on John's samp ling scheme.
( b ) Estima te the var iance of the estimator of total weekly expenditure and derive
the 95% confi dence interva l estima te for John's SRSWR design.
( II ) Michael consider s the same problem of estimation of weekl y expenditure
using the probability proportional to weekly wage s design and again selects the
same second and seventh households in the sample .
( d ) Estimate the total expenditure based on Michael' s sampling sch eme .
( e ) Estimate the var iance of the estimator of total weekly expenditure and derive
the 95% confidence interval estimate for Michael' s PPSWR design.
( III ) Who is more accurate: John or Michael ?

Practical 4.9. Consider a class with six girls, their height and response to the
question, 'Do you like Bob?' are given in the following tab le:

Girls Amy Cathy Eileen Heather Jennifer Kelly


Girl likes Bob Yes No Yes No Yes No
Height (em) 167 165 169 164 171 162

( a ) Find the true proportion of girls who likes Bob.


( b ) Not all girls want to disclose their response, but agree to give four responses.
Select a sample of four girls using SRSWR sampling and estimate the proportion of
gir ls who likes Bob. (Rule: Use the first column of the Pseu do-Random Number
Table I given in the Appendix to select the sample)
( c ) Estimate the variance of the estimate of proportion and construct 95%
confidence interval estimate.
(d ) Does the true proportion lie in your confidence interval estimate?
( e ) Select a sample of four girls using probability proportional to height of the
girls. (Rule: Use Cumul ative Total Method and first three columns of the Pseudo-
Random Number Table 1 given in the Appendix)
( f) Estimate the proportion of girls who like Bob using PPSWR samp ling.
( g ) Estimate the variance of the estimator of proportion of girls who likes Bob and
construct 95% confidence interval estimate .
( h ) Does the true proportion lie in your new confidence interval estimate?

Practical 4.1 0. Consider a class with six students, their marks in the assignments
and final examination are given in the following table:

I ~Student , " Amy Bob Cathy Don Erika Frank


IV)\ssignments , 78 58 87 75 56 69
I Final examination 72 56 94 67 58 75
348 Advanced sampling theory with applications

( a ) Find the true average marks for the final examination.


( b ) Select a sample of three students using SRSWR sampling and estimate the
average marks for the final examination. (Rule: Use third column of the Pseudo-
Random Number Table I given in the Appendix to select the sample)
( c ) Estimate the variance of the estimate of average and construct 95% confidence
interval estimate .
(d ) Does the true average lie in your confidence interval estimate?
( e ) Select a sample of three students using probability proportional to marks on
assignments. (Rule: Use Lahiri's method and specify the columns used from the
Pseudo-Random Number Table I given in the Appendix)
( f ) Estimate the average marks for the final examination using PPSWR sampling .
( g ) Estimate the variance of the estimator of average marks and construct a 95%
confidence interval estimate .
( h ) Do the true average marks from the final examination lie in your new
confidence interval estimate?
5. USE OF AUXILIARY INFORMATION: PROBABILITY
PROPORTIONAL TO SIZE AND WITHOUT
REPLACEMENT (PPSWOR) SAMPLING

5;oIN.TRODUCTION .\...,

In probability proportional to size and without replacement (PPSWOR) sampling


schem e, we will discu ss the Horvitz and Thompson (1952) estimator, two forms of
the variance of the Horv itz and Thompson (1952) estimator and their estimators,
superpopulation model , construction of inclus ion prob abilities, calibrated estimators
of population total and calibrated estimato rs of variance of the estimators of total
like ratio estimator , linear regression estimator, regression predictor, distribution
funct ion, Rao, Hartley, and Cochran (1962) sampling scheme, unbiased estimation
strategies under IPPS sampling and unified approach. At the end, we will celebrate
Golden Jubilee Year 2003 of the traditional linear regression estimator owed to
Hansen , Hurwitz, and Madow (1953) . Before going further we would like to define
a few important symbols and mathematical relations.

( a ) p; = X i/ X , where X = I X;, denotes the probability of selecting the /"


;eQ
population unit at any particular draw .
( b ) Jr; : Probabil ity of including the / " popul ation unit in the sample .
( c ) Jr ij : Probability of including both /" and i h
(SUCh that i *- j) population units in
the sample, and so on.

5;0:2 SOME .MATHEMATICAD RELATIONS .

To discuss some mathematical relations, we make use of the following theorems:

Theorem 5.0.1. The sum of the probabil ities of including the /h population unit in
the sample over the whole population is equal to the sample size, i.e.,
LJri = n
ieQ (5.0.1)
where i E n implies that /, unit belongs to population n of size N .
Proof. Let us define a variable Ii which takes the value I if the / " population unit is
included in the sample and zero otherwise, i.e.,
I with probability Jri,
I i = {0 with probability (I- Jri} (5.0.2)
The expected value of the random variable t, IS
£(1;)= I XJri +(I-Jr;) xO = Jri ' (5.0.3)

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
350 Advanced sampling theory with applications

Note that we are using without replacement (WOR) sampling , so there is no chance
of getting any unit repeated. The n selected units will be distinct and we have
2,fi = n.
ien (5.0.4)
Taking expected values on both sides of (5.0.4) we have
IE(tJ= n .
ien (5.0.5)
On using (5.0.3) in (5.0.5) we have
IJTi = n.
ien (5.0.6)
Hence the theorem.
Theorem 5.0.2. The sum of the probabilities of including different pairs of units
from the whole population in the sample is given by product of the probability of
including i h population unit and the remaining sample size. In other words
. I JTij = (n -lk . (507)
l(" ,)en . .
Proof. Let us consider a random variable t ij which takes the value I if both i h and
/h population units are included in the sample and 0 otherwise, i.e.,
I with probability JTij.
tij = { 0 with probability (1- JTy) (5.0.8)

Thus the expected value of tij is given by


EVij) = JTij X 1+ (1- JTij)x 0 = JTij ' (5.0.9)
Now we have

E[ L tij] = L E&ij)= L JTij '


(5.0.10)
j("i)en j{"i)en j{"i)en
Note that JTij is the probability for both i h and j" (i '* j) units to be included in the
sample s we have

L JTij = L[Probability for both /h and /h(i '* j) units are included in the sample]
j("i)en j{"i)En
L prj E s liE s ]P(i E s)
j("i)en
LP(j E S liE S )JTi = JTi LP(j E S liE s). (5.0.11)
j("i )en j("i)en
In Theorem 5.0.1 we proved that the sum of the probab ilities of including i h
population unit in the sample is LJTi = n, where n denotes the set of all N units
ien
in the population and n is the number of units in the sample s . Now in
LP(j E S liE s) the number of population units is (N -1) (since i h unit has been
j{"i)en
already selected) and in the sample we have to select a further (n - I) units . Thus
L P(j E S liE s) = (n -1) .
j{,,;)en (5.0.12)
Hence the theorem .
Chapter 5: Use of auxiliary informa tion: PPSWOR Sampling 351

5.1 HORVITZ AND THOMPSON ESTIMATOR AND RELATED TOPICS

The Horvitz and Thompson (1952) estimator (or HT estimator) of the population
total Y is a linear estim ator of the sample observations. The Horvitz and Thompson
(a universal estimator), on the basis of n sample observations y;, i I, 2, ..., n , can =
be defined as
(5.1.1)
ies
where d i , i = I, 2, ..., n are predeterm ined real constants or design weights. Thu s we
have the following theorems:

Theorem 5.1.1. For the estimator YHT to be unbia sed for the popul ation total, Y,
the design weights are given by
d, = 1/";. (5.1.2)
Proof. Taking expected values on both sides of (5.1.1) , we have

E(YHT ) = E[.2AY;] = E[.LJ;d;Y; ]


l ES lEO

where t; is a random variable and takes a value I if the i'h population unit is
includ ed in the sample and 0 othe rwise. Note that here both d, and Y; (ith
popul ation value) are constants . Ther efore we have
E(YHT )=L, E{t;}ci;Y; = L,";d;Y; . (5.1.3)
;Efl ;Efl

Now 'L,,;d;Y; is equal to 'LY; if d, = 1/,,; . Therefore the cond ition that YHT is an
;Efl ;Efl
unbia sed estimator of population total Y is that the design weights d, are equal to
1/,,; . Thu s we have the following theor em:
Theorem 5.1.2. An unb iased estimator of the popul ation total Y is given by
• y.
YHT = L, --!..
Jt
. (5.104)
ies ;

Proof. Follo ws by sub stituting d, = 1/,,; in (5.1.1).

Theorem 5.1.3. Under SRSWOR sampling, the estim ator (5.1.4) reduces to the
estimator of population total Y given by
YHT = Ny . (5 .1.5)
Proof. Under SRSWOR sampling, the probability of including the i'h population
unit in the sample s is

"; = (~_-llJ/(~J = ~ . (5.1.6 )

On substituting the value of TC; from (5. 1.6) in (5.104) we have


352 Advanced sampling theory with applications

, y. N n _
YlIT = L-'-=-LY; = Ny . (5.1.7)
;ESn/ N n ;=1
Hence the theorem.

Theorem 5.1.4. The variance of the Horvitz and Thompson (1952) estimator,
YHT = LYd"; ,of the population total, Y, is
ies

V(~IT) = L (I-ll';) J'i + L L ("ij - ";llj) 1'iYj . (5.1.8)


;EO "; ;EO j(*;)Eo ";"j
Proof. Let us define a random variable I; as in (5.0.2), i.e.,
I with probability ";,
{
I; = 0 with probability (I - ";) (5.1.9)
Now we have

V(YHT )= V( .L Y; J=V( .L 1;1'i J. (5.I.I0)


IES"I lEO ",
Because the sample has been taken by using WOR sampling, the random variable
values I; and I j for i *- j are not independent, thus

V(YHT)= L V(/;1'iJ+ L L cov(/;1'i , IjYj] . (5.I.I1)


;EO "; ;EO j(*;)EO "; "j

Now we have
V(t;)= E~l)- {E(t;)}2 . (5. I.I 2)
From (5.1.9) we have
th
t . = {I if the i population unit is included in the sample,
I 0 otherwise.
(5.1.13)

Thus the expected value of t; is given by


E(t;)= " ; xI + (1- ,,;)x 0 = " ; . (5.I.I 4)
Also from (5.I.I3) we have
t 2 = {I if the i'h population unit is included in the sample,
(5.1.15)
I 0 otherwise.
Therefore
E~l)= "; xI + (1- ,,;)x 0 = "; (5.1.16)
and
V(t;)= E~l )-{E(t; )}2 = "; -"l = ,,;(1-,,;). (5.I.I7)
Now
Cov(t;, tj ) = E(t;tj )- E(t;)E(tj ) (5.I.I8)
and
u . = {I if both i'h and r population units are included in the sample,
(5.I.I9)
I J 0 otherwise.
Chapter5: Useof auxiliary information: PPSWOR Sampling 353

Thus
EVil j)= l x Jrij +Ox (1- Jrij)= Jrij .
Therefore we get
COVVi ' Ij) = EVilj)- E(/i )Eb) = Jrij - " t"! . (5.1.20)
Now plugging the values of V(/ i) and COVVi,l j) in (5.1.11) we have

V (YHT =
y;2 ()
, ) I ---'TV Ii + I I Y;Yj
--COY li,l j
( )
iEO «; iEO j("i}eO JriJrj

y,2 ( Y;Yj (
)
= L: ---'TJri
- - Jrij - JriJrj ) ,
1- Jri + L: L:
iEO Jri JriJrj iEOj("i)EO

which on simplifying reduces to (5.1.8). Hence the theorem.

Theorem 5.1.5. In case of SRSWOR sampling, the variance of the Horvitz and
Thompson (1952) estimator reduces to

, \ N
2(1-
f) 2
(5.1.21)
V (YHT SRSWOR = Sy
n
where f = n/ N denotes the finite population correction (f.p.c .) factor .
Proof. We know that the probability of including the ith population unit in the
sample by using SRSWOR sampling is

Jri = [~--/J/[~J ~ = (5.1.22)


and the probability of includ ing both lh andj" population units in the sample is

Jrij=ln-2
(N - 2J/(NJ
In n(n -I)
=N(N-l)' (5.1.23)
On substituting the values of 1r i and 1r ij in (5.1.8) we have
n(n-l) n n
V (Y
HT
)= I (1-n/ N) y;2 + I I N(N -I) NN Y;Yj
iEO n] N iEO j(,,;}eO ~~
NN
N -n N 2 N -n N N
=--L:Yi - - - - L : L: YiY " (5. I .24)
n i~l n(N -1) i~1 j";~1 )
Note that

Y =
2 [NIY; J2 = IY;
N 2 N
+I N
IY;Yj . (5.1.25)
i~l i~ 1 i~1 j,,; ~ 1

Therefore
N N 2 N 2
I I Y;Yj = Y - I Y; . (5.1.26)
i ~1 j"i ~1 i ~1

On substituting (5.1.26) in (5.1.24) we have


354 Advanced sampling theory with applications

V(YHT) =(N-n)[iY;2_~{Y2_ iY;2 }] = (N-~I)[NIy;2_y2]


n 1; \ N 1 ,=\ n(N I) 1; \

2(1-
= (N - n) [NI Y;2- (Nyf ] = N(N - n) S~ = N f) S2. (5.1.27)
n(N-I) i=\ n > n Y

Hence the theorem.

Theorem 5.1.6. Another form for the V(YHT ) , developed by Sen (1953), and Yates
and Grundy ( 1953) independently, is given by
2
, I y; Yj
V(YHT kYG = - I I (Jrillj - Jrij -L _ _
{ Jri
(5.1.28)
2 iEQ j(<<i}eQ Jrj )

Proof. We have
2
, I Y; Yj
V(YHTkYG =-I I kJrj-Jrij - --
2 iEQj(<<i}eQ { Jri Jrj )

=-
1
L: L:
( (y;2 Y}) + L:
JriJr . - Jrj" - '- + - L:
( \Y;
Jrj" - JriJr . ~- .
Y
j
. " J.(* ,'}e""
2 IE" J U
Jri
2 2
Jr j
. " .( .L "
' E" J * , /'="
U J -rr rr
"i " j

(5.1.29)
Note that the probabilities for y;2 / Jr? and Y} / Jr } are the same , therefore
(y;2/ Jr? + Y}/ Jr}) = 2y;2/ Jr? and (5.1.29) becomes

(, \
V YHTf.; YG = I I ({y;2]
JriJrj - Jrij ~ + I I (Jrij - Jri Jrj \Y;
~-
Yj
iEQj(*i )EQ Jri iEQ j(*i)EQ Jri Jrj

(5.1 .30)
Note that L: Jrj = n- Jri ' the i 'h unit with probability Jri is not there in the sum
j(*i)EQ

because the sum of all the inclusion probabilities is n. Also we know that
L: Jrij = (n -1)Jri . Using these results in (5.1.30) we have
j(* i)EQ

, \ y,2 y;2 ( \ Y ; Yj
V (YilTf.;YG = I -'-(n - Jr;) - I ~(n - 1)Jri + I I Jrij - JriJrj ~ -
iEQ Jri iEQ Jri iEQj(*i)EQ Jri Jrj
Chapter 5: Use of auxiliary information: PPSWOR Sampling 355

(5 .1.31)

Corollary 5.1.1. The variance V(YHT lvG reduces to zero if JTi a; }j .


Proof. We have JTi oc Yi . Let a be the constant of proportionality, then JTi = a}j.
Taking sum on both sides of it we have LJTi = a L}j, which implies that a = qv .
ien ien
Thus we have JTi = n}j /Y . On substituting JTi = n}j/Y in (5 .1.28) we obtain
V (YHT lvG = O. Hence the corollary.

Theorem 5.1.7. An unbiased estimator of the variance V (YHT )of the Horvitz and
Thompson (1952) estimator of the population total Y is given by
A(A ) = L--
vY (I-JTJ 2 + L L (JTij-JTiJTj]Yi
-- v, .
ies JTi2-Yi iesj(*i)es
HT (5.1.32)
JTij JTi JTj
Proof. We know that the variance of the Horvitz and Thompson (1952) estimator of
the population total, Y, is given by

V (YHT )= L (1-JTJ}j2 + L L (JTij - JTiJTj ]}jYj . (5.1.33)


ien JTi ienj(*i)en JTiJTj
The estimator for the first term in (5.1 .33) will be of the type
I (I-JTi t;ai'
i ES «. r (5.1.34)

Now chose a i in such a manner that

E[.Iles(~)iai]
JTi = .I (~)}j2 .
len JTi (5.1.35)

Now

E[ I (~)y;ai]
JTi
i es
= E[.I (~)}j2aiti]
len JTi

= i; n( I ~~i )}j2aiE(ti) = i;n( I ~~i )}j2aiJTi . (5.1 .36)

Note that (5 .1.35) will be true if

I(I-JTi)Y/aiJTi = I ( I-JTi)}j2 (5 .1.37)


ien JTi ien JTi
or if a.n, = I, i.e., ai = I/JTi . Thus an unb iased estimator of the first term of
(5 .1.33) is
356 Advanced sampling theory with applications

(5.1.38)

(5.1.39)

(5.1.40)

(5.1.41)

Thus (5.1.40) will be true if

j(;ti)En
L
( ~ - ~~ )
Jri Jr)
Y;YPijJrij = L
j(;ti)En
(~-~~) Y;Y) .
Jri Jr)
(5.1.42)

or in other words aij = 1/Jrij . Therefore an estimator of the second term of (5.1.33)
is given by

L L Jr lJ.. -Jr.Jr .) 1
I J YiY '-= L L (Jr"-Jr'Jr')y.y.
lJ I J _ '_J • (5.1.43)
iESj(;ti)ES ( JriJr) J Jrij iES j(;t i)Es Jrij JriJr)

On combining (5.1.38) and (5.1.43) an unbiased estimator of v(rHT ) is given by


,(,) (l-JrJ 2 (Jrij-JriJr))Yi Y)
vYHT = L - - Y i + L L --. (5.1.44)
iES Jrl iES j(;ti)Es Jr ij Jri Jr)

Hence the theorem.

Theorem 5.1.8. An unbiased estimator of the variance of the Horvitz and


Thompson (1952) estimator of the population total Y in the Sen--Yates-Grundy
(1953) form is given by

, (,)
VSYGYHT = - L
1
L
(JriJr) - Jrij )( Yi
---
Y) )2 (5.1.45)
2 iESJC;ti)ES Jrij Jri Jr)

Proof. We know that the Sen--Yates--Grundy (1953) form of the variance of the
Horvitz and Thompson (1952) estimator of the population total Y is given by
Chapter 5: Use of auxiliary information: PPSWOR Sampling 357

(5.1.46)

(5.1.47)

(5.1.48)

(5.1.49)

(5.1.50)

Therefore an unbiased estimator of the variance in the Sen-- Yates--Grundy (1953)


form is given by (5.1.45) . Hence the theorem.

The condition for the estimator of variance of the Horvitz and Thompson (1952)
estimator of population total Y to be non-negative is given by

(5.1.51)

Example 5.1.1. Consider a population of N = 5 units n = {A, B, C, D, E} and if we


wish to select a sample of n = 3 units using without replacement sampling out of
total 10 sample Sf = 1,2,3'00 .10 given by
SI= {A,B,C}, S2 = {A ,B,D}, S3 ={A,B,E}, S4 = {A,C,D}, Ss = {A, C, E},
S6={A ,D,E}, S7={B,C,D}, ss={B,C,E}, s9={B,D,E},and SIO={C,D,E} .
358 Advanced sampling theory with applications

( a ) Find the first order and second order inclusion probabilities provided that all
samples have equal chance of selection that is p(St) = 1/10, \;j t .
( b) Compare your results with SRSWOR sampling scheme.
( c ) P(SI) = 0.50, ph) = 0.30, P(S3) = 0.2 , p(St)= 0.00, t = 4, 5, 6, 7, 8, 9, 10 then
find the new first order and second order inclusion probabilities.
Solution. (a) When all samples have equal chance of selection, that is, p(St) = 0.1
\;j t then the first order inclusion probabilities Jri' i = 1,2, ..., 5 are given by
Jr) P[First unit A from the population is included in the sample 1
P(SI)+ P(S2)+ P(S3)+ P(S4)+ p(ss)+ P(S6)
0.1 + 0.1 + 0.1 + 0.1 + 0.1 +0.1 = 0.6,

Jr2 P[Second unit B from the population is included in the sample 1


P(SI) + P(S2) + P(S3) + P(S7 ) + p(Sg) + P(S9)
0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 = 0.6,
Jr3 P[Third unit C from the population is included in the sample 1
P(SI)+ P(S4)+ p(ss)+ P(S7)+ p(Sg)+ P(SlO)
0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 = 0.6,
Jr4 P[Fourth unit D from the population is included in the sample 1
P(S2)+ P(S4)+ P(S6)+ P(S7)+ P(S9)+ P(SIO)
0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 = 0.6,
and
Jrs P[Fifth unit E from the population is included in the sample 1
P(S3)+ p(ss)+ P(S6)+ p(Sg)+ P(S9) + P(SIO)
0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 = 0.6 .

The second order inclusion probabilities Jrij' i *- j = 1,2,3, 4, 5 (note that Jrij = Jrji )

are given by
Jrl2 P[First and Second units A and B from the population are included in the sample 1
p(s\)+P(S2)+P(S3) = 0.1+ 0.1 + 0.1 =0.3,
Jrl3 p[First and Third units A and C from the population are included in the sample 1
p(s\)+p(s4)+ph) = 0.1+ 0.1 +0.1 =0.3,
Jrl4 P[First and Fourth units Aand D from the population are included in the sample 1
P(S2)+P(S4)+P(S6) = 0.1+ 0.1 +0.1 =0.3,
JrIS P[First and Fifth units A and E from the population are included in the sample 1
P(S3)+P(SS)+P(S6) = 0.1+ 0.1 +0.1 =0.3,
Jr23 P[Second and Third units Band C from the populat ion are included in the sample 1
P(SI)+P(S7)+P(Sg) = 0.1+ 0.1 +0.1 =0.3,
Jr24 P[Second and Fourth units Band D from the population are included in the sample 1
P(S2) + P(S7 )+ P(S9) = 0.1+ 0.1 + 0.1 = 0.3,
Chapter 5: Use of auxiliary informatio n: PPSWOR Sampling 359

Jr2S P[Secondand Fifth units B and E from the population are included in the sample ]
P(S3)+P(Sg)+p(Sg) = 0.1+ 0.1 +0 .1 =0.3 ,
Jr34 P[Third and Fourth units C and D from the population are included in the sample]
P(S4)+ ph )+ P(SIO ) = 0.1+ 0.1 + 0.1 = 0.3,
Jr3S P[Th ird and Fifth units C and E from the populati on are included in the sample ]
p(SS)+ p(sg) + P(SIO ) = 0.1+ 0.1 +0.1 =0.3,
and
Jr4S P[Fourth and Fifth units D and E from the population are included in the sample ]
P(S6)+P(Sg)+P(SIO) = 0.1+ 0.1 +0.1 = 0.3.
Thus we observed that if all the samples have the same chance of selection then
Jri = 0.6, Vi = 1,2,..,5 and Jrij = 0.3, Vi *- j = 1,2,..,5 .

( b) Also note that under SRSWOR sampling

Jr . =~=~=0.6 and Jr .. = n(n -l) = 3(3-1) =~=0.3 .


I N 5 ' IJ N(N -I) 5(5 -1) 20

Thu s if all samples have the same chance of selection then PPSWOR and SRSWOR
sampling schemes are equivalent.
(c ) In this case we have
Jr l P[First unit A from the population is included in the sa mple ]
P(SI)+ P(S2)+ P(S3)+ P(S4)+ ph)+ P(S6)
0.5 + 0.3 + 0.2 + 0.0 + 0.0 + 0.0 = 1.0,
Jr2 P[Second unit B from the popul ation is included in the sample ]
P(SI ) + P(S2) + P(S3) + P(S7) + p(Sg) + p (Sg)
0.5 +0.3 + 0.2 + 0.0 + 0.0 + 0.0 = 1.0,
Jr 3 P[Third un it C from the population is included in the sa mple J
P(SI)+ P(S4)+ p(SS)+ P(S7)+ p(ss)+ P(SIO)
0.5 + 0.0 + 0.0 + 0.0 + 0.0 + 0.0 = 0.5,
Jr 4 P[Fourth unit D from the population is included in the sample ]
P(S2)+ P(S4)+ P(S6)+ P(S7)+ p(Sg)+ P(SIO)
0.3 + 0.0 + 0.0 + 0.0 + 0.0 + 0.0 = 0.3,
and
Jr 5 P[Fifth unit E from the population is included in the sample]
P(S3)+ p(ss)+ P(S6)+ p(ss)+ p(Sg)+ P(SIO)
0.2 + 0.0 + 0.0 + 0.0 + 0.0 + 0.0 = 0.2.

The second order inclusion probabilities Jrij ' i *- j = 1, 2, 3,4,5 , (note that Jrij = Jrji )
are given by
Jr l 2 = P[First and Seco nd units A and B from the popul ation are included in the sample ]

= P(SI )+ P(S2 )+ P(S3) = 0.5 + 0.3 + 0.2 = 1.0,


360 Advanced sampling theory with applications

ll"\3 = P[First and Third units A and C from the population are included in the sample]
= P(SI)+ P(S4)+ P(S5) = 0.5 + 0.0 + 0.0 = 0.5,
ll"14 = P[First and Fourth units A and D from the population are included in the sample]
= p(sz)+ P(S4)+ P(S6) = 0.3 + 0.0 + 0.0 = 0.3,
ll"15 = P[First and Fifth units A and E from the population are included in the sample]
=P(S3)+P(S5)+P(S6) = 0.2+ 0.0 +0.0 =0.2,
ll"Z3 = P[Second and Third units B and C from the population are included in the sample]
=P(SI)+P(S7)+P(SS) = 0.5+0.0 +0.0 =0.5,
ll"Z4 = P[Second and Fourth units Band D from the population are included in the sample]
= p(sz)+ P(S7)+ P(S9) = 0.3 + 0.0 + 0.0 = 0.3,
ll"Z5 = P[Second and Fifth units Band E from the population are included in the sample]
= P(S3)+ p(ss)+ P(S9) = 0.2 + 0.0 + 0.0 = 0.2,
ll"34 = P[Third and Fourth units C and D from the population are included in the sample ]
= P(S4)+ P(S7)+ P(SIO) = 0.0 + 0.0 + 0.0 = 0.0,
ll"35 = P[Third and Fifth units C and E from the population are included in the sample]
= ph)+ p(ss)+ P(SIO) = 0.0 + 0.0 +0.0 = 0.0,
and
ll"45 = P[Fourth and Fifth units D and E from the population are included in the sample]
= P(S6)+P(S9) + P(SIO) = 0.0+ 0.0 +0.0 =0.0.

The next example explains How Michael selected Amy?

Example 5.1.2. John and Michael were appointed to select three players (n = 3 ) out
of five players (N = 5) from the list n = {Amy, Bob, Chris, Don, Eric} with their
scores 125, 126, 128,90 and 127, respectively.

( I ) John considers the following sampling


scheme consisting of only four possibilities:

SI = {Amy, Bob, Chris}, Sz = {Bob, Chris, Don} ,

S3 = {Chris, Don, Eric}, and S4 = {Amy, Don, Eric},

such that

p(S])= 0.25, p(sz)= 0.25, P(S3)= 0.25 and


P(S4)=0 .25 .

( a ) Find the first order inclusion probabilities for John's sampling scheme.
(b) Find the estimates of total score from each sample using John 's sampling .
(c) Find the bias in John's sampling scheme.
(d) Find the second order inclusion probabilities for John's sampling scheme.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 361

(e) Find the variance of John's sampling scheme using the Sen--Yates--Grundy
formula.
(f) Find the variance of John's sampling plan using usual formula.
(g) Find the variance of John 's sampling plan using definition of variance.
(h) Are three variances equal for John's sampling scheme?
( II) Michael likes Amy and cleverly suggests the following changes in John's
sampling scheme as: p(sd=0.50, p(sz)=O.OO , P(S3)=0.00 and p(s4)=0.50 .
( i) Find the first order inclusion probabilities for Michael's sampling scheme .
(j) Find the estimates of total score from each sample using Michael's sampling.
( k ) Find the bias in Michael's sampling scheme.
( I) Find the second order inclusion probabilities for Michael 's sampling scheme.
( m ) Find the variance of Michael's sampling scheme using the Sen--Yates--
Grundy formula .
(n) Find the variance of Michael's sampling plan using usual formula .
(0) Find the variance of Michael 's sampling plan using definition of variance .
( p ) Are three variances equal for Michael's sampling scheme?
( III) Discussion on John and Michael's schemes :
(q) Find the relative efficiency of Michael's sampling scheme over John 's
sampling .
( r ) Would you like to comment on the results?

Solution: ( I ) John's Sampling Scheme:

( a ) The first order inclusion probabilities for John 's sampling scheme are
Jrt = P[Amy is included in the sample ] = p(s\)+ P(S4) = 0.25 + 0.25 = 0.50,

Jrz = P[Bob is included in the sample] = p(St)+ p(sz) = 0.25 + 0.25 = 0.50,
Jr3 = P[Chris is included in the sample] = p(s\) + p(sz) + ph) = 0.25 + 0.25 + 0.25 = 0.75,
Jr4 = P[Don is included in the sample] = p(sz) + P(S3)+ P(S4) = 0.25 + 0.25 + 0.25 = 0.75,
and
Jr 5 = P[Eric is included in the sample] = (S3) + P(S4) = 0.25 + 0.25 = 0.50 .
5
Note that L.Jri =3 .
i=\

( b ) Let YHT(t) ' t = I, 2, 3, 4 be the estimates of total score based on the first,
second, third and fourth sample respectively, then we have
'
y, _ " Yi _ Amy Bob Chris _ 125 126 128 _ 672 66-6
HT(\) - L. - - - - + - - + - - - - - + - - + - - - . ,
John iE S\ Jri Jr\ JrZ Jr3 0.50 0.50 0.75

'
y, _ " Yi _ Bob Chris Don _ 126 128 90 _ 542 66-6
HT(Z) - L. - - - - + - - + - - - - - + - - + - - - . ,
John iESZ Jri JrZ Jr3 Jr4 0.50 0.75 0.75
362 Advanced sampling theory with applications

'
y, _ " Yi _ Chris Don Eric _ 128 90 127 _ 544 66-6
HT(3) - L.. - - - - + - - + - - - - - + - - + - - - .
John ies3 Tei Te3 Te4 Te5 0.75 0.75 0.50 '
and
'
y, _ " Yi _ Amy Don Eric _ 125 90 127 _ 624 000
HT(4) - £.. - - - - + - - + - - - - - + - - + - - - . .
John ie s4 Tei Tel Te4 Te5 0.50 0.75 0.50

(c) The bias in John's sampling scheme: Note that true total score of the five
players is
5
Y= III = Amy + Bob+ Chris + Don + Eric = 125 + 126+ 128+90+ 127 = 596.
i=\
Now the bias in John 's sampling scheme is given by

B(YHT )JOhn = E(YHT ) - Y = I~ {P(SI )YHT(thohn}- Y


= 0 .25 x 672.666 + 0.25 x 542.666 + 0.25 x 544.666 + 0.25 x 624.000}- 596
= 596.000 - 596.000 = 0.000 .
Thus John's sampling scheme is unbiased for estimating the total score.

( d) The second order inclusion probabilities for John's sampling scheme are
Tel2 = P[Amy and Bob are included in the sample] = p(s\) = 0.25,
TeI3 = P[Amy and Chris are included in the sample] = P(SI) = 0.25,
Tel4 = P[Amy and Don are included in the sample] = P(S4) = 0.25,
Tel5 = P[Amy and Eric are included in the sample] = P(S4) = 0.25,
Te23 = P[Bob and Chris are included in the sample] = p(s\)+ P(S2) = 0.25 + 0.25 = 0.50,
Te24 = P[Bob and Don are included in the sample] = P(S2) = 0.25,
Te25 = P[Bob and Eric are included in the sample] = 0.00,
Te34 = P[Chris and Don are included in the sample] = P(S2)+ P(S3) = 0.25 + 0.25 = 0.50,
Te35 = P[Chris and Eric are included in the sample] = P(S3) = 0.25,
and
Te45 = P[Don and Eric are included in the sample] = P(S3)+P(S4) = 0.25 + 0.25 = 0.50 .

(e) The variance of John's sampling plan using the Sen --Yates --Grundy formula
IS

VSYG (YHT )JOhn


Chapter 5: Use of auxiliary information: PPSWOR Sampling 363

= (0.50 x 0.50 - 0.25{ 125 _ 126 J2 + (0.50 x 0.75 _ 0.25{ 125 _ 128 J2
\ 0.50 0.50 \ 0.50 0.75

125- -
+(0.50xO.75-0.25{ - 90 J2 +(0.50xO.50-0.25{- -J2
125- 127
0.50 0.75 0.50 0.50

+ (0.50 x 0.75 - 0.50{ -126 - -128 J2 + (0.50 x 0.75 - 0.25{-126 -90-J2


0.50 0.75 0.50 0.75

126-127
+(0 .50xO.50-0.00{ - - J2 +(0 .75xO.75-0.50{- 90 J2
128- -
0.50 0.50 0.75 0.75

128-127
+ (0.75xO.50-0.25 { - - J2 +(0.75xO.50-0.50{90 -J2
- - 127
0.75 0.50 0.75 0.50

= 3035.333 .

(f) The variance of John 's sampling plan using usual formula is given by

V(YHT )JOhn

=2:(I-7r iJY?+2: 2: (7rij-7ri7r j


]JfYj
ien tri ienj(;<i}en 7ri7r j

=C~~l JJI2 +C~:2 Jrl +C~:3 JYl +C~:4 Jrl +C~:5 Jrl
364 Advanced sampling theory with applications

= (1- 0.50)(125 0.50)(126 f + (1- 0.75 )(128 f


f + (1-0.50
0.50 0.75

+(1-0.75)(90)Z +(1-0.50)(I27)Z
0.75 0.50

+ 2[(0.25 - 0.50 x 0.50)(125 x 126)+ (0.25 - 0.50 x 0.75)(125 x 128)


0.50 xO.50 0.50xO.75

+ (0.25 - 0.50 x 0.75)(125 x 90) + (0.25 - 0.50 x 0.50)(125 x 127)


0.50 x 0.75 0.50 x 0.50

+ (0.50 - 0.50 x 0.75)(126 x 128)+ (0.25 - 0.50 x 0.75)(126 x 90)


0.50 x 0.75 0.50 x 0.75

+ (0.00 - 0.50 x 0.50)(126 x 127)+ (0.50- 0.75x 0.75)(128 x 90)


0.50xO.50 0.75xO.75

+ (0.25- 0.75 x 0.50)(128x 127) + (0.50 - 0.75 x 0.50)(90 x 127) ]


0.75 x 0.50 0.75 x 0.50

= 55791.333 + 2(- 26378)= 3035 .333 .

( g ) Variance by definition for John's sampling scheme is given by

VDef(~1T thn = I p(St ){YHT(t )John _ y}Z


t =l

= 0.25(672.666 - 596 f + 0.25(542.666 - 596 f


+ 0.25(544.666 - 596 f + 0.25(624 - 596)Z

= 3035.333.

( II ) Michael's sampling scheme:

(i ) The first order inclusion probabilities for Michael's sampling scheme are
Jr] P[Amy is included in the sample] = p(s])+ P(S4) = 0.50 + 0.50 = 1.00,
=
Jr2 = P[Bob is included in the sample] = P(SI)+ p(sz) = 0.50 + 0.00 = 0.50,

Jr3 = P[Chris is included in the sample] = P(SI)+ p(sz)+ P(S3) = 0.50 + 0.00 + 0.00 = 0.50,

Jr4 = P[Don is included in the sample] = p(sz)+ ph)+ P(S4) = 0.00 + 0.00 + 0.50 = 0.50,

and
Jrs = P[Eric is included in the sample] = P(S3)+ P(S4) = 0.00 + 0.50 = 0.50 .
s
Note that I.Jri = 3.
i=l
Chapter 5: Use of auxiliary information: PPSWOR Sampling 365

(j ) Let YHT(t) MIC. haei ' t = 1, 2, 3, 4 be the estimates of total score based on the first,
second, third and fourth sample respectively, then we have
'
y, _ " Yi _ Amy Bob Chris _ 125 126 128 _ 633 000
HT(I) ' - L. - - - - + - - + - - - - - + - - + - - - .
MIchael iesl 7fi 7f1 7f2 7f3 1.00 0.50 0.50 '

YHT 2 . = L Yi = Bob + Chris + Don = 126 + 128 +~ = 688 .000,


()Mlchael ies2 7fi 7f2 7f3 7f4 0.50 0.50 0.50

YHT 3 = L Yi = Chris + Don + Eric = 128 +~+ 127 = 690 000


()Michael ie s3 7fi 7f3 7f4 7f5 0.50 0.50 0.50 . ,
and
YHT 4 . = L
Yi = Amy + Don + Eric = 125 +~+ 127 = 559.000.
()Mlchael ie s47fi 7f1 7f4 7f5 1.00 0.50 0.50

(k) The bias in Michael 's sampling scheme is given by

B(YHT 1ichael = E(YHT ) - Y = t~l {p(s( )~IT(t )Michael}- Y


= {0.50 x 633.000 + 0.00 x 688.000 + 0.00 x 690.000 + 0.50 x 559.000} - 596
= 596 .000 - 596 = 0.000 .

Thus Michael's sampling scheme is also unbiased for estimating the total score.

( I ) The second order inclusion probabilities for Michael's sampling scheme are
7f12 = P[Amy and Bob are included in the sample] = P(SI) = 0.50,

7f13 = P[Amy and Chris are included in the sample] = P(SI) = 0.50,

7f14 = P[Amy and Don are included in the sample] = P(S4)= 0.50,

7f15 = P[Amy and Eric are included in the sample] = P(S4) = 0.50,

7f23 = P[Bob and Chris are included in the sample] = P(SI)+ P(S2) = 0.50 + 0.00 = 0.50,

7f24 = P[Bob and Don are included in the sample] = P(S2) = 0.00,
7f25 = P[Bob and Eric are included in the sample] = 0.00,
7f34 = P[Chris and Don are included in the sample] = P(S2)+ P(S3) = 0.00 + 0.00 = 0.00,
7f35 = P[Chris and Eric are included in the sample] = P(S3) = 0.00,
and
7f45 = P[Don and Eric are included in the sample] = P(S3)+ P(S4) = 0.00 + 0.50 = 0.50 .

(m) The variance of Michael's sampling plan using Sen--Yates--Grundy formula


is given by
366 Advanced sampling theory with applications

= (1.00x 0.50 _ 0.50{ 125 _ 126)2 + (1.00x 0.50 _


'\ 1.00 0.50
0.5o!'\ 1.00
125 _ 128)2
0.50

125- 90)2
+(1 .00xO.50-0.50{ - - +(1.00xO.50-0.50{ -
125-127
-)2
1.00 0.50 1.00 0.50

+ (0.50 X 0.50 - 0.50{ -126 - -128 )2 + (0.50 X 0.50 - 0.00{ -126 -90)2
-
0.50 0.50 0.50 0.50

+ (0.50 X 0.50 - 0.00{ -126 - -127 )2 + (0.50 X 0.50 - 0.00{ -128 -90)2
-
0.50 0.50 0.50 0.50

+ (0.50 x 0.50 - 0.00{ -128 - -127 )2 + (0.50 x 0.50 - 0.50{90


- -127
-)2
0.50 0.50 0.50 0.50
= 1369.000.

(n) The variance of Michael 's sampling plan using usual formula is given by
V (YHT ~ichacl
Chapter 5: Use of auxiliary information : PPSWOR Sampling 367

= (1-1.00)(125
1.00
f + (1-0.50
0.50)(126 f + (1- 0.50)(128 f
0.50

+(1-0.50)(90
0.50
f +(1-0.50)(127
0.50
f
+ 2[(0.50 -1.00 x 0.50)(125x 126)+(0.50-1 .00x 0.5°)(125 x 128)
1.00x 0.50 1.00x 0.50

+ (0.50-1 .OOx 0.50)(125x 90) + (0.50 -1 .00 x 0.5°)(125 x 127)


1.00x 0.50 1.00x 0.50

+ (0.50 - 0.50 x 0.50)(126X128)+ (0.00 - 0.50 x 0.50)(126 x 90)


0.50 x 0.50 0.50 x 0.50

+ (0.00- 0.50x 0.50)(126x 127)+ (0.00 - 0.50 x 0.50)(128x 90)


0.50 x 0.50 0.50 x 0.50

+ (0.00- 0.50 x 0.50)(128x 127) + (0.50 - 0.50 x 0.50)(90 x 127) ]


0.50x 0.50 0.50 x 0.50
= 41989 + 2(-27560)= 1369.000.

( 0 ) Variance by definition for Michael's sampling scheme is given by

Voer(YHT ~ichael = IP(sl ){YHT(I)Michael -


1=1
yf
= 0.50(633-596f +0 .00(688-596f

+ 0.00(690 - 596)2 + 0.50(559 - 596f


= 1369.000.

( [II ) Comparison among schemes:

( q ) The percent relative efficiency of Michael's sampling scheme over John's


sampling scheme is given by

RE= V(YHT)JOhn xlOO=3035.333 x I00=221.71%


v(r.liT 'I/Michael 1369.000
368 Advanced sampling theory with applications

( r) Comment: When we read this example we thought John's scheme would be


more efficient and it looks that Michael is doing unnecessary favor to Amy, but
from the bias and variance point of view we were surprised that Michael's scheme
is more efficient than John's sampling scheme. There may be some different reason
behind it, but two reasons apparently comes to our minds :

( i ) Why did John not consider all possible samples?

If John did not consider all possible samples we cannot do anything. If he selects all
possible samples, then we do not know how much probability he was going to
assign to different samples? How was Michael going to react? We may get an
answer to this question by doing an unsolved practical at the back of this chapter, so
let us think more here!

( ii ) What is the correlation between inclusion probabilities and the study variable?
This seems to be a good point. We observed that the value of correlation coefficient
between the John 's first order inclusion probabilities and the study variable is
negative that is P Jry(John) = -0 .569 and that for Michael's sampling scheme is
positive that is PJry(Michael) = +0.198.

Caution: While using PPSWOR sampling, we should make sure that the inclusion
probabilities have positive correlation with the study variable . Note that the
estimation of variance from each sample for John's sampling scheme is possible,
but remains biased . Further note that estimation of variance from each sample
using Michael's sampling scheme is not possible, because certain useful second
order inclusion probabilities are zero.

Moral: Note that three of the candidates (Bob, Chris and Eric) have higher scores
than Amy and John did not consider them together. Michael took the benefit
because John was trying to break the merit. Message for the future generations ,
"Do not attempt to break a merit and be honest if you get a chance to be an
administrator as otherwise someone, like Michael, may take benefit of your
limitation"

The next example has been taken from Ghosh (1998) which compares the two
estimators of variance of the Horvitz and Thompson (1952) estimators.

Example 5.1.3. A computer salesman wishes to estimate the number of left handed
students in a town having eight elementary schools . The salesman must make a
decision to ensure enough left hand mice are ordered to accommodate his expected
sale. He used prior information on the number of registered students to select a
sample of n = 3 units by using PPSWOR sampling. The information collected by
him is given in the following table:
Chapter 5: Use of auxiliary information: PPSWOR Sampling 369

0.20
2 4 0.40 0.20 0.20
3 2 0.50 0.20 0.20

Discuss two 75% confidence intervals based on two estimators of variance of the
Horvitz and Thompson estimator .
Solution: An estimate of the total number ofleft handed students in all schools of
the town is given by
• Yi Yl Y2 Y3 10 4 2
YHT = I - = - + - + - = - - + - - + - - = 36.2 '" 36.
iESJri Jrl Jr2 Jr3 0.45 0.40 0.50

Now the variance of the estimator YHT can be estimated two different ways :

Case 1. The usual estimator of variance of YHT is given by

V(YHT )=I(I- Jri]Yl + I I(Jrij-JriJrj] YiYj


iES Jr i Jr i i$ j ES JriJrj Jr ij

= ( 1- 0.45) x 102+ ( 1- 0.40) x 42+ ( 1- 0.50) x 22


0.452 0.402 0.502
+2( 0.20 - 0.45 x 0.40) 10x 4 +2( 0.20 - 0.45 x 0.50) 10x 2
0.45 x 0.40 0.20 0.45 x 0.5 0.20

+ 2(0.20 - 0.40 x 0.50) 4x 2


0.40 x 0.50 0.20
= 271.6049+ 60 + 8 + 44.4444 - 22.2222 + 0 = 361.82 .
Thus a (1- a)1 00% confidence interval for the number of left handed students will
be given by
~H ± (a/2(df = n -INv(YHT ) .
Using Table 2 from the Appendix the 75% confidence interval estimate of the
number of left handed students in the town is given by
370 Advanced sampling theory with applications

YHT ± fO..25!2(df = 3 -1Nv{YHT )


or
36.22±1.604~361.82 ,or [5.709,66.73], or [6, 67].

Case 11. The Sen-- Yates--Grundy estimator of variance of YilT is given by

, (A)_ 1
YHT - - I I
[JriJrj-Jrij][Yi
- - -Yj]2 -_ I I [JriJrj-Jrij][Yi
- -yjJ2
-
vSYG
2i*jes Jrij Jri Jrj i<jes Jrij Jri Jrj

= (0.45 x 0.4 - 0.2 )(~ _ ~)2 + (0.45 x 0.5 - 0.2 )(~ _ 2.)2
0.2 0.45 0.4 0.2 0.45 0.5

+ (0.4 x 0.5 - 0.2 )(~ _ 2.)2


0.2 0.4 0.5
= -14.938 + 41.506 + 0 = 26.568 .
A (1 - a)1 00% confidence interval for the number of left handed students will be

~IT ± f a ! 2 (d f = n -1Nvs YG (YHT )


Using Table 2 from the Appendix the 75% confidence interval estimate of the
number of left handed students in the town is given by
~IT ± fO.25!2(df = 3 -1N vsYG (YHT ) , or 36.22 ± 1.604~26.568
or [27.95,44.488] ,or [28, 44] .

Following Ghosh (1998), it seems difficult to conclude which estimator of variance


is better. In this example the Sen--Yates--Grundy estimator performs better because
it provides a smaller interval estimate at the same level of confidence. Ghosh
(1998) found that it is difficult to decide to select an estimator of variance in actual
practice because Rao and Singh (1973) have shown that the usual estimator of
variance is hyper-admissible. Ghosh (1998) ends his paper with a criticism similar
to Basu (1971). Basu's famous circus elephant example is available in recent books
written by Thompson (1997) and Brewer (2002) . Basu
considers the problem of estimation of the weight of 50
elephants in a circus. Instead of weighing all elephants, the
owner of the circus weighs only one elephant named Samba
by using three year old information about the size of Samba
as a middle sized elephant in the circus. Ifw is the weight of
Samba then an estimate of the total weight of 50 elephants in
Chapter 5: Use of auxiliary information : PPSWOR Sampling 371

the circus is Y= SO w . Later a circus statistician suggests that the owner should
app ly random sampling in place of purpose sampling. Both the owner and
statistician decide to use a random sampling device which gives a 99/100 chance of
selection to Samb a and 1/4900 chance of selection to rest of the 49 elephants in the
circus. Thus the first order inclusion probabilities are given by

j
99 if Sambo is included in the sample,
"i = 1 0~
otherwi se.
4900
50
Clearly L " i = I indicate s selection of a sample of only one unit. Owing to the high
i~ 1

probability of selection, Samba is selected in the sample and the circus statistician
reports an estimate of the total weight of 50 elephants as
v _ '" Yi _ 100
L,- - - w
IHT -
iES"i 99
which is approximately the weight of Samba. Then the owner asks the statist ician
" If a large elephant named Jumbo would have been selected what would have been
the estimate of total weight?"

Let W be the weight of Jumb o. Then the Horvitz and Thompson estimate of the
total weight of all elephants is given by
, Yi
YHT = L:- = 490 0W
iES"i

which clearly is an over estimation of the true weight of 50 eleph ants . The main
mistake made by the circus statistician was to give a small selection probability to
Jumb o and a large selection probability to Samba. This ignore the fact that PPS
sampling works only if the correlation between selection probabilities and the study
variable is positive and high . In fact the selection method made by the circus
statistician shows that he might be a circus clown instead of a statistician or he may
not have understood the meaning of Horvitz and Thompson (1952 ). Thus aim of
Basu ( 1971) is to show that if some wrong prior (or Bayes) information is used in
the estimation process the results may be too biased, and gives us a caution while
using Bayes estimates!. Let us make the circus statistician 's problem clear with the
help of following examp les.

Example 5.1.4. Suppose there are 10 elephants in a circus and their weights and
diets are given in the following table:

lephantNo. I 2 3 4 5 6 7 8 9 10
arne ~ >
' Jumb o Jumbo Sambo Sambo Niko Niko Niko Niko Niko Niko
LTeight (kg) 5000 5000 1000 1000 500 500 500 500 500 500
Diet (kg) 300 250 75 75 50 50 50 50 50 50
372 Advanced samp ling theory with applications

Based on a samp le of one elephant , estimate the total weight of the 10 elephants
using the Horvitz and Thompson estimator.
Solution. In this case we are considering a sample of II = I unit. Then we can
consider that the inclusion probabilities are the same as the selection probabilities
N
beca use I Jri = If} = II = I . Assuming that the diet of each elephant is know n, the
iEO i=1
selection probabilities are given by

Now if the first elephant Jumbo is selected in the sample, the Horvitz and
Thompson's estimate of the total weig ht of the 10 elephan ts is given by
• 5000
JlIT(I) = - - = 16666 .67 kg.
0.30
Similarly, if the seco nd, third or fifth elephant, that is either Jumbo, Sambo, Niko, is
incl uded in the sample, then the respective estimates of the weight of all elephants
are given by
. 5000 • 1000 • 500
YHT(2) =- - = 20000kg, YHT(3) =- - = 13333.33kg, and YHT(5) =- - = 10000kg .
0.25 0.075 0.050
The true weight of the 10 elephants is given by Y = 15000kg . Thus one can easily
observe that in Basu' s (1971) example, the Horvitz and Thompson ( 1952) estimator
was correct. Rather it was misused by the circ us statisti cian by assigning incorrect
inclusion probabilities. The circus statistician problem is illustrated in the next
example.
Example 5.1.5. Suppose there are 10 elephants in a circus and their weights and
diets are given in the following table :
ElephantNo. I 2 00 3 4 5 6 7 8 9 10
Elephantname Jumbo JumboSambo Sambo Niko Niko Niko Niko Niko Niko
'Weight (kg) ~ 5000 5000 1000 1000 500 500 500 500 500 500
Diet (kg); I 1 81 I I I I 1 I I

Based on a samp le of one elephan t, estimate the tota l weig ht of the 10 elephan ts
using the Horvi tz and Thompson estima tor.

Solution. Again if we consi der a sample of II = I unit then the inclusion


probabilities are the same as the selection probabilities because
N
I Jri = If} = II = I .
iEO i=1

Assuming that the diet of each elephant is known, the selection probabilities are
Chapter5: Useof auxiliary information: PPSWOR Sampling 373

Now if the first elephant Jumbo is selected in the sample the Horvitz and
Thompson's estimate of the total weight of the IO elephants is given by
YHT (l) = 5000 = 450000 kg.
1/90
The true weight of the 10 elephants is only Y = 15000kg . Similar to Basu (1971) , the
total weight of the elephants is highly over estimated . The circus statistician never
considered that Jumbo's diet was heavier than Sambo' s diet. Thompson (1997) also
accepted that the circus statistician lost his job and perhaps became a teacher of
statistics, but he cannot be a teacher. Brewer (2002) attempted to weigh elephants
under SRSWOR sampling by taking a sample of 5 elephants, but Basu's (1971)
case is very serious . No doubt a sample of 5 units may represent a heterogeneous
population of 50 units and it may be possible to weigh an elephant in a spring
balance, but how can a sample of one unit represent a heterogeneous population?
There is a very basic assumption in sampling that the sample has to be random and
representative of the population. If the circus statistician does not know about these
two requirements in sampling theory, then it would have been better if the circus
statistician would have become a clown rather than a teacher. Caution! The use of
the wrong Bayes informat ion is more dangerous than using no information at all. A
layman can understand this in more simple language as follows: Consider a
policeman following a thief in going north, and the thief is also going north , but on
the way, while following the thief, the policeman received a phone call from the
police station that the thief went south, and now the policeman changes his
direction towards the south, but due to inaccurate information from the police
station, the policeman is now going away from the thief and will never reach the
thief. Basu (1971) alerts survey statisticians to be careful while using Bayes
estimates. More details about Basu's contribution to the foundation of survey
sampling theory can be found by Meeden (1992) and a decent monograph by
Ghosh and Meeden (1997).

As we discussed in the previous chapters, ratio and regression type estimators under
simple random sampling have been studied by a number of researchers, including
Cochran (1963), Srivastava (1967), Reddy (1974), Gupta (1978), Vos (1980),
Srivenkataramana and Tracy (1980,1981), and Singh and Singh (1993a). Most of
them are special cases of the class of estimators proposed by Srivastava (1971) in
which the efficiency of the optimum estimators is the same as that of the linear
regression estimator. There are a number of estimators viz. Srivenkataramana and
Tracy (1980) and Ray and Singh (1981) estimators which do not belong to the
Srivastava (1971) class. These are more efficient than the optimum estimators in the
Srivastava (1971) class. Das and Tripathi (1980) have used the coefficient of
variation of the auxiliary variable to form a class of estimators which are more
efficient than linear regression estimator. Srivastava and Jhajj (1981) have proposed
a general class of estimators in which, along with the ratio of sample mean to
population mean of the auxiliary variable, the ratio of sample variance to population
374 Advanced sampling theory with applications

variance of the aux iliary variable has also been used, and the opt imum estimator of
the proposed class was shown to be better than the linear regression estimator. For
the general sampl ing design, an unbiased estimator of S2 cannot be easily derived
x
N
but an unbiased estimator of IX; can be easily developed for any r 2: I.
i='
Following Cassel, Sarndal, and Wretman (1977), under any sampling design, the
population total can be estimated unbiasedly if and only if the first order inclusion
probabilities Jr i are positive for all the units in the population. For any such
sampling design, obviously
2
Y= I Yi , X, = I.::L and X2 = I!L
i ES Jrj i ES tfj iES Jrj

are the unbiased estimators of


N N N 2 .
Y = Dj, XI =I Xi and X 2 =I Xi ,respectIvely.
i=1 i=l i= l
Sampath and Chandra (1990) assume that whatever sample is chosen, (u" U2)
assumes values only in the closed convex subset D of the two-dimensional real
space containing the point (I, I), where Ul = xli X, and U2 = X2/X 2. They
cons idered the general class of estimators of population total Y as

Yg = f g(U"U2) (5.2 .1)

where g(u I, U 2) is a function satisfying the following conditions:


(a) g(I,I)=I;
( b) g(u" U2) is a continuous and bounded in D; and
( c ) The first and second order partial derivatives of the function g exist and are
continuous and bou nded .

Then we have the following theorem:

Theorem 5.2.1. ( a ) The lower bound of the asymptotic mean squared error
(AMSE) of the general class of estimators f g defined in (5.2 . I) is given by

(5.2 .2)

where e=~VrCI22t, vo=v(f), VJ=V(X I ) , V2=V(X2) , co,=cov(f,xJ


CO2 = cov(f, X2) and C12 = Cov(X, , xJ
( b ) Also show that, MSE(Yg) :S; MSE(~r) , where MSE(~r)= VO{I- C61 } denotes the
VoVJ
mean squared error of the general linear regression type estimator studied by
Sarndal (1980a) and given by ~r = f + .il(XI - XI)'
Chapter 5: Use of auxiliary information: PPSWOR Sampling 375

Proof. ( a ) Following the previous chapter, expanding the function g(U"U2)


around the point (1, 1) in a Taylor's second order series, the general class of
estimators can be approximately written as

where g, (1, I) and g2 (I, I) denote the first order partial derivatives of the function g
with respect to UI and U2 , respectively . Then by the definition of MSE we have

Vo +-2
MSE (Y,g ) = Y 2[-2 v2 g22()
JiI g,2(1,1 ) +-2 COl (1,1 )
1,1 +2-g,
Y X, X2 YX,

+2 CO2 g2(1,1)+2--.SLg,(I,I)g2(1,1)]. (5 .2.3)


YX2 X,X 2
On differentiating (5.2 .3) with respect to g, (1, 1) and g2(1, I)and equating to zero we
obtain their optimum values as
gl(l, 1)=~B(V2COI-C12C02), and g2(1, 1)= X 2 B(JiIC01-C'2COI) ' (5.2.4)
Y Y
On putting (5.2.4) in (5.2.3) we have the minimum mean squared error given by
(5.2 .2).
(b) Now if

MSE(Yg )- MSE(~r)= Vo - o~,CJl + vlCJ, - 2COICOl C12]- VO[I - CJI ]


voJiI
~0
then we have
cg l -< vtcg 2 + v2cg\ - 2C01C02C12
2 ' or
CJI <
-
VICJ2+ vrJ1 - 2COf02C'2
2
VI Vt2 - Cl 2 VoV, Vo JiIV2 - C'2
or,
2 2
2 P02 + POI -2POIP02PI2 ( )2
Po I ~ 2 ' or P02 - PoIPI2 ;:: 0
I-P12
which is always true . Hence the theorem .

S.3.. MQDE!i·BASED ES'FIMATION;S'FRA1'.EGIES ~ . ~

Before defining the model we should define a few notation which will be helpful in
understanding the model based estimation strategies . Consider we wish to estimate
- ,N
the population mean, Y = N- L: Y; , based on a sample s of n observations drawn
;=,
with probability p(s) from a population of N units. The function p(s), defined for
all samples s, is called the sampling design. We shall consider the problem of
estimation of population mean using fixed effective sample size, i.e., all the units in
the sample are distinct. Let e, and Jrij be the probab ilities of including lh and l h
376 Advanced sampling theory with applications

and r population units in the sample and are called the first and second order
inclusion probabilities. These inclusion probabilities can also be defined as
"; = LP(S) (5.3.1)
ss i
and
"ij = LP(S) . (5.3.2)
ssi.]

Definition 5.3.1. An estimator Os is said to be model unbiased for population


parameter e if

(5.3.3)

where Em denotes the expected value over the model M .

Definition 5.3.2. An estimator Os is said to be design unbiased for the population


parameter o if
Ep(O.) = LP(s)Bs
s
=e (5.3.4)
where E p denotes the expected value with respect to design p(s).

Definition 5.3.3. Under design based approach the mean squared error of the
estimator Os is defined as
(5.3.5)

Under the model M we have two situations:


( a ) The finite population mean squared error averaged over the super population
model by Cochran (1963) is defined as

(5.3.6)

( b ) The mean squared error of the estimator and finite population parameter over
the super population model M by Royall (1971) is defined as

MSE(Os)= Em [Os-ef . (5.3.7)


The above definitions lead to several possibilities depending upon the interest of the
investigators or survey statisticians .

For example :
( a ) Rao (1979) minimized (5.3.6) subject to (5.3.4) and found that it leads to the
conventional sampling strategies involving randomization;
(b) Rao (1979) also noted that minimization of(5.3.7) subject to (5.3.3) provides
purposive selection strategies;
Chapter5: Use of auxiliary information: PPSWOR Sampling 377

(c) Minimization of(5.3.6) or (5.3.7) is also possible subject to some cost


funct ion.

Definition 5.3.4. A cost function is any linear or non-l inear funct ion of the costs of
selecting a unit in the sample and a number of units in the sample.

Definition 5.3.5. A general superpopulation model can be defined as


m k
M : Y; = 'ifJkX; + E; . (5.3.8)
k;O
The regressors X; , i = 1,2,...,N, are the non-stochastic auxiliary variables and E . IS
I
a random error such that
Em(E;l xt)=o, Em(ElIxt )=a2 j (X;), and Em(E;E jlxt x j)=o, i"*j ,
where j(X;) is a function of X; .

5.3.1 A~BRIEF.HISTORY OETHESUPERPOPULATION MODELj

Various strategies have been proposed for estimating a finite population mean or
population total under a superpopulation model M that relates the variable of
interest to one or more auxiliary variable s. Brewer (l963a) and Royall (1970a,
1970b, 1970c, 1971, 1976) have adapted a linear model prediction theory to the
finite popul ation situation and have derived the best linear M unb iased (BLU)
predictor. Cassel, Sarndal, and Wretman (1976, 1977) and Sarndal (I 980b) have
propo sed a generalized regression pred ictor that is asymptotically design unbiased
(ADU) . Brewer (1979 ) suggested a predictor that blend s aspects of the BLU and
generalized regres sion predictors and retains the ADU prop erty by using a single
auxiliary variable. Isaki and Fuller (1982) propo sed some ADU predictors
involving several auxilia ry variables or characters. Wright ( 1983) exam ines
strategies that are approximately design unbiased and nearly optimal, assuming a
large sampl e surve y and a regression superpopulation model and suggested a new
class of predictors to link certain features of optimal design-unbiased and model-
unbiased predictors.

Thus we have the following theorem :

Theorem 5.3.1.1. Under the superpopulation model


Y; = fJo + fJ1X Ii + fJ2X2; + ..... + fJkXk;+ E; (5.3. I.I)
with the assumptions
E(E;) = 0 , E(El)=a2v(x;) and Ek Ej)= °'ii "* j , (5.3.1.2)
a predictor of population mean propo sed by Wright ( 1983) is given by
• 1
x ' _fJ + N -
11
Y- p = _ L E·
'" I (5.3. 1.3)
;;1

where ~ = ~I X x' z t
378 Advanced sampling theory with applications

1
}[ 13,
[l'X' "x",.......,x"
1,X ' 2,X ,·······,X
Y2 132
22 k2
X = Y= Y3 , 13= , and e= r.. - X'I!... .
- 1,Xln'X2n'........' x kn '
Yn 13k

Royall (1971) showed that the choice of design p(s) which mmnruzes
EmEJys - rf leads to the purposive design. If f( Xi) is a non-decreasing function
of Xi and f(Xi) / xl is non-increasing, then the sampled n units with largest values
provide the optimal sample selection. Brewer (1963a) noted that the purposive
design which minimizes EmE p (>is - rf ' the ratio method of estimation is optimal
when f(X;) = aXl for a = 1 and 0 ~ g s 1 . Cassel and Sarndal (1974) pointed out
that the study of Brewer (1963a) holds well even for continuous distributions of the
auxiliary characters . Royall's (1970a, 1970b, 1970c) result and his relevant work
has been the subject of much controversy for statisticians. The criticisms were noted
by Royall (1971), Cox (1971), and Wynn (1977a, 1977b). Neyman ( 1971) was
somewhat stronger in his criticism, saying that it would be dangerous to draw a
sample based on an unverified model. The optimal result is dependent on the model
assumed. In other words, different models could lead to different kinds of results
such as bias, etc.. Related criticism and interesting results can also be seen from
Royall and Herson (1973a, 1973b). We will discuss the robust estimation procedure
of Scott, Brewer, and Ho (1978) which is in fact the extension of the work of
Royall and Herson (1973a, 1973b), ensuring the robustness of the standard ratio
estimator against polynomial superpopulation models by choosing balanced
samples, to the case of more general regression estimator .

5;·312scotT;'BREWER~iJANDHO'·
... .. .
~ .. -
. _-
.-""
S.ROImsTEST.·.. IM.·••· ·.AT.IONSTRA'I'EGY
.. ' -, , -". .. .. .. . ..- . ~

Scott, Brewer, and Ho (1978) have shown that the requirement for robustness is a
relationship between the moments of the sample units and those of the remainder of
the population and it can be achieved approximate ly by an unequal probabil ity
sampling scheme. Let us first introduce the Royall and Herson ( 1973a) notations for
the purpose of clarity. Royall and Herson (1973a) found that the efficiency and
robustness can be combined by choosing an optimal estimator of population total
under a superpopulation model and a selection procedure so that the resultant
estimator is a BLU estimator under a more general family of polynomial models.
They used the notation r;loo ,0, ,..., 0 p :v(x)J to represent the superpopulation model,
given by

Yi = Ip Oj13 j x (. +ei [V(Xi) 2", J' i = 1,2,....,N , (532 I )


j=' . . .
where
Chapter5: Use of auxiliary information: PPSWOR Sampling 379

O. = {I if the term {3jX j appearsin the model,


} 0 otherwise,
2
E(e;) = 0, E(eiel) = {a V(x) ifi = ~,
o otherwise,
with v(x) assumed to be known . Royall and Herson (1979) considered the model
';[0, 1 : x] under which the ratio estimator is the best linear unbiased (BLU) . They
proved that the ratio estimator is optimal under any p'" degree polynomial
regression model with variance function, v(x) = fj ;1ojajX j , provided that the sample
is balanced up to degree p, that is, x,u) = X[j] for j = 1,2, ..., p, where

xsU)=n-1Lxl
iES
and XU)=N- I
t:« . The superpopulation models of the form
iEn
j
';[0, 1, x 2r have been used by the several researchers including Smith (1938) ,
Jessen (1942), Raj (1958), Rao and Bayless (1969) , and Bayless and Rao (1970) . It
is interesting to note that a more general form of the superpopulation model given
by ';[0, 1 : v(x)] has been considered by Scott, Brewer, and Ho (1978) . Following
them, regardless of the way the sample observations have been obtained, the BLU
estimator of population total Y is given by
IV -'(xi)Yixi ( )
Yo = 1'[0,1: v(x)] = IYi + iES -I( )x2 IXi - IXi . (5.3.2.2)
iES IV Xi i iEn iES
iE S

Thus we have the following theorems.

Theorem 5.3.2.1. Let s *(p) be a particular sample for which


. I xl IV-I (Xi )x(+1
IE(n-S) _ iES f . -12
. IX~ - IV -I(Xi)xl ' or } - , ...,p.
IE(n- S) iES
Then the estimator Yo is ,; unbiased (Royall, 1970a, I970b, I070c) under the
model ';(00 , 01' ..., op : v*(x)) .
Proof. We have

E~(Yo-Y)= iE(~:l)2 I[OjfJjXj+I]_ I (foP .xl ]


x iES V(Xi ) iEn- S j;O } }
I - '-
iES V(Xi)

= f O'fJ .j
j ;1 } } I
iE(.d:
~:;) I
iES
[OjfJjx/+I]_
V(Xi)
I
iEn5
!)'
iES V(Xi)
380 Advanced sampling theory with applications


Since I
ieo.-s
( IP
j=O
8jfJjXijJ = °.
On sim?lifying and.using t~e condition of balanced sampling we have
E.;~Yo-Y)=O If s=s (p).
Hence the theorem .

Theorem 5.3.2.2. If s = s*(p) , then Yo is the BLU estimator of the population total
under the model ';(80 , .. ., 8p : v*(x)) for any variance function of the form
V*(x) = V(x) £ 8jajx j- 1
• (5.3.2.3)
j=O
Proof. Under the model ';(0,..., 0, 8j = 1 : v(x}xj-I), the BLU estimator of
population total Y is given by
j
• -_ IYi + I
l(j) YiXij '-1 / I YiXi2 '- 1 } (IXi
j - IXij) -_ Yo•
(5.3.2.4)
{
ies ies V(Xi )x/ ies V(Xi }xl ieo. ies

for j=O, l, ...,p and s=s*(p).


Now consider the more general model .;A80 , .. . ,8 j =1, ...,8p
V(x}xj-I) . For any :

j-
linear unbiased estimator Y, E';j (Y - Y) depends only on the variance V(x )x I

and not on the coefficients fJo ,...,fJ p' Since Yo is unbiased when s = s* (p) and is
the BLU estimator when fJk = °for k *' j , this implies that Yo is the BLU estimator
under ';j'

Thus if Y is an unbiased linear estimator under the model ';(80 , .. ., 8 p : v*(x)) with

v*(x)= £8 .a.V(x}xj-1
j=1 } }
then its mean squared error is:
E.;(Y-Y) = £8.a
j=O } }.E.;.(Y-Yf.
} (5.3.2.5)
Hence the theorem.

If v(x) = x then the model considered by Scott, Brewer, and Ho (1978) reduces to
the Royall and Herson (1973a, 1973b) result leading to balanced sampling and the
estimator Yo reduces to the traditional ratio estimator of population total Y defined
as
IYi
~=~IXi'
IXi ieo.
iES
Chapter 5: Use of auxiliary information: PPSWOR Sampling 381

In this case the reduction of bias through balanced sampling may be expensive in
terms of efficiency under the model ';(0, I : x] . Thus the most efficient sampling
strategy is to choose the n units with the largest Xi values . The relative efficiency
of the balanced sampling is given by

M m.
· _ J.
(XQ-S
Xs

Royall and Herson (I 973a) have given numerical values of relative efficiency for a
variety of distributions.

If v(x)= x2 then Yo becomes

.IYi + ('!-.I Yi J(.IXi - .I Xi]


Y2 = ies n ies Xi leQ les

and the condition of balanced sampling


I x( IV-I(x .)xj+1
iE(n-S) ;ES I I

Ix · IV-I (xi)xl
ie(Q-s) ies

reduces to
j
I x
I '" j_1 -_ ieQ-s I
- .::,.X· , j=O,I, ...,p,
nies I IXi
ieQ-s
which is always true for j = I. A sample satisfying this condition IS called
'overbalanced' .

The MSE of Y2 under the model ';(0,1 : x 2 )is given by

MSE(Y,)= a{e~X! C~!J) + (5.3.2.6)

Thus (5.3.2.6) shows that if the sampling fraction is small and no single Xi
dominates the others, the MSE is affected very little by the choice of sample and
little efficiency is lost by choosing an overbalanced sample.

As suggested by Scott, Brewer, and Ho (1978), in many real life situations v(x)
increases more quickly than x but less quickly than x2 , so that .;l8o,..., 8p : v(x))
with v(x)= a,?x + aix 2 is often a fairly realistic model.
382 Advanced sampling theory with applications

Thus we have
MSE(11V)-- N(N -n)rlalx+a2
2- 2-(2)]
x, (5.3.2.7)
n
and
MSE(Y2 ) = N(N - n) X(n-s )[aI2 + ai x], (5.3.2.8)
n
where X(n-s) is the mean of the elements not included in the overbalanced sample,
~ xi2 an d x- -- N- 1 L.
x-(2) -_ N - I L. ~ Xi .
~I ~I

Thus it follow s from the condition

L xi
..!.- L x i-I = ien-s I

n ies I LXi
ie n-s
with j = 0 that X(n- s) is smaller than Xs and hence less than x, which implies that
MSE(Y2 )<MSE(Y\ ).
In other words , the ratio estimator with balanced sampling is less efficient than Y2
with overbalanced sampling . It may be noted that the loss in efficiency will be less
if a? dominates ai , but can be substantial if ai is relatively large. Thus to form
an efficient estimator, we have to look for overbalanced sampling.
Note that

E("!'- iLes x j - l) = i=1


'l I
~ x j / i=l~ x .
I I

the selection of the sample with probability proportional to x yields an


approximately overbalanced sample if the sample size is large and the sampling
fraction is small. For example, let the first order inclusion probability 7fi be given
by
(5.3.2.9)
N
where A is obtained from the condition L7fi = n , that is, A is a solution of the
i=1
equation
N
LAX;/(I+Axi)= n. (5.3.2.10)
i=1
The solution may be obtained iteratively by taking the initial value as
,1,(0) = nji~lxi and A(k+l) = nji~I~;/(1 + A(k)x;}}.
Now the probability of not selecting the lh unit in the sample, say 7ff IS:

7ff = 1-7fi =1- Ax;/(I + Ax;)=1/(1 + Ax;}. (5.3.2.11)


Thus for all j we have
Chapter 5: Use of auxiliary information: PPSWOR Sampling 383

N 'j
E( L x!. ) = ;=1
;EO -S
LJr
N
fx!. =LX
;=1
! (I+AxJ . (5.3.2.12)

Also we have

( .I)= NLJr;Xr. I= N'j


E LXr
;E S
LAx! (1+ Ax;).
;=1 ;=1
(5.3.2.13)

From (5.3 .2.12) and (5.3.2.13) we have

E( . L XI )=E(LXtl)/A..
lE Q -S IE S
(5.3.2.14)

Hence it folIows that

E(L xtl /n)=E( . L Xl)/E(. LX;).


S lEO - S l EO-S
(5.3.2.15)

Thus an approximate overbal anced sample can be obtained if the sample size is
large enough .

5.3.3n.ItSIGNfrYAJUA~CE\ANJJ· ;.\.N'fICIPATEnVAlUANCE •.OF·LINEAR


s. llliGRESSIONJTyPWESTIMATOR ..... .

Consider a superpopulation model m as


m : y; = f3 x; + e; (5.3.3.1)

such that Em(e;) = 0, Em (el)= a2 xf and Em (e;ej)=0 for i ;to. j and g is any real
number. In the prev ious sections, we have seen that if the sample size is fixed at n
the design variance of the Horvitz and Thompson (1952) type estimator of
population total (calIed generalised linear regression estimator or GREG) is defined
as

(5.3.3.2)

where /J is a prediction unbiased and prediction consistent estimator of f3 . If


(5.3.3.1) holds then the above estimator can be written as

(5.3.3 .3)

If the samp le size n is fixed and sampling design is p , then the design variance of
the estimator Yg becomes
, ) 1 2
Vp (Yg = -2 .L.L(Jr;Jrj - Jrij Xd;e;- djeJ . (5.3 .3.4)
I* JES
384 Advanced sampling theory with applications

Consider the random variable Yg - Y, then the variance of this random variable
under the model m and the sampling design p are called Anticipated Variance
(AV) .

Thus AV of the estimator Yg is given by

AV(Yg )=Em~p(Yg )]= Em["!'- I I


2 i;ejen
(1l'i1l'j -1l'iJdiei -dje J]
1
=Em[-2 I I (1l'i1l'r 1l'i Jdfef + dJeJ - 2dieidjeJ
i;ej en J
=..!.- I
2 ien j (;ei)en
I (1l'i1l'r 1l'ijXdfEm(el)+ dJEm(eJ)- 2didjEm(eie J)
=..!.-2 ienj(;ei)en
I I (1l"1l" -1l'··Xd 2 2xf: + d 2 2xff - Zd.d . x 0)
lJ'
I} , }
(J'
}
(J'
I }

= (J'22 .I .( I) (1l'i1l'r 1l'ijXdfxf + dJxJ) = (J'2 .I I(n -1l'i );rjl - (n -1)1l' j l ~f


len ) ;e, en ,ed

=(J' 2 I
ien
(I
--I
1l'i
JXig
which is the same expression as was shown by Godambe (1955) to be the minimum
possible anticipated variance for any design unbiased estimator of population total.
Similar results can also be had from Brewer (1979) and Sarndal and Wright (1984) .
The Anticipated Variance (AV) can also be derived with an alternative method as
follows :

i1l'j}i
AV(Yg)=EmVp(Yg)+VmEp(Yg)",Em[I I (1l'ij-1l' ej]
ienj(;ei)en 1l'i1l'j

I I 1l'ij- 1l'i1l'j } m\ei g_ (I }g


( ej ) -_ (J' 2 I (1l'ij-1l'i1l'iJ Xi . - (J' 2 I - - 1 i .
ieQjeQ ( 1l'i1l'j ieQ 1l'i1l'i ien 1l'i

We have seen in Corollary 5.1.1 that if 1l'i oc lj , the variance of the Horvitz and
Thompson (1952) estimator of the population total Y reduces to zero.
Unfortunately the values of Y.I are not usually known in practice, but the value X
of the auxiliary variable correlated with Y may be known . Thus if 1l'i is chosen
proportional to X i' there may be a substantial reduction in the variance. Now we
will consider the construction of the first and second order inclusion probabilities in
the following section.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 385

5.4 CONSTRUCTION AND OPTIMAL CHOICE OF INCLUSION


PROBABILITIES '

Theorem 5.4.1. The inclusion probability of selecting the j''' population unit using
PPSWOR sampling scheme is given by
TCi = P{I + S - P;/(I - Pi )}, where S =
j =1
I
Pj /(1- pJ (5.4.1 )

Proof. For a sample of size II =2 we have


TCi = Probability of selecting the j''' unit in the sample = Pi + . I
J''' I= I
P; Pj /(1- Pj )
where P; is the probability of j''' unit being selected in the sample at the first draw .
If i'" unit is not selected at the first draw, then some other unit will be selected. Let
the other unit select ed be the / ' unit. Its probability of getting selected is P;, so that
at second draw the probability of the j''' unit getting selected (if j''' unit is not
selected on the first draw) is the product of two probabilities, namely
( a ) the probability that on the first draw the i '" unit is not selected but some other
unit is selected
and
( b ) the probability that the j''' unit is selected on the second draw and is not
selected on the first draw.
In other word s, on the second draw, the probability for selecting the j''' unit is
Pj {P; / (I-Pj)} (5.4.2)
where Pj = probability of first ineffective draw , and P; /(1- Pj )=probability of
second effecti ve draw . The summation sign is taken with the second factor for the
reason that any of the remaining units (different from the first selected unit) can get
selected on the second draw. Therefore

TCi = Pi( l + I
jo;<i=1 ) ) ) )
I
P-/(I-p. )1 =P .(I + P-/(I-P .)-P; / (I-P;) ]
j =I) )
=P;(I +S-P;/(I -p; )) (5.4.3)

where Pi /(1-Pi) is the prob ability of selecting the j''' unit and S = I Pj /(1- Pj) .
) =1

Now the factor Pi/(1- Pi) is not constant because it depends upon Pi' but S is a
constant since it is a sum over all the units. Therefore TCi is proportional not only to
P; but also to the factor Pi/(1- Pi ) . To achieve minimum variance , the probability
chosen should be such that TC.I may become proportional to Pi' Narain (1951),
Yates and Grundy (1953) , Fellegi (1963), and Hanura v (1967) have suggested
differ ent methods to obtain the values of 7r i proportional to p; . Chaudhuri (19 8 I)
has shown that the application of the sampling method of Fellegi (1963) to the first
occasion units and the unmatched second occasion units provides the required
inclusion probabilities. Now we shall discuss in detail the methods for the
construction and optimal choice of inclusion probabil ities.
386 Advanced sampling theory with applications

Assume Jr; ,
i = 1,2,...,N, be the set of desired inclusion probabilities for a sample of
size n drawn by using WOR sampling, say the target inclusion probabilities or nps
N
probabilities. Note that 'LJri =n, the nps probabilities should be such that
i~1

Jr; '=Jri Vi . (5.4.4)


Let us now consider the construction of the Jr; based on a size variable
• N
Xi > 0, i = 1,2, ...,N, i.e., Jri =nXi / 'LXi'
i~ 1

The first problem with these inclusion probabilities is that some size variables lead
to problems, since one or more of the Jrt exceeds one. This problem can partially
be solved by using a revised selection probability discussed in Section 5.4.5 . The
second problem concerns the exhibition of a good, preferably a best scheme . To
solve this problem , Rosen (1997a) has stipulated the following requirements :
( a ) The sample selection should be simple to implement.
( b ) The scheme should lead to good estimation precision.
( c ) The scheme should have good variance estimation properties.

We will now discuss below some target inclusion probab ility sampling schemes :

5.4.l zpARETO 17cpS SAMPLING 'ESTI MATI ON SCHEME


) 'i' i' .'.',,' " " . ;" ' :. ' i.'LL .c" ,
;
.;< ·.··" ·<:.,: .," +" '. h :>~.... ','.", /; :.Y:· ,,' ',':,; ' ':S .:: .,:::: , ,',,> ',' ,< "-" , ,;,,,"-- ,J _,,:. ': , : ~" "-. ~
I

As discussed by Rosen (1997a), the main-steps in this scheme are as follows :


( a) A sampling frame n = (1,2,....,N) with sizes Xi >0, i = 1,2,...,N is at hand, and
a sample size n is specified ;
( b ) Compute the target inclusion probabilities Jr; = nXd X, i = I, 2, ..., N. It is
presumed that Jr;
< 1 V i. Ifnot, then modify the size measures;
( c) Realize independent standard uniform random variables Vb V 2 , .. ·,VN , thereby
realizing the ranking variables
Qi = ri(l- Jr; Wk(l -v;)}, i = 1,2,...,N . (5.4.1.1)
The sample consists of the n units with labels (Jb J2 ,... ,J,;) that are determined by
Qjb Qj2 ,..·,Qjn being the n smallest values among the realized QI, Q2,...,QN' The
sampled values so obtained result in the non-negative estimator of variance
following Rosen (1997b) . The general structure of the Pareto nps sampling
estimation procedure applies to all order sampling with fixed distribution shape
(OSFS) sampling schemes. Sunter (1977) also introduced a nps scheme with
control of variance estimation. Aires (2000) has considered the problem of
comparison of Pareto ps sampling with the conditional Poisson sampling by Jr
introducing algorithms to calculate the first and second order exact inclusion
probabilities.
Chapter5: Useof auxiliary informati on: PPSWOR Sampling 387

5.4.2 HANURAV'S METHOD

Keeping in view the stability of the Sen-- Yates--Grundy (1953) estimator of


variance of the Horvitz and Thompson (1952) estimator of population total,
Hanura v (1967) proposed a method such that Jrij > Y JrjJrj for all pairs (i,j) , where
Y is a positi ve constant. Hanura v's (1967) sampling scheme for sample of size
n = 2 is as follows:

Without loss of generality let us assume that


1N
o < X I ::; X 2 ::; .... ::; X N < - L X i'
2i=1

Conduct a Bernoull i trial with the probability of success


I/J = 2(1- PN XPN - PN - I )/(1- PN - PN - I ) .

Then there are two possibilities:


( a ) If the trial results in success , select one of the units V b V 2 , ... , V N - I with
probabilities proportional to fj , P2 ,... , PN - I • Thus for example, if/" unit is selected,
accept V N and V j as the sample;

( b ) If the trial results in failure, draw two units from the population with
replacement and with probab ilities I/J;' = P; /(1- PN + PN - I ) for the t" unit
I::; i::; N -I and I/J~ = (PN - I/J /2)/(1-I/J).

In case the selections do not coincide then accept the sample; otherwise reject the
sample and select two units from the population with replacement and with
prob abilitie s proportional to 1/J;,2. If the sample consists of distinct units, then accept
it, otherwise reject it and repeat the process. Hanurav (1967) listed the following
obviously desirable properties of a sampling design to base Horvitz and Thompson
(1952) estimator:
( a) Jrj = np;, i = I, 2, ..., N; (b) No.of distinct units, v = n '<:j s :p(s) > 0 ;

( c) n.. = n : > 0 '<:j i


lJ JI
'* j = I, 2,...., N ; (d) Jr"
u < JrjJrj ' , '<:j i '* j = 1,2,...., N ;

( f) Jrij should be eas ily derivable from simple formulae .

Several other researchers have also worked on similar procedur es including Fellegi
(1963 ), Vijayan (1968 ), Fuller (197 1), Brewer (1967, 1975), and Asok and
Sukh atme (1975, 1976a).
388 Advanced sampling theory with applications

Suppose a population consists of N units labelled i = 1,2,...,N, with measures of


N
size P. such that I P, = 1 and "j = np; with 0 < "j < 1. Here we select
I i=l
the (n - r + l~h unit v" last draw), from among those not already selected, with
working probabilities proportional to Pj (1- Pj )/(1- r~). Following Brewer (1963a),
the probabilities of including i''' and/, units for n = 2 is given by

" ij = 2P;Pj { 1_12P; + 1- ~pJ /{I \~11 ~pJ. (5.4.3.1)


For n > 2, Brewer (1975) provided a recursive formula for the inclusion
probabilities of inclusion in the sample as follows. The working selection
probability for the i''' unit at the (n - r + 1~h (r t" last draw), given that it was not
selected in the first (n - r) draws is

p;(r) = rp;/(l- ~~p; J, (5.4.3.2)


n- r
where l:~ denotes the sum of the selection probabilities corresponding to the
j=l
selected (n - r) units in the sample. Thus the conditional probabilities of inclusion
in the last r draws are proportional to the measures of size of the remaining units.
The method of induction can be applied to derive a formula for the working
probabilities. Consider a method is available for selecting (n -1) units. Then if the
first unit is selected with working probabilities p;(n), the last (n -1) can be selected
from the remaining units with probabilities proportional to Pj values so that the
probability of including the i''' unit in the last (n -1) units, given that the first unit
selected is the it
will be (n -l)p;/ll- pJ
Then the condition required on p;(n) will
be
np; = p;(n)+ r.
j #= l
Pj(n){(n-1)p; /(1-P;)} . (5.4.3.3)

N
Putting Sj(n) = p;(n)/(l-p;) and S(n)= l: Sj(n) and eliminating p;(n) from (5.4.3.3)
j=!

we have
Sj(n) = p;[n-(n-1)S(n))j(1-np;) . (5.4.3.4)
N
Putting (5.4.3.4) in S(n)= I Sj(n) we have
j= l

S(n) = nT(n );{1 + (n -l)T(n)} . (5.4.3.5)


N .
where T(n) = I ~ /(1- nPj). Usmg (5.4.3.5) m (5.4.3.4) we have the required
j =l

working probabilities as
pJn) = nPj(1- Pj);[(1- nPj){1 + (n -l)T(n))]. (5.4.3.6)
Chapter5: Use of auxiliary information: PPSWOR Sampling 389

N
It is obvious that L Pi(n) = I , 0 < p;(n ) < I since 0 < "i = np; < I . Keeping in view the
i =1
Sen--Yates--Grundy estimator of variance, Brewer (1975) also suggested a formul a
for calculating the second order inclusion probabilities given by

"ij(n)=(n - I~Si(n)Pj+S)n)p;]+ (I.)= Pk(n )" tk)(n-I ) (5.4.3.7)


k "" .) 1

where "t )(n -I) denote the joint probability of inclusion of the /"
k
and l" units in
the remaining (n-I) units given that the J(h unit was selected first. Thus the joint
inclusion probabilities can also be calculated by using the above recursi ve formula .
Chromy (1974) found that the second order inclusion probabilities "ij given by this
procedure asymptotically minimized the expected variance of the Horv itz and
Thompson (1952) estimator when r =1/2 . Rao (1963a, 1963b) showed that the
Horvit z and Thompson (1952) estimator is always more efficient than the
corresponding Hansen and Hurwitz (1943) estimator for multinomial sampling, and
that its variance estimator was never negati ve. Thus in this case, the joint
prob abilitie s of inclusion , and hence also the variance estimator, are simple
functions of size.

5.4.4 SAMPFORD ~S METH OD

Sampford (1967) propo sed a method for inclusion prob abilities propo rtional to size
(lPPS ) scheme for selecting a sample of n units, which in fact is an extension of the
method s proposed by Brewer (1963a), Rao (1965b ), and Durb in (1967). For this
scheme of sampling, the first order inclusion probability is given by
lfi = probkh unit is selected at the first draw1
+probk" unit is selected at the second draw1
N
= p; + L PjP;.j= 2P; (5.4.4. 1)
j "'i=1
wher e P;.j denotes the conditional probability P(i I j } If P; denotes the probability
of selecting the { h unit at the first draw, then the probability of selecting the second
unit from the remaining ( N - I ) units in the population is given by

p(; I i) = Pj.i = Pj[ _1- +


1- 2P;
_I-]/[1I ~],
1- 2Pj
+
i= ll - 2P;
j(* i) = 1,2,...,N. (5.4.4 .2)

Furthermore the second order inclusion probability is given by

" IJ.. = pp. . + PBI .) . = 2PP· -


1 j.1 ) 1 )
-
[ I 1]/[ "v=PJ .
I _ 2- + -I _ 2P. D.
' j )
1+ N
L,
)=1
Pj ]
1- Pj
(5.4.4.3)

It can be easily verified that "i"j - " ij ~ 0 for Samphord' s sampling procedure.
Midha (1980), Hartley and Rao ( 1962), and Asok and Sukhatme (1976b) have also
suggested some procedures to appro ximate these inclusion probabilities. The
Durbin (1967 ) procedure has also been used by Brewer and Hanif (1970) for
developin g a new multistage estimator of variance.
390 Advanced sampling theory with applications

5.4.5 NARAIN'S METHOD

Narain (1951 ) proposed another method of sample selection, which is free from any
restriction on the set of initial selection probabilities and leads to a more effic ient
estimator of the population parameters than with replacement sampling. This
method consists of making revised selection prob abilities (i = 1,2,..., N) such that p/
the resulting inclusion probabilities If; are proportional to the original probabilities
of selection P;, i = 1,2,...,N . For a sample of two units , the revised selection
probabilities are given by

If; =P; S +1__I-P;'_,J=2P; whereS' =IP; /I-p; .


'( , p,' N
;~ l
, ( ')
(5.4.5 .1)

Rao (1989) , Yates and Grundy (1953) and Brewer and Undy (1962) have done
further work in these direction s.

5.4~6 MIDZUNO-::SENMETHOD

Midzuno (1952) and Sen (1952) introduced an interesting method for selecting the
sample , which is useful in many ways. Using this method , the unit at the first draw
is selected with unequal probability, while the rest of the units are selected with
equal probability and without replacemen t. Definin g a random variable, t. , such
that

t, =
1
{I if illl uni t is selec ted in the sample,
0 otherwise . (5.4 .6.1)

We have
E(t;) =If;=P; + p[/Il u~it. is not se lec ted at the first draw and is selecte d at any of the ]
remaining (n -I) draws

=p,
1
+(I - P, { ~J=(~JP,+(~J
''\..N_ lN- l N- I 1
(5.4 .6.2)
and
Ek tj)=lfij
= {it" unit is selected at the first draw and/ ' unit is selected ]
at any of the remaining (n- I) draws

p[i"
+ unit is selected at the first draw and i''' unit is selected]
at any of the remaining (1l-1) draws

+ p[ Neither / h nor i h units are selected at the first draw but ]


both of them are se lected in the subse quent (II- I) draws

= p,(_11 -_I J+ P .(_11 -_IJ+ 1_--,.-;_-


...:....( p, P--"-j,,,
XII_ -_IX
-,.-
II_- _2)
1 N- l } N- I (N - lXN- 2)
Chapter5: Use of auxiliary information: PPSWOR Sampling 391

=(~)[(!!..::.!2)(p;
N-I N-2 +P )+ (~)]N- 2 .
j (5.4.6 .3)

For this sampling scheme, one can easily see that Jr;1<j - Jrij > 0 , which guarantees
non-negativity of the estimator of variance proposed by Sen (1953) and Yates and
Grund y (1953). At the same time the limitation is that the first order inclusion
probabilities are not proportional to the selection probabilities. This can be rectified
by finding a new set of revised selection probabilities P;- by applying a suitable
transformation on the selection probabilities P; such that the resultant inclusion
probabilities Jri, i = 1,2 ,.., N are proportional to the value of Pi ' Note that P;- are
the revised selection probabilities we have

Jri = (: ~~ )p;- +(~~\) . (5.4.6.4)

Simplifying (5.4.5.4) for p;- and using Jri = np; we have


_=np; (N -I) ( n-I).
----
P;
N-n - ----
N-n (5.4.6 .5)

Note that the revised selection probabilities Pi- must always be positive, the initial
selection probabilities Pi must satisfy

(5.4.6.6)

Thus the use of revised selection probab ilities for deriving efficient estimates of
population total through Horvitz and Thompson estimator will be possible only in
those cases where the original selection probab ilities satisfy the above condition.
This condition on the initial probabilities usually does not hold, and hence limits the
use of the scheme in practice . Some more relevant work related to the discussion of
the validity of these inclusion probabilities can be found in Rao (1963a, 1963b),
Asok (1974 ,1980) , and Asok and Sukhatme (1978). A generali zation of Midzuno--
Sen sampling scheme has been given by Prasad and Srivenkataramana (1980) ,
Deshpande and Ajgaonkar (1987) , and Kumar and Srivenkataramana (1994) . Bedi
and Agarwal (1999) suggested a new set of revised probabilities under the Midzuno
(1952) sampling scheme. The revised probabilities are functions of the location
shift factor L between 0 and I and it is remarkable that the optimum value of L is
free from any knowledge of unknown population parameter.

5.4.7. KUMAR "'~.GUPTA--NIGAM SCHEME


Following the concept of Gupta, Nigam, and Kumar (1982), Kumar, Gupta , and
Nigam (1985) introduced a family of inclusion probab ilities proportional to size
(IPPS) sampling scheme based on balanced incomplete block design by following
Raghavarao (1971) for selecting a sample of n units. The main steps of the
proposed scheme are:
392 Advanced sampling theory with applications

( a ) Con sider two balanced incomplete block des igns D I ( v = N , q , Ij , kl ~ n ,


Al ) and D2 ( v = N, b2 , r2 , k 2 ~ k l , ..1. 2 ) where the parameters have their usual
meani ng as in the case of incompl ete block designs ;
( b ) Perform a random exper iment which select s the design D) with probability
W (0 :0; W :0; 1) and the design D2 with probability (1- W);
( c ) From the design D) selected at step ( b ), select one block (s, D)) with a
preassigned probability
P(s,D) )= :L~ ,) (s=I,2 , ..., b) ; j= 1 or 2), (5.4 .7.1)
IE(S,D) )
wher e
~ = [klk2(v -1)p/ - {kl (k2 - 1)- W(k 2 - k) )}][kl (v- k 2) +W(k 2 - kl )v]-I (5.4 .7.2)
t = 1, 2,.., N . If k) = n the elements of the selected block con stitute the required
sample of size n. Otherwise go to step ( d );
( d ) Select a sub-sample of size n by equal probability sampl ing without
replacement from the k ) units belonging to the block (s, D)) selected at step ( c ).
For this scheme, the first and second order inclusion prob abilities are
Jri = nW :LP(s,DI)l kl +n(I- W) :LP(s,D 2 ),
(s,Dd~i (s ,D2Pi (5.4 .7.3)
and

- 1)}+(I - W ) I P(s, D 2) /{k2 (kz - I)}]. (5.4.7.4)


1 (s,DIIpP(s,DI)
Jrij = n(n - J W
i,)
/{kl(k l
(s, D2Pi,)
Kumar, Srivastava, and Agarwal (1986) have discussed a general class of unequal
prob abil ity sampling.

5.4.8 DEY AND SRIVASTAVA SCHEME FOR EVEN SAMPLE SIZE

Dey and Srivastava (19 87) considered the following IPPS sampling scheme.
Con sider a population of N units with y as the study variable and x, an auxiliary
variable, as the size. It is assum ed that x values are known for all the population
units. A sample of size n (> 2) is to be selected. To start with, it is assumed that n is
even. Divide the population into m (> n/2) groups so that the /i1 gro up contains
N, (> 2) units (i = 1,2, ..., m), and for each group
XI! X > (n- 2)j {n(m - I)} (5.4 .8.1)

were
h L XiII.IS the va Iue 0 f x c:lor the util unit• •In tel
X i = Ni h ·til group and X = IIn X i .
lI ; l i;1
Equation (5.4 .8.1) is satisfied if the Xi (i = 1,2,....m] are made nearly equal. It has
been seen in actual populations, consider ed by Rao and Bayless (1969), that this
conditi on is satisfied for quite a few values of m. Rao and Lanke ( 1984) suggested a
grouping procedure in which N units are divided into R groups so that group
totals X i are nearl y equal and group sizes are either [N/ R] or [N/ R] + 1, whe re [e] is
Chapter 5: Use of auxiliary informati on: PPSWOR Sampling 393

the largest integer. Having formed the m groups, the sugge sted sampling procedure
consists of the following steps:
Step I. Select n/2 groups out of m groups using Midzuno ( 1952) sampling
procedure, i.e., select one group with probability
p/= {n(m- I}p; -(n- 2}}/(2m- n), with p;=XJ X (5.4.8.2)
and the remaining (n/2}- 1group s with equal probabilities without replacement.

Step II. From each of the selected groups , select two units by any IPPS procedure,
say by Durbin 's (1967) procedure; that is, from the j''' selected group
(i = I. 2...., n/2) select one unit with probability
P;u li =Xiu/Xi (5.4.8.3)
and the second unit with revised probability
P;u liv =x;v[(Xi-2Xivtl +(Xi - 2Xiut ' ] /D i , (5.4.8.4)

where o, = {I + ~XiU /(Xi - 2XiU)} .


1=1

For this sampling scheme the inclusion probability for the i~" unit is evidentl y given
by
Jr . =nP (5.4 .8.5)
'u III

where p;U = XiII IXand the joint inclusion probabilities for a pair of units are given
by

Jr . .
'II
=nP'u P' v (p-p
I 'u
-PI v )lI/ lU,
f ~.(P_
I
2P'II XP-
I
2P)}
'v
(5.4 .8.6)
'V

5.4.9 SSS SAMPLING SCHEME

Saxena, Singh , and Srivastava (1986) suggested the following IPPS sampling
strat egy. Consider a popul ation of size N with Jj and Xi as the study and the
auxiliary variab le values, respectively, for the j''' unit. We further assume that
Xi > 0 for all units in the population. The following steps make the IPPS sampling
strategy:

Step I. Select a sample , s, (say), of size n from the popul ation by simple random
sampling without replacement. Let s; be its complement, that s; = n - Sl . Perform
independent Bernoulli trials on each unit of s,
with probability of success Pi for
the j''' unit in the population. Let the number of successes be r .

Step II. If r < n, select n - r units from s; by simple random sampling without
replacement.
394 Advanced sampling theory with applica tions

The ultimat e sampl e s will consist of r units selected at Step I and (II- r) units
selected at Step II. For this sampling scheme, the first and second order inclusion
probabilities are given by

"i = II(NJ~ + j~l o, -I))N(N- I))


and

"ij ={II(II- 1)fN(N - l)}- [P;Pj +{(P; +Pj)IQj}


)= 1 I'~N-2)+ .I
F"k =1
QjQk !{(N - 2XN -3)}]
respecti vely, where Qj = 1- Pj. Several other research workers such as Chaudhuri
(1975a, 1975b), Choudhry and Singh (1979) and Singh (1978) have also suggested
some IPPS sampling schemes . As reported by Deville and Tille (1998), 50 different
sampling procedures have been reviewed by Brewer and Hanif (1983) . Interested
readers may refer to Brewer and Hanif (1983) to study in depth the construction of
inclus ion probabilities. In addition to that, Chen (1998) has discussed the weighted
polynomial models (WPM) for generating the first order inclusion probabilities.
Deville and Tille (2000) have extended the problem of unequal probabil ity
sampling of fixed size from a finite population to the generalised case in a partit ion
problem of a population into subsets with unequal inclusion prob abilities in each
subset.

5.4.10 OPTIMAL CHOICE OF FIRST ORDER INCLUSION


PROBABILITIES . <'

Cassel, Sarndal, and Wretman ( 1976) have shown that the average variance of any
estimator as satisfying the cond ition of unbiasedness may be:
(5.4.10.1)

where under the model m


(5.4.10.2)
and
s.; (os)= Em(os)- Em [eJ (5.4.10.3)

are respectively the variance and bias in the estimator as ' Note that Vm(e) is
constant , it does not enter into the minimization process. Con sider an estimator a;
and a set of inclusion probabilities that yield E p (a; )= o. If as minimi zes E p Vm (as)
subject to the condition I p(s)9s= e, then the choice of a; also minimi zes (5.4.10.1 )
s

if Bm (0;)=
O. For example, a model unbiased estimator in the case of a linear
homogeneous class of estimators that satisfies the above criterion is defined as
Chapter5: Use of auxiliary information: PPSWOR Sampling 395

Ys = "i.d;y; (504.1004)
iES

which is the well known Horvitz and Thompson estimator. Hajek (1958) obtained
the optimal estimator and design pair for estimating population mean Y and
introduced a general cost constraint defined as

(504.10 .5)
to obtain the optimal choice of the first order inclusion probabilities given by
Jr;OC oj,f;:; under the super population model m : y; = a; + e;, where Emk) = 0,

E(el)= ol and E(e;ej)= 0 .The restriction Bm(ys) = 0 implies that

(504.10.6)

Several other researchers have also suggested methods to find the optimal first
order inclusion probabilities. For example Godambe (1955) also obtained results
similar to Hajek (1958) . Rao (1975), Cassel, Sarndal, and Wretman (1976) and
Rao, and Bellhouse (1978) have also suggested optimal choices of the inclusion
probabilities. Maxmin zrps sampling designs have been discussed by Hanurav
(1967), Rao and Bayless (1969), Sinha (1973) , Chao (1982) , Gabler (1984), Herzel
(1986), Chaudhuri and Vos (1988), and Herze l (1993) . Comparisons of PPSWR
and Brewer's zrps WOR procedures have been done by Sampford (1967),
Chaudhuri (1974), Gabler (1981) , and Sengupta (1986) .

Example 5.4.1. We wish to estimate the total number of fish of all kinds caught by
marine recreational fishermen of the Atlantic and Gulf coasts during 1995.
Population 4 in the Appendix shows that information on the number of different
kinds of fish caught during 1992 is available. Use the known information on the
number of fish caught during 1992 to select a sample of eight units by using
PPSWOR sampling. Collect the required information from population 4 to estim ate
the total number of fish caught during 1995. Apply the Sen--Yates--Grundy
estimator of variance to construct a 95% confidence interval.

Solution. Although we discussed several methods to select a sample of required


size by using PPSWOR sampling, let us restrict ourselves to the case of the
Midzuno--Sen sampl ing scheme . In the Midzuno--Sen method the first unit is
selected with probability proportional to the number of fish caught during 1992 and
the remaining seven units are selected by SRSWOR sampling. We have used
Lahir's method to select the first unit and a random number table method to select
the remaining seven units. Using the Pseudo-Random Numbers (PRN) given in
Table I of the Appendix, we used the first two columns to select a random number
1 :s; R; :s; 69 and columns 7 to II to select another random number t s Rj :s; 28933 .
We performed the following trials to select one unit by using Lahiri's method.
396 Advan ced sampling theory with app lications

Trial "' 1 ~ Ri ~6~ 1 s Rj ~ 2893; No of fish caught Decision


Number . "O:i~~~in
f~"-!>. ':II>
1992 R--Rejected
j ;. (Xi) Sv-Selecte d
I 58 07572 1045 R
2 60 21945 5575 R
3 54 25975 4195 R
4 01 09266 1467 R
5 69 15573 12249 R
6 62 01228 11918 S
Thus the species 'Summer flounder' at serial number 62 has been selected as the
first un it in the samp le. Out of the remaining 68 species, now we have to select
seven units by SRSWOR samp ling. We used the first two columns of the Pseudo-
Random Nu mbers (PRN) given in Table 1 of the Appe ndix to select seve n random
numbers between 1 and 68. These random numbers came in the sequence as: 58,
60, 54, 01, 62, 23, 64. Thus the ultimate sample is as shown below.
Table 5.4.1. Samp le selected using Midzuno--Sen sampling scheme.
Sample PRN Population
'ev
Spe9i~s group'f" 1992 1995 · P. ti. ydtii
/ -/
units , No. units xi. :; Yi h " , ','

1 01 01 Sharks, other 1467 2016 0.005026 0.107450 18762.21


2 23 23 Blue runner 2371 2319 0.008123 0.110228 21038.21
3 54 54 Tautog 4195 3816 0.014372 0.115834 32943.70
4 58 58 Atlantic mackerel 1045 4008 0.003580 0.106153 37756.82
5 60 60 Spanishmackerel 5575 2568 0.019100 0.120075 21386.63
6 62 62 Summer flounder 11918 16238 0.040832 0.139569 116343.90
7 62 63 Gulf flounder 216 163 0.000740 0.103605 1573.28
8 64 65 Winter flounder 1544 2324 0.005290 0.107686 21581.26
,
'1
~ '~
'1":::: '!'i" ",C'
.0'
271386.0 1

From population 4 we have the total number of fish caught during 1992,
X = 291882. Thus the selection probabilities p; = xd X have been calculated as
shown in the Table 5.4 .1. The val ues of the first order inclusion probabilities based
on the Midzuno--Sen sampli ng scheme given by
N -n n- l
ti· = - - R + - -
I N-I IN_I
have also been presented in the above table. For example, the value of til IS

calcu lated as
til = N-n F]+~= 69-8 xO.005026+~=0.10745
N- 1 N- I 69 - I 69 - 1
and so on. The values of the second order inclusion probabilities based on
Midzuno--Sen sampling scheme given by
Chapter 5: Useof auxiliary information: PPSWOR Sampling 397

lrij =
n-l[N-n(
N-1 N- 2 P; + P
) n-2]
j + N- 2

have been given in Tab le 5.4.2. For example the value of lrlZ is calcu lated as

lrl2 [N-
= -n - -1 - -
n ( ~ +P ) + -11 - -2 ] = -
z N - 2 69 -
8 - 1 [69
- - - 8 (0.005026 + 0.008123) + - 8 - -2 ] =0.010451
N- 1 N- 2 - 1 69 - 2 69 - 2
and so on.

T abl e 5.4.2. Secon d order inclusion probabilities for the units selected In the
sample.
I
~
,."~: c:"
"" ",%
",l{.E .' . " .,

.
1fi},}? -e -(~ .{'c . ;..

i >'ij •~' 'tn .i.i' /*.2.~· 3


.4;• :'4 '1\> 5 6 .' eT, , ;;
1"" '2 0.0 10451
3, <- 0.011037 0.0 11327
4 0.0 10025 0.0 10315 0.01090 1
5 0.0 11480 0.0 11770 0.012356 0.011344
6 0.0 13517 0.0 13807 0.0 14392 0.013381 0.0 14836
7 0.0097 59 0.010049 0.0 10635 0.009623 0.0 11078 0.0 13115
8 0.010 185 0.010476 0.0 11061 0.010050 0.0 11505 0.0 13541 0.009784

The values of Sen--Yates--Grundy weights, (lr;lr j -lrij )Ilrij, are given below:

Table 5.4.3. Sen--Yates--Grundy weights .

i.» 2 1 :;<> 6 7
2 ~ 0.133289
3 0.127726 0.127238
4 . 0.137742 0.134317 0.127962
5 0. 123894 0. 124520 0.125694 0.123590
I;' 6 0.109511 0.114270 0.123288 0.107219 0.129638
I ' 7 ' ", 0.140723 0. 136417 0.128445 0.142822 0. 122973 0.102578
1$, ,;,8 !' " 0.136022 0.133105 0.127683 0.137442 0.123948 0. 109925 0.140346
Using information from Table 5.4.1, the HT estimate of the total number of fish
caught during 1997 is:

• y.
YHT = I -1.. = 271386.0 1 .
i E Sltj

Now using information from Table 5.4. 1 and Table 5.4.3, the Sen--Yates- -Grundy
estimator of variance of the HT estimator of total is given by
398 Advanced sampling theory with applicat ions

2
x.IJ. y. y.
i« j)=lj=1 (
L8 L8 Jr.Jr . -
= 1 J
Jrij J( -l.. _ _
Jri
J
Jrj
J

The above 28 values of (Jril[ j- J( Jrij Yi - Y jJ2 are given in the following table.
Jrij Jri nj
Cha pter 5: Use of auxiliary information: PPSWOR Sampling 399

j
Table 5.4.4. Th e va lues of (lrilr - lrij J(Yi - Yj
lrij lri lrj
J2for different va lues of i and j.

' l' ' < ,Y"::!,,,. :~,:~:''i Y I" , :::'. " " " , ~::" >,.\1
l'>:~ ~ ::' )" ti: /;2-2 :'i3 " Iu; 'it 5 : ,; 6, Y" , 7
4~tt~i
I
,
",
j
" 1< I"
9' "~,g o,
,~'
' 2$' 690440.4

~3 25687680.6 18035050.2

", 4 49697155 .8 37543 877.8 2964455 . 1

5 853303 .5 15116.2 16788659.6 33120629.9


6 -: 104 2769336 .0 1037919851.0 857526233.2 662 159238.9 11689 14287.0
7 41578123.1 51686249.9 1264034 73. 1 18699 1336.4 4827531 8.4 135 1165614.0
\,8 ' 1080852.7 39235 .0 16485 132.9 35962754.1 468 9.3 987 107399.0 56 18254 1.0

Thus we have

VSYG(YHT )=. f f ( lrilrj - lrijJ(Yi _Yj J2 =7857648034 .


'« j )=l j=l lrij lri lrj
Note that the (1- a) IOO% confidence interval of population total Y is given by

YHT +la / 2 (df = n-l)JvsYG(YHT ) '


Us ing Tab le 2 from the Appendix the 95 % confidence interval of the total number
of fish during 1997 is given by
27 1386.01+ 2.365.J7857648034 , or [61744.43, 48 1027.59].

The nex t section has been devoted to discuss ca libration approach In sampling
theory.

5.5 CALIBRATION APPROACH

Statisticians are often interested in the precision of survey estimates. The most
commonly used esti mator of population tota l or population mean is the generalized
linear regression (GREG) estimator. Let us consider the simplest case of the GREG
where information on on ly one auxi liary variable is avai lable . Co nsi der a
population n = {I, 2, .., i, .., N} , from whic h a probability sample s (s en) is drawn
with a given samp ling design pO. The inclusion probabilities lri = P(iES) and
lrij EP(iES, j Es) are assumed to be strictly positive and known. Let Yi be the value
of the study variable, y, for the i''' population unit and let Xi be the value of the i'''
unit of the associated au xiliary variable. The population total X = LXi of the
iEO
auxi liary variable x is assumed to be acc urately known. The objective is to

estimate the populatio n total Y = LYi . T he concept of linear we ighting of sample


iEO
400 Advanced sampling theory with applications

survey data can be found in Bethlehem and Keller (1987) . Deville and Sarndal
(1992) used calibration on the known population total, x , to modify the basic
sampling design weights , d ; =1/1r; , that appear in the Horvitz and Thompson
(1952) estimator

(5.5.1)

A new estimator
(5.5.2)

was proposed by Deville and Sarndal (1992), with weights W; as close as possible
in an average sense to the d , for a given measure and subject to the calibration
constraint
2:w;x; = X.
(5.5.3)
ie s

Now we have the following theorems:

Theorem 5.5.1. Minimization of chi square (CS) distance between the new weights
W; and selection weights or design weights d , leads to a general regression type

estimator (GREG) of population total, Y, given by

YG = .2: d;y; + /JdS(X- .2: d;x;) (5.5.4)


l ES l ES

where P• ds = .2:d;q;x;y;/ .2: d;q;x;2 .


IES lES

Proof. Let us define the chi square (CS) type distance function D as
D = I(w; -d;)Z(d;q;
ie s
r' (5.5.5)

where q; are suitably chosen constants such that the estimator depends upon its
choice. The Lagrange function L for minimizing D in (5.5.5) subject to the
constraint in (5.5.3) is then given by

L = .I (w; -d;)2(d;q;t' -U(.IW;X; -xJ. (5.5.6)


I ES lES

On differentiating (5.5.6) with respect to w; and equating to zero we have


W; =d;+ Ad;q;x;. (5.5.7)
On substituting (5.5.6) in (5.5.3) and solving for A we have

A=(.Id;q ;X;J-' (x - .I d;x; J . (5.5.8)


lE S t ES

On substituting (5.5.8) in (5.5.7) we have

W; =d; +(d;q;x;j I d;q;X;J( x - Id;X; J . (5.5.9)


i es ies
Chapter 5: Use of auxiliary information: PPSWOR Sampling 40 I

Substitution of the value of Wi from (5.5.9) in (5.5.2) leads to the general


regression estimator (GREG) of total given by (5.5.4). Hence the theorem .

Corollary 5.5.1. If qi = IjXi , then the optimal weight Wi become s,

Wi = di xl 'iA xi
ies
and the resultant estimator reduces to the ratio estimator of popul ation total as

YR = .' iA lES
Yi( X!.I diXi ] .
IES
(5.5.10)

Remark 5.5.1. Singh, Horn, and Yu (1998) reported that there is no choice of qi
such that the resultant estimator (5.5.4) reduces to the product estimator of
population total discussed by Cochran (1963) .

The main difficulty with the calibrated weights given in (5.5.9) is that they do not
satisfy the desired constraint of weights being non-negative. Deville and Sarndal
(1992) have considered several distance function s which guarantee the non-
negativity of the weights. Let us discuss a new distanc e funct ion, which also
guarantees the non-negativity of the weights, in the following theorem.

Theorem 5.5.2. The optimal weights obtained by minimizing the distance function

D = LWiIn(IjdJ- LWiIn(Ij wJ . (5.5.11)


ie s ies
subject to calibration constraint (5.5.3) lead to non-ne gative weights.
Proof. In this situation, the Lagrange function L is given by

L = .L Wi in(wJ d i )- ..l( .LWiXi -


I ES IES
x) = .L {Wi In(wJ - Wi in(di )}- ..l{.LWiXi - X}.
IES I ES
(5 5 12)
' •

On differentiating (5.5. 12) with respect to Wi and equating to zero we have


CL
av. = 1+ln(wJ -ln(dJ-..lxi = O, (5.5.13)
I

which implies
Wi = exp[ln(d i )+ ..lxi-I], (5.5.14)
where the value of ..l can be obtained by solving

(5.5.15)
ie s
Thus (5.5.14) shows that the calibrated weights are always non-ne gative if the
distance function (5.5.11) is minimized for the calibration constra int (5.5.3). Hence
the theorem .

Thus we conclude that the calibration approach can guarantee to yield non-negative
estimators of population total depending upon the choice of the distance function to
be minimized.
402 Advanced sampling theory with applications

Example 5.5.1. Find the calibration weights for the units selected in the sample by
using PPSWOR sampling by making use of known information about the number
of fish caught during 1994 as auxiliary variable at estimation stage.

1 01 01 Sharks, other 1467 2001 2016


2 23 23 Blue runner 2371 5692 2319
3 54 54 Tautog 4195 2653 3816
4 58 58 Atlantic mackerel 1045 4860 4008
5 60 60 Spanish mackerel 5575 3850 2568
6 62 62 Summer flounder 11918 1774 1 16238
7 62 63 Gulf flounder 216 776 163
8 64 65 Winter flounder 1544 2300 2324

Use the chi square distance function between the design weights and calibrated
weights. Discuss two cases where these weights lead to the ratio and GREG
estimator for estimating the total number of fish during 1995. Deduce the value of
the estimate in each case, provided that the sample has been selected using
Midzuno --Sen's scheme of sampling as in the Example 5.4.1 by using information
on the number of fish caught during 1992.

Solution. The chi square distance function gives the calibration weights as

Wi = d, + (diqiX,,! 2AqiX~)(XI - .IdiXli)'


I ES l ES

From population 4 we have the value of XI = 341856 .

Case I. If qi = 1/ Xli then these weights becomes Wi = d, XI/I:dixli .

,
IES
I,>':@ .._ :i:;';'; "i' -I'
I.; <C;"'AA;A~ ";~,.,::,;"" ;i o o c.
·'i.··i·.W.. ;' I~).lv;~iiu':i
1~~~~4, 1 ;: ; Yi ;i~ . I ;~ ' " : : :dr~I~'
1

'{i'~;i:.: ';1<';; =
l:1:'1J. •
I '';;;:.~ i'
....,. , ' ;i " ;: 1
:- i':; :. ::.: , '- I:F i" ii'
Sharks, other 2001 2016 0.107450 9.306654258 18622:6152 9.73030 19616.3
Blue runner 5692 2319 0.110228 9.072105091 51638.4222 9.48508 21995 .9
Tautog 2653 3816 0.115834 8.633043839 22903.4653 9.02603 34443.3
Atlantic mackerel 4860 4008 0.106 153 9.420364945 45782 .9736 9.84919 39475 .6
Spanish mackerel 3850 2568 0.120075 8.328128253 32063 .2938 8.70723 22360.2
Summer flounder 17741 16238 0.139569 7.1649 14845 127112.7543 7.49107 121640 .0
Gulf flounder 776 163 0.103605 9.652043820 7489.9860 10.09141 1644.9
Winter flounder 2300 2324 0.107686 9.286258195 21358 .3939 9.70898 22563.7
;i ;;
,'iiii;> e...s,i;i;i> '.>';;i Xi,i!>'!x . ." . ", ~111 3.26971,9042 'K '" ,283739.8
Note: To find Jri refer to Example 5.4.1.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 403

8
Thus we have Idi xli =340963 .6378 and the ratio estimate of the total number of
i=l
fish caught during 1997 is given by
YR = LWiYi = 283739 .8 .
ie s

Case II. If qi = I then these weights become

Wi =d, + (diXIi/ ,IdixtJ(Xl -


lES
,IdiXliJ
lES

and are given in the following table:

2001 20 16 0.107450 9.306654 18622.6152 37263852.9 9.397594


5692 23 19 0.110228 9.072105 51638.4222 293925899.0 9.324272
2653 3816 0.115834 8.633044 22903.4653 60762893.5 8.744889
4860 400 8 0.106153 9.420365 45782.9736 222505251 .9 9.643938
3850 2568 0.120075 8.328 128 32063.2938 123443681.0 8.484704
17741 16238 0.139569 7.164915 127112.7543 2255107373 .0 7.785647
776 163 0.103605 9.652044 7489.9860 5812229 .1 9.688620
2300 2324 0.107686 9.286258 21358.3938 49124305 .8 9.390558 21823 .6
326Q7cHQ042 30479.454.8J ,0 '2g4206:~

From the above table we have


8 8 2
I dixli = 326971 .9042 , and IdiXli = 3047945487 .
i=1 i= 1

Thus a GREG estimate of the total number of fish caught during 1995 is

YG = I WiYi = 284206.90 .
ie s

Now the question arises of how the calibration can be done if there are two or more
auxiliary variables. To answer this question we have the following theorem :

Theorem 5.5.3. Suppose XI and X 2 are the known totals of two auxiliary
characters Xli and X 2i , for i = I, 2,..., N . The minimization of the CS distance
function (5.5 .5) subject to the two linear calibration constraints given by
IWiXIi = XI (5.5 .16)
ie s
and

IWiX2i =X2 (5.5.17)


ie s
leads to the regression type estimator
404 Advanced sampling theory with applications

(5.5.18)

where PI and Pl have their usual meanings .

Proof. In this case the Lagrange function L is given by


l
L = .L(Wi -dif(diqit - ZAl[.LWiXli -XI]-2Al[ .LWiXli - Xl]' (5.5.19)
lES IES lES

On differentiating (5.5.19) with respect to Wi and equating to zero we have


Wi =d i +Aldiqixli +Aldiqixli' (5 .5.20)
On substituting (5.5.20) in (5.5.16) and (5.5.17), respectively we have

AI(.LdiqiX~)+ Al(.LdiqiXliXli) = (XI - .LdiXli) ,


lES IES IES
(5.5.21 )

and

AI(.LdiqiXliXli) + Al( .LdiqiXii) = (Xl - .LdiXli) . (5.5.22)


IE S lES / ES

The system of equations given by (5.5.21) and (5.5.22) can be written as

l
·L diqiX~'
IES
.LdiqiXliXli]rAI] =
IES
lXl - .L dixli ]
lES •
(5.5.23)
1
.L diqixlixli' .L diqixl i L Xl - .L diXl i
IES lES '"2 tES

and
Chapter 5: Use of auxiliary information : PPSWOR Sampl ing 405

(5.5 .24)

On inserting this value of Wi in (5.5.2) we obtain

Yo = .' f diYi
lES
+ P1 l( XI - .' f diXli ) + P2( X 2 - .' f diX2i )
I ES l ES
(5.5 .25)

where

and

Hence the theorem.

Example 5.5.2. Find the calibration weights for the units selec ted in the sample by
using PPSWOR sampling by making use of known infor mation about the number
of fish caught during 19 9 3 and 1994 as auxiliary variab les. Use the chi square
distance function between the design weights and calibrated weights. Discuss the
cases when these weights lead to the regression estimator for estimating the total
number of fish durin g 1995. Deduce the value of the estimate.

Sampl e selected usin Midzuno-Sen samplin scheme using 1992 data


Sample PRJ'! Population;~ Speci es group , 1992 1993t 1994 1995
\ 'X
units No;~ un its ·t:,'" 3 X2 ~
Xl Y
1 01 01 Sharks, other 1467 1385 2001 2016
2 23 23 Blue runner 2371 3800 5692 2319
3 54 54 Tau tog 4195 4215 2653 3816
4 58 58 Atlantic macke rel 1045 2307 4860 4008
5 60 60 Spani sh mackerel 5575 3653 3850 2568
6 62 62 Summer flound er 1191 8 22919 17741 16238
7 62 63 Gulf flounder 21 6 189 776 163
8 64 65 Wint er flounder 1544 3582 2300 2324
Solution. If qi = 1 'i i the calibration weights in case of two auxil iary variables
becom e
406 Advanced sampling theory with applications

iX2i
+d f(.tdiX1~)(.tdiXii)-(.tdiXliX2i)2)
1 1=1 1=1 1=1
.
To calculate these weights we proceed as follows:

~I ;~xi' ';/ +:';" " '..


I' ,
d"
." dTi1~ > '( ·( 1' ++ ,.,,"0'; ;"
'''''C''
c,'7<7;'a :":I, x"'~;h
."",t·
'~~J;
0.107450 9.306654 18622.62 12889.72 37263853 .0 17852256 .9 25792322 .0
0.110228 9.072105 51638.42 34474.00 293925899.0 131001197 .5 196226004.3
0.115834 8.633044 22903.47 36388 .28 60762893 .5 153376599 .3 96538106.3
0.106153 9.420365 45782 .97 21732 .78 222505252 .0 50137527.9 105621320.2
0.120075 8.328128 32063 .29 30422.65 117127212 .2 123443681.0 111133949 .6
0.139569 7.164915 127112.75 164212.70 2255107373.0 3763590489.0 2913297215.0
0.103605 9.652044 7489.99 1824.24 5812229 .1 344780.6 1415607.3
0.107686 9.286258 21358.39 33263.38 49124305 .9 119149415 .9 76505766.8
I ;;<~,y+:(" ' '';:1 ; Q ,!Jl 11 ~26271:,,20 335207:70 3047945487.0, 4},46~?q217 .0 ,"3532523554.0
Note : To find Jri refer to Example 5.4.1.

Thus the calibrated weights are given by

Thus we have the following results:


Chapter 5: Use of auxiliary information: PPSWOR Sampling 407

W.
",,,, . wl
10.62613538
I
'"21422.3
i

12.91002878 29938.4
7.35385154 28062 .3
14.07120429 56397.4
9.43691815 24234 .0
5.39870687 87664.2
10.65755444 1737.2
8.18806195 19029.1
"":-'Ii'!:
~~ ~1Sum~ ·268484:8

Hence an estimate of total number of fish caught during 1995 is given by


YG = L: WiYi = 268484.80 .
ie s

Re ma rk 5.5.2: Note that there is no choice of weights qi such that the estimator
h in (5.5.25) will reduce to a multivariate ratio type estimator with two auxiliary
variables.

Now if there are p auxiliary variables Xii ' j = 1, 2,..., p, and the population totals

X i = L: X i i are known , then we have to minimize the CS type of distance function


ie fl
(5.5.5), subject to the p calibration constraints:

L: wixii = X] . for j = 1,2, ...., p. (5.5 .26)


ies
Under such situations, the system of linear equations by minimizing the Lagrange
function is given by

lAqiX~' i~/iqiXliX2j, , i~/,qiXljX pi fA I


ies

'i.diqjX2iXlj, .'i. diqiX~j , , .'i. d jq jX2iXpi ..1. 2 (5.5 .27)


ie s lES l ES X 2 - 'i.djX2j
ies
Ap
2
'i.diq iXpixli, 'i.d iqjX pi X2i ' .. .. .. ..·, 'i.djqixpi
ies IE S ies

If the system of equations given by (5.5.27) is expressed as


AA=C (5.5.28)
and its solution is given by
A = A - 1C. (5.5.29)
Then we have the following theorem.
408 Advanced sampling theory with applications

Theorem 5.5.4. The optimal weights in case of p auxiliary variables are given by
w = d, + d .q.[I CX (5.5.30)
I I "

where X = (x ji )p xn has its usual meaning.

The resultant estimator of the population total III the presence of p auxiliary
variables is given by
YG = r.d;Yi + "IdiqiYi['CX . (5.5.31)
ie s ie s
Now we will study some properties of the above estimator III the following
theorems :

Theorem 5.5.5. The variance of the estimator YG based on p auxiliary characters is


' ) = v(,YHT"1
v.(,YG A-
2
RYOX\ ,X2, ...,Xp ) , (5.5.32)

where R;O XI,X2,..,XP denotes the multiple correlation coefficient of Y on Xl> x2'''''x p '

Theorem 5.5.6. The Sen--Yates--Grundy form of the variance of the estimator YG


can also be written as
1
V2 (YG) = -2 "I "I (JTiJ1j - JTijXdiei - d je) (5.5.33)
i*jeO
where ~ =~ -!!.X denotes the vector of error terms in the model m .

Theorem 5.5.6. Another form of the variance of the estimator YG is given by


(5.5.34)

In the following section we shall discuss the calibration of the estimator of variance
of the HT estimator of population total, Y, defined as

fHT = 'i.d;Yi
ies
and then the estimation of variance of GREG defined as

fG = ."IdiYi + ( ."IdiqiXiYij ."Idiqixl)(x - ."IdiXi )


lE S lE S lE S lES

in section 5.7.

Remark 5.5.3: The drawback with the above approach is that it may produce
negative calibrated weights or large calibrated weights for some units. This can lead
to unstable parameter estimates in some domains. We may also produce implausible
estimates such as a negative total for a variable which is strictly positive. In order
to alleviate these problems we would like to introduce restrictions on the values that
the calibrated weights can take. Consider we specify a lower bound Ii and an
Chapter 5: Use of auxiliary information: PPSWOR Sampling 409

uppe r bound IIi for each unit i E S and bounds may differ from each unit. Thus the
above probl em can be restated as

Minimize : I(Wi - dif(diq;)- I


ie s
subject to
I Wixi = X
ie s
and
l, ::; Wi ::; IIi for i E S .

Now it becomes a quadrati c programming problem with equality and inequality


constraints . The solution to this problem is more complex because of the presence
of inequality constraints and there is no closed form expre ssion for the solution. For
detail on can refer to Estevao (1994). Singh and Mohl (1996) have also tried to
understand calibration approa ches and to explain other peopl e in a simple way.

5.6 CALIBRATED 'ESTI MATOR OF THE NARIANCE OF THE


ESTIMATOR OF POPUUATION TOTAL '

The Sen--Yates--Gru ndy (1953) form of the variance of the estimator JIlT for a
fixed sample size is given by

(5.6.1)

and its unbiased estimator is given by

VSYG (YHT )=~ .I .I Dij(d;Yi - djyJ (5.6.2)


' '''-i ES

where Dij = lJrillj - Jrij)/ Jrij denote s the design weights. Singh, Hom , Chowdhury
and Yu ( 1999) consider an estimator of variance of the Horvitz and Thompson
(1952) estimator

VI (YHT ) =-Zl,f., I wij(diYi - dj yJ (5 .6.3)


' '''-i E S
where wij are the modified weights and are as close as possible in an average sense
to the D ij for a given measure and subj ect to the calibration constraint

(5.6.4)
410 Advanced sampling theory with applications

estimator of auxiliary total X = 2: X i given by X HT = 2:dixi ' The idea of


iE Q. iES

adjusting the denominator of weights Dij has also been discussed by Fuller (1970),
but his method may have a limitation of not guaranty ing the non-negativity of the
estimates of variance.

For simplicity they restricted themselves to the two dimensional CS type of distance
D between two n x n grids formed by the weights wij and Dij for i, j = 1, 2, ..., n ,
defined as

(5.6.5)

In most situation s Qij = 1 but other types of weights can also be used. It is shown
that the ratio type estimator proposed by Isaki (1983) is a special case for a
particular choice of Qij' Minimization of(5 .6.5) subject to (5.6.4) leads to modified
optimal weights

Wij=Dij I{ DijQij(diXi - d j x)
} [ (A) 1
VSYG X HT - - .2: ,2:Dij\diXi-djxj}
{ \2] . ( 6)
1 (\4 2 l'T'j E S 5.6.
'2,2:,2: DijQij\diXi - dj xj)
' *jES

Substitu tion of wij from (5.6.6) in (5.6.3) leads to the following regression type
estimator :

VI (YHT )=vSYG (Y )+B~SYG (xHT)- vSYG(X


HT IIT )] (5 .6.7)

where
B =,2:,2: DijQij(diYi -djYj~(diXi -dj X) / ,2:, 2: DijQij(diXi -djxjf = it22 /it04 (5 .6.8)
' *j E S ' *jES

and
VSYG (X IIT )= ± .2:,2:Dij(diXi - djX ) ' (5 .6.9)
' *j E S

The leading term of the mean squared error of the regression type estimator (5.6.7)
is
MSE[vl (YHT )] = V [VSYG (~lT )]+B 2V[vSYG(x HT)]- 2Bcov[vSYG(YHT ~ vSYG (xHT)]
(5.6.10)
where
B= L L Qij(JriJrj -JrijXdiYi - djYj ~(diXi - djXj ~/ L LQij(JriJrj- JrijXdiXi-djxjf
~~ ~~

and
Chapter 5: Use of auxiliary information: PPSWOR Sampling 411

Here "ijkl denotes the positive probability of including four units in the sample, i.e.,
"ijkl = P(i, j, k, I E s).
Expression (5 .6.10) shows that the estimator VI (YHT ) of variance is not always more
efficient than the estimator given by Sen-Yates-Grundy (1953), but the estimator
(5.6 .7) is consistent because the ratio of modified weights to original weights, i.e.,
wij/ Dij converges in design probability to unity. This condition of consistency is an
analogue of the condition given by Sarndal, Swensson, and Wretman (1989).
Ramakrishnan (1975a) has also mod ified the Sen-Yates-Grundy (1953) estimator to
suit varying sample size designs. It is also possible to calibrate the denominator
dij = "iiI of Dij similar to Fuller (1970) and has the limitation of not guaranteeing
the non-negativity of estimates of variance. Sitter and Wu (2002) claim that the
two -dimensional chi square distance function can take negative value, and hence
calibrates only design weights dij = "iiI . Note that the calibration of dij = "iiI will
not guarantee the non-negative estimate of variance, but the calibration of Dij can
guarantee the non-negativity of the variance estimate if calibrated under bounded
conditions by using quadratic programming. Further note that distance can be
negative, but the magnitude of distance cannot be negative. Stukel, Hidiroglou, and
Sarndal (1996) have attempted to compare Jackknife and linearization forms of the
estimators of variance.

Here we would like to discuss the cases where the estimator VI (YHT ) reduces to
usual estimators in simple designs as follows :

Case I. Under SRSWOR, "i = " j = n] N , "ij = n(n -1)/N(N -I) and if qij = I then
(5 .6.7) becomes

VI (Ysrswor)= N2(~ - f) [s; + b(S; - s;)], (5 .6.11)

where
~rswor denotes the estimator of population total under SRSWOR design,
2 Inn t. )2 . . . 2
SY = ( ) I I I)' i - Y j IS an unbiased estimator of Sy'
2n n -I i= l j = 1

2
Sx = (
1
)
n n
I I
(
Xi -Xj
)2 .
IS
. .
an unbiased estimator of st.2
2n n -I i=lj=1

and
b = fi22/fi04
where
412 Advanced sampling theory with applications

, j) n (
N 4(1_ tl '12 ( '12 , N 4(1_ j) n n ( \4
fl22 = 4( ) I I I)'i - Yj) Xi -Xj) , and fl04= 4( ) I I X i - Xj) '
n n-I i;lj;1 n n -I i=lj;t
The ratio

is a regression type estimator of finite population variance S; as proposed by Isaki


(1983).

Case II. For an IPPS sampling scheme for which "i = npi and"ij is a second order

inclusion probability such that I"i = n and I "ij = (n -I)"i , then the Horvitz
iEO j (;t i )EO
and Thompson (1952) estimator becomes
• 1 tl Yi
YHT =-I-
n i ;IPi
with
Vt(YHT )=vAYHT)+b~(XHT )-VAXHT)] (5.6.12)
where

, (Y
vy
, ) = -12 I
HT I v. -Y-j J2 ,
kij ( - . (,)
VX Y HT = - 2
1 I I kij ( -
Xi
- Xj
- J2
2n i » jes Pi P] 2n i ",jE s Pi P]

where
n 2p op 0 ,

kij = __'_J -I, b = jIn! jI04


" ij
with
,
fl22 =
(
L: L: ki" - - -
Yi Yj 2(-Xi -Xj-J2 , and
,
fl04 = (J4
L: L: ki" - - -
Xi Xj
iv jes v P; Pj J P; Pj ie jes fj P; Pj
have their usual meanings under IPPS sampling.

Case III. If qij = (d iXi -d j X J-2 then we have ratio type estimator given by
,(, JRatio ' ('YHT 1V, SYGXH
\ = VSYG
VI JIlT '
T ].
VSYG X HT (5.6.13)

Under SRSWOR sampling

VI (YHT katio/{ N2C ~j)}


leads to the ratio type of estimator
s2I = s2{S2/
Y x s2)
x
proposed by Isaki (1983).
Chapter 5: Use of auxiliary information: PPSWOR Sampling 413

5.7 ESTIMATION OF VARIANCE OF GREG

Following Sarndal, Swensson, and Wretman (1989), Deville and Sarndal (1992),
Sa rndal ( 1996), and Rao (1997), the GREG can be written as
" e· -
YG = L.-L+f3dsX (5.7 .1)
iE SJrj

and the Sen --Yates --Grundy (19 53) form of estimator of variance of the GREG is

VSYG (YG)=~.L..L.Dij(w;e; -wh Y (5 .7.2)


'*JE S

where Dij = lJr;Jrj - Jrij)/ Jrij , i '" j , and e; = Y; - iJdsX; . Thi s estimator can easily be
written as
1
VSYG(YG)=-2 .L..L.Dij(d;e; - dje} +vll(X-±d;X;J +vl2(X-.±d;X;J2 (5 .7.3)
,*J ES 1=1 1= 1
where

viI ={Id;q;xl}-I I I Dij(d;e; - dj ej Xd;q;x;e; - djq j xjeJ


iES i* j e s
and

IE S
1 ,*~
vl2 = O.5{ .Id;q;x }-2 .I Dij(d;q;x;e; - djq jxjej Y.
JE S

The est imator in (5.7.2 ) covers a variety of estimators of variance. Let us consider
SRSWOR design, i.e., Jr; = Jrj = n] N and Jrij = n(n -I);N(N - I).

Th en we have the follo win g cases:

Case I. If q; = I , then YG reduces to the usual regre ssion estimator of total. Now
if IV; = d, in (5.7 .2), it reduces to

v(YG )=~~~~ 0)i~lel (5 .7.4)

whi ch is the usual estimator of variance of the regression estimator.

Case II. If qi = J/Xi then the estimator YG reduces to the ratio estimator YR of the
population total. The estimator of variance (5.7.2) reduces to
2
_(" )_ N (1_ f) II 2{ X }2
v YR - ( ) L.ei -..,,- (5.7.5)
n n -I i=1 X

where e; = Yi - (~JXi and X= N I Xi. The estimator at (5 .7.5) is a special case of


x n i=1

a class of estimators of variance of the ratio estimator proposed by Wu (19 82) as

vw(YR )= N~(I -~)±el{~ }g for g= 2. (5.7.6)


n n -I i=1 X
414 Advanced sampling theory with applications

Case III. If q; = 1 then the estimator (5.7.2) becomes


, (,) N 2(1_ f) n 2 , (X X,) , (X X, \2 (5.7.7)
vSYGYG = n(n-I) ;~le; +1fI1 - +1fI2 - j

where,

and 1f12 =
(
_""-N
-;- ) I I (x;e;-xje}
.:.. j;-:]_ _-::-_
-_n.!...c~/:·".!.-
2N(n-l) (" 2J2
Ix;
;;]

Deng and Wu (1987) have defined a general class of estimators of the variance of
the regression estimator:
2((I_
vow(YG )=N f))Iel{~}g (5.7.8)
n n-I ;;1 X
The linear form of the class of estimators (5.7.6) takes the form
, (,)_
vow
2
YG - N ((1_f) n 2[
(X ) g(g-I)(X
) I ei I+g -,.-1 + - - - -,.-1
)2+....] (5.7.9)
nn-I ~l X 2 X
which is again similar to (5.7.3).

Following Singh, Hom, and Yu (1998), the estimators of variance of the estimators
of total considered so far belong to the low level calibrat ion approach. The
estimators studied by Chaudhuri and Mitra (1992) are also special cases of the low
level calibration approach . As noted earlier, there is no choice of qi which reduces
YG to the product method of estimation considered by Cochran (1963) . Thus the
estimation of variance of the product estimator has not been discussed here. To
discuss the efficiency of such estimators consider an analogue of the general class
of estimators of the variance of GREG by following Srivastava (1971) given as

vs(YG )=( ~~~~0)i~lel JH(1) (5 .7.10)

where H( .) is a parametric function such that H(I)= 1 satisfying certain regularity


conditions defined earlier in Chapter 3. Following Srivastava (1971), it is easy to
see that analogues of the general class of estimators (5.7.10) attain the minimum
variance of the class of estimators proposed by Dcng and Wu (1987) for regression
estimators and Wu (1982) for a ratio estimator. If we attach any function of the

ratio x] X to the usual estimator of variance given by N~(I- )) Iel as given in


n n -I ;; ]
(5.7.10), the asymptotic variance of the resultant estimator remains the same. In
other words, the efficiency of the estimators of variance of the regression estimator
GREG of total obtained through low level calibration remains less than or equal to
the class of estimators proposed by Wu (1982) and Deng and Wu (1987). The
weights Wi used to construct the estimator of variance of GREG in (5.7.2) were
obtained while estimating the population total and hence named as low level
calibration weights for variance estimation .
Chapter 5: Use of auxiliary informat ion: PPSWOR Sampling 415

Example 5.7.1. We wish to estimate the total number offish of all kinds caught by
recreational fisherman on the Atlantic and Gulf coasts during 1995. Population 4 in
the Appendix shows that information on the number of different kinds of fish
caught during 1992 and 1994 is available. Use the known information on the
number of fish caught during 1992 to select a sample of eight units by using
PPSWOR sampling. Collect the required information from population 4 to estimate
the total number of fish caught during 1995 using a regression estimator by using
information during 1994 as an the auxiliary variable. Apply the Sen--Yates--
Grund y form of the estimator of variance of the regression estimator to construct a
95% confide nce interval. Also use the estimator of variance of the GREG proposed
by Sarndal, Swenson and Wretman (1989).

Solut ion. Again referring to example 5.4.1 , the ultimate sample is as shown below .

.". .~., 1:!S~l1lple' selected .'iising;M idzul1()..Sen'sampling'scheme.. 'tb. x• "

Population Speciesgr()up'.," t~1 994


Samp!?
units'§ I I~~' r 11 ··tunits
I""
~' .' . j !V ·
, "g
,tl;.
••,. .h
1:1+1 ~21" ,. Xl
q
liJ 995E
iIIy !y.
'" 3
1 01 01 Sharks, other 1467 2001 2016
2 23 23 Blue runner 2371 5692 2319
3 54 54 Tautog 4195 2653 3816
4 58 58 Atlantic mackerel 1045 4860 4008
5 60 60 Spanish mackerel 5575 3850 2568
6 62 62 Summer flounder 1191 8 17741 16238
7 62 63 Gulf flounder 216 776 163
8 64 65 Winter flounder 1544 2300 2324

As before, the Sen-- Yates--Grund y weights based on first and second order
inclusion probabilities are:

!" .; :., '11'; "


;,

I~ ".. f,,s en-Yates-Grundy weights/ ., c • ,'" ....;7 ,:' ','


!.!.!.!; j;iilli
··..,!.c
il;, Yi01 t~i iTc:. ~ijJ!zrij i. !
;\~\ (. ...
" ';1";, :,-
i >:j 1/1 ','1';11 ' " ' 2 ,t 3 /0 "':' 4- " 5' " " 6 7
;;,2 0.133289
101:;;' 3 0.127726 0.127238
4 :;., 0.137742 0.134317 0. 127962
5 0.123894 0.124520 0.125694 0.123590
6 0.109511 0.114270 0.123288 0.107219 0.129638
7 0.140723 0.1364 17 0.128445 0.142822 0.122973 0.102578
8 0.136022 0.133105 0.127683 0.137442 0.123948 0.109925 0.140346
416 Advanced sampling theory with applications

If qi = 1then these weights become

and are given in the following table:

' X li ' 7ri ' -I dix 1i -


Iii Yi d, = 7ri dixt Wi Wi);'i
2001 2016 0.107450 9.306654 18622.61517 37263852.9 9.397594 18945.5
5692 2319 0.110228 9.072105 51638.42218 293925899.0 9.324272 21622 .9
2653 3816 0.115834 8.633044 22903.46530 60762893.4 8.744889 33370 .5
4860 4008 0.106153 9.420365 45782 .97363 22250525 1.9 9.643938 38652 .9
3850 2568 0.120075 8.328128 32063.29377 123443681.0 8.484704 21788.7
1774 1 16238 0.139569 7.164915 1271 12.75430 2255107373.0 7.785647 126423.3
776 163 0.103605 9.652044 7489.98600 58 12229.1 9.688620 1579.2
2300 2324 0.107686 9.286258 21358 .39385 49124305.8 9.39055 8 21823 .6
:&
Sum 326971 .90420 3047945487.0 284206 .9

From the above table we have

8 8 2
'i.dixli = 326971.9042 and 'i.dixli = 3047945487 .
i=1 i=1

Thus the regression estimate of the total number of fish caught during 1995 is

YG = I WiYi = 284206 .90 .


i ES

Now we assume the following superpopulation model passing through the origin as
Yi = fixli + ei'

" ", Estimates of error term ,


" "

J
d I. ¢
Yi ~ Xli 2
d I.x l I· d.x ·y · , e.
,c,
I 1I I I

9.306671806 20 16 2001 37263923.22 37543262.97 292.3759


9.072095548 23 19 5692 293925 589.80 119749375.10 -2583.9800
8.633049291 38 16 2653 60762931 .83 87399678.80 1530.7550
9.420379236 4008 4860 222505589.40 183498436 .70 -178 .3130
8.328 116785 2568 3850 12344351 1.00 82338425 .03 -748.3180
7.164889400 16238 17741 2255099365.00 20640495 74.00 956.2333
9.65204 1724 163 776 58 12227.87 1220867.45 -505.4320
9.286219716 2324 2300 49124102 .30 49636701.63 342.8229
"
"
Sum 3047937240.00 2625436321 .00 -893.8559
Chapter 5: Use of auxiliary information: PPSWOR Sampling 4 17

Residual plot

2000 T •
'* 0 -f-- *- ---. -- - - -+- . -.----.- .\.. -~ _.. _ --j

-6 0• +:;000 10000 15000 20000


~ -2000 t. •
-4000 J
Numbe r offish during 1994

Fig. 5.7.1 A residual plot.


Note that we used


ei = Yi - fJdsxi .'
WIth fJds = .' I d iXliY i / 'Idixl2i = 0.861381359,
I ES

therefore, the sum of the estimates of these residual s mayor may not be equal to
zero .

The Sen-- Yates--Grundy estimator of van ance of the GR EG estimator of


popul ation total is

VSYG (YG)
2 2
1 e. e. e. e ,
I8 I8
( J( ( J(
zs:_ _ J_
1[ .1[ . - 1 [ . .
=-I I
1[ .1[ . - 1[ . .
-!..-_...L. I J

J J
IJ IJ lj

2 iEsj(*i)es 1[ij 1[i 1[j i«j)= lj=1 1[ij 1[i 1[j


4 18 Advanced sampling theory with applications

2
" ." . - " .. e e·
The above 28 values of
( I ~ij lj
J( ~ - ~
J
for different comb inations i and )

are given in the following table:

¥' . ,. <",1.,;4, , . } ~,,,


~"c,,,
',- ~~ ~
'.

' i>} I .';: . '" 2 ~ 4 5 ' .,. 6 "7

". 2 9 1238120.44
'3 "
14065768.94 170976306.70
'''4..; 2667685 .08 636126 17.02 283893 10.20
. ;~f 5 , 993 1124. 16 368810 63.74 47536545.93 256 1215.89
-'& 6'1- 1868 160.75 104865300.70 4992906.747 7803347.57 22 190860.32
7 8 127085.79 470 10740.63 42049622.94 146 1282.77 225325.07 14 113424.37
8 29093 .82 9436 1304.94 12849024.92 3250757 .81 10988465 .09 1478773.16 9 121878.27

Thus we have

vSYG (Y"G ) = . 2:
8
1«J l= IJ=1
2:8 ("'' '-''' J(
I J
" ij
lj 2 __
"j
e'J2
J = 854647113.8 .
"j

A (1- a) 100% confidence interval of population total Y is given by

YG =+= (a/2(df = n - 2 )JvSYG (YG ) .


Using Table 2 from the Appendix the 95% confidence interval of the total number
of fish caught during 1997 is given by

284206.9 =+= 2.447 ~85 4647 1 1 3 .8, or [212670.45, 355743.35] .


An estimator of variance of the GREG owed to Samdal, Swen son, and Wretman
(1989) is given by

( "G ) =-1 .2:.


vSYGY 2:
2 'ESA",)eS
("'' '-' ' J(
I J
"ij
lj Wjej-Wjej \2
) = . I8 ~8("'' '-''''J(
,h )=lJ =l
I J
"ij
lj w j ej -wjej )2
Chapter 5: Use of auxiliary information: PPSWOR Sampling 419

96029625.10
14456136.50 178738153.50
2748863.48 67239252.44 29199645.52
10252616.95 39207263.34 48956859.12 2648940.95
2416283.22 113663403.70 4352131.18 9005203.82 24667396.60
8223784.93 50272113.22 42935981.61 1441813.32 259376.84 15624764.01
30258.71 99296317.46 13198368.82 3352643.45 11348317.19 1962788.15 9121878.27

Thus we have

vSyo(Yo)= ± ±(TriTrj-TrijJ(Wiei-Wjej~ =900650181.40 .


i«j)=l j=l Trij
A (I- a ~ 00% confidence interval of population total Y is given by
Yo +ta/2(df = n - iNvsyo (Yo) .
Using Table 2 of the Append ix the 95% confidence interval of the total number of
fish caught during 1997 is
,------
284206.9029+2.447~900650181.40, or [210770.40,357643.40] .

The next section is devoted to the higher level calibration approach introduced by
Singh, Horn, and Yu (1998), where the variance of the auxiliary variable is
assumed to be known. Several new estimators are shown as special cases of the
higher level calibration approach.

5.8 IMPROVED.ESTiMATOR OF VARIANCEORTHE GREG:THE . .


\~ HIGHERLEVEl:; cAuinu.ifION APPROACH \ i .: ". , ' :"
Singh, Horn, and Yu (1998) applied the calibration approach to estimate the
variance of the GREG estimator . The weights Dij of the Sen--Yates--Grundy
(1953) for an estimator of variance given in (5 .7.2) are calibrated so that the
estimator of variance for the auxiliary variable has the exact variance and suggested
an estimator of the variance of GREG

v(Yo )=~ L: L Oij(Wiei - wje} (5.8.1)


I~JE S

where O ij are the modified weights attached to the quadratic expression by the
Sen--Yates- - Grundy (1953) type of estimator and are as close as possible in an
average sense for a given measure to the Dij with respect to the calibrat ion
equat ion, defined as
420 Advanced sampling theory with applications

~ ,2:, 2: nij(d;x; - d j x j ~ = VSYG(XHT) (5.8.2)


' ''') E S

where VSYG(XHT)=-Zl 2: 2:(Jr/l"j-JrijXd;x;-djxj~ denotes the known variance of


; ",j En

the estimator of the auxiliary total x(= ,L X; ] given by


l En
XHT = ,2:d;x; . To compute
IE S

the right hand side of (5.8.2), we need either information on every unit of the
auxiliary variable in the population , or only VSYG HT) obtained from a past survey (x
or pilot survey. The examples of a situation where information on every unit of the
auxiliary variable is known are the establishment turnover recorded from census or
administrative records or Business Register (BR) or Internal Revenue Service, etc..
The use of a known variance of the auxiliary variable has also been supported by
Das and Tripathi (1978), Singh and Srivastava (1980), Srivastava and Jhajj (1980,
1981), Isaki (1983), Singh and Singh (1988), Swain and Mishra (1992) , Shah and
Patel (1996) , and Garcia and Cebrian (1996) . Singh, Mangat , and Mahajan (1995)
have reviewed classes of estimators of unknown population parameters making use
of the known variance of an auxiliary variable. For simplicity, Singh , Horn, and Yu
(1998) have been restricted themselves to the two-dimensional CS type distance,
D , between two 1/ x 1/ grids formed by the weights n ij and Dij for i,j = I,Z,..., I/ ,
given by
D =-ZI .L. L (n ij _Dij)2(DijQij)-1 . (5.8.3)
'''') ES

In most of the situations Qij = 1 but other types of weights can also be used. They
have shown that the ratio type adjustment using a known variance of the auxiliary
variable is a special case for a particular choice of Qij ' Minimization of (5.8.3)
subject to (5.8.2) leads to modified optimal weights as

nij = Dij + 1
DijQij(d;x;-djxj ~ [ ( ,) 1 (
VSYG X HT - z ,2:.2: Dij d .x, -d/ "j)
\2 ] (5.8.4)
2' ,2:,2: DijQij(d;x; - d jxj ~ ' ''' ) ES
'''' ) ES

for the optimal choice of the Lagrange multiplier A given by

A = {VSYG(XHT )-~ .2:,2: Dij(d;x; -djXj~}/~ ,2:.2: DijQijkx; -djx) .


'''')ES '''') E S
(5.8.5)

Substituting nij from (5.8.4) in (5.8.1) leads to the following regression type of
estimator
v(YG) = vSYG(YG)+B1lvsYG(x HT)- vSYG(x HT )] (5.8.6)
where
81 = .L ,"i. DijQij(d;x; -
' * J ES
d jxj Y(w;e; - wjej ~/ ,"i. , "i.DijQij(d;x; -djx) = fi22/fi04 '
l':l:- J ES
Chapter 5: Use of auxiliary information: PPSWOR Sampling 421

The regression coefficient HI makes use of the known total X of the auxiliary
variable and hence can be treated as an improved estimator of the regres sion
coefficient by following Singh and Singh (1988) . Under the higher level calibration
approach, Singh, Horn , and Yu (1998) have discussed the following cases :

Case I. Under SRSWOR sampling design , if qi = xiI and Qij = (d; x i - d jX j )-2 are,
respectively, the weights attached in the low level and higher level calibration
appro ach, then the estimator (5.8.6) reduces to the estimator of the variance of the
ratio estimator as

(5.8.7)

Case II. If qi = 1 and Qij = 1 'i i, j then we have


2
• N (1_ f ) II 2 "(
( ") ")" ( "\2 " ( 2 2)
VSYG YG = 11(11-1) i~lei + lfII X - X+ lfI2 X - X J + lfI3 S , - sx (5.8.8)
where rfJ l and rfJ2 are given earlier respecti vely and

(5.8.9)

Without loss of generality, the estimators of variance of ratio and GREG given in
(5.8.7) and (5.8.8) are neither members of a low level calibration nor of the class of
estimators by Deng and Wu (1987) . These estimators are members of the
analogues of classes of estimator s for estimating variance of GREG given by
Srivasta va and Jhajj (198 1) as
II 2 JH (X S
;J
2
vSJ Y = ( N((I - f) ) s:«
" (") G """"" -2 (5.8.10)
1111 -1 i; l X Sx

where H(., .) is a parametric function such that H(I, 1) = 1 which satisfies certain
regularity conditions defined in Chapter 3. Following Srivastava and Jhajj (1981)
and Deng and Wu (1987), it is easy to see that the class of estimators in (5.8.10)
remain s better than the class of estimators defined in (5.7.6) and hence (5.7.10). A
difficult issue in using (5.8.1) is how to get non-negative estimates of variance
using calibration. The simplest way is to optim ise the CS distance funct ion (5.8.3)
subject to calibration constraint (5.8.2) along with the conditions D.ij ~ 0
'i i, j = 1, 2 ,..., 11. Con straint calibratio n weights can be obtained by following
Estevao (1994) . While it is difficult to develop a solution to this problem
theoreticall y, well known quadratic programming techniques can yield useful
numeri cal results . Straightforward extension, using other distance function s as
discu ssed by Deville and Sarndal (1992) for instance, to the two-dimensional
problem is not poss ible owed to the unpredictable nature of the weights Dij'
422 Advanced samp ling theory with applications

Padmawar (1994, 1998a) has suggested sampling stra tegies admitting the non-
negative unbiased estimators of the variance following the lines of Rao and
Vijayan (1977) and Rao (1979). Chaudhuri (198 1) also discussed the methods for
construc ting non-nega tive estimators of the variance by following Sharma (1970),
Rao (1972, 1977a), Vijayan (1975) and Cha udhuri (1976) and Bandopadh yaya,
Chattopadhyaya, and Kundu (1977). Amab (1992) and Chaudhuri and Roy (1997a)
have also cons idered the problem of estimation of popu lation total under a super
population model.

Example 5.8 .1. We want to estimate the total real estate farm loans of all the
operating banks in the United States. Take an SRSWOR sample of 15 states from
population I and note the records of the real estate farm loans as well as nonreal
estate farm loans. Given that information on the nonreal estate farm loans is
avai lable for all states, apply the ratio estimator for estimating the total real estate
farm loans in the United States. Construct the 95% confidence intervals using low
and higher level calibration estimators.
Given: N = 50 , X =43908 .12 and S~ =1176526 .
Solution. We used the first two columns of the Pseudo -Random Numbers (PRN)
given in Table I of the Appendix to select 15 distinct random numbers 1 ~ R; ~ 50 .
We observed the random numbers in the sequence as 01,23,46,04,32,47,33, OS ,
22,38,29,40,03,36 and 27.
", *;;:~ Selected sampl e and analysis ~

l I!oP;<' ~itate I N onrealJ Real estate ;,</ '<,,,!!C~~\ ' " '{ I;; - 1:,:
"

;' No! ~~t(lte~anil 'fann loails ~Xi! - X,l,/;;!~ :" \Y I:~


.Y ? ... '};'
{ ! !' '.
lo~Qs, Xi 'I!- "
8 "

01 AL 348.334 408.978 563337 .81 49034 .26 209.145 43741.647


23 MN 2466.892 1354.768 1871423.08 524687.56 -60.444 3653.434
46 VA 188.477 321.583 828856.08 95377.08 213.457 45563 .946
04 AR 848.317 907.700 62787.99 76887.08 421.036 177271.119
32 NY 426 .274 201.631 452415.42 183855.54 -42.9 15 1841.675
47 WA 1228.607 1100.745 16825.89 2212 10.49 395.915 156748.865
33 NC 494.730 639.571 365012.12 83.84 355.753 126560.397
05 CA 3928.732 1343.461 8007992 .54 508434 .88 -9 10.382 828795 .394
22 MI 440.518 323.028 433456 .76 94486.64 70.311 4943.599
38 PA 298.351 756.169 640866.43 15814.12 585.010 342237 .154
29 NH 0.471 6.044 1206529.42 389838 .89 5.774 33.337
40 SC 80.750 87.951 1036613.81 294266 .97 41.626 1732.738
03 AZ 431.439 54.633 445493 .95 331524.68 -192.876 37201.077
36 OK 1716.087 612.108 380929 .26 335.14 -372.380 138667.086
27 NE 3585.406 1337.852 6182750 .22 500467 .39 -719 .031 517005.641
Sum 16483.385 9456.222 ;22495290.81 3286304.58 / ' 0.000 2425997. 112
Chapter 5: Use of aux iliary information: PPSWOR Sampling 423

From the above table we have


y = 630.4 14 , x = 1098.892 , s; = 1606 806.4 8 , s~ = 23473 6.042 , r = ylx = 0.5736 8 ,
ei = Vi - y)- r(xi - r) and I e1 = 2425 997.11 .
i= \

An estimate of the total nonreal estate farm loans in the United States during 1997
is given by
X = N x = 50 x 1098.892 = 54944 .6 .
Also we are given X = 878. 1624 and f = 0.3 .

Thus a ratio estimate of the total real estate farm loans in the United States during
1997 is given by

Yo = N -(
R Y
XJ
x = 50 x 630.414 x ( 878.1624)
1098.892
= 25189 .276 .

( a ) Low level calibration

v,(YR )=N2(1- f ) ± e1(~ )2 = 50


2
(1- OJ ) x 2425997.11 x (43908.12) 2
n(n-l) i= l X 15(15-1) 54944.6

= 12910667.19.
Making use of Table 2 from the Appendix the 95% confiden ce interval based on the
low level calibration estimator for estimating the total real estate farm loans in the
United States is given by

YR +faj 2(df = n -1~ , or 25 189.27 + fO.02S(df = 14)vf129 10667. 19

or 25189.27 + 2.145.J12910667.19 , or [1 7481.98, 32896.56] .

( b) Higher level calibration

= 12910667.1 9 x 1176526 = 9453369.624.


1606806.486
Using Table 2 of the Append ix, the 95% confidence interval based on higher level
calibration estimator for estimating the total real estate farm loans in the United
States is given by
YR +faj 2(df = n -1~ , or 25189.2Hfo.02S(df = 14).J9453369.624

or 25 189.2H 2. 145.J9453 369.624, or [1 8594 .18,31784.36] .

This example shows that the width of the 95% confidence interval obtained from a
higher level calibrated estimator is smaller than that obtained from a low level
calibration estimator.
424 Advanced sampling theory with applications

5.8.1 RECALIBRATED ESTIMATOR OF THE VARIANCE OF GREG

Farrell and Singh (2002a) considered a new estimator of the variance of GREG as

(5.8.1.1)

where O ij are recalibrated weights such that chi square type of distance function

DO= -2 ~ .L (Oij - Oij (CijOij)-,


1
I:l:.JE S
1 (5.8.1.2)

-l~~j~S
is minimized subject to the calibration constraint

Oij(w;e; - Wij) ] = EmVSYG(YG) (5.8.1.3)

or equivalently

I I nij(wlv(x;) + wJv(Xj)- 2Pe W;Wj~v(x; )vAXj) )


l~ J ES

=I I 0ij(dl v(x;) +dJv(xj)- 2pAdj~v(x; )v(Xj) )


I"#J EO

where cij are real constants, e; = y ; - fix; such that Em (e;)= 0 , Vm(e;)= a 2v(x ;) and

Em (e;eJ = Pe a 2 ~v(x; )v(Xj) for i '" j, a 2 > o. Here Pe is the correlation coefficient
between successive error terms that are related according to e; = Pe e; - l + U; where
the U; - i .i.d. N(O, 1).
Optimization of (5.8.1.2) subject to (5.8.1.3) yields the recalibrated weights

CijOij(wlv(x;)+w; V(X j)-2Pe W;Wj~v(x;)v(Xj ) )


Oij = Oij+ 2 {Vrna - vrnad
~I CijOij(wlv(X;) + w;v~J-2Pe
I W;Wj~v(x; )v(Xj) )
,*JES
(5.8.1.4)
Note that optimization means the distance may be mmimum or maximum . The
optimisation of a distance function always leads to a new estimator in calibration
approach, but mayor may not be efficient one. Further note that total of distances
between design and calibration weights can be negative, but the magnitude of
distance cannot be negative. If a few weights are negative and others are positive,
the resultant estimator of a parameter in calibration approach can still be positive.
Substitution of (5.8.1.4) into (5.8.1.1) provides a new estimator of the variance of
GREG as
V2(YG)=vo(YG)+B~YG(X HT) - VSYG(XHT)]+D~rna - vrnal] (5.8.1.5)
where
B= L ~ Dijqij(d;x; -djxjNw;e; -wjej~/I ~ Dijqij(d;x; -dj x) ,
,*JES ,* JES
Chapter 5: Use of auxiliary information: PPSWORSampling 425

Vrna =~~~eij(dlv(x;)+dJV{Xj)-2Pe d;dj~v(x;)v~j) ) ,


2'* Je
and

vrna l = ..!- I I Qij(w1v(x;) + wJv{Xj )- 2Pe W;Wj ~v(x; )v(Xj) ) .


2 ,*Jes
The class (5.8.1.5) is wider than Wu (1982), Deville and Samdal (1992), Deng and
Wu (1987) Singh, Hom, and Yu (1998), and Wu and Sitter (2001) and named it
higher order model assisted calibration. Note that the estimator suggested by Shah
and Patel (1996) as

Qgreg = s~ +YI(s; - s; )+ Y2("'N!"- ;eI nv(x;)- ~n iesI v(x;)] (5.8.1.6)

where YIand Y2 are sample dependent constants, is also a special case of (5.8.1.5).
We illustrate some special case s of (5.8 .1.5) here, but a large number of estimators
can be deri ved as special cases.

Case I. Improved estimator of the variance of the ratio predictor under


SRSWOR sampling:

If q; =1/x; , Jr; =n/N , Jrij =n(n- I)/N(N - I), v(x;)=xf for O:,> g :,> 2 and if

'(r.' . ) = '
- Ix f -
N
I ;en
Pe I
N(N - I) ( ;e n
vxf J21
I;
v rn~ ~ 2 (5.8.1.7)

I ~Ixf-
n ~(INJ
n(n -I)
;e n ien

where Vo = N~(l- ~) I el and e; = v,- (y/x)x; . Assuming that I(x; - x)/ xl < I,
n n-I ie s

I(Xi - x)/xl < 1 and autocorrelation Pe = 0 , then (5.8.1.7) can be approximated as


426 Advanced sampling theory with application s

1 +~ 2: I (Xi=.X]j+l II (g- k)
N iEOj=O X k=O(j + I) (5.8. 1.8)
1
1+-2: 2:
00 (
Xi~X
)j+1 nj ~
- ( k)
n iEOj=O x k=OV+1)

;r
If j = 0 then (5.8.1.8) becomes

V(YratiO) = va(
which is the same as the class of estimators defined by Wu (1982).

If j =1 then (5.8.1.8) becomes

' (Yrn<;o )" ;0(; Jj :+=-g~=-~--iN'-'-:-~- "-: ~


l
1 )
X- 2 2n
which is a member of Singh, Hom , and Yu (1998).

However, if j =2 then

where

N - 1i=1
I
P03= _1_ (Xi - xf and iJ03= _1_ t (Xi - X-Y are the third order population and
n - 1i= l

sample moments of the auxiliary variable. This estimator does not belong to the
class of estimators proposed by Singh, Hom , and Yu (1998) , implying that the
recal ibration method is more general than that of these authors.

5.8.2 . RECALIBRATION USING OPTIMAL DESIGNS FOR THE GREG

Consider the case of autocorrelation Pe not necessarily zero and note that if

Jri oc ~v(x;) then (5.8.1.3) becomes

LL nij~1? +hJ - 2PehihJ=2(1 -Pe) L Leij (5.8.2.1)


i e jes i"'j eO
Chapter 5: Use of auxiliary information: PPSWOR Sampling 427

where hi = 1 +(qiXli~diqixl)(X - XHT ) . The recalibrated estimator of variance

takes the form

V2(YG )= ].. L: L: !1ij(Wiei -


2 i* j es
wjej~ + Popt[(I- pJL:j esL:0 ij -]..L:
i*
L: !1ij(hl +hJ - 2Peh;hJ ]
2 j es
i*

(5.8.2.2)
where
L: L: !1ijcij(hl + hJ - 2peh;hj XWiei - wjej ~
PoPt= i* jes ( )
L: L: Cij!1ij III + IlJ - 2Pehihj
i* jes
This result illustrates that the Farrell and Singh (2002a) technique works to
recalibrate the Yates and Grundy (1953) form of the estimator of the variance of
GREG under the condition of minimum variance for the estimator of the total under
the true model. It is often the case that cij = I . If this is so, then (5.8.2.2) is a new
estimator of variance of the GREG estimator. Three other special cases of (5.8.2.2)
are also worthy of note.

Case I. If Pe =0 and cij =(hl +hJ }-t , then the estimator (5.8.2 .2) reduces to

. (.)
V2
I ( \2
YG = - L L Wij Wjej - Wjej J
j LL
" je fl 0 ij
--'-'-'~--,---
. 1 (5.8.2 .3)
2iv j es II; )
1
2
LL Wijh +
i e jes

Case III. If Pe = +1, then for cij = (hi - hj)-2 the estimator of variance in (5.8.2.2)
becomes
V2(YG)= O. (5.8.2.5)

Note also that the condition If i oc ~v(x;) corresponds to the Godambe and Joshi
(1965) lower bound of variance, so the variance for fixed sample design under the
true model may be equal to zero. This demonstrates the usefulness of the
recalibration method of the estimator of the variance of the GREG estimator.
Although different choices of the function, V(Xi1 leads to different estimators of the
variance of GREG and a most common form of the function can be considered as
V(Xi ) =xf , where g is any known model parameter.
428 Advanced sampling theory with applications

ESJ1iMATORS OF~~AiuAN 'GE 'OFESTIl\1A J10ROF


,DIsT'mB'uTIONFUNCT'ION ~: , i ~"";:i~,;"".ii;'> , ;, 1 :, ,'f >,
Rao (1994) has considered the problem of estimation of the general parameters of
interest, given by
Hy = L: hlYj) , and Ii, = N-1H y (5 .9.1)
jEO
for a specified function h . The choice of h(y) = y gives the population total
Hy = Y and the population mean H = Y, while the choice h(y) = ~(t - y) with
~(a) = 1 when a ~ 0 and ~(a) = 0 otherwise gives the distribution function
Hy=F(t)=N- 1 L: ~~-Yj) (5 .9.2)
jEO
for each t. Rao (1994) has suggested a general class of estimators of H y given by

il y = L:di(s )h(Yi) (5 .9.3)


;E S

where the basic weights di(s) can depend both on s and i (i E s) and satisfy the
design unbiasedness condition. The choice h(y) = y in (5.9.3) gives Godambe 's
(1955) class of estimators of total. If d;(s)=Jr- 1 then (5.9.3) reduces to the Horvitz
and Thompson (1952) estimator of population total. If d, (s) = Wi and
h(y;) = J(Yi ~ r}, then (5.9.3) reduces to the estimator ft(t) suggested by Silva and
Skinner (1995) . Rao (1979) has suggested an estimator to estimate the variance of
the estimator il y as
v(il)= L: L:Dij(S)WiWJi -Zj~
i < j (5 .9.4)
i.j« S
where Zi = h(Yi )/Wi and weights Dij (s) can depend both on s and (i, j) E S , and
satisfy the unbiasedness condition. For example, the Sen--Yates--Grundy (1953)
estimator of the variance of Horvitz and Thompson estimator is a special case of
(5.9.4) with Wi =Jri and Dij(s)=(JriJrj-Jrij)j(JrijJriJrj) for any design with fixed
sample size n. Singh (200 I) proposed an estimator of the variance of il y as
vAil)= L: L:Wij(S}.viWJi -Zj~'
i< j (5 .9.5)
i.j e S

where wij(s) are the modified weights and are as close as possible in an average
sense for a given measure to the d ij (s) with respect to the calibration equation:
L:L:Wij(S)WiWj(qi-qJ2 =v(il x ) '
i< j (5.9.6)
i. j e s

where qi = h(Xi )/Wi and


v(il x)= L: L:Dij(n)WiWj(Qi-Q j)2, for Dij(n)=(JriJrrJrij)jJriJrj
i<j
i,jEO
Chapter 5: Use of auxiliary information: PPSWOR Sampling 429

denotes the known second order moment of the estimator ilx = Idi(s)J(x;) of the
ie s
auxiliary parameter H x .
The minimisation of the two-dimensional CS type distance D between two lower
triangular n x n grids formed by the weights wij(s) and Dij(s) for i, j = 1,2,...,n is
defined as

L L[{Wij(S)- Dij(s )P(Dij(s)Qij )-Il . (5.9.7)


i< j
i ,je s

Minimization of (5.9 .7) subject to (5.9.6) leads to the modified optimal weights
given by

wiwjDij(s)Qij(qi -qj~ [( (\2]


Wij(S) =Dij(s)+ 2 2 ()Q (
L LWiWjDij S ij qi-qj 1< }
r
V n, -.l: LDij(s )wiWjWi - qj) . (5.9.8)
A )

i< j t,J E S
i.jE S
On substituting the value of wij(s) from (5.9.8) in (5.9.5) we obtain a regression
type estimator for the variance of ily as

(5.9 .9)

where
B = .L L wfw]Dij(s)Qij{q; -q)(z; -Zj~/.L L wfw]Dij(s)Qij(q; - qj ~
1 < } 1< }
i ,j e s i ,je s

and
v(ilx )=i L< LDij(s)w;wAqi -q) .
j
i,jE S
The approximate leading term of the mean squared error of the regression type
estimator (5.9.9) is given by

MSE[vs(ilJ] = V[v(ilJ]+B2V [v(ilJ- 2BCov[v(ily1v(ilJ (5.9.10)


where

B= !.I I
1< J
;.jE rl
w;wJDij {o.)Qij (qi - qj Yk - Z})/!.I 1<
;,jErl
Lw;wJDij{o.)Qij(q;- qj
}
r)
and
Cov[v(Ji y1v(il.J]", Lc jL«kL« LDij(n)Dkl(nXJrijkl
i I
- JrijJrkJqi - qjY(Z; - Zj ~ .
i , j , k,lErl

Note that Dij' Dij(s), and Dij(n) have their different meanings. Dorfman and Hall
(1993) considers the estimator of the distribution function of a variable over a finite
population, when a sample of units is available and the values of a related auxiliary
variable are known for the whole population and developed several estimators
430 Advanced sampli ng theory with applications

based on nonparametric regression models. Wu and Sitter (200 I) also considered


the problem of estimation of variance of the estimator of distribution function in the
presence of complete auxiliary information. Rueda and Arcos (2002) make use of
quantiles of the auxiliary variable while estimating distribution function and hence
median .

5.9.1 UNIFIED SETUP

An estimator for estimating the variance of an estimator of population total, e.g, a


ratio estimator, regression predictor, or estimator of distribution function, can be
written as
vy = LLDijf(Yi'Yj, Xi ,Xj) (5.9 .1.1)
where Dij are some design weights and f(.,.,.,.) is a quadratic function of Yi and
Xi and L L denotes the sum over sampled values . A calibrated estimator of
variance is then defined as
vp = LLWijf(Yi, Yj,Xi, Xj) (5 .9.1.2)
where w i} are calibrated weights subject to a function and calibration constraint.

Consider IWij - dijl < I . Let g(wij) be any function of the new weights wij satisfying
the regularity conditions
( a) g (d ij) = 0 ,
( b ) The first and second order partial derivatives of the function g exist and are
known .

The weights wij are obtained such that the function g (wij ) has the minimum value
subj ect to the constraint
LL Wijh(Xi,Xj)=Vt , (5.9.1.3)

Expanding the function g (wij) around the point Dij by using second order Taylor
series, we have
g(wij) = g[Dij + (wij - Dij)]

= g(Dij)+ (wij - Dij) :!!.IWij =gij +-21 (wij - Dij~ :2~


uwy UWij
IWij =gij +..... . (5.9.1.4)

Using regularity condition ( a ) we have

(5.9.1.5)

where 'II ij and 'II ~ denote the first and second order derivatives respectively of
the function g with respect to wij and known constants. For example , if 'IIij = 0
and 'II ~ = Ij(DijQij) ' then L L g (wij ) reduces to the CS distance function discu ssed
Chapter 5: Use of auxiliary information: PPSWORSampling 431

In the previous sections. Minimization of I I g(wij) subject to the calibration


constraint (5.9.1.3) leads to optimum weights given by

wij '" Dij~


'l'ij
~
1 [ h (xi' x} I I '/(Xi,X})
, j-' l Vx - IIDijh(
2'1'ij
xi'x} 'l'ij-
)+ I I -
2'1'ij
. - 'I'ij . (5.9.1.6) j ]
On substituting (5.9.1.6) in (5.9.1.2), one can easil y get the higher order calibrated
estimator of variance.

5.10 CALIBRATION OF ESTIMATOR OF VARIANCE OF REGRESSION


, PREDICTOR ~

The Horvitz and Thompson (1952) estimator and the generalized regression
(GREG) predictor for the population total, Y, are given respectively

(5.10.1 )
and

Yg "' 7di Y;!Si +PQ( X -7diX;!Si ) (5.10.2)

where d .> ~i ' ISi"'{~ \: : : : : PQ"' {7QiXiY;!si }/hQiX;Isi }, and o. IS a


suit ably cho sen constant. An approx imate expression of the variance of Yg was
obtained by Sarndal (1982) and is given by
(Y
Vy '" Vp g )"' ..!- I I 0 ij(d i E i - d} E} ~
2 ie ]
where Ei'" Yi - fJQXi ' ;1ZJ
fJQ '" {I QiXiYill'i }h(QiX and 0 ij '" (ll'ill') -ll'ij )' Sarndal
(1982) introduce two estimators of variance as
1
VI '" -2 I I Dij{d;ei -di e) I sij (5.10.3)
,¢ J

and
v2 "' ..!-2 i¢}
I I Dij(gsidiei -gs}d}e ) I sij (5.10.4)

where with and

I ..
Sl]
",{I if (i andj) Es.
0 otherwise.
Kott (1990) proposed the following estimators:
Vk} '" wi}, j '" 1, 2 (5.10.5)
where w} is the calibration weight satisfies
EmVg -Y~ '" w}Em(v} ). (5.10.6)
432 Advanced sampling theory with applications

The estimators are practicable for the model M(j) with o} = a 2fi . Chaudhuri and
Roy (I 997a) have shown that Vy , the variance of the regression predictor, can be
written as
2
Vy = IaiYi + I IaijYiYj,
i i~ j (5.10.7)
where

(5.10.8)

N N
V (r.d .x.f . ) r. d id k0ikxk r. d jdk0 jkXk
a ij = - d i d j 0- ij + Qi Q/ fi 7rj XiXj (
p I
\2 + QjXj 7rj
I 51 k- I
2 +
Q X 7r
i i i
k -I
2
(5 • 10• 9)
r.Q~~J r.Q~~ r.Q~ ~
with Qi > 0 being an assignable constant to form different estimators of regression

coefficient, and Vp( L diX/Si) =.!. L L 0ij(diXi - d jXj Y.They considered a class H of
iES 2 i~jE n
non-homogeneous quadratic unbiased estimator as:
vy =as + Lbsii I
l si + LLbSijYiY/Sij
l~ J
(5.10.10)

where as , bsi and bSij are constants free from the Yi values and satisfy the
unbiasedness conditions as follows :
Ep(asl = 0, Ep(bs/s;) = ai , and E p(bsylsij) = a ij .
Chaudhuri and Roy (1997a) derived the lower bound of the variance of an estimator
belonging to the class H under the following superpopulation model M :
Eml v;) =,ui ' Vm (Yi)= al and CmlYi' Yj)=O for i e ] and showed that the variance
estimator
Vo = Idiai~l--al-,ul Y Si+ IIaijdij(YiYr,ui,uj Y Sij+I a, (al+,ul
i i~ j i
)+I i~
I a ij,ui,uj (5.10.11)
j

is optimal within H in the sense that its variance attains the lower bound. In the
next section, Arnab and Singh (2002a) have shown that the result concerning the
lower bound is incorrect and hence the optimality property of vo cannot be
acceptable.

The estimator "0at (5.10.11) can not be used in practice since it involves unknown
parameters ,ui and a l . So Chaudhuri and Roy (1997a) proposed the following
alternative estimators when ,ui = fJxi and al = a 2x f by replacing the unknown
fJ and a 2 with their suitable estimators as:
Chapter 5: Use of auxiliary information: PPSWORSampling 433

, (2 0' 2 ;, g \lsi ; ' g


V3 = L:ai\Yi - Xi -'/'Xi ~+'/'L:aixi
i lfi i

(5 .10.12)

(5.10 .13)

where

5~10:L:CIIAUDIIURI AND ROY'Siq:SULTS

The lower bound of the estimator of variance by Chaudhuri and Roy (1997a) has
been restated in the following theorem:
Theorem 5.10.1.1.Under model M, and vy E H
Vm(Vy)~ L:a?(di -lh? + L:L:aJ(lfIij -lhij (5 .10 .1.1)

h 2 s: {2 2 \2 d (2 2V 2 2) 2 2
were 77i = vi - \Ui + fli J an 77ij = \Ui + fli AUj + flj - fl i flj .

The equality is attained in the above if the estimator of variance vy takes the form
vo=L:diai~?--CJ}-fl? frsi+L:L:a ijlflij&iYrfliflj YSij+L:a;{u?+fl? ~L:L:aijfliflj' (5 .10.1 .2)
Thus we have the following theorem:
Theorem 5.10 .1.2. By relaxing the assumptions of Chaudhuri and Roy (1997a), a
new expression for the variance of Vo is given by
M(vo) = EmEp(vo - vf

(5.10 .1.3)
434 Ad vanced sampling theory with applications

where Yi = Em~l) < 00 •


Proof. Let Ep(Em ) , Vp(Vm), and Cp(Cm) be the expectation , variance and
covariance with respect to design p (model M ) .
Then
M(vO) = EmEp(vo - vf = EIIIVp(vo), since Ep(vO) = vy •
Now
Vp(vo)

= vp b:Aai~l - a} - JJl YSi + 'L.'L.aijdij&iYj - JJiJJj)JSij]


= vp l'L. diai ~l - a} - JJl YSi]+ Vp ['L. 'L. aijdij&iYj - JJiJJj )Jsij]
+ 2Cp ['L.diai~l - o} - JJlYSi ,'L.'L.aijdij&iYj - JJiJJj )Jsij]

2$ iYj - JJiJJj 12
+ 2II aijl)' f iYj - JJiJJj }{YkYI - JJkJJI }Cp( -ISij ,-ISk/ J
f Vp( -ISijJ + I I IIaiPkJl)'
i"j 1[ij i e j ek e j 1[ij 1[kJ

+ 4I I Iaijaik
~# k
~iYj - JJiJJj}{YiYk - JJiJJd c p(I Sij , Isik J
~ ~

+ 4;;7 aiaij ~l - (al + JJl )J{yiYj - JJiJJj }cp( ; : ' ; ; J. (5 .10.1.4)

Further noting

( i) Vp( I Si]=1[i -
1[i
21[l
1[i
= 1-1[i ,
1[i (1.I'J
( ii ) Cp --B...,....!L =
1[i 1[ j
1[ .. -1[ .1[ .
IJ
1[i1[ j
I ) for i *- j,

(1'1'1
' ) Vp ( ISijJ -- 1[ij-1[J
2
_1-1[ij
-
f
or I. *- J. , ( IV
. ) C
P
( lSi , I Sij ) -_ 1[ij-1[i1[ij fior I. *- J,.
1[ij 1[ij 1[ij 1[i 1[ij 1[i1[ij

.. Em flYi2- (ai2+ JJi2)12 (4) (2 2\2 2


( VB ) f = E\)'i - ai + JJi ) = '7i ,

( viii ) Em ~l- (al + JJl )J{y; - (aJ + JJ; )}= 0 for i *- j ,


. ) Em l)'iY
( IX f j - JJiJJ j }2 = (2
ai + JJi2Aaj
Y 2+ JJj2) - JJi2JJj2= '7ij for i *- j ,
(x) Em tyiYj-JJiJJj J{YkYI-JJkJJ/}=O for i*-j*-k*-l,
Chapter 5: Use of auxiliary information: PPSWOR Sampling 435

( xi) Em ~iY j - JiiJij }{YiYk - JiiJik}= alJijJik>


and

( xii) Em ~l - (al + Jil )}&iYj - JiiJij}= Jij~i - Jii(al + Jil )} ,


where r t = Em~; )< 00 •

Hence we have
M(vo) =E",Vp(vo)=I alrJl(di - I)+2IIaJ17ij{lf/ij -1}r4 II IalaijaikJijJik (l!ijklf/ijlf/ik - I)
i i* j i*j*k
+4IIaiaijJij ~i - Jii(al + Jil )Xdi - I). (5.10. 1.5)
i*j
Note that the results in (viii), (ix), (x) , and (xi) are derived under the assumption
C",lYi' Yj )= 0 , c", ~l, YJ)= 0 and c", ~l , Yj )= 0 for i * j , but these assumptions
may also be relaxed. The following theorem states a new lower bound of varianc e
as:

Theorem 5.10.1.3 . A new lower bound of variance M(vo) is give n by


Ial(di - lnl+IIaJ{lf/ij- lhu . (5.10.1.6)
Proof. Following Chaudhuri and Roy (1997a) we can write
M(vy)= EmEp(vy - V~ = Ep~lI(vy)+ Ep {Em(vy - v)P - vm(v) .
Now we have
V",{v)= L.b1iIsiV",~l)+
i
2L.Ib;ijIsijV,J,YiYj)+ 4I IIbsijbsikIsijIsikC",&iYj'YiYk)
ie ] i e jv k

+ 4'I 'Ib;;bsijIsijCm~l ,Y;Yj )


;cf. j
and
vm(v) = Ialvm~l)+ 2L.IaBVm&iYj)+ 4I IIaijaikCm&iYj'YiYk)+ 4I I alaijC", ~l 'YiYj)
i i* j i*j*k i*j
Now noting
V", ~l ) = 17l, v",lYiYj )=17ij , C", &iYj'YiYk)=alJijJik>
and
C",~l'YiyJ = Jij ~i - Ji;{al + Jil)}.
Thu s we have
M(vy)=Ao+ Bo (5. 10.1.7)
where
Ao = Ep(a s - 1, )2+ I~p(b;;Isi - al )hl +2L.I~p(b;ijIsJ-aJb (5.10.1.8)
i ie ]

and
436 Advanced sampling theory with applications

8 0 = 4 L L L {E p (bSijbSikIsi/ Sik)- a ija ik ~? Jij Jik


r"I"k
+ 4H
i~ j
~p~~bsij lsij )-alaij ~J; J.l;(a} + J.ll)} - (5.10.1.9)
with
rs = -L (a} + Ji? XbsJ si - a i) + L LJiiJiAbsijI sij - a ij )'
i i"j

Now from Chaudhuri and Roy ( 1997a) we note that Ao attains a minimum when
bsi = lTi- l , bSij = lTi/ and a s =rs , and the minimum value

Ao = z;ar(~-
I lT
lJ17r + ~~aD(~
lTlJ
l
- lJ'7ij 1* J

which Chaudhuri and Roy (1997a) claimed as the lower bound of M{vy) .
Obviously the claim is not justifiable under the new assumptions because 8 0 does
. a rmrnmum
not attains . . when bsi = lTi- I ,an dbsij = lTij-I .

5.10 .2 CALmRATED ESTIMATORS OF VAIUANGE OF REGRESSION '


PREDICTOR
In this section, we discu ss Amab and Singh (2002a) calib ration techniques under
the following situations:
( i ) Model Assi sted Cal ibration (MAC) when the regression of Y on x is passing
throu gh the origin.
( ii ) The regression is unkno wn but the population variance Vx of the GREG
predictor for the auxiliary variable x is known.

5.10 .2.1 M ODEL ASSISTED CALmRATION

The Horvitz and Thompson (195 2) type estimator of the variance of the GREG
predictor is given by

VhJy)= '[,aidiYl ls i + '[,'[, a ij'l'ijYiY/sij' (5.10 .2.1.1)


i i* j
Now we consider a calibrated estimator of variance of the regression predictor as
• 2
Vc = LaiwiYi l si + L LaijwijYiY/sij' (5.10.2.1.2)
j j ", j

where Wj and w ij are calibration weights again obtained by minimising a


compromised distance function
D
sp
="c: (wjajd -dja j f I .
SI
" " (a ijwij
+ £..,£..,
- 'I'ijaij~ I ..
SI) (5.10.2.1.3)
j iai qi i* j 'l'ija ijqij
Chapter 5: Use of auxiliary information: PPSWOR Sampling 437

where qi and qij are suitably chosen constants to form different kinds of
estimators , subject to a model assisted calibration (MAC) constraint given by
Em(vc(Y)) = Em(Vy ) (5 .10.2.1.4)
or equivalently
'LaiwiEm~l Ysi + 'L'Laij wijEm&iyJSij = 'LaiEm~l)+ 'L'LaijEm&iyJ (5 .10.2.1.5)
i i* j i i* j
For the superpopulation model
M : Yi = fJxi+ ei (5.10.2.1.6)
2
such that E m(Yi ) = fJx i ' Vm( Yi)=a x f , and CmlYi'Yj )= O, the calibration
constraint (5.10.2.1.5) reduces to
I a iwi (a
2
xf + fJ 2x l )rSi + I I aij wijfJ2 XiXjI sij = Ia i(a2 xf + fJ2xl )+ I Ia ijfJ2 XiXj
i i* j i i* j

2'LP 2{Ia x
=a ixf + fJ i l + I Iaijxix j} . (5.10.2.1.7)
i i i* j

On comparing the coefficients of a 2 and fJ2 on both sides of (5.10.2.1.7), the


system of calibration equations becomes
2 2
I aiwi xi + I I aij wijxix/sij =I a ixi + II a ijxiXj
i i*j i i* j (5.10.2.1.8)

and
Ia iWi x f l si = Iai x f . (5.10.2.1.9)
i i
Note that a slightly new set of calibration constraints can also developed by
considering an autocorrelated model. Then we have the following theorems :

Theorem 5.10.2.1.1. For Vx > 0, that is the first and second order inclusion
probabiliti es are the functions of another auxiliary variable, then the calibrated
weights are given by
C6 , -B6 2 2 A6 2 -B6, g
wia i =diai + 2 d iq iaixi + d,q , a ix i (5 .10.2.1.10)
AC- B AC-B 2
and
C6, -B6 2
wija ij = 'IIija ij + 2 'IIijqijaijxixj ' (5.10.2.1.11)
AC-B
where
A= s:«, d iqi xi4 l si + I 2 2
Ia ij'llijq ijx i xj I sij , B-
- z«,d n ,»;g+2 l si, 6 , -- Vx - v' h' ( x ),
i i* j i

C = I ai di qix;g Isi, 6 2 =Iaixf - I a idi x fI si'


i i i

and Vh' (X) = Ia;d;xlIsi + I Iaij'llijxix/Sij .


i i* j
438 Advanced sampling theory with applications

Proof. In order to minimise (5.10.2.1.3) subject to (5.10.2.1.8) and (5. 10.2.1.9)


consider the Lagrange function as
(w;aj - dja jf (aijwij - 'I'ijaij ~
¢= "i.
i d.a.q,
u-:»
ie ] 'I'ijaijqij
I sij

-2)1,L;.aiwixllsi + '~"?-'i:aijW
J
ijXP:jISij} - 2J1{L;.aiWixf l Si } '
I
(5.10.2.1.12)

Now
o¢ = 0 ~ Wjaj = dja; +d;a jqj(,h ; + J1 xr ) (5.10 .2.1.13)
OW;
and
o ¢ =0 ~ wijaij ='I'ijaij+ A'I'ijaijqijxjxj ' (5. 10.2.1.14)
o Wij
On substituting (5. 10.2.1.13) and (5.10.2.1.14) in (5. 10.2.1.8) and (5.10.2.1.9) we
have
A = CL\ 1 - BL\ 2 , and J1 = AL\ 2 - BL\ 1
AC-B 2 AC -B 2
On substi tuting these vales of A and J1 in (5.10.2.1.13) and (5.10.2.1.14) we have
the theorem .

Theorem 5.10 .2.1.2. The new calibrated estimator of the variance is given by

Vc = Vh,(Y )+ iJ2(vx - Vh/(X))+0-2( "7ajXr -"7ajdjxr ISj ) , (5.10.2.1.15)

where
iJ2 = CP - BQ and 0-2 = AQ - BP
AC-B 2 AC - B 2
with
P="i.a jd jq;x j2 v i2 I sj + "i."i.aij'l'ijqijx jx j Y;Y j I sij , an d Q = "i. ajd jqj xf y ; I si :
j j* j j

Proof. Use (5.10.2.1.10) and (5.10.2.1. 11) in (5. 10.2.1.2) we have the theorem.

Case I. For an SRSWOR sampling with Qj = qj = 1/Xj ,qij = Ij(XjXj) we have


"j = 11/ N ,"ij = 11(11 -1)/ N(N -I),

. = N(I- f) (1- f)c2 - 2 N(I- f)(~_l) = .


a, + x ( ) - a .; (5.10.2.1.16)
II II IIN-I X
and

N(I - f ) (l- f)C2_ N (I - f) fJ( . - x ) ( . -x)}= ..


aij = II(N -I) + II x II(N - I)X FI + xl ay . , (5.10.2.1.17)
- I N
where f = II/ Nand X = N - "i. x;.
j =1
Chapter 5: Use of auxiliary information: PPSWOR Sampl ing 439

In this situation Yg reduce s to the ratio estimator , viz.

, = _(x)'
Yg x = N Y Yratio .

The conventional Horvit z and Thompson (1952 ) and the calibrated variance
estimators came as
Vht (Yratio )stage=1= N
II JE S
'[.ajol + Nt
II II -
-I))L L a ijoYjY j .
j", j ES
(5.10.2.1. 18)

The second stage calibrated estimator of variance of the ratio estimator becomes
Vht (Yratio)stage=2 = Vht CYratio )stage=l + Ytgo{Vx - vx }
, {
+ Y2go I.a joxj - -N I.a joXj g g} (5.10.2.1.19)
JE D II JE S

where
, N 2 N (N - I)
VX =- I. aj oXj + ( ) I. I. a ijoXjXj ,
II JE S II II - 1 j", j E S

, IE S
N - I ~ I. a ijoYjYj ) - ( .I. a joXjg+I)(.I. a joxjg-IY 2)
( .I. a joXj2g-I)(I.I.ESa joXjYj2+ --
II - I '* JES IES IES
j

Ylgo = ( 3 N- I )( 2g-l) -
.I. aj oXj + - - I. I. a ijoXjXj .La jo Xj
( .L a jo'""';g+I)2
IES II - 1 ,* JES IES IES

and

."I.
$ 3
ai . Xj +-N -- I I ,I aij. xixj ) ( .I a ;.xj g-12
Y; J- ( ,La;.xjg+IJ(.I 2
ai . xjY; + -N --I I I aij.YiYj )
...
(
l ES n - I l~J ES rEs l ES lES n - I l 'i:-J ES
Y2g. = (.Ia;. Xi3+N--I- I I a ij.xi x j
)(
.Ia;. xi
2g-1J- (.I a j. x;g+1J2 .
l ES n - 1 1:1-JE S lES rEs

Case II. For SRSWOR sampling with qj = Qj = I , qij = 1 yield


11(11 -I)
J[ . =-II J[ .. = ---;c--------'7
I N' 1J N (N - I)'

a . = N ( I - f ) 1+ NS; X2 _ 2 N Xj(Xj _X )] = a . (5. 10.2. 1.20)


I II (2)2 I (N _I) LX? 10'
LXj JE D
JE D

and

= a ijo . (5.10.2 .1.21)

In this situation Yg reduces to the regression estimator


440 Advanced sampling theory with applications

~eg = N~ + /Jds(x - x)J, where /Jds= t~XiYi}/ t~sxl} .


The Horvitz and Thompson (1952) and calibrated variance estimators are
respectively given by
Vhl(Yreg)slage=1 = N ,I
n ies
a ioYl + Nt_-I))~
nn
I aijoYiYj
'¢Jes
(5.10 .2.1.22)

and

(5.10.2.1.23)
where
N 2 N(N -I)
L.aioxi + ( ) L. L. aijoxixj ,
A

VX =-
n ies n n- 1 i¢ jes

and

( 4N-I
,'L.a joXj +--'L. ,'L. 22)(
aijoXj Xj g 2)- ( ,'L.a joXjg+2)( ,'L. aj oXj22
,'L.a ;ox j Y j Y; + -N-l
- 'L. 'L. aijoXjXjYjYj )
... l Est n -I 1':1: )ES lES I ES lES n -I '"¢.jE S

Y2g = (4 22)( 2g)- ( ,'L.a joXjg+2)2 .


o
N -l
,'L.ajoXj + - - 'L. 'L. aijoXj Xj'L.aj oXj
IES n -I '* JES res lES

.2:CALIBRA'FION'ES'J1IMA'FORS j\VHEN~
'~j:AUXi:LjAAY VARIAHUE IsfkNOWN .

Amab and Singh (2002a) proposed alternative calibrated estimator of the variance
of the regression predictor as

(5.10.2.2.1)

where b;i and b;ij are the modified weights. Now the variance of the regression
predictor of the auxiliary variable is obtained by replacing Yi with Xi in (5.10.7)
and is given by
2
Vx = L.aiXi + L.L.aijxiXj . (5.10.2.2.2)
i i¢ j

An estimator of the variance of the regression predictor of the auxiliary variable is


obtained by substituting vt for Xi' given by
Chapter 5: Use of auxiliary information: PPSWOR Sampling 441

Vx = as + I,bsixl l si + I, I, bsijxix/sij . (5.10.2.2.3)


i i" j

Thus the choice of these weights can be made in several ways :

Consider each component of the decomposition of f'.t = I,aiX; + I,I,aijxix j is


i i" j

known . Let I,aixl = 1] (x) and I,I,aijxixj = T2(x) . The calibrated weights b;i and
i ie ]

b;ij are obtained by minimizing the CS distance functions I, (b;i - bsi l si and
I bsiqsi
L
I, I,
(b;ij - bsij) l sij su b·ject to I, bsixi
• 2 l si = 1] (x) and '
I,I,bsijXiX/sij ( ) . 1
= T2 x ,respective y.
i" j bsijqsij i i j

Minimisation yields
• =b +
bsi si
bsiq six4l [ 2 2]
I,aixi - I,bsixi l si (5.10.2 .2.4)
I,bsiqsixi l si ' ,
i

(5.10.2.2.5)

On putting the values of b;i and b;ij from (5.10.2.2.4) and (5.10.2.2.5) in
(5.10.2.2.1), we obtain an another calibrated estimator of the variance of the
regre ssion predictor as

vP1 = vy +b{ 1](X)- ~Aixllsi]+b2[T2(X)- E'7 bSijXiX/Sij] ' (5.10.2.2.6)


where
~ = I,bsiqsixl Yllsi/ I,bsiqsixilsi
, , and b2 = I, I,bsijqSijXiXjYiY/Sij/I,I,bsijqsijx;x;lsij .
'" J '" J
N
Case I. Consider an SRSWOR design with as = 0 , bsi = - aio and
n
N (N - I ) -I (\-I
bsij = ( ) aijo, qi = Qi = Xi , qsij = XiXj J , where a i. and a ij' are given in
n n-I
(5.10.1.16) and (5.10.1.17). In this case the variance estimator for the ratio
estimator Yratio is given by
,(, ) N 2 N (N -I) , [ () N 2]
Yratio = - .L ai oYi + (-I) I, L aijoYiYj + r Io 1] x - - .LaioXi
Vp I
n IES nn I:#.J ES n IES

(5.10.2.2.7)
442 Advanced sampling theory with applications

where

rIo = .IP ioxiYl / .'L. a iox! , r20 = 'L. 'L. aijoYiYj/'L. 'L. a ijoxiXj' 'L.aio xl = 1i(x)
IES IES 1* JE S 1* JE S iE[l

and 'L. 'L. a ijoxixj = T2(x) .


i*jE[l

Case II. Consider an SRSWOR design with as = 0, bsi = (Njn)aio and


bsij = {N(N -l)}j{n(n -1)}aijO, qi = Qi = 1, qsij = 1, where a io and aijo are as given
in (5.10.2.1.20) and (5.10.2.1.21).
In this case we have
, t, ) N 2 N(N -1) . [ () N 2]
Vpll)'greg = - .'L. a;oY; +-(--) 'L. 'L. aijoYiYj + Ylo 1] x --'L.a;ox;
n i es n n- I l -:t: JES fl I E S

• [ T2 ()
+ Y20 N(N -1) 'L. 'L. aijoxi x j ]
x - -(--) (5.10.2 .2.8)
n n -1 i* j es

"[,aioxl = T[ (x), and "[, "[,aijoxi Xj =T2(x) .


iEQ i*jEQ

Amab and Singh (2002a) further considered the situation when VX ' the variance of
the regression predictor for the auxiliary variable is known but 1i (x) and T2 (x) are
unknown, and determine the calibrated weights b;i and b;ij by minimizing

{b" b \2 (b".. -b ..)


D='L.hi- siL +'L.'L. S lj S lj (5.10.2.2.9)
i bsiqsi i* j bsijqSij
subject to the calibration constraint
" 2 "
as + 'L.bsixi lsi + 'L.'L.bsijXiX/sij = Vt · (5.10.2.2.10)
i j;t:. j

(5.10.2.2.11)

and
" bSiqSijXiX/Si{VX -l\}
bsij = bsij + 4 2 2 (5.10.2.2.12)
'L.bsiqsixi l si + 'L.'L.bsijqsijXi xj I sij
i i* j
The resultant estimator of variance of the regression predictor is given by
Chapter 5: Use of auxiliary information: PPSWOR Sampling 443

(5.10.2.2.13)

If we assume bsij (and hence b;ij) equal to zero for all values of i and j in the
sample , then the proposed strategy is an improved version of the estimator studied
by Sarndal (1996).

Case I. Consider an SRSWOR design with as =0, bsj = (N/n}aj. and


bSij = {N(N -I)/n(n -I)}aij.' qj = Qj = xi i , qsij = {XjxJI , where a .; and a ij. are as
given earlier. In this case the estimator of the variance of the ratio estimator Yratio IS

A (A ) N" 2 N(N - I)" "


Vp2 Yratio = - L,aj.Yj + ( I) L, L, aij.YjYj
n JES n n- j* j es

(5.10.2.2.14)

where

and v., = Ia j.xf + I I aijoxjxj .


jEQ j*jEQ

Case II . Consider an SRSWOR design with bsj = (N/ll}ajo and as = 0,


bsij = {N(N - I)/ n(ll - I)}aijo, qj = Qj = I , qsij = 1 , where ajo and aijo are given
earlier. In this case the estimator of the variance of the regression estimator Yreg is

A (A ) N 2 N(N - 1)
v p 2 1)'greg = - IaiY; +- (- -) I I aij.Y;Yj
11 ie s n n- 1 i ~ je s

+ Y3. Vx -
{N
- Ea .x;2 +-N(N
(- - -1)) I I aij.x;xj }] (5.10.2.2.15)
A [

11 ie s 11 Il - 1 j~ j es

where
2 2 N- 1
I a ;.Yixi + - - I I aij.xixjYiYj
ies n - I tv j e s and Vx = Iajoxf + I I a ijoxjxj .
Y3. = 4 N- I 2 2 j EQ j* j EQ
I
ie s
«». + -- I I a ijx.xj
n - I i:l:je s

Brewer (1999) said, " It is appropriate to estimate the anticipated variance for
sample design purposes, but for the analysis of any particular sample the prediction
varian ce is more logical choic e".

Thu s the next section has been devoted to find the prediction van ance of the
calibrated estimator of variance .
444 Advanced sampling theory with applications

The prediction variance is given by


EmfiJc-vyf

= Em[ .I:(Wi- aJY1+I: I: (wij-aij)Yiyr{ .I:aiYl+I: I: aijYiYJ}]2


,es '''' Jes tes ''''J'l-S

= Em[. I: wJsiyl
,eO
+ I: .I: wijIsijYiYJ - {I:aiY1 +
'''' JeO
~ .I: a ijYiYJ}]2
'''' JeO

= I: (wi2 - 2wiai )m4i+ I:ai2m4i+I: I: Wi wjm2im2j- 2A I: Wi m2i + 2 I: wiai m22j + I: I: a ia j m2im2j


i ES i Efl i-:J: j E S iES i ES i:t: jEn

+ 2/32[I: I: I: Wiwikm2ix 'Xk - I: Wim2i I: I: a AX'Xk - I: I: W'kX 'Xk I:aim2 i


i", J",kes J ies J#(", i)eO J J jvk es J J i(",J,k)eO

+ I: I: I: apikm2ixJxk ] .
i"' J",keO

We will now discuss the estimators which take into account the order of the units
selected in the sample as well as those that ignore the order of the units. It is
remarkable here that:
( a ) For each ordered estimator, we can find an unordered estimator;
( b )The unordered estimator is more efficient than its corresponding ordered
estimators.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 445

Let us discuss each of these situations:

5.1L1 ORDERED ESTIMA'[QRS '."

Suppose that n units are selected in the sample in the order Y" Yz, ..., Y ll with
selection probabilities PI ' p z ,...., P ll ,respectively. Then an estimator of
population total can be defined as

i, = i~1 Y j +~[I _ i~l Pj] for i = 1,2 , ....n . (5.11.1.1)


j=1 Pi j =1

Thus we have n possible estimators of population total and each estimator is found
to be unbi ased .

Theorem 5.11.1.1. Each of the estimators f;, i = I, 2, ..., n, is an unbiased estimator


of the population total Y .
Proof. Here we assume that the results of the first (i -I) draws are known. The
theorem is proved if we show that E(Ti )=Y. Now we have
E(f; )=EIEz [f; I Results of first (i-I) draws are known] (5.11.1.2)
since E z[f; I Results of first (i -I) draws are known] is a sum of two terms. The first
term is the sum of the units selected in the first (i - I) draws whereas the second
term is obtained as follows. The number of units in the population are N and we
know that (i-I)units are drawn in the (i -I) draws. At the time of t il draw , the
number of units will be N - (i- I) = N - i + 1. Now suppose that the /' unit is selected
r
out of (N- i + I) units at til draw . The probability for the unit (Pj ) at the til draw
out of (N - i + I)units being selected is given by
p. =
;
i~1 P; .}
~/{I _ j=1 (5. 11.1.3)
i-I
where 1- L Pj is the sum of the probabilities for (N - i + I) units . Therefore
j= 1

Ez [Ii I Results for first (i-I) draws are known]= iII Y j + EZ[ Y'.· {I _ iI I Pj }]
; =1 ~ ; =1

= .-:L Y . + Ez(Y')
j =1 ; P
--L.
j

= i~1 Y.+ [Y _i~1 Y.] = Y


j =1 ; j=1 ;

-1
i- I
note that Pj = ~ I - L: Pj
( )
; =1

Therefore
446 Advanced sampling theory with applications

E(~)=EI( Y) = Y ' (5.11.1.4)

Hence the theorem.

Corollary 5.11.1.1. The estimator of population total, Y, given by


• 1 11 •
T= - I T; (5.11.1.5)
II ;=1
is an unbiased estimator of the population total.
Proof. It follows from (5.11.1.4).

Theorem 5.11.1.2. The ordered estimator s i; and f j for i if:. j are uncorrelated.

Proof. We have to show that cov(i;, fj )=0 'if i if:. j . For simplicity let us assume
that i < j and results up to the i''' draw are known. Now we know that
cov(i;, fJ=EIC2[i;, fjIAj_ I]+ C1[£2 (i; I Aj_l~ E2(fj I Aj_I)] (5 .11.1.6)
where
EI , E2 = Cond itional expected values,
C" C2 = Conditional covariance terms,
A j_1 = Results for the first (j -1) draws which are known ,

fj = Involve s the outcom e oi]" draw and is a random variable,


and
i; = Involves the outcom e of i''' draw and is con stant for a given Aj _ l .
The covarian ce between a constant and a random variable is zero; therefore
c2(i; , fj!Aj_I)=O. (5. 11.1.7)
Now
C I[£2(f; I Aj_ ~ E2(fj I Aj_1 )]=CI[f;,Y]=O.
J (5. 11.1.8)
Using (5.11.1.7) and (5.11.1. 8) in (5.11.1.6) we have
cov(i;, 0. fJ= (5.11.1.9)
Hence the theorem .

Theorem 5.11.1.3. The unbiased estimator for the variance v(f) is given by
.(. ) 1 [" ' 2-IIT.2] .
vT=-(--)IT;
II 11-1 ;= 1 (5.11.1.10)
Proof. The theorem is proved if we can show
E[v(f)]= v(f), (5.11.1.11)
where

v(f)= v[ ~ ;~t;] = ~ ;~t(f; ) . (5.11. 1.12)

Then we have
Chapter 5: Use of auxiliary information: PPSWOR Sampling 447

E[v(f)]=n(nl_I)E[i~p-nf2] = n(nl_I) [~IE(fi2)-nE(f2)] . (5.11.1.13)

Now

E(1'2 ) = v(f)+ {E( f)}2= v(f)+y2.


Using it in (5 .11.1.13) we have

E[v(f)]= n(nl-I)[i~l{v(f;)+ y2 }-n{v(f)+y2 }]=v(f).


Hence the theorem.

For simplicity we restrict ourselves to the case of sample size of two units . Let Yi
and Y j be the values of the units selected with varying probabilities and without
replacement in a sample of size two and let P; and Pj be the corresponding initial
selection probabilities at the first and second draw respectively. If Yi and Yj are
the values of the un its drawn at the first and second draw respectively, then Raj's
ordered estimator of population total , Y, corresponding to ordered sample
S' =(Yi,Yj) is given by

~o = ~{(I + ~)~ +(I_ ~)~ } (5.11.1.14)

and an estimator of the v(y1 lis given by


0

-(-0 )=-I-
v>I 1( ~ \2( Yi Yj J2
J --- (5.11.1.15 )
4 ~ Pj
If Yj and Yi are the va lues of the units drawn at the first and second draw
resp ecti vely, then Raj 's ordered estimator of population total , Y, corresponding to
the ordered sample S2 = (Yj' Yi ) is given by

Y5: = ~ {(I+ P ~ + (I - P ~
j ) j ) } (5.11.1.16)

and an estimator of the V(Y5: li s given by

v(Y5: )= ~(I- Pj ~(Yi - Yj J2 (5.11.1.17)


4 ~ Pj

The probabilities of the estimators Ylo and Y5: are respectively given by the
probabilities of the selecting the {'I ordered sample as
p(sr)=PiPj j(I-p; ), and P(S2 )= PiPj j(I-Pj ).
Then Raj ' s (1956) unordered estimator of population total, Y, is defined as
448 Advanced sampling theory with applications

(5.11.1.18)

Theorem 5.11.1.4. An expression for the variance of the estimator i for a sample
of two units is given by

(. ) ( 21 2)11-2 ,2N: f} ( f}r, YJ2) - -412:Nf} 2( f}r, Y J2


Nf}
V T,.aj = 1- - 2:
1=1 1=1
-L. -
1=1
-L. - (5.11.1.19)

The variance expression (5.11.1.19) becomes too complicated for n > 2.


Pathak (1967a, 1967b) provides an upper bound for the variance of Raj's estimator
for any sample size n and is given by

A ) :::; -1 ,2N: f}
V (T,.aj (r, YJ2 - {1+ 0 (n)}(n-l)[N
-L. - - - ,2: f} 22: f} (r, YJ2 + 2:
N
Nf} 2(r, YJ2] .
-L. - -L. -
n 1=1 f} N 2n 1=1 1=1 f} 1=1 f}

Under the assumption of large population size, Mukhopadhyay (1977) showed that

v(fraJ =-21
ni*j=1
I &i/Pi - Yj/PjfPiPj[l- n- 1 (Pi + Pj)+ (n-IXn -2)(p; + P] + PiPj)
2 n
-
(n-l)(n-2)(
2 Pi + PJ' L
)N
If2] .
n /=1
The admissibility of the estimator ~aj for PPSWOR samples of two units within the
class of all unbiased estimators of population total was claimed to be proved by
Joshi (1970) . While indicating that this claim is incorrect, Patel and Dharmadhikari
(1978) proved its admissibility when restricted to the class of linear unbiased
estimators only. Sengupta (1980, 1982b) generalized the results of Joshi (1966) by
showing that an estimator identical to ~aj remains admissible within the class of all
estimators of population total for any fixed size sampling design of size two.
Sengupta (1983) also provided a sufficient condition for the admissibility of
unbiased estimators of finite population parameters when sample size is two at
most. The result is used to check admissibility of several unbiased estimators of
population total. Rosen (1997b) has provided an asymptotic theory for order
sampling and introduced a novel general class of varying probabilities sampling
schemes, called order sampling schemes. The main result concerns asymptotic
distributions of linear statistics. Even if the results are theoretical, they provide the
ground work for applications of practical sampling interest. Rosen showed that
order sampling yields interesting contributions to the problem of finding simple and
good zrps schemes . Bhargava (1978) considered some applications of the technique
of combined unordering of different estimators which enables us to obtain various
new results in addition to those given by Basu (1958) and Pathak (1967a, 1967b).
Rosen (1998) has discussed in detail the methods for calculating the inclusion
probabilities for ordered sampling. Some order relations between the selection and
Chapter 5: Use of auxiliary information: PPSWOR Sampling 449

the inclusion probabilities for PPSWOR sampling scheme have also been discussed
by Rao, Sengupta, and Sinha (1991) . Das (1951) proposed an estimator based on
ordered sample s, however his estimator does not have a non-negative variance
estimator. Mukhopadhyay (1977) considered the comparison of ordered estimators
under Midzuno--Sen 's and probability proportional to with replacement size based
on samples of two units. Andreatta and Kaufman ( 1986) stud ied these estimators
under informati ve design , that is, their selection probabilities depend upon the
values of the study variable.

5.11.2 UNO~ERED ESTIMATORS

If we have a sample of n units then there can be n! arrangements of the units, that
is, there will be n! ordered samples. For example, if n =2, we have two ordered
samples and one unordered sample. In other words , AB, BA are two ordered
samples, but if we ignore the order then there is only one sample of two units . For a
sample of n = 3 units, the number of ordered samples is 3!= 6 , namely ABC, ACB,
BAC, BCA, CAB, and CBA . Suppose Sit denotes the u" unordered sample, then the

number of unordered samples is N Cn and if So denotes the d" ordered sample of the
Sit unordered samples then the number of ordered sample So is n!.

Defining
i(so,o)= estimat e of population total based on d" ordered sample
corresponding to the s,/" unordered sample.
i(slt)= Unordered estimator of population total based on s/' unordered sample .
p(so,o) = Probability of selecting the d" ordered sample of the s/' unordered
sample .
n! I
p(SIt) = IP(so,o) = Probability for selecting s,/ ' unordered sample.
o= !

Then we have the following theorems:

Theorem 5.11.2.1. An unordered estimator of population total is given by

(5.11.2.1)

Proof. It follows from the result given in (5.11.2.10) based on next two theorems .

For n=2 the estimator (5.11.2.1) reduces to Murthy' s (1957) unordered estimator of
population total which we discuss in the following theorem .
450 Advanced sampling theory with applications

Theorem 5.11.2.2. Murthy 's (1957) unordered estimator of population total based
on two units is given by

T•M =( 1 ) [y-
---1... (1-Pj+-I-P;
) Yj ( )] . (5.11.2.2)
2-P;-Pj P; Pj
Proof. Suppose u. , Uj and Ui : Ui constitutes the units in the sample and the
probabilities of selection attached to these units are P;, Pj and Pj, P;,
respectively. Now if P; is the probability of selecting the {" unit at the first draw ,
then the probability of selecting /, unit in the second draw given that {" unit has
already been selected in the first draw = Pj /(1- p;) .
Therefore the probability for selecting Ui and Uj = P;Pj /(1- p;) = p(so, I).
Similarly the probability for selecting Ui and Uj = P;Pj /(1 - Pj)= p(so, 2).
Thus the sum of these probabilities is given by
_ ( ) ( ) _ P;Pj P;Pj _ p;pA2 - P; - Pj)
P(Su) -PsG ' 1 +Pso, 2 - - - + - - -
1- P; 1- Pj
X )
(1- P; 1- Pj (5.11.2.3)

Now the first ordered estimator based on the units Ui and Uj is given by

f( so, I) = Yi + i (1-p;)
}
(5.11.2.4)

with probability p(so, I), and the second ordered estimator based on the units Uj
and Ui is given by

f (so,2)= Yj + ~ (l-pJ
I
(5.11.2.5)
with probability P(so,2} Then we have
f (sJ = f p(so,o)f(so,o) = f p(so,o)f(so,o) = p(so,l)f( so,I)+P(so,2)f(so,2)
0=1 p(su) 0=1 p(su) p(su)

- [{ y. +-Yj (I- P )}-


I Pj
j {
P;P-
, (1 - p;)
)t
+ y . +-Yi (I- P. -r;-----;:;-'\
} P; } \1- Pj J
A pJ
P;Pj ]/ [ P; p 2- P; -Pj)]
(I - P; XI-

=(2-p; _pJI[~ (I -Pj)+ ~ (I -P;)] .


Thus we have the following theorem:

Theorem 5.11.2.3. The ordered estimator of population total is always less efficient
than the corresponding unordered estimator.

Proof. We have to show that


v [f (so,o));:: v [f (s/l )) . (5.11.2.6)
Now
v [f(so,o ))=E1V2 [f(so,o) 1 su)+l'\ E2 [f (so,o ) I su), (5.11.2 .7)
Chapter 5: Use of auxiliary information: PPSWORSampling 451

where v[i (so, o)] is based on the olhordered of the s:,h unordered sample. E2 , v2
are the conditional expectation and variance for a given unordered sample Su and
E I, VI are the expectation and variance over all the unordered samples s" .
Therefore we have
V [i(so,0 )]=E)V2 [i(so'0)1sul+VI [Q], (5.11.2.8)
where
Q= Ezli(so ,o)1sJ (5 .11.2.9)
Now to find the value of Q in (5.11.2.9), we proceed as follow s: The ordered
estimators i (so ,o) for a given unordered sample s" have values corresponding to
each order, that is i(so,I), i(so,2), .....,i (so , n!),where
i(so, I) = Estimator of the population total based on the first order sample from
1h
the u unordered sample s",

i(so, 2) = Estimator of the population total based on the second sample from the
the z/h unordered sample Su ,

and
i (so,n!) = Estimator of population total based on II! th ordered sample from the
u,h unordered sample s"
with prob abilities p(so, I), p(so, 2) , and p(so, n!), respectively. Now to find the
ordered estimator i(so, 0) we adjust the probabilities such that sum of these
probabilities equals one.

Thu s we use the adjusted probabilities given by p(so, 1)/p(s,,) , p(so, 2)/ p(sJ , 00'
II !
p(so'n!)/ p(s,,) , respectively, where p(sJ= IP(so'o) .
0= 1
Thus we have

Q" -- E2 [T"(so'o)1 s"] -- ~L. i(so'o)p(so'o) -- T"(su')


P s"0= 1
() (5.11.2.10)
Using (5.11.2.10) in (5.11.2.8), we have
V [i(so,0 )]= E) V2 [i(so,o)1 su]+ VI [i(s,,)]. (5.11.2.11)
Now the first term , namel y E\ v2[i(so,0)Is,, ], will be zero if for each unordered
sample s" , the estimator takes the same value s as for the ordered samples ;
otherwise, it is positive.

Therefore we have

Hence the theorem .


452 Advanced sampling theory with applications

5.12RA.O:-HARTLEY~~€OCHR.AN;(RHC) SAM~LIN'G STRATEGY

Sampling Scheme: Suppose a population consists of N units and we wish to draw


a sample of n units. First of all, divide randomly the N units into n groups as
shown below :

Population of
N
units

SRSWOR
nth random grou
of the remaining
», units

Se ect one
unit with
PPS

Fig. 5.12.1 Rao --Hartley--Cochran strategy .

First random group: Out of N units, select N 1 units by using SRSWOR sampling.

Second random group: Out of (N - N 1) units, select N 2 units by SRSWOR


sampling and so on such that

n
INi=N.
i=l
(5.12.1)

The allocation of units to different groups is done randomly and we select one unit
from each of the n groups with probability proportional to size (PPS) and thus we
obtain a sample of size n .
Suppose R, P2 , .... , PN are the probabilities associated with the N units in the
N
population and IP; = I.
i= l
Further suppose that Pi) denotes the probability corresponding to the i h
unit in the
/h group, 0i ' 'II i = 1,2, ..., n. .
Chapter 5: Use of auxiliary information : PPSWOR Sampling 453

Thus the Rao, Hartley, and Cochran (1962) mechanism can be better understood
from the following table, which gives the structure of population units after making
random groups, as follows:

where r, = L Pi}, i = 1, 2,..., n ,denotes the sum of selection probabilities of the /h


jeGi
random group.

Now we have the following theorems:


Theorem 5.12.1. The unbiased estimator of population total Y is given by
, n y"\
YRHc=L:(D I' ). (5.12.2)
i;\ Til 'i

Proof. Suppose E 2 denotes the expected value for the given random group Gi and
E 1 denotes the expected value over all possible random groups. Then we have

E(YRHC) = E\E2[YRHCI G;]= E1E 2[±-( Yil ) I Gi ]


/;\ Pil /'i
= El[.±(~
/;\ );\
YijJ] = E\(.~YiJ
/;\
= Y.

Hence the theorem.

Theorem 5.12.2. The variance of the estimator YRHC is given by

(±N
2-NJ[
2] (5.12.3)
i~(~ -1) j~\ ~
2
V(Y RHC)= - y .
Proof. We have
V(YRHC)= E\V2(YRHCI Gi)+J.]E2(YRHC I Gi )
I. ~
=E\V2[ i;I(P;\ I G ] + J.]E2[ I.~ I Gi ]
/,;) i i;\(P;\ /,;)

=E\V2[.I. ( Ylil )1 Gi]+J.](Y) =E\V2[.I. ( Ylil )IGi]. (5. 12.4)


/;\ p;\ 'i /;\ p;\ 'i
454 Advanced sampling theory with applications

Note that we have selected independent samples from each group, therefore

E,V2 [ I
11 -Y·I
'-
;=1(P;If,;)
(1"1
IG;] =E1[ I11 V2 - ' - IG;
;=, P;If,;
J] . (5 .12.5)

Thus we have

2- {E ---.11L }2
v2
(P;I I J-
Yil' ; G - E 2 ~
[(P;, /,;)] (P;I/,J (5.12.6)

In a given random gro up of N; units , the random variable -( Yil ) can take any of
P;' /';
If, lf2 lfN; .h b uu . P;, P;2 P;N;
t he va Iues -(- -) ' -( - -) ,....., ( ) WIt pro a I ines - , -
P;' /'; P;2 /'; P;N;/'; '; r; ';

respectively.
Th us we have

£2[( Y/il .)] = ( >i/., .)X(pil /,;)+( .If/2 .)X(P;2/' ;)+ +(Y;N/; y(P; N;/';)
11, r F:l r F:2 r F:N; r
N;
=If,+lf2+·······+lfN;= ~ lfj =lf. · (5.12 .7)
J='
Also we have

E -
y 'l
'-
]2= L N; 1':
1)
.2
..,--"---,r (5 .12.8)
[
(P;If,;) j=' (Pij I,;)

because [ _
Yil ( )]2 takes the values If,-)]2, [-( -lf2]2
-) ,....., [-(-lfN;]2 .
Iid,; [-(-
P;' /'; P;2 /' ;
-)
P;N/';
with

pro b a bili .
nines -p;!, -P;2 , P;N;, respective
,- . Iy. U smg
' (5 . 12 .7) an d (5 . 12.8) m
.
r, '; ';
(5.12 .6) we have

(Y' J
V2 -lL.L I G = L
N; r}I)
(5. 12.9)
P;, j =1 (Pij I';)
Note that
N;
'; = I Pij
j='
therefore (5. 12.5) implies that

E,V2 L-,_
11
[
·I _IG = E, LV
11

;=1 (P;If,;)
Y
2 _ i1_ IG =E1 L L
11 ] [ (Y
N;
;=1 P;If,;
J] [ 1j=' (Pij1I'':~ ;)
;=1
I)

= L11 [E, ( L
N; 1':~'J
;=, j=' (Pijl';)
Chapter 5: Use of auxiliary information : PPSWOR Sampling 455

(5.12.10)

Furthermore

E\ (Y;;)= VI (Y;.)+ {E\(y;.)}2 = Vi[N; ~: ] + {EI (y;.)}2 = NlVi( ~: ) + [N;1Tf


=N2(N- N;)S2 + N2y2 .
' NN; Y I (5.12.11)
Now consider
N y2 N y2 N y2 N N N f} y 2
I -J-= I-J- xI= I-J-xIP .= Iy2+ I _J_ (5.12.12)
j ;\ Pj );\ Pj ); \ p) );1 J ); 1 J )"#1;1 p)
which implies that
2 2
~ f}Y) = ~ 2_ ~ y2 . (5.12.13)
j"#I;\ p) ) ;1 p) );) J

Therefore (5.12.10) implies that

E{V (t V{/T;)IG)]
2

= ff N; ~ y} + N;(N; - N;)s; _
-1)[ ~ y}P _ ~ Y}]_ N?(NNN; N?y2)
;;\1N N(N -I)
); 1 j ;1 j );1

= In[N; 2
- IN y. + N;(N;-I)[N
I -y} - IN y . 2]- N;(N-N;) 1 {NI 2 -2} - N;2-2]
Y - NY Y
;;1 N ) ;1 J N(N -I) );1 Pj ); \ J N (N -1) j ;1 J
456 Advanced sampling theory with applications

= L"[N; N 2
- L y. +
N; (N; - I) LN -YJ - N; (N;- I) LN y.2 - -
N;- L y. + -NN;
N 2
-Y
-2
;= N j =1 J N(N - I) j=1 Pj N(N - I) j= \ J (N - IL=I J (N - I)
2
+ N' N y2
L N~' _y_ 2 _N 2y
- 2 ]
N(N - IL=\ J (N- l) I

N Y}
- L:" [ N;(N; - I) L: N
- + L: y 2{N;
. - - N;(N -I) -N-
j j Nt
+- --} + -y {-
NNj Nt
2 - ---- N ·
2
2} ]
- j=1 N(N- I) j =1 j P j=1 J N N(N- I) N- I N{N - I) N N- I N- I '

~r Nj(Nj - l ) ~ YJ + ~ Y2 {Ni(N- I)-N;(N;- I) -N;N +Nl }+ ~{NNi-Nl -Nl(N - l)}]


;=t N(N -I)
=
j =! Pj j =! J N(N - I) N2 N- I

= ~I N; (N;- I) I YJ _N; (N; - I)y2 ] = ~I N;(N;-I) { I YJ _y2 } ]


;=t N(N - I) j =! Pj N(N - I) ;=t N(N - I) j =\ Pj

_(;~INl- N)[I YJ _ y 2] (5.12 .14)


- N(N - I) j=1 P
j
.

Hence the theorem.

Theorem 5.12.3. The RHC scheme is more efficient than PPSWR sampling if
N; =N , 'if i = I, 2, ..., II.
II

Proof. Without loss of generality we have

"
V (YpPSWR ) = -I
II
[ LN-YJ - Y2] ,
j = 1 Pj
and V(YRHC )=( ;=\
L" N ·2 -N
I
N(N - I)
[I Y
J j
j=1 Pj
2
_ Y2].
Combining these results we have

"
V (YRHC ) =
{~tl-NJV " ( ) (YPPSWR ).
(5.12. 15)
N N- I
To find the minimum value of the variance V(YRHC ) with respect to N; we have the
Lagrange function

L= ~ Nl-A[~N;
;= 1 ;=1
- N] . (5.12.16)
On differentiating (5.12.16) with respect to N; and equating to zero we have
N; =A/2. (5.12.17)

On substituting (5.12.17) in the equation L" N; = N we have


;=1
A =2N/ II. (5.12.18)
Plugging back this value of A in (5.12.17) we have the optimum value of N; that is
N;=N/ II. (5. 12.19)
Chapter 5: Use of auxiliary information : PPSWOR Sampling 457

On substituting the optimum value of N; in (5.12.15) we have

I( Ny_N
V(YRHC)= ;=1 n [nV(YpPsWR)] = (N-n)V(YpPSWR) ' (5.12.20)
N(N -1) N-1

Note that (N - n )/(N - 1)< 1 'i n :2: 2 , V(YRHC)< V(YPPSWR ). Hence the theorem .

Theorem 5.12.4. An unbiased estimator of the variance V(YRHC) is given by

J
± Y;~
n 2
"IN , -N [
v(r.RHC )= (
;=1 '
n , 2
Y'2
RHC ] • (5.12.21)
(N 2
- ;~IN? J '=I(/j1/1J
Proof. We know that

, (IN? - N J ' (5.12.22)


V(YRHC) = ,=1 (
N N-1
) x nx v(YPPSWR) '
Also we know that
2
, ) N Y 2
nV (YpPSWR =
j
I- -Y . (5.12.23)
j=1 Pj
Now the estimator of the right hand side of (5.12.23) is given by
n
I -' 2- - [(I
Y' I n Y'J
--'- J2 - V(A)
YRHc r . (5.12.24)
;=1(/jl l r;) ;=I(/jI IT;)

y2
Note that the estimator of I Yj is
N

j=J
I~
;=1(P;IIT;) ,
the estimator of
N
I
j=1
_J_
Pj
IS

I[Y~ f/j I]T;


;=1 /jll'
= IY~
;=I/j ,
T; and, that of y is
2 j(,I(!jIYn)J2
1=1 lT,
- v(yRlIc)l .
Thus, we have

n v(YPPSWR)= I -
, A n Yil
22 T; - [(I-T; J2 -V(YRHC).
;=I/jl
n Yil
i=I/j 1
A A

~ (5.12.25)

Therefore
458 Advanced sampling theory with applications

or

-(x [, (,EN,' -N]]_ (;~INl -NJ[i. y;~ Y~HC]


v RHcl - N(N - 1) - N(N- 1) ;;1 (p;f!r;)
which on simplifying reduces to (5.12.21). Hence the theorem.

Some improvements in the RHC strategy have also been suggested by Hartley, Rao,
and Kiefer (1969) , Gabler and Horst (1995), Mangat (1993), Bansal and Singh
(1986) and Singh and Kishore (1975). Padmawar (1996) has considered an
interesting extension of the RHC strategy for the case of continuous populations. Its
comparison with Midzuno 's scheme of sampling has been discussed by Chaudhuri
(1977) under a superpopulation model setup .

Example 5.12.1. From the population I select a sample of five units by using the
RHC scheme. Estimate the total real estate farm loans using RHC estimator and
making use of nonreal estate farm loans as an auxiliary variable. Also find 95%
confidence interval for the total real estate farm loans in the United States.
Solution. We are to select a sample of size 5 by using RHC scheme. Thus the
population must be divided into five random groups. To do this we selected 50
distinct random numbers between 1 and 50 by starting with the first two columns of
the Pseudo-Random Numbers (PRN) given in Table I of the Appendix.

These random numbers came in the sequence 01,23 ,46,04,32,47,33,05,22,38,


29,40,03,36,27,19,14,42,48,06,07,21 ,31,16,10,18,26, 02, 44,12,37,25,
50, 30, 09, 11,49,43, 15, 39, 35, 20, 34, 13, 24, 41, 17, 08, 28 and 45.

The states bearing serial numbers corresponding to the first ten selected random
numbers constitute the first random group, whereas the next ten form the second
random group and so on.

Let
Yij = Real estate farm loans ($000) for the/" state in the /" random group,

X ij = Nonreal estate farm loans ($000) for thej" state in the /" random group,
and
Pij = the initial selection probability of the /" unit in the /" random group.

We are given X = 43908.1 2 , thus the following are the 5 random groups of units
along with initial selection probabilities.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 459

'" ", '; ~" ,,,~f


,?'\< ' r. ..«:
g:::' ',";\ p' .ft .".' :
~"~~, .: 1 '~ ( ;;N6~;r;;';:; "
I ",,~
' ~t~te ~ ~d l ~mtl!tYIj~, !!!";'m~~j;;<::~ I";,i,"'i;,;;P,"!'i;'"
'y

,!;
I"X"1

1 01 AL 408.978 348.334 0.007933


2 23 MN 1354.768 2466,892 0.056183
3 46 VA 321.583 188,477 0.004293
4 04 AR 907.700 848.317 0.019320
5 32 NY 201.631 426.274 0.009708
6 47 WA 1100.745 1228.607 0.027981
7 33 NC 639.571 494.730 0.Ql1267
8 05 CA 1343,467 3928.732 0.089476
9 22 MI 323.028 440,518 0,010033
10 38 PA 756.169 298.351 0.006795
I";:"' ;,;.r··.·.·,';~,;:· :i,;"i;; ,, !;;!!;,;,,;.;; ' :r.", ))h)):H;,;;;;Sum ;' ·' tl0669:230 (! ' ''0 .242990;,

1 29 NH 6.044 0,471 0.0000107


2 40 SC 87.951 80.750 0.0018391
3 03 AZ 54.633 431,439 0.0098260
4 36 OK 612.108 1716.087 0.0390836
5 27 NE 1337.852 3586,406 0.0816798
6 19 ME 8.849 57.684 0.001313 7
7 14 IN 1213.024 1022.782 0.0232937
8 42 TN 553.266 388 .869 0.0088564
9 48 WV 99.277 29.291 0.0006671
10 06 CO 315.809 906.281 0.0206404
1:j).: .'.)! ·;H;; ..';" ;' " , ;.!';f0.;h!"h):h:!:; :Sum i" t" , 8220.0( 0); , 0.1872100 :

07 7.130 4.373 0.000099


21 7.590 56,471 0.001286
31 140.582 274.035 0.006241
16 1049.834 2580.304 0.058766
10 939,460 540 .696 0.012314
18 282.565 405 .799 0.009242
26 292.965 722.034 0.016444
02 2.605 3,433 0.000078
44 56.908 197.244 0.004492
12 53.753 1006.036 0.022912
5790.425> 876
460 Advanced sampling theory with applications

37 OR 114.899 0.034606
25 MO 1579.686 0.034618
50 WY 100.964 0.008802
30 NJ 39.860 0.000626
09 FL 825.748 0.010579
II HI 40.775 0.000867
49 WI 1229.752 0.031258
43 TX 1248.761 0.080176
15 IA 2327.025 0.089044
39 RI 1.611 0.000005
0.2:90580:

0
35 OH 870.720 635.774 0.014480
20 MD 139.628 57.684 0.001314
34 ND 449.099 1241.369 0.028272
13 IL 2131.048 2610.572 0.059455
24 MS 627.oI3 549.551 0.012516
41 SD 413.777 1692.817 0.038554
17 KY 1045.106 557.656 0.012701
08 DE 42.808 43.229 0.000985
28 NY 5.860 16.710 0.000381
45 YT 57.747 19.363 0.000441
Sum 7424:730 0.169100

Now apply Lahiri's method of selection of sample in each group to select one unit
independently. The first random group consists of N, = 10units and maximum
value of the auxiliary variable Xl} is 3928.732. Choosing X o = 4000 we select
random number 1:0; Rj :0; 10 by starting with the first two columns and another
random number 1:0; R} :0; 4000 by starting from the i h to loth columns of the Pseudo-
Random Numbers. Then the first effective pair of random numbers is (04, 0757).
Thus from the first random group the unit with Sr. No.4, that is the state AR, will
be included in the sample.

The second random group consists of N 2 = 10 units and the maximum value of the
auxiliary variable X 2} is 3586.406 . Choosing X o = 3600 we select random
number 1:0; R :0; 10 by starting with the 7th and 8th columns and anothe r random
j
Chapter 5: Use of auxiliary information: PPSWOR Sampling 461

number 1 ~ Rj ~ 3600 by starting from the 13th to 16th columns of the Pseudo-
Random Numbers. Then the first effective pair of random numbers is (07, 0536) .
Thus from the second random group, the unit with Sr. No.7, that is the state IN,
will be included in the sample .
The third random group consists of N 3 = 10 units and the maximum value of the
auxiliary variable X 3j is 2580.304. Choosing X o = 2600 we select random number
1 ~ Ri ~ 10 by starting with the 13th and 14th columns and another random number
1 ~ R j s 2600 by starting from the 19th to 22th columns of the Pseudo-Random
Numbers . Then the first effective pair of random numbers is (07,0705). Thus from
the third random group the unit with Sr. No.7, that is the state MT, will be included
in the sample .
The fourth random group consists of N 4 = 10units and the maximum value of the
auxiliary variable X 4i is 3909.738. Choosing X o = 4000 we select random number
1 ~ Ri ~ 10 by starting with the 19th and 20th columns and another random number
1 ~ R j ~ 4000 by starting from the zs" to 28th columns of the Pseudo-Random
Numbers. Then the first effective pair of random numbers is (08, 1230). Thus from
the fourth random group the unit with Sr. No.8, that is the state TX, will be
included in the sample.
The fifth random group consist of N s = 10units and the maximum value of the
auxiliary variable X S j is 1610.572. Choosing X o = 27 we select random number
1 ~ Ri ~ 10 by starting with the 25 th and 26th columns and another random number
1 ~ Rj s 1700 by starting from the 31st to 34th columns of the Pseudo-Random
Numbers . Then the first effective pair of random numbers is (06, 0599) . Thus from
the fifth random group the unit with Sr. No.6, that is the state SD, will be included
in the sample .
After combining all above steps, the ultimate sample consists of the following
information.

11416.090 536347692.4
9749.003 507681529.1
2349.463 41857340.6
4525.876 70491971.3
0.169100 1814.867 19478071.8
29855.3001175856605.0
Note that here 'i = I Pij' i = 1,2,3 ,4,5.
jEGi
462 Advanced sampling theory with applications

The estimate of the total real estate farm loans during 1997 in the US is given by
• n y OI
YRHC = I:-(_,-) = 29855.30
1=1 Pn /Ti
and an estimate of variance of the estimator YRHC is given by

'(9; )- (±Nl - NJ[ ± Yn2


i=1
v N itNl J (p;? ITi)
RHC - ( 2 _ i=\

J~ 175856605 - (29855.30f I= 64016475.05.


( h02 -50)
• ( ,., 5
502 - I10 2
i= 1
Using table 2 from the Appendix the 95% confidence interval of the total farm real
estate loan is given by

YRHc+ta/2(df=n-l~v(YRHC)' or 29855.30+2.776.J64016475 .05


or
[7644.44, 52066.16] .

5.J3 :UN~IASE])STBATEGIES !(JSINGIFFSoSAMFLINGSCHEMES

We shall discuss a few sampling schemes under which the usual ratio type
estimators of population mean and variance become unbiased .

Consider a population consists of N unit s and we want to draw a sample of n units .


N
Assume that X values are known for all units and P; = Xi / X, for X = LXi ,
i=1
denotes the probability of selecting the /h unit. Following Midzuno--Sen's sampling
scheme:

( a ) The first unit is selected with probability proportional to Xi value


( b ) The remaining (n -I) units are selected with SRSWOR from the remaining
(N -I) units with equal probability.

Let the lh sample s consists of n units such that (Yi' Xi), i E sand t = 1,2,..., (~J is

the total number of samples using SRSWOR sampling. Our aim is to find the
probability p(t) of selecting any given sample. Consider the first unit Xi is
selected on the first draw .
Chapter 5: Use of auxiliary information: PPSWOR Sampling 463

(i )(:--/T
1
Then the probability of selecting the first unit on the first draw is ,

1
where (N11 -1-IJ- denotes the probability of selecting the remaining (n -1) units out

of (N -1) units .

Similarly the probability of selecting the second unit on the first draw IS

(
X2
X
J(N -IJ-I,and so on.
11-1

The probability of selecting the nth unit on the first draw is (-; )( :_-n- 1

Hence the probability of selecting the lh sample on the first draw is given by

p{t) = (i )(:--/T' +(; )(:--/T' + +(-; J(:--/T'


-IJ-I
11

N IXi
1=1 (5.13.1.1)
= ( 11-1 X '
_ _I 11 _ _I 11 th
Let YI = II LYi , XI = II LXi be the sample means obtained from the t sample for

{:l
i=1 i= 1

t = 1,2,....

Also assuming that the population mean X of the auxiliary variabl e is known, the
usual ratio estimator of population mean is given by
_ YI_(X-=- ).
YR = (5.13.1.2)
XI
Then we have the following theorem :

Theorem 5.13.1. The ratio estimator YR is unbiased for the popul ation mean .
Proof. We have

E{YR) = (Q(YI Xlv{t) = (i )( YI


1=1 XI J 1=1 XI
x)[ (Ni~-IJx
IXi 1JQ(YI)XI x(N
1=1
X XI
-IJ II

11-1 II-I

I (~) _ II (~) _ 1 (~) _ -


N(N -IJI~I (II YI) = N(N -IJI~I (YI ) = (NJ I~I (YI) = Y.
11-1 11-1 II

Hence the theorem.


464 Advanced sampling theory with applications

Nanjamma, Murthy, and Sethi (1959), Singh and Srivastava (1980) and Swain and
Mishra (1992) have considered the use of a Midzuno type of sampling scheme as
follows :

Step I. Select two units, for example, th and j", with the probability of their joint
selection being proportional to (Xi - Xj~;
Step II. Select (n - 2) units from the remaining units of the population by simple
random sampling and without replacement.

Note that the first step may be performed by considering all possible pairs of units
and selecting a pair with the assigned probability.

Evidently the probab ility of selecting the lh sample is given by

Thus we have the following theorem:

Theorem 5.13.2.1. The ratio type of estimator

2 2(S;)
SI = Sy s;
is unbiased for S;.
Proof. We have

(2)_ [2S;]_(~)2[S;] ()_ 2(~)[S;] (S;)I _ (~)(2)_ 2


£\SI -£SY2
Sx
-IS'2"" Pt -SxI 2"" - (
1=1 s,
. 1
1= 1 Sx
1
N S2
x
J --(JI\sy
N
1
-Sy .
1=1 1
n n
Hence the theorem .

Singh and Srivastava (1980) proposed two unbiased regression type strategies
which depend on only one auxiliary variable. Wywial (1999) has considered an
elegant extension of such sampling schemes to the case of multi-auxiliary
information . Further, Wywial (2000) consider a study of Horvitz and Thompson
(1952) type estimators under such sampling schemes. Chen (1998) has also
proposed weighted polynomial models and weighted sampling schemes for finite
populations.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 465

5.14 GODAMBE?S STRATEGY: ESTIMATION OF PARAMETERS IN


:SURVEYS AMPLI NG' .: ;. ??,' '~','

Godambe (1955) has considered the problem of estimation strategies in survey


sampling through the concept of a unified approach. His way of estimating the
parameters in survey sampling is unique and can be studied from Godambe (1955,
1960,1976,1980,1984,1989,1991,1995), Godambe and Heyde (1987), Godambe
and Kale (1991), Godambe and Thompson (1984, 1986, 1989, 1996--97) and
Binder and Patak (1994). We will discuss Godambe's strategy only briefly as his
detailed work is available in a book by Thompson (1997). Following Godambe's
strategy, let us consider the problem of estimation of population parameters which
can be implicitly defined by a population equation of the form

L rPj~j,Xj,f}) = 0 (5.14.1)
jell

where y j and x j are values of observed variables Y and X respectively. The rPj

are known real valued functions and (} is a real valued parameter of interest. Some
special cases of (5.14 .1) are as follows :
( a ) the population mean u y defined by

(5.14.2)

( b ) the population total t y defined by


L ~j - ty / N) =0; (5.14.3)
j e ll

( c ) the population ratio R of Y to X defined by

L ~rRxJ =0; (5.14.4)


j e ll

( d ) the population c.d.f. evaluated at y defined as

t )
where/\)'. :<:; y=
{I if y . :<:; y ,
J . (5.14.5)
} 0 otherwise;

( e ) the population median defined as the least value of (} such that

1: [/~. :<:; (})-~] =0.


(5.14 .6)
j e ll ) 2

When a population function (or parameter) is defined by (5.14 .1), its estimator can
be defined as a solution of the sample estimating equation

(5.14.7)
466 Advanced sampling theory with applications

where Jrj denotes the probability of including the l' unit in the sample. An
estimator of the population mean can easily be obtained by solving
I (y j - 0)/ Jr j = 0 , that is,
j es

- - (jes'" YjJrj
Ys - L. - J/('" J
L. -
j e sJrj
I • (5.14.8)

An estimator of the population total can be obtained by solving


I (yj - O/ N )/ Jrj = 0 , that is,
jes

Ys =(N jes
I YjJ/ I ~J .
(jes (5.14.9)
Jr j j Jr

Similarly estimators of ratio of R Y


to x , median of and c.d.f. of can be Y Y
obta ined by following Thompson (1997). Godambe (I995) also considered the
problem of estimation of two types of parameters defined below:

( a ) The parameters of the superpopulation from which the present survey


population may be assumed to have been drawn as a random sample.
( b ) The paramet ers of the survey popul ation itself.

Th e parameters ( a ) and ( b ) mayor may not be correl ated . A survey may be


condu cted with the intention of estimation of both ( a) and ( b ) types of
par ameters. On the other hand, if the survey' s emphasis is on the estimation of
either ( a ) or ( b ), the former is a case of analytical surv eys and the latter is a case
of enumerative surveys. IfE; (O)denotes the robust optimal estim ating function for
the j''' unit of the survey population, then the induced parameter solves the equation
I E; (0)= o. (5.14.10)
;e n

Thus the underl ying concept of the term ' target paramet er ' can most simply be
explicated by ' induced paramete r'. To explain it let m;(O) and vi(O) be the mean
and variance of the random variable y ; associated with the i''' unit of the survey
popul ation (i = 1,2,...,N} In equation (5.I4.I0), let
E. (0) = { Y; - m;(o)} 01ll;(0)
, V; (0) 00 (5.14.11)
or its trimm ed version if one wants to exclude the tails. The score function of the
logistic model discussed by Skinner, Holt , and Smith (I 989) can easily be seen as a
special case of (5.14.1I) and hence (5.14. 10). The robu st opt imality is as
comp elling as the optimality of the sample mean , or its weighted version, for the
Gaussian linear models. The j oint distribution of the study variable Yo and known
auxiliary variable X o is assumed to be such that (y;,x;),i = 1,2,.., N , are independent
with density
f(y;, x;;0, 11;) = h(.'l:;;O, I1;)fi (y; I x;;O), i = 1,2,...,N (5.14. I2)
Chapter 5: Use of auxiliary information: PPSWOR Sampling 467

where 0 is a parameter of interest and no = (nl ,n2 ,....,nN) is a set of nuisance


parameters. The data consist of the known values of the auxiliary variate X 0 , the
sample S drawn with sampling design P(el X o ), and the variable vi. i E s, can be
defined as
d={Xo,s,(i,y;) :iES} . (5 .14.13)
Then the probability density of the data d is given by

J
Prob(d;O,no) = {nfz(x;;O,n;)}p(s 1 X o njj(y; I x; ;O)}. (5.14 .14)
,;1 1'ES
The nuisance parameter n; can vary here. In (5.14.14), for every fixed 0 the class
of distributions of X 0 obtained for the possible variations of the nuisance
parameter is called a complete class of distributions. Following Godambe ( 1991),
an estimator g = g(d ,0) of the parameter 0 such that E(g) = g * is optimal if

(5.14.15)

is its a minimum for g = g * . Then the optimal estimating function for estimating 0,
based on data d, is
g*(d,O)= L ologjj (y; I x; ;O) (5.14 .16)
;ES 00
that is, the optimal estimate for () is the solution of the equation
g*(d,O) = ° (5.14.17)
which, in fact, is a special case of the general result of Godambe (1976) . The
interesting result is that the estimate of 0 is free from the sampling design
p(el X 0), which, in fact supports the result of Royall (1970a, 1970b, 1970c). The
estimates of parameters of interest in general can be obtained by solving the
maximum likelihood equations given by
olog[Prob(d; 0,no)] (5.14.18)
= 0,
00
and
olog[Prob(d; O,no)] =0. (5.14 .19)
ono
If (0, no) is the solution of the equations (5.14.18) and (5.14 .19), then the estimate
of J1 is given by
jJ=J1(O, no). (5 .14.20)
It is to be noted that the equations (5.14.18) and (5.14.19) are independent of the
sampling design p(el X), and so are the estimates 0, no and jJ.. The estimating
equations (5. 14.18) and (5.14.19), and hence the implied estimate jJ of J1, are
dependent on the full model (5.14.12). Godambe raised the following interesting
468 Advanced sampling theory with applications

question: Can this dependence of the estimate on the entire model be reduced by
some alternative procedure of estimation? This query is particularly meaningful in
that, as remarked earlier, the modelling of the design or auxiliary variate X must in
practice be very tentative . An alternative estimating procedure that utilizes only the
conditional distribution in (5.14.12), namely II, would be very desirable . It would
be more helpful if this alternative procedure depends only on some semiparametric
relationship underlying the conditional distribution fi . Such an alternative
procedure is given below. The main concept here is that of the 'induced parameter'
defined by (5.14.10). Since the variates Yi, i = 1,...,No are i.i.d. with the unknown
mean /.1, the induced parameter is given by the solution /.10 of the equation

1(Yi - /.10)= 0,
i=1
(5.14.21)

that is
1 NO
/.10=-N IYi · (5.14.22)
o i=1
Assume that in the conditional distribution II, B is a regression parameter and
Em {(Yi - a - Bx;)I Xi} = 0, i = 1,...,No , (5.14 .23)
where a is a known constant. The theory of estimating functions given by
Godambe and Thompson (1986) can be seen to provide optimal estimation of the
induced parameter /.10 in (5.14.22) by using (15.14.23) and the sampling design
p(.1 X n )· Note that this optimality is both conditional on holding the design
variable X n fixed and unconditional. It should be noted that the unconditional
optimality is important here, for it is in the unconditional distribution that
/.10 ~ /.1 as No ~ 00 . (5.14.24)
In other words, for survey populations with large size No, we have /.10;: /.1. This
also provides justification for using the estimate which is optimal or approximately
optimal for the induced parameter /.10 for the parent parameter /.1. The conditional
and unconditional optimal results are given below . Let

(5.14.25)

It is remarkable here that for given (a,B ,Xn), H is a function of /.10. Using the data
d in (5.14.13) and parameter B, we can obtain an optimal estimating function
h*(d,B) for H. For the given sampling design p(.IXn ), let the first order inclusion
probabilities be
Jri = Ip(z I X n ), i = 1, ...,No · (5.14.26)
ssi

For a fixed B, let h(d,B) denote an estimating function based on data d, which is
design unbiased for the function H because E p (h) = I hp(s I X n) = H . Let Em denote
the expectation with respect to the model (15.14.23) both conditionally on X n and
Chapter 5: Use of auxiliary information: PPSWOR Sampling 469

unconditionally. In the class of all design unbiased estimating functions h, h * is


said to be optimal if, in the class, EmEp(h- H)2 is minimized for h = h * ; in other
words, if

(5.14.27)

Following Godambe and Thompson (1986), one such optimum estimating function
is defined as

h*(d,8)= L Yi- a- 8xi (5.14.28)


iES Jri

which is also called a design weighted optimum estimating function . Similarly, for a
fixed 8, the optimal estimating function or estimate fLo for IJo is given by the
solution of the equation h* - H = 0 or equivalently

• = - 1 + ( a + - 8 NO
IJo J
L Xi . (5.14.29)
No No i=1

It is clear that fLo in (I 5. I4.29) depends upon the unknown parameter 8 . Let e be
an estimator of 8 obtained by setting g~ (e)= 0 . Then an approximation to fLo is
given by

- 1 ( - ) + ( a+-Lxi
IJo=-h\d,8 NO e. J (5.14.30)
No No i=1

In the case of uncond itional distribution, where X n is allowed to vary, IJo -; IJ .


This is an analog of the condition given by Godambe (I976) for the conditional
score function g *. The approximation 'i1o of fLo can easily be justified by solving
the equation h * - H =0 for a given 8. Godambe (1991) has shown that for large
samples the difference h* - H is almost free from 8 because ~ E(h* - H) = O.
08
Thus replacement of 8 by its estimator e can be expected to provide an
approximately optimal or near optimal estimate. The optimality property of h*,
given by

EmEp(h* -H) -:;' EmEp (h - H )2,

is restricted to design unbiased estimating functions . Here it is important to point


out that design unbiased estimating functions is a wider class of estimates than the
design unbiased estimates . For many sampling designs which admit a ' natural'
sequence of sample sizes and (survey) population sizes -; 00 , design unbiased
estimating functions in common use provide ' consistent' estimates under
appropriate conditions . Godambe and Thompson (I986) have pointed out that an
470 Advanced sampling theory with applications

obvious finite sample explication of consistent estimation is provided by the notion


of a design unbiased estimating function. It is notable that the appropriate design
weights can lead to optimal estimation for a less restrictive model (15.14.23) than
the one we discussed above. For example, the estimating function g; = I
;ES
Ej (0),
and hence the estimate jj depend upon the conditional variance of y j , namely
o}, i = 1,00.,No . Hence these conditional variances can affect the estimate jio of the
population parameter of interest Po ' Consider a sampling design of fixed sample
size with inclusion probabilities "j oc X j , i = 1,00.,No. In such situations, the
estimating function h * - H and hence the solution fio of the equation h * - H = 0
are independent of 0 , rendering estimation of 0 irrelevant. If the above mentioned
sampling design is implemented with an appropriate stratification, h* - Hand
hence fio will also be independent of intercept a; in other words, the use of
appropriate design weights can robust the optimality. Although, it is assumed in the
above discussion that the entire design vector X 0 is known, it is to be noted that
the estimating function 11 * is also optimal if only the sample values of x., i E S , and

the population mean X = N(j11 X j of the auxiliary variable is known in addition to


j=l

the design inclusion probabilities "j , i = 1'00" No .

5.14.1 OPTIMAL ESTIMATING FUNCTION '

Consider Yj , i = 1,00 .,N, are drawn from a superpopulation model with a parameter
0, possibly a vector. Assume that the model is such that Yj, i = 1,00 .,N, are
independent and that for some specified real functions tPj(yj,O) of the indicated
variables
(5.14.1.1 )

where i = 1,00 .,N, and the choice tPj depends upon the problem of estimation under
consideration. Godambe (1995) considered the problem of estimation of the general
function, defined as

H = I!ftJYj ,O) (5.14.1.2)


j EO

with the estimator defined as

tP;
11 * = I - (5.14 .1.3)
iES 1[;

where "j = IP(s I X o ) denote s the first order inclusion probability. The function 11*
S3;

is optimal in the sense of having minimum variance.


Chapter 5: Use of auxiliary information: PPSWOR Sampling 471

Assume the function ¢i can be

(5.14.1.4)

where li(Yi)and ai(O) are special functions of the indicated variables . Then it
follows from the optimality of h' that for any given 0, the optimal estimate of

(5.14.1.5)

is given by

(5.14.1.6)

It is to be noted that the replacement of 0 in (5.14.1.6) by its plausible estimate can


be expected to affect the optimality of the estimating functions h only a little in
large samples. The reason behind this is that its optimality is in particular
independent of any assumption concerning the variance functions
Em (Yi - Em (Yi ))2, i = 1,..., N , in the superpopulation model. The optimal estimate h
given in (5.14.1.6) is itself an estimating function and contains the unknown
superpopulation parameter O. Also, we have .!.-E(h· - H)= O.
00

The estimating function (5.14.1.6) is calibrated in so that for ./i(Yi) = ai(O),


i = 1,...,N, we have

(5.14.1.7)

This will be true if 0 is replaced by its estimate. We list here a few assumptions
about the functions Ii and a ., which have important implications for estimation of
robustness . For the superpopulation model satisfying the relations EoMYi'o) = 0 and
¢i(Yi'O) = ./i(yJ-ai(O), there exists a survey population based estimate ON, with the
following properties:

( a ) ON is consistent for 0;

(5.14.1.8)

(c) 10~~0)1 < C, a fixed constant, for all 0and i = 1,...,N .


472 Advanced sampling theory with applications

Using first order Taylor Series expansion of a;(e) around the point eN, under the
assumption 2 and for large N if N+ I a; is of 0(1), then the optimal estimating
;
function h becomes

~ Id;{;;(y;)-a; (e)};; O. (5.14.1.9)


N ies
We know that the replacement of e by its plausible estimate can be expected to
affect the optimality of the estimating function h only moderately in large samples.
Let iJ be a consistent sample based estimate of the population parameter e and the
approximation given in (5.14 .1.9) holds good by replacing e with iJ. Then under
assumption 2, Ia;(iJ) is a near optimal estimate for the population parameter
;erl

S:14.2tRE(;RESSION ,TYPEESTIl\iATORS ';·· "',:,'

If ;;(y;)= y;, i = 1,2,...,N, then the problem of estimation of I;;(y;) reduces to the
;erl
problem of estimation of population total, Y = I y; . Assume that a; (e) = f3x;,
;erl
i =1,2,...,N, such that Ia;(e) = f3 I x; = f3X is known.
;erl ;erl

Let [I; (y;) =y; , a;(e) = f3 x;; i = 1,2,...,11] be the 11 sampled observed values on the
study and auxiliary variable.
Then the estimator
h= I ;;(y;)- a;(e) + I a;(e)=I y;- f3x; +f3X = I y; +f3( X - I .:5-) (5.14 .2.1)
ie s 7r; ;erl ie 7r; ie s 7ri ies 7ri

which is the usual difference estimator if f3 is known; otherwise it is the usual


regression estimator.

Iff3=( .Iyi/7ri]/(Ixi/7ri]then the estimator (5.14.2.1) reduces to the usual ratio


lES lE S

estimator. Godambe (1995) has also discussed ratio and regression type estimators
under stratified random sampling and over different occasions. For details in
stratified random sampling and on sampling different occasions, one can refer to
Chapter 8 and Chapter 10. Godambe's paradox and the ancillary principle has also
been discussed by Bhave (1987), and its resolution is discussed by Godambe
(1987). Godambe and Thompson (1999) have shown that the theory of estimating
functions can also be used to construct confidence intervals for population
parameters in survey sampling. Godambe (1998) also considered the problem of
estimation of parameters in survey sampling.
Chapter 5: Use of auxiliary information : PPSWOR Sampling 473

5.14.3 SINGH1~. .S TRATEGY IN TWO-DIMENSIONAL SPACE

Singh (2000a) consider the problem of estimation of the general parameter of


interest in two dimensional space, defined as
.!. L: L: tp~i 'Yj , Xi , Xj ,0]= O. (5.14.3.1)
2 ie O j e O
The indices i and j mayor may not be equal. An unbiased estimator of (5.14.3.1)
for a measurable design will be
.!. L: L: dijtp~i 'Yj ,Xi,Xj,0]= 0 (5.14.3.2)
2 ies jes
with dij = lri/ Singh (2000a) developed a new estimator of variance by considering
the following function to be estimated :
.!.2 ieO
L: L: [0AdiYi -d -y-Y-2V{N(N-I)}-I]=O ,
j(o'i)eO Y } }
(5 .14.3.3)

where di =«;' , 0 ij =lrilrj - lrij' V =-21le.L:O }.(o'l/"'O


~_ 0 ij(diYi - djYj Yand 2/ N (N - I)
denotes an adjustment factor. Using (5.14.3.2), an unbiased estimator of (5.14.3.3)
is given by
1
-2 .L: .( ~\_ dij~ij(diYi -djy) -2V(N(N -1)t1}= 0,
lE SJ *"1r= S
(5.14.3.4)

which leads to the following new estimator of variance of Horvitz and Thompson
(1952) estimator of total in Sen--Yates--Grundy (1953) form given by

J2
"z =
N (N -I) L:
---'----"-
2
L: -0 ij (Yi

L: L: _1_
- - -Yj
ies j(o'i)es lrij «. lrj
=
(
~ N (N -
y
1) L: L: - 1 ,
ies j(# )es lrij
J (5.14.3.5)

ies j(o'i)es lrij


where
VI L: L: d ..0 ..(d.y . - d .y. Ydenotes
=.!.2 ies the usual Sen-- Yates--Grundy (1953)
j (# )es IJ IJ II} }

estimator of variance for the fixed sample and a measurable design.

Under Midzuno--Sen's scheme of sampling with revised selection probabilities


n, = nP and lr .. = n(n -I){p + p.__I_}
I I , IJ (N _ 2) 1 } N- 1
the estimator in (5.14.3 .5) reduces to

I L L {n(N _ 2)P; P.( P; + p . _ _


2n(N- 2) ie s j( ..i}es J J N-
1_I)-1_(n _ l)lJ( YiP; _ yjP )2
"2 j
(5.14.3 .6)
{N(N-I)}-'L L (p;+P .__ I_ )-'
ie s j( ..i)es J N- I
474 Advanced sampling theory with applications

Under PPSWR sampling, ll"i = liP;, ll"ij = 1I(1I-1)p;Pj and therefore (5.14 .3.5) reduces
to

v2 = .-!.- I
211 iesj(t'i)es
I ~;/P;- Y)-jP.Y/{{N(N-I)}-II
} iesj(;<i)es } J
I (Ijp;p.)l. (5.14.3.7)

Following Singh (2000a) one can easily get an improved estimator of variance of
GREG given by
v2(Ygreg ) = J.- I I Oij(wiei-Wj e ) /{N(N-I)tII I dij' (5.14.3.8)
2 iesj(;<i)es iesj(;<i)es
where

Oij=dij0 ij ll
dij0 ijQy(diXi- d jX) [ (' ) 1 ( \2]
V X HT -2" .I . L dij0 ij d.x, -dj xj)
-- I I d ..0 ..Q.. (d.x.-d .x .f les}(;<i)es
2 i e s j(;<i)es IJ IJ IJ I I }}

and

Wi = d, + (diqixi \f.I diqixf }-I (x - XHT )


1,=1
with qi and Qij are suitably chosen constants, which are the higher and lower
order calibrated weights as reported by Singh, Hom , and Yu (1998). Singh(2000a)
considered the problem of estimation of the ratio of variance of the study variable
to that of the auxiliary variable as
J.- I I 0..ffd .y . -d}.y}.U-
2 ien j(;<i)en y[~ I I f
V/d .x. -d .x .U]=O .
~ I I } } f
(51439)
• ..

Obviou sly for a measurable design, an estimator of (5.14.3.9) is given by


-21 .I .(I,_ dij0ij[(diYi-djY)-VR(di Xi-djXjY]=O ' (5.14.3.10)
ies j ;<,!"'S
Thus an estimator of the variance ratio is given by
VR =J.- I I d ..0 ..(d X-d .y .Y/J.- I I d ..0 ..(d.x .-d .x ·Y· (5.14.3.11)
2 ies j(;<i)es IJ I} I I } } 2 ies j(;<i)es IJ IJ I I } }

He then considered the problem of estimation of the regression coefficient as


-21 .I '( ~_
l en ) ;<I!"'n
0 ij[(diYi -djYjXdiXi -dj xj) -P{d;xi -dj xjY]=O ' (5.14.3.12)

For a measurable design, an unbiased estimator of (5.14.3.12) is given by


-21 .I .( I ,_ dij0ij[{d;Yi -djyJdiXi-djxj)- P{d;xi -djx)]= O. (5.14.3.13)
i es ) ;<,!"'S
Thus an estimator of the regression coefficient is given by
P= .I .(I,_ dij0ij(diXi-djxJdiYi-djY j)/I .(I,_ dij0ij(d jxj-dj x) . (5.14.3.14)
t E S J :#1 jt S / l E S J :;tl r=S

In Chapter 3 we considered the problem of estimation of the median M y of the


study variable in the presence of a known median M x of the auxil iary variable as
considered by Kuk and Mak (1989) by defining a matrix of proportions lp;j )as
Chapter 5: Use of auxiliary information: PPSWOR Sampling 475

I'" '" " c,


Y<M I" Y>M " Total
"q" -" Y Y
. "
s-,

~! 5, M x Rl P21 PI
"X > M R2 P22 P2
''''x
'Ji Totals: R, P2, I

. .
Kuk and Mak (1989) considered three estimators M R ' M s and M p under the
names of ratio, stratified and position estimators of median of Y, respectively, in
the presence of a known median of variable x, It is interesti ng to note that the
variances of all three estimators are a function of PI I for which X :'> M x and
Y s M y as defined above. The value of PI I can be determined by

.!. L L [l~i :'>Myn x j :,> Mx )- R I]= O (5. 14.3.15)


2 iefljefl
where
t
I \,Yi :'> M y nxj :'> M x ) = {I
0
if Yi :'> M y nXj :'> u;
otherwise .
An unbiased estimator of (5.14.3.15) is given by
.!.L Ldij[l~i :'>Myn xj :'>M..)-Rl]=O (5.14.3.16)
2 ies j es
or equivalently
11 l =.!. L
2 ies jes
L dij[l~i :'> M y n Xj s M J V.!.2 ies
L L dij .
j es
(5.14.3.17)

Singh (2000a) suggested that this procedure would be useful for estimating the
variances of the estimators of median proposed by Kuk and Mak (1989) . Finally he
considered the problem of poststratification in the two dimensional plane, Let Y
and x be the two variab les used for post stratifying the given data set. Let us define
I ifa:'> y :'>b, and {I if c :'> y :'>d,
Yi = {0 otherwi se, Xi = 0 otherwise,
where a, b, c and d are predefined limits used for defining the stratum
boundaries. For examp le, the strata defined as
.!.I I [1(a :'> Yi :'> b,c :'> xj :'> d )- W] = O, (5.14.3.18)
2 iefl j efl
where
I if a :'> Yi :'> b, C s Xj :'> d,
I (a s Yi s b, c s x j s d ) = { 0
otherwise,
contains the number in the sample with a :'> Yi :'> band C:'> Xj :'> d . In case of post-
stratification, the sample size /I is a random variable; therefore an estimator of
(5.14.3 .18) is given by
476 Advanced sampling theory with appl ications

..!.- L L dij[/(a ~ v, ~ b,c ~ Xj ~ d)-W] =0, (5.14.3.19)


2 ie sjes
or equ ivalently
W=..!.- L L dij{I(a~ v, ~ b,c ~ Xj ~ d)}j..!.- L L dij . (5.14.3.20)
2 ie sj e s 2 iesjes

5.14.4 GODAMBE'SSTRATEGY
. FOR l::INEAR
. BAYES *AND OPTIMAL
," ESTIMATION ~. r •

God ambe ( 1999) has discussed very thoroughly the optimal estim ating function in
linear Bayes form. He followed the linear Bayes methodology introdu ced by
Hartigan ( 1969) to handle the Bayesian semi-parametric model s based on fewer
moments. As in the case of non-Ba yesian statistics, it is common practice to replace
a full distributional assumption by a much weaker assumption about its first few
moments such as mean and variance. In the Bayesian approach one may similarly
consider the replacement of a completely specifi ed prior distribution by an
assumption about ju st a few moments of the distribution. Intere stingly, Godamb e
(1999 ) proposed an alternative methodology based on the theo ry of optimum
estimating function s and showed that his strategy is more readil y applicable and
efficien t in common problems than the linear Bayes methodology. God ambe ( 1999)
generalized results of Godambe and Thomp son ( 1989) for extend ing the theory of
optimum estimating functions to semi-parametr ic Bayesian models. Following
Godambe (1999), let x={x} be an abstract sample space, P={p }be a class of
probability distribution of X, e =(e e2 ,... ,en,) is a real valued m dimension al
"
parameter of interest defined on P, and n = {e(p) : p EP } . Here we wish to estimat e
e based on sample information X = {x} through elementary estimating functions.
The elementar y estimating function h j is a real valued function defined on X x n
such that, under the distribution PEP, X ,

where X j is some partition of the sample space or a (J' field generated by a


part ition. Such an elementary estimating function hj is called unbiased
conditionally on X j . Let us now consider
g = (g ),...,gm ), (5.14.4.2)
k
to be an estimating function, such that g r = I hj a jr , where a jr are any function s
j =1

of (x, e) which are measurable with respect to X j , j = 1,2,...,k , r = 1,2,... ,m.


Suppo se a class G of all such estimating functions g is given by
G= {g} . (5.14.4.3)
Now if we introduce anoth er estimating function
Chapter 5: Use of auxiliary information : PPSWOR Sampling 477

g * '" (* *)
gj ,···,gm (5.14.4.4)

where g; '" j~1 hjG;r with G;r '" E{ ~;, IX j }/E~J I X j} exists for r '" 1,2,...,111 . One

may note here that the estimating function g * is also a member of the class G .
Denoting hjG;r by hjr' the elementary estimating functions hj ' i> 1,2,...,k are said
to be mutually orthogonal if
E(hjrhpr'IX j) =0 (5.14.4.5)
for j 'I; j' ; j, l> 1,2, ..., k; r,r'= 1,2 , ..., III. Corresponding to the elementary
estimating function g, define the two matrices as:

2
g l , glg 2,..·,g lgm
2
g2gl, g2 ,..·,g2gm
J= E =IIE(grgr' ~I and

Similarly for the elementary estimating function g * define the two matrices / and
H * as follows
ig~ og~ og~
gl*2 , gl'" si'" » .. ,gl'" gm
'" 881
' 082,...,m
08
'" '" *2 '" '"
g2gl,g2 ,..·,g2gm o g ; o g; o g;
J*= E . and H* = E 081 ' 082 ,.. ., 08
m

'" '" ,..·, g lll


'" l'",g lllg2
g lllg *2

Thus we have the following theorem .

Theorem 5.14.4.1. If the elementary estimating function hj' j = 1,2 ,.... k, are
'mutually orthogonal' in the class G, then the estimating function g * is optimal in
the sense that the matrix

D= J - H(H * t J* {(H*) } - HI (5.14.4.6)

is positive semi-definite for all g E G, where AI and A- denote the transpose and
generali zed inverse of the corresponding matrix A. An estimate of 8 is obtained
by solving the equation g * = 0 for the observed value of x. The details about this
theorem can be found in Godambe and Thompson (1989) .
478 Advanced sampling theory with applications

In order to generalize the above theorem Godambe (1999) first considered X j to be


the partition of X x Q with the parameter () having some distribution . Let p
denote a joint distribution of (x, ()) and P = {p} . Thus in the above discussion, all
the 'expected values' are to be replaced by the ' expected values with respect to the
j oint distribution of p of (x, ()) ,. In other words, E(-I (}) reduces to the expected
value in the above theorem, when the partition of X x Q is given by holding ()
fixed. Thus the elementary estimating functions hj' j = 1,2,....k, would be functions
of the parameter (). FoIlowing Godambe (1999), the estimating functions g
including g. in the class G, under the new setup are obtained from the functions
hj with coefficients a j r that are measurable functions with respect to the new
partition of XxQ corresponding to hj' j = 1, ,k ; r = 1, ....m. Assuming the
orthogonality of the estimating functions hj' j = 1,2, ,k, defining the class G = {g}
and the matrices J I and HI similar to J and H , respectively, we have the
following theorem due to Godambe (1999).

Theorem 5.14.4.2. Assume the interpretation of estimating function g., the class
G, the orthogonality and the matrices J I and HI ' Now if the elementary
estimating functions hj' j = 1,...,k, are mutually orthogonal, then in the class G,
the estimating function g. is optimal in the sense that the matrix

D* = J HI(H; JJ; {(H;J}- H:


1- (5.14.4.7)
is positive semi-definite for all g E G.

Proof. Because of the orthogonality of the estimating functions hj' j = 1,2,...,k , we


have

IIE( ~~:')II IIE(grg;J


= r, r'= 1,....m . (5.14.4.8)

Following Godambe and Thompson (1987), the above equality shows the positive
semi-definiteness of the matrix D *. Hence the theorem.

Again we may note here that the estimate of () is obtained by solving the equation
g * = 0 for given x. In case of the scalar parameter () = (}I the optimality of the

+!, +1E[ :~ r
corresponding estimating function g * in the above theorem is equivalent to the

inequality ~~J s For more details on this topic one

can refer to Basawa, Godambe, and Taylor (1997).


Chapter 5: Use of auxiliary informat ion: PPSWOR Sampl ing 479

5.15 UNIFIE_D,~THEORY OF SURVEYSAMPLING

In this section we consider the problem of the estimating a population total or mean
through a unified approach of survey sampling. The concepts of uniform
admissibility, hyper-admi ssibility , ' uniformly minimum variance unbiased
estimator' and the concept of sufficiency in survey samp ling are discussed:

5.15.1 CLASS,OF ADMISSIBLE ESTIMATORS

We first discuss possible definitions of best and admissib le estimators of popu lation
total Y = IY; for a given sampling design P. Lets be a sample of units drawn
iEn
according to a probability scheme P : {p(s )} from the totality of possible
samplesS, where p(s) ~O for all SES and Ip(s)=1. Let Y=(Yl> Yz "",YN) bea
SES

vector of variate values associa ted with differe nt units in a population of size N, be
an element of the Euclidean space RN .

5.15.2 ESTIMATOR

An estimator of a population paramete r is a function defined on S x RN , whos e


value for any S E S, depends upon Y, through only those Yi Es . Murthy and Singh
(1969 ) discussed the following definitions on admiss ibility of estimators in a class
C(p) of estimators of population total. Suppose el(s,y ) and ez(s,y) be any two
estimators of the population total Y based on sample s under design P .
Then we have the following definition s:

5.15.3 ADMISSIBLE ESTIMATOR "

In a class C(p) of estimators of popu lation total, Y, an estimator e1 EC(P) is


admiss ible if for every other estimator e E C(p) either MSE(el) = MSE(e) for all Y or
MSE(e) < MSE(e,) for at least one y .

5.15.4 STRICTLY ADMISSIBLE ESTIMATOR

An estimator e, E C(p) is strict ly admissi ble if for every other estimator e E C(p)
such that MSE(e) < MSE(ej) for at least one y .

Now we have the following theorem :


480 Advanced sampling theory with applications

Theorem 5.15.1. If an estimator is strictly admissible, it is necessarily admissible ,


but the reverse is not true.
Proof. Consider a class C(p) containing four estimators e), e2, e3 and e4 ' Let

MSE(e,) = MSE(e2) for all Y and MSE(e})':::' MSE(e;), i = 3,4 for all y with
<
inequalities holding for some of the y values. In this case both e\ and e2 are
admissible , but none of them are strictly admissib le. Now if e\ is excluded then e2

becomes strictly admissible . This proves the theorem .


Godambe (1955) had proven the non-existence of the uniform ly best estimator in a
class of homogene ous linear unbiased estimators of population total. Following
Godambe (1955) a most genera l type of linear estimator of popu lation total Y may
be defined as es = I/lsiYi' It may be worth while to mention that I Yi stands for
ie s ies
the summation over all different individuals in the samp le s (these may be, in
number, s; n) . The estimator es will be unbiased if and only if I/lsiPs = 1 for
5 3;

i = I, 2, ...., N and I denotes the sum over all samples which include the /h unit.
S3;

Let us first exp lain the meaning of this sum with the help of a numerical example .

Example 5.15.1. Consider a population of N = 5 units as 10, 20, 30, 40 and 50.
Select all possible samples of size n = 3 units by using SRSWOR sampling. Find
the following:
(i ) L Ys2; ( ii) find /ls2 such that L/ls2Ps = 1 .
>32 >32
Solution. Let A = 10, B = 20, C = 30, D = 40 and E = 50 be the units In the
population. Here N = 5 and n = 3 . Therefore the total numbe r of possible without
replacement samples will be given by n(s) = N C,,=5C3 = 10 and the sample space S
is given as below:
T a ble 5 151 Sampi e space .
'i: Sample.; Units included Values of" Tota ls
"
Number . iri'the samples -tlie'units 'I
I A,B,C 10,20,30 60
2 A, B, D 10, 20, 40 70
3 A,B,E 10, 20, 50 80
4 A,C,D 10, 30, 40 80
5 A,C,E 10, 30, 50 90
6 A,D,E 10,40,50 100
7 B,C,D 20,30,40 90
8 B,C,E 20,30,50 100
9 B,D,E 20,40,50 110
10 C, D, E 30,40,50 120
Chapter 5: Use of auxi liary information: PPSWOR Sampling 481

(i ) In sum ~::Ys2 ' the value YS 2 corresponds to the sum over all po ssible samples
532
having the second unit Y S2 = B in the samples. Clearly the unit B is inc luded in 6
samples with num bers 1,2,3,7, 8 and 9 in the above table.
Thus we have
LYs 2 = 60+ 70 + 80 +90 + 100 + 110 = 510 .
532
( ii ) To find /3s2 such that L /3s2Ps = 1 and under SRSWOR,
532

Ps = Ne1 = 5e1 =10


1 for all S E S.
11 3
T hus
. I' 1
L/3 s2Ps =1 Imp ies L /3s2- =1 or L/3 s2 =10 .
532 532 10 532

Keep in mind that the seco nd unit belongs to six samp les as shown above in Tabl e
5.15.1. In other wo rds we are looking for six constants such that their su m is 10.
Th e fo llowing table shows that such a choice of constants is no t unique unless we
give equal weight to all samples in the sample space.

..
T a ble 5 15 2 Ch oice
. 0 f weiztht s.
l.Sample 1 2 3 7 8 9 Sum
no. r",
. ;;/3.2 •• 5/3 5/3 5/3 5/3 5/3 5/3 10, ,<
/3;2 2 2 3 2 0.5 0.5 10
Ir /3.2'i '; 1.176471 1.372549 1.568627 1.764706 1.960784 2.156863 10

Thus all these cho ices of /3s2 provide unbiased estima tors of population total or
mean under SRSWOR sampling. We shall see in the next theorem that the second
row of the Tab le 5.15 .2 provides the correct choice for unbiasedness, that is
/3s2 = N/ n = 5/ 3.

Now let B p denote the class of all /3 for which the corresponding estimates are
unbiased. Then one wou ld usually define an estimator esb as the best linear
est imate of popul ation total suc h that the variance is minimum, which is not
possible. We have the followi ng theorem:

Theorem 5.15.2. Prove that there does not ex ist any /30 E B p such that variance of
the estimator e(s,/30) is always less than or equal to all other estimators e(s ,/3) .
Proof. If e s is unbiased then

V(es ) = I. e;ps _ y 2 .
seS
The Lagrange function is given by
482 Advanced sampling theory with applications

L = V(e s)- A( t.e;», -I).


HI

Suppose Po exists then for a given y and i E 5 we have


ov(es ) A 0
{ ----;--p
U 51 U 5 1 5 3'
} -o
. - i ::Jp. I..Psi P s Po - ,
where Ai is a Lagrange multiplier. Upon differentiating, we obtain 2Yi{e s }Po = Ai
for all 5;) i and where Ps;t O. Moreover, this must hold for all Y since Po is
supposed to give the minimum for all values of y. That is for any 5 1 and 52
containing liz unit, PSI and PS2 being positive, we have {e sl }Po = {e s2}Po for all
values of y . For example if we take Yi = 1 and Yj = 0 for all i;t j then we have
1 1
Psi = Po which implies that, Po = Pso =- - =- , for all 5 and i,
t»,
53;
"i

"i = I. Ps denoting the first order inclusion probability. Hence if Po exists it must
53;

satisfy the above condition. (Under SRSWOR sampling, Po = N/n.)

Conversely, we have to choose Po such that the variance of the estimator e s IS

minimum. The choice of Po =Pso= YI.,ps = d, holds but does not provide minimum
H I

variance. Hence an optimum Po does not exist in the class of linear unbiased
estimators.

Thus Godambe (1955) has proven the non-existence of the uniformly best estimator
in the class of homogeneous linear unbiased estimators of population total. The
linearity restriction has been removed by Godambe and Joshi (1965) and has proven
the non-ex istence of the best estimator in the entire unbiased class of estimators.
Murthy and Singh (1969) extends the non-existence of such a result as wide classes
of estimators of the population total and we have the following theorems:

Theorem 5.15.3. For a given design P if an estimator el is admissible in the class


C(p) and if there is another estimator e2 in the class C(p) such that el is not as
good as e2 ' then e\ is not a best estimator in that class. This implies non-existence
of a best estimator in the wider class of estimators C(p}

Proof. Note that the estimator e \ is admissible, for any other estimator
e E C(p) either MSE(e\) = MSE(e) for all Y or MSE(e1) < MSE(e) for at least one y
holds. Further the estimator e\ is not at least as good as e2 therefore
MSE(el) > MSE(e2) for at least one y . Hence the theorem by contradiction.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 483

Theorem 5.15.4. If there exist two (or more) admissible estimators in a class C(p) ,
with unequal MSEs for at least one Y for a given design P , then that class does
not contain a best estimator.

Proof. Let el and e2 be two admissible estimators in C(p). Then the estimator el
is not at least as good as e2 and e2 is not at least as good as el if either

MSE(el)= MSE(e2) for all y or MSE(el)':: MSEh ) for all y with inequalities for at
<
least one y . Clearly the relation MSE(el) = MSE(e2) does not hold if the relation

MSE(el)':: MSE(e2) holds. That is, neither el nor e2 is a best estimator. Hence the
<
theorem.

5.15:5 L INEAR ESTIMATORS OFPO~UI.JATION TOTAL

Let y = (YI' Y2,..., YN ) be a vector of variate values associa ted with different units in
a popu lation of size N and be an element of the Euclidean space R N . Then a linear
estimator eb is a function on S x RN such that
I if i E S,
eb(s, y ) = Ib(s ,i)Yi where b(s,i) = { ' .
iEO 0 If Iii!: s.

It is noted here that the estimator is unbiased for population total Y if


I eb(S, y )p(s) = Y
SES

for all Y E RN .

Theorem 5.15.5. For any given design P, any constant is strictly admissible in the
class L of linear estimators of the population total, except in trivial cases .
Proof. For a design P consider an estimator eo as eo = 8 , for all y, where 8 IS a
constant not equal to zero. Let el E L be any other estimator such that
e} = as + IbsiYi .
ies
Suppose Y(8) is the set of vectors in RN for which the population total Y = 8 for

all Y E Y(o) ' Clearly the mean squared error of the constant estimator eo is given by

MSE eo -( )-1[o8- y ]2 if Y Ii!: Y(o).


if y E Y(o).
Keep in mind that MSE(e}) ~ 0 for all y ER N ' In other words , the estimator eo is
strictly admissible unless MSE(e)) = 0 for all y E Y(o), which shows that both
estimators eo and el are identical for all Y E Y(o)' One can easil y see that, in fact,
eo and el are identical for all Y ER N ' Hence the theorem .
484 Advanced sampling theory with applications

Now we have the following theorem to state a very useful result:

Theorem 5.15.6. There does not exist a best estimator, and hence the best and the
uniformly best estimator, in the class of linear (L), linear unbiased (Lu ),
homogeneous linear (LIl ) , homogeneous linear unbiased (4:),
all (both linear and
non-linear) unbiased (Au) and all (A) estimators of the population total for any
sampling design P .

Proof. In the previous theorem, we have seen that any constant belongs to the class
of linear estimators (L), that it is strictly admissible in L and hence there does not
exist any best estimator in this class. Its proof also follows from the fact that if there
are at least two strictly admissible estimators in C(p) then a best estimator does not
exist in that class of estimators.

Godambe and Joshi (1965) have shown that the Horvitz and Thompson (1952)
estimator of the population total defined as
ellt = IdiYi,
iES

where "i = I Ps is admissible in the class of all unbiased estimators All and hence
53;

in the class of homogeneous linear and unbiased estimators Following (4:).


Murthy (1967) and Rao and Bayless (1969), if the design is not unicluster, then it is
possible to find another estimator e~t in the class of homogeneous, linear and
unbiased estimators (L7,) such that ellt is not as good as e~t for a given design P.
We know that for a given design P, if an estimator ellt is admissible in the class
C(p) and if there is another estimator e~t in the class C(p) such that ellt is not as
good as e~t, then ellt is not a best estimator in that class. This implies non-
existence of a best estimator in the wider class of estimators C(p) , which proves
the non-existence of a best estimator in (L7,). The results are obvious for linear
unbiased LII and all other unbiased Au class of estimators. Now consider a class A
consisting of all estimators of the population total. Then obviously two admissible
estimators with unequal mean square errors for at least some yare the usual and
ratio estimators, given by
N
LYi
ell=-IYi, and e =~X
II i E S r L Xi
i ES

where X = IXi ' We know that if there exist two (or more) admissible estimators
iEO
in a class C(p) , with unequal MSEs for at least one Y , for a given design P, then
that class does not contain a best estimator. This shows the non-existence of a best
estimator in the class of all estimators (A} Hence the theorem .
Chapter 5: Use of auxiliary information : PPSWOR Sampling 485

Let n = {Ul>U2" ",U N} denote a finite population . With each unit Ui' i = 1,2,...,N, is
an associated variate value vi - Let Y = (Y"Y2 "" ,YN) denote a point in the N
dimensional Euclidean space RN . Let s denote any non-empty subset of nand s
denotes the set of all possible samples s. Then a sampling design is defined by
attaching a selection probability p(s) to each s E S so that p(s);::: 0, L p(s) = 1. The
seS
size n(s) of a sample s denotes the number of units Ui included in s. Let
" i = L p(s) and "ij = L p(s) denote the positive first and second order inclusion
53; 53 ;,}

probabilities.

'5. 15;6. 1 · <::O.NDIJ:;ION FOR TR~fuNBIASED ESTIMAJ'OROF,VAmA.N<::E·.

An estimator e(s,y) is said to be an unbiased estimator of a function V(y) defined


on RN, ifand only if L:p(s}?(s, y) = V(y) for all Y E RN.
seS

5is:6.2iAnM~SSIBLE AND UNBIASEI>.ESTIMATOR OF VARIANCE '

An estimator e(s,y) is said to be admissible in the class of all unbiased estimators of


a function v(y) if there exists no other unbiased estimator e l (s, y) of v(y) such that
L: p(s Ie, (s,y)- v(y)F ~ L: p(sle(s,y)- v(y)F ,
seS seS
for all y E RN , with the strict inequality holding for at least one y E RN .

5:15.6.3 FIXED SIZE SAMPLING.DESIGN

A sampling design is said to be of fixed size v if p(s) = 0 whenever n(s)"# v .

The Horvitz and Thompson (1952) estimator of the population total, Y, defined as
eht(s ,y )= L:diYi
ie s
which is unbiased if and only if the first order inclusion probabilities are known and
positive. Then the first form of the variance, which is valid under any kind of
sampling design , of the Horvitz and Thompson estimator is given by
v(y)= L:(di-l)yf+ L: L: (didj"ij- l)YiYj
ien ienj(;ti)en
which we can refer to as a class of unbiased estimators of variance.
486 Advanced sampling theory with applications

On the other hand, for a fixed sample size design the Sen-- Yates--Grundy form of
variance is give by
VSyg (Y )= ..!.- L L (JriJrrJrijXdiYi-djY}'
2 iEnj(#}en
Assuming that the second order inclusion probabilities are known and positive, an
unbiased estimator of v(y) = Vsyg(Y) based on a fixed sample design is given by

VSyg{eht(s,Y)}=..!.-L L dij(JriJrj- JrijXdiYi-djYj~ '


2 iESj(*i}es
Then following Josh i (1970) we have the following theorem.

Theorem 5.15.6.1. For a fixed sampling design of size two the Sen-- Yates--Grundy
estimator of variance is admissible in the entire unbiased class of estimators of
vanance.
Proof. Suppose the theorem is not true. In this situation, there exist an unbiased and
admissible estimator Vuht(s,y)} such that
LP(sXvuht(s,y)}-V(y)]2 s LP(sXvsyght(s, y)}- v(y)r
SE S SE S

for all Y E RN and for strict inequality with at least one Y E RN . Suppose the new
unbiased estimator has a relationship with the Sen--Yates--Grundy estimator of
variance shown by
Vuh t(s, y)} = Vsyg {eht(s,y)}+ H(s,y)
where H(s,y) is any function of sample values.

Then we have
LP(SXVSyg h t(s, y)} - v(y)+H(s,y)r ::; LP(S XVSyg h t(s, y)}- v(Y)r
SES SES

or
LP(sXH(s, y)j2::; -2 LP(s)H(s,yXvsyght(s, y)} - v(y)].
SES SES

The estimators Vuht(s, y)}and vsyght(s,y)} are unbiased.


Therefore
LP(s)H(S,y) = 0
SES

and the above inequality becomes,


LP(sXH(s,y)]2 ::; -2 LP(sXH(s,y)][vsyght(s,y)}].
SE S SES

Considering Yi oc Jri , then one can easily see that the above inequality does not hold
and is a contradiction to our assumption . Hence the theorem.

We already discussed Murthy (1957, 1963) estimators in detail based on a sample


of two units . Joshi (1970) used the above mechanism to prove the admissibility of
Murthy 's estimators. Patel and Dharmadhikari (1978) have shown that Joshi's
Chapter 5: Use of auxiliary information : PPSWOR Sampling 487

method is not applicable . They discussed the admissibility (within the class of linear
unbiased estimators) of Murthy's estimators with an effective sample of two units.

Now a linear estimator of population total defined as


ej(s )= Lb(s,i)Yi
i ES

is unbiased in linear manner if Lb(s, i)p(s) = I , i = 1,2, ..., N .


ss!
Suppose Vi denotes the variance of an estimator ej when Yi =1 and Yj =0 for
j ,,: i . Assuming the estimator ej is unbiased we have

Vi = Lb 2(s,i)p(s)-I .
s si
Let k»k 2,...,k N be non-zero constants such that 'Lk i = 1 and let ql> q2 ,...,q N be
iefl
positive numbers such that the weighted variance Lq iV; is minimum subject to
ie fl
conditions of unbiasedness and zero variance at a given point(k"k2,...,k N ) defined
as
'Lb(s,i)k .=I, S E S.
. I
IES
Consider a Lagrange function defined as
L = LqiV; - 2 LAi Lb(s,i)p(s)- 2 Lf.Js Lb(s ,i)ki
iefl iefl Hi seS ie s

= L qi{ Lb 2(s,i)p(s )-I} - 2 LA; Lb(s,i)p(s)- 2 t», Lb(s, i)ki


iefl Hi iefl Hi s eS ie s
where Ai and f.J s are the Lagrange multipliers. Upon differentiating L with
respect to b(s, i) and equating to zero we have

i3(L .) = 2qib(S,i)p(s)- 2Ai p(s )- 2f.Jski = 0 ,


d b S,I
which implies that
qib(s, i)p(s) = Aip(s) + f.Jski' i = 1,2,...,N,
or
.) Ai f.J s k, t:
b(S,I =- +-( )-=~i+as'7i
qi p S qi
where ;i = Ad qi , as = f.J s / p(s) and '7i = kd qi ' .
We know that
'Lb(s,i)k . = I ,
• I
IES

which implies that


L(;i+ a s'7i'fti=1
iES

or
488 Advanced sampling theory with applications

.L ~ ki + as.L 1]/i = I ,
I ES IES
which implies that

as = _(I
oS )(1- P ;iki)'
IES

where o(s ) = L1]iki.


iES

Also we have
Lb(s,i)p(s) = I,
S3i
which implies that
L. (~ + as1]i )p( s) = I
S3 1

or
L . ~p(s)+ L . 1].a
1 S
p(s) = I
S3 1 S31

or

or
p(s) p(s) .
C;i"i -1]i I - () I C;iki = 1-1]i I-() for 1= 1,2,...,N .
HiO s iES HiO S

Remember that k, and qi are known, and hence the 1]i and o(s ) are known . For a
given design, p(s) and "i are also known. Thus the unknown quantities are only
values of C;i which is, in fact, a function of Lagrange multipliers. The above
system of equations can be written as

1- 1]1 I p(s)
HIO(S)
1-1] 2 I p(s)
HI o(s)

Clearly a solution (C;" C;2,...,C;N) to the above system of equations can be obtained
to find the weights b(s, i) subject to the condition of unbiasedne ss and admissibility .
Chapter 5: Use of auxiliary information: PPSWOR Sampling 489

In other words , to show the admissibility of an estimator of the form


e,(s)= Ib( s,i)Yi , we have to show that the quantity b(s,i) given by the above
i ES

system of equations is independent of the choice of a particular solution.

Consider a situation where the sample consist of only two unit, that is, s = {i,;}.
Now if we take qi=kl /(I-Pi)' TJi=(I-Pi)/ki and TJiki =(I-Pi)' then
8(s) = 2 - Pi - Pj and the weights b(s, i) become

b(s,i) = (1- PJ/ {kJ 2- Pi - pJ } +( kiP; - PikJ/(kiP(S)) .


Now if we take ki =Pi then b(s,i)=ll-Pj)/{Pil2-Pi-Pj)}, i =I,2 and the
estimator of population total defined as

e, (s) -_ 'L.<""b(s.t.)Yi -_ 1 [(I-pJYi +------"--


(I-P i)Yj]
i ES 2-Pi-Pj Pi Pj
which is same estimator proposed by Murthy (1957, 1963).

Again if we take k, = 1[i - Pi such that Ik i = I , we can consider a trivial solution


ies
(~" ~2 " " ' ';N) = (0,0,...,0). Then the weights b(s, i) can be written as

b(s,i)=(1- Pi )/K2- Pi - Pj X1[i - Pi)J


and the estima tor of the population total defined as

e,(s) = Lb(s,i)Yi= 1 [(I -P;)Yi+(I-Pj)Yj]


i ES 2-Pi-Pj 1[i -Pi 1[j-Pj
which is again the same estimator discussed by Murthy (1957 , 1963).

Thus we have the following theorem .

Theorem 5.15.6.2 The estimators of population total proposed by Murthy (1957 ,


1963) are admissible.

5.15.7 POLYNOMIAL TYPEESTIMAl'ORS

Hanurav (1966) starts with the basic concepts in sampling theory for finite
populations and considers a fundamental problem of optimum estim ation
procedures to estimate the popu lation total. For unicluster design s Hanurav (1966)
has shown that any estimator eo in M*(P), where M*(P) is a class of all
polynomial unbiased estimators of total, is admissible in M*(P) ifand only if
490 Advanced sampling theory with applications

where YHT(P) denotes the Horvitz and Thompson (1952) type of estimator under
design P and va lues of gs are constants independent of the study variable, the
following condition is satisfie d '[.gsPs =0 . A given parametric function g(y) is
SE S

said to be estimatable in a given design P if and only if there exist a statistic


e s such that
E(e) = '[. esps=g (y) .
SES

T heore m 5.15.7.1. A set of necessary and sufficient conditions for the estimability
of the quadratic parametric function
2
Q = 10 + '[.l iYi + '[.q iiYi + '[. '[. qijYiYj
iEO iEO iEOj(;ti )EO
in a design P is given by
(i) Jri >O if Il +qi~ >O, (ii) Jrij >O if qij+q ji*O .

Theorem 5.15.7.2. Given a design P, an estimable parametric function g(y) and


SI, S2 are effecti vely equivalent. If e is any unbia sed estimator of g(y) with

respect to any design P such that for any two samples SI and S2 the condition
PSI > ° PS2 > °get violated , then the estimator, / = E(e I (s ,y)) , is also unbiased
for g(y) and V(/ ):<> V(e) \;fYE RN with strict inequality occurs at least once. The
above theorem basically restates the well know n Basu (1958) result.

5[15:8 AL'FERNATIVE QPTIMALITYCRITERION ..'

While looking for an optimal estimator in survey sampl ing, we have to keep the
following criteria in mind:
(i ) Bayesian approach; (ii) Linear invariance;
( iii) Regu lar estimators; (iv) Hyper-admissibility.
( i ) Ba yesian approach : This approach makes use of prior information about the
known distribution of the study variab le. In this case we genera lly prefer to
minimize a loss function in place of variance or mean squared error.
( ii ) Li nea r Invariance: Roy and Chakravorty ( 1960) introduced the concep t that
an estimator should remain invarian t under linear transformation of the study
variab le y .

( iii ) Regu lar est ima tor : An estimator e is said to be a regul ar estimator if
V(e) = ka 2 , where k is a constant and a 2 is a finite population variance.

( iv ) Hyp er-admissibility: An estimator e is said to be hyper -admis sible if it is


admissible not only in the whole R N but also in each of its principa l hyperp lanes.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 49 1

For example, for any design P which is not a uniclu ster design the class of
polynom ial, M*(P), unb iased estimators of the popul ation total Y admits ju st one
estimator which is hyper-adm issible. This optimum estimator is only the Horvitz
and Thompson ( 1952) estimator. A class of hyper-adm issible estimators for
unicluster designs is give n by eo = g s + YHT(P) as defined earlier.

5.15.9 SUFFICIENT STATISTIC IN SURVEY SAMPLING

An estimator is said to be sufficient for a parameter , if it contains all information in


the sample regarding the parameter. More precisely, if e = es(YbYZ,""Y") is an
estimator of a parameter e, based on sample information YI, yz ,...,Y" of size II
from a population with density f (y,e) such that the conditional distribution of
Y" yz ,...,Y" given e , is independent of parameter e , then e is a sufficient
estimator of e . For example, let YbYZ,""Y'" be a random sample from a Bernoulli
population with parameter p such that,
I with prob ab ility p ,
Yi ={ ° with probability (I-p) = q,
where ° < p < 1.

Then the statistic e = e,(YbYZ,...,y,,) = L" Yi follows a binom ial distribution with
i=1
parameters n and p, and

P(e = x) ="CtpXq(,, - x), x = 0,1,2,..., 11.


The condi tional distribution of YI,Yz ,...,y" given e is

p(YI n yz n ....nY" I e -- x ) -- p[(Yl n YZn ....nY,,)n(e=x)] _ pV'-x


-,,- ,
P (e = x ) - lC X " - X
xp q Cx

which is free from p , indicates that e =es(Y"Yz ,...,Y,,)= L" Yi is a sufficient


i=1
statistic for p .

There is another way to find a sufficient statistic called Factorization Theorem.


The statistic e =es (Y" Yz, ...,Y,,) is sufficient for B if and only if the likelihood
function L of the samp le values can be expre ssed in the form , L = go (e(y ))h(y)
where go (e(y)) depends on e and y only through the value of e(y) and h(y) is
independent of e. For example, YI'Y2 ''' '' Y n is a random sample from a

population with normal density N~, (J' z ) . Then the statistic es = ~ Yi is sufficient
i=1
492 Advanced sampling theory with applications

for J.1 and the statistic e s = I


• II

i~ 1
Yi2 is sufficient for 0" 2 because the likelihood

function can be written as

L= .rlf(Yi,B)= (
,~ l
d-
v 2Jr O"
J/ exP[- ~(
20"
.I.Yl - 2J.1.I.Yi +nJ.12J]
I~ l I~ l
=ge(e(y))h(y),
where

e(y )= (i~/i'i~/l J and h(y) = 1.

It is easy to recognize the concept of sufficiency in general from the above


discussion. Yamada and Morimoto (1992) have been able to compare the following
four definitions of sufficiency in a very elegant way.

( i ) The conditional distribution of a sample when given the estimator e does not
depend on the parameter.
( ii ) The distribution of a sample can be reconstructed from that of the estimator e
through randomisation or mathematically using a stochastic kemal.
( iii) For every decision problem, given a decision function based on the sample ,
there exists a decision function based on e which is at least as good as the former.
( iv ) For any prior distribution of the parameter, the posterior is a function of the
sample through e .

The first definition is owed to Fisher (1920, 1922), the second and third are due to
Blackwell (1951) and the fourth is due to Ko1mogorov (1942). Based on the first
definition of sufficiency, Halmos and Perlman (1974) , and Bahadur (1954)
introduced its new definition as follows.

"Let E =(X;:A,p) be ,a statistical experiment and B be a sub-field.I a sub- 0"-


field' iofN) :' ' B is called sufficient iffor every~ A,E A there
measuralJle ji (AIB)(y) ~,whicJi satiSfies/or ai " ;,)n B '~l1;d
': .. "' ;"',> ~~k;' F ~ t': ' > " ,';""i , "'-'it

I BXy)dp · · Holds. "\~'


i, ··' ; · .· . · .

that p(A nB)= .

Further Yamada and Morimoto (1992) have developed relationships between above
definitions of sufficiency in more descriptive way. The readers must follow the
lecture notes of Ghosh and Pathak (1992) to have better understanding of
sufficiency in survey sampling.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 493

5.16 ESTIMATORS BASED ON CONDITIONAL INCLUSION


PROBABILITIES !~; ,£",.", ,~~

Following Tille (1998) let '7 = '7(Xi , i E s) be a stati stic based on the auxiliary
information. Since the population is finite the stati stic '7 takes a finite number of
possible values denot ed {'71, '72, ..., '7/}'
Define a indicator variable
I if i E S,
Ii = { 0 otherwise. (5.16.1)

Then the first order conditional inclusion probabilities are given by


Jl'il'7 = E(Ii 1'7) for all i En . (5. 16.2)
and the second order conditional inclusion probabilities are given by

(5.16.3)

Then we have the following theorem.


Theorem 5.16.1. Sho w that the simple conditionally weighted (SCW) estimator of
popul ation mean Y defined as
-
Yscw 1 "£..,-
=- Yi (5.16.4)
N iESJl'il'7
is virtually conditionall y unbiased .
Proof. An estimator iJ of () is said to be virtually conditionally unbia sed (YCU)
with resp ect to auxiliary statistic '7 if
B(iJl'7)= I (}ia;('7)Ih'7 =0 ) (5. 16.5)
iEfl
N,
for all ((}),(}2, .. .. ,(}N)E R where the coeffi cients ai('7) depends on '7 and I (e) is
an indicator function such that

I if Jl'ilIJ = 0,
I (Jl'ilk = 0 ) = { .
o If Jl'il'7 > O.

Now the bias in the estimator Yscw is given by

- sc w ] = E (-Yscw I '7) - Y
B [Y - = E [ - 1 .I - Yi ] - [
- Y = E -I . I Y;Ii]
- I '7 - -
Y
N 'EsJl'ilIJ N 'Efl Jl'il'7
" il'7> o

I
=- I
N iEfl
(YI J
E - ' -' 1'7 - - I I f; = - -
Jl'ilIJ N iEfl
I If; IJl'iI
N iEfl
[ IJ =0.
]

"il'7>o

Henc e the theorem.


494 Advanced sampling theory with applications

Theorem 5.16.2. Show that a corrected conditionall y weighted (eeW) estimator of


population mean Y is
1
Yccw = N .L Yi/ V1iJril'l ) (5.16.6)
/ES

where hi = ElI(Jril'l > o)j = Pr(Jril'l > 0) is an unbiased estimator of population mean.
Proof. The bias in the estimator Yccw is given by

B(yccw 1'7 ) =E[-'!'" L ---lL]-Y =


N iEshiJril'l
E[-'!'" I Y;Ii ] _-.!... I Y;
N iEO hiJril'l N iEO

=-.!... I Y; (I(Jril'l >0) IJ=O .


N iEO hi
Hence the theorem .

Note that Rao (1985) also discussed conditional inference in survey sampling.

Remarks 5.16.1. Note the following points carefully:


( a ) The eew estimator is not yeW but it is unconditionally unbiased;
( b ) The estimators eew and sew are not invariant.

5.17 CURRENT"TOPICS IN SURVEY SAMPLING

Rao (1996) discussed some current topics in survey sampling at the Golden Jubilee
conference of the Indian Society of Agricultural Statistics , New Delhi. In particular,
inferential issues were studied and the advantage of conditional design based
approach were demonstrated. Practically useful estimators for dual frame surveys
were presented. The jackknife method was shown to provide a unified , but
computer intensive , approach to variance estimation and analysis of survey data.
Finally, small area estimation was considered and model based indirect estimators
that borrow information from related small areas were introduced. Moors, Smeets,
and Boekema (1998) have considered an interesting problem of sampling with
probabilities proportional to the variable of interest. Brewer (1994) has discussed
the past and present prospects in survey sampling inference . Again Rao (I 999a,
1999b) has considered the problem of review of some current trends in sample
survey theory and methods. He provides a brief discussion on developments in
survey design and data collection and processing, issues related to inference from
survey data, re-sampling methods for analysis of survey data, and small area
estimation. Following him the principal steps in sample survey have been

( a ) Survey design ,
( b) Data collection and Processing,
( c) Estimation and analysis of data.
We would like to discuss these issues briefly here. They are covered in more depth
by Rao (I 999a, I999b).
Chapter 5: Use of auxiliary information: PPSWOR Sampling 495

5.17.1 SURVEY DESIGN

Researchers have paid much attention to sampling errors, and have developed
numerous methods for optimal allocation of resources to minimze the sampling
variance associated with the estimators of total or mean. Much less attention has
been paid to reducing the total survey error arising from both sampling and non-
sampling errors . Many researchers, including Fellegi and Sunter (1974), Linacare
and Trewin (1993) , and Smith (1995), have emphasized the need for a total survey
design approach in which resources are allocated to those sources of error where
error reduction is most effective thus resulting in superior survey designs . Linacare
and Trewin (1993) applied this approach to the design of the Construction Industry
Surveys. Smith (1995) proposed the sum of component MSEs, rather than the MSE
of the estimated total, as a measure of the total error which may be written as the
sum of the errors from different sources. While estimating the population mean or
total, it is customary to study the effect of measurement errors. It has been found
that the usual estimators are design unbiased and consistent under the assumption of
zero mean measurement errors. Following Mahalanobis (1946), the traditional
variance estimators remain valid provided the sample is in the form of
interpenetrating sub-samples. It is interesting to note that this useful feature no
longer holds in the case of distribution functions, quantiles, and some other
complex parameters as shown by Fuller (1995) . The usual estimators are biased and
inconsistent and thus can lead to erroneous inferences . Fuller (1995) obtained bias
adjusted estimators under the assumption of independent and normally distributed
errors. Eltinge (1999) extended Fuller's results to the case of non-normal errors
using small standard deviation approximations. Singh , Gambino, and Mantel
(1994) illustrated the use of compromise allocation to redesign the Canadian
Labour Force Survey. As mentioned by Rao (1999a, 1999b), the interpenetrating
sub-samples provide a valid estimate of the total variance of an estimated total in
the presence of measurement errors, but such designs are not often used, at least in
North America, due to cost and operational considerations. Hartley and Rao (1968)
and Hartley and Biemer (1978) provided interview and coder assignment conditions
that permit the estimation of total variance and its components, such as sampling,
interviewer, and coder variances, directly from stratified multistage surveys that
satisfy the estimability conditions. Groves (1996) noted that the customary one way
random effect interviewer variance model may be unrealistic because it fails to
reflect 'fixed' effects of interviewer attributes such as race, age, and gender that can
affect the responses.

In comparison with the face to face interviewing technique of data collection,


telephone surveys have become popular in developed countries and are growing
very rapidly in developing countries too. Computer Assisted Telephone Interview
(CATI) has also been found helpful in reducing both measurement and data
processing errors. A combination of both face to face interviews and telephone
surveys has also been found useful in monthly surveys . There are several more
496 Advanced sampling theory with applications

methods of collecting data like random digit dialing (RDD), which provides
coverage of both listed and unlisted telephone households . The two stage Mitofsky-
-Waksberg technique and its refinement s are designed to increase the proportion of
eligible numbers in the sample and thus reduce data collection cost by following
Casady and Lepkowski (1993). Following Groves and Lepkowski (1986) the dual
frame approaches are also useful in obtaining more efficient estimates by
combining a sample selected from an incomplete directory list frame with another
sample selected by random digit dialing . Many large scale surveys, especially
surveys on family expenditure and health, use long questionnaires for data
collection . Such surveys can lead to high rates of non-response and decreases in the
quality of response, but this problem may be reduced by splitting the long
questionnaire into two or more parts. For example, Wretman (1995) splits his long
questionnaire into five non-overlapping parts. For surveys dealing with sensiti ve
questions, the quality of responses and response rates might depend on the ordering
of the questions in the list. It is also important to keep in mind that data on sensitive
characters or variables can also be collected through a Randomized Response
Technique, which makes use of a device to collect indirect answers to the questions
in the surveys. Data collected from surveyor census by anyone of the above
methods needs editing . The purpose of editing is to see which records are
unacceptable, outliers, or missing. Then imputation of the missing record s and
proper treatment of the outliers or unacceptable values is most important, if care is
taken to assume that the editing procedure changes as few values as possible. For
example, Fellegi and Holt (1976) developed a method for automatic editing of
survey data with the help of computers under certain assumptions.

5.17.3 'ESTI MATION 1\ND ANALYSIS OF DATA

Suppose we are interested in estimating the population total of a characteristic of


interest y. A sample s is selected according to a specified sampling design p(s)
from a population of size N and the sampled data {(i, yJ i E s} are collected ,
assuming non-sampling errors are absent.

The basic idea of the inference is to obtain an estimator Y , its standard error s(r)
or coefficient of variation,
c(r) =s(r)/y,
and associated normal theory intervals
Y+ Za/2s(r)
on Y from the sampled data for large sample size n. As we noted in this chapter ,
there are three approaches:
( a ) Design based approa ch;
( b ) Model based approach ;
and
( c ) Model assisted approach.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 497

The traditional work in sampling including the Horvitz and Thompson estimator
comes under design approach . The work related to Royall (1970a, 1970b, 1970c)
and Brewer (1963a) comes under the model based approach. The contribution
related to Sarndal, Swensson , and Wretman (1991) comes under the concept of
model assisted approach . Following Rao (1999), the methods of re-sampling and
small area estimation, which we shall discuss in subsequent chapters, are also under
the current topics in survey sampling.

5;18 MISCEUEA.NEOUS'DISCUSSIONSrrpPICS

In this section we introduce some topics which exist in the literature and may be
useful for the research oriented readers.

Generalized 1r ps designs were defined by Rao (1972). The term IPPS stands for
Inclusion Probabilities Proportional to Size. Working with a general
superpopulation model B(g), the strategy consisting of G1r PS design together with
the associated HT estimator of population total has proved to be better than two
other well known strategies of Rao (1971, 1972). Following Rao (1972), if a design
is such that 1r; ex: Xl / 2 (i = 1,2,...,N ) and L xl-(g/2) =k , a constant for any sample s
ies
with Ps > 0 then that design is called generalized 1r ps design. For g = 2, the
G1r ps design is a 1r ps design with fixed sample size. Ramachandran (1982) has
shown that the g(B) optimality of the strategy consists of Gnps design together with
the associated HT estimator in the entire class of design based unbiased strategies of
the population total with expected sample size fixed. Ramachandran generalized the
results of Rao (1971 ,1972) by following Godambe and Joshi (1965). Pedgaonkar
and Prabhu- -Ajgaonkar (1978) have shown that the G1r ps strategy is better than
the RHC strategy with fixed sample size. McLeod and Bellhouse (1983) gave a
simple and useful algorithm for drawing a simple random sample without
replacement in a single pass through a sampling frame for a finite population whose
size is unknown. Richardson (1989) extended their method to probability
proportional to size sampling. Korwar (1996) provides a method, which is an
adoption of the method of McLeod and Bellhouse for the simple random sample
without replacement, for drawing a sample with probability proportional to
aggregate size in a single pass through a sampling frame of a finite population
whose size is unknown . Some discussion on the question of availability of a unique
best estimator in the Horvitz and Thompson (1952) class of estimators has been
given 'by Chaudhuri (1975a, 1975b).
498 Advanced sampling theory with applications

5.18.2 TAM'S OPTIMAL STRATEGIES

Tam (1984) has provided necessary and sufficient conditions for an estimator
design pair to be optimal under a regression superpopulation model with correlated
residuals. Tam (1986) has provided necessary and sufficient conditions for the
optimality of an arbitrary linear predictor of the total of a finite population in survey
sampling under a general linear model with a symmetric and positive definite
covariance matrix . He extended the work of Pfeffermann (1984) and Tallis (1978)
by foIIowing Cassel, Sarndal, and Wretman (1977), Royall (I 970a, 1970b, I970c,
1976), RoyaII and Herson (I973a, I973b), RoyaII and Pfeffermann (1982), Sarndal
(1980a, 1980b), Scott, Brewer and Ho (1978), and Zyskind (1967) .

.5.18.3 PSE.QF RANKS IN SAMPLE'SELECTION

Wright (1990) has shown that there can be a gain in estimation strategies over equal
probability sampling methods when one makes use of auxiliary information for
probability proportional to size with replacement sampling methods. When a
suitable variable X is not available, one may know how to rank units reasonably
weII relative to the unknown y values before sample selection . When such ranking
is possible Wright (1990) has introduced a simple and efficient sampling plan using
the ranks as the unknown X measure of size. He showed that the resultant
sampling plan is similar to, has the simplicity of, and has no greater sampling
variance than with replacement sampling, but is without replacement. Kumar,
Srivenkataramana, and Srinath (1996) have also shown the use of ranks in unequal
probability sampling for sample selection and stratification including determining
the strata boundaries. They also suggested a few sampling schemes . For samples of
size two, two sampling schemes and their IPPS versions have been discussed along
with their extension to large sample situations. Non-negative unbiased estimators of
the variance have also been suggested.

5.18.4 PREDICTION APPRO~CH

Mukhopadhyay (1994) has considered the optimal prediction of finite population


total and variance under a location model with measurement errors . Bayes
predictors of population total and variance under a class of priors have been derived
and a minimax predictor for population total has been obtained. Under a regression
superpopulation model with measurement errors, an optimal predictor of total has
been derived. Srivastava and Ouyang (1992) have studied a general estimator in
sampling by utilising extraneous information through a sample weight function.
They discussed the use of extraneous information (including that collected during
the survey) to obtain guesses . Mukhopadhyay and Bhattacharyya (1990, 1991)
have provided optimal estimation of a finite population variance under some
exchangeable general linear models. They examined the robustness of the optimal
strategies under a class of alternative models and sampling designs ensuring near
unbiasedness. Samiuddin and Kattan (1991) have suggested a new unequal
probability sampling scheme . Samiuddin , Kattan, Hanif, and Asad (1992) have
Chapter 5: Use of auxiliary information: PPSWOR Sampling 499

contributed some useful remarks on the use of models and sampling schemes while
using unequal probability sampling strategies.

Reddy and Rao (1990) have considered the problem of estimation of population
total of bottom (top) P percentiles of a finite population using the Horvitz and
Thompson (1952) type estimator.

5.18.6 GENERAL FORM OF ESTIMATOR OFVARIA-NeE 5i: ","': x

Let Q denote a scalar quantity to be estimated. A point estimator of Q will of the


form Q=Q(K, y) and variance V{Q)=v(K, y), where X=(Xiitpis matrix of
auxiliary information and y = (Yi )nxl is a vector of study variable . The estimator Q
can be considered
Q=g(TxI,Tx2 ,··.,Txp,Ty} (5.18.6 .1)

1n -I n
where TXj = n" L.Xii ' j = 1,2,...,p .T; = n L.Yi, and g is a smooth function. It is
i=1 i=1
to be noted that estimation of mean, proportion and ratio of means is a special case
of the general estimator Q, but the estimation of median, variance, or correlation is
not a member of this general estimator. The estimand Q will be the same smooth
function g of the expectations of the means Q = g(E(TxJE(Tx2)...,E(TxJE(Ty)) ,
where the expectations are taken over repeated sampling and estimator Q is called
an estimator based on the method of moments. Thus the general form of the
estimator of the variance V{Q) is given by
"(Q) = n-l[og(r)joIJ ~[og(r)/oI] (5.18 .6.2)

where I = (J:tl ,TX2, ..·,Txp,T) t and S = (n-ltll~t ~ -ai: J with ~ = (K,y) . Schafer
and Schenker (2000) have developed a model for missing data based on this
procedure .

:5:i s :7 POISSON SAMPLING ;.,, '

Hajek (1964) suggested a sampling procedure in which a Bernoulli trial is


conducted with probability of success "i such that the /h trial results in success
then the t h unit is included in the sample. Clearly the sample size m is a random
variable for this sampling procedure. The second order inclusion probabilities are
given by "ii = " i"j ' Under such a sampling procedure an unbiased estimator of
population total is given by
500 Advanced sampling theory with applications

Yp = L Yi (5.18.7.1)
iesJr j

with variance
2
V(Yp )=L (I- Jri)2L (5.18.7.2)
iEn Jri
and an unbiased estimator of the V(Yp ) is given by
.(y'P )= " (1-2Jr;)Y"2
v .L. (5.18.7.3)
IE S Jri
For more depth in Poisson sampling, one can refer to Ogus and Clark (1971) and
Brewer, Early, and Hanif (1984).

Cosmetic calibration was introduced by Sarndal and Wright (1984). Brewer (1995)
suggested a procedure for constructing a cosmetic estimator. Brewer (1999) has
shown that cosmetic estimators are by definition interpretable both as design based
and as prediction based estimators. Formulae for them can be obtained directly by
equating these two estimators or indirectly by a simple form of calibration. Note
that they constitute a subset of GREG, their design variances cannot be estimated
without knowing the relevant second order inclusion probabilities, but under the
prediction model to which they are calibrated those probabilities do not affect their
anticipated variances, so it is more appropriate to estimate these and/or their
prediction variances. Brewer (1999) has shown that cosmetic calibration is a simple
and effective method for eliminating negative and unacceptably small positive
sample weights. Interestingly he suggests here to estimate the anticipated variance
of any calibrated estimator of population total under superpopulation model
m : Yi = .axi +ei' such that Em(e;) =0 , Em(el)= u xf , and Emkej )=0 for i
2 '* j .
We have seen that a calibrated estimator of population total Y can be written as
Yc = LWiYi , (5.18.8.1)
ie s
where Wi are the calibration weights. The by the definition of Anticipated Variance

AV(yJ= Em[Yc - yf =Em[.LWiYi - .LYi]2 =Em[.I WiYi -{ .IYi + .I Yi}]2


IE S lEO lE S I ES lE s

=Em[ .I (Wi -1)Yi - .I Yi]2


IE S IE S

=U
2
[.I Wi (Wi -If xf +(Ixf - .I Jrj1xf )- (.I wixf - .I Jrj'xf)] . (5. I8.8.2)
I ES lEO IE S I ES lE S

Brewer (1999) discusses the following advantages of the cosmetically calibrated


estimator:
( a ) The estimator is clearly interpretable as prediction based as well as design
based;
Chapter 5: Use of auxiliary information: PPSWOR Samplin g 501

( b ) Its anticipated variance and its prediction variance can both be estimated more
easily and more efficiently than the design variance of the standard GREG ;
( c ) Design based estimation has a tendency to be more reliable for large samples,
and prediction based estimation for small samples and small domains. Thus the
estimators used for large domains are typically design based while those for small
domains are often purely prediction based or synthetic. If the large domain
estimator s are calibrated, the estimates for their compon ent small domains
automatically sum them without forcing;
( d ) As an unexpected spin off, the elimination of negative and other unacceptably
small weights is streamlined by the use of cosmetic calibration.

5.18.9 MIXING OF NON-PARAMETRIC MODELS IN SURVEY


' SAMP EIN G '

In order to estimate the population total '» = LieOYi , the Horvitz and Thomp son
(1952) estimator is given by
~ Y; _ ~ Y/ i
ty -
A _

L.. - - L.. - , (5.18.9. I)


ies "i ie O " i
where
I if i E S,
I, = (5.I 8.9.2)
I { 0 if i ~ s,

such that Ep (Ij) = "i . The Sen--Yates--Grundy form of the variance of iy is


1
VSyg (iy )= -2 I I 0 ij( Yi _ Yl J2, (5.18.9.3)
i ~ leO " i " 1

where 0 ij=l" i" r " ij) for i s- j such that lim sup(n ) ~ax I0ijl <oo. Anestimator
N -w) l ,j eO

for (5.18.9.3) is

VSygt ·)=-21I I Dij( Yi - Yl J2 (5.18.9.4)


l~ j es "i"1
where DIJ·· = ("
~ I." j
. - "··)
IJ V,,··
IJ are called the Sen--Yates--Grundy (SYG) design
weights. In order to derive an alternative estimator of the population total, Breidt
and Opsomer (2000 ) consider modelling Yi and Xi under a superpopulation model
~ through the relationship
Yi = m(x; )+&i (5. I8.9.5)
where m(x;) is a smooth function of x, such that E~(y;)= m(x;), V~(y;) = v(x;), and
V(Xi ) is smooth and strictly positive. Let K denote a continuous kernel function
and let h denot e the bandwidth. Breidt and Opsomer (2000) propo sed a local
polynomial kernel estimator of degree q for m(xi) that is based on the entire finite
popula tion as follows. Let Yo = Lvit e O be the N-dimensional vector of v. values in
502 Advanced sampling theory with applications

the finite population, er a vector with a 1 in the r,h position and 0 elsewhere,
E'= (&;,£2•...•£N )IXN '

1,(XI - x;),(XI - X; f , ,(XI - x;'[1


X = 1,(X2 - .d (X2 -xif, ,(X2- x;'[1
Oi

and
-1 K (XI
h
- -Xi)
-
h '
0,
°
0,
W Oi =

0, 0, ,i K(XN ;Xi )
NxN
Minimization of er E' W Oi E leads to the local polynomial kernel estimator of the
regression function at Xi given by

= ei' (XoiWn;Xn;)
, \-1 . ,
1IIi X OiWOiYO = WOiYO ' (5.18.9.6)
This estimator is well defined if (X~i Wn;X o i )
is non-s ingular. If the 1/Ii are known,
then a design unbiased estimator of t y would be the generalized difference
estimator
• y- -1/1.
' + I1/Ii .
ty = I - ' - - (5. 18.9.7)
iES 1(i ;EU
The Sen--Yates--Grundy form of the variance of (5.18.9.7) is

VV;)= ~I I 0ij [ Y; -1/I; _ Yj _ 1II j ]2 (5.18.9.8)


2 ;;<j EO 1(; 1(j

for i"* J . If the 1/1; are known , a design-unbiased estimator of VV;) is


.r.)=-1 I I Dij[-
VVy
Y; -1IIi Yj _ 1II j ]2
----- (5.18.9.9)
2i;< j ES 1(i 1(j
In practice, the values of 1IIi are unknown and must be estimated using sampl e
information. Let the n dimens ion vector Ys = {Yi tES ' the n by (q + 1) matrix X si ,
and the n by n matrix Wsi , be the sample equivalents of Yo, X 0, and Wo.
Minimization of er£'Wsi£ leads to design based estimators of 1/Ii given by
\-1 , o'
(5. 18.9.10)
= ei X siWsiX si ) = WsiYs'
• 0 • ( •

1/Ii X siWsiYs

If lij denot es the (i,J}th element of the inverse (X~iWsiXsJI , then (5.18. 9.10) can be
written as
Chapter 5: Use of auxiliary information : PPSWOR Sampling 503

lIli =el L
' 0 . 11
J
.! / (I_I)j(XI -xYK(x-x.)YI
_ 1_ _'
(5.18.9.11)
1=1 Jrl h h
for i = 1,2,...11.
Note that Xi is known for the entire populat ion, the 1/11 can be calculated for all
i En. Using this result Breidt and Opsomer (2000) propo sed the estimator
- y . - ffl?
t)~ = L - ' - - ' + LI/Il (5.18.9.12)
iES Jri i EU

for tY" Following Sarndal, Swenson, and Wretman (1992), the variance of t"y0 can
be approximated as

V(7;,°)"'V~;)=~L L 0ij( Yi- lIli _Yj_mj]2 . (5.18.9.13)


2 i* j EO Jri Jrj

An estimator for the variance of (5.18.9.13) in Sen--Yates--Grundy form is given


by


VsygVy
('O)=~"''''D ..( Yi - ffll _Yj-ffli ]2
L. L. IJ •
(5.18.9.14)
2 i* jES Jri Jrj

If (X~iWsiXsJI is not invertible, one possibility is to replace it with

(X~iWSiXSi + diag(RXq +I)x(q+l)t ' where R stands for a ridge constant. Fan (1993)

made use of this approach, choosing diag(RXq+l}«q+I) = diag(~)


N (q+ l)x(q+ l)
,with J

being a small positive number.


Let lni be a new set of new estimates for mi obtained by either one of these two
methods . Using these estimates, the method of Breidt and Opsomer (2000) yields
7;, = L Yi- Illi + LI/li (5.18.9.15)
iES Jri i EU

as an estimator for the population total, and an estimator for estimating the v(t"y) in
Sen--Yates--Grundy form is given by

VSyg(7;,)= ~L L Dij(Yi -ffli _Yj_mj] 2 (5.18 .9.16)


2 i* jES Jri Jr j

Still more remain seems to be done on these lines. Note that one thing is very
obvious by following Singh, Hom and Yu (1998) that one can calibrate the
estimator of variance V
syg(7;,) by constructing model assisted calibration constraints
VSyg(7;,) under the model (5.18.9.5) such that
E; {VS yg(t;,)} =E; {vSyg(7;,)},
504 Advanced sampling theory with applications

and the chi square distance is minimum between the design weights and calibrated
weights . Further note that Bredit and Opsomer (2000) have reported second form
of the variance of Horvitz and Thompson estimator.

5.19 GOLDENWBILEE YEAR 2003 OF THE LINEAR REGRESSIO


ESTIMATOR
In this section a question has been addressed raised by Deville and Sarndal 's (1992)
calibration approach to the eminent survey statisticians working at Govt.
Organizations such as the U.S. Bureau of Statistics, Statistics Canada, Australian
Bureau of Statistics, and private organizations such as RAND, WESTST AT etc.,
and their consultants from different universities across the world . Note that during
the last decade the survey statisticians working for these institutes have tried a lot to
construct a traditional linear regression estimator through calibration approach, but
there was no success . For example, as reported by Farrell and Singh (2002b) that
Wu and Sitter (2001) rediscovered the Deville and Sarndal (1992) estimator by
setting one of the auxiliary variable at constant level while in the search of a
traditional linear regression estimator using calibration approach .

Singh (2003c) discovered that the traditional linear regression estimator can also be
shown as a special case of calibration approach, and pointed out that all the papers
related to minimizing chi square distance function in survey sampling need
modifications. There is a series of such papers by many followers of Deville and
Sarndal (1992) and it seems that everyone skipped a very important point while
using chi square distance function. The technique developed by Singh (2003c) is
logically more accurate than whatever is done by survey statisticians during the last
decade . The traditional linear regression estimator due to Hansen , Hurwitz, and
Madow (1953) is shown to be unique in its class of estimators, and celebrates
Golden Jubilee Year 2003 for its outstanding performance. Singh (2003c) considers
an estimator of the population total Y as
• Ell
Ys = LWj v. (5.19.1)
ie s

where w?
are called the calibrated EB (read as plus ) weights such that the chi
square distance function defined as

D Ell = .!- L (w? - d L j


(5.19.2)
2 ie s djq?
is minimum subject to the two constraints, given by

LW?=Ldj , (5.19.3)
ie s ies
and
L WjEll Xj=X, (5.19.4)
ie s
Chapter 5: Use of auxiliary information : PPSWOR Sampling 505

The choice of weights q?makes different forms of estimators . Note that the
conditio n (5.19.3) is a requireme nt of the chi square test given by Sir R.A. Fisher,
and is ignored by all the followers of Deville and Sarndal ( 1992). Obvio usly the
Lagrange function is given by
ffi d ·f { ffi
L= -21 .L -(w diqi-ffi'-
, ES
Al .LWi - .Ldi} - Az {ffi
I .LWixi- X} .
IES IES IES

On setting aLIaw? = 0 we have


w? = d, +A.,diq? + Azdiq?Xi. (5.19.5)
On using (5.19.5) in (5.19.3) we have

AI(.Ldiq?] + Az(.Ldiq?Xi] = 0 .
I ES IES
(5.19.6)
On using (5.19 .5) in (5.19.4) we have

Al(.Ldiq?Xi] + Az(.Ldiq?xl] = (x - XliT) '


l ES I ES
(5. 19.7)
where
XIIT = Ldixi.
ie s
On solving (5.19.6) and (5.19 .7) for Al and ..1.2 we have

-(.Ldiq?Xi](X- XIIT ) (.Ldiq?](X - XIIT )


AI= IES 2 and Az = IES 2

(.Ldiq?](.Ldiq?4 ]-(.Ldiq?Xi]
lE S lES I ES
(.Ldiq?](.Ldiq?4]-(Idiq?Xi]
I ES lE S IES

On substituti ng these values in (5.19.5) the calibrated plus weig hts are

diq?Xi(.Ldiq?]-diq?(.Ldiq?Xi ] )
Wi =
ffi d
i + I ES
2
(x - x·li T')
I ES
(5.19.8)

1 (.Ldiq?](.Ldiq?4 ]-(.Ldiq?Xi]
l ES

On substituting (5.19.8) in (5. 19.1) we have


I ES IES

r. = .LdiYi+ /3oIS(X - .LdiXi]


lES lES
(5. 19.9)

where

.
R _
(.Ldiq?XiYi ](.Ldiq?]-(.Ldiq?Yi](.Ldiq?Xi]
lES I ES I ES I ES
Po~ - 2 (5. 19.10)
(.Ldiq?](.Ldiq?4 ]-(.i L
IES es
I ES
diq?Xi]
which is clearly usual traditional linear regression estimator.
Note that if q?
= 1 and under SRSWOR sampling where d, = ut«, the estimator
(5.19.9) reduces to
506 Advanced sampling theory with applications

y. = N~+(Sxy/S;XX -x)] (5.19.11)


where
y=n-1IYi, x=n-1Ixi ' s; =(n- lt I I (xi - x)2and sXy=(n-1tII(Xi-XXYi-Y),
i=1 i=1 i=1 i=1
and it is the famous traditional linear regression estimator of Hansen, Hurwitz, and
Madow (1953), and celebrates Golden Jubilee Year 2003 for its outstanding
performance. Note that there is no choice of weights q? which reduces the
estimator (5.19.9) into ratio or product methods of estimation . The only choice of
weights q? = 1 reduces it to the traditional linear regression estimator, and leads to
the following theorem .

Theorem 5.19.1. The traditional linear regression estimator due to Hansen,


Hurwitz, and Madow (1953) is unique in its class of estimators .

A new estimator of the variance of the traditional linear regression estimator Ys has
also been suggested by Singh (2003c) as
A(A) 1
s Y, = -2 .L,
V t», (<:B<:B
ei - Wj<:B<:B\2
wi ej ) (5.19.12)
'*JES
where e'j' = Yi - aol s - PolsXi ' Note that aol s and P ols are the least square estimates
of a and f3 in the model Yi = a + f3 Xi + ei obtained by minimizing IAq?e?2 .
ie s
Singh (2003c) also studied a further calibrated estimator of variance given by

vsJys)=~ 'L 'Lnt(w?e? -w1 e1J (5.19.13)


2 iv j es
where nt are two dimensional calibrated plus weights and are chosen such that
two dimensional chi square distance

D<:B=~'L'L(nt-DijJ, (5.19.14)
2 iv j e s DijQ:
is minimum subject to two calibration constraints given by
'L 'L n ij
<:B
= 'L'LDij' (5.19.15)
ie j es i*jES
and
±'*JES
,I , Int(diXi - djxj ~ = v(xHT)' (5.19.16)

Singh (2003c) suggested that the statistical package GES developed by Statistics
Canada could be modified to obtain the traditional linear regression type estimates
of population total, and to estimate its variance using the modified calibration
approach discussed in this section. Similar changes in other statistical packages
such as SUDAAN , CALMAR, SAS, and STATA etc. are also suggested.
Chapter 5: Usc of auxiliary information: PPSWOR Sampling 507

Example 5.19.1. Continuing from Example 5.5.1, find the calibrated plus weights
which leads to traditional linear regression estimate of the number of fish caught
during 1995.
Solution . Continuing from Example 5.5.1 and for = 1 we have q?

Thus we have the following table:

~lV?7: I( x'c,
,ki 'Jill
I k ' ~~~) ' ) ~i0:
,>lCV!". J;I,) 1" ,, '01'/ ' ,'>'
2001 2016 0.107450 9.307 18622.615 37263852.955 9.071 18288.135
5692 2319 0.110228 9.072 51638.422 293925899.046 9.167 21257.489
2653 3816 0.115834 8.633 22903.465 60762893.451 8.469 32318.978
4860 4008 0.106153 9.420 45782,974 222505251.853 9.443 37846.597
3850 2568 0.120075 8.328 32063.294 12344368 1.033 8.267 21228 .615
17741 16238 0.139569 7.165 127112.754 22551073 73.414 8.074 131111.6 75
776 163 0.103605 9.652 7489.986 58 12229.140 9.294 1514.894
2300 2324 0.107686 9.286 21358.394 49124305 .852 9.078 21098 .349
,., ,",,"'" ,
"l"" ", ;iq, "",,'" ,ii0)~;:; ~~A ,'J.lllll 8U4 3047945486:743 1
:;x J>'J' i284664:732

Thus a traditional linear regression estimate of the number offish caught during
1995 is given by
, Ell
Ylr = IWi Yi = 284664 .732.
ie s

Exercise 5.1. Find the bias and variance of the estimators of population total, Y ,
defined as
YgI = Yufuf, Yg2 = Y{1 + a(ut -1)+ fl(U2 - I)}, and Yg3 = Y[1 + a(ul -1)+ fl(U2 -1)]1
, / ' / , /I x~ N
where rq = XI XI and U2 = X2 X2 for X , = L-L and X, = LXi.
i= IJri i=t

Hint: Sampath and Chandra (1990).

Exercise 5.2. Find the bias and mean squared error of the estimators of population
total Y defined as

Y,kg 1 = aY'(XI
XI + flY'(XI
XI J J2 and
508 Advanced sampling theory with applications

where a,fJ and r are suitably chosen constants such that MSEs of the estimators
are minimum .
Hint: Kapadia and Gupta (1984) .

Exercise 5.3. Show that for any sampling design with positive first order inclusion
probabilities for all the units in the population, the covariance between Y. = 'L.d;Yi
"
i=\

. = 'L.d;xj
and X; " is given by
i=1

cov(r, X = I Y;Xj(l- Jri )Jri- I + I Y;Xj(Jrij - JriJrJJrijJriJrJI


r )
i=1 i*j=1

where d, = Jrjl, for ,. = 1,2, ..., m. Derive its value under SRSWOR sampling
design.
Hint: Sampath and Chandra (1990).

Exercise 5.4. Study the asymptotic properties of the estimator of population total
Y defined as

Y,,, I,1,r (X)]


. = Y.[m _r ,
r=! X;

'" n x~
where X r = I - ' has its usual meaning .
i=1 Jri

Exercise 5.5. Show that the minimization of EmEp ~s - r] for any design p(s)
under the model 111 : Y; = fJX i + Ei' where Em (EllX i) = 0'2 f(X i) leads to the
estimator of total, Y , given by

• = "L,Yi + ("
YI
iES
L, -»i».
(- )
iESf Xi
J/(" xl )"
L, - (- ) L,Xi '
iESf Xi iES
Hint: Royall (I 970a, 1970b, 1970c, 1971), Bellhouse (1984).

.
E xercise 5..
6 Consiider Y- s =-L,-,
I" Yi 1" xij
- =-L,-
Xsj an d 1~
X-j =-L,Xij c.
lor
N iESJr' N iES Jri N i=1
j = 1,2,..., p have their usual meanings. Study the bias and variance of the estimator
of the population mean Y defined as
y, =Ys + f jJsAxj - i s)
j =!

where jJSj denotes the r partial regression coefficient, under the superpopulation
model.
Hint: Sarndal (I 980b).
Chapter 5: Use of auxiliary information: PPSWOR Sampling 509

Exercise 5.7. (a) Minimize each of the following distance functions:

D1 = L n{w-d.
' ,)Z ; n[ (w) -wi+di ] ;
D z = L wi ln -l...
n(c r:J\2J ;
D 3 =2L VWi -Vdi
i=1 2d i i= di i=1

n[
D4 = 7= - di ln d; +wi-di ; and
(w. ) ] D5=i~1
n{w.-d.)Z
i''
2W
subject to the calibration constraint
n
l:wixi =X,
i=1
where d, are the known design weights and Wi are the calibrated weights to be
found . Discuss the nature of calibrated weights in each situation.
( b ) Optimise the generalized distance function
n {w- _d.)a
D= L ' ,
i=1 diqi
subject to
n
L w.x . = X.
i= 1 I I

Find a such that the variance of the resultant estimator is minimum. Is optimum
a = 2?
( c) In each one of the above cases ( a ) and (b ), study the following two
estimators of the population total, Y, defined as
, n
Yds = l:wiYi and Yes = l:wi Yi + ei - l:diei
,n{}n
i= 1 i=1 i=1

where ei = Yi - PdsXi .
Hint: Deville and Sarndal (1992), Estevao and Sarndal (2000) .

Exercise 5.8. Find the bias and variance of the Rao, Hartley, and Cochran (1962)
strategy and the Horvitz and Thompson (1952) strategy for a multi-character survey
for estimating popu lation totals given by
, n y. , 1 n y.
YRHC = l: -t<i and YHT = - l:-.'-
i=1 p; ni=1p; Jri
where

p;* = (1 + ~ f PXY
(I + p; )PXY -1
and Pxy is the known correlat ion between the selection probabilities Pi and the
variab le under study , under the superpopulation model , y; = fJP; + ei , where e i are
the error terms such that E{ei I p;) = 0, E(ell p;) = ap;g with a>O and g e 0 and
Ele;ej I p;Pj ) = 0 \::Ii '* j . Can you suggest two more transformations p;*?
Hint: Bansal and Singh (1989, 1990) , Bedi (1995) , Mangat and Singh (1992-93).
510 Advanced sampling theory with applications

Exercise 5.9. Let there be a population of N units and we want to select a sample
of 11 units. For this selection, the population is randomly divided into (11 + k)
n+k
groups of sizes NI> N 2 , ••••, N n+k such that 'L.Ni = N . For the first group we select
i=1
N1 units out of N units with SRSWOR sampling. Then for the second group,
select N 2 units out of (N - N)) units with the same sampling scheme and so on.
From these (11 + k) groups we then select a sample of 11 random groups using
SRSWOR sampling. Now for selecting the ultimate sample of 11 units, we select
one unit with probability proportional to the orig inal probabilities Pi such that
N .
'L.P; = 1 from each of the groups mdependently. For this scheme, show that an
i =1
unbiased estimator of population total is given by
• (1I+k) Yi
lJ = - - ,'L. -(--) where Ti = L Pij'
n les P; / Ti j eGi

Find its variance and compare with the usual RHC scheme.
Hint: Bansal and Singh (1986).

Exercise 5.10. Suppose a selection scheme consists of the following steps:


Step I. Select the first unit with probability for the l" unit proportional to
(Xi -xf +N- I (Xi -xf .
1
i =1

Step II. Select the second unit from the remaining (N - 1) units with conditional
probability for the/" unit being proportional to (X j - Xi~ .
Step III. Select (11 - 2) units from the remaining units of the population by simple
random sampling and without replacement.
Show that under such a sampling scheme, the probability of selecting S'" sample is

p(s) = s~/{(:)s~}.
Deduce that the regression estimator of population mean defined as,
(s s.; Xx - Xs )
)ilr = )is + xy /
and the ratio type estimator of finite population variance defined as
2 2( 2/ 2)
S ) = Sy S x Sx

are unbiased for population mean and population variance.


Hint: Singh and Srivastava (1980) , Swain and Mishra (1992) .

Exercise 5.11. Show that the difference between the variance of the estimator of
population total Y under PPSWR sampling and PPSWOR sampling schemes is:
Chapter 5: Use of auxiliary information: PPSWOR Sampling 511

V(YHT ) - V(YHH )= I
2
Difference = (Y; - ~YXYr Pj Y)-2.:- .
;,~j=l n ~Pj

Show that Difference z 0 for Midzuno (1952) and Sen (1952) sampling schemes.
Hint: Prabhu--Ajgaonkar (1975).

Exercise 5.12. If the probability of selecting the i" sample is given by

p(s) = m/ {(:)x}.
Show that a product type estimator of population mean, YI = (yx)/m , where m is
the equiprobable harmonic mean of x values, is unbiased.
Hint: Ruiz and Santos (1990).

Exercise 5.13. In the RHC sampling scheme, let the first random group be made by
using Midzuno --Sen's scheme of sampling while the remaining (n -1) groups are
constructed as usual. Show that if the random groups are of equal size then the
resultant strategy and the usual RHC strategy are equally efficient; otherwise, find
the condition under which the resultant strategy fares better than the usual RHC
strategy .
Hint: Singh and Lal (1978).

Exercise 5.14. Write the FORTRAN codes to generate the first and second order
inclusion probabilities to estimate the variance of the Horvitz and Thompson
estimator of population total.
Hint: Bandyopadhyay, Chattopadhyaya, and Kundu (1977).

Exercise 5.15. In the RHC sampling scheme let the sample be selected in such a
way that:
( a ) The cost is proportional to the expected number of distinct units in the sample;
and
( b ) The cost is proportional to the total expected size of the sample where {Pj } are
taken as the relative measures of sizes for the respective units in the population .
Then show that the RHC scheme remains less efficient than the PPSWR sampling
scheme for the fixed cost of surveys.
Hint: Singh and Kishore (1975).

Exercise 5.16. Show that under Midzuno--Sen's scheme of sampling, the variance
of the Horvitz and Thompson (1952) estimator (HTE) of total may not generally
decrease with an increase in sample size or average effective sample size. Suggest
an improvement so that the resultant Jr ps property yields a variance of HTE
which decreases with increasing sample size.
Hint: Chaudhuri and Amab (1978).
512 Advanced sampling theory with applications

Exercise 5.17. Discuss the efficiency of the usual ratio estimator under Midzuno-
Sen's scheme of sampling.
Hint: Singh (1975a).

Exercise 5.18. Show that the general formula for estimating the variance of the
estimator of population total under the RHC scheme is
n n
Vb = L: L: aijdij'
i=lj=1
2
.. QiQ j Yi Yj n n n
[
where aij=bij j(NiN j)c'i'}, dij=-z- ~--p, and ,L: L: bij = ,L:Ni(Ni-l) .
'i] J 1=1]=1 1=1

Hint: Chaudhuri and Mitra (1992), Ohlsson (1989).

Exercise 5.19. Discuss the asymptotic properties of the two estimators of the
variance of the linear regression estimators:
VI =I IDij(diei -dje} and I IDijkigiei -djgje}
v2 = j¢i=1
j ¢i=l
where gi = JriWi and the other symbols have their usual meanings.
Hint: Chaudhuri and Maiti (1994), Valliant (2002).

Exercise 5.20. Show that an estimator of variance of the estimator of population


total can be written as
e(s,y)= 'Ib(s,kk)yf + 'I b(s,kl)YkY/
kE S k¢/E S

where b(s, kk)and b(s, kl) are suitably chosen constants. Find the bias and variance
of this estimator.
Hint: Mukhopadhyay (1982), Hanif, Mukhopadhyay, and Bhattacharyya (1993)

Exercise 5.21. Discuss the relative efficiencies of the strategies due to Horvitz and
Thompson (1952), Rao, Hartley, and Cochran (1962), Sen (1952) and Midzuno
(1952) sampling schemes in estimating a finite population total under the
assumptions of a superpopulation model.
Hint: Chaudhuri and Amab (1979).

Exercise 5.22. Is there any model free evaluation technique to compare the bias
and the mean squared error of the regression estimator?
Hint: Konijn (1979).

Exercise 5.23. Assume that the sampling design is such that when the population
size N and sample size n are large enough, the joint probability sampling

distribution of the vector


' , )T can be
z= (Y,RT assumed normal with mean vector

z= (Y, RT r and variance matrix


Chapter 5: Use of auxiliary information: PPSWOR Sampling 513

. [V(Y),c(X.Y)]
v(z) = • • •
c(X,Y),v(x)
where v(Y) and v(X) are the sampling variance matrices of Y and x,

respectively, and c( X, Y) is the sampling covariance vector. Then show that the

estimator Y is normal with mean vector

and variance matrix

where B=v(xfc(x.Y).
Also show that the conditional mean squared error of the estimator, Y, can be
written as

Hint: Montanari (1999).

+! +1,{::,)}'.
Exercise 5.24. If the estimating function g * is optimal, that is if it satisfies the

condition that E[ ~~J s then p,"YO the fnllnwing results

( a ) C orr 2{g,* OIOgp(Olx)} ;::: corr 2{ g,_...:::."'-.-'---'-"'!"


OIOgp(Olx)}
00 00

(b) E{g* - OIO~(Olx)r ;: : E{g- OIO~(Ol x )r


for all distributions pEP and estimating functions g E G.
Hint: Godambe (1999).

Exercise 5.25. Consider a Jr ps sampling design with Jr; = nx.] X = np; ,


j = 1,2,...,N . Let y ; , j = 1,2,..., n be the values of the n units in the sample s drawn
by the above l [ ps sampling design. Suppose that the finite population
{Y, , Y2,..., YN} is a random sample from the following superpopulation model,
y;=fJxx;+e;, with E(e;)=O , E(e;)=a 2 x;, Ekej) =o, i v. j , E(e:)=O and
514 Advanced sampling theory with applications

E(ei)= 3cr4xi , where i = 1,2,... ,N, f3 and cr2 > 0 are unknown parameters. An
unb iased estimator of population mean f IS
_ 1 11 Yi 1 1 11 Yi
YIlT =-L:-=--L:-
N i; j !ri N n i;lPi
Assuming PPSWR sampl ing designs , compare the following estimators of variance
V(YHT 1 given by

(a) vWR = X
2
r.(Yi
n(n-1L;\ Xi
_..!.- r.
n i-i x,
Yi J2,

(c) v~=(I-nIP1)vwR'/;\

and

(e) ,* n - 1[ 11 N 2 +2- nL:pi


vz=-1-2L:Pi+nL:Pi
n +1 i; \ i; \
11 2 - (11
n - 1 i; \
L:Pi )2)] vWR
i; )
, ' 1
Hint: Kott (1988) , Zou (1999).

Exercise 5.26. If a simple random sample is drawn without replacement from a


population of size N with random sample size n s where ns > 0 , then if 7] = ns then
show that the first order conditional inclusion probability is given by
!ril'] = ns / N , 'If i En

Hint: Tille (1998).

Exercise 5.27. Suppose cr 2 is known . Then the necessary and suffici ent conditions
, N
for the estimator Y = WOs + I WksYk of the linear function L:PkYk to be admissible
kes k;1
are that there exists As such that Wks = Asak + Pk (k E s), and one of the following
two conditions are satisfied:
( i) O ~ As~2 ,where cs'" L: Pkak and d s '" L:al;
ds k es kes

2
( ti ) As = 2 , and WOs = _2 L:akbk + L:Pkbk - acr [(2J2 d, + L:pl]
d, ds kes k es 2 ds k es

under the superpopulation model Yk = ad] + bk + Gk , k = 1,2,..., N .


Hint: Zou (1997).

Exercise 5.28 . A necessary and sufficient condit ion for the estimator
, 1- f 2
Vs =-n- Sy

to be admissible in the class of all quadratic estimators is f ~ 2/(n + I).


Hint: Zou and Liang (1997).
Chapter 5: Use of auxiliary information : PPSWOR Sampling 515

Exercise 5.29. Consider a design with p(:s) = (N: Ift Pi. Show that an unbiased

estimator of population total is given by y= Y(!)s ,


X
where Xs = - I - I xi .
N -n s
Hint: Deshpande (1978).

Exercise 5.30. For any sampling design P and any real number a, show that
{-a)= --;;I .~z:
E It ~ ~ (. . . )
. z: " ". z: Yil Yiz..·Yia" 1\ ,IZ,···,l a ,
n '1;\'2;) 'a ; \

where y = n- I IYi is the sample mean, and "(i\,iZ, ...,ia ) is the ath order inclusion
i ;)

probabil ity of including units (i\ ,iz,...,ia ) . Also for any real number fJ show that

E{( .L YifJJ{-a)~
IE S
_ I NN N
It ---;;.L
n
N fJ (. . . .)
.L.L ..... L YiOYi)YiZ" 'Yia" 10 ,/\,IZ ,·..,l
'a;) 'O;\,,;I'Z;I
a .

Deduce the result that for any non-negative real number fJ if M fJ = L(Yi - y)/3 then
ie s
show that
1
SOl }_ fJ- ( )1(fJ) II. {LN. LN .....LN YfJ-I Yi\ " 'Yi " (..
Kl!VlfJ - L -I
.) }
1;\ t n '0;) ,, ;\ /1;\ iO l 10,/\,...,11

I NN (. .) N fJ .
+ (-I)/3-fJ_\ { .L .... L Yi\ " 'YifJ" 1\, .. .,lfJ + .L Y; "(I)}
n ,,;\ IfJ ;\ /;\
Hint: Srivastava and Saleh (1985)

Exercise 5.31. Find the bias and variance of the following estimator of the
populat ion total Y as

YH = IdiYi[~]
iE S Idixi
, where d, = ,,;1.
ies
Hint: Hajek (1959) .

Exercise 5.32. Find the bias and variance of the estimator of the population total
Y defined as

YH = .I diYi +
IES
b(X - .I d;X; ]
IE S

where d, = ,,;\ . Discuss the optimum choice of the constant b .


Hint: Cassel, Sarndal, and Wretman (1977).

Exercise 5.33. Justify the statement 'For each sampling design there exists a
rejective method of drawing a sample and terminates with selection of a sample
516 Advanced sampling theory with applications

with probability one.' Also show that for Hajek's (1964) method the expected

number of trials for drawing a sample is [ I


SE Q
{np; n(1- pJ]-1
IES J
J"-S

Hint: Hajek (1959), Deshpande and Ajgaonkar (1977).

Exercise 5.34. In the non-parametric model Y; = m(x;)+ e;, where Em(e;) = 0 and
Em (el)=ax?, g <: 0, discuss different method of estimating m(x;) . Show that if
m(x;) = fJx; the non-parametric model reduces to parametric one.
Hint: Breidt and Opsomer (2000).

Exercise 5.35. Let y and x;, i = 1,2,...,1, respectively, be the survey variable and
the auxiliary variables related to y and the information about the quantiles of the
auxiliary characters or distribution functions are known . From the sample of n
units from a population of size N we observe lX;k ' Yk) where k E s. Consider
Qx.(a;)
1
fora; E (O.O,O.S)U(OS,l.O) are known and we wish to estimate Qy(fJ) with
fJ = 1/2. Study the asymptotic properties of the following estimators of Qy(fJ) as

FR = FHTy Qy
A A (

fJ ITV;
())'

1=1
A I t
jFHTx.(Qx.(a;)))
1 (
F HTXj Qx; a;
))'
. ,
With .I W; = 1
1=1

and
fro = frHTiQy(fJ))+ jIb;{F
:::::l
HTX'I (Qx.(a;))- frHTx I (Qx.(a;))}
I I

where
Ll(Qy(fJ)- Yk) LllQx.(a;)- x;k)
I I
A A

fJUy = and FHTx j = I •


kE S N Jrk k ES N Jrk

Hint: Rueda and Arcos (2002).

Exercise 5.36. Consider a penalized chi square distance function between design
weights d, and calibrated weights as w;
D=..!.-'<;"(w; -dJ +..!.-<]J2,<;"w*1* '
LJ tic L.
2 ;ES d .q, 2 ;Esd;q;
where q;
are weights and <]J is a positive quantity that reflects a penalty to be
decided by the investigator based on prior knowledge , or the desire for certain
levels of efficiency and bias. Minimize D subject to the following two situations :
(i ) No auxiliary information is available;
and
( ii ) Calibration constraint of Deville and Sarndal (1992) .
Chapter 5: Use of auxiliary information : PPSWOR Sampling 517

Discuss the different estimators of Ynew = L: w;Yi as special cases for different
; ES

sampling schemes and choices of weights qi' What choice of penalty and which
situation leads to Searls' estimator?
Hint: Farrell and Singh (2002b).

Exercise 5.37. Under the linear model Y; = a + fJ Xi + Si , such that E(&i) = °,


E(S?)= °and Elsisj) = 0, i"* j, consider the problem of estimation of population
regress ion coefficient, defined as

fJ = {NIXiY; - .I XlI
1=1 1=1 1=1
y;}/f NIX? _(.IXi)2).
1 1=1 1=1

}/! n
Show that the usual OLS estimator of fJ defined as

.
fJols = {n
n.''IXiYi- .'I Xi f= Yi
11 Il
n'Ixi2 - (n.'I Xi)2]
1=1 1=1 1=1 1=1 1=1

is unbiased, with variance

V(POIJ= (T2 / fn.Ix? _(.IXi)2] .


/ 1 ,= 1 1=1

Consider the sample s is taken from a finite population n with sampling design p
such that ll'j and ll'ij denote the first and second order inclusion probabilities. Show
that the variance of the design consistent estimator of fJ, given by

/Jdc = {.'I di .'IdiXiYi- .'I dixi.'IdiYi}/J'I di .'IdiX? -( ."I diXi)2],


IES IES IE S IES liES IES IE S

where d, = 1/ ll'i , has the variance

V(PdJ= {"I ."I ~iAdAXi - xXXr X~iSj}/{."Idi(Xi - Xf}2,


I JEO lEO

where ~ij = (ll'ij -ll'jll' j) and Sj = (lj - Y)- fJ(X x). j -

Hint: Sharma, Singh, Rai ,and Verma (2000).

Exercise 5.38. Consider an auxiliary variable x has a negative correlation with the
N
study variable y . Let X = "IXi and consider a transformed variable X; = (X - Xi)'
i=l
N
i = 1,2,..., N , so that L: X; = (N -l)X and the probabilities of selection are
i=l

• _ (1-pJ
Pi-(N_1)' i = 1,2,...,N.

Then study the following estimators of the population total


518 Advanced sampl ing theory with applications

•• 1 n y-
YHH = - I ---+ and
II i=IPi
where 1[; is the first order inclusion probability with prob ability set P; .
Hint: Bedi and Rao (2001 ).

Exercise 5.39 . Find the condit ion on the real constants c., i = I, 2,..., N such that
the estimator

YB = I Yi- Ci + C,
i es 1[i
N
where C = I.ci , is unbia sed for estimating population total Y .
i=1
Hint: Basu (1971) .

Exercise 5.40. An estimator of population total, Y, due to Deville and Samdal


(1992 ) is given by

Yds = .IdiYi + PdS(X- .I diXi)


le s I ES

where Pds= .I diqiXiYi / .I diqiXl


l ES l ES
,with appro ximate variance

V(YdS) = Idi( I -1[i)el - I I didl)ijeiej


ien ien j("'ijen
and consider an estimator of the variance is given by
VI (YdS ) = .I
l ES
w; di(l - 1[i )el-.I .( IrES) :;t;i)es
did j Dijeie j .

( a ) Consider a model m such that Em(ei IXi) =O, E m (el lxi )= a 2V(Xi) and
Em leie j I XiXj )= O. Assuming that V(Xi ) is known, show that the calibration
equation obtained by equating
Em [VI(Yds)] = Em~(YdJ]
is given by
I.
ie s
w;di(l- 1[;)v(x;) = ienI. di(l - 1[){x;) .
( b ) Consider an estimator of the populat ion total Y as
Yss = IW;Yi
ie s

where w;
are recalib rated weights, and are obtained by the minimization of a
penalized chi square distan ce function

Dp = -!. '"
~
J + -!.",2 '" ht* '
(w;- w • \..V ~
2 ies wiqi 2 ies wiqi
assuming Wi > 0 for all i = 1,2,..., II , and are already calibrated weigh ts ( or design
weights), and et> is a known real constant and called penalt y to the function .
Chapter 5: Use of auxiliary information : PPSWOR Sampling 519

( c ) If no information on model parameter V(Xi) is available , then show that the


minimization of D p leads to an estimator of total given by
• 1
Yss = - - 2 2.: WiYi .
1+ <1> iES
Also deduce the Searls' (1964) estimator as a special case for certain choice of <1> •
( d ) Show that minimization of Dp subject to the calibration constraint developed
In (a) leads to a new estimator is given by

Y.s = ( 1 )
1+ <1>2
[~WiYi+/JsS{~di(I-JlJV(x;)- .fwidi(I -Jl";)V(Xi)}] .
1=1 1=1 1=1

Further deduce the estimator owed to Prasad (1989), Yp(r) = N Ysearl (xix), as a
special case of it for certain choice of <1> •
( e ) Suggest some improved estimators of variance through calibration.
Hint: Singh (2002b), Singh and Hom (1999).

Exercise 5.41. ( a) Let Pi be the probability mass at, yt , for i E S and X be the
know population means for the (vector valued) auxiliary variable, x . Study the
asymptotic properties of the empirical maximum likelihood estimator of the
population mean, f, defined as
f EL = IPiYi
ies
where the values of Pi maximize the empirical likelihood function L(p)= Il Pi
ies
subjectto 2.: Pi =1, Pi > 0 and 2.:Pixi =X.
ie s ies

(b) Consider the estimation ofa quadratic parameter, V = II


i=I) =i+1
<1>(Yi' yJ, which
includes population variance and co-variance as a special cases .
Study the asymptotic properties of an estimator of V defined as
Vwu = N(N -1)2.: 2.: Pij<1>(Yi,Yj) ,
iesj >i
where the Pij 's maximizes the pseudo empirical likelihood function defined as
i(p) = .L .L dij log(pij) '
IE S»l

subject to the conditions

2.:2.:Pij=1 (Pij>O) , and .2.: 2.:.Pij<1>0i'Y)) =


ies j»i IE S»I
N(~_I) .I .I
1=1)=1+1
<1>0i'Y))
where Yi are the fitted values.
Hint: Wu (2001), Wu and Sitter (2001), Sitter and Wu (2002), Chen, Sitter, and
Wu (2002).
520 Advanced sampling theory with applications

Exercise 5.42. Study the asymptotic properties of an unbiased ratio estimator of the
population total defined as

,
TRao = --.£ .L (y.J
(T) I (B;Y;-BjYj {I~--IJ. - (I--I .L~IJ( L-'.-'
-'. +-LL B.Y.J
n 'E S X, N ' <J E S B, BJ N ' EO B, 'E S «.
Show that if B; = Tx/(NX;) , and L B;-I = N then TRao reduces to Hartley and Ross
;EO
(1954) unbiased ratio estimator under SRSWOR sampling.
Hint: Rao (2002).

Exercise 5.43. Suppose XI and X 2 are the known totals of two auxiliary
characters Xli and X 2; , for i = I, 2" 0" N . Consider an estimator of the population
total Yas
Ylr = LW?'y;
i es
Then show that the minimization of the CS distance function

D= ~Jw?'d -dJ
. $
' ES ;q;
subject to the three linear calibration constraints, given by

Iw; = Id; , I W;XI; = XI' and Iw;x2; = X 2 ·


ies ie s ie s ie s
leads to the traditional linear regression estimator with two auxiliary variables as

YJr = .I d;y; + PI(OIS)(XI - .I d;XI;] + P2(OI S)(X 2 - .I d;X2;]


IES IES IES

where PI(OIS) and P2(OIS) are the ordinary least square estimates.
Hint: Singh (2003c)0

PRACTICAVPROBLEMS

Practical 5.1. John and Michael were appointed to select three players (n = 3 ) out
of five players (N = 5) from the list n = {Amy, Bob, Chris, Don, Eric} with their
scores 125, 126, 128,90 and 127, respectively.

( I ) John considers the following sampling scheme consisting of all possibilities:


SI = {Amy, Bob, Chris} , s2 = {Amy, Bob, Don}, s3 = {Amy, Bob, Eric},
S4 = {Amy, Chris, Don}, s5 = {Amy, Chris, Eric}, s6 = {Amy, Don, Eric},
s7 = {Bob, Chris, Don}, s8 = {Bob, Chris, Eric} , s9 = {Bob, Don, Eric} ,
and SI O = {Chris, Don, Eric} such that
P(SI) = Dol, 'r/ t = 1,2,3, ...., 100

(a) Find the first orde r inclusion probabilities for John's sampling scheme.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 521

(b) Find the estimates of total score from each sample using John's sampling
scheme.
( c ) Find the bias in John 's sampling scheme.
( d ) Find the second order inclusion probabilities for John's sampling scheme.
( e ) Find the variance of John's sampling scheme using Sen--Yates--Grundy
formula .
(f) Find the variance of John's sampling plan using usual formula.
( g) Find the variance of John's sampling plan using definition of variance.
(h) Are three variances equal for John 's sampling scheme?

(II) Michael likes Amy and cleverly suggests the following changes in John's
sampling scheme as:
p(s/) = 1/6, '<I t = 1,2 ,3 ,4 ,5,6, and p(s/)= 0.00, '<It = 7,8,9 ,10.
( i ) Find the first order inclusion probabilities for Michael 's sampling scheme.
(j ) Find the estimates of total score from each sample using Michael 's sampling
scheme.
( k ) Find the bias in Michael's sampling scheme .
( I ) Find the second order inclusion probabilities for Michael's sampling scheme .
(m) Find the variance of Michael's sampling scheme using Sen--Yates--Grundy
formula .
( n ) Find the variance of Michael's sampling plan using usual formula .
(0) Find the variance of Michael's sampling plan using definition of variance.
(p ) Are three variances equal for Michael's sampling scheme?

( III ) Comparison among techniques:


(q) Find the relative efficiency of Michael 's sampling scheme over John's
sampling scheme.
( r ) Would you like to comment on your results ?

Practical 5.2. Use the known information on the number of fish caught during 1992
to select a sample of seven units by using PPSWOR sampling. Collect the required
information from population 4 to estimate the total number of fish caught during
1995 through a regression estimator by using information during 1994 as the
auxiliary variable. Apply the Sen--Yates--Grundy form of the estimator of variance
of the regression estimator to construct a 95% confidence interval. Also use the
calibrated estimators of variance of the general linear regression estimator.
Population 4 in the Appendix shows that the information on the number of different
kinds of fish caught during 1992 and 1994 is available.

Practical 5.3. Develop the calibration weights for the units selected in the sample
using PPSWOR sampling by making use of known information about the number
of fish caught during 1994 as the auxiliary variable.
522 Advanced sampling theory with applications

1 10 Pollock 862 832


2 15 White perch 4648 3489
3 28 Gray snapper 4845 4552
4 44 Sand seatrout 5665 4355
5 63 Gulf flounder 776 163
6 68 Puffers 1141 935

For simplicity use the chi square distance function between the design weights and
calibrated weights. Discuss the three cases where these weights lead to the ratio,
GREG and traditional linear regression estimator for estimating the total number of
fish caught during 1995. Derive the value of the estimate in each case.
Given : Total number of fish caught during 1994 is 341856.

Practical 5.4. Take an SRSWOR sample of 10 states from population 1 and note
the records of real estate farm loans as well as nonreal estate farm loans. Given that
information on the nonreal estate farm loans is available for all states, apply the
ratio estimator for estimating the total real estate farm loans in the United States.
Construct the 95% confidence intervals using the lower and higher level calibration
approach.
Given: N = 50, X = 43908.12 and S; = 1176526 . Repeat the exercise by taking
PPSWOR sample.

Practical 5.5. Select a sample of 4 units by using the RHC scheme from population
1 of the Appendix . Estimate the total real estate farm loans using the RHC
estimator and making use of nonreal estate farm loans as an auxiliary variable . Find
a 95% confidence interval for the total real estate farm loans in the United States.
Hint: Divide the population into four random groups of unequal sizes.

Practical 5.6. The demand of fish for human consumption creates the need to
estimate the total number of fish of all kinds caught by recreational fishermen of the
Atlantic and Gulf coasts during 1994. Population 4 in the Appendix shows that the
information on the number of different kinds of fish caught during 1992 is
available. Use the known information on the number of fish caught during 1992 to
select a sample of seven units by using PPSWOR sampling (Midzuno--Sen
Sampling Scheme). Collect the required information from population 4 to estimate
the total number of fish caught during 1994. Apply the Sen--Yates--Grundy
estimator of variance to construct a 95% confidence interval.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 523

Practical 5.7. Mr. Mario was interested to estimate the total value of produc tion of
the soybeans for beans in all the 30 states of the United States of America growing
this particular crop. Mr. Mario selected a sample of n = 5 states based on some prior
information and listed the first order (ll'j) and second order (ll'ij) inclusion
probabilities based on PPSWOR sampling as

5430 0.55 I ? ? ? ? ?
7009 0.58 2 0.25 ? ? ? ?
7180 0.60 3 0.27 0.29 ? ? ?
6062 0.63 4 0.28 0.30 0.32 ? ?
4054 0.65 5 0.30 0.32 0.32 0.36 ?

( a ) Apply the Horvitz and Thompson estimator to estimate the total value of
production of the crop.
(b) Construct a 95% confidence interval of the total value of produc tion based on
Sen--Yates--Grundy estimator of variance.
( c ) Construct a 95% confidence interval of the total value of production based on
usual estimator of variance .
( d ) Comment on the confidence interval estimates .

Prac tical 5.8. Calibration weights are found to be useful in survey sampling . Find
the calibration weights for the units selected in the sample using PPSWOR
sampling by making use of known information about the number of fish caught
during 1992 and 1993 as auxiliary variables. Use the chi square distance function
between the design weights and calibrated weights. Discuss situations where these
weights lead to the GREG and traditional linear regression estimator for estimating
the total number offish caught during 1995. Construct a 95% confidence interval.

1 05 Herrings 28933 34060 38007 30027


2 11 Red hake 559 216 369 184
3 17 Temperate basses 5 35 32 23
4 33 Snappers, others 746 861 462 492
5 42 Spotted seatrout 22304 21538 22181 24615
6 52 Mullets 5571 4186 4386 4657
7 61 Tunas/mackerels , oth 1190 794 1018 1029
8 69 Other fish 12249 14953 20488 14426
524 Advanced samp ling theory with applications

Practical 5.9. The following table provides a set of first order (Jr i) and second
order ( Jrij ) inclusion probabilities based on a popu lation of size N = 5 units with
values X and selection probabilities P, .

Data , . Selection I" First order . Secon d orde r inclusion pro babilit ies
Probabilities inclusi on '"
. ii..
'" . 'c probabilities I,
.. .~, "."
lJ
"',
"
",~.
'" P'! .~ "'~ ",
:j \Js,~ I"" } " 1
. + 2 3 4 ,; ,,' 5 0

" X """ I ,.

50 0.10 0.55 I d) ? ? ? ?
75 0.15 0.58 2 0.25 d ? ? ?
2
100 0.20 0.60 3 0.27 0.29 d) ? ?
125 0.25 0.63 4 0.28 0.30 0.32 d ?
4

150 0.30 0.65 5 0.30 0.32 0.32 0.36 <


( a ) Find the value of sample size n .
( b ) Based on the sample size found in part ( a ), show that the second order
inclusion pro babi lities are correct.
( c ) Comp lete the missing entries ( ? ) in the above table with the appropriate
second order inclusion probabilities.
( d ) Are these section probabilities correct? Why?
( e ) What will be diagonal entries d., i = 1,2,3,4,5 , in the above table ?
( f) Compute V (YHT ) and VSY G (YHT ) if >I = Y2 = 25, Y3 = 40, Y4 = 55, and Y 5= 75 .
( g ) Does the equality V (YHT ) = VSYG (YHT ) holds?
N N N
Hint: I Jri = n, II Jrij = (n - 1)Jri' Jrij = Jrji' \;/ i '* j, I.P; = I , and Jrii = Jri '
i=1 j# =1 i= 1

Practical 5. 10. A psychologist wants to know if the birth rate of left handed girls
and left handed boys is same in a town having eight elementary schools. The
psychologist used the prior information on the number of registered boys and girls
to select two samples (one of boys and another of girls) each consisting n = 3 units
by using PPSWOR sampling. The information collec ted by him is given the
following table :

Second order inclusi on"


pro15abiliti~s fo'tsa;npigci unit!>"
".' ( Jrij )"'

j
2

4 0.45 0.20 0.20


2 3 0.40 0.20 0.20
3 2 0.50 0.20 0.20
Chapter 5: Use of auxiliary information: PPSWOR Sampling 525

( a ) Derive a 75% confidence interval estimate for total number of left handed
boys using the Horvitz and Thompson estimator and usual estimator of its variance .
(b) Derive a 75% confidence interval estimate for total number of left handed girls
using the Horvitz and Thompson estimator and usual estimator of its variance .
( c ) Assuming that both samples are independent, use ( a ) and ( b ) to estimate the
difference between total number of left handed boys and girls, and derive 75%
confidence interval estimate for the same.
( d ) Derive a 75% confidence interval estimate for total number of left handed
boys using the Horvitz and Thompson estimator and the Sen--Yates--Grundy
estimator of its variance .
( e ) Derive a 75% confidence interval estimate for total number of left handed girls
using the Horvitz and Thompson estimator and the Sen--Yates-Grundy estimator
of its variance.
( f) Assuming that both samples are independent, use ( d ) and ( e ) to estimate the
difference between total number of left handed boys and girls, and derive 75%
confidence interval estimate for the same.
( g ) Comment on the confidence interval estimates obtained.
Hint: Ghosh (1998) .

Practical 5.11. Consider a population of N = 6 units Q = {A, B, C, D, E, F} and if


we wish to select a sample of n = 4 units using without replacement sampling out
of total 15 sample St = 1,2,3,...15given by
Sl = {A,B,C,D}, S2 = {A,B,C,E}, s3 = {A ,B,C,F}, S4 = {A,B,D,E},
s5 = {A,B,D,F}, s6 = {A,B,E,F}, s7 = {A,C,D,E}, Ss = {A,C,D,F},
Sg = {A,C,E,F}, SID = {A,D,E,F}, sll = {B,C,D,E}, SI2 = {B,C,D,F},
sI3 = {B,C,E,F}, SI4 = {B,D,E,F}, and SI5 = {C,D,E,F} .
(a ) Find the first order and second order inclusion probabilities provided that all
samples have equal chance of selection that is p(St) = 1/15 Vt .
( b ) If p(St) = 0.050, t = 1,2,3,4,5 and p(St) = 0.075, t = 6,7,8,9,10,11,12,13,14,15, then
find the first order and second order inclusion probabilities.
( c ) If p(St) = 0.10, t = 1,2,3,4,5,6,7,8,9,10 and p(St) = 0.00, t = 11,12,13,14,15, then
find the first order and second order inclusion probabilities.
(d ) If p(St) = 0.050, t = 1,2,3,4,5, p(St) = 0.10, t = 6,7,8,9,10, and p(St) = 0.05,
t = 11,12,13,14,15 , then find the first order and second order inclusion probabilities.
( e ) For each one of the above situation find the usual v(YHT)and Sen--Yates--
Grundy VSY G (~H ) given that A = 10, B = 20, C = 45 D = 40, E = 50, and
F=75 .

( f ) Does the relation V(YilT)= VSY G (YHT) holds?


( g ) Which sampling scheme is most efficient?
526 Advanced sampling theory with applications

Miscellaneous:

( h ) For each one of the sampling plan ( a ) to ( d ), from each sample estimate
population total using Horvitz and Thompson estimator and test for unbiasedness.
( i ) For each one of the sampling plan (a) to ( d), from each sample estimate the
variance using the usual estimator V(YHT ) , and test for unbiasedness.
( j ) For each one of the sampling plan (a) to (d), from each sample estimate the
variance using the Sen--Yates--Grundy estimator vSYG (YHT ), and test for
unbiasedness.

Practical 5.12. Consider a population of N = 4 units n = {A,B,C,D} with A = 10,


B = 15, C = 12 and D = 25 . Consider a sampling scheme which gives 1[i = 0.50,
i = 1,2,3,4 and 1[ij = 0.20, "v'i *- j = 1,2,3,4.
( a ) Estimate the population total, Y, with the Horvitz and Thompson estimator,
YHT , assuming that the units A and D are included in the sample.
( b ) Find v(fin) and construct 95% confidence interval estimate.
( c ) Find vSYG (Yin) and construct 95% confidence interval estimate .
( d ) Comment on the interval estimates.
(e ) Are v(YHT) and vSYG (YHT) are equal?
( f) Find the population total Y, V(YHT ) and VSYG(YHT ).
( g ) Does the relation V(YHT)=VSYG (YHT ) holds?

Practical 5.13. Consider a bivariate population of N = 6 units


n = { Y; :. A, B, C, D, E, F}
Xi. P, Q, R, S,T,U
with A = 10, B = 15, C = 12, D = 25, E = 27, F = 30, P = 8, Q = 12, R = 18, S = 23,
T = 30, and U = 28 .
Consider a sample of n = 4 units
s= {Yi ~ A, B, C, F} .
Xi . P, Q, R, U
and a sampling scheme which gives
1[i = 0.75, i = 1,2,3,4,1[5 = 0.2 and 1[6 = 0.8 and 1[ij = 0.15, "v'i *- j = 1,2,3,4,5,6 .
( a ) Find the design weights.
( b ) Derive the calibrated weights with qi = 1 using chi-square distance function,
assuming population total of the auxiliary variable X is known .
( c ) Estimate the population total, Y , with the GREG estimator (YG ) .
( d ) Find vSYG (YG ) using usual estimator and construct 95% confidence interval
estimates.
Chapter 5: Use of auxiliary information: PPSWOR Sampling 527

( e ) Find "SYG (YG ) using low level calibration approach and construct 95%
confidence interval estimates.
(x
( f) Assume that VSYG HT) is known and construct 95% confidence interval
estimate using higher order calibration approach .
( g ) Discuss the estimates.

Practical 5.14. Consider a population of N = 4 students


n = {Raghunath, Mohan, Kulwinder, Sarjinder}
with their weights 110, 135, 120, and 145 Ibs, respectively.
Two methodologist; Melissa and Stephanie, were asked to suggest a sampling
scheme to select one student to estimate the total weight.
( I ) Melissa likes every one and considers the following sampling scheme
consisting of four possibilities:
SI = {Kulwinder}, S2 = {Mohan} , s3 = {Raghunath} and S4 = {Sarjinder}

such that
P(SI) = 0.25, P(S2) = 0.25, P(S3) = 0.25 , P(S4) = 0.25
( a ) Find the first order inclusion probabilities for the Melissa's sampling scheme.
( b ) Find the second order inclusion probabilities for the Melissa's sampling
scheme
(c) Find the variance of the Melissa 's sampling plan using the Sen-Yates-Grundy
formula .
(d) Find the variance of the Melissa's sampling plan using the usual formula of the
vanance.
)=
(e) Does the relation V (YHT VSYG(YHT ) holds for Melissa 's sampling plan?
( II ) Stephanie likes Sarjinder and cleverly suggests the following changes in the
above sampling plan:
P(SI) = 0.20, P(S2) = 0.25, P(S3) = 0.24, P(S4) = 0.31
( f) Find the first order inclusion probabilities for Stephanie's sampl ing scheme.
( g ) Find the second order inclusion probabilities for Stephanie's sampling scheme.
( h ) Find the variance of the Stephanie's sampling plan using the usual formula of
the variance.
)=
(i ) Does the relation V (YHT VSYG(YHT) holds for Stephanie's sampling plan?
(j) Find the relative efficienc y of the Stephanie's cleverness over Melissa 's
sampling plan?
( k) What is your opinion about the Stephanie's sampling scheme ?
( I ) Can you estimate the variance using any formula either or "(YIlT ) "SYG (YHT )
for Melissa's (or Stephanie's) sampling scheme? Give you opinion.
528 Advanced sampling theory with applications

Practical 5.15. Stephen and Sarjinder were appointed to select four students
(n = 4) out of a multicultural list of six students (N = 6) with different cultural
backgrounds as: n = {Poonam, Quang, Ryan, Stephanie, Tom, Udeesh] and GPA 3.70,
3.20,3.80,3.72,3.90 and 3.92, respectively.
( I ) Stephen considers the following sampling scheme consisting of only four
possibilities:
SI = {Poonam, Quang, Ryan, Stephanie} , S2 = {Quang, Ryan, Tom, Udeesh] ,
S3 = {Ryan, Stephanie, Tom, Udeesh] , and S4 = {Poonam, Stephanie, Tom, Udeeshj,
such that
p(St) = 0.25 'd t = 1,2,3,4.
(a) Find the first order inclusion probabilities for Stephen's sampling scheme, and
derive the estimates of average GPA from each sample .
(b) Find the bias in Stephen's sampling scheme.
(c) Find the second order inclusion probabilities for Stephen's sampling scheme,
and derive the variance using:
( i ) Sen-- Yates--Grundy formula, (ii) Usual formula , and (iii) By
definition.
( d ) Are three variances equal for Stephen 's sampling scheme?

( II ) Sarjinder likes both Poonam and Stephanie and suggests the following
changes in Stephen's sampling scheme as:
p(St) = 0.50 for t = 1, 4, and p(St) = 0.00 for t = 2, 3.

( h ) Find the first order inclusion probabilities for Sarjinder's sampling scheme,
and estimates of average GPA from each sample.
(i ) Find the bias in Sarjinder's sampling scheme.
(j ) Find the second order inclusion probabilities for Sarjinder's sampling scheme,
and derive the variance using :
(i ) Sen--Yates--Grundy formula, (ii) Usual formula , and (iii) By definition.
( k ) Are three variances equal for Sarjinder's sampling scheme?

( III ) Comparison among techniques:


( 0 ) Find the relative efficiency of Sarjinder's sampling scheme over Stephen's
sampling scheme.
( p ) Would you like to comment on your results?
6. USE OF AUXILIARY INFORMATION: MULTI-PHASE
SAMPLING

6;0 INTRODUCTION

In the previous chapters we have seen that use of known auxiliar y information at
the estimation stage as well as at the selection stage leads to improved estimation
strategies in survey sampling. When such information is not completely known or
lacking and it is relatively cheaper to obtain information on the auxiliary
variable(s), one can consider taking a large preliminary sample for estimating
population mean(s) of the auxiliary variab le(s) to be used at the estimation or
selection stage of the ultimate estimation strategies. For examp le, in the case of
single auxiliary variable x, since it is cheaper to obtain information on x, we
consider taking a large preliminary sample for estimating population mean X or
distribution of X as the case may be, and only a small sample (some times a sub-
sample) for measuring the study variable Y.

Population of N units

Preliminary large .,
sample of

Sample of
n units

Fig.6.0 .1 Two-phase sampling scheme.

This could mean devoting a part of the resources to this large preliminary sample
and, therefore, reduct ion in sample size for measuring the study variable . This
sampling technique is called double sampling or two-phase sampling and was
invented by Neyman (1938) . In cases in which the sample for the main surveys is
selected in three or more phases , the sampling procedure is called three-phase or
multi -phase sampling. This procedure is advantageous when the gain in precision is
substantial as compared to the increase in cost owed to collection of information on
the auxiliary variate for large samples .

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
530 Advanced sampling theory with applications

In this chapter, we will discuss two-phase sampling estimation strategies under


different situat ions such as:

(a) SRSWOR scheme at the first as well as second phases of the sample selection ;
( b ) SRSWOR scheme at the first phase and PPSWR sampling at the second phase ;
( c) PPSWOR scheme at first as well as second phases of the sample selection .

In the following section , we will discuss a situation when simple random without
replacement sampling is applied at both phases of the sample selection.

6:rSRSWOR'SCHEME AT TU"'" 1<'"fDlO:T" A


:; ,'::~' PHASES OF TIIl: SA .

Under this strategy we would like to discuss the usual ratio and regression type
strategies using one and two auxiliary variables. Before proceeding further, it is
necessary to define notation and expected values, which will remain useful
throughout this chapter.

6.1':0 NOTATION"AND "EXPECTEDVALUES

Let (x;,x;, ....,x:) be the first phase sample SI (say) drawn by simple random
sampling from the population of N units and let only the auxiliary variable X be
measured. Also, let (Y"Y2 '....'Yn) and (XI, X2, ....,x n ) denote , respectively, the second
phase sample S2 (say) drawn by simple random sampling from the first phase
sample for the study variable Y and auxiliary variable X .
Let
_ _I n _ -I n _* _I m * 2 ( )_1 n ( -)2
y = n 'I Yi,x=n 'Ixi, x =m 'Ixi,sx=n-I 'Ixi-x ,
i=1 i=1 i=1 i=1

2 = ( n -I )-1 'I
Sy
n(
J an d sx*2 = ( m -I )-1 'I
Yi - Y-\2 m (* -* \2
Xi - x J .
i=1 i=1

Defining
- - - -* 2 2
Y X X X Sy Sx
Eo==-I EI=--I E2=~-1 E3=~-1 0.0 =--1 and 5, = - - 1
Y , -* , X ' X ' 2' *2
x ~ ~
such that

and

E(E6)=(~- ~Jc;, E(En=(~- ~Jc;, E(E~)=(~- ~Jc;,


E(E~)=(~ - ~Jc;, E(5J)=(~- ~)A40-1), E(5?)=(~- ~)A04-1),
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 531

E(EOEI)=(~- ~)pXyCyCX,E(EOE2)=(~- ~ )pXyCyCX,E(EOE3)=(~ - ~ )PXyCyCX,

(1 1) 2 (I I)
E(E2 E3)= -;;;- N Cx' E(EO 01)= -;;- m CyAn, E(EIO,)= -;;- m CxAo3' (I I)
E(EOOO)=G- ~)CyA30' E(OOE')=(~- ~)cxA2I,and E(OOO')=(~- ~}A22-I)
where
A = J.lrs C2 = J.102 P = J.l"
rs r/2 s/2' x -2' xy ~
J.l20 J.102 X VJ.l20J.102
for

Note that some of these expected values are true only up to the first order of
approximation.

The above results can easily be proven on the lines of the following two theorems :

Theorem 6.1.0.1. E( E5) = (~- ~)c;.


Proof. Note that
_* -1 m *' mt ,
y =m Iy; and *2
Sy =(m-I ) - I ;;,
Ill; -Y *)2 .
;;1

Then we have
V(EO) = E(E6)- {E(EO)Y = E(E6)
= E,V2[EO I first phase sample] + VjE 2[EO I first phase sample]

= E,V2 [ ( ~ - I) I first phase samPle] + Vj E2[ ( ~ - I) I first phase samPle]


=
E,[V2(Y)lfirsty phase
2
sample] + v., [{E2 (Y) -
----y- I}I fiirst p h i]
ase samp e

Hence the theorem .


532 Advanced sampling theory with applications

Theorem 6.1.0.2. E(EOEl) = (~- ~ )PXyCyCt .


Proof. We have
COV(EO ,Ed = E(EOE,)- E(EO)E(E,) = E(EOEd

= EI[C2 (EO,E,) I first phase sample] + C, [{E2(EO),E2(E,)} I first phase sample]

= E{C ~ -I,; -I} I first phase samPle]


2{

+C{{Ez(~ -1}E 2( ; -I)} I first phase samPle]

= E1hCy,x)/(rx' )1 first phase sample] + C{(~ -I}O]

= E'[(~-~)
nmy x
~;~.] '" (~-~)
nmYX
!x~ = (~-~)Px
nm
C CX'
YY

Hence the theorem.

The ratio estimator Yrd of r in two phase or double sampling takes the form

Yrd =y( ~l (6.1.1.1)

Assuming that IEil < I, i = 0,1 and using the binomial expansion
(I+E,t' =I-E, +Ef +...
the above estimator can easily be written as
Yrd = r(l+ EoXI+ Elt 1 = r(l+ EoXI- E, + Ef +......)
(6.1.1.2)
= r[l+ EO - E, + Ef - EOE, +.....] .
Thus we have the following theorems:

Theorem 6.1.1.1. The bias in the ratio estimator Yrd' to the first order of
approximation, is given by

S(Yrd) = (~- ~)r[C} - PXyCycJ (6.1.1.3)


Proof. Taking expected values on both sides of (6.1.1.2) we have
E(Yrct)=rE[I+EO-E, +Ef-EO E, +.....] =Y[I+ E(Eo)-E(E,)+E(Ef)-E(EoE,)]

= r[1 +0-0+(~_~)(C2
n m x
- Pxy CC)].
xy
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 533

Thu s the bias in the estimator Yrd' to the first order of appro ximation , is given by
B(Yrd) =E(Yrd) -Y = (~n _~)
m
Y(C2x _ pxyx
C C).
y
Hence the theorem.

Theorem 6.1.1.2. The mean squared error of the estimator Yrd' to the first order of
approximation, is

MSE (Yrd- ) (1 1)S + (1- - -1)[S


= -
m
- -
N
2
Y nm Y
2+ R2Sx2 - 2RSx') ] (6. 1.1.4)

where R = Y/ X has its usual meaning.


Proof. By the definition of mean squared error we have
MSE(Yrd) = E~rd - yf '" E[Y(I+ EO - EI +...)- yf '" y2E[E6 + E? - 2 EOEI ]

)c; + (~ - ~){c; -2PxyCXCy}]


= Y2[(~ _ ~

(1 I)Sy + (I- - -I)rlSy+R S; - 2RSxy .


= ---
m N
2
II m
2 2 2 ]

Hence the theorem.

Corollary 6.1.1.1. An estimator of the mean squared error of the estimator Yrd' to
the first order of approxima tion, is
MSE V, ) (1 1) + (I- - -1)[s , +
(~rd = - - -
mN
S
2
Y nm
2
>
2 2-
I' S x 21'Sx
Y
]
(6. 1.1.5)

where I' =YIx has its usual meaning.

Example 6.1.1.1. From population 1 in the Append ix select a first phase sample of
10 units by SRSWO R sampling and note only the nonreal estate farm loans from
the selected units in the sample. From the selected first phase sample of 10 units,
selec t a sub-sample of 5 units and note the real estate farm loans as we ll as nonrea l
estate farm loans. Estimate the average real estate farm loans by using ratio
estimator in two-phase sampling. Deduce the 95% confidence interva l.
Solution. We used the first two columns of the Pseudo-Random Numb er (PRN)
Table 1 given in the Appendix to select an SRSWOR sampl e of m =10 units. The
following 10 distinct random numbers 0 1, 23, 46, 04, 32,47, 33 05, 22, and 38
between I and 50 resulted in the following first phase sample.

,,?,'First phase sample information


,~

Sr. No . 1 2 3 4 5 6 7 8 9 10
Pop U nits' 01 23 46 04 32 47 33 05 22 38
State s AL MN VA AR NY WA NC CA MI PA

Xi
• 348 .334 2466 .892 188.477 848.3 17 426.274 1228.607 494.73 3928 .732 440 .5 18 298 .351

where x; = Nonreal estate farm loans, and.


534 Advanced sampling theory with applications

We used the 7th and 8th columns of the Pseudo-Random Numbers to select a second-
phase sample of n = 5 units from the above list of selected first phase sample units .
The following five distinct random numbers between I and 10 were observed: 07,
09, 0 I, 02 and 03. Thus the second phase sample consists of the following
information:
' 0)0);:;
··. •;~;•.• ;J';t ,",pt'nnr1 pbasesample.illformation ·; ' , ;;;;.>:; •.T:1 ~.. ~ <;*;;
';''4!{lrst d

l i i'~~~!4 ': R~~l ~stiit~ ..


.T U!. l <;aJI::S Lall:
" 0) 'J'1 ; ;:; F ': ~" : " ' ';;:;' .. :~.aIlu I·······..;}'"; ·;·· . . :••,:••...:.
" un n " u.erntory 'c H U H . ' V . . . . .' i;fatn1:16ans ,"
1 07 33 NC 494.730 639.571
2 09 22 MI 440 .518 323.028
3 01 01 AL 348.334 408 .978
4 02 23 MN 2466.892 1354.768
5 03 46 VA 188.477 321.583

From the second phase sample information we have

1 NC 494 .730 639.571 85884 .281 899.12 -8787.5273


2 MI 440.518 323.028 120597.980 82115 .26 99513.4880
3 AL 348.334 408 .978 193121.750 40243.42 88158 .2540
4 MN 2466 .892 1354.768 2819382 .900 555296.80 1251237 .1000
5 VA 188.477 321.583 359176.310 82945 .50 172603.7600
Sum 3938.95 1.:: ';"3047.928 3578163;221 761500.10 1602725.1193

Note that
x =..!.- IXi = 3938 .951 = 787.7902 Y =..!.- IYi = 3047.928 = 609.5856
n i=1 5 n i=l 5

r=L= 609.5856 =0.7738 s; =(n-Itl±(Xi-xf = 3578163.221 =894540.79,


x 787.7902 ' i=1 4

s; = (n -i): ±(Yi - y)2 = 761500.1193 = 190375.025,


I
i=1 4
and
S ,y= ( n-I )
-1n( -X
IYi- Y xi -x =
-) 1602725.1
=400681.275.
i=l 4

Thus ratio estimate of the real estate farm loans is given by

Yrd Y X
J
- = -[ x' = 606.5856( 1066.9232) = 825.576 .
787 .7902

An estimate of the MSE(Yrd) is given by


Chapter 6.: Use of auxiliary information : Multi-Phase Sampling 535

MSE(Yrd)=(~ - ~ };+(~- ~J[s;+r2s;-2rSXY]


= (...!... _...!...) x 190375.025
10 50

+ (.!-5 -...!...)[190375.025
10
+ 0.77382 x 894540.79 - 2 x 0.7738 x 400681 .275]

= 25820.171 .

The (1- a )100% confidence interval estimate for the average real estate farm loans
in the US during 1997 is
Yrd =+= f
aj 2(df -S-E-(Y-rd-) .
= n -1).jr-M
Using Table 2 from the Appendix, the 95% confidence interval for the average real
estate farm loans in the US during 1997 is given by

825 .576=+=2.776.J25820.171 , or [379.51 , 1271.63].

Suppose Co is the overhead cost, C, is the cost of selection and processing of a


single unit in the first phase and C2 is the cost of selection and processing of a
single unit in the second phase, then the total cost function C for selecting m units
in the first phase and n units in the second phase is given by
C=Co+mC,+nC2. (6.1.1.6)

Theorem 6.1.1.3. The minimum MSE of the ratio estimator Yrd for the fixed cost
C given by (6.1.1.6) is
Min.MSE(Yrd)

=[
.JC:D
(C-Co~S; -VR -
I] 2

N Sy + (C-Co)
D r~C2VR(S;-VR)
~S; -VR
-.JC:VR lJ, (6.1.1.7)

where
2 2 (6.1.1.8)
VR = Sy +R S2x -2RSty
and

D=~C,(S;-VR) +~C2VR' (6.1.1.9)

Proof. The Lagrange function is


L= (~_~JS2
m N y
+(~-~JVR
n m
-A[C-Co -mC, -nC2] · (6.1.1.10)

On differentiating (6.1.1.10) with respect to m and equating to zero we have


536 Advanced sampling theory with application s

-0st.11/ =-11/2
2
Sy VR
-+- 11/2
+AC\ =0'
which implies that
II/ = ~S;-VR/{Ji.;c;} . (6.1. 1.11)
On differentiating (6.1.1.10) with respect to Il and equating to zero we have
oL/on= - VR/n2+ AC2= 0
which implies that
n=JV;/ Vi jC;}. (6 .1.1.12)
On substituting these values of m and n in equation (6.1 .1.6) we have
Ii = {~c, (S; - VR) +~C2VR }j(c-Co)=D/(C-Co) (say). (6.1.1.13)

On substituting this value of A in (6.1.1.11) and (6.1. 1.12) we obtain the optimum
sample sizes
m ={(C -CO~S;-VR }/~D} (6.1.1.14)

and
n= KC-Co )JV;}/{Jc;D}. (6. 1.1.15)
On substituting (6.1. 1.14) and (6. I. 1.15) in (6. 1.1.4) we obtain (6. 1.1.7). Hence the
theorem.

Theorem 6.1.1.4. The minimum cost C for the fixed MSE(Yrd) = Vo (say) is

Min.C=Co+ D2~o+sNN)' . (6 .1.1.16)

whereD= ~Cl(S; -VR) + ~C2VR and VR=S; +R 2S; - 2RSty.


Proof. Consider the Lagrange function

L=C- Co - mCI - nC2- A[(~


11/
- -N.!. . )S~> + (~n - -.11/!. .) vR- Vo]. (6•I• I• 17)
On differentiating (6.1.1.17 ) with respect to m and equating to zero we have
st. AS 2 AV
-om= -C, +11/2-y - - R=0
-
11/2 '

which implies that


1I/ = {Ii~S;- VR }/.;c;. (6.1.1.18)
On differentiating (6.1.1.1 7) with respect to n and equat ing to zero we have
oL = -C2 + AV2R =0
on n
which implies that
n= {JiJV;}/{Jc;} . (6.1.1.19)
On substituting these values of l/l and II in equation V(Yrd) = Vo we have
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 537

s: = (VO +S;/NtUC,(S; -VR)+~C2VR }= D(vo +S;/NJ' . (6.1.1.20)


On substituting this value of A in (6.1.1.18) and (6.1.1.19) we obtain the optimum
sample sizes
(6.1.1.21)

and

n= DJV;/{;C;(vo +S;/N)} . (6.1.1.22)

On substituting (6.1.1.21) and (6.1.1.22) in C = Co + mCI + nC2 we have (6.1.1.16) .


Hence the theorem.

Theorem 6.1.1.5 . Then rmmmum variance under single-phase and SRSWOR


scheme for the fixed cost of survey is

V(-)=(~_J...JS2
Y C-C N Y
. (6.1.1.23)
o
Solution. Under SRSWOR sampling we know that
V(y) = (~- J...)S2 . (6.1.1.24)
n N Y

The cost function in this situation will be


C=CO+nC 2 (6.1.1.25)
which implies that
n=(C-CO)jC2· (6.1.1.26)
On substituting (6.1.1.26) in (6.1.1.24), we have (6.1.1.23). Hence the theorem .

Example 6.1.1.2. The amounts of the real and nonreal estate farm loans (in $000)
during 1997 in the 50 states of the United States have been presented in population
I of the Appendix . Suppose we selected first phase and second phase samples each
of size 10 and 5 respectively .
( a ) Find the relative efficiency of the ratio estimator, for estimating the average
amount of real estate farm loans during 1997 by using information selected in the
first phase sample only on the nonreal estate farm loans during 1997, with respect
to the usual estimator of population mean.
( b ) Suppose a budget of US$5000 is available to spend on the survey, $2000 of
which will be the overhead cost. Suppose selection, compilation, and analysis of
one unit in the first phase sample cost $50, whereas for the second phase unit is
$500. Find the optimum values of the first phase and second phase sample sizes.
Also find the relative efficiency of the ratio estimator over the sample mean for the
fixed cost.
( c ) What will be the minimum cost for attaining a 30% relative standard deviat ion?
Solution. From the description of the population we have
Y; = Amount ($000) of the real estate farm loans during 1997.
538 Advanced sampling theory with applications

X i = Amount ($000) of the Nonreal estate farm loans during 1997.


- -
Y = 555.43, X = 878.16, R = 0.6325, Sy = 342021.5 ,
2
s;2 = 1176526, 2
C x = 1.5256,

C; = 1.1086, Sxy = 509910.41, Pxy = 0.8038, and N = 50 .

( a ) We have m = 10 and n = 5, therefore

- ) = (I
MSE (Yrd I ) Sy2 + ( -;;--;;;
-;;;- N I )f
I lSy2 +R 2 s;2 -2RSxy ]

= (~-~)X342021.5
10 50

+ (~-~)[342021.5 + 0.6325 2 x 1176526 - 2 x 0.6325 x 509910.41]


5 10
= 44127 .86 .
Also
V(y) = (~- ~)s; = (~-~) x 342021.5 = 61563 .87.
n N 5 50

Thus relative efficiency of the ratio estimator over the sample mean is given by
RE= V(y) xI00=61563.87 x I00=139.51% .
MSE(Yrd) 44127 .86

(b) We have
C = 5000, Co = 2000, C1 = 50, C 2 = 500,

s;
VR = S y2 +R 2 2 -2RSxy

= 342021 .5 + 0.6325 2 x 1176526 - 2 x 0.6325 x 509910.41 = 167661.41


and

D = ~CI (s; - VR) + ~C2VR = ~50(34202 1.5-167661.41) + "'500 x 167661.41 = 12108.53.


The optimum first phase and second phase sample sizes are, respectively, given by

m={(C-CO~S;-VR }/~D}
= K5000 - 2000 N342021.5 -167661.41 Ywso x 12108.53}= 14.6 '" 15,
and
n = ffC - C ).JV;;}/ rc;
e D} = (5000 - 2000 N167661.41 = 4.5 '" 5.
~ 0 N 2 J500 x 12108.53
Thus we have

MSE(Yrd) = (~ - ~ )s; + (~- ~ )[s; + R 2S; - 2 RSxy ]


Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 539

= ( ...!... _ J...)X 342021 .5


15 50

+ (~ - ...!...) [342021.5 + 0.63252 x 1176526- 2 x 0.6325 x 509910.41 ]


5 15
= 38315.86.
For single-phase sampling, the optimum sample size for the fixed cost w ill be
n = (C-C O)fc 2 = (5000-2000)/500 = 6,
and

v(y) = (~ - . .!. . )s; = (~ - J...) x 342021.5 = 50163.15 .


n N 6 50
Thus percent re lative efficiency of the ratio estimator over the sample mea n for the
fixed cost is g iven by
RE= V(y) x I00=50163 .15 x I00::d30.92% .
MSE(Yrd) 38315.85

( c ) By the defi nition of relative standard deviation we have

~MSE(Yrd)/y2=0.30, or ~MSE(Yrd)/555.432 =0.30 ,or


MSE(Yrd)= 27765.22 = Vo (say).

Thus the minimum cost of the survey for the fixed precision will be

Min.C = Co + D2 (vo+ S;/N)' = 2000 + 12108.532(27765 .22+ 342021 .5/50 t l


= 6236.77 .
Thus a minimum of $6236.77 instead of $5000 is requi red for attaining the 30%
relati ve standard deviation of the estimator.

6.1.2 ·DIFFERE~CE ESTI MATOR

The difference estimator of the population mea n Y in two-phase sampling is


Ydd =Y+d(x· -x) (6 .1.2.1)
where d is a known constan t and can be chosen such that the variance of the
estimator Ydd is minimu m. The above estimator can easi ly be written as
Ydd = f(l+ Eo)+dX(E3 - E2)' (6.1.2.2)

Th us we have the following theorems:


Theorem 6.1.2.1. The difference est imator Ydd is an unbiased est imator of the
population mean Y.
Proof. Taking expected values on both sides of (6.1.2.2) we have
E{Ydd) = f(1+ E(EO))+dX(E(E3)-E(E2)) = f(I+O) +dX(O -O) = f .
Hence the theorem.
540 Advanced sampling theory with applications

Theorem 6.1.2.2. The minimum variance of the diffe rence estimator Ydd of
population mean r is
Min.v(Ydd)=(~ - ~ Js;+(;- ~JS;(l-P;J (6.1.2.3)

Proof. We have
VCYdd) = E[Ydr E(Vdd)]2 = E[r(l+ EO )+dX(E3 - E2)- r] = E[r EO +dX(E3 - E2 )]2
= EY
[- 2 EO
2
+d 2X-2(E32 + 2
E2 -2 E2E3 ) + 2dY- X-(EOE3 - EOE2
)]

= (~_~Jf2C2
nN y +(~-~Jrd2X2C2
nm~ x -2dr Xpxy xy ' cc] (6.1.2.4)
On differentiating (6.1.2.4) with respect to d and equating to zero we have

d = Pxy C
Ct
y
(~J
X
y
= S t2
s; . (6 I 25)
...
On substituting (6.1.2.5) in (6.1.2.4) we obtain (6.1.2.3). Hence the theorem.

6:1:3:REGRESSION·' ESTIMATOR .

The exact distribution of the regression estimator in two-phase sampling has been
derived by Causeur (1999). We consider the regression estimator of population
mean in two-phase sampling in the following theorem:

Theorem 6.1.3.1. Find the asymptotic MSE of the linear regression estimator of
population mean r
in two -phase sampling
Ytrd =Y+p(x·-x) (6.1.3.1)
where /J = s xy / S 7; is an estimator of the regression coefficient fJ = Sxy/ S; .
Proof. Let us define I] = /J/fJ - 1, such that E(I]) "" O. Then the estimator Ytrd of the
population mean r can be written as
Ytrd = r(l+ EO)+ fJX(l + I]XE3 - E2) = r(l+ EO)+ fJX(E3 - E2Xl+I]) .

Thus the MSE of the estimator Ylrd is given by


MSE(Ytrd)= E[r(l+ EO)+ fJX(E3 - E2Xl+I])-rf

"" EY
[- 2 EO
2
+fJ 2X-2(E32 + 2
E2 -2 E2E3 ) - 2fJY- X-(EOE3 - EOE2
)]

=(;- ~Jr2C;+fJ2X2{(~ - ~Je;+(;- ~JC;-2(~ -~Jc;}


+2fJrX{(~ - ~JpXYCxCy -(;- ~JpXYCxCy}
= (;- ~ Js; +(;- ~J~2X2C; -2fJr XPXYCyCx}
Chapter 6.: Use of auxiliary information : Multi-Phase Sampling 541

=(~_J.-+J.-_~)S2 +(~_J.-){f32S2 -2f3S }


nmmNY nm x .xy

I --
= (-
mNY
I) S 2 + ( -I - -
nm
I){ S 2
Y
+13 2 Sx2 -2f3Sxy }

=(J.-_~)S2 +(~_J.-)S2(I_p2
mN Y nmY xy
).
Hence the theorem .

Example 6.1.3.1. The amounts of real and nonrea l estate farm loans (in $000)
during 1997 in the United States have been presented in population 1 of the
Appendix . Suppose we selected first phase and second phase samples each of size
10 and 5 respectively. Find the relative efficiency of the regression estimator, for
estimating the average amount of the real estate farm loans during 1997 using
information selected in the first phase sample only on the nonreal estate farm loans
during 1997, with respect to the ratio estimator of populat ion mean.
Solution. Continuing from example 6.1.1.2, we have MSE(Yrd) = 44127.86. Also the
mean squared error of the regression estimator is given by

MSE(Ylrd)=(~ - ~ )S;+(~- ~)S;(I-P.;)


= (~-...!...) X342021.5 +(.!- -~)X342021 .5(1- 0.80382 )
10 50 5 10
= 39466.05.
Thus percent relative efficiency of the linear regression estimator Ylrd over the ratio
estimator Yrd is

RE = V(Yrd) x 100 = 44127.86 x 100 = 111.8% .


MSE(Ylrd) 39466.05

We discuss the results of the general class of estimators, which is in fact an


analogue in two-phase sampling of the general class of estimators proposed by
Srivastava and Jhajj (1981) for single-phase sampling. Srivastava (1981a, 1981b)
and Srivastava and Jhajj (1987) have also considered some generali zed estimators
of population mean in two-phase sampling. We have the following theorem :

Theorem 6.1.4.1. Consider a general class of ratio type estimators Ygd of the
population mean Y in two-phase sampling is
542 Advanced sampling theory with applications

Ygd = yH(u, v) (6.1.4.1)


where H(.,.) is a parametric function such that it satisfies the following regularity
conditions:
(a)H(I ,I)=I;
( b ) The first and second order partial derivatives of H with respect to u = x/x·
and v = s; / s? exist and are continuous and known constants.
Then show that its minimum MSE is given by

Min.MSE("" d)=
\Yg m
(~_~)S2,
N >
+ (~_ ~)S2[1
m II
_ Px2 _ (,1,1 2 - PXYAo3~]
Y 1 ,1,
Y 1_
. 2
(6 I 42)
...
"04 - - 03

Proof. Expanding the function H(u, v) around the point (1,1) up to the first order
Taylor's series we have

Ygd = Y[ H(l,I)+ (u -I) ~~ 1(1,1) +(v -I)~~ 1(1,1) +........]


= Y(1+ EoXI+ Ej H IO + 0IHol + ....) (6.1.4.3)

where H IO
oH 1(11)
=- and
oH
HOI = -1(11 ) denote the known first order partial
ou ' dv '
deri vatives of the function H with respect to u and v respectively. By definition,
we have
MSE(ygd)= E~gd - rf = E[r{l+ EO+ EI H IO +oIHOI + ...}- rf
= PE[E5 +H?o E~ +H510? +2 EOEI H IO +2 EO 0IHOI +2 EI OIHIOHoI]
= p[(~_J...)C;
II N
+ (~ - ~){HI
m
II
~ C; + H51(Ao4 -1)+ 2H IOP.w. CyCx

+ 2HOI CyAI2 + 2HoIHIOCt Ao3 } ] . (6.1.4.4)

On differentiating (6.1.4.4) with respect to H IO and HOI ' respectively, and equating
to zero we have
H IOCx+H01Ao3 = - PxyCy • (6.1.4.5)
and
HIOCxAo3+HOI(Ao4 -1) = - CyAI2 ' (6 .1.4.6)
Solving (6.1.4.5) and (6.1.4.6) for H IO and HOI we have

H IO = CAAo3A12 - Pxy(Ao4 -1)V[cx {Ao4 -1- A53}] , (6 .1.4 .7)


and
HOI = CAp.tyAo3 - AI2 V {Ao4-1- A53} . (6.1.4.8)
On substituting (6.1.4.7) and (6.1.4.8) in (6.1.4.4) we obtain the minimum mean
squared error given by (6.1.4 .2). Hence the theorem.
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 543

Remark 6.1.4.1. The optimum first phase and second phase sample sizes for the
fixed cost (or variance) can also be derived for the difference, regression, and
general class of estimators of mean. The optimum first phase and second phase
sample sizes from the above theorems can easily be obtained by replacing VR with
VR = S;(I- P;y)
for the difference and regression estimator and with

V; =S2jl_ 2
R y Pxy
_(AI2-PXY-103~)
2
-104 -1- ,103
for the general class of estimators.

Example 6.1.4.1. The amounts of the real and nonreal estate farm loans (in $000)
during 1997 in the United States have been presented in population 1 of the
Appendix . Suppose we selected first phase and second phase samples each of size
10 and 5 respectively . Find the relative efficiency of the general class of estimators,
for estimating average amount of the real estate farm loans during 1997 by using
information selected in the first phase sample only on the nonreal estate farm loans
during 1997, with respect to the regression estimator in two-phase sampling of the
population mean.
Proof. We have
Jj = Amount of the real estate farm loans in different states during 1997.
X i = Amount of the nonreal estate farm loans in different states during 1997.
- 2
N = 50, Y = 555.43, S y = 342021.5, -103 = 1.5936, Pxy = 0.8038, ,112 = 1.0982, and
-104 = 4.5247 .
So we have
MSE(Ylrd)= (~_~JS2
mN Y
+(~_~JS2(1-
nm Y
P;Y )
= (~_...!...J
10 50
x 34202 I.5 + (~- ~J x 342021 .5(1- 0.8038 2) = 39466.05 .
5 10
Also for the general class of estimators, we have

MSE(y d)=
g
(~_~JS2+(~_~JS2[1-
m N Y n m Y
P; _ (,112 - PXY-103~]
Y 1 ,1 2 1.
"tl4 - - 03

=(~-...!...)
10 50
X342021.5

+ (.!.. _~) x 342021 .5 x


5 10
[I_ 0.80382 _ (1 .0983 - 0.8038 x 1.5936)2 ]
4.5247 -1- 1.59362
= 38308.00 .
544 Advanced sampling theory with applications

Thus the percent relative efficiency (RE) of the general class of estimators Ygd with
respect to the regression estimator Ylrd is given by

RE = MSE(Ylrd) x 100 = 39466.05 x 100 = 103.02% .


MSE(j7gd) 38308.00

Singh (1991) defined a general class of estimators for estimating the finite
population variance S; given by
s~ =s;H(u,v) (6 .1.5 .1)

where u = x/x', v = s; / s? , and H(-,-) is a parametric function and it satisfies


some regularity conditions such that:
(a) H(I,I)=I;
( b ) The first and second order partial derivatives of H with respect to u = x/x'
and v = s; / s? exist and are continuous and known constants .

Thus we have the following theorem:

Theorem 6.1.5.1. The minimum mean squared error of the general class of
estimators , s} , is given by
Min.MSE(s~ )

=Sy 4[(1m 1)(


N n m
) (I I){(
- - - ,1,40- 1 + - - - A40- 1 )- A2IZ - (AzZ-I-A03AzI)Z}]
Ao4 -
Z
1- AD3
' (6.1.5 .2)

Proof. Expanding H(u , v) around the point (I, I) up to the first order Taylor's
series, we have
s~ = s;H(u, v) = s;H[1 +(u-I),I+(v-I)]

= S;[I + (u -I)H IO+ (v -1)Hol + (u _I)Z Hzo + (v-If Hoz + (u -IXv -1)H11 + ....j
where
sn an oZH oZH oZH
HIO = ou 1(1.1)' HOI = ov 1(1,1), Hzo = ouz 1(1,1}' Hoz = ovz 1(1,1}' and HI 1 = O UO V .

Thus the mean squared error of the general class of estimators s l , to the first order
of approximation, is given by
MSE(s~)= E[S~ - s ; f = 4; {I + (u -I)HIO + (v-I)HoI }- s; f
=E[(S; - s;)+ s; (u-I)HIO + s;(v -1)Hol + .... f
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 545

'" S;E [80+ E j H IO + 8 IHOI]Z

= Sy4E[82
0+ EJ7H2IO+ 8122
HOI + 280 E l H 10 + 2808tH 01+ 2 E l 81Hl0HO l ]

= S;[(~-~)(A40
n N
-I) + (~ - ~){H?OC'; + HJI(Anc 1)+ 2CxAzIH IO
n III

+ 2(Azz-I )HOI + 2CxAn3 HIO HOt } ]. (6.1.5.3)

On differentiating (6.1.5.3) with respect to H IO and H OI' equating to zero each


derivative, we have the following set of minimal equations
IO]
Cx, An3 ][H [- Azi ] (6.1.5.4)
[C (An4- 1) HOI = -(Az2- 1) '
xAn3'
On solving for H 10 and H OI we obtain
H - {AzI(An4- 1)-An3(Az2 - 1)} and H = {(Az2 - 1)-AzIAn3} .
01
10 - r x(An4 - 1- ~3)}' ~4 - 1- ~J
On substituting these values of H IO and H OI in (6.1.5.3) we have the theorem.

The optimum first phase and second phase sample sizes for the fixed cost (or
variance) can also be obtained for estimating the finite population variance .

·6.1.6 CALIBRATION APPROACH IN TWO-PHASE SAMPLING

We introdu ce here the concept of two-ph ase sampling calibration approach and its
generalisation which has been studied by Dupont (1995), Hidiroglou and Sarndal
(1995 ,1998) and Estevao and Sarnd al (2002). At the moment we are using only one
auxiliary variable to keep the procedure simple . Suppose a first phase probability
sampl e 51 is drawn from the popul ation n using a sampling design that generat es
the selection probabilities Jr li' From the given first phase sample 51 ' the second
phase sample 52 (subs et of 51 ) is drawn with a sampling design with the selection
prob abilities Jr2 i = Jri isl . Evidently from the first phase sample 51 the Horvit z and
Thompson (1952 ) type estimator of population total X is given by
x· = I. dlix i
ie st
where d Ii =1/ Jr li . From the second phase sample unbiased estimators of Y and X
are, respectively, given by
y = I. d 2iYi, and = I. d2ixi
ie s2
x ie s2
where d 2i = (1/Jrli )x (1/ Jr2;)'
546 Advanced sampling theory with applications

Our interest is in the estimator given by


Yw = IWZiYi' (6.1.6.1)
iESZ
Following the simplest case of Deville and Sarndal (1992), we choose the weights
WZi such that the chi square (CS) distance given by

D= I(WZi -dzJZ(dzi%tl (6.1.6 .2)


iESZ
is minimum subject to the calibration constraint given by
IWZixi =XA* (6.1.6.3)
iESZ
where uu are suitable weights as discussed in Chapter 5.

Thus we have the following theorem:

Theorem 6.1.6.1. The calibration estimator Yw of population total Y in two-phase


sampling is given by

Yw = Y+ ..Bds (x* - x), where ..Bds = ( . L:.dZiqZiXiYi]/(. L:.dZiqZixl] . (6.1.6.4)


IESZ IESZ

Proof. From (6.1.6.2) and (6.1.6.3), the Lagrange function L is

L = .I (WZi - dzJZ(dziqZJ-l - 2..1.(. IWZixi -x*J . (6.1.6.5)


IESZ IESZ

On differentiating (6.1.6.5) with respect to wZ i and equating to zero we have


WZi =d Zi +AdziqZixi' (6.1.6.6)
On substituting (6.1.6.6) in (6.1.6.3) we have

A=[X* - .L:.dZiXi]! L:.dziqzixl. (6.1.6.7)


IESZ IESZ

On substituting (6.1.6.7) in (6.1.6.6), we obtain the second phase calibrated weights

WZi = ( 11
d Zi + dZiqZixi L:.dZiqZixi
1;1
2
)-I(
",* n
x - .L:. dZiXi .
I;)
) (6.1.6.8)
On putting these calibrated weights in (6.1.6.1) we have the estimator (6.1.6.4).
Hence the theorem .

Corollary 6.1.6.1. If qZi = 1/ Xi , then the regression estimator Y w reduces to the


ratio estimator

A*]
X
(T
A A

Yrd =y

of the population total in two-phase sampling.


Chapter 6.: Use of auxiliary information : Multi-Phase Sampling 547

Example 6.1.6 .1. Select a preliminary large sample of 10 units by PPSWOR


sampling using the Midzuno --Sen sampling scheme and using the number of
species groups during 1992 as an auxiliary variable given in the population 4 of the
Appendix. Select a second phase sample of 5 units from the given first phase
sample by using the Midzuno--Sen sampling scheme . Find the calibration weights
for the units selected in the ultimate sample by making use of known information
about the number of fish caught during 1994 as an auxiliary variable. Use the chi
square distance function between the design weights and calibration weights.
Discuss two cases when these weights lead to the ratio and generalized regression
estimator (GREG) for estimating the total number of fish during 1995. Deduce the
estimate in each case.
Solution. In the Midzuno--Sen sampling scheme the first unit is to be selected with
probability proportional to the number of fish during 1992 and the remaining 9
units with SRSWOR sampling . We used the first two columns of the Pseudo-
Random Numbers (PRN) given in Table 1 of the Appendix to select a random
number 1 ~ Ri ~ 69 and another random number 1 ~ Rj ~ 28933 by using the 7th to
11th columns . The first effective pair of the random numbers is (62, 01228). Thus
the unit at serial number 62 is included as a first unit in the preliminary large sample
of 10 units . The remaining 9 units are selected by SRSWOR sampling from the
remaining 58 units in the population . We used the 13th and 14th columns of the
Pseudo-Random Numb ers to draw 9 distinct random numbers between I and 58.
The random number came in the sequence as 05, 34, 30, 55, 46, 07, 13, 19 and 44.
jHx 'xvvx
",," " ,,,q~~; ;,:Fir~s,i;~~ilif~""'tloit{~;~;:i;;i';,~ '·,'ii~
Alli"

. .. H4'v~.v.. v. x.x.l r :;jJ.riU -·~ i~i£;·~~lr;101\·'~~flV


~~b;;/ ..v
I
1
62 Summer flounder 11918 0.0408 0.1678
'1 "1 iVvvii !ru " v
17741 105727.1
2 05 Herrings 28933 0.0991 0.2184 38007 174024.7
3 34 Pigfish 2955 0.0101 0.1411 4918 34854 .71
4 30 Lane snapper 9 19 0.0031 0.1351 1088 8053.3
5 55 Cunner 1931 0.0066 0.1381 1255 9087.6
6 46 Spot 14974 0.0513 0.1769 18491 104528.0
7 07 Saltwater catfish 13466 0.0461 0.1724 14441 83764 .5
8 13 Searobins 4768 0.0163 0.1465 4707 32129 .7
9 19 Groupers 4661 0.0159 0.1462 4583 31347 .5
10 44 Sand seatrout 3780 0.0129 0.1436 5665 39449.9
'1;~£j 1£0.!+*,f~,v, v:ii . j .; "v;~;~;' Sum; · · 88305 ii;" ·rv..VjV,if... ·i.,I,.•:. .• ,;·ii/ v .!; 622966 :9

In the above table the values of Jrl i are given by


Jrli = (N - m)PIi /(N - 1) + (m - 1)/(N - 1)

for N = 69, m = 10, and fli = z;/Z with Z = 291882 (given) .


Thus an estimate of the total number of species during 1992 is
x· = IAixi = 622966.9.
ie s\
54 8 Ad vanced sampling theory with appl ications

Aga in app lying the Midz unc--Sen samp ling scheme to the units selected in the first
pha se sample, we selected one unit with probability proportional to the number of
fish during 1992 and the remaining 4 units with SRSWOR sampling. We used the
first two columns of the Pseudo-Rand om Numbers to se lect a random number
I ~ Ri ~ 10 and another random number I ~ Rj ~ 28933 by usi ng the 7 to u "
h

columns. The first effective pair of the rando m numbers is (01 , 07572). Thus the
unit at serial number 01 from the given first phase sample is incl uded as a first unit
in the second phase sample of n =5 units . The remaining 4 units are selected by
SRSWOR sam pling from the remaining 9 units in the given first phase sample. We
used the 13th column of the Pseudo-Random Numbers to draw 4 distinct random
numbers between 1 and 9. The rando m num ber came in the sequence as 3, 7, 5, and
4. Th us the ultimate sample cons ists of the followi ng information .

~~
.... ..
......
.Secon<I''phasesamplefiItfor mation ,
,:~.).> ". c. )/

Second~ 1r{1~l: I :l ~;ib) Ic. Plc.;,.


First- I ~ pop:;·.
I~ d 2i I.:,';1994 . 1994
: phas~; ~h~se) RDriifi¥
',~1"'( H·· ~ ,.
"
•., x .' I
. Yi I
I I 62 0.1678 11918 0.1349 0.5194 11.4737 1774 1 16238
3 4 30 0.1351 919 0.0104 0.4502 16.4414 1008 859
7 8 13 0.1465 4768 0.0539 0.4744 14.3886 4707 4793
5 6 46 0.1769 14974 0.1696 0.5387 10.4936 1849 1 11567
4 5 55 0.1381 1931 0.0219 0.4566 15.8588 1255 1375

In the abo ve table we used


d 2i = 1/( )'
7rli7r2i
m -l m-I
and
i =!
7rZi
m -n
=- n- I .
- P Zi + - - With P2 i = Zi I1
m
Zi .

Under the chi square distance function the second phase cal ibration weights are
1

1V2i = d Zi +d2iQ2iXi ( . I d ZiQZixl


IESZ
J- (x'-. I d 2i XiJ
IES2
where
x' = Idlixi '
iESI
Case 1. If Q2i = 1/xi , then these weig hts become

- 1" 622966 .9
1V2 · = d 2 ·
I I ( I d ·x .
iESZ 2I I
J
x = d ·.
50 1796.3 2I

Sr. No...•'· d Zi 1;.; 1;.. ': X i Yi ', I!' xi dzi ,,'A' W2i wZiYi
1 11.4737 17741 16238 203554.9 1 14.24429 231298.78
2 16.44 14 1008 859 16572 .93 20.41156 17533 .53
3 14.3885 4707 4793 67726.67 17.86673 85635 .23
4 10.4936 1849 1 11567 194037.16 13.02752 150689.32
5 15.8588 1255 1375 19902.79 19.68828 27071 .38
SUnl 50 1794 .46 512228.24'
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 549

Thus a ratio estimate of the total number of fish during 1995 in the United States is
Yrd = L WZiYi = 512228 .24 .
iESZ

Case II. If q i = 1 then these weights become

WZi = d Zi + dZiXi(. 2..dZixlJ-'(x* - .2.. dZiXiJ .


IESZ IESZ

1 11.4737 17741 16238 36 11267688 .00 14.73639 239289.40


2 16.4414 1008 859 16705514.65 16.70704 14351.35
3 14.3885 4707 4793 318789433 .30 15.47406 74167 .17
4 10.4936 18491 11567 3587941081.00 13.60373 157354.40
5 15.8588 1255 1375 24978006.47 16.1778 1 22244.49
:.7559681724.00'; :5 07406.80 ;
Thus the genera lized regression estimate (GREG) of the total number of fish in the
United States during 1995 is given by
Yw = 2.. WZiYi = 507406 .8 .
iESZ

6.2 l'WO-PHASJj: SAMPLING,USING TWO AUXILIARY'YARIABLES


In the case of two auxiliary variab les, Rao (1972) and Singh (1984) have
considered the following sampling strategies. Consider a finite population of size
N . Let X and Z denote the two auxiliary variables and let Y denote the variab le
under study . Now when two auxiliary variab les are used in a survey, there may be
three possib le sampling schemes , namely :

Scheme I. A first phase sample of size m is selected by SRSWOR for observing


the auxiliary variables X and Z . A smaller sub-sample of size n for observing
Y, X, and Z is selected from the first phase sample;

Sche me II . The first phase sample is selected as in Scheme I but the smaller
sample of size n is selected independently from the who le population;
Scheme III. Often the aux iliary information may be collected by two different
agencies and hence two independent preliminary samples of sizes m, and mz are
selected for observing X and Z, and the small sample of size n is also selected
independently from the population by SRSWOR. For simplicity we will consider
ml = mz ·
Before proceeding further let us define
550 Advanced sampling theory with applications

y x x- * z -*
z
EO= = - I, EI= ~ - I, E2 = ~ - 1 , E3= = - I, and E4=~ - I,
Y X X Z Z

such that
E(E j) = 0, j = 0, 1,2,3,4.

Under Scheme I.

Under Scheme II.

Under Scheme III.

Now we have the following theorem :

Theorem 6.2.1. The difference type and unbiased estimator of population mean Y
is given by
YI =y+,BI(X* - x~,B2(Z* -z). (6.2.1)
Under scheme I the minimum variance ofYI is given by

Min.v(YI)1 s; [1- ( m -II)JP;y + P;z - 2~XYPYZPxz)] _ s; .


=
II m 1 1- Pxz N
(6.2.2)

Under scheme II and as N ~ 00, the minimum variance of YI is given by

Min.v(YI )Il = s.~ [1- (-.!!!-)J P;y + P;z - 2~xyPyzPXZ ) ] .


I ns m 1 I-pxz
(6.2.3)

Under scheme III and as N ~ 00 , the minimum variance of YI is given by


Chapter 6.: Use of auxiliary information : Multi-Phase Sampling 551

Min,V(YI)m = S; 1- ( ~)j P;y + P;z


- 2PXJPYZPXZ) . (6 .2.4)
n n+m 1-(~) P;z
m+n
Proof. The estimator YI in terms of E J' j = 0, I, 2,3,4 can easily be written as

YI = Y(I+ EO)+ AX(E2 - EI)+ /JzZ(E4 - E3)' (6.2.5)


Taking expected values on both sides of (6.2.5) we have
E(YI) = Y, (6 .2.6)
which proves the unbiasedness.

By the definition of variance we have


V(YI) = E~I - yf = E[Y EO +pIX(E2 - EI)+ p2Z(E4 - E3)f

=E[Y- 2 EO2 +.012 X-2 f\E22 + EI2 -2 E)E2 ) + p2Z


2-2 f 2
\E4 +
2
E3 -2 E3E4
)

+2YXpl (EOE2 - EOEI)+ 2p2 Y Z(EOE4 - EOE3)

+2pIp2 X Z(E2 E4 - E2E3 - EIE4 + EI E3) ] . (6.2.7)

Under Scheme I the variance at (6.2.7) reduces to

V(YI)I = (~- ~ )s; +(~- ~)~?S; +pis; -2pISxy -2p2Syz +2pIp2SxJ (6.2 .8)

On differentiating (6.2.8) with respect to PI and .02, respectively, and equating to


zero we have
PIS; + p2 Sxz = Sxy, (6.2.9)
and
pI Sxz+p2 Sz2 =Syz ' (6.2.10)
On solving (6.2.9) and (6.2.10) for PI and .02 we have
PI = SY(PX
y-
pyzPxz), and .02 = SY(PYZ- PXZPXy) . (6.2.11)
S)I- p;z) Sz(l- p;z)
On substituting PI and .02 from (6.2.11) in (6.2.8), we have (6.2.2). Hence the first
part of the theorem .

Under Scheme II and as N ~ 00 then (6.2.7) becomes

(-) S; (I I
V Ylll=-+ - + - )fIPIn2Sx2 +/3zSz
n mn
2 2) 2pI Sxy 2/JzSyz
------+2pIp2 Sxz(- + - . (6.2.12)
n n mn
I I)
On differentiating (6.2.12) with respect to PI and .02, respectively and equating to
zero and solving for PI and .02, we obtain
552 Advanced sampling theory with applications

m+n
Sy{PYZ- PxyPxz}}
- ( --.!!!-){ (6 2 13)
. .
,an d /32-
{Sz(l-P;J}
On substituting /31 and /32 from (6.2.13) in (6.2.12), we have (6.2 .3). Hence the
second part of the theorem.

Under Scheme III and as N ~ a:J then (6.2.7) becomes


-
V(YI)lIl = -
Sy2 ( 1 1 )
n
+ -+-
m n
~I2Sf2+/322Sz2)- -2n (;JISXy +/32Syz- /31/32SxJ (6.2.14)

which in fact reduces to (6.2.4) for the optimum values of /31 and /32 obtained from
(6.2.14). Hence the theorem.

The estimation procedure based on two-phase sampling schemes, when none of the
auxiliary variable population means are known, has been considered by Khan and
Tripathi (1967), Tripathi (1970, 1976, 1987), and Adhvaryu (1978). However, in
many socio -economic and agricultural surveys, the population means (totals) of
some of the auxiliary variables may be known while those of the others may not be
readily available. For example, to estimate the total number of agricultural labourers
in a rural county, the information about the area and population of the village may
be known from the recent county records while the information about the number of
cultivators and cultivated areas of the village in the county may not be readily
available. The estimation of population mean of a survey variable under the partial
knowledge of the auxiliary means has been considered by Singh (1969), Chand
(1975), Kiregyera (1980, 1984), Mukerjee, Rao, and Vijayan (1987), and
Srivastava, Khare, and Srivastava (1990). These estimators and their modifications
are popular in survey sampling under the name 'Chain Ratio Type Estimators'
which will be discussed in the next section, but let us first do an example.

Example 6.2. 1. The season average per pound prices (in $) of the commercial
Apple crop in 36 different American states have been given in population 3.

Scheme I. Select a first phase sample of 10 units by SRSWOR for observing the
auxiliary variab les:
Xli = Season average price per pound during 1995;

X 2i = Season average price per pound during 1994.


Select a second phase sample from the first phase sample of 5 units and observe:
Y; = Season average price per pound during 1996;
Xli = Season average price per pound during 1995;
X 2i = Season average price per pound during 1994.
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 553

Scheme II. Select the first phase sample of 10 units as in Scheme I and collect
information on two variables. Select the second phase sample of 5 units
independently from the whole population and collect information on three
variables.
Scheme III. Suppose the information on 10 units selected by SRSWOR sampling
about ' Season price per pound average during 1994' is collected by a company
XYZ. The information about 'Season price per pound average during 1995' on an
independent sample of 10 units is collected by another company ABC in the United
States . A small sample of 5 units is also selected independently from the population
by SRSWOR sampling to collect information on the ' Season average price per
pound during 1996.'
We wish to estimate the average season price per pound during 1996 by making
proper use of information under different schemes. Which sampling scheme would
you prefer to recommend for the future?
Solution. From the description of the population we have Y; Season average =
price per pound during 1996, Xli = Season average price per pound during 1995,
and X 2i =Season average price per pound during 1994. Here Y=7.317,
X I = 6.683 , X 2 = 5.9222 , N = 36 , Y = 0.2033, XI = 0.1856 , X 2 = 0.1645 ,
2 2 2 C y2 = 0.1563 , 2
Sy =0.15633, S rI =0.16406, SX2 =0 .17396, CX \ = 0.1641,

C}2 = 0.1739 , PYX\ = 0.8775, PYX2 = 0.8759, and PX\ X2 = 0.74135 . Also we have
m = 10, n = 5 .

Under scheme 1 the minimum variance of YI is given by

Min,V(YI)1 '" S;
n
[I _(m111- n J!p}\y + P;X2l_p
- 2PXP.pYX2PXI X2 ) ]
2
Xl X 2

2 2
= 0.15633[ I _ ( 10- 5J{ 0.8775 + 0.8759 - 2 x 0.8775 x 0.8759 x 0.7413}]
5 10 1-0.74132

= 0.017465.

Under scheme II the minimum variance of Y\ is given by

Min.v(y\) '" S; [1_ (~Jf P}IY + P;X2 - 2PxIyPYX2PX\ X2 ) ]


II n n + 111 1 I - P}\X2

2
= 0.15633 [1_ (~J{0.87752 + 0.8759 - 2 x 0.8775 x 0.8759 x 0.7413}]
5 10+5 1-0.7413 2

= 0.012865.
554 Advanced sampling theory with applications

Under scheme III the minimum variance of Yl is given by

Mi"V~I)", • S! r1-(,: J{ P;IY


m ~:(~i;':22 j P'I '2 )

= 0.15633 1- (~){ 2
0.8775 + 0.8759
2
- 2 x 0.8775 x 0.8759 x 0.7413)
5 10+5 ( 10
1- - -
)2 x 0.7413
2
10+5
= 0.021389.

In this situation the estimator of population mean under scheme II has minimum
variance. Hence from the efficiency point of view scheme II will be preferred.

6.3'<SHAIN RATIO&l'rMPE ESTIM.Al'ORS)U.

Following the notation of the previous section and assuming that population mean
Z of the second auxiliary variable is known and Pxy > Pyz (variable Z is closely
related to X; however, it is not as closely related to Y as X is related to Y),
Chand (1975) proposed a chain ratio type estimator of population mean Y as

- -(x*J(
x z*
Yc = Y ZJ ' (6.3.1)

Thus we have the following theorems:

Theorem 6.3.1. Under scheme I the bias in the estimator Yc to the first order of
approximation is

Proof. The estimator Yc in terms of EO , E], E2' and E4 can be written as


(6.3.3)

Taking expected values on both sides of (6.3.3) and taking its deviation from
population mean Y, we have (6.3.2). Hence the theorem.
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 555

Theorem 6.3.2. Under scheme I the mean squared error of the estimator Yc, to the
first order of approximation, is
(- ) (1 1
MSE Yc = -;; - N ) Y
-Z lCy
[ Z + CxZ - 2pxyCxCy 1

+ ( --;;;- Y lCzZ -CxZ +2PxyCxCy-2pyzCyCz 1.


1 N1 )-Z[ (6.3.4)

Proof. We have
MSE(yd = E~c - r f '" E[r(EO - E\ + EZ - E4)f

-Z [ z Z Z
= Y E lEo + E\ + EZ + E4 -2 EOEI +2 EOEZ -2 EOE4 -
Z 2 E\ EZ +2 E \ E4 -2 EZE4
1.
On substituting the expected values and after simplification we obtain (6.3.4).
Hence the theorem .

Khare and Srivastava (1998) and Tracy and Singh (1999) have considered the study
of chain ratio type estimators in more detail by proposing a few classes of
estimators of population mean.

6.4 CALIBRATION USING TWO AUXILIARY VARIABLES

In this case a first phase probability sample s\ of size m is drawn from the
population n , using a sampling design that generates the selection probabilities
Jrl i ' Given that sample SI has been drawn, the second phase sample Sz (subset of

s\) of size n is selected from SI using a sampling design with the selection
probabilities JrZi = Jrils\ . The first phase sampling weight of t il unit is denoted by
d li = 1/Jr\i , and the second phase sampling weight by d Zi = (Jr\iJrZi t l.
Table 6.4.1. Relationship between set of units and available data at different levels.
Set of units 1
7 ..
•... ;"'; ' \. "7
.•J;;
liT;estirriators Calibrated
' ;.;.; '
'; 1;;\ ' ~~;'. '. ' .. ..; ;j . ;...: 7 •• ;' 1\ . " 7 .. ;
. estimators
popuIiliiOn
N
k :i E n} or z= IZ i is known .
i=\
Firstphase'. . X" = Id\ixi , Xc = I WIiXi
"

sample {(Xi ,ZJ i E s\} iESI iES\


"
Z = Idliz i
iES\

Second phase y= IdziYi , Yc = IWZiYi


sample {(Xi' Yi,Zi) i E sz} iESZ iESZ
x = Idzixi,
iESZ
Z= Idziz i
o. iESZ
556 Advanced sampling theory with applications

Furthermore we assume that:


N
(a )Zi is known for all r a O (or Z=IZi is known) and Zi is observed for all
i=1
i E SI ;

( b ) Xi is observed for all i E SI ;

( C ) v, is observed for all i E Sz.

The Table 6.4.1 summarises our assumptions on the auxiliary information available
for estimation .
Now we have the following theorems:

Theorem 6.4.1. Under chi square (CS) type of distance function for the first phase
data set defined as
DI = I(WIi-dlif(dli%tl (6.4.1)
ie sl
subject to the calibration equation
IWlizi = Z (6.4.2)
ie s l
the first phase calibrated estimator
;; = IWlixi (6.4.3)
ies]
becomes

; ; = . Idlixi + ( .Idli%XiZi/
.IdliqlizlJ(z - .IdIiZiJ . (6.4.4)
les l lesl les]le sl

Proof. Consider the Lagrange function L I given by

L] = . I ( wli -dlif(dli%tl-2AI[ .IWliZi -z] . (6.4.5)


le~ l e~

On differentiating (6.4.5) with respect to Wli and equating to zero we have


wli = d li + Al dliqlizi' (6.4.6)
On substituting (6.4.6) in (6.4.2) we have

AI =(z- .I dIiZiJ /Idl i%zl . (6.4 .7)


le sl / ie sl

On substituting (6.4.7) in (6.4.6) we obtain the optimal weights given by

Wli = dli+( dli% Zil I dli% zl J( z - .IdliZiJ . (6.4.8)


I lesl le sl

On substituting (6.4.8) in (6.4.3) we have (6.4.4). Hence the theorem.

Theorem 6.4.2. Under the CS type of distance function for the second phase data
set defined as
Dz = I(WZi -dz;)Z(dziqz;)-1 (6.4 .9)
iesz
Chapter 6.: Use of auxiliary information : Multi-Phase Sampling 557

subject to the calibration equation


,*
L: w2ixi = Xc (6.4.10)
iES2

the second phase calibrated estimator


Yc = L: w2iYi (6.4.11)
iES2

becomes

Yc = . Ld2iYi + iJ2(dS)[(.LdliXi - .Ld2iXiJ+ iJ\(dS)(Z - .L dliZiJ] (6.4.12)


IES 2 I E S\ I ES 2 IE S\

where PI(ds)
, = .L d'i% XiZi / .L d'i% Zli
2 P2(ds) = . Ld2iq2ixiYi / .Ldli%Xli2 .
and '
IES! I ES ' I E S2 I E S2

Proof. The Lagrange function L 2 is

L 2 = . L:(W2i - d 2;)2(d2iq2i
IES2
r' - 2~(. L: W2i Xi - x;J .
IE S2
(6.4.13)

On differentiating (6.4.13) with respect to W2i and equating to zero we have


(6.4.14)
On substituting (6.4.14) in (6.4.10) we have

,12=[X;- .L:d2iXi] IL:d 2iQ2i xl


IE S2 I iE S2
(6.4.15)

which on substitution in (6.4.14) leads to the second phase optimal weights

W2i = d2i + (d2i%Xi / . L: d2iQ2ixl J[x; - . L:d2iXi] . (6.4.16)


'j IES2 I E S2

Substituting (6.4.16) in (6.4.11), we have (6.4.12). Hence the theorem .

Example 6.4.1. Select a preliminary large sample of 15 units by using the


Midzuno--Sen sampling scheme and using the number of fish caught during 1992
in the United States as a selection variable given in the population 4 in the
Appendix . Collect the information on the number of fish caught during 1993 and
1994 from the units selected in the sample. Assume that the total number fish
caught during 1993 is known, derive the first phase calibration weights and hence
estimate the total number of fish caught during 1994 in the United States. Select a
second phase sample of 10 units from the given first phase sample by using the
Midzuno--Sen sampling scheme. Collect the information on the number of fish
caught during 1994 and 1995 for the selected units in the second phase sample.
Derive the second phase calibration weights, and hence deduce the estimate of the
number of fish caught during 1995 in the United States.
Solution. In the Midzuno--Sen sampling scheme the first unit is to be selected with
probability proportional to the number of fish caught during 1992 and the
remaining 14 units with SRSWOR sampling. We used the first two columns of the
Pseudo- Random Numbers (PRN) given in Table 1 of the Appendix to select a
random number 1::; R, ::; 69 and another random number 1::; Rj ::; 28933 by using the
7h to 11th columns . The first effective pair of random numbers is (62, 01228). Thus
558 Advance d sampling theory with app lications

the unit at serial num ber 62 is included as a first unit in the preliminary large sample
of 15 unit s. The remaining 14 un its are selected by SRSWOR sampling from the
remaining 58 units in the population. We used the 13th and 14th co lumns of the
Pseudo -Rando m Numbers to draw 14 distinct random numbers between 1 and 58 .
The random numbers came in the sequence as 05, 34 , 30 , 55 , 46, 07 , 13, 19,44,25 ,
58 ,68,47, and 67 .

" C,' First Il base samn lelnformatloh


Sr. PRN ' Species gr~up"
. 1992 P.,i , "I!iy I,dli l'! 1993 d1iz i , I'· ..Jd 1iZ?
.
No . No.
,
""'. C ,,/fi,"'" .c. " l!
*
I ~Xi ~, ,.t~ ..
1>1';'"
,'S." +."'" ,
I ~ ~i
, i'"
,. ll:-,'.
'", ~
1 62 Summer flounder 119 18 0.0408 0.2383 4. 1962 229 19 96 174.090 22042 14018.00
2 05 Herrings 28933 0.0991 0.2846 3.5137 34060 119676.800 4076 192604.00
3 34 Piafish 2955 0.0101 0.2139 4.6746 269 1 12579.350 33851040.24
4 30 Lane snapper 919 0.003 1 0.2083 4.7988 1079 5177.974 5587034.02
5 55 Cunner 1931 0.0066 0.2111 4.7362 1876 8885.269 16668764.30
6 46 Spot 14974 0.0513 0.2466 4.0547 14263 57833.490 824879029.40
7 07 Saltwater catfish 13466 0.046 1 0.2425 4.1233 12690 52325 .790 6640 14268.20
8 13 Searobi ns 4768 0.0163 0.2188 4.5692 7726 3530 1.980 272743125.60
9 19 Groupers 4661 0.0159 0.2185 4.5753 4236 19381.100 82098340. 12
10 44 Sand Seatrout 3780 0.0129 0.2161 4.6260 4068 18818.820 76554980 .01
11 25 Florida pompano 498 0.0017 0.2072 4.8253 641 3093.073 1982659.95
12 58 Atlantic macker el 1045 0.0035 0.2087 4.7909 2307 11052.800 25498800.66
13 68 Other fish 2 100 0.0071 0.2115 4.7259 1323 6252.4 88 8272041.08
14 47 King fish 3778 0.0129 0.216 1 4.6261 3304 15284.900 50501301.24
15 67 Puffers 1103 0.0037 0.2088 4.7873 999 4782.576 4777793 .03
,. .4·, . Sum 466620 .500 8347835800.00

In the abo ve table the values of Jrli are given by


N- m m-l
Jrli = N _ 1 it + N - 1 '

with N = 69, m= 15, and 11i = x;/ X· with X· = 291882 (given, the total number of
fish caught duri ng 1992) .

Thus an estima te of the total number of speci es caug ht duri ng 1993 is give n by
z= Idliz i = 466620.5 .
;Esl

For % = 1Vi and the total number offish during 19 93 are known to be Z = 3 16784 ,
the first phase calibration weig hts are given by

WI; = dli + (dliz;I ,IdIi Z1)(Z- ,I dl;Zi )


'I, Es l IESI

= d li + d1i zi (316784 - 466620.5).


8347835800
Chapter 6.: Use of auxiliary information : Multi-Phase Sampling 559

~m iFirst i " 'w"", A/

',.., ''' T ~..777.


v:
Y :' / '
ft.' · .t.
I 2.470018 17741 43820.590
2 1.365609 38007 51902 .720
3 4.448813 4918 21879 .260
4 4.705924 1088 5120 .045
5 4.576801 1255 5743.885
6 3.016730 18491 55782 .350
7 3.184185 14441 45982 .810
8 3.935604 4707 18524.890
9 4.227456 4583 19374.430
10 4.288281 5665 24293 .110
11 4.769869 425 2027.194
12 4.592594 4860 22320 .010
13 4.613765 1141 5264.306
14 4.351829 4805 20910 .540
15 4.701520 918 4315 .995
I'F 'dd':'\d\ , 0,~;: .,\, ¥\ :' \ ' "" SlIm l 347262:100,:

Thus an estimate of the total number of fish based on first phase sample during
1994 in the US is given by

.t = L WIiXi = 347262.1 .
ies,

Again applying the Midzuno--Sen sampling scheme on the units selected in the first
phase sample, we selected one unit with probability proportional to the number of
fish caught during 1992 and remaining 9 units with SRSWOR sampling. We used
the first two columns of the Pseudo-Random Numbers (PRN) given in Table I of
the Appendix to select a random number I s Ri :s; 15 and another random number
I :s; Rj :s; 28933 by using the 7th to 11th columns . The first effective pair of random
numbers is (01, 07572) . Thus the unit at serial number Olin the given first phase
sample is included as a first unit in the second phase sample of n = 10 units. The
remaining 9 units are selected by SRSWOR sampling from the remain ing 14 units
in the given first phase sample . We used the 13th and 14th column of the Pseudo-
Random Numbers to draw 9 distinct random numbers between 1 and 14. The
random numbers came in the sequence as 05, 07,13,09,06, 11, 12,04, and 03.

Thus the second phase sample consists of the following information:


560 Advanced sampling theory with applications

Summer
I flounder
ILane 0.003 0.2080.003 0.644 7.45 1088 8107.6 8821100 .0 859 7.26 6243.62
sna er
I Cunner 0.0070.211 0.006 0.645 7.34 1255 9212.4 11561586.8 1375 7.13 9806.76
S ot 0.0510.2470.051 0.661 6.13 18491 113399.0 2096863507 .0 11567 3.56 41266 .91
Saltwater 0.0460.2430.046 0.659 6.25 14441 903 12.1 1304197188.0 13859 4.21 58361.07
catfish
Searobins 0.0160.2190.016 0.648 7.04 4707 33155.1 156061147.0 4793 6.29 30166.37
Sand 0.0130.2160.013 0.647 7.14 5665 40474 .7 229289141.0 4355 6.22 27128.08
Seatrout
Atlantic 0.0040.2090.003 0.644 7.43 4860 36147.9 175678928.0 4008 6.62 26533.76
mackerel
Other fish 0.0070.2120.007 0.645 7.32 1141 8354.72 9532731 .7 6669.63
Kin fish 0.0130.2160.012 0.647 7.14 4805 34331.3 164961659.0 27594.05
~if~61Jr. Q ;,(i!§ ? 8 ~ (i7Q3 .0 295822:10
In the above table we used
m-n n-1
Jr2i = - -P2i+ - -
m _
with P2i =Xi IXi '
_1
m-1 m-1 ~

Under the chi square distance function the second phase calibration weig hts are

W2i = d2i + (d2i%Xi/ . Id2iQ2 ixlJ(.t -. Id2iXiJ


'j le s2 le s2
_ m
where e = Idlixi .
i;l
If Q2i = 1 the second phase calibration weights are
xi
W2i =d 2i+(d2iX) Id2ixlJ( x- - Id2iXi) =d 2i+ d 2i (347262.1-48673 1).
'j ies2 i;l 6165886793
Thus the regression estimate of the total number of fish caught in the United States
during 1995 is given by
Yw = I W2iYi = 295822.1 .
ie s2

;6.5 ESTIMATiON~OF VARIANCEi)F CALIIJRATEDESTIMATORIN


~.;, TWO~PHASE SAMPLING~" L6 ·Fc~ IND niGHEil LEVEel c\.: T 1'1

;ii;;;i,CALIBRATION, .r.r/\;'i:; . " .C t 1 ; I••I F"I I 1'1 •

In general calibrated estimator in two -phase sampling can be written as


Yc = ieIs2W2iYi ' (6.5 .1)
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 561

Following Sarndal, Swensson, and Wretman (1992) the two required set of
residuals are given by
eli = Yi - PaZi ViE 52 (6.5.2)
and
(6.5.3)

Using the concept of two-phase sampling, an unbiased estimator of v(yJ is given


by
vu(yJ = I I (i'T2ij -i'T2ii'T2j h i e2j +..!.. I I (i'Tlij -i'Tlii'Tlj ~lielj
iES2 jEs2 i'T2ii'T2ji'T2ij 2 iES2 j Es2 i'Tlii'Tlji'Tliji'T2ij (6 .5.4)

Following Hidiroglou and Sarndal (1995, 1998), the low level calibrated estimator
of variance of yc is
, (, ) _*_* 1 __
vB Yc = L L W 2ijWi Wj e2i e2j + - L L WiijwliWljelielj (6.5.5)
2 iES2 j Es2
r
iES2 jEs2

where W 2ij = (i'T2ij -i'T2ii'T2j Xi'T2ij l


and Wiij = (i'Tlij -i'Tlii'Tlj Xi'Tliji'T2ij r l
.

A large number of estimators of variance can be shown as special cases of the


estimator considered at (6.5.5). For simplicity, if qli = J/zi and q2i = J/xi then
(6.5.5) takes the form

hatio(YJ= .L .L
IES2JES2
W2ije2ie2j[i~ldl:Xi]gl[
L, Wi Xi
"d
L,
Z
liZi
]" + L L
i ES2 jES 2
"'ij,,,e,,[_z_]g3 Idliz i
iES2 iEsl iESI

which is an analogue of the class of ratio type estimators proposed by Wu (1982)


for gl = g2 = g3 = 2 .

Similarly from (6.5.5) a regression type estimator for % = q2i = 1 can be developed
for estimating the variance of the chain regression type estimators.

Singh (2000b) suggested a higher order calibration estimator of the variance in two-
phase sampling as
, (,)
Yc = '"
vho L,
'"
L,
_*_*
D 2ij wi wj e2i e2j
1
+- '"
L,
'"
L,
_ _
Dlijwliwljelielj ' (6.5.6)
iES2 jEs2 2 iES2 j Es2
where D lij and D 2ij are the weights such that the distance between D lij and W 1ij

and that between D 2ij and W 2ij is minimum. Define two chi square type of
distance functions as

o, = ±.I .I
IESI JE SI
(D lij - Wiij ~(QlijWiij r l
(6.5.7)

and
562 Advanced sampling theory with applications

DZ =~ .L .L (Ozij-wzijNQzijwzij)-t. (6.5.8)
IESZ JESZ
Also let us define the first calibration constraint as follows :
L L 0liAAiZiZj = V(z), (6.5.9)
iESI JE St

where V(z) denotes the known variance of the estimator of the total of the cheaper
auxiliary character, Z .
The second calibration constraint is
L L OZijdZidZjXiXj = v(x) (6.5.10)
iESZ jE SZ
where
v(x) = L L (Jrtij - JrliJrlj kx j
iESI JE St JrlijJrtiJrlj

Minimisation of (6.5.7) subject to (6.5.9) leads to second order calibrated weights


obtained from first phase sample information given by

Qlijl¥tijdtidljzi z j [ ( ') ]
0tij = I¥tij + Z Z Z Z V Z - .L . L l¥tijdlidljZiZj . (6.5.11)
L L Ql ijWt ijdtidljzi Z j IESl JESZ
i ESt J ESt

Similarly, minimisation of (6.5.8) subject to (6.5.10) leads to second order


calibrated weights obtained from second phase sample information, given by

°Zij=WZij + "
z: s:
QZijWZijdZidzjXiXj [t)
" Q W dZd z Z Z v x - L
Zij Zij 'u ZjXi Xj
L WZijdZidzjXiXj
iESZ jE SZ
]
' (6.5.12)
iESZ j ESZ

Use of (6.5.11) and (6.5.12) in (6.5.6) forms the higher order calibration estimator
of variance. Several estimators can be shown as special cases of the higher order
calibration approach .

For example , if Qtij = 1/ldtidtjXti Xlj) and QZij = l/ld Zid ZjXZiXZj ), then an estimator
of the variance of the chain ratio type estimator becomes

Similarly higher order calibration estimators for estimating the variance of the chain
regression type estimators can be developed by choosing
% = qZi = Qli = QZi = 1 "if i and j .
Chapter 6.: Use of auxiliary information: Multi-Phase Sampling 563

6.6 TW:Q~rHASE SAMRLING :USING .MULTIiAUXJLIARY YMUABLES

We have described the estimation methods of two-phase sampling through simple


examples. Complicated situations, with more than two auxiliary characters, have
been considered by Dupont (1995) and Hidiroglou and Sarndal (1995, 1998).
Dupont (1995) has used two different approaches for deriving the estimators: a
regression model assisted approach, which seeks to adapt the idea of the regression
estimator, and a calibration approach, which attempts to adapt the idea of
calibration. The estimators obtained by the two approaches may be linked together.
Hidiroglou and Sarndal (1995, 1998) considered multiplicative and additive type
weights obtained through the combination of first phase and second phase
calibration weights or selection weights. Tripathi and Ahmed (1995) have
considered a class of estimators for a finite population mean for the situations where
population means of some auxiliary variables are known while those of others are
unknown . Hidiroglou (1995) has applied the two-phase calibration approach in the
survey of employment, payrolls, and hours in a panel survey. Singh and Singh
(1997--98) considered the problem estimating the ratio of two finite population
means using two auxiliary variables in double sampling. Dalabehera and Sahoo
(2000) proposed a class of unbiased estimators in two-phase sampling .

Chaudhuri and Adhikary (1983) showed the admissibility of an estimator in two-


phase sampling with varying probabilities following the original work by Rao and
Bellhouse (1978). Later Chaudhuri, and Adhikary (1985) modified their results and
extended it for two-phase sampling by following Godambe and Joshi (1965). The
results for single-phase sampling discussed by Godambe (1969), Sekkappan (1973),
Sekkappan and Thompson (1975) and Ramakrishnan (1975b) can also be extended
for two-phase sampling. Chaudhuri and Adhikary (1985) have defined additional
concepts for deducing the single-phase Bayesian results of Meeden and Ghosh
(1981,1983) and Verdeman and Meeden (1983) for two-phase sampling .

Suppose a first phase sample S j of size m is drawn from the population of N units
with sampling design PI and values Xi (i E s\) on a auxiliary variable X are
ascertained. A sub-sample Sz of size n is drawn from the first phase sample Sl
with design Pz and the values Yi ' Xi for i E Sz of the variable Y and x. The
resulting two-phase sample is S = (s), sz), the two-phase sampling design is P, and
the sample selection probabilities are given by p(s) = PI (SI )pz(sz I sd . Such a class of
design P will be denoted by D.
564 Advanced sampling theory with applications

The first phase inclusion probabilities will be denoted as Pj and second phase
inclusion probabilities will be denoted as Qj and are assumed to be positive .
Define
Pj (51) = L. pz (5Z I 51) .
spj

Let E p denote the design expectation and Vp denote the design variance .

Let X = (Xt>Xz, ···,XN ) , I=(l] ,Yz,...,YN ), ~=(X,Y), Ui=lj-X i and


!!. = (Ut>Uz, ....,U N) such that x , Y, and U are the total of Xi' lj, and Ui . Also
define x = {(i,Xi) I i E 51}, Y = {(i,y;) I i E5Z} and U = {(i,Ui = (Yi -xi))1 i E SI}'

Thus we can define two sub-classes of the main class D as follows:


( a ) A sub-class D1 can be defined for which the sizes of the first phase sample are
fixed at m;
( b ) Another sub-class D z can be defined for which the sizes of the second phase
sample are fixed at n < m.
Then we have the following results:
Result 6.7.1. A class C of design unbiased estimators of population total, Y, is
given by
eo = I Uj/Qj+h l(5,I,X)+ I Xj/Pj+hZ(5!,X)
jESz jEsl (6.7.1)
based on a design in the restrictive class D 3 such that:

(a) hi is free of Yi' Xi for i ~SZ but may involve them and i when i E 5Z;
( b ) hz is free of Xi for i ~ Sl but may involve them and i when i E SI ;
(c ) E pz [hi +hzl=o if P1(51»O .

Result 6.7.2. Another admissible estimator of population total in class C using


two-phase sampling is
el = .L. Uj /Qj + .L. Xj / Pj, based on any P in D3 • (6.7.2)
JESZ JESI

Result 6.7.3. An admissible estimator within a wider class CI of P unbiased


estimators is given by
ez = e(u)+e(x) (6 .7.3)
where
( a ) e(u) is free of Yi' Xi for i ~ Sz but may involve them, in the form Ui and i for
i E 5Z,
( b) e(x) is free of Xi for i E SI but may involve them and i for i E 51 ,
and
( c) E p {e(u )} = U and E p {e(x)} = X .
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 565

Result 6.7.4. Suppose a prior; = ;(x) for Y may be postulated prior to second
phase sampling. Let E; denote the prior mean operator and the posterior mean by
E;(-Id), where d=(s,y,x). Then under a square error loss function, a Bayes
estimator of population total in two-phase sampling is given by

eB = . LYi + E;[[ . L Yj + . L YjJ1d] . (6.7.4)


le s2 jeSI- S2 jeO-sl

Result 6.7.5. For every design p in class D the estimator

eg = LYi + Lai + Lbi (6.7.5)


ie s2 ie s\ -S 2 ieO-s2

is admissible for Y with a., b, arbitrary.

Mukerjee and Chaudhuri (1990) considered the problem of sampling a finite


population in two-phases with varying probabilities, using the first phase sample
with known size measures and the second phase sub-sample there from utilising, in
addition, ascertained auxiliary variate values for the initial sample .

6.8CONCEPT·OF :THREE-PHASESAMPLlNG

Let (xf , x~ ,....,4) be the first phase sample S j (say) drawn by simple random
sampling from a population of N units where only auxiliary variable X is
measured. Let (x; ,x;,....,x: ) be the second phase sample S2 (say) drawn by simple
random sampling from the first phase sample s\ units and again only auxiliary
variable X is measured. Let (Y\>Y2, ... .'Yn) and (X\> X2, .... , Xn ) denote the third phase
sample S3 (say) drawn by simple random sampling from the second phase sample
for the study variable Y and auxiliary variable x.
Let
_ -I n _ -I n
y =n IYi, x=n I Xi, -* =
X m
~ *
-I ':"Xi and i. i#
- # = / -1 ':"X
X
i=1 i= 1 i=1 i=1
Define
-*
x
EO= ~ - I, EI = ~ - I, E2=~-I
Y X X
so that
E(Ej)=O, j=O,I ,2,3.

Now we have the foIlowing theorem :

Theorem 6.8.1.

E{E5) = (~- ~ Jc;. (6.8.1)


566 Advanced sampling theory with applications

Proof. Let E) , E2, and E3 denote the expected values over all possible first,
second and third phase samples, respectively, and let V" V2 and V3 denote the
variances over all possible first, second, and third phase samples , respecti vely. Then
we have

E(E~)= V(EO) = E1E 2V3 (EO)+ EI V2E3 (EO)+VI E2E3 (EO)

= E1E 2V3( ~ -1)+E1V2E3( ~ -1)+ViE2E3( ~ -I)

= E,E{(~- ~) f,]+ E'V{{ -1)+ V,E{{ -1)


=E{(~- ~)~:] +E{(~ -Mlv,(~ -IJ
= (2.-
nm>
~)c~ + (~
ml
- ~)C2 + (~- ~)c~
Y IN>

= (~ - ~ )c;.
Hence the theorem.

Now we have the following corollary :

Corollary 6.8.1. One can easily prove that

E(Er)=(~ - ~)c;, E(E~)=C,- ~ )c;, and E(E~)=G- ~ )c;.

Theorem 6.8.2. Prove that under the concept of three phase sampling

E(EOE,) =(~ - ~ )PxyCXCy . (6.8.2)

Proof. Let C1 , C2 and C3 denote the co-variances over all possible first, second,
and third phase samples, respectively. Then
E(EOE\)
=COV(EO,Et)=E,E z{ C3(EO, El) }+E1CZ{E3(EO), E3(El)}+ C1{ EzE3(E O), EzE3(El)}

={ E EzC3( ~ -I,~ -I)}+ E'Cz{ E3(~ -I}E3(~ -I)}+CEzE3( ~ -I} EZE3(~ -I)}
1 1{

= E,E2(~- ':,)(~>fJ+E'C21(~ -I} (~ - IJ)+ CI1E{~ -I} E{ ~ -IJ)


Chapter 6: Use of auxiliary information : Multi-Phase Sampling 567

= (~- ~ )PxyCxCy + (~ -7 )PxyCxCy + (7-~ )PxyCt Cy


= (~- ~ )PxyCxCy .
Hence the theorem.

Corollary 6.8.2. Following the above theorem we can easily prove that

E(EOE2)= (~ - ~ )PxyCxCy, E(EOE3)= (7- ~ )PxyCxCy,

E(EIE2)= (..!-_-.!...)c;,
m N
E(EIE3)= (..!-_-.!...)c;,
m N
and E(E2E3)= (!_-.!...)c;.
I N

Remark 6.8.1. Under three-phase sampling the following types of estimators can

r,
be studied

YtJ =y[~r(; and Yt2 =Y+a(x-x·)+p(x#-x).


The bias and variance expressions for the above estimators can easily be obtained
using three-phase expected values.

Remark 6.8.2. The three-phase sampling can be extended to four-phase sampling


and so on. Such a sampling plan is called multi-phase sampling.

6.9

Suppose that a first phase sample SJ of fixed size m is taken by SRSWOR from a
population of N units and the auxiliary variable X is observed for all i E SJ . A
simple random sub-sample S2 of size n is taken by SRSWOR from sample SI , and
the variable of interest Yi and auxiliary variable Xi are observed for all i E S2' The
simple linear regression estimator for two-phase sampling is given by

(6.9.1)
n n
where y = n- I LYi, x = n- I LXi are the means for the second phase sample S2,
i=1 i=1

x· = m-I IXi is the mean for the first phase sample SI and b = Sxy/ s; . The variance
i=1
of the estimator (6.9.1) is given by
568 Advanced sampling theory with applications

(-) (I NI)Sy (I 1)[Sy


V Ylr = m- ,2 + -;;--;;; s; - 2/3Sxy 1.
2 + /32 2 (6.9.2)

The variance expression in (6.9.2) can be easily written as


V (Ylr_) (1 I) (I I)
= -
m
- -
N
S2 + - - -
y n m
SD2 (6.9.3)

where

Si>=(N-lt II[(Y;-y)-/3(Xj
i=1
-x)f =(N-ltIID? i=1
with Dj = ( Y ; - Y ) - / 3 ( Xi - X).

By the method of moments an obvious estimator of V (Ylr ) is given by

v,(_)
Ylr = (1 1 )S2 + ( -
--- I --
1 ) sd2 (6.9.4)
m N y n m

where
SJ = (n -It I f[(Yi - y)-b(x;- x)f = (n -It 1 fd? with d, = (Yi - y)-b(Xi - r) .
i=1 i=1
Following Sitter (1997) we have
S; = si> + /32S; . (6.9.5)
A sample analogue of (6.9.5) leads to
2 2 2 2
Sy=Sd+bsx ' (6.9.6)
On the basis of the relationship (6.9.6) Sitter (1997) considered the following two
estimators of variance of the regression estimator in two-phase sampling as
,(-)
vIYlr (I N
1)Sd (I NI)b Sx
= -;;-
2
+ -;;;- 2 *2
(6.9.7)

and
,(-) (1 I)sd (I I)b2Sx'2
Vo Ylr = -;;- N 2
+ -;;;- N (6.9.8)

Sitter (1997) also considered the problem of estimation of variance of the regression
estimator in two-phase sampling as follows.
Defining
-*
x*()
j =
mx -x·} if j E sl, (6.9.9)
m-l

x(J)=!x::=? if jEsz, (6.9.10)


if j E SI -S2,

ny - Yj
- - - if j E s2,
(6.9.11)
yj
!
() = y_ n - l
if j E sl -S2 ,

and
bul = 1:-[x} - xk lit,-1).;(1- k} 11 if j
if j
E s2,

E sl -sz, (6.9.12)
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 569

The Jackknife variance estimator of variance in two-phase sampling is given by


A(_) (m-l)~r:-: ( .) _]2
VJ Ylr = - - L.LY/r ) - Y/ r (6.9.13)
m j =1
where
Ylr(j) = y(j)+ b(j~x'(j)- x(j)] (6.9.14)
h
denotes the/ regression estimator of population mean in two-phase sampling . Due
to computer facilities available now, the problem of estimation of variance of
parameters of interest following the Jackknife technique has become quite popular.
The usual ratio and product estimators in two-phase sampling have been studied by
Sengupta (1981b) and stability of the Jackknife variance estimator in ratio
estimation has been considered by Krewski and Chakrabarty (1981). Sitter (1997)
also considered the problem of estimation of variance of the general linear
regression estimator using two-phase sampling in the presence of multi-auxiliary
information . Fuller (1998) considered the problem of estimation of variance of the
regression estimator for two-phase sampling. Given the covariance matrix of the
first phase control variables, a replication variance estimator which uses only the
second phase sample has been suggested.

Example 6.9.1. From population 1 in the Appendix , select a first phase sample of
10 units by SRSWOR sampling and note only the nonreal estate farm loans for the
units selected in the sample. From the selected 10 units select a sub-sample of 5
units and note the real estate farm loans as well as nonreal estate farm loans.
Estimate the average real estate farm loans by using the regression estimator.
Construct the 95% confidence intervals by using three different estimators of the
variance.

Solution. We used the first two columns of the Pseudo-Random Number (PRN)
Table 1 given in the Appendix to select an SRSWOR sample of m = 10 units. The
following 10 distinct random numbers 01, 23, 46, 04, 32, 47, 33 05, 22, and 38
between 1 and 50 resulted in the following first phase sample.

' First phase sample information ,f:~ ,' ... ,':',17 -, .


Sf. No. 1 2 3 4 5 6 7 8 9 10
Pop.Units 01 23 46 04 32 47 33 05 22 38
States " AL MN VA AR NY WA NC CA MI PA
188.477 848.317 426.2741228.607 494.733928.732 440.518 298.351

where x; = Nonrealestate farm loans . From the first phase sample, x' = 1066.9232,
and S;2= 1470305.988.
570 Advanced sampling theory with applications

We used the 7th and 8th columns of the Pseudo-Random Numbers to select a second
phase sample of n = 5 units from the above list of selected first phase sample units.
The following five distinct random numbers between I and 10 were observed : 07,
09, 01, 02 and 03. Thus the second phase sample consists of the following
information.

Second phase samplei nfor mation ,


"
Sr. No., "J First-phase ,',' Population State and ' Nonreal estate Real estate,... ..
~ unit# ,:number unit Territory farm loan, I' f~rm 1?~n ,
.:~~.
p

.~,
,; 1i~, .•.,,' number ;x.
, I .: Yi' " .'
x
I 07 33 NC 494.730 639.57 1
2 09 22 MI 440.5 18 327.028
3 01 01 AL 348.334 408.978
4 02 23 MN 2466.892 1354.768
5 03 46 VA 188.477 32 1.583
.", :.':"
·R........'
··'i ·/ S um , '., 3938.951
" 305 1.928

From the second phase sample we have

Sr. State ang "


-xf ~ (Y.I -:.vf (Y;::- yXXi - x) d. d2
No. '~TerritorY " , (Xi ,; -,.) ,-
. .,:,. "
I NC 85884.28 851.7875 -8553.079 160.338 25708.49
2 MI 120597.98 80291.5294 98402.217 -127.943 16369.36
3 AL 193121.75 40565.0213 88509.818 -4.737 22.45
4 MN 2819382 .85 554105.1574 1249893.828 -7.0663 49.93
5 VA 359176 .31 83406.94 17 173083.210 -20.592 424.02
Sum, 3578 163.\8 759220.4376 1601335.995 42574.26

Now
x = ~ r. x; = 3938.951 = 787.7902 y = ~ r.y; = 3051.928 = 610.3856
n ;=1 5 n ;=1 5

r = E = 610.3856 = 0.7748 sx2 = (n - I)-I r.(x,._ x\2 = 3578 163.18 = 894540.795 ,


x 787.8082 ' ;=1 ) 4

s; = (n - It r.(y; - yf = 759220.4
l
;=14
= 189805.1,

s xy = (n - I t l r. (y; - yXx; - x) = 1601335.995 = 400333 .9988 ,


;=\ 4

2
b = sxy / sx= 04475
' ,an d Sd2 = _1_ L.~d,2 = 42574 .26 = 10643.56.
n-I;=\ 4
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 571

Thus the regression estimate of the real estate farm loans is:

Ylr = Y+ b(x' - x)= 610.3856 + 0.4475(1066.9232 -787.7902) = 735.297 .

Case I. We have

, (-)
Vo Ylr = (1-;;- N1 sd2 + (1J
;;;- N1 Jb2 Sx2

= (~_...!...Jx 10643.56 + (-.!..._...!...J x0.44752 x894540.795 = 16246.83 .


5 50 10 50
Using Table 2 from the Appendix, the 95% confidence interval for the average real
estate farm loans in the United States during 1997 is

Y1r=Fta/2(df =n-2),jvo(Ylr) , or 735.29H3.182,J16246.83

or [329.70, 1140.89] .

Case II. We have

VI(Ylr) = (~- ~}~ +(~ -~}2s;2


= (~_...!...J x 10643.56 + (-.!..._...!...J x0.44752 x1470305.988
5 50 10 50
= 25470.878.
The (1- a)1 00% and hence using Table 2 from the Appendix the 95% confidence
interval for the average real estate farm loans in the United States during 1997 is
given by
Ylr=Fta/2(df=n-2~vl(Ylr) , or 735.297=F3.182,J25470.878 or [227.46, 1243.13] .

Case III. We have

494.730 1130.50 494.73861.077 639.57 602.09 161.25 0.224 0.464727.364 52.5628


440.518 1136.52 440.52874.630 323.03 681.25 -131.01 0.2340.431794.196 3549.9610
348.334 1146.77 348.33897.676 408.98 659.74 -3.78 0.254 0.447771.165 1335.9420
2466.892 911.372466.89368.014 1354.77 423.29 -6.92 0.988 0.71 7812.953 6136.9170
188.477 1164.53 188.48937.641 321.58 681.58 -19.56 0.300 0.443 782.161 2260.6620
848.317 1091.21 787.810 609.58 0.448745.485 118.1760
426.274 1138.12 787.810 609.58 0.448766.489 1016.0420
1228.607 1048.96 787.810 609.58 0.448726.558 64.8937
3928.732 748.950 787.810 609.58 0.448592.177 20288.4100
298.351 1152.32 787.810 609.58 0.448772.856 1462.4490
Sum 36286.0200
572 Advanced sampling theory with applications

Thus Jackknife estimator of the variance of the linear regression estimator Ylr is
VJ(Ylr) = (m-I) I[Ylr(;)-YtJ = (10-1) x 36286.02 =32657.42.
m j=1 10
A (I - a)l 00% confidence interval for the average real estate farm loans III the
United States during 1997 is given by
Ylr+ fa /2(df=n-2).jvJ(Ylr) .
Using Table 2 from the Appendix the 95% confidence interval for the average real
estate farm loans in the United States during 1997 is given by

735.297 +3.182.J32657.42 or [160.266, 1310.327] .

6.10 TWO-PHASE SAMPLING!USING PPSWRSAMPLING

Raj (1964) and Singh and Singh (1965) considered the problem of estimation of
population mean using the concept of probability proportional to size and with
replacement sampling of second phase sample of n units from the given first phase
sample of m units selected with SRSWOR sampling. A pictorial representation of
such a two-phase sampling is shown in Figure 6.10.1.

Population
of N units
Only auxiliary
variable, x, is
measured

First phase sample of


m units drawn with
SRSWOR sampling
Both the study variable , y ,
and auxiliary variable, x,
are measured

Second phase
sample of n
units drawn wit
PPSWR
sampling

Fig. 6.10.1 Two-phase strategy using PPSWR sampling.


Chapter 6: Use of auxiliary information: Mult i-Phase Sampling 573

It means that the probability of selecting one unit in the first phase sample of size
m is given by 1/ N and that of selecting a second phase sample from the given first
m
phase sample is given by Pi = xd/, where x· = IXi .
i=1
Then we have the following theorem :

Theorem 6.10.1. An unbiased estimator of the population mean Y under PPSWR


two-phase sampling is given by
-
Yppsd =- 1 In -Yi . (6 . 1 0 .1 )
mni=IPi

Proof. Defining E1 and Ez as the expected values for the given first phase sample
and all possible first phase samples respectively, we have

t
E(y) = E1Ez[YPP sd I first phase] = E1Ez[_1- 2l I first Phase] = El[~
mn i=1 Pi m i=1
IYi] = Y .
Hence the theorem.

Theorem 6.10.2. The variance of the estimator Yppsd is given by

V{-
I)'ppsd
)= (~_~)5Z + (m -I) 5z
m N Y mnN(N -I) Z
(6.10.2)

where
5 z =-1- IN(Y: - Y -f and 5 z =-1- IN(Z. - Z -f .
with Z , = y. /p. .
Y N - 1i= 1 I Z N _ 1i=1 I I I I

Proof. The variance of the estimator Yppsd is given by

V(YPPSd)= E1VZ(YPPSd I first phasej-i fJEz(yPPSd I first phase)

= E1Vz(_I- f. Yi I first PhaseJ + V1Ez(_I-


mn i=1 Pi
f. 211 first PhaseJ
mn Pi i=1

z
m(m-I) N m v. Y j lIz
mZnN(N -1)i=lj>i } Pi Pj[
I I PiP ' - - -
J
+ --- 5
(m N) Y
z
(m -I) Ip· Y' - lIz
(
N
= -L_y + --- 5
mnN(N -1)i=1 I Pi J (m N) Y

= (m-I) ~Pi(Zi-Yf+(~-~)5Z=(~-...!-)52+(m
-I)52 .
mnN(N-I) i=1 m N Y m N Y mnN Z
Hence the theorem.
574 Advanced sampl ing theory with applications

Theorem 6.10.3. An unbiased estimator of the variance of the estimator Ypp sd is

,(_ ) (1 N1)mi~/iz1-
v Yppsd nm(m-l)
= -;;;-
s; (6.10.3)

where

s; = (n-ltlJ±zl-n-I(.±zi)2j
1,=1 1=1
with Zi =yjNPi'
Proof. Obvious by taking expected values on both sides .

Example 6.10.1. From population 1 given in the Appendix select a first phase
sample of 10 units by SRSWOR sampling and note only the nonreal estate farm
loans from the units selected in the sample. From the selected 10 units select a sub-
sample of 5 units using PPSWR sampling and note the real estate farm loans as
well as nonreal estate farm loans. Estimate the average real estate farm loans by
using PPSWR two-phase sampling estimator. Construct the 95% confidence
intervals.
Solution. We used the first two columns of the Pseudo -Random Number (PRN)
Table 1 given in the Appendix to select an SRSWOR sample of m = 10units. The
following 10 distinct random numbers 01, 23, 46, 04, 32, 47, 33 05, 22, and 38
between 1 and 50 resulted in the following first phase sample .

',i;} ... .;;..:.;;..p.;Yt: .......... . Firsth)hasesample"'information;i . .. .....;;..... ;. ..... :.


...
Sr;1No. 1 2 3 4 5 6 7 8 9 10
I'ooWnitS 01 23 46 04 32 47 33 05 22 38
'. ·..,"1":- AL
1;1••', MN VA AR NY WA NC CA MI I'A
~i ~~ifi~;.i:ir-i.348.334 2466 .892 188.477 848.317 426 .274 1228.607 494 .73 3928 .732 440 .518 298.351

where x; = Nonrealestatefarmloans.

t
1 01 AL 348.334 348.334 01473 S 0.032649
2 23 MN 2466 .982 2815 .316 0.231226
3 46 VA 188.317 3003 .633 0.017651
4 04 AR 848.317 3851.950 03965 S 0.079511
5 32 NY 426.274 4278 .224 04981, 05365 S,S 0.039954
6 47 WA 1228.607 5506.831 0.115155
7 33 NC 494 .730 6001.561 07673 S 0.046370
8 05 CA 3928 .732 9930 .293 0.368233
9 22 MI 440.518 10370.810 0.041289
10 38 PA 298.351 10669.160 0.027964
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 575

We used the cumulative total method to select the second phase sample as follows.
The cumulative totals of the auxiliary variable, nonreal estate farm loans, is given in
the fifth column of the above table. We used the first five columns of Pseudo-
Random Numbers given Table 1 in the Appendix to select eight random numbers
between 1 and 10,670. These random numbers came in the sequence as 01473,
04981,05365, 03965, and 07673. These random numbers have been shown in the
sixth column of the above table. The seventh column shows the states selected in
the second phase sample with PPSWR sampling. Note that the state NY has been
selected twice. Now, from the ultimate sample selected in the second phase, we
have the following table:

275.86 76096.26
95(i.66 211406.30

Thus an estimate of average real estate farm loans is given by


_ 1 n Yi 1
Yppsd =- L:- = IOx5 x 47833.08 = 956.66.
mn ui p,
Now we have

fzf-n-1(fzi)2 211406.31_956.662
s2 = .:...
i=...:.I_ _-->..:i-'=I'----"'_ ~5"--_ = 7091.66 .
_
z n-l 5-1
Thus an estimate of the variance of Yppsd is given by
n 2 2

'(Y ) (1 1) mL:Pizi -Sz 2


i=1 Sz
v ppsd = m - N nm(m -1) +-;;;T
= (...!..._...!...) lOx 10537.48 -7091.659 + 7091.659 = 88.389.
10 50 IOx5x(IO-1) 102
A (1- a)1 00% confidence interval estimate of the population mean is given by

Yppsd:+ (a/2(df = n -1~V(yPPSd) .


Using Table 2 from the Appendix the 95% confidence interval is given by

or 956.66+ 2.776~88.389 or [930.56, 982.76] .


576 Advanced sampling theory with applications

The generalized linear regression (GREG) estimator has been found to be most
commonly used as an estimator of population total/mean in survey sampling. Let
us consider the simplest case of the GREG where information on only one
auxiliary variable is available . Consider two populations n/ = {1/,2/,..,i/ ,.., N/ },
for t = I, 2 from which two independent probability samples s/ (s/ en) are drawn
with a given sampling design, p/ (.). The inclusion probabilities Jri/ = Pr(i/ E s/ ) and
Jrij(t) E Pr(i/& lIE s/) are assumed to be strictly positive and known. Let Yi/ be the
value of the variable of interest, Y , for the l" population element , with which also
is associated an auxiliary variable Xli for the l" population. For the elements i) ESt,
we observe tyil ,xii ,Zil) in the main or first survey. In the second independent
survey, we observe lXiz , Ziz )' The population total of the common auxiliary variable
N/
ziz ' i = 1,2,...,nz; X = ~::Xi/' t = 1,2, is assumed to be accurately known in both
i/ =I
N
surveys. The objective is to estimate the population total Y = I Yi l using auxiliary
il =1
information available in both surveys .

,~ i~i, 1\.rG9l\1M6NVARIABLESIUSEDFOifFP;R'I'IIE~/~AiIBRATIQN~O~}
0 0 ° WEIGHTS " o ,\ ",i " /"" ,;,! /1:.\\\:: \\'1 ! " , I ' ' ;, ' ,

Suppose that two sample surveys have one variable in common. If the population
total of this variable is known, then it can be used as a control variable in GREG.
Suppose the population total of this common variable is unknown. Let it in the first
surveyor sample SI of nl units take values zil ' i = 1,2,..., n j ' In the second
independent surveyor sample of n: units it takes values ziz' i = 1,2,...,nz , defining
estimators of unknown common total as ZI = 2: di) Zit and Zz = 2: diZZ iZ ' where
ie s, ie s/

d iZ = 1/Jriz such that £(Z/) = Z , the unknown total for the common variable Zi ' Let
A z A Z
us define Z = 2:a/Z/ , where 2:a/ = I, is an unbiased estimator of Z based on the
/=1 /=1
estimators obtained in both surveys . Let us consider a new estimator of population
total, Y, as

Yp = 2: r il Y il (6.11.1)
i) e s)

with weights r i1 as close as possible for a given metric to the w il ' while respecting
the calibration equation
,L rij zil = Z. (6.11 .2)
'1e sl
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 577

A simp le case is the minimization of chi square type of distance function given by
L ~il - wil Y(Wil hiJI , (6.11.3)
il ESI

where hil are suitabl y chosen weights . By rmrnrmzmg (6.11.3) subject to


calibration equation (6.11.2) we obtain weights

r il = wi l + (Wil hil Zil / . L:


' IESI
wil hi! z~ Xz-.L:
'I ESl
Wil Zil J. (6.11.4)

Substitution of the value of ril from (6.1 1.4) in (6.11. 1) leads to a new estima tor of
population total from the 1''' survey as
Yp = L: d i l Yi ) +PI) (x - XHT)+.021 (Z - ZGREG) (6.11.5)
i )ES I

whe re
XHT = L d i )Xi) ,
ilE SI

and ZGREG = L wi) Zi) •


iIES)

Now from Renssen and Nieuwenbroek (1997) we have the following coro llary:

Corollary 6.1 1.1.1. An adjusted estimator of population total, Y, is given by

YAD= L d i )Yi l +.0;) (x - XHT)+.0;1(Z - ZHT) (6.11.6)


i)ESI

where .0;1
and .0;) have their usual meanings . On comparing (6.11.5) with
(6.11.6) , one can easily see that we are using the regression type estimator for
estimating unknown total, Z, given by ZGREG , whereas in (6.11.6), Renssen and
Nieuwenbroek (1997) used the Horvitz and Thompson (1952) type estimator, given
by ZHT'
6~U.2 EST~ATION OF VARIANCE;USING'DUA.L FRAME SURYEYS ,

Conditioning on the calibrated weights, ri ) as fixed or given, an expressio n for the


variance of the propo sed estimator Yp = L ril Y i\ is given by
i) ESI

(6.11.8)
578 Advanced sampling theory with applications

Several estimators of variance can be shown as special cases of this estimator. Let
us consider the following case:

Case I. Suppose both surveys used simple random sampling and without
replacement (SRSWOR) scheme for selecting the same number, n, of the units.
Under such situations, J! i, = n/N and J!ij(t)=n(n-I)/N(N-I) . If we choose the

weights qi, =I/Xi, and hi, =I/=i" then the calibrated weights areri, = ~(i/iHT)
and the estimator of variance becomes

[T--d-,-)2].
N2(1_ f) ritei, ri,ei,
Vo (yp(t))=
{
,2: ,2: Dij(t J!ij(tXri, ei,-ri/i} + (6.11.9)
A A

n IES, JE S, I, J,

Hartley (1962,1974) pointed out that a dual frame design can result in considerable
cost savings over a single frame design with comparable design. For example, if the
frame 0, is an area frame and frame 02 is a list frame, the frame 01 may be
complete but expensive to sample, the frame 02 may be incomplete but has a lower
cost per unit sampling. Bankier (1986), Fuller and Burmeister (1972), Kalton and
Anderson (1986), Skinner (1991), and Skinner and Rao (1996) have considered the
problem of estimation of population total using dual frame surveys. Lohr and Rao
(2000) have considered several estimators of population total in dual frame surveys
and compared them under a unified setup. They also considered the problem of
estimation of variance of the estimator of population total under dual frame survey
through the concept of Jackknifing. Deville and Goga (2002) studied Horvitz and
Thompson estimator when information is gathered on two samples.

6.12ES'FIMATION:ORMEDJA.NUSING:T:W()~PHASE.·SAM:llJd(NG....

In two-phase sampling, in the first phase we select a preliminary large sample s' of
n' units by simple random sampling without replacement (SRSWOR) and only the
auxiliary character X is measured. Let M:
be the estimator of median M x of the
auxiliary character X based on the first phase sample. In the second phase a sub-
sample s of n units is drawn from the preliminary large sample by SRSWOR and
both the study variable Y and the auxiliary variable X are measured. Let Mx and
My denote the estimators of Mx and My respectively, based on the sample drawn
at the second phase.

6~1:'2 :l.GENERAbCLA..SSi()FES'FIMATORS ;/:: :; .

Following Srivastava and Jhajj (1981) and Srivastava (1971), Singh, Joarder, and
Tracy (2001) proposed a general class of estimators for estimating the median as
M, =H(My,U) (6.12.1)
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 579

where U= MxlM~. Whatever may be the sample chosen, H(M y'u) assumes the
value in a bounded closed convex sub-set R 2 of two-dimensional real space
containing the point {My,l) such that H{My,I)= My. Using a first order Taylor's
series expans ion around the point {My,l) we have
MI = H(M y,I)+(My-My)H1 (M y,I)+(U -1)H 2(M y,I)+o(n- l ) (6.12 .2)

where HI {M y,I) and H2 {M y,I) denote the first order partial derivatives of H(M y'u)
with respect to My and U, respectively.
Furthermore, under the assumptions that HI {M y ,I) = 1 and that H 2 {M y ,I) is an
unknown constant, we have the following theorems:

Theorem 6.12.1. The general class of estimators, Mj , is asymptotically unbiased


and normal.
Proof. Following Kuk and Mak (1989), we have
My -My = (ry(M y)}-IG- Py)+op(n-O'5),

Mx- Mx = {/x(Mx)tj(~ - Px) + op{n-O.5 ~

and

M'x -Mx ={/x(M J}-I (~- Px )+0 p(n'-O.5)


which proves the theorem.

Theorem 6.12.2. The minimum variance of M1 , up to terms of order o(n-'), IS

1
given by

Min.V(M,)= (rAMJ-2[(~- ~)~-4(~- ~)~, -O.25~ (6.12.3)

wherefy{My) denotes the values of the marginal density fy(Y) at the median value
of y.
Proof. Follows from Chapter 3.

We also consider the following regression type estimator


M2= My +d(Mx'-Mx) (6 .12.4)
where d is a suitab ly chosen constant such that the variance of the proposed
estimator M 2 is minimum.
580 Advanced sampling theory with applications

The minimum variance of the estimator M1 up to terms of order o(n -I ) is given by


Min.v(M l )=(ry(M)}-l[(~- ~J±-4(~- ~.}~I-O.25f] (6.12.5)

for the optimum value of d given by

(6.12.6)

An estimator of the optimum value of d is given by

d = Jx(Mx)J;I(MyX4PII -1), (6.12.7)

where I, (M y)and Jx (M J may be obtained by following Silverman (1986) . The

regression type of estimator after replacing d with d becomes


Ml1r = My+d(Mx'-MJ (6.12.8)

Theorem 6.12.3. The estimators Mland Mllr have the same asymptotic variance
because
E(d)= d + O(n-I ).
Proof. Define

s =
o JAM J
JAMJ_ 1 s
' 1
= JY~Myl-l
Jy My '
and

such that

E(OO) = E(OI) = E(02) '" 0


which proves the lemma.
The estimators of the form

,M'
MJ=M y ---"L
A

Mx ( ' J, M A

4=M
'M
('Jd
y --,;-f
Mx
, and Ms=M y
A A

[
dMx"
A

+
M
d]
,
(~\,'I '
x
1 )lVl

which are in fact analogous of the ratio estimator and the estimators proposed by
Srivastava (1967) and Walsh (1968) in double sampling, respectively, are also
special cases of the class of estimators MI '

Theorem 6.12.4. The general class of estimators, M] , is always more efficient than
the ratio estimator MJ .
Proof. If we set V(M j )<V(M J) , where V(MJ)denotes the variance of the ratio
estimator MJ , then it reduces to
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 58 1

[MArAMx)tl-4(~ I -a.25){rAMJ}-I MJ > a, (6.12.9)


which is always true. Hence the theorem .

6.12.3 POSITION ESTIMATOR

Suppose n.; is the number of units in the second phase sample SII with X s Mx'.
Thus if the values of Pij (i, j = 1,2) are known we can predict P by

(6. 12.10)

because P.j = ~ j + PZj "" a.5, j = 1,2. If we replace Pij by Pij from the sample in
(6.12.10) we obtain an estimator of P as
.01'=n-1[n:PII /P.I + (n -n:)PIZ/P.z] "" (2n- 1ln:PII + ~l- n:Xa.5 _PI I)] (6.12.11)

Then a position estimator in two-phase sampling is given by


M6 =QApl'). (6.12 .12)

Thus we have the following theorems:

Theorem 6.12.5. The estimator M6 is asymptotically unbiased and normal.


Proof. We have
M6 - My = (ry(M y)}-l(p'l-p y)+op(n-°.5 )
which proves the theorem .

Theorem 6.12.6. The variance of the estimator M6 ' up to terms of order O~l -I ), is
V(M 6 )=(ry(MJ-Z[(~ - ~ )± - 4(~ - ~J~I-a.25f J (6.12.13)

Proof. We have M6 = Qy (.0'1)' Following Kuk and Mak (1989), we have

F) M6 ) = FAMy+(M6 - My )] = FAMy)+fAMylM6 -My]+Op(n-O.5) .


This implies
M6 -M y= (rAMJ-l[F)M6)-M y] +oAnO. 5).
Also we have
Fy(M 6 ) - Fy(My)= Fy(M6 ) - Fy(My)+ p(n 0.5 ) . 0

So we have
M6 -My = (rAM y)}- I[F)M 6 )- FAMy)]+op(nO.5)
=try(My)}-I [p,'- Py1+0 p(nO.5).
Now we have
582 Advanced sampling theory with applications

p'o-P y = 2[P>1I +(1- P;XO.5-11I)] - Py =(4~, - IXp; -0.5)-(p y -0.5) .


Also we have
PI'-PO'= 2[p;PI' +(1- P;XO.5- PI')] - 2[p;11 , +(1- P;XO.5-11,)]
=2(PI,-PllX2p;-I).
Clearly (p"_PO'~ ~ 0 in probability, since P; ~ 0.5 in probability and
(PI I - PI I ) is of order 0 p (n -0.5) and we may replace PI' with Po' . The details of
these assumptions are available in Kuk and Mak (1989) and we have

if 6 - My = {ry (M y )}-' (Po '-Py )+ 0 p (n -0.5 ) .


Hence if 6 is approximately normal with mean M y. To find the variance of the
estimator if 6 let £2 and £1 denote the expected values for the given first phase
sample and over all possible first phase samples respectively . Similarly defining the
variances V2 and VI , we have
V (if 6)= £1 V2 (if 6)+ V'£2 (if 6)' (6.12.14)
Now we have
V2(if 6)= v2(if 6-M y )={ry(M y )}-2 V2(Po'-p y ) . (6.12.15)
Again
V2 (Po '-py)= v2 l(4P'1 -IXp; - 0.5)- (p y -0.5)J
= (411'1 -I)V(P;)+ v(py)- 2(411', -1~OV2~;,pJ
= (~- ~,)(411'1 -I) /4 +(~-~J/4-{~- ~J411" -IX11', -0.25)
= (~-~J/4+(~- ~.J[(4P,·, _1)2/4 - 2(4~', -IXPt', -0.25)]

= (~-~J/4 -4(~- ~J11'1 -0.25) (6.12.16)

where p,', is defined by a cross classification of the first phase sample using if .~
and if ~ as the cut-offs. Covj denotes the covariance term for the given first phase
sample. Using (6.12.16) in (6.12.15) and then using the result in (6.12 .14), we have
the theorem .

Suppose Fyi (y) and FY 2 (y) denote the proportion of units in the second phase
sample for which X:o; ifx' and X > if x ' , respectively , for the value Y that have y
values less than or equal to y. Then Fy(Y) can be estimated by
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 583

(6.12.17)

where /I x is the number of units in the preliminary large sample with X ::; Mx ' .
0

Then we may consider an analogue of the stratification estimator in two-phase


sampling which is of the form

(6.12.18)

The estimator M7 is asymptotically unbiased and normal. Its variance is given in


the following theorem:

Theorem 6.12.7. The variance of the estimator M7 , up to terms of order 0(/1-1) , is


V(M 7 )={ry(My)}-2[(~ - ~ )± -4(~- ~o)(J~, -0.25 f ] . (6.12 .19)

Proof. We have
Fy(M 7 )=Fy[My+(M 7 -M y)] = Fy(My)+fy(M Y XM 7 -My]+0 p(n-O.5).
This implies
M7 -My= {rAMy)}-'[F)M7 )- FAMy)]+op(n-O.5)
= {rAM>.)}-'[0.5- p y]+op(n-O.5).
For large N we have FylMy)"" FytlMy)+Fy2lM y), so that
V[Fy(My)]= EtV2[FAMy)]+I'\E2[FAMJ (6.12.20)

Now we have

V2[Fy(My)] = V2[Fy,(MJ +V2[Fy2(M y)]


= 0.5[(~- ~,)±-4(~- ~.)(R·' -0.25J].
+0.5[(~- ~o)±-4(~- ~.)(R·' -0.25)]
On substituting it in (6.12 .20) we have the theorem

The regression type estimator, position estimator, stratification estimator and


general class of estimators attain the same minimum variance to the first order of
approximation. An estimator to estimate the variance of the estimators,
u; i=I,2,4,5,6,7, is
584 Advanced sampling theory with applications

6.12.5 OPTIMUM FIRST AND SECOND-PHASE SAMPLES FOR MEDIAN


ESTIMATION

We obtain the optimum first phase and second phase sample sizes for the fixed cost
as well as for the fixed variance cases.

6.12.5.1 COST IS FIXED

Let C, and C 2 denote the cost per unit in the second phase and first phase,
respectively, then the fixed cost C is given by
C=nC,+n'C2 • (6.12.21)
The variance of M;, i = 1,2,4,5,6,7 will be minimum for the fixed cost C given by
(6.12.21) if the optimum values of nand n' are, respectively, given by
C~2~ ,(1- 2~ ,)

6.12.5.2 VARIANCE IS FIXED

Also this variance Vo can be achieved by M; for i = 1,2,4,5 ,6,7 for the minimum
cost function (6.12 .21) if the optimum values of nand n' are, respectively
{rAMy)}-2 ~2~ ,(1- 2~ I)[~2~ 1(1- 2~ l)cl + 2~1~, - 0.251]
(6.12.24)
n = Je; ( Vo+ {{)M y )}-2 /(4N) ] ,
and
21~ I -0.251{ry(My)}-2[~2~ ,(1- 2~ l)cl + 2~1~, - 0.251] (6.12.25)
n' = --'---'-'-----'-'--""-'~=C~2('---Vo-'-'+- -'{f':"":"')r-M---,
y)""}-2'"'/(:-'--4N--:-)--3-]----'--'---'-'-------'-' .

Kuk and Mak (1994) proposed another ingenious method which in double sampling
can also be extended for estimating the finite population distribution function of Y
defined as:
N
F(y) = Wi I t.(y - yJ,
;=1
where t.(a) = 1 when a ~ 0 and t.(a) = 0 otherwise. Note that F simply puts
probability N-' at each y;, i = I, 2, ..., N assuming them to be distinct. The naive
estimator of F is the sample distribution function defined as:
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 585

fr(y)= ,,-I LL\(y- y; ) (6.12.26)


ies
which puts the probability mass ,, -I at each sample point. Before we introduce a
new estimator in double sampling, let us restrict ourselves reviewing to the
definition of the customary ratio estimator of population mean of Y in double
sampling given by

Yr = (~')y = h(y), (6.12.27)

where y and x are the second phase sample means of y and x, respectively, and x'
is the first phase sample mean of x variable . Also h(a)=(~'} is a function

satisfying h(x) = x' for the given first phase sample . Let G denote the population
distribution function of the auxiliary variable x. Also let (i' and G be the sample
distribution functions of the auxiliary variable X based on the first and second
phase sample information respectively. An obvious choice of h(-) is
h(-) = G'oG- 1(_), (6.12.28)
where G- denotes the inverse of the function G and 0 denotes the composition
I

of two functions . Based on this information define an estimator of F(y) in double


sampling as:

~I (y) = G'oG- 1 F(y) = G'[G- I {F(y)}] .


0 (6.12.29)
Let the variables y ; and X; be related to each other through a model defined as:
Yij = BXi + v(x; 'it j , j E S, i E s'c n, (6.12.30)
where s' and S denote the first phase and second phase samples from the
population n . Here aim is to redistribute the point mass at each Xi of the given

first phase sample evenly to the n points . Also B= {.L {I(ox; )}{oL 24 • is the
IES v Xi IES v ( Xi

weighted least square estimator of the regression coefficient. After the redistribution
of mass based on a given first phase sample, we obtain Q' and Q as
Q'(y) = ("',,t I I L\&- Yij )=",-I IP;(y),
l
(6.12.31)
ies' j es ies'

Q(y) =,,-2 I I L\&- hJ= ,,-I IP;(y), (6.12 .32)


iesje s ie s

where P; (y) =,,-1 L L\&- Yij) is an estimator of Pr[y':o; ylx'=x] under the model
JE S

y; = fJx; + v(x; p; for a given first phase sample.


Replacing G' and G- I with Q' and Q-I we obtain another estimator of F(y) as

(6.12.33)
586 Advanced sampling theory with applications

Now we have the followi ng theorem :


Theorem 6.12.8. The estimator .PI I (y) is asymptotically normal with mean F(y)
and variance given by

v [fi; I (y)] = (~, - ~ )F(y11- F(y)]


+(~-~)[F(Y
n n'
){1- F(y )}+ G(x){I- G(x)}- 2{H(x,y)- F(y)G(x)}] . (6.12.34)

where x = G- 1(a) with a = F(y) and H(x, y) denotes the finite population joint
distribution of X and y.

Proof. Its proof follows from the relation

.PI I (y)- F(y) '" hy)- F(y)}- {G(x)- G(x )}+ {G' (x)-G(x)}. (6.12.35)

Chen and Qin (1993) suggested an interesting empirical likelihood method for
quantile estimation in the presence of auxiliary information. Following their
notation, they considered the problem of estimation of () = E{T(X,Y)} subject to the
constraint 0 = E{w(X)}. An estimator of any population parameter () based on the
method of empirical likelihood function is given by
en =~ IT(x;,y;)/{I +iw(x;)}
n ies
(6.12.36)

with .i satisfying I w(x;)/{1+ Aw(X;)} = 0 for some suitable weights w(x;),


ies
i = 1,2,..., n. A few special cases of the estimator (6.12.36) are as follows:

Case I. If w(x) = x-x', where x=n-1Ix; and x'=n,-l I x; are the second phase
ie s ies'
and first phase sample means, respectively , of the auxiliary character X then in
such situations the estimator (6.12.36) will be an analogue of the Hartley and Rao
(1968) estimator in double sampling .
Case II. If w(x) = I[X5:M 'x1- 0.5 and e is the median of X, then the estimator
(6.12.36) will lead to the position estimator in double sampling.
Case III. If w(x;) = Xi - x' and x; is a 0 -1 variable and x is the estimator of the
population proportion X (say) based on the first phase sample information, then
the estimator (6.12.36) leads to an analogue of the post stratified estimator of Silva
and Skinner (1995) in double sampling.
Case IV. The ratio type estimator proposed by Garcia and Cebrian (1998) is also a
special case of it.
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 587

Example 6.12 .7.1. The amounts of the real and nonreal estate farm loans (in $000)
during 1997 in the United States have been presented in population I given in the
Appendix . Suppose we selected an SRSWOR first phase sample of ten states to
collect the information on nonreal estate farm loans only. From the given first phase
sample of ten units, we selected a second phase sample of eight states and both the
real and nonreal estate farm loans were observed . Find the relative efficiency of the
ratio estimator of median, for estimating median of the amount of the real estate
farm loans during 1997 by using information on the nonreal estate farm loans
during 1997 collected in the first phase sample, with respect to the usual estimator
of population median. Assume that both the real and nonreal estate farm loans
follow independent normal distributions.

Solution. From the description of the population, we have Yi Amount (in $000) =

of the real estate farm loans in different states during 1997, Xi = Amount (in $000)
of the nonreal estate farm loans in different states during 1997, N = 50,
M y = 322.305 , M x = 452.517 , ,uy = 555.434, ,ux = 878.162 , ax = 1073.776 ,
a y = 578.948, and fll = 0.42 .

Note that we are given

(x-
2
I fl X )2 1 y-fly

f(x) = _ I _ e-"2 ----;;;- and f(y)=_I_ e


(
2 "»
)

&ax &ay
which implies that
_~( 452.517-878.162)2
e 2 1073.776

f(M x ) = & x 1073.776


and
_~( 322.305-5 55.434)2
e 2 578.948
f (M ) = 6.3541x 10- 4 .
y = & x 578.948
Under single phase sampling we have
2
V(Ai y ) = C~f) {rA~)}-2 C-~.16) = k35419: 10-
4
t = 65014.30 .

Under two-phase sampling the ratio estimator of median is given by

MRD=M y Ai:
A A [Ail J
with variance
*
588 Advanced sampling theory with applications

V(M RO ) • C.-~J±VAMy)l-2+(~- Vy(My)}-2 + ( ~: r~(rAM")}-2


+ {::}111- 0.25){rX(MJ/) M)}-1]
= (~_...!...) k3541 x 10- 4
10 50 4
t 2

+(~_ ~)[ k3541 x 10- e+( 322.305)2 ?.4345 x 10- e


4 4

8 10 4 452.517 4

- 2 (322 .305 )(0.42 - 0.25)~.3541 x 10- 4 x 3.4345 x 10- 4 t J


]
452.517

= 49536.111 + 0.025[619201.38+ 1075170.28 -1109669.104] = 64153.67 .

Thus the percent relative efficiency (RE) of the ratio estimator MRO with respect to
the usual estimator My is given by

RE = V(M y) x 100 = 65014.30 x 100 = 101.34%


MSE(M RO ) 64153.67

which shows that the ratio estimator is more efficient than the usual estimator of
population median.

6.J3.DISTRIBU;fIONFUNc:nON:WITH;TWQ7PHASE SAMPLING ·

Singh and Joarder (2002) consider an estimator of distribution function in two-


phase sampling as

FRD (I) = f\(t ~;:~:~} (6.13.1)

where F~ (t) = ~ f L'i(t - x;) , FAt) = .!- i L'i(t - x;) and Fy(I) =.!- ~ L'i(t - y;) . The
n i;J ni;J n i;J
preliminary large sample consists of n' units using SRSWOR sampling where only
auxiliary variable X is measured. In the second phase, a sub-sample of n units is
drawn from the preliminary sample of n' units through SRSWOR sampling and
both the study variable Y and the auxiliary variable X are measured on the
selected units . The following lemmas are needed to find the variance of the above
estimator.

Lemma 6.31.1. The variance of Fy (t) is given by

V[Fy(t )]= (:;2nt~I{L'i(t - y;)f - N ~ 1 i"'~;l L'i(t - v,)L'iV - Y j)] . (6.13.2)


Chapter 6: Use of auxiliary information: Multi-Phase Sampling 589

Proof. Let Eland E 2 denote the expected values over all possible first phase
samples and for a given first phase sample, respectively . Also let the variances V,
and V2 be similarly defined. Then we have
V[Fy(t)] = E1V2{Fy(t )}+ VjE 2{Fy(t)}

= H
E,v2 iEtl(t - Yi)} + V1E2 Hi~,tl(t - y;)}
= (N -2n)[2: {tl(t-
nN i~1
v, W__ 1_ 2:
N -I i"o j~1
tl(t- Yi)tlV- yJ.J
Hence the lemma.

Lemma 6.13.2. The variance of FAt) is given by


n
V[FAt)] = (::2 l~, {tl(t - x;)}2 - NI_I iJ~, tl(t - x;)tlV - x J]. (6.13.3)

Lemma 6.13.3. The variance ofF;(t) is given by


V[F; (t )] = (~ - ~')[ 2: {tl(t - Xi )}2 - _1_ 2: tl(t - x;)tlV - x} . (6.13.4)
nN i~' N - I i"oj~' J
Lemma 6.13.4. The covariance between Fy(t) and FAt) is given by

COV[Fy(t), FAt)] = (::;ll~ltl(t-x;)tl(t- Yi)+ N~li"o~~,tl(t-xi)tlV- Yj)]' (6.13.5)

Proof. We have
COV(Fy(t), FAt))
= E, [c2{Fy(t), FAt )}]+ C1[E2{FAt)1 E2{FAd}]

= E,[C2{~n Itl(t - Yi ),~Itl(t


l~' n
- Xi )}] + Cl[E2{~ .I tl(t - Xi )},E2{~ .I tl(t - YJ}]
l~l n ,~' n ,~'

(n'-n)
= -(-,-)E1
{I---;-l:tl(t - Y;)l:tl(t - x;) {I---;-l:tl(t - x;),---;-l:tl(t
n' n' }
+ C,
n' 1 - y;)n' }
n n -I n i~1 i~' n i~' n i~'
which proves the lemma.

Lemma 6.13.5. The covariance between Fx(t) and F;(t) is given by


(6.13.6)
cov[FAt), F;(t)] = V[F;(t)] .
Proof. We have

Hence the lemma.


590 Advanced sampling theory with applications

Lemma 6.13.6. The covariance between fry(t) and frAt) is given by

cov[fry(t), fr;(t)] = (~N-~')[ILl(t-


n l~
y;)Ll(t-x;)+ N1-1 .I .Ll(t - Y;)LlV- xJ.
l~ J (6.13 .7)
Proof. We have
covlfry(t), fr;(t)j = E1lcz {fry (t), frAt )}j+ C\lEZ{fry(t)1 Ez{F;(t )}j
I n' I n' ]
=O+C1[ -;ILl(t-Yit -;ILl(t-x;)
n i=\ n i=\

=(N-
- n')
- [N 1 IN Ll(t-Yi)Ll\f-x,
ILl(t-Yi)Ll(t-Xi)+-- ()~
n'Nz i=\ N-1i"'J }
which proves the lemma.

Thus we have the following theorem.

Theorem 6.13.1. The variance of the estimator frRO(t) at (6.13.1) is approximately

V [frRO (t)] =V {fry (t )}+{ ~~;~r~{fr;(t )}+ V {frAt )}-2cov{fry(t ),frAt)}]

+2{ ~& n[cov {fry (t), fr;(t )}- Cov {fry(t ),frAt)}] .
Proof. The proof of this theorem follows directly from elementary concepts. The
algebraic expression for V {frRO (t )}may be obtained by using lemmas.

6.1{Il\1PROVj:D VERSION OF TWO:-PHASE .CALIBRATIONAPPROACH'

This section has been especially designed to improve the work of Hidiroglou and
Sarndal (1995, 1998), and hence that Singh (2000b), and is based on Golden
Jubilee Year 2003 celebration by Singh (2003c) of the traditional linear regression
estimator owed to Hansen, Hurwitz, and Madow (1953) for its outstanding
performance in the literature. It has been shown that chain regression estimator is
unique in its class of estimators.

6.14~tIMPROVED FIRST PHASE CAL:ffiRATION

Singh (2003c) considers an improved calibrated estimator of population total X z


given by
'$ -$
Xz = IdJjxZi (6 .14.1.1)
iE S\

where dl~ are the calibrated plus weights obtained from the first phase sample
information. Although these weights can be chosen in many ways, but we will
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 591

discuss here only the simple case. We choose the first phase calibrated weights d!~
such that the chi square distance defined as

D~ =I -dliL(dl~ (6.14 .1.2)


iES! q~dli
is minimum subject to the two first phase calibration constraint defined as
-$
Id li = Id\i' (6.14.1.3)
iES\ iES\
and
(6.14.1.4)
Note that the condition (6.14 .1.3) is a requirement of the chi square test owed to Sir
R.A. Fisher and is ignored by all the followers of Deville and Samdal (1992) .

The choice of q~ decides the form of the estimator. The improved first phase
Lagrange function is then defined as

L\ =.I
IES\
(dl -d\.f
$
I $'Ull
qli d li
$[ -$ ] $[ -$
,Id\iXli-XI -U}2 .I d li -.Id\i ,
IE Sl IE SI IES\
] (6.14.1.5)

where A~ and A~ denote the first phase Lagrange multipliers.

-$ $$ $ $
On differentiating (6.14.1.5) with respect to d!~ and equating to zero we have
dli = dli + Ai Iql i dlix\i + A1 2dli% (6.14.1.6)
On substituting (6.14.1.6) in (6.14.1.3) we have

A~(.Idliq~XIi)+
IES\
A~(.Idliq~)
IES!
= 0, (6.14.1.7)

and on substituting (6.14 .1.6) in (6.14.1.4) we have

A~(.Idliq~Xt) + A~(.Idli%Xli)
I E SI IES\
= (XI - .I dIiXli ) .
IES\
(6.14 .1.8)

On solving (6.14 .1.7) and (6.14.1.8) for A~ and A~, and substituting back in
(6.14 .1.6), we have the modified first phase calibrated weights as

[(X\Aiq~tIdliq~)-(dli%{.Idliq~l~Aiq~XIi)]
JI~=di+ IES! IE Sj IE S\ 2 [X\-Id1 iXIi] (6.14.1.9)

( IE.Id\iq~)(.Id\iq~xt)
S! IES\
-(.L:dliq~X\i)
IESI
I ESI

Thus a modified first phase calibrated estimator of X 2 is given by

X• 2$ = . Idix2i + Pl(o\s)
• [
X\ - .I dlixli ] , (6.14.I.IO)
IES2 IES\
where
592 Advanced sampling theory with applications

(6.14.1.11)

Clearly the estimator (6.14.1.10) is a traditional linear regression estimator due to


Hansen , Hurwitz and Madow (1953), and hence is improved version of first phase
calibrated estimator owed to Hidiroglou and Sarndal (1995, 1998) and hence that
Singh (2000b) .

Note that there is no choice of q~ which reduces the estimator (6.14.1.10) into ratio
or product method of estimation, and leads to the following theorem .

Theorem 6.1.4.1. The traditional first phase calibrated linear regression estimator
of the population total X 2 is unique in its class of estimators .

.6;14.2 IMPROVEnSEC()Nn:PHA~;E:yAmBRATI()~ '

Singh (2003c) considers an improved second phase calibrated estimator of


population total, Y , defined as
'$ -$
Yc = I,d i
iE S2
v.. (6 .14 .2.1)

where di$ are called the second phase calibrated plus weights . Let us choose the
second phase calibrated weights di$ such that the chi square distance function
defined as
D$ = "
2 L,
(di$ - d1id2J $ ' (6.14.2.2)
iES2 d lid 2iq 2i
is minimum subject to the two calibration constraints defined as
-$
I,d i = I,d 1id2i , (6 .14 .2.3)
iES2 i ES2

and
(6.14.2.4)

where x~ is given by (6.14.1.1) after fixing first phase calibration. The choice of
q?; makes different forms of estimators in two-phase sampling. The modified
second phase Lagrange function is given by

L$2 = . I, -
IES2
(di $ - d 1id2i
$ -
d lid 2i q 2i
f -
$[-$ ,$] $[-$
2-121 .I,di
IES2
X2i - X2 - 2A.22 .I,d i
IES2 IES2
]
- . I,d lid 2i , (6.14.2.5)

where 4'1 and 4'2 are Lagrange multipliers. On setting 8~! = 0 we have
8d i
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 593

(6.14.2.6)
From (6.14.2.3) and (6.14.2.6) we have

~I[. LAid2iqf;X2i] + A!2[.ES2LAid2iqf;] = 0, (6.14.2.7)


'ES2 '
and from (6.14.2.4) and (6.14.2.6) we have

~1(.'ES2Idlid2iqf;xii) + ~2(.'ES2Idlid2iQf;X2i) = (if -.' ES2Idlid2iX2i). (6.14.2.8)

On solving (6.14.2.7) and (6.14.2.8) for ~I and ~2' and substituting them in
(6.14.2.) we obtain the improved second phase calibrated weights as

(dlid2iQf; x2i'( .Idlid2iQf;] - (dlid2iQf; '(. Idlid2iQf;X2i] [ ]


-EEl \" ES2 \" ES2 ' EEl
di = d lid2i + 2 X 2 - Id lid2iX2i .

.Idlid2iQf;xii]( .Idlid2iQf;) - [ .Idlid2iQf;X2i]


[ 'Es2 iES2
,ES 'Es2
(6.14.2.9)
On substituting (6.14.2.9) in (6.14.2.1), the improved second phase calibrated
estimator of the population total Y is given by

Y'EEl '
c = . Id 1id2iYi + /32(0Is) [ X'EEl - . Id
2 lid2ix2i] , (6.14.2.10)
'ES2 1ES2
where

Note that if Q~ = 1 and Qf; = 1, then the resultant calibrated estimator (6.14.2.10)
can be claimed as a traditional chain regression type estimator in survey sampling.

Thus we have the following theorem:

Theorem 6.14.2.1. The traditional chain regression estimator is unique in its class
of estimators .

Note that in the same way all the ten cases (See Exercise 6.27) considered by
Estevao and Sarndal (2002) can be improved and we are leaving it to the readers as
an exercise . The problem of estimation of variance with the help of modified two-
dimensional methodology as discussed in Chapter 5 can also be extended for two-
phase sampling, and will be discussed in the next volume of this book.
594 Advanced sampling theory with applications

Exercise 6.1. Assume that a sample of size m is selected using SRSWR sampling
out of N units to observe the variable x , while a sub-sample of size n is selected
out of m units to observe the study variable Y and auxiliary variable X with the
same sampling strategy. Suppose Yn' xn denote the second phase sample means,
and XIII is the first phase sample mean.

( a ) Find the first order bias and mean squared error of the estimators of population
mean Y defined as

YB xn J
- = Y- n(XIII ' [Bose (1943) ]

[Patterson (1950)]
and
_ __ n(m-l)(_ __)
Yyr = rnX n + -(--) Yn - rnX n , [ Yates (1949) , Rao (1975) ]
m n-l
1 n y.
where "in = - L --!-.
n i =IXi
( b ) Now suppose
CI = Cost per unit of observation on the auxiliary variable X ,

C2 = Cost per unit of observation on the variable Y,


so that the total cost C is
C = mCI + nC2'
Find the minimum mean squared errors of the above estimators defined in part ( a )
for the fixed cost. Comment on the resultant mean squared errors.
Hint: Sukhatme (1962), Tukey (1956).

Exercise 6.2. Suppose two samples of sizes nj and n2 are drawn independently
from a finite population of size N . Let XI denote the sample mean of auxiliary
variable X of the first phase sample of size nl, and let Y2 and x2 denote the
sample means of study variable Y and auxiliary variable X respectively, based on
the second phase sample of size n2 ' Assuming the two samples are drawn
independently and the regression coefficient [J is known, consider the follow ing
two estimators of population mean Y as
Yo =Y2 + [J(XI -X2)
and
YI =Y2+[J(X-X2), where x=aX\+bx2 with a +b=1.
Show that
V(yO)-V(YI)~O .
Hint: Shah and Gupta (1986) .
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 595

Exercise 6.3. Suppose a sample of size III is selected using SRSWOR sampling out
of N units to observe the variates X and Z , while a sub-sample of size n is
selected out of III units to observe the variates Y and X with the same sampling
strategy . Suppose Yn and xn denote the second phase sample means , and xm and
zm are the first phase sample means for the associated variables. Assuming that
population mean Z of the second auxiliary variable Z is known . Find the first
order bias and mean squared error of the estimators of population mean Y defined as
- _- [xm+b(Z-Zm)]
Y\ - Y n [Kiregyera (1980,1984)]
Xn

- - p[xzm Z-- - ]
Y2 = Yn +
m
XII [ Kiregyera (1980, 1984) ]

Y3 = YII +b[Xfrd -XII] [ Singh and Singh (1991)]


where
_ =x
_ [(A+C)Z+jBZm]
X
frd m (A+ jB)Z +CZm .

-
Y4
= -
Y II
(xmJ[ Z ++CCz ]a
-
Xn
-
Zm z
[Singh and Upadhyaya (1995)]

- _- [ Xm+b(Z-Zm)] ,
Ys - Yn
Xm +1l.(xlI - xm ) [Upadhyaya, Kushwaha, and Singh (1990)]

and
~ = q~ + ~~+ ~~ + ~~ + ~ ~+ ~Z
where Cj , j = 1,2,3,4,5,6, are suitable chosen constants such that bias in Y 6 is
equal to zero.
Hint: Mishra and Rout (1997) .

Exercise 6.4. Using the concept of two-phase sampling study the asymptotic
properties of the following estimators of population mean Y defined as

Y\ = Yn( ~ rand Y2 =YII[ (Un +(r~a)xJ ·


Hint: Sahoo and Swain (1989).

Exercise 6.5. In a finite population n of size N, let the value of the variable
Yj , (j = 0,1,2) on the it" unit be Y j i. Let Yj and SJ , respectively, denote the
population mean and mean square error of the / , variable. Consider we are
interested in estimating the ratio and product of population means of the first two
variables, defined as R = Yo/~ and p = Yo~ , respectively. Suppose a preliminary
large sample of III units is drawn from the given population by SRSWOR scheme
and only the auxiliary variable Y2 is measured on it. The main sample or second
596 Advanced sampling theory with applications

phase sample of size II is drawn either from the preliminary sample of size
III (> n)or independently from the population using SRSWOR scheme and variables

Yj , (j = 0,1,2) are measured.


Suppose

Ynj =II- I I.Yji and S~ = (1I - 1tl I.(Yji -YnJZ


i=1 i=1
denote the second phase sample means and sample variance ofthe/' variable .
Also let
_ _I til
YtIl z = III LYZi
i=1
denote the first phase sample mean and sample variance .

( a ) Study the asymptotic properties of the general class of estimators of Rand P


defined as
RH=im(u,v) And PH = PH(u, v)
whereR=Yno/Ynl' P= Y1l0Ynl , u=YllzlYmz, v=s;;z/s;,z and H(-, -) IS a
parametric function such that H(l,l) = I, satisfying certain regularity conditions.
Also find the optimum sample sizes for the fixed costs .
Hint: Singh (1982b), Khare (1991), Singh, Singh, and Singh (1994), Ahmed
(1997).
( b ) Assuming that population mean of another auxiliary variable Y3 is known,
study the asymptotic properties of the estimators:

R· = WR(~nz J + (I - W)R( ~ll Z Y Y~3


YtIlz YtIlz )l Y3
J.
Hint: Singh , Singh, and Singh (1994) , Prasad, Singh, and Singh (1996) .

Exercise 6.6. Show that the following are unbiased estimators of the population
mean, Y ,
/1 = r(xI -x)+M(y-r x)+ Y, /z = r(x· -x)+aM(y-r x)+ Y,
and
r(xw- x)+ Mw(Y - r x)+ Y
/3 =

where M = (N -1I);{N(II-1)}, Mw= (W-II)/{w(1I-1)}, III $. W< N , x· = aXl +bx,


a + b = 1; XI is mean of x based on III units and W is the number of distinct units
in the first and second-phase samples, yand X are the means based on second
phase II units, and Xw is the mean based on only distinct units.
( a) Show that /1 is a special case of ti for a = 1and b = °.
( b ) Compare these three estimators under a superpopulation model. Use the
Quenouille's method to construct unbiased regression type estimators in two-phase
sampling.
Hint: Rao (1975), Singh, Katyar, and Gangwar (1996) .
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 597

Exercise 6.7. Show that an unbiased estimator of variance of the regression


estimator in two-phase sampling is given by

V(Yd)=(~_~J_l_I[(Yi-Y)-.BoIS(Xi-X)f + s; .
n m (n-3L=1 m
Hint: Tikkiwal (1960) .

Exercise 6.8. Let a finite population of N units be represented by the set of (p + I)


lri' Xli' XZi' ..., XPi)'
N
component vectors j = I, 2,..., N . Let J1y = N- 1'L,Yi and
i=l

J1x . = N- 1 ~x ji , j = 1,2, ..., p. Suppose a sample of n units is taken by SRSWOR


} i=1
n n
sampling. Let y=n -I'L,Yi and Xji=n-I'f.xji, j=I,2, ...,p.
i=1 i=1
( a ) Show that for some real constants a j the estimator

fa
Yml = Y- j=1 Axr J1x}.)
is an unbiased estimator of the population mean.
( b ) Suppose that the sample elements are drawn once at a time, so the sample is
ordered set, the order being that of the order of drawn . Let Z a denote the ordered
set of observations on the first a sample elements, 1 < a < n, and let ai(Za)
denote functions of these observations . Let Y(a), Xj(a) denote the sums and y(a) ,
Xj(a) , j = 1,2,..., p denote means of the indicated observations on the first a
sample elements . Also let E{el Za} denote the conditional expectation given Za '
Show that the following three estimators are unbiased estimators of population
mean Y, defined as

II = Y- f aj(Za~xr J1x.]
j=1 }
((N -)n) {y(a)- y - f aj(ZaXXj(a
n-a N j=1
)-xJ}
n-a )N s -v:«,
Iz=((N-a)n j=1 [Xj-J1x }' ]~ (a(N-n){_(
{- P (Za)f- -v- ( )
) - P a j (ZaAxja-xj
n-a )N ya-Y-'f.
j=1
-)t
and

t: = jt a/Za)J1Xj + (~~~)~ {Y(a)- j~laj(Za)Xj(n)}- N%~:)[Y(a)- jta/Za)xj(a)]


( c ) For p = 1show that the minimisation of
J1 y
Iw{xJYi-
i=l t J1 x
xi]Z

where w(x i ) is some suitably chosen weight to form different kinds of estimators,
leads to the biased ratio estimator
iJy = J1xa(Ze)
598 Advanced sampling theory with applications

where

a(Za)= i~/iXiW(Xi )/i~lxlw(xi)'


( d ) Show that if w{Xi) = 1/ xl , a(Za) = ~ I Yi and the estimator 13 becomes
a i ~IXi

* - (N -I)n (- - r) . h r- =-I 1:-


n Yi
13 = rJi x +-(--) y-r x , Wit .
N n-I n i~ I X i
(e) Show that if W(Xi)= 1/Xi , then the estimator 13 becomes,
** - (N-a)nr.,-, --)
13 =RaJix + (
N n-a
)1l- RaX
for a=1 and a=(n-I) .Alsoshowthatifa=1 the estimator 1;* reduces to I; and
if a = (n-I)then R,,-I =..!.. I
n~- Yi .
n i ~lnx -Xi
Hint: Mickey (1959).

Exercise 6.9. Suppose an initial sample SI of size m is selected with replacement


with probabilities Pi proportional to Z, (i = 1,2 ,..., N ), where N is the population
size. Information on the variable X is gathered on the first-phase sample SI' The
second phase sample S2 of size n is a sub-sample of Sj , selected with SRSWOR in
which information on Y and X is collected . Find the variance of the unbiased
difference type estimator of population total Y given by

y = n-
I
i~Y;/ Pi +.B[ m- i~X;/ Pi - n- i~ X;/ Pi] '
I I

Hint: Srivenkataramana and Tracy (1989), Sarndal and Swensson (1987) , Raj
(1964, 1965b).

Exercise 6.10. Consider a sequence of finite populations such that its t" member
U, consists of N, units labelled i = 1,2, ..., N, , 1 = 1,2,..., and N, ~ 00 as 1 ~ 00 • Let
~(I ) be a quantity that tends to zero as 1~ 00 . Let w{ ~ 0), X, Y denote
respectively a size measure, an auxiliary variable, and the variable of interest. Let
their respective values for the til unit (i = 1,2, ..., Nt) of U, (I = 1,2 , ....) be Wit
(known for each i ), Xit and Yit. The corresponding totals (means) over U, are
~, x. , 1;( ~ ,Xt ,~)· Let
Wt =lWII' ....'W Ntt) ' X t =lX II' ....' X Ntt ) and r; =lYlt ,..·.'YNtt) .
From U, an initial sample Sit (each possible Sit with a fixed number nit of distinct
units) is supposed to be drawn with a probability PIt(SIt ) that may involve the Wit
for i E V t. For the units in Sit the x values are supposed to be ascertained. From SIt
a sub-sample S2t (say, each possible S2t with a given number n2t < nIt of distinct
units) is to be chosen with a conditional probab ility P2t(S2t ) = P2t(S2t ISit) given that
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 599

Pit (Sit) > O. These probabilities may involve Wit' for i E U,and also Xit for i E Sit .
The overall two-phase sample St = (SiP S2t ) has the selection probability
Pt(St) = Pit(SIt )P2t(S2t I Sit), where PI> Pit and P2t denote respectively the design
corresponding to double, first phase, and second phase sampling plan. Define
E Pt' E PIt and E P2t be the design expectation operators with respect to Pt' Pit' and
P2t where E P2t is a conditional expectation for each fixed Sit with Pit > 0 ;
I kit(I kit)= 1 if i E Skt , and = 0 otherwise, for k = 1,2. Let Jrw and Jrlijt be the first
and second order inclusion probabilities according to design Pit and let
rlit = 1/ Jrlit be the design weight for every Sit with PIt > O. Let Jr2it and Jr 2ijt be
the first and second order conditional inclusion probabilities with respect to design
P2t(el SIt) and let design weights be r2it = 1/ Jr2it .
Consider two estimators such as
elt (w)= L Witrlit
i ESlt

and
e2t(w)= LWit rlitr2it .
iES2t
Similarly define elt (x), e2t(x) and e2t (y) based on the information available from
survey data.
Find the bias and variance of the regression type estimator of population total
Y given by

»t = N;I [e2t(Y)- .BIt {e2t(x)- Rt }- .B2t {e2t(w)- ~}]

where x, =elt(x)-.B3th(w)-~} and .Bjt(j=1,2) are quantities in terms of known


wand sampled x and y values.
Hint: Mukerjee and Chaudhuri (1990).

Exercise 6.11. Compare the following three estimators in two-phase sampling :


(a) Ratio estimator : YIYn(xm /xn), =

(b) Regression estimator : Y2 = Yn + fJ(xm-xn), where fJ = SxJS;, and

( c) PPSWR sampling : Y3 = N xm I.
Yi .
m n i=IXi
Hint: Prabhu--Ajgaonkar (1975).

Exercise 6.12. Find the asymptotic bias and variance of the chain ratio type

r[;r, r[?; r,
estimators of finite population variance S; as

( a) Sf = s;[ ~ (b) sj s;[ :!2


= and
600 Advanced sampling theory with applications

(c) 55=5;(:' [:!2 ~ ~; r r[r[ r


Hint: Gupta, Singh, and Mangat (1992-93 ).

Exercise 6.13. Suppose a simp le random sample 51 of size m is taken without


replacement from the population of N elements and X i alone is observed for all
elements i E 51 . A simple random sub-sample 52 of size II is then drawn without
replacement from 51 and }j , Xi are observed for all i E 52 ' A ratio estimator of
population mean Y is

Yr = Y[ ;J
where Y and x are the means for 52 and x· is the mean for 51 . Define

- (.) -( .J1. x'(j)}


Yr } = Y } x(j) ,
wher e

_(.)=!IIX-Xj
X} II -I
if j E 5 2'

x if j E(51 - 52 ),

mx> x ,
and x·(j) = J if j E 51 . Show that the linear form of the modified Jackknife
m -I
estimator
• _ m -I ...{- (.) - }2
VJ - - - 4- Y r } - Y r
111 j e sl
is identical to usual estimator of variance under mild conditions.
Hint: Rao and Sitter (1995) , Sitter and Rao (1997), Rao (1996b).

Exercise 6.14. Suppose that two sample surveys have one variable in common, say
Z, that is two agencies are collect ing information on it. Suppose the population
total of this variable is unknown to both agencies. Let X be the known auxiliary
variable and Y be the variable under study. Suppose 21 = LZ i/lrli and
ies l
22 = L Zi / lr 2i are the two different estimators of unknown total Z of the common
ie s2
variable obtained by both independent surveys agencies. Study the properties of the
regression type estimators of population total Y defined as

»0(1)=»1+ ~(X - XI )+b2(z1- 22 )


wher e »1= L Yli/ lrli and .TI = L Xii / lrli where 1 = 1, 2 denote s the t''' agency.
ie st ie st
Hint: Renssen and Nieuwenbroek (1997).
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 601

Exercise 6.15. Consider the problem of estimating the population mean Y of the
study variable, Y, from a finite population of size N. Suppose the information on
an auxiliary variable, X, highly correlated with Y, is not available, but values of
X are assumed known over a large random sample of m units . Suppose that
information on another auxiliary variable , Z, is available on all units of the
population with population mean Z. Let (Yi' xi, Zi)for i = 1,2,..., n denote the
information collected on the second phase sample. Then study the asymptotic
properties of the following four estimators:
(a )/1 = Yn +byA(xm-xn)-bxJm-z)]; (b) 12 = Yn +byAxm-xn)+byAz -zn);
(c) 13 = Yn +byAxm-xn)+byz(Z -zn);
and
( d ) 14 = Yn +byAxm- xn)+byxbxz(Z - zm)+ byAz - zn).
Hint: Ahmed (1998), Mukerjee, Rao, and Vijayan (2000), Sahoo and Sahoo
(I999a, 1999b), Pradhan (200 I).
( e) Develop a test statistic to test a hypothesis that known population mean Z can
be included or not in the regression type estimators using two-phase sampling.
Hint: Das and Bez (1995) .
( f) Study the asymptotic properties of these four estimators ( a ) to ( d ) under the
superpopulation models:
Yk =qlxk +elk
with
2k
E(elk I Xk) = 0, E(el I Xk)= axff ' a> °,g ~ 0, E(elkelj I Xk>xJ = 0, k '* } = 1,2,...,N,
and elk S are independent of x , and
Xk = q2 zk +e2k
with
E(e2klxk)=0, E(eiklxk)=CXZ,c>O,h~O,E(e2ke2jlxk,Xj)=0, k,*}=1,2, ...,N,
and elk S are independent of z . Assume el and e2 are also independent.
Hint: Sahoo and Sahoo (1999a, 1999b).

Exercise 6.16. Let s = {i1,i 2,...,in} where ij En (; = 1,2,3,...,n) denotes a random


sample of n units drawn from n using the sampling design

P= {p, :P, ~ 0, LP' = I} ,


SES
where S denotes the set of all possible samples . Assume that paired observations
{(Yi'Z;), i E s} are available. Suppose Z = Y, that is, both variables have common
population mean . Let
-
YHT = -
1 "Yi
L. - an d -ZHT = - 1 "zi
L. - ,
N iESlfi N iESlfi
where lfi = L P, and S(i) denotes the set of all samples containing the /h unit.
sES(i)
602 Advanced sampling theory with applications

( a) Find the bias and variance of the estimator


fi 1= wYHT + (1- W)ZHT'
( b ) Also find the optimum value of W so that the variance of the estimator fil IS

rmrumum.
Hint: Tripathi and Chaubey (1992).

Exercise 6.17. In the first phase we select a preliminary large sample s' of n' units
by SRSWOR and only the auxiliary variable X is measured . Let Mx be the I

estimator of median M x of the auxiliary variable X based on the first-phase


sample. In the second phase, a sub-sample s of n units is drawn from the
preliminary large sample by SRSWOR and both the study variable Y and the
auxiliary variable X are measured. Let Mx and My denote the estimators of M';
and My respectively, based on the sample drawn at the second phase .
( a ) Study the asymptotic properties of a general class of estimators for estimating
the median My as

where U =MX/ M'X' H(My' u) assumes the value in a bounded closed convex sub-
set R 2 of two dimensional real space containing the point lMy' 1) such that
HlMy'1)= My and satisfies certain regularity conditions .
( b ) Show that the following estimators

M2=My~X', M3=My(Mx/M.~r. and M4=My[M~/~Mx+(I-d)MJ


x
are the special cases of the general class of estimators .
Hint: Singh, Joarder, and Tracy (200 I).

Exercise 6.18. Consider a two-phase sampling design in which second phase


SRSWOR sample of r units is selected from an infinite population and the
regression estimator of the mean of characteristic y be
fiy = Y2 + /J ollxI - X2)
r n. r /r
where (Y2, x2)=r-l i~I(Yi 'X;), XI =n-li~lx i' fJoIS=i~I(Xi-X2XY i-Y2) i~I(Xi - X2 )
2

( a ) Then show that


v(;Jy) = [n-I P;y+ r -I (I - p;)] (I;,
where Pxy is the population correlation between y and x and (I; IS the
population variance of y .
( b ) Also show that an estimator of variance is given by
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 603

, _ -I n ,
where Yi = Y2 + ( Xi - x2 ) Pols for i = 1,2, ..., nand Y, = n LYi'
i=1
Hint: Kim (200 I).

Exercise 6.19. Let the first phase sample SI of nl units is taken with SRSWOR
sampling, and the second phase sample S2 of n : units is taken as PPSWR
sampling .

( a ) Differentiate between nested and non-nested two-phase sampling designs.


( b ) Consider an estimator of population total Y as:
, N
Y=--LY;/Pli .
nln2 S2
Show that: If the design is nested then

VY
, ) N2(1_
( =
Ji) 2
Sy+-- ,
V(Yp)
nl n2
and if the design is non-nested then:
v(y)= N2(I-Ji) R 2S; + V(Yp)[1 + (1- Ji) ~2 ],
nl n2 nl X
where
, ) 1 N ( \2 nl
V (Yp =-LPi Y;/Pi- Y ) , fl = - , Pli = x;/ L Xi' and Pi = X,! LXi '
nl i=1 N I iESI iEO

( C) Consider an estimator of Y as

YRat = . LY/.LXi/. LXiJ.


IES2 l'Es, IES2

Show that: If the design is nested then


v(y Rat )=N 2(I - Ji)L(L RSxy _R 2Sx2}+N 2(I - h ) (N 1 I) f;.(Y,. _ Rx. )2 L. I'
nl n2 - i=1
and if the design is non-nested then
v(y Rat )=N2(I - Ji)R2Sx2 + N 2(I - f ) (N 1 I) f;.(Y,._Rx.\2 L. I) .
nl n2 - i=1
Hint: Hidiroglou (200 I) .

Exercise 6.20. In the first phase sample SI consisting of m units measure the
values of the P auxiliary variables (Xii ,Xi2"",XiP} i = 1,2,....m . In the second phase
sample 52 C SI consisting of n units measure the study variable and auxiliary
variables as (ri ,Xil,Xi2' "'' XiP} i = 1,2,...,n . Let
604 Advanced sampling theory with applications

_* -I m .
xj=m 'L, xij' j=I,2, ..., p
i=1
denote the means of P auxiliary variables based on the first phase sample . Also let
_ -I n _ -I n
Y = n 'L,Yi and Xj = n 'L,xij
i=l i=l
denote the means of study variable Y and P auxiliary variates in the second phase
sample. Based on the j'h auxiliary variate define a difference estimator of the
population mean in two-phase sampling as
Yd(j)=y+bAx;-xJ
Now consider a weighted estimator of population mean Y as
Ywd = 'L, WjYd(j), sue hh
- p -
t at LP Wj = 1, 0 < Wj < 1 .
j=l j =l

( a ) Show that YWd is an unbiased for population mean Y with variance


V(Ywd) = WBW'

where b 'k =
J m
(~_~)SZ
N Y
+(~-~)[b
n m J
·bkS ·k -b·S . -dkS ok ],
J J YJ >
SZ = (N -r)" I(Jj -
Y i=1
rt
Sjk=(N-ltl i%/Xji-XJXki-Xk} W=(Wt-WZ, ...,wp), B=(bjk)pxp' and

Syj = (N I(Jj - rXx ji - xJ


-lt l i=1
(b) Also show that an unbiased estimator of the V(Ywd) is given by

(~_~)sZ +(~_~)_1_ ±[(Yi - y)- f w .b.(x;. -x .)Y


V(Ywd) =
m N n m n -1 Y i=l J j=l J J lJ J

where s; = (n -lt l I:( Yi - yf


i=1
Hint: Raj (1964)

Exercise 6.21. Consider the problem of estimation of population mean using two-
phase sampling. Consider an initial first phase sample of m units is selected at
random and information on auxiliary variable x is measure. A second phase
sample of n units in selected with PPSWR sampling with probability proportional
to x. Find the variance of the unbiased estimator of the population total Y given by
, N ny .
Yuds = - 'L,---.L , where Pi = Xi
1mLXi'
nm i=IPi i=l
Hint: Raj (1964) .

Exercise 6.22. Under a superpopulation model, consider Em(Yi I Xi ) = a + fJxi ,


Vm(Yil x;)=azxf and CmlYi' Yj l Xi' Xj )= o , i e j , and let the cost functions be
Chapter 6: Use of auxiliary informat ion: Multi-P hase Sampling 605

for single phase SRSWOR sampling and


C = IIC) C = II C I + IIl C Z for two-ph ase
SRSWOR sampling, where Cz < C1 •
( a ) Find the optimum first phase and second phase sample size such that the
variance of the usual ratio estimator in two-ph ase sampling under the
superpopulation model is minimum for the fixed cost of surveys .
( b ) Compare the results with single phase sampling under SRSWOR sampling for
the fixed cost.
Hint: Ajg aonkar (1975).

Exercise 6.23. In the first phase we select a prelim inary large sample s' of II '
units by SRSWOR and only the auxiliary variables X and Z are measured . Let
Mx' and M: be the estimators of medians M x (unknown) and M z (known) of the
auxiliary variables X and Z based on the first phase sampl e. In the second phase, a
sub-sample s of II units is drawn from the preliminary large sample by SRSWOR
and both the study variable Y and the auxiliar y variable X are measured. Let Mx
and My denote the estimators of Mx and My respectively, based on the sample
drawn at the second phase.
( a ) Study the asymptotic properties of a general class of estimators for estimating
the median My as
Ma = H(My , U, V~
where U =Mx!M'x, V=M:/M z and H(M)" U,v) assumes the value in a
bounded closed convex sub-set R3 of three dimensional real space containing the
point (My, 1, 1) such that H(M)" 1,1) = My and satisfies certain regularity cond itions.
( b ) Show that the following estimators

( . J( " J ( . Ja( MM.,z JP


• • M' M • • M
Mz=M y M: M: ; M3 =My M: ;
z

and

M, =M y
\JM<+~~d)M;) [~M, +~~g)M,l]
are the special cases of the general class of estimators.
Hint: Allen, Saxena, Singh , Singh, and Smarandache (2002).

Exercise 6.24. ( I ) Assume that a sample of size III is selected using SRSWR
sampling out of N units to observe the variable X , while a sub-sample of size II is
selected out of III units to observe the study variable Y and auxiliary variable X
with the same sampl ing strategy. Suppose jill' XII denote the second phase sample
means, and xm is the first phase sample mean.
( a ) Find the first order bias and mean squared error of the estimators of popul ation
mean Y defined as
606 Advanced sampling theory with applicat ions

[Bose (1943)]

_ __ (m-I)
YHR = r X m + ----;;;- Srx' [ Sukhatme (1962) ]

where
~. = Yi - 1 n
'I , r = - ~:>i, and in ( _)
Srx = - -L rixi -xn '
Xi n ;;1 n -I i=\
( II ) Con sider anoth er cheaper auxiliary variable Z is ava ilable for all the units in
-
the popul ation and henc e Z = N - I.Z; is known and
IN
;; 1
zmbe its estimator obtained
from the first phase sample inform ation.

( a ) Find the bias and mean square error of the following estimators:
- -- (x
Yc - YnmJ(z-=-J.,
-=- X II Zm
[Chand (1975)]

- - - -
Z , w here g- = - 1 m
Xi
Y ds = r g L- ;
mi=\ zi

Ydsu = (r"if + mm - I Srg ) Z + N -I slrz'


N
where Shz = (n-It\ 'L,h;(=; - z) and Srg = (n -It 1 I g;(r; -r) .
ies iES

Hint: Dalabehera and Sahoo (2000) .

YI = Y+ byx[ tm- t(~m -!)~~ (zy-- ,


Zm - t2 zm- Z
where a is a suitably chosen constant, and t( and t2 are suitably chosen statistics
such that their means exist, and

byx = ;~\(y; - YnXx; -x,,)/;~\(x; - xJ


( b ) Furthe r show that the followin g estimators :

1[ (-J 1
U
-(I ) _ -
YI - y+ byx Xm s; { ( -J}
- ~ _ - . -(2) _ - -
x"' YI - y+ byx x m 2
_ ~
i; \ ]
_-
x" '
.

-(3) - - b { - ( Z
YI - Y + yx x m Z + a l (zm _ Z )
J--}. x"'

and

(t;-J
U
-(4) _ - - _ - ~ _- )

1
YI - Y + byx a \x m + (I a \ ~\:m 2 x" '
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 607

where al and a2 are suitably chosen constants, are the special cases of the
estimator )it .
Hint: Das and Tripathi (1979), Singh, Singh, and Upadhyaya (2001) .

Exercise 6.25. Consider a population of N identifiable units on which a study


variable Y is associated with k auxiliary variables XI' 'X2 .., Xk . Let
X~i ' h = 1,2, ..., III be the information collected on the k auxiliary variables in the
first phase sample of III units. Let (Yh , Xiii)' h = 1,2, ..., n be the observed values in
an SRSWR sample on the (k + 1) variables in the second phase sample.
Consider the problem of estimation of finite population variance
2
ao = -
1 N( _)2 ,
L Yh -Y
-
where Y = - LYh
1 N .
N h=1 N h=1
Let Si*2 = (III -1 )-1 mLXiii
(* - xilm
-* J \2 ' Wit
. h xhm
-* = III-1 LXhi
m *
' be an estimator
. 0
f ai2 based
h=1 h=1

on the preliminary large sample of III units . Further, let s f = r' h=1I. (Xiii - xh" f ,
(n - 1
.
with -
X/Ill -I "
= n LXhi an
d So2 = ( n -1 )-1 L" (yiii - Yh" \2 .h -
J , Wit Yh" = II _ I LYhi
"
, be the
h=1 h=1 h=1
estimators of al, i = 0,1,2, .., k based on the second phase sample . Find the bias

and variance of the following estimators of aJ as

*2 k *2 sJ k
(a)al = LWi'isi where F;=2,=1 ,2,...,k , O<wi <I and LWi=I ,
A A

i=1 Si i=1

k +1
where LWi = 1.
i=l
Hint: Singh and Singh (2001).

Exercise 6.26. Consider a two-phase design in which in each phase the sampling
scheme is as follows:

(i ) The first phase sample s * of III ( III < N) units is drawn from the population
n to observe two auxiliary variables X and z.
( ii ) The second phase sample s of size II (II < Ill) is drawn from s * to observe
Y ,x, and z.
Consider the population mean Z of one auxiliary variable z is known.
Let
608 Advanced sampling theory with applications

_ _I n _ _I n _* _I m _* -I m
Y = n IYi, x=n IXi' x =m IXi' and Z = m Izi •
i~l i~1 i~ 1 i~1

After decomposing the whole pOfulation 0 into three mutually exclusive domains
S
as s , r2 = n s * and r, = (0-
s * of n , (m - n) and (N - m) units respectively,
where s = 0 - s , then the population mean can be written as

Y =-.!..[,IYi + , I Yi + ,I Yi] = f y+0'* - f)Yz +(1- f*~


N IES IE'2 IEIl

where f = nlN and f* = miN .


Study the properties of the following estimator of the population mean, Y, defined
as
Y=f y + 0'* - f) T2 + (1 - r') 11
under the following situations:

( a ) Tz = Y[ ~J and 11 = ~:. ), (b) Tz = Y[~J and 11 ={ ~ J[:')


Hint: Sahoo and Sahoo (2001).

Exercise 6.27. Consider the problem of estimation of population total Y in the


presence of two auxiliary variables x and z . Consider the auxiliary information is
available at three levels as follows:
( a ) At the level of population 0 of size N : the vector total I Xi
iEQ

( b ) At the level of first phase sample Sl of size m : Xi and Zi are known for j E SI

( C)At the level of second phase sample S2 of size n : Yi ' Xi and Zi are known for
every j E S2;
Study the following calibrated estimator of the population total Y defined as
YES = L W2iYi '
iES2
where W2i' j = 1,2,..., n, are the second phase calibrated weights obtained under the
following situations:
SitUation -:;:;;;;;-:;:;;;;;-;;--C7;7;-:;:;;;;;T-"'"'~;:;;;:S ~~?iTI'iE~::;-r;;:;;;:-:;:;;;;;-:;:;;;;;;;;;;:::: ..........-:;:;;;;;-:;:;;;;;-:;:;;;;;-:;:;;;;;==E'l
'Niliiib~ti
I Wli Xi = I Xi ' I W2i Xi = I Wli Xi
SI i EQ S2 iES\
and I W2i Zi = I WliZi
S2 iESI

Continued .
Chapter 6: Use of auxiliary information: Multi-Phase Sampling 609

2 Calibrate x from to n to I WIi Xi = I Xi , I W2i X i = I WIiXi


S\
S\ iEO S2 iES\
obtain Wli and thenx from S2
Note : it is a special case of the first
to Sl to obtain W2i . situation.
3 Calibrate x from s\ to n to I WIiX i = IXi , I W2i Zi = I WIi Zi
Sl iEO S2 iESI
obtain w\i and the z from S2 to
Note : it is a special case of the first
s\ to find w2i . situation.
4 Calibrate x from s\ to n to I wlixi = IXi
S\ iEQ
obtain WIi' and the W2i = d 2i Wli ,
Note : it is a special case of the first
where d 2i are the known design
situation
weights at the second phase
sample.
5 Calibrate on x from S2 to n I W2i Xi = IXi , I W2i Zi = Id1iz i
S2 iEO S2 iESI
and z from S 2 to Sl to find
where d li are the known first phase
direct W2i III a single shot of
design weights.
calibration .
6 Calibrate on x from S 2 to n to I W2i Xi = IXi
S2 iEO
find W2i .
Note : it is a special case of the fifth
situation.
7 Calibrate both x and z from S2 I W2i Xi = I d lixi , I W2izi = Idlizi
S2 iESI S2 iESI
to Sl to obtain W2i
8 Calibrate on x from S2 to s\ to I w2i x i = Idlixi
S2 iES\
find w2i.
Note: it is a special case of the
seventh situation.
9 Calibrate on z from S2 to s\ to I W2i Z i = Idliz i
S2 iESI
find W2i .
Note : it is a special case of the
seventh situation.
10 Calibrate from S2 to s\ to find I W2J(Xi,Zi)= Idlij(Xi'Z;)
S2 iESI
where j(xi ,zi)is any joint function
of both auxiliary variables,
j(xi, z;)= xfzf, where a and pare
know.
Note : it is a special case of the
seventh situation.
Hint: Estevao and Samdal (2002).
610 Advanced sampling theory with applications

Exercise 6.28. ( I ) Consider a population is represented by n = {1,2,..., i ,...,N }. A


first phase probability sample SI (SI en) is drawn from the population n using a
sampling design that generates the selection probabilities
Given that the first
1( Ii .

phase sample SI has been drawn , the second phase sample S2 (S2C SIC n) is
selected from SI with a sampling design with the selection probabilities 1( 2i = 1( ils, .

Evidently the first phase and second phase sampling weights are defined as
d li =1/Jrli and d 2i =1/Jr 2i' respectively. The overall sampling weights for the
selectedl" unit in the second phase sample S2 will be di• = d lid2i . Considered the
problem of estimation of general parameters of interest as,
n, = I.H(y;) and Hy = N- I 'LH(y;)
iEO iEO
for a specified function h.
( a ) Study the asymptotic properties the estimator of H y in two phase sampling,
defined as
HI = I'd;. H(y;),
i=1

where ;It are the ultimate calibrated weights obtained by minimizing chi square
distance between second phase design weights.

( b ) Show that the chain ratio type estimator of population parameter H y is

• 1/ i~ldliH(X;) Hz
H R = .'Ld lI·d 2IH (y I. nm ][ m ]
1=1
{
i~ldlid2iH(X;) i~ldliH (z;)
and the chain regression type of estimator is

HG= i~ldlid2iH(yi)+ .8{~ldliH(X;)- i~ldlid2iH(X;)] + .82[ H z - i~ldliH(z; )]


where

.81 = i~ldlid2iH(X;)H(Y;)/i~ldlid2i{H(Z;)F and .82 = .8{~ldliH(X;)h(Zi)/i~ldli{H(Z;)F]


are the special cases of the calibrated estimator.
Hint: Singh and Puetas (2002).

( II ) Consider H/(J) be a calibrated estimator of population distribution function


obtained by dropping /" units from the sample SI of m units by following Singh
and Puetas (2002). Then the Jackknifed estimator of population distribution
function in two-phase sampling can be written as
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 611

ii Zy(})+,82 (}){Hfx(})- HzA;)}+,81 (;),82(;)~Z - Hfz(})} if; ES2 ,


H/(}) =
j HZy+,82 {HIOA;)- HzJ+ ,81(;),82 ~Z - Hf(;)} if; E(SI - S2),
where S2 be the second phase sample of n units,

HfA;) = Hf + wfAI- wfj t(Hf-H(zj))' HfA;) = Hfx + wfAI- wfj t(Hfx - H(Xj)) ,
Hzx(;) = Hzx+ W2j(1- W2j)1 (H zx- H(xj )), HZy(;) = HZy+ W2j(l- W2j )1 (H ZY- H(yj )),
,8\> ,81 (;), ,82, and,82(;) have their usual meanings.
( a ) Show that
E2 o» ,82 EI (;)+,82(/)d2(;)+ ,82°2(;)
H~(})-H~ =
j
,82 EI (;)
where
E2 (}) = (H ZY(;)- Hz) - ,82 (}XHzA;)- Hzx)-,81 (;),82 (}XiifA;)- HJ
EI (})= (HfA;)- Hfx)- ,81 (}XHfA;)- HJ d 2(; )= (HfA;)- HzA;)}
and
02(}) = (H zx(})- HfA;))- ,81 (}XH z- HfA;))- ,81 (Hz - Hfz) .

( b ) Show that the modified Jackknife estimator of variance of H~ given by

vAH~)= (m-l) L: {H~(;)-H ~}2


m jEsl

is approximately equivalent to the variance of the chain ratio and regression


estimators.
Hint: Singh (2003b).

Exercise 6.29. Under the concept of two-phase sampling compare the mean
squared error of the ratio estimator MR defined as

.
MR=M y . (M*
M: J'
with the following estimator for population median My defined as

M(o)=M
y y
(A-N::J
A-M '
x
where A is a suitably chosen real constant.
Hint: Singh, Singh, and Puertas (2003a).
612 Advanced sampling theory with applications

Practical 6.1. Select a first phase sample of 15 units by SRSWOR sampling from
population 1 of the Appendix and record only the nonreal estate farm loans from the
units selected in the preliminary sample. Select a sub-sample of 5 units from the
preliminary large sample and note the real and nonreal estate farm loans. Estimate
the average real estate farm loans by using the ratio estimator. Construct the 95%
confidence intervals by estimating the variance of the ratio estimator with two
different estimators: ( a ) Jackkn ifing; (b) method of moments.

Practical 6.2. A key bank in the United States of America is interested in the
average of real estate farm loans. The bank manager has information about nonreal
estate farm loans in 15 states selected by SRSWOR sampling. From these selected
states select a sub-sample of 5 states and note the real estate farm loans as well as
nonreal estate farm loans from population I given in the Appendix. Use the
following estimators to estimate the average real estate farm loans
YI =y(x·/xXs;z/s.;) and yz =y+a(x' -x)+p(s?-s;).
Suggest estimators for optimum values of a and p . Construct the 95% confidence
interval in each case. Explain the difference in the estimates based on these two
estimators to the bank manager.

Practical 6.3. A private consultant selected first phase and second phase samples of
sizes 20 and 10 respectively. Discuss the relative efficiency of the general class of
estimators for estimating average amount of the real estate farm loans during 1997
by using information selected in the first phase sample only on the nonreal estate
farm loans during 1997, with respect to the regression estimator of population
mean.

Practical 6.4. People Bank has information about the real and nonreal estate farm
loans (in $000) during 1997 in the United States has been presented in population I.
If the bank select first phase and second phase samples of size 10 and 5
respectively, then:
( a ) Find the relative efficiency of the ratio estimator, for estimating the average
amount of the real estate farm loans during 1997 by using information selected in
the first phase sample only on the nonreal estate farm loans during 1997, with
respect to the usual estimator of population mean;
( b ) Suppose a budget of US$5000 is available to spend on the survey, out of
which, $2000 will be the overhead cost. Suppose selection, compilation, and
analysing of one unit in the first phase sample costs $50, while for the second
phase unit the cost is $500. Find the optimum values of the first phase and second
phase sample sizes. Also find the relative efficiency of the ratio estimator over the
sample mean for the fixed cost;
( c ) What will be the minimum cost for attaining a 20% relative standard deviation?
Chapter 6: Use of auxiliary information : Multi-Phase Sampling 613

Practical 6.5. A private company XYZ selected first phase and second phase
samples of size 20 and 10, respectively, from population 1 given in the Appendix.
Find the relative efficiency of the regression estimator, for estimating average
amount of real estate farm loans during 1997 by using information selected in the
first phase sample only on the nonreal estate farm loans during 1997, with respect
to the ratio estimator of population mean.

Practical 6.6. Mr. Nelson wishes to make a future strategy of selection of estimator
while estimating the average real estate farm loans. Suppose he selected first phase
and second phase samples each of size 10 and 5 respectively. Suggest a few
estimators as a member of the general class of estimators. Find the relative
efficiency of these members for estimating average amount of real estate farm loans
during 1997 by using information selected in the first phase sample only on the
nonreal estate farm loans during 1997 with respect to the regression estimator in
two-phase sampling of the population mean. Has Mr. Nelson any hope in finding a
new estimator?

Practical 6.7. Ms. Stephanie Singh selects a preliminary large sample of 20 units
by PPSWOR sampling using the Midzuno--Sen sampling scheme and using the
number of species groups during 1992 as an auxiliary variable and given in
population 4 of the Appendix . Select a second-phase sample of 10 units from the
given first phase sample by using the Midzunc--Sen sampling scheme. Find the
calibration weights for the units selected in the ultimate sample by making use of
known information about the number of fish caught during 1994 as an auxiliary
variable . Use the chi square distance function between the design weights and
calibration weights. Discuss three cases when these weights leads to the GREG,
ratio, and traditional linear regression estimator for estimating the total number of
fish caught during 1995. Deduce the estimates of the total number of fish in each
case.

Practical 6.8. Use the Midzuno--Sen sampling scheme to select a preliminary large
sample of 15 units by using the number of fish caught during 1992 in the United
States as a selection variable given in the population 4. Collect the information on
the number of fish caught during 1993 and 1994 from the units selected in the
sample. Assume that the total number of fish caught during 1993 are known, derive
the first phase calibration weights and hence estimate the total number of fish
caught during 1994 in the United States. Select a second phase sample of 10 units
from the given first phase sample by using Midzuno--Sen sampling scheme. Collect
the information on the number of fish caught during 1994 and 1995 for the selected
units in the second phase sample . Derive the second phase calibration weights, and
hence deduce the estimate of the number of fish caught during 1995 in the United
States .
614 Advanced sampling theory with applications

Practical 6.9. Take a first phase sample of IS units by SRSWOR sampling and
note only the nonreal estate farm loans from the units selected in the sample given
in population I of the Appendix . Select a sub-sample of 10 units from the given
preliminary large sample and note the real estate farm loans as well as nonreal
estate farm loans. Estimate the average real estate farm loans by using regression
estimator. Deduce the 95% confidence interval.

Practical 6.10. The real and nonreal estate farm loans (in $000) during 1997 in
different 50 states of the United States have been presented in population I of the
Appendix. Suppose we selected an SRSWOR first phase sample of ten states to
collect the information on nonreal estate farm loans only. From the given first phase
sample of ten units, we selected a second phase sample of seven units and both the
real and nonreal estate farm loans are observed. Find the relative efficiency of the
ratio estimator of the median , for estimating median of the amount of the real estate
farm loans during 1997 by using information of the nonreal estate farm loans during
1997 collected in the first phase sample with respect to the usual estimator of
population median. Assume that both the real and nonreal estate farm loans follows

,+
a bivariate normal distribution.

r r
Hint: The joint p.d.f. for the bivariate normal distribution is given by

2( 1_p;yl[(x~:' +(y~:y
1
- 2 P,, ( x~:'Xy~:y )J)
/(x,y) = --'-------=---=-::-------;===~----~
7. SYSTEMATIC SAMPLING

Sampling scheme in which only the first unit is selected at random, the rest being
automatically selected according to a predetermined pattern is known as systematic
sampling. Systematic sampling provides a very simple sampling design in practice
to select a sample of size n from a population of size N . Systematic sampling is
both operationally convenient and efficient in sampling some natural populations
like forest areas for estimating the volume of timber and hardwood seedlings, etc ..

There are two possibilities:


( I) When N = nk and k is an integer (II) When N "" nk .
Let us discuss each of these cases in detail.
Case I. If N = nk. The N units in the population can be arranged in n rows as
shown in Table 7.1.1 and can be named as a sequential list of population units.
Such a list can be prepared only if we have a fin ite number of units in the
population.
Table 7.1.1. Sequential list of population units.
1 2 3 r k
k+l k+2 k+3 k+r 2k
2k+l 2k+2 2k+3 2k+r 3k
• • • • •
(n -I}k + 1 (n-I)k+2 (n-I)k+3 (n-I)k+r nk=N
Y) Y2 Y3 Yr Yk

The first step is to select a random number from 1 to k, that is in the range of
integers listed in the first row . Let the first selected random number is 2. Then first
unit selected in the sample is number 2 in the sequential list. After selecting the
second unit from the population, every J(h unit is automatically included in the
sample. Thus the units in the sample of size n are at the serial numbers
2, k + 2, 2k + 2, ... , (n -I)k + 2 .
The random number selected from 1 to k is called a random start. The number k
is called sampling interval. Corresponding to each random number from 1 to k,
there is only one possible sample of size n. Thus in systematic sampling the total
number of samples will be k. If r denotes the random start, then the systematic
sample consists of the units at the serial numbers given by the sequence
{r + ik, i = O,1,2.....,(n -I)} .

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
616 Advanced sampling theory with applications

The sequential list of population units can either be made using known magnitude
of auxiliary information or by just numbering the list of population units. Assume
Yr denotes the sample mean corresponding to random start ' r '.
Then
_ 1 n-l
Yr=-IYr+ik' (7 .1.1)
n i=O
Thus we have following theorem.

Theorem 7.1.1. The sample mean Yr is an unbiased estimator of the population


mean Y .
Proof. Note that we have k possible samples in the whole population, thus the
probability of selecting one sample is given by ljk . By the definition of expected
value we have
_ 1 k_ 1 k (1 n-l ) 1 k n-l 1 N -
(7.1.2)
E(Yr)=-k IYr =-k I - IYr+ik =-k I IYr+ik = N If; = Y .
r=l r=1 n i=O n r= 1i=l i=l
Hence the theorem.

Theorem 7.1.2. The variance between k sample means is given by

v(yr)=ert=~~(yr-rf . (7.1.3)
k r =1

*r~IY; *r~l(yr rf .
Proof. By the definition of variance we have

V(Yr) = E(Yr)2 - {E(Yr)f = - p = -


Hence the theorem.

From (7.1.3) one can easily calculate the variance between the sample means if all
possible sample means and population mean are known . Unfortunately we
generally have the knowledge of only one sample mean and only one sample mean
cannot be used to estimate the variance between all possible sample means. In
(7 .1.3) ert stands for the variance between all possible samples selected by using
systematic sampling. Till now we have discussed the variation between different
column means of the Table 7.1.1. A natural question arises that there will be
variation within units of each column . Such a variation is called within sample
..
vanation. Suppose Y ri denotes tel
h .th ( I• = 1,2,3'00 " n ) 0 b servation
. In. the r th samp Ie.
th
Then the r sample mean can also be defined as
_ 1 n
Yr =- LYri ' (7.1.4)
ni=l
The variation within the /h sample can be defined as
n
2 1 ( -\2 (7.1.5)
err = - L Yri - Yr} .
n i=l
Chapter 7: Systematic Sampling 617

Thus the overall measure of within sample variation can be defined as the average
of these values across all k samples as
2 Ik Ikln _ Ink _
O"w
r=1
2
= -k IO"r = -k I - I
r=ln i=1
(Yri - Yr f =N I
i=lr =1
IcYri - Yr f. (7.1.6)

Now we have the following theorem .

Theorem 7.1.3. The total variance of the population is the sum of between and
within sample variances, that is
0"2=O"l+0"~. (7.1.7)
Proof. Note that

Now we know that

NO"
2= Lz.s.t -)2 nk( -)2
Ll,Yri -Y = L Ll,Yri - Y r + Yr- Y
i=lr=1 i=lr=1

2 2
= NO"w+NO"b'
Hence the theorem .

Remark 7.1.1. In systematic sampling we have

wit. hiIII samp Ie vanation


.. 2
0"w =-I £..~ #.(Yr,i - Yr
£..
- \2J ' (7.1.8)
N i=lr=1
· ·
between samp Ie vanation 0" b2= - I L
k t: -)2
I,Y r - Y , (7.1.9)
N r=1

total variation 0"2 = ~ I I (y .- rl


N i=l r=1 r, l ,
(7.1.10)

(7.1.11)
We now discuss the situations under which systematic sampling is better than
simple random sampling without replacement. The variance of the estimator of
population mean under systematic sampling is
V(-) 2 2 2
I,Ysy = O"b = 0" -O"w (7.1.12)
and that under SRSWOR is
_ ) (N -n) 2
V (Ysrs = n(N -1)0" (7.1.13)
618 Advanced sampling theory with applications

The design effect (DE) or relative efficiency of SRSWOR with respect to


systematic sampling is defined as

DE = V~sy) = n(N -1) (1_ a~2 J. (7.1.14)


v(Ysrs) (N - n ) a

Now ifDE= I then

(7.1.15)

.r-
Evidently ifDE< 1, i.e.,

a~ > (I -~J(I -
then systematic sampling for a sample of size n remains better than SRSWOR
sample of the same size. The greater the variation a~ within samples, greater will
be the gain due to systematic sampling. It is obvious that if a ~ will increase then
as will decrease, since the total variation a 2
remains fixed. Thus for large gain in
efficiency due to systematic sampling, the units within each systematic sample
should be as heterogeneous as possible.

Case II. When N c# nk, i.e., it is not possible to find k such that k = N/n is an
integer. For example when N = 5 and n = 2, k = N/ n = 5/2 = 2.5 is not an integer.
In such situations, we can either take k = 2 or 3. Suppose the population consists of
following units.

Evidently the population mean Y is given by


- 1 N 18
Y=-LY;=-=3.6.
N i=1 5
Now there are following possibilities:
( a ) If we take k = 2 then we have to select a random number between 1 and 2 and
every second unit will be included in the sample. Consider the random start is
r = 1. Then the units selected will be first, third and fifth or in other words, the
sample values will be {2, 2, 6}. Thus the first sample mean
- _ 2 + 2 + 6 _ 3 33
Y, - 3 - ..
Chapter 7: Systematic Sampling 619

If the selected random start r = 2 then the units at serial numbers 2 and 4 will be
selected. The resultant sample will be {3, 5}. The mean of the second sample will
be
- _3 +5_
yz --2-- 4 .

While discussing the above example we observe the following difficulties:


(i ) We had a sample 00 units, while we were interested in a sample of size 2.
( ii ) There are only two possible sample means
- 10 d-
YI = - an yz =4 .
3
The mean of these sample means is given by

+( I~ + 4) = 3.667
which is not equal to the true population mean Y = 3.6 .
( b ) If we take k = 3, then we have to select a random start r between I and 3.
Then we have the following possibilities:
( i) If r = 1 then the sample is {2, 5} with mean )II = 2 + 5 = 3.5,
2

(ii) If r = 2 then the sample is {3, 6} with mean )lz = 3 + 6 = 4.5 , .


2

( iii) If r = 3 then the sample is {2} with mean )13 = ~ = 2.0.


1
Again we have the mean of three sample means as
Mean = 3.5 + 4.5 + 2.0 = 3.3333 '" 3.6 = Y .
3
Thus we have the following corollaries for Case II.
Corollary 7.1.1. The sample size is a random variable, that is the sample size is not
fixed.
Corollary 7.1.2. The sample mean is not an unbiased estimator of the population
mean.
Theorem 7.1.4. Let Yr denote the total corresponding to random start r and let
units be taken in the sample skipping k of them . Then the estimator
_ k
Yu = NYr (7 .1.16)

is unbiased for the population mean, Y .


Proof. We have

E('yu) = E(~ Yr) =t~1 ~ Yr = ~ r~rr = Y. (7 .1.17)

Hence the theorem.


The above theorem can also be justified with the help of follow ing example.
620 Advanced sampling theory with applications

Example 7.1.1. Case I. If k = 2, then the possible samples are {2, 2, 6} and {3, 5}.
Thus YI = 2 + 2 + 6 = 10 and Y2 = 3 + 5 = 8 . Now the mean of Yu = ~ Yr , r = 1,2 ,
N
will be
E(YJ=~ f ~Yr = ~(
2 r =( N 2
2 XIO + 2X8) =.!! = 3.6 = Y.
5 5 5
Case II. If k = 3 then possible samples are {2, 5}, {3, 6}and {2} . Obviously

YI = 2+5 = 7, Y2 = 3 +6 = 9, and Y3 = 2. Again mean value of Yu = ~ Yr ,


r = I, 2,3, will be

E(YJ= ~ t ~Yr =~(3 X 7 + 3x9 + 3X2) =~ x 54 =.!! = 3.6 = Y.


3 r=l N 3 5 5 5 3 5 5
The sample means as such are not unbiased for population mean. A sampling
scheme providing unbiased estimators is called Modified Systematic Sampling .

In case of modified systematic sampling, instead of selecting a random start ' r '
from 1 to k , we have to select a random number from 1 to N, then every J(h unit on
the right and left of it is included in the sample. Thus, for example, if k = 2 then the
Table 7.2.1 shows the possible samples for population given in Table 7.1.2 for
different random starts.

Table 7.2.1. Possible samples using modified systematic sampling.


1,;Serial number 'of samp.ledf Sampled
Rand~~~~~~
Start <~:r (',,- ."'" units "",/'
, values
Sample' ..ev'
means.
1 1,3, 5 2,2,6 10/3
2 2, 4 3,5 4
3 1,3,5 2, 2, 6 10/3
4 4,2 5,3 4
5 5,3 , 1 6,2,2 10/3

Evidently the expected value of sample means is


EV{~ms ) -_1[10
- - + 4 +-+
5 3
10 4 +-
3
1O]-36
3
- . --Y- .

Hence under modified systematic sampling, the sample mean Yms is an unbiased
estimator of the population mean Y, but still the sample size is a random variable as
shown in Table 7.2.1.

This difficulty can also be removed by following Circular Systematic Sampling


scheme as discussed in the next section.
Chapter 7: Systematic Sampling 621

Murthy (1961), Sukhatme and Sukhatme (1970), and Konijn (1973) have suggested
using the circular systematic sampling (CSS) design in the situations when N is not
a multiple of n . In this sampling scheme, the sequential list of the population units
is first prepared on the circle as shown in Figure 7.3.1.

Fig. 7.3.1 Circular Systematic Sampling (CSS).

The main steps involved in selecting a sample using CSS scheme are as follows:
( a ) Select a random number from 1 to N and name it as 'random start';
( b ) Chose some integer value of k = N/ n or rounded to nearest integer and name it
as skip;
( c ) Select all units in the sample with serial numbers
r+jk if r w jk s.N,
(r+jk-N) if r+jk>N; j=O,1,2, ..., (n-l) .

Example 7.3.1. Suppose a population consists of 10 ministers sitting on a round


table. We wish to select a sample of 3 ministers, to study corruption level in their
country, using CSS scheme with skip k =10/3"" 3. The sequential list of 10
ministers sitting on the round table can be prepared as shown in Figure 7.3.2. Now
if the random start is r = 3, then the ministers included in the sample would be with
serial numbers, 3 + 0 x 3 = 3, 3 + 1x 3 = 6, and 3 + 2 x 3 = 9. Similarly, if r =8 then the
ministers in the sample would be with serial numbers, 8 + 0 x 3 = 8, 8 + 1x 3-10 = 1,
and 8 + 2 x 3-10 = 4 .
622 Advanced sampling theory with applications

Ministers at
serial no. 3, Ministers
6 and 9 are at serial
selected with no. 8, I
r =3,k =3 and 4 are
and N = 10. selected
withr = 8 ,
k = 3 and
N = IO

Fig.7.3.2 Pictorial representation of two systematic samples each of size 3 from a


population of 10 units with two different random starts, r = 3 and r = 8 .
Now we have the following theorem.
Theorem 7.3.1. Under the a CSS scheme an unbiased estimator of population mean
Y is given by
_ In-I
Y r =- L Yr+jk (7.3.1)
n j;O
Proof. Taking expected values on both sides of(7.3.1) we have

E(Yr)= E['!" nIl Yr+jk] = _1_


n j;O
~(nII Yr+jk J=_I_
Nn r;1 j;O Nn
x n x ~Y; =~ ~Y; = Y .
;;1 N ;;1
Hence the theorem.
Difficulty in CSS scheme: Sudakar (1978) pointed out that the use of the skip or
span of sampling as the integer nearest to Nln in CSS does not always draw a
sample of desired size as shown in the following example.
Example 7.3.2. Let N = 15, n = 6, k = 3 and the starting point be r = 3 . The
sample in this case has only five (instead of six) distinct units with serial numbers 3,
6,9, 12, 15.
Sudakar (1978) has mentioned that if we take the span of sampling as the largest
integer smaller than or equal to N]n , then we do not encounter the above difficulty,
although this choice depends upon n .
Chapter 7: Systematic Sampling 623

Thus we have the following theorem .

Theorem 7.3.2. A necessary and sufficient condition for all units of the sample of
size n selected by circular systematic sampling with random start r to be distinct
for all r ~ Nand n ~ N, is that Nand k are relatively coprime.

Proof. Let Nand k be integers with k < N, r ~ N and II < N . Also let the sample
s be consist of with serial numbers s = {iI' ti- ... ,in } where ij = {r+ jk }mod( N),
j = 0, I, ..., (n -I}

Then the necessary and sufficient (n.s.) conditions are proved as follows :

( a ) Sufficiency: Assume Nand k are relatively coprime and there exists a


random start r and sample size n such that two elements of the sample are same .
Without loss of generality, suppose that i l and ij +1 are same . Now i l = rand
ij + 1 = [r + jk )mod(N). Thus i) = ij +1 implies that jk is a multiple of N where
j < n < N . This contradicts that k and N are coprimes. Hence the first part of
theorem.

( b ) Necessity: Assume for all r s N and II < N, all elements of the sample are
distinct and Nand k are not coprimes . Let the greatest common divisor (g.c.d.) of
(k, N ) = a , with k = b.a - N = c.a, where band c are both smaller than N . For any
random start r, let us take II ~ C + 1.

Then we have

iC+ 1 = [r +ck)mod(N) = [ch .a + r)mod(N) = [b.N + r)mod(N) = r = il>

which again contradicts our assumption that all elements of the sample s are
distinct. Hence the theorem.

Consider a finite population of N units with positive integer size-measure X, and


cumulative totals Cj=X1+Xz+ ....+Xj (i=I ,2, ..., N , Co=O, CN =X) . Then we
have the following definitions.

Definition 7.4.1. A PPS circular systematic sample (CSS) of size II is drawn by


selecting a random number r between I and X and choosing the units i
(i = 1,2, ..., N) for which Cj_1 < r + jkmod(X) ~ Cj, j = 0, I, 2,..., (II -I).

Then we have the following theorem .


624 Advanced sampling theory with applications

Theorem 7.4.1. A necessary condition for a sample to be selected according to the


definition 7.4.1 to contain all distinct units is that

n ~min{/,[XIM]} = no (say) , (7.4.1)

where M = Max X j, [x] = the largest integer contained in x and 1 is the smallest
l::;j::;N
positive integer for which (I x k)1 X is an integer.
Proof. Follows from Chaudhuri and Adhikary (1987).

Sengupta (1988) pointed out that condition (7.4.1) is not sufficient and resultant
sample will not contain all distinct units even if (7.4.1) holds. He provided the
following example .
Example 7.4.1. Suppose N = 10, X = 300, M = 65 and [XI M] = 4 . The data is
given in the following table:

Table 7.4.1. Difficulty in PPS Circular Systematic Sampling .


1 2 3 4 5 6 7 8 9 10
12 14 24 65 13 16 29 31 46 50
n@· in 12 26 50 115 128 144 173 204 250 300

Let k= 120, n=3. Here 1=5, no=4, so that (7.4.1) holds . But for r=112 the
sample contains only two distinct units 4, 9. Furthermore Sengupta (1988) gave the
following theorem.
Theorem 7.4.2. Let M = Max (X;). Then a necessary and sufficient condition for a
l ::;j::;N
PPS Circular Systematic Sample of size n with sampling interval k to always
contain all distinct units is that, n ~ nl , where n, is the smallest positive integer j
for which
j k mod(X ) < M or >(X - M ) .
Brewer (l963a) suggested a method of selecting a systematic sample with unequal
probability sampling. Some new circular systematic sampling (CSS) schemes have
also been studied by Uthayakumaran (1998). Hartley (1966) studied PPSWOR
systematic sampling . Although we have discussed many difficulties and their
solutions in systematic sampling in the preceding sections, the more serious
difficulty is in the estimation of variance of the estimator of mean/total using
systematic sampling schemes .

We will discuss here certain methods to estimate the variance of the estimators of
mean (total) under systematic sampling scheme.
Chapter 7: Systematic Sampling 625

7.5.1 SUBo;SAMPLING OR REPLI€A12ED~U:B-SAMPLING SCHEME:\~

In this method, rather than choosing one systematic sample of size n , choose m
sub-samples each of size 11/ m by selecting m random starts from I to k =(Nm)j II
using without replacement sampling scheme. Compute the m sample means Yj '
j = 1,2, ..., m. Also compute the full sample mean defined as
_ 1 m_
Y p = - LYj ' (7.5.1)
m j=l
An unbiased estimator of vlyp) is
.f"" ) ' 2 1-/ ~f.,., - \2 (752)
vl)'p =(Jb = m(m-l) /:ll)'rYp) . .

where / = n]N . Such a method of estimating the variance under systematic


sampling is also called interpenetrating systematic sampling.

Exa mple 7.5.1. Select three sub-samples each consisting of 5 states from
population 1 having 50 states as given in the Appendix by using systematic
sampling. Collect the information on the real estate farm loans from the states
selected in the sample. Obtain a pooled estimate of the average real estate farm
loans in the United States. Use an appropriate method for estimating the variance of
the resultant pooled estimator.
Solution. We are given N = 50 , m = 3, and n = 5 x m = 5 x 3 = 15 is the total sample

size. Select m = 3 distinct random numbers between I and k = ~ = ~ = 10.


nl m 15/3
Use these three distinct random numbers to form three different systematic sub-
samples with a skip of k = 10. We used 6th and 7th columns of the Pseudo-Random
Numbers (PRN) given in Table I of the Appendix to select three distinct random
numbers between 1 and 10. We observed the three distinct random numbers as 07,
09, and 01. Then we have the following sub-samples:

07 09 FL 01
17 19 ME 11
27 29 NH 21
37 39 RI 31
47 49 WI 41
626 Advanced sampling theory with applications

Thus the pooled sample mean estimate Yp is given by


Y = YI + Y2 + Y3 = 721.1464 + 414.401 + 175.4328 = 436.993 .
p
3 3
An unbiased estimate of vlYp ) is
-r- ) I -f ~r-
VlYp =
-2
a b = m(m -1) j:-?j - Yp-\2)
= 1(- 0.3)[(721.146_ 436.993f + (414.401- 436.993f + (175.433- 436.993f]
33-1
= 17461.184.

'~.5.2·SQ~G§S~JMWIFFE~~~§~dlll:I1.'liIriw~t:I'~~.~i~
If Y i denotes the i th unit selected in the sample, then an estimator of the variance of
sample mean can also be obtained as
-2 (I -f) n-I( \2
(7.5.3)
ab = 2 n (n- 1) i=1
L Yi - Yi+1J .

The assumption here is that each successive pair of units, i.e., Yi and Yi+1 are
drawn using SRSWOR sampling from the 2k eligible units.
Example 7.5.2. Select a sample of 10 states from popu lation I consisting of 50
states by using the systematic sampling scheme. Collect the information on the real
estate farm loans from the states selected in the sample . Use an appropriate method
for estimating the variance of the estimator of population mean.
Solut ion. We have N = 50 and n = 10 , therefore k = N /11 = 50/10 = 5. We used the
8th column of the Pseudo-Random Numbers (PRN) given in Table I of the
Appendix to select a random number between I and 5. We observed random
number 2. Thus the systematic sample consists of the following 10 distinct units as
02,07, 12, 17,22,27,32,37,42 and 47.

(Yi '~ n
2 AK -4.525 20.475
7 CT -46.623 2173.704
12 ID -991.353 982780 .77 1
17 KY 722.078 521396.638
22 MI - 1014.82 1029867.750
27 NE 1136.221 1290998.160
32 NY 86.732 7522.439
37 OR -438 .367 192165.627
42 TN -547.479 299733 .255
47 WA
Sum 432 6658;820 ,'
Chapter 7: Systematic Sampling 627

Thus an estimate of variance of the estimator of mean for the systematic sampling is
,2 (I-f) n-1( \2 (1-0.2)
(J"b = ( ) L v. - Yi+IJ = ( ) x 4326658.82 = 19229.595 .
2n n -1 i= 1 2 x 10 10 - 1

In case of CSS scheme, we have N possible samples instead of k samples and we


can define
2 1 n N t. _ \2 2 (-::) 1 N (", -\2 2 2 2
(J"w = - I I I,Yr ,i - Yr j , (J"b = vI,Ycss = - II,Yr -Yj and (J" = (J"w + (J"b •
nN i=lr=1 N r=1

Ray and Das (1995, 1997) have proposed circular systematic sampling schemes
which provide unbiased estimator of population mean and also estimator of
variance of the mean without putting any restriction on the population size.

Quite often situation arises when the study variable Y; is related to the random start
i through a linear relationship, usually known as a linear trend. We discuss below
some results when such a trend is present.

In the following theorem, we show that the usual estimator of population mean
under systematic sampling scheme remains more efficient than simple random
sampling in the presence of linear trend.

Theorem 7.6.1. If Y; has linear relation with the random start i , that is
Y; = a+bi (7.6.1.1)
under systematic sampling strategy while drawing a sample of n units, and if
another sample of n units is drawn by SRSWOR or SRSWR, then for large N we
have
V(y1RS/V(y"JSYS = n . (7.6.1.2)
Proof. Under SRSWR sampling we have

V(YWR) =
n
2
(J"y , where
IN
(J"; = - I(Y; -
Nw
yf and Y_ = -Nw
IN
IY; . (7.6.1.3)

Under model (7.6.1.1) we have


y =...!-~y; =...!-~[a+bi]=...!-[Na+b~i]
N i=l N i=1 N i= l

=a +.!?.. ~i = a+.!?..{N(N + I)} = a+ b(N + 1) (7.6.1.4)


N i=1 N 2 2
and
628 Advanced sampling theory with applications

2
a 2 = ~ I(r; _
y N i= 1
¥f =~N Ira «bi-{a + b(N2+1)}]2
i=
= b I[i _ (N + 1)]2
N i= 2

=It.I[i2+(~)2
N 1=2
-2(~)i]
2
=!t[.~i2+N(~)2
N 2
-(N+I).~i] 1=1 1=1

2(N2
=It[N(N +IX2N +I) + N(N +If N(N +lfJ = b -IL
N 6 4 2 12 (7.6.1.5)
From (7.6.1.3) we have
v(- )_ a y2_ bN2( 2-I ) (7.6.1.6)
YWR - -;;- - 12n .
Under SRSWOR sampling we have
v(- ) = (N - n)S2 = (N - n)--!!'-a2 = (N - n)b2(N + I). (7.6.1.7)
YWOR Nn y Nn (N-I) y 12n
If a systematic sample of size n is selected with random start i and skip k, then
the units selected in the sample will be listed at serial numbers
i , k + i , 2k +i, ..., (n -I)k + i . Using (7.6.1.1) we have
- 1 n-I
Yi = - L Yi+jk
1[
= - Yi + Yk+i + Y2k+i +... + Y(n-I)k+i
1
n j =O n

=~[(a + bi) + {a + b(k + i)} + {a +b(2k + i)}+....+{a +b((n -I)k +i)}]

=a+~[i+(k+i)+(2k+ i)+ ..... +{(n-l)k+i}] =a+b[i+ (n~l)k l (7.6.1.8)

Thus we have

E(y;) = ~ .fYi = ~[.f{a + b(i+ (n -I)k )}]


k 1=1 k 1=1 2
2
=~[ka + bfi + f (n -I )k] =~[ka + bk(k +I) + (n -I )bk J.
k i=1 i=1 2 k 2 2

=a + !:[k + 1+ k(n -I)] = a +!:(nk +I) = a+!:(N +I) = ¥. (7.6.1.9)


2 2 2
Thus the sample mean is an unbiased estimator for the population mean under linear
trend. The variance of the estimator under systematic sampling reduces to

_
V(Y;)SYS = EYi -12J = -1 .Lk(.,-,
[_ - Y -\2
\Yi - YJ = -1 Lk[a + b{.1(n-I)k}
+- - {a + b(N+I)}]2
-
k 1=1 k 1=1 2 2
Chapter 7: Systematic Sampling 629

=~k .I[i2+(~)2
,=1 2 -2(~)i]
2 =~[L2+k(k+lf
k 4 2(~).Ii]
2 1=1 1=1

=~[k(k+1X2k+l) + k(k+lf _ k(k+l f j = b2(k2


k 6 4 2 12 -IL (7.6.1.10)
From (7.6.1.6) for large N we have

V(-.)
2 2
= b (N -1 ) =
b2N2(1 1
-Nf
J b
2n2k2
nb
2k2 (7.6.1.11)
,
Y WR 12n 12n =U;-=-1-2-
Again from (7.6.1.7) for large N we have

V(Y;)WOR = b'(N ~;~N +I) = b'N2(1-1~XI+~J


=--=
b2 N 2
12n
b 2n 2 k 2 nb 2 k 2
12n 12 =-- (7.6.1.12)

From (7.6.1.11) and (7.6.1.12) we have

V(yJWR =v(yJWOR =V(YkRS =


2 2

12 nb k . (7.6.1.13)

Now from (7.6.1.10) we have


V(- .\
Y, JsYS
=
12 12 n2 -IJ
b2(k2_1L~[N2

12 N n 2 =!!£....
2
=~[N2
12
_n J=~[I -(.!!.-)2][~J
n2 12
(7.6.1.14)

Comparing (7.6.1.13) with (7.6.1.14) we have


RE = V(Y1Rs = n (7.6.1.15)
V(Yikys
which proves the theorem.

In systematic sampling it is interesting to note that it is possible to adjust the


weights given to the units of the sample observations in such a way that the
variance of the resultant estimator of population mean is zero. Such an adjustment
is called the Yates (1948) end correction.

Here we adjust the weights given to the sampled observations in the estimator of
population mean in such a way so that the variance of the estimator is zero. The
estimator of population mean with random start i and skip k in the systematic
sampling is given by
630 Advanced sampling theory with applications

_ 1 n-I
Yi =- L: Yi+jk' (7.6.2.1)
n j=O
Let us change the weights as

( ..!.-n - x],..!.-nn
) ,.....)),(..!.-+ x] .
nn n
(7.6.2.2)

Clearly we have only changed the weights for the first and the last unit in the
sample. The value of x is determined suitably such that the modified estimator s;
(with new weights) matches the population mean, Y for all i. Then the systematic
sampling estimator with new weights is given by
_.
Yi =( --x
1 ] 1
n Yi+-Yk
1
n +i+-Y2k+i+
n
1
·....+-Y(n-2)k+i+
n
( 1
-+x ] Y(n-l)k+i'
n
(7.6.2.3)

Note that the linear trend is given by,


Yi = a+bi. (7.6.2.4)
The estimator (7.6.2.3) becomes

y; =(..!.-n - x](a +bi)+ ..!.-[nI1a


n ]=1
+b(jk +i)}] +(..!.- +x][a +b{(n - I)k+ill
n
=x[-a-bi +a +b{(n-I)k +i}]+..!.- nIl{a +b(jk +i)}
n j=O

\,_. +--;;1[ na +bni + bkn(n2 -I)] .


=b(n- 1J!U'
Now if y; = Y '</ i then v~;)= 0 , therefore we have
b(n -I}kx+~[ na «bni + bkn~ -I)] = a + b(N I), or (n-I}kx+i + k(n - 1) = (N;1),
2+ 2
which implies that
x=_I_)[(N+I) -i- k(n-I)]=_(1 )[k+l_ i] . (7.6.2.5)
k(n - 1 2 2 k n- 1 2
Thus if we use this value of x in the weights of the first and last unit in the sample,
then the variance of the usual estimator of population mean in systematic sampling
reduces to zero. This type of adjustment or correction of weights is called Yates
(1948) end correction. Bellhouse and Rao (1975) extended the Yates end
corrections to the case N '" nk for CSS scheme. The performance of systematic
sampling in the presence of a linear trend can also be improved by following Singh,
Jindal and Garg (1968), where they suggested to choose pairs of units equidistant
from the ends of the population. If i is the random start chosen from I to k then
the units selected will be {i +(j - l)k, nk - (j - l)k - i +I}, j =1,2 ,..., n/2 if n is even
and {i+(j-I)k, nk-(j-I)k-i +l}, i+(n-l)k/2, j =1,2, ...,(n-I)/2 if n is odd. For
the linear trend Y; = a + bi, i = 1,2, ..., N, the sample mean will be exactly equal to
the population mean when n is even. The accuracy in the estimates of systematic
sampling can also be increased by following any of the following two methods.
Chapter 7: Systematic Sampling 631

Suppose we have to take a sample of n units from a population of N units such


that N = nk , where k is a sampling span. Then there are two cases:
(i) k is odd; ( ii) k is even.

If k is even then the variance of the systematic sampling mean does not become
exactly zero, but if k is odd then it may be zero. We shall discuss each case
separately.
Case I. k is odd: We chose (k; I}h unit as random start. The value of y which

corresponds to the random start (k; I) is a + b(k; I). Thus the first unit in the

sample corresponds to serial number (k; I), the second unit corresponds to

k +( k ; I), the third unit corresponds to 2k + ( k ; I) and so on. Thus the sample

mean correspondimg to the ran d om start -


k-+I
IS.given
. by
2

I n~l[
Y
-
(k+l)
-
=-
nj;O
b{'k (k+I)}]
L..
2
1[
a+ } + - - = - na+
n
nb(k+1) bkn(n-I)]
2
+
2
2

bk b nbk bk b(nk + I)
= a + - + - + - - - = a+----'--.....
2 2 2 2 2
b(N +1) - .
= a+ = Y = Population mean.
2

Thus with (k; I) as random start we have the sample mean equal to the population

mean . Since no randomness is involved, the variance of the estimator Y( k;') IS

zero. A pictorial representation of centrally located systematic sampling when k IS


odd is given in the Figure 7.6.3 .1.

Indian Grand Trunk Road

tt
I' I'

2
t 1.1
( k ;
I'
t k

Fig. 7.6.3.1 Centrally located systematic sampling for odd random start.
632 Advanced sampling theory with applications

Case II. k is even: In this situation we chose two random starts, as shown in the
figure below .

.!,~ h~ .!,~

ll----=-----=----_t
I 2 k

Fig. 7.6.3.2 Centrally located systematic sampling for even random start.

In this case there are two possibilities of choosing random start point, i.e., either !..
2

or (~+1). Now, in order to decide which random start should be considered, we

perform an experiment with an unbiased coin, e.g., head for ~ and tail for (~+ 1) .
If we take !!... as random start then from the linear trend, the Y value corresponding
2

to the first unit is a +b;, second unit a +b(k+~), and (n - l)th unit

a + b[(n -1)k + ~] . Thus the sample mean with random start ~ will be

y = ~ nIl [a +b(ki +!!...)] = ~[na +nbk +nbk(n -1)]


(~) n j=O 2 n 2 2

bk nbk bk Nb
=a+-+---=a+- . (7.6 .3.1)
2 2 2 2

Similarly, if the random start is (~+ 1) then we have

_
Y(k ) = - c: a+
1n~l[ b{k' (k +
g+--
2)}] 1[na+ nb(k + 2) + nbk(n -1)]
=-
- +1 n j=l 2 n 2 2
2

b(N +2)
=a+ 2 . (7.6.3.2)
Note that we are choosing each of the two random starts with probability 1/2, so
_
2 -
1
Y=-Y(k) +-Y(k
2 - +1
1
)=-21 [bN
a+-+a+
2
b(N+2)]
2
b(N+l)-
=a+---=Y.
2
2 2
Furthermore
Chapter 7: Systematic Sampling 633

(7.6.3.3)

We assume that N = nk and n is even. Instead of taking the sampling span from 1
to k, we take the sampling span from 1 to 2k. In other words , instead of taking n
groups of k units each, we take n/ 2 groups of 2k units each and the sampling
span is 2k instead of k . A pictorial representation of Balanced Systematic
Sampling (BSS) is presented below :

Indian Grand Trunk Road

tt
I'

~
I'

2
f i
~
k
f t ...
2k-i + 1
I'

~
Fig. 7.6.4.1 Balanced Systematic sampling.
Furthermore, as shown in the above diagram, we select two random starting points
i and 2k - i + 1 (say), one between I and k and another between k and 2k such
that both points are at equal distance from k. Then we have the following data in
the sample. The first unit is selected at serial number i, the second unit is selected
at serial number Zk - i + 1, the third unit is selected at serial number i + 'lk , the
fourth unit is selected at serial number 4k - i + 1and so on. Hence using the linear
trend we have
~-1 ~-1
YBSS =..!.- 2I [a +b{Zkj +i}]+..!.- 2I [a +b(Zkj +Zk - i +1)]
n j;O nj;O

~- l ~-1
=..!.- 2I [a+Zbkj+bi+a+2bkj+2bk-bi+b] =..!.- 2I [Za+4bkj+Zkb+b]
n j;O n j;O

=-;;Z [ "2
na +-Z-
Zbk ( "2 J( J
n -1 "2 nbk +4
n +-Z- nb] = a +bk ( "2
n -1 +bk +"2
b J
nbk b Nb b b(N + 1) -
=a+- -bk+bk+- =a+-+-=a+---= Y . (7.6.4 .1)
Z Z 2 Z 2
634 Advanced sampling theory with applications

Thus irrespective of the random start i, the balanced systematic sample (BSS)
mean is equal to population mean and equivalently V(YBSS) = o.

Singh and Garg (1979) have proposed a sampling scheme named as Balanced
Random Sampling Scheme, which has the advantage of both simple random
sampling and systematic sampling in the sense that a part of the sampling variance
depends upon the arrangements of the units in the population and the other part is
independent of the arrangements. The resultant sampling scheme performs the best
for the populations showing linear trend or periodicity.

Example 7.6.1. Select a samples of 10 states from population 1 consisting of 50


states by using systematic sampling scheme. Collect the information on the real
estate farm loans from the states selected in the sample . Assume that there is a
linear trend between the real estate farm loans and the serial numbers. Apply the
Yates end correction to obtain unbiased estimate of population mean.
Solution. We have N = 50 and n =10 , therefore k = N / n =50/10 = 5. We used the
8th column of the Pseudo-Random Numbers (PRN) given in Table I of the
Appendix to select a random number between I and 5. We observed random
number 2. Thus the systematic sample consists of the following 10 distinct units
such as 02, 07, 12, 17,22,27,32,37,42 and 47.
According to the Yates end correction the adjustment factor in the first and last
weights is given by
1 [k + 1 ] = 5(10-1)
x = k(n-l) -2-- i
1 [5 + 1 ] 1
-2--2 = 45'
Thus we have the following table:

02 AK 2.605 0.1 0.0778 0.202 0.261


07 CT 7.130 0.1 0.1000 0.713 0.713
12 ID 53.753 0.1 0.1000 5.375 5.375
17 KY 1045.106 0.1 0.1000 104.510 104.511
22 MI 323.028 0.1 0.1000 32.302 32.303
27 NE 1337.852 0.1 0.1000 133.785 133.785
32 NY 201.631 0.1 0.1000 20.163 20.163
37 OR 114.899 0.1 0.1000 11.489 11.489
42 TN 553.266 0.1 0.1000 55.326 55.326
47 WA 1100.745 0.1 0.1222 134.535 110.074

From the above calculations the biased estimate of the average real estate farm
loans is 474 .002. The Yates end corrected and unbiased estimate is 498.404.
Chapter 7: Systematic Sampling 635
>~

Singh and Singh (1977) suggested a new systematic sampling scheme which
provides an unbiased estimator of variance of the sample mean. Here we shall
discuss their technique briefly. Suppose a population consists ofN distinct units and
a sample of size n has to be drawn. Let u(~ n) and d be two predetermined
integers which are chosen in such a way that: (i) Every sample contains distinct
units; (ii) The inclusion probability for each pair of units is non-zero , and starting
with a random number r(~ N), select u units continuously, and thereafter the
remaining n - u = v (say) units with span d such that d ~ u and u + vd ~ N . With
these conditions, a sample of size n can be drawn in two or more phases. The
condition for the number of phases p(say), required for selecting a sample of n
units from the population of N units is given by
log{log(N/ 2)}- log{log(n/2)}
p e log(2) (7.7.1)

The sample space corresponding to the selection scheme contains N possible


samples. Thus the probability of selecting each sample is 1/ N and the probability
measure associated with the selection procedure is

P=P(Sr ~P(Sr)= ~ , r =I ,2,3, ..., N }, (7.7.2)

where the sample Sr corresponding to random start r is given by Sr = (sr', s,"},


where s;' consists of units with indices r + m, (m =0, I, 2,...., u -I) and S r consists II

of units with indices (r + u- I) + td , (t = 1,2, ...., v) . Assuming that the first and
second order inclusion probabilities Jr i and Jrij are known. Then we have the
following results.

Result 7.7.1. The Horvitz and Thompson (1952) type estimator of population mean
is given by
_ 1 n Yi (
YSys= N L - . 7.7.3)
i=I Jri

Result 7.7.2. The variance of the estimator Ysys is given by

v(ySYs) = ~ f
N i=lj>~
f(l-Jrij - N:J&i -
n
Yj~ ' (7.7.4)

Result 7.9.3. An unbiased estimator of the v tY"Sys ) is given by

V\Ysys Inn
' (-, ) =- 2 LL
N i=lj>i
(I
---2
Jrij
N
n
2
J\Yi
r - Yj \2) . (7.7.5)
636 Advanced sampling theory with applications

Example 7.7.1. Select a sample of 10 states from population 1 consisting of 50


states by using Singh and Singh's systematic sampling scheme. Calculate the
minimum number of phases to select a sample of 10 states . Collect the information
on the real estate farm loans and nonreal estate farm loans from the states selected
in the sample. Assume that information on nonreal estate farm loans is known for
the whole population. Estimate the total real estate farm loans .
Given: X = 43908.12.
Solution. We have N = 50 and n = 10, therefore, k = N/ n = 50/10 = 5. Let us
decide u = 7 and d = 4. We used 8th column of the Pseudo-Random Numbers
(PRN) given in Table 1 of the Appendix to select a random number between 1 and
5. We observed random number 2. Select first seven units with skip 5 with random
start number 2. The remaining 3 units were selected with a skip 4. Thus we have the
following situation.

The minimum number of phases to select the sample are


p > log{log(N/2)}-tog{log(n/2)} = log{log(25)}- log{log(5)} =1.
- log(2) log(2)

02 AK 2.605 3.433 0.000078 0.00078186 3331.79


07 CT 7.130 4.373 0.000099 0.00099594 7159.04
12 1D 53.753 1006.036 0.022912 0.22912300 234 .60
17 KY 1045.106 557.656 0.012701 0.12700521 8228 .84
22 MI 323 .028 440.518 0.010033 0.10032723 3219 .74
27 NE 1337.852 3585.406 0.081657 0.81657015 1638.37
32 NY 201.631 426.274 0.009708 0.09708318 2076 .89
36 OK 612.108 1716.087 0.039084 0.39083591 1566.15
40 SC 87.951 80.750 0.00 1839 0.01839068 4782.37
44 UT 56.908 197.244 0.004492 0.04492199 1266.82

where Pi = Xi / X and Jri = nPi .

Thus an estimate of the average amount of the real estate farm loans in the United
States is
Y sys =~ I Yi = 33504.64 = 670.093 .
N i = l Jri 50
Chapter 7: Systematic Sampling 637

The values of (Yi - Y j)2 , i *- j are given below .

i?-j ","ifif,' '"


iZ"!TIiI,,,' '1. 7" lilt" 8'
I~ 20.48
l tl ! 2616.12 2173 .7C
1086808.34 1077394.18 982780 .77
it, 102670.90 99791.55 72509 .03 521396.64
1782884.55 1770821.0~ 1648910.24 85700 .22 1029867.75
39611.35 37830 .64 21867 .90 711450 .08 14737.23 1290998.16
371493 .91 365998.38 311760 .31 187487.27 83567 .25 526704.35 168491.37
7283 .94 6532.03 1169.50 916145 .69 55261.2C 1562252.51 12923.14 274740 .5~
Iro, 2948 .82 2477.85 9.95 976535 .2S 70819 .85 1640817.53 20944 .75 308247 .04 963.67

From the above table we have


I I (Yi - Yj ~ = 20258417.01.
i=lj>i

For simplicity, if we take Jrij =n(n-1)/{N(N -1)} then

_1 _ N
2
2
J= N(N -1) N
2
= 50x49 -~ = 2.2222 .
( Jrij n n(n-1) n2 lOx9 102
Using the above information the estimator of the variance of the estimator Ysys is

,(.,., )= _1_2 Ln Ln (_1 _ N22 J(III._Y],\2J = 2.2222 x 20258417.01 = 18007.48 .


Vllsys 2
N i=lj> i Jrij n 50
Thus (1 - a)1 00% and hence, using Table 2 from the Appendix, the 95% confidence
interval of the average amount of real estate farm loans in the United States is
Ysys+ta!2(df=n-1~v~sys) , or 670.09H2.626.J18007.48, or [317.704, 1022.481].

Given the finite population n of N distinct and identifiable units with variable of
interest, Y. Then the Zinger (1980) sampling design exists if the following
conditions hold:
( a ) First a sample s of size m < N is selected from n using design Ps .The
strategy (Ps' Ys) is unbiased for the population mean r;
( b ) Once the sample s is selected and held fixed, another sample r of size
n«N-m) is selected from (n-s) with probability design Pro The strategy
{(Pn tr ) Is fixedjis unbiased for Y(o-s)' and in tum (Ps' YO-s) is unbiased for
population mean r.
Espejo (1997) used the above defined Zinger design to
propose a Zinger statistic as
tE=fJYs+(1-fJ)(tr I s fixed). (7.8.1)
where fJ is a non-zero real constant. Then we have the following theorems.
638 Advanced sampling theory with applications

Theorem 7.8.1. The estimator tt: is unbiased for population mean Y.


Proof. We know that
y
E(ys)= Y, and that E{E(tr I s)}= E(Yn-s)= E[NY - m s] = y.
N-m
Thus we have
E(tE) = pE(ys)+ (1- p)E{E(tr Is)}= pY +(1- p)y = Y.
Hence the theorem.

Theorem 7.8.2. The variance of the estimator t E is given by

V(t E) ={P-(I- p)_m_}2 V(Ys)+(I- p)2E{V(t r Is)} . (7.8.2)


N-m
Proof. It follows from Espejo (1997).

Theorem 7.8.3. An estimator of V(Ys) , where Ys = m- I IYi , is given by


i=1

V(Ys)= {P-(I-P)N:mf2{v(tE)-(I-P)2V(tr I sfixed)}. (7.8.3)

Espejo (1997) has shown that the estimators proposed by Gautschi (1957), Heilbron
(1978), Wolter (1984), Wu (1984), Ruiz and Santos (1992) and Rana and Singh
(1989) are also special cases of the estimator (7.8.1). Systematic sampling has also
been discussed by Madow and Madow (1944), Madow (1949,1953), Finney (1948,
1995), Sukhatme, Panse, and Sastri (1958), and Reddy (1980) .

The precision of systematic sampling from populations having cyclic or periodic


trend depends upon the periodicity of the trend.

Period(i)

Fig.7 .9.1 Population with regular cyclic trend.


Chapter 7: Systematic Sampling 639

The population may consist of a periodic trend given by the sine curve as
Yi =a+sin(Jri/n+p) (7.9.1)
where i varies from 0 to an integral multiple of 2n . In Figure 7.9.1, we have taken
a = 0 , P = 0.1, n = 10, Jr = 22/7 and i = 0,1,2,3,....,120. Here 2 x n = 20, which means
successive sampling units will repeat themselves after every 20 th value. It can be
easily observed that a 5% systematic sample from such a population will form
sampling units drawn from the same position of each cycle. An estimate from such
a sample will be as good as a single value. On the other hand, a 5% random sample
will contain units from different parts of the population and estimates from such
samples will be more precise for the effect ofa periodic trend . In Figure 7.9.1, the
height of the curve is equal to the value of the study variable Y. Now if the skip k
is equal to the period of the sine curve or an integral multiple of it, then the units
marked with the circles are in the sample. Since all circles are at the same height
from the x axis, therefore every observation within the systematic sample is exactly
the same. In other words, the sample has the same information as from any single
unit of the population. Thus this case can be considered as the least favourable for
systematic sampling. In contrast, the most favourable situation happens when the
span k is an odd multiple of half-period. In this case, every systematic sample
mean is equal to the population mean, which results in zero variance of estimator of
population mean. Such a situation of half-periodicity are indicated the Figure 7.9.1
with squares. The choice between these two cases depends upon the relationship
between k and wavelength . Increase in periodicity in cyclic populations, while
selecting the sample, increases the efficiency of the estimates. Madow (1949),
Finney (1948, 1950), and Milne (1959) have observed linear and periodic type
trends in the natural populations .

So for we have discussed only one-dimensional systematic sampling in which the


population units are supposed to be in ascending or descending order on a line or
boundary line of a circle based on the serial number or magnitude of any known
auxiliary variable. A natural question arises, that if there are two auxiliary variables
(say XI and X 2) then what is the best method to use both auxiliary variable at
selection stage in systematic sampling. In such situations it is possible to arrange the
population units on a plane or area instead of on a line. Such a systematic sampling
scheme is known as two-dimensional systematic sampling scheme. Quenouille
(1949), Das (1950), Manwani and Singh (1978) and Singh and Chaudhary (1986)
have discussed the problem of two-dimensional systematic sampling . The simplest
extension of a linear systematic sampling scheme to two-dimensional scheme is also
known as squared grid or aligned sampling scheme. To explain two-dimensional
systematic sampling scheme, we consider population consisting of N square grid
areas of equal size. Now a sample of n grid areas has to be selected. Let there be r
rows and s columns , such that r.s = N as shown in the following Figure 7.10.1.
640 Advanced sampling theory with applications

Fig. 7.10.1 Two-dimensional systematic sample.

The simpl est way is to select a pair of random numbers (i, j ) such that is; rand
j S; s . Thus the random location of a grid is unique. For example, let us suppose
the population cons ists of N = 100 units and we wish to select a sample of n = 9
units. The simplest way to create 100 grid areas is to form a plane with r = 10 rows
and s = lO columns as shown in Figure 7.10.2.

Field 1 I l A2 3 4 5 I tA 6 7 8 9 Al o
1
,1 ~ ~
(~73
~ ,(,'13
2 c; ~J l.-J
3

10
y y
Fig . 7.10.2 Square grid or Aligned sample.
Chapter 7: Systematic Sampling 641

Another method of selecting a sample from a two-dimensional space is called


Unaligned Sampling Method. In this method the grids are arranged in the form of
r x I rows and m x c col umns and it is required to select a systematic sample of size
n = rc gr ids . A pictorial representation of such a method is given in Figure 7.10.3 .

Field I rlh 2 3 4 7 8
I
a Wff
2
'-~

"~.~
3
'"
~ l
'

4~
".,
~
5 r----.l

6
i ,I},
, I"~J
7
~ , ,'\I~
8
L:J
9

10

Fig. 7.10.3 Unaligned systematic sampling.

Select r independent random numbers i» i2 , .. .. , i; each less than or equal to I , and


c independent random numbers JI,h ,....,Jc each less than or equal to m. The grids
included in the unaligned systematic sample are {ix +1 + xl, J x+1 + ym } for
x = O, I,2,....,(r - l) and y=O,I,2,....,(c - l). Quenouille (1949) and Das (1950) have
shown that an unaligned systematic sampling remains better than square grid or
aligned two dimensional systematic sampling. Milne (1959) has suggested a central
square gr id technique for which the first sampling point starts from the centre of the
square. If there are more than two auxi liary variab les to be used for se lecting the
sample using systematic sampling, such a sampling scheme is called Multi-
dimensional systematic sampling. Yates (1960) has considered three dimension
lattice sampling strategy. Decent reviews on systematic sampling can be had from
Buckland (195 1) and Iachan (19 82).
642 Advanced sampling theory with applications

~XERCISES ~( .~

Exe rcise 7.1. Give the circumstances under which systematic sampling is preferred
over simple random sampling. Discuss the difficulties in estimating the variance of
the estimator of population mean under systematic sampling design.

Exe rcise 7.2. What was the need of circular systematic sampling ? Is it always
preferable than usual systematic sampling? Give a practical example of using
circular systematic sampling.

Exercise 7.3. Discuss a procedure of drawing a sample of n units from a


population of N units using ( a ) systematic sampling, ( b ) circular systematic
sampling, ( c ) any modified procedure without difficulties in part ( a ) and ( b ).

Exer cise 7.4. What is Yates end correction in Systematic Sampling? How it effect
the variance of the estimator of population mean under systematic sampling? Does
there any difficulty in this method while implementing in actual practice?

Exe rcise 7.5. Discuss the systematic sampling strategies under ( a ) a linear trend,
( b ) a cyclic trend.

Exe rci se 7.6. What is Balanced Systematic Sampling (BSS)? Show that the
variance of the estimator of the population mean under BSS reduces to zero.

2
Exe rcise 7.7. In a population with quadratic trend Y i =i , i= 1,2,...,25 , as shown in
the following figure.

Quad ratic trend fo r n=25

600 I

500 +
N
<
'ii
400
300
J
'" 200 ~
100
0 I I I I ,
M 10 .... C)
;:: M
~ ~ ~ ~ N M
N
10
N

compare the value of v(ySyJ = E(ysys - ff given by f(h systematic sample of size 5
by Yates end correction method and usual method of systematic sampling .
Chapter 7: Systematic Sampling 643

Exercise 7.8. Mr. Bean studied that the alternating current follows sine curve as
shown in the Figure 7.9.1. Mr. Bean's interest was to estimate the average amount
of current. He used systematic sampling to select a sample of n = 4 units with
random start r = 3 from a population of N = 40 as shown below .
3 13 23 33
+0.86 -0.86 +0 .86 -0.86
From the sample information, he obtained that the average current is zero . Do you
agree with him? If not, why?

Exercise 7.9. Write a short note on two-dimensional systematic sampling. Is it


possible to use three-dimensional systematic sampling? Can you suggest a practical
situation?

Exercise 7.10. If the units in the population can be arranged in increas ing (or
decreasing) order of magnitude of the study variable. Show that the value of
intraclass correlation coefficient is negative (Refer to Chapter 9 ). Also show that
both balanced systematic sampling and modified systematic sampling remain
superior to the usual systematic sampling for estimating population total or mean.
Hint: Reddy (1980) .

Exercise 7.11. Suggest an estimator of correlation coefficient in systematic


sampling. Derive the expression of bias and variance.
Hint: Gupta and Singh (1990) .

Exercise 7.12. Under the superpopulation model :


Em{y; I x;) = a + fJx; , Vm(Y; I x;) = a 2xf, and COvrnlY;, Yj)= 0 , i "* j ,
assuming that the units in the population are arranged in the ascending order of the
auxiliary variable x . Let Vsrswr' Vsrswor, Vsys , vbss and vrnss denote the variance of
the sample mean estimator under SRSWR sampling, SRSWOR sampling,
Systematic Sampling, Balanced Systematic Sampling, and Mod ified Systematic
Sampling, respectively, then show that

Given that balanced systematic sampling is due to Murthy (1977) and modified
systematic sampling is due to Singh, Jindal, and Garg (1968) .
Hint: Reddy (1980).
Exercise 7.13. Consider a superpopulation model: m : Y; = a + fJ(i _ N; 1) + e.,

i = 1,2,...,N, where e; is a random variable with Em(e;lx;)=O, vm(e; ! x;) = a 2 ,


Covmle;, ej I X;, Xj)= 0, i"* j . Let us consider the case such that N = nk, k is an
integer and n is sample size. Let the expected value of the mean square error of the
644 Advanced sampling theory with applications

of the estimator ys be defined as ErnE p [Ys- yf for s = sys, mss, ess and bss over
the design p . Show that:
( a ) For systematic sampling
2(1- 2(k2
E [MSE f- )~= 00 f) + b -1l.
rn I)'sys ~ n 12'
(b) For modified systematic sampling (Singh, Jindal, and Garg ,1968))
2(k2
002
(1- f) + b ;1)
if n is odd,
E [MSE(y )]= n 12n
m mss 00 2 (1- f)
1 n
if n is even;

( C ) For centred sampling

002(1- f) b 2
if k is even,
Em[MSE(ycss)]= n +4
00 2 (1- f)
1 n
if k is odd;

( d ) For Balanced Systematic Sampling


OO2(I - f ) + b2(k 2; 1) if n is odd,
Em[MSE(Ybss)]= 2(n ) 12n
f
1
n
a 1- if n is even.

Hint: Fountain and Pathak (1989).

Exercise 7.14. Consider a superpopulation model:


N 1)+
m : y;=a+ p(i- e;,i=l, 2, ..., N
2+
where e.I is a random variable with

Ern(e;lx;)=O, Vrn (e; l x;) = oo 2 , Covrn(e;, ejlx; , xJ=O, i e j ,


Let us consider the case such that N = nk, k is an integer and n is sample size.
Now, consider the linear regression estimator defined as

Ylr- p[N + 1
- =y+ - - - - £ .1~ .J .
.,1
2 n ;=1
( a ) Show that Ylr is the best linear unbiased estimator of the population mean Y.
Also show the following relations :
2(1-
. [ (_ )
( b ) under SRSWOR samplmg Em MSE Y lr srswor =
] 00 f) + oo 2((k -1)) ;
n nk n-l
2(1- 2
. . [ ( _ ) ] _ 00 f) oo (k - lXk +l ) .
( c ) under systematIc samphng Em MSE Ylr sys -
n
+ 2( X )'
nk n -1 n + 1
Chapter 7: Systematic Sampling 645

if k is odd,

if k is even .

Exercise 7.15. (a) Under the model m : Yi =a + Pi + rXi +ei, where a, P and r
are constants, ei - N(O, ( 2) , and Xi is a periodic function of i with a period of
2f , and N = 2fQ = nk Compare the expected variances given by Em[Vsrswr ],
EmlvsysJ, Em[Vsrswor], Em[Vbss], and Em[Vmss]'
Hint: Madow and Madow (1944).
( b ) After eliminating the linear trend from the model: m: Yi = a + Pi + r xi + ei ,
show that the following relations holds:
(i ) EmlVsysJ::; Em[vbssl
( ii ) Em[vbssl = EmlvsysJ::; Em [Vsrswor] for k/f and n both are even.
(iii) EmlvsysJ'" Em[Vsrswor] for k/f being odd.
and
( iv ) Em[Vsrswor ] '" Em[Vmss] for k/f, k and n/2 are odd.
Hint: Bellhouse and Rao (1975).

Exercise 7.16. Under the model m : Yi = PYi-l + ei, i = 1,2, ..., N , where ei - (0, ( 2 )
and are independent and identically distributed then show that the centrally located
systematic sample (c.s.s) is optimum.
Hint: Blight (1973).

Exercise 7.17. Let N (= kn, where a s k) be the population size. The population
units VI' V 2,... , V N are arranged in an n x k matrix M (say) and the /h row of the
matrix M is denoted by Rj , j =1,2,...,n . Note that the elements of the row
Rj = {uU-I)k+i' i = 1,2, ..., k} . Consider a sampling scheme to select n units from the
diagonal of the matrix M and hence belongs to different rows and columns . Let Yij
be the value corresponding to /h row and j" column then the sample will consist of
the observations, given by Si = {YIn Y2(r+I)' ..., Yn(r+n-I)}' for i = 1,2, ..., k. Note that
if r + n -1 > k then it has to be reduced to mod k .
( a) In the hypothetical situation of linear tread Y; = a + bX i , i = 1, 2,...., N , show
that the variances of the simple random sample mean eYr), systematic sample
mean (ySY ) and the diagonal systematic sampling mean (y dsy ) are given by
646 Advanced sampling theory with applications

V(-)= (k-1XN+l)b2 Vi.,..,)= (k-1Xk+l)b2 and Vi.,.., )= (k-nXn(k-n)+2)b2


Yr 12 ,\Ysy 12 ' \Ydsy 12n
( b ) Also show that
V(ydSY)~ V(YsJ~ V(Yr ) •
( c) If k = n, then VlYdsy) = 0 that is the diagonal systematic sampling becomes a
completely trend free sampling .
Hint: Subramani (2000), Mukerjee and Sengupta (1990).

Practical 7.1. Select a sample of 10 states from population 1 consisting of 50 states


by using systematic sampling scheme. Collect the information on the nonreal estate
farm loans from the states selected in the sample. Assume that there is a linear trend
between the real estate farm loans and the serial numbers. Apply the Yates end
correction to obtain unbiased estimate of population mean.

Practical 7.2. Select a sample of 10 states from population 1 consisting of 50 states


by using systematic sampling scheme. Collect the information on the nonreal estate
farm loans from the states selected in the sample. Use an appropriate method for
estimating the variance of the estimator of population mean.

Practical 7.3. Use two letter abbreviations to sort the 50 states listed in population
I of the Appendix, and then write them on a circle in the clock wise direction as
shown below:

Select a circular systematic sample of 10 states out of 50 states, and collect the
information from population 1 given in the Appendix on the real estate farm loans
from the states selected in the CS sample.

( a ) Estimate the average real estate farm loans in the United States.
Chapter 7: Systematic Sampling 647

( b ) Select another independent circular systematic sample of 10 states. Use the


information from the both samples to estimate the variance of the estimator of
average real estate farm loans based on circular systematic sampling.

Practical 7.4. Select a sample of 15 states from population 1 consisting of 50 states


by using Singh and Singh's systematic sampling scheme. Calculate the minimum
number of phases to select a sample of 15 states. Collect the information on the real
estate farm loans and nonreal estate farm loans from the states selected in the
sample. Assume that information on nonreal estate farm loans is known for the
whole population. Estimate the total real estate farm loans and find estimator of its
vanance.
Given: X = 43908.12 .
Practical 7.5. Select three sub-samples each consisting of 5 states from population
1 with 50 states by using systematic sampling. Collect the information on the
nonreal estate farm loans from the states selected in the sample . Obtain a pooled
estimate of the average nonreal estate farm loans in the United States. Use an
appropriate method for estimating the variance of the resultant pooled estimator.

Practical 7.6. From the population 1 given in the appendix, apply the following
methods to select a sample of 10 units and compare the estimate.
( a ) Select a sample with random start 1::; i ::; N with the units:
{i + jk, N - i - jk + 1, j = 0,1,2,oo.,(n/2 -I)} if n is even,
n ={
s ()
{i+jk, N-i-jk+l , i+(n-l)k/2, j=0,I,oo.,{(n-l)/2-1}} if nisodd,
which is called modified systematic sampling .
Hint: Singh, Jindal and Garg (1968).
( b ) Select a sample with random start 1::; i ::; N with the units:
s(n) = {{~ + 2~k, 2(~ + 1)k- ~ + 1, ~ = 0,1,2,oo.,(~/2 -I)} if n is even,
{/+2jk, 2(j+l)k-/+l, /+(n-l)k, j=0,I,oo.,{(n-3)/2}} if n is odd,
which is also called balanced systematic sampling.
Hint: Murthy (1977), Sethi (1965) .
( c ) Apply the following formulae to estimate the variance in each situation and
construct 95% confidence intervals:

(i) \-1
vy = 6n n - 2 )±(vi
j=3 '
j· -
2Yi j ·-I+Yi j
, ,
·-2r, called Yates' method;
( ii) vsd = 1(- 1 ) f
2n n -1 j=2
(vi,j - Yi,j-I r,, called the method of successive differences.

Hint: Yates (1948).


( d ) Divide the sample into two parts, say Sl and S2 (each consisting of 5 units) .
Estimate the variance using the following formulae:
• 1 f- -)2
vk = '4l)' sl - Y S2 .

Hint: Koop (1971).


648 Advanced sampling theory with applications

Practical 7.7. Select all possible five samples each of 10 units with their respective
random start 1 S iS 5 from the population 1 given in the Appendix. Calculate the
following estimators of the variance, given by
• (1- f) 2 2 1 II t _ )2 _ 1 II
vI = - - Sy ' where "» =--1 ~ I)'i,j - Yi and Yi =- ~ Yi,j '
n n- J=l n J=l

• 1- f II t )2 . 1- f II t )2
v2 = 2 ( 1) ~ I)'i,j -
n n - J=l
ru-. ,v3 = 6 (
n n-
2) I I)'i,j - 2Yi,j -l + Yi,j-2
j =3
,

and
• 1- f II ( Yi,j Yi,j_4)2
V4 = 35n(n-4)j~5 -2-- Yi,j - l + Yi,j-2 - Yi, j - 3 +-2-

Also find their empirically expected values, biases, and mean square errors .
Suggest an efficient estimator based on the empirical results.
Practical 7.8. Select t = 2 systematic samples, each of n = 10 units, with two
different random starts from population 1 consisting of N = 50 units. Let k = 5 , so
that the condition N = nk = lO x 5 = 50 is satisfied. Also note that the sample size
n = 10 is dividable by t = 2. Take an SRS of t = 2 units from the first block of
tk = 2x 5 = 10 units and select every tk = 2x 5 = 10th unit thereafter. This method
divides the population into tk: samples each of size m= N /(tk) = n/ t units and
selecting t = 2 systematic samples (or clusters) by an SRSWOR sampling . From
the selected I samples, each consisting of m units, find the sample means as
)ij' j =1,2,..., t. Estimate the population mean on the basis of pooled estimator

=
Ypoo led = -
1 ~ _ d esti . . • (k -1) 1 ~ t: =
z: Yj ,an estimate Its van ance as v = - - - (- ) z: I)'j - Ypooled
)2
m j=1 kt t - 1 j=l
Derive 95% confidence interval estimate.
Hint: Iachan (1982), Gautschi (1957), Shiue (1966).

Practical 7.9. The following figure shows nine trees on a portion of Indian Grand
Trunk Road with their heights (feet) and timbre weights (kgs) as
a .,: Indian Grand Trunk Road

Height: 10.5 11.5 12.2 8.4 15.9 12.7 14.3 11.7 10.5
Weight: 202 125 219 107 187 198 210 213 209
( a ) Select all the possible three samples each of three trees using systematic
sampling, and estimate the average height (and weight) from each sample.
( b ) Use these estimates to estimate the variance. (Rule: Use first column of the
Pseudo-Random Number Table 1 given in the Appendix .)
8. STRATIFIED AND POST-STRATIFIED SAMPLING

Stratified and post-stratified sampling schemes are useful survey techniques


commonly used by government agencies , private consultants, and applied
statisticians. Let us differentiate between stratified and post-stratified sampling.

Stratified Sampling: In this sampling scheme, the population of N units is


subdivided into L homogeneous sub-groups called strata, such that the hlh stratum
L h .
consists of N h units, where h = 1,2...,L and 'LNh = N . From the hI population
h=1
stratum consisting of N h units, a sample of size nh is drawn using any sampling
L
design such that 'Lnh = n , the required sample size . A pictorial representation of
h=l
stratified sampling scheme is given in Figure 8.0.1.

- Population
-:::
I~Homogeneous
strata or
of N units __ groups
......
»r-«:
.--:;;?

"""'-~
N( N2 NL

1 1
". ". Samples selected with ".
nl n2 design p. nL

Fig. 8.0.1 Stratified Sampling Scheme .

Post-stratified sampling: In a post-stratified sampling scheme a sample of n units


is first selected from the population of N units using any sampling design . The
population is stratified into L strata on the basis of some auxiliary information
available. In post-stratified sampling, the values of N h where h = 1, 2...,L and
L
I,Nh = N mayor may not be known . Each sample unit selected with the chosen
h=!
design is then post-stratified or placed in the hlh stratum based on the auxiliary
L
information associated with each sampled unit such that I,nh = n .
h=1

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
650 Advanced sampling theory with applications

A pictorial representation of post-stratified sampling scheme IS given III Figure


8.0.2.

Population of N units

Distribution of
sampled n units
among post-strata

Fig. 8.0.2. Post-stratified Sampling Scheme

Thus the difference between stratified and post-stratified sampling schemes is that
in stratified sampling the sub-sample size nh is a fixed or predecided number,
whereas in post-stratified sampling it is a random variable

Now we will discuss each case in detail.

The population of N units is first subdivided into L homogeneous subgroups


called strata, such that the h lh stratum consists of N h units, where h = 1,2, ..., Land
L
'LNh = N . Let Yh; be the fh population value of the study variable in the hI"
h=1
stratum, i =1,2, ..., Nh such that the hI" stratum population mean is given by
- _I Nh
Yh = N h L Yh ; , for h = 1,2, ...,L.
;=!
Chapter 8: Stratified and Post-Stratified Sampling 651

Obviously, using the concept of weighted average the true population mean of the
whole population can be written as:

Y = NIY;+N2Y; + +NLYr = NIY; +N2Y; + +NLYL


N I +N2 + +NL N

= (!!l-JY; + (N2 JY; +....+ (NL JYr = w;Y; + w2Y; +...+ WLYr =
N N N
f.Wh~ .
h;1

Consider a sample of size nh is drawn using SRSWOR sampling from the hlh
L
population stratum consisting of N h units such that Inh = n , the required sample
h;1
size . Assume the value of the /h unit of the study variable selected from the h lh
stratum is denoted by Yhi' where i = 1,2, ..., nh and Wh = N h/ N is the known
proportion of population units falling in the hlh stratum.

Then we have the following theorems:


Theorem 8.1.1. An unbiased estimator of population mean Y is given by
L
Yst = IWhYh (8.1.1)
h;(

_ _I nh th
where Yh = nh 'IYhi denotes the h stratum sample mean .
i;(

Proof. We have

E(Yst) = E[ IWhYh]
h;1
= IWhE(Yh)=
h;1
IWh~ = h;1IWh[_I_¥
h;1 N
Yhi]
h i;(

= IL -N h [ - 1 Nh ] 1 L Nh Y-
I Yhi =- I I Yhi =- = Y .
h;l N Nh i ; 1 N h;li;1 N
Hence the theorem.

Theorem 8.1.2. Under SRSWOR sampling, the variance of the estimator Yst is
given by

V(Yst)= IWl(I-n«fhJS~y
h;( (8 .1.2)

where S~y = (N h -r): ¥(Yhi - ~f


I denotes the h lh stratum population variance,
i;1
- _INh h
Yh = N h 'I Yhi denotes the hI stratum population mean and fh = nh/ N h .
i; (

Proof. Note that the strata are independent and under SRSWOR sampling we have

V(Yh) = C~:h Js~Y'


Thus we have
652 Advanced sampling theory with applications

2V (-) 1 - !h J 2
2(-
- )
V ( Yst = V [ 'L.WhYh
L _]
= 'L.Wh
L
Yh = 'L.Wh
L
- Shy '
h=l h=l h=1 nh
Hence the theorem.

Theorem 8.1.3. An unbiased estimator ofV(Yst) under SRSWOR sampling is


,(-) L 2( 1- fh) 2
v Yst = 'L.Wh - - Shy' (8.1.3)
h=l nh
nh 2 nh
W h ere
2
Shy = ( nh-1 )-1 LYhi-Yh
( - )
and Yh=nh
- -I
LYhi'
h=1 i=1
Proof. Obvious on using
E(s~y)= S~y .
Remark 8.1.1. The (1- a )00% confidence interval estimate is stratified random
sampling will be given by
jist ± (a/2(df=n-L}.jv(jist). (8.1.4)

Example 8.1.1. The following data shows daily temperatures in London and New
York cities in F as follows:
0

I;;,< L~

NY 48
London 54
NY 52
NY 47
London 57
NY 54
NY 49
London 59
NY 53
NY 50
NY 52
NY 57
London 55
NY 54
London 68
NY 49
NY 51
London 61
NY 55
NY 53
London 50
Chapter 8: Stratified and Post-Stratified Sampling 653

( a ) If we select an SRSWOR sample of 4 units, find the variance of estimator of


population mean.
( b ) If we stratify the population on the basis of location or cities, and then select
two units from each city, find the variance of the estimator of mean in stratified
sampling.
( c ) Find the relative efficiency of the stratified sampling over the simple random
sampling.
( d ) Select an SRSWOR sample of four units from the above population and
construct 95% confidence interval estimate of the population mean.
( e ) Select two units using SRSWOR sampling from London and two units from
New York using stratified random sampling, and construct 95% confidence interval
estimate of the population mean.

Solution ( a) SRSWOR Sampling: From the above data we are given N = 21,
n=4, /=n/N=4/21 , and

NY 48 2304
London 54 2916
NY 52 2704
NY 47 2209
London 57 3249
NY 54 2916
NY 49 2401
London 59 3481
NY 53 2809
NY 50 2500
NY 52 2704
NY 57 3249
London 55 3025
NY 54 2916
London 68 4624
NY 49 2401
NY 51 2601
London 61 3721
NY 55 3025
NY 53 2809
London 50 2500

which implies
N N 2
IY; = 1128, IY; = 61064
;;1 ;; 1
654 Advanced sampling theory with applications

and

s; = (N -ltl{~Y/ - N-I(~Y;)Z) =_1_{61064


;: 1 21- I ;:1
- (1l28 f } = 23.71 .
21

Thus the variance of the sample mean estimator is given by


VsrsworCy) = (1- f)
n
s; = (1- 44/21) x 23.71 = 4.715.
( b ) Stratified sampling: We stratify the population into two strata based on
location as follows :

54 2916 48 2304
57 3249 52 2704
59 3481 47 2209
55 3025 54 2916
68 4624 49 2401
61 3721 53 2809
50 2500 50 2500
52 2704
57 3249
54 2916
49 2401
51 2601
55 3025

From Stratum 1 we have

N 1 = 7,
N1
Jt] = N
7 nl
= 21 = 0.3333333 , It = Ii; = "7 '
2 NI
D t; = 404,
;: 1
Nl Z
1:
;:1
lJ; = 23516
and

s?y = (Nt -It !~Yt7 - Nlt(~lJ;JZ) = 23516 -(404f /7 = 33.24 .


l
1;:1 ;: 1 7-1

From Stratum 2 we have


N z 14 nz 2 NZ NZ z
N2 =14, WZ=N=21=0 .6666667 , h= N =14 ' i~IYz;=724, i~IYZi=37548
z
and
Chapter 8: Stratified and Post-Stratified Sampling 655

S2
2y 2 1
=(N -ltIJ'fY2_N-I('fy.J2j= 37548-(724f/14 =8.22.
;;1 2/ 2 ;; 1 2/ 14-1

Thus the variance of the estimator of the population mean in the stratified random
sampling is given by
V(y-)- ~W2(1-fh}2
Sl - L.
h;1 nh
~W2(1-fh}2
h - - hy - L.
h;l
h - - hy -_W;2(1-
nh
Ji Iy +W22(I-
I - - S2
n(
- -JS
n2
fz 2y
2 J
= (;lr( 1-;/7)X 33.24 +(~~ r( 1- ;14) X8.22
= 1.319246 + 1.565714 = 2.88496.
( c ) Relative efficiency: The percent relative efficiency of the stratified random
sampling over SRSWOR sampling is given by

RE=Vsrswor(}i) xlOO= 4.715 xlOO=163.45 .


V(Ysl) 2.88496

( d) 95% CI estimate using SRSWOR sampling: We started from first two


columns of the Pseudo-Random Number Table I given in the Appendix to select 4
distinct random numbers between 01 and N = 21 (both inclusive) as follows:

Thus the sample mean estimate of the population mean based on SRSWOR is
_ 1 n 204
y=-I y; =-=51.
n ;;1 4
The sample variance is given by

s;= (n-1t l{,Iyl-( ,Iy;)2 ~)= 10466-(204f/4


/;1 4- 1
/; (
=20.66667 .
/'

An estimate of the variance of the sample mean is given by


. (-) (I-f)Sy=
vsrswor Y = -n- 2 (1-4/21)
--4- x 20.66667 = 4.182 .

A (1- a)l 00% Confidence Interval estimate of the population mean is given by
y ± la/2(df = n - l).jvsrswor(Y) .
Thus a 95% Confidence Interval estimate of the population mean is given by
51 ± IO,02S(df =4 -1)J4.182 .
Using Table 2 from the Appendix we have
656 Advanced sampling theory with applications

51 ± 3.182"'4.182, or [44.49,57.51].
Note that the true population mean Y = 53.71 lies in the 95% confidence interval.
The interpretation of this confidence interval estimate is that we are 95% sure that
the true population mean lies between 44.49 OF to 57.51 OF.
( e) 95% CI estimate using stratified random sampling: We selected two units
from Stratum 1 and two units from Stratum 2 using lottery methods:

57 3249
55 3025

From sample stratum I


nl
LYli = 112,
_ 1 nl
Yl =-LYli =-= 56,
112 nl 2
LYli = 6274
i=1 nl i=1 2 i=1
and

2 -(
Sly - nl- I)-Il~""Yli-
2 nl-I(~ - 6274-(112f /2 --20
""YliJ2)- ..
i=1 i=1 2-1
From sample stratum 2
n2 _ I n2 109 n2 2
LY2i =109, Y2 =-LY2i =-=54.5, LY2i =5953
i=1 n2 i=1 2 i=1
and

2 -_(n2 -
S2y 1)-11~""Y2i2 n:-I(~""Y2iJ2)_5953-(109f
- /2_125
- . .
2-1
r

i=1 i=1

Thus an estimator of the variance of the estimator of the population mean III
stratified random sampling is

"(Yst)= f Wh2(1-!h
h=1 nh
}~y = h=1f Wh2(1-nhIh }~y = WI2(1-nlII JS12y +Wl(l-n2h Js~y

= (:1 r( 1-; /7) x 2.0 + (~~r( 1- ;14)x 12.5


= 0.079365 + 2.380952 = 2.460317 .
Also the stratified estimate of the population mean is given by
_ L _ 2 _ _ _ ( 7) (14)
Yst = IWhYh = IWhYh = WIYI + W2Y2 = - x 56 + - x 54.5 = 55.
h=1 h=1 21 21
Now a (1 - a )1 00% confidence interval estimate of the population mean Y using
stratified random sampling is given by
Chapter 8: Stratified and Post-Stratified Sampling 657

Yst ± la/2(df= n-L)Jv(Yst), or Yst ± lo.025(df=4-2)Jv(Yst).


Using Table 2 from the Appendix the required 95% CI estimate is given by
55 ± 4.303,J2.460317, or 55±4.303x1.56854 , or [48.251,61.749] .
Again note that the true population mean Y = 53.71 lies in 95% confidence interval.
The interpretation of this confidence interval estimate is that we are 95% sure that
the true population mean lies between 48.251 of to 61.749 OF.
Conclusion: The length of the CI estimate obtained using stratified sampling is
Lst=6 1.749-48.251=13.498 , is more than that for SRSWOR sampling
Lsrs = 57.51- 44.49 = 13.02 , thus we may conclude that stratified sampling does not
perform better in this particular situation, perhaps because of small sample size.

Example 8.1.2. Select two units by SRSWOR sampling from each stratum of the
population 5 given in the Appendix . Collect the information on the yield/hectare of
the tobacco crop from the countries selected in the sample. Assuming that the total
number of countries in each continent are known, estimate the average yield/hectare
of the tobacco crop in the world. Construct the 95% confidence interval.
Solution. Using Pseudo-Random Number (PRN) Table I given in the Appendix we
have the following sample information and some results.

5 2.03
6 2.00
2 2 4 1.99 0.0566 1.750 0. 11520
2 1.51
3 3 2 3.69 0.0755 3.240 0.40500
8 2.79
4 4 and 5 07 1.36 0.0943 1.850 0.48020
08 2.34
5 6 and 7 07 2.05 0.1132 1.155 1.60210
02 0.26
6 8 2 1.61 0.0377 1.785 0.06130
I 1.96
7 9 and 10 20 2.10 0.2830 2.305 0.08410
12 2.51
8 II and 12 15 1.33 0.1604 1.090 0.11520
06 0.85
9 13 and 14 05 1.33 0.0944 1.240 0.01620
07 1.15
10 15 3 2.58 0.028 3 1.765 1.32850
2 0.95
658 Advanced sampling theory with applications

Then we have

I 6 0.33 0.0566 2.015 0.1140 0.00045 0.000000


2 6 0.33 0.0566 1.750 0.0990 0.11520 0.000124
3 8 0.25 0.0754 3.240 0.2446 0.40500 0.000863
4 10 0.20 0.0943 1.850 0.1744 0.48020 0.001708
5 12 0.17 0.1132 1.155 0.1307 1.60210 0.008520
6 4 0.50 0.0377 1.785 0.0672 0.06130 0.000022
7 30 0.07 0.2830 2.305 0.6523 0.08410 0.003132
8 17 0.12 0.1603 1.090 0.1748 0.11520 0.001302
9 10 0.20 0.0943 1.240 0.1171 0.01620 0.000057
10 3 0.67 0.0283 1.765 0.0499 1.32850 0.000176

where
2(1- fh) 2
Vh = Wh --;;;- Shy'
A

Thus a stratified estimate of the average yield/hectare of the Tobacco crop in the
world is
L
Yst = 'L.WhYh = 1.8243
h=l
and an estimate of the variance of the stratified estimator is given by

L 2(1-
,(- )= 'L.Wh
v Yst -- 2= 0.015905.
fh) Shy
h=l nh

A (I- a)I00% confidence interval estimate of the average yield/hectare of the


Tobacco crop during 1998 in the world is

Yst =1= fa/2(df = n - L}Jv(Yst) .

Using Table 2 from the Appendix the 95% confidence interval estimate is

Yst =1= fO.02S(df = 20-10 }Jv(Yst), or 1.8243+ 2.228~0.015905, or [1.5433, 2.1052].

We know now that the variance of the estimator of population mean under stratified
sampling is given by formula in (8.1.2). Furthermore, we known that the choice of
sample size nh for stratified sampling has to be decided once the sample size n is
chosen. A natural question arises: What choice of nh will make the variance of the
estimator Yst a minimum ? There are several ways to answer this question, but we
will discuss only a few of them here.
Chapter 8: Stratified and Post-Stratified Sampling 659

Below we discuss different methods of sample allocation namely:


( a ) equal allocation; (b) proportional allocation; and (c) optimum allocation.
Let us discuss each one of the above methods in detail:

As the name of the method suggests the sub sample sizes are equal, i.e., nh = n/ L .
Under this choice of sample allocation, the variance of the estimator Yst reduces to
v(-) -
Yst E -
k.z: wh2( 1- l h ) S hy2 -_ k.z: wh2( N h - nh ) S hy2 -__I_k.
2 z:
Nh(Nh-n/L)s 2
/ hy
h=l nh h=l Nhnh N h=1 n L

I L
=--2
nN h=l
z», ( LNh-n ) Shy2 ' (8.2.1)

An unbiased estimator of V(Yst)E is given by

v(Ystk =~ rNh(LNh -n~~y. (8.2.2)


nN h=1

Under this allocation the sub sample size from each stratum is proportional to the
size of the subpopulation in the stratum. That is
nh oc N h . (8.2.3)
To find the constant of proportionality we have
nh=KNh. (8 .2.4)
Taking the sum on both sides of(8.2.4) over all possible strata we have
L L
'i. nh = K 'i. N h , or n = KN
h=l h=l
which implies that
K = n/ N . (8.2.5)
On substituting (8.2.5) in (8.2.4) we have the proportional allocation in the hth
stratum as

nh = (~ )Nh = n( ~ ) = nWh . (8.2.6)

Under this allocation the variance of the estimator Yst becomes


v(Yst) =_I_
P
r Nh(Nh-nh)s~ =_I_ r
N 2 h=l nh Y N 2 h=1
Nh(Nh-nWh)s;
nWh Y

=_1_ ~ N (1_~)S2 = (I-I) ~ WS2 .


nN h=1 h N hy n h=1 h hy (8.2 .7)

If I = n/ N is negligible then we have


660 Advanced sampling theory with applications

(~ )
1 L 2
(8.2 .8)
VVst 'LWhShy '
p = -
n h=l
The unbiased estimator of V(Yst)p can easily be obtained from (8 .2.7) and (8 .2.8) by

replacing Sly with its sample analogue sly-

Example 8.2.1. Select a sample of 40 countries from popu lation 5 using the method
of pr oportional alloca tion. Record the yieldlhectare of the tobacco crop from the
selected countries. Estimate the average yieldlhectare of the tobacco crop in the
world. Estimate the variance under the method of proportional allocation. Con struct
a 95% confidence interval.
Solution. By the method of proportional allocation, the number of units to be
selected from the h1h stratum is given by, nh = nNh/ N . Thus we have

I 2 3 4 5 6 7 8 9 10 TotlH~
i~+riir
1 1"f" r~~r" 'N;''''f!:;' 1 ,ilii!>' 6 6 8 10 12 4 30 17 10 3 1 ;;:iB bQ6:~ .pi
;?
.",;.)" 3 3 3 3 4 2 11 6 2 ""4 0
,~
nh 3

Using Pseudo-Random Number (PRN) Table 1 given in the Appendix we collect


the following sample information.

~tiatum

5 Nicara ua 2.03
6 Panama 2.00
2 El Salvador 1.79
2 2 4 Jamaica 1.99
2 Dominican Re 1.51
1 Cuba 0.63
3 3 2 Bel ium-Iux 3.69
8 S ain 2.79
I Austria 1.90
4 4 and 5 07 Macedonia 1.36
08 Poland 2.34
02 Albania 0.64
5 6 and 7 07 Moldova 2.05
02 Armenia 0.26
10 Turkmenistan 2.36
09 Taiikistan 2.48
6 8 2 Lib a 1.61
I Al eria 1.96
7 9 and 10 20 Ni eria 2.10
12 Ken a 2.5 1
30 Zambia 1.29
Continued. .. .. .
Chapter 8: Stratified and Post-Stratified Sampling 661

15 Malawi 1.22
07 Central African Rep 0.87
26 Togo 0.50
22 Zimbabwe 2.06
01 Angola 0.99
11 Cote d'ivoire 0.26
05 Zaire 1.11
06 Cameroon 1.62
8 11 and 12 15 Thailand 1.33
06 Indonesia 0.85
02 Burma 1.22
10 Korea, South 2.02
11 Laos 0.75
05 China 1.75
9 13 and 14 05 Lebanon 1.33
07 Syria 1.15
01 Cyprus 1.50
10 15 3 New Zealand 2.58
2 Solomon Islands 0.95
Thus we have

1 1.9400 0.017100 6 3 0.5000 0.0566 0.1098 0.0009


2 1.3766 0.475733 6 3 0.5000 0.0566 0.0779 0.0269
3 2.7933 0.801033 8 3 0.3750 0.0754 0.2108 0.0604
4 1.4466 0.728133 10 3 0.3000 0.0943 0.1364 0.0686
5 1.7875 1.069825 12 4 0.3333 0.1132 0.2023 0.12 11
6 1.7850 0.061250 4 2 0.5000 0.0377 0.0673 0.0023
7 1.3209 0.482449 30 11 0.3666 0.2830 0.3738 0.1365
8 1.3200 0.246160 17 6 0.3529 0.1603 0.2116 0.0394
9 1.3266 0.030633 10 3 0.3000 0.0943 0.1251 0.0028
10 1.7650 1.328450 3 2 0.6666 0.0283 0.0499 0.0375

Thus an estimate ofyield!hectare of the Tobacco crop during 1998 in the world is
L
Yst = I,WhYh = 1.5654
h=l
and
v(Yst)p = (1- f) ~WhS~y = 1- 40/106 x 0.4970 = 0.007736.
n h=1 40
Using Table 2 from the Appendix the 95% confidence interval estimate of the
yield/hectare in the world during 1998 is
662 Advanced sampling theory with applications

)lst =+= ta/2(df = n - L )..j«Y:J; , or )lst =+= to.o2s(df = 40 -10 V v(Yst )p

or 1.5654 =+= 2.042.J0.007736 , or [1.3857, 1.7450].

8.2.3 OPTIMiJM-ALLO~ATION METHOD~~ .

This method is based on the cost aspect of the survey. Let Ch be the cost of
observing the variable y in the h'h stratum and let C/ be the total fixed cost of the
survey, then
L
C/ = Co + 'L,nhCh (8.2.9)
h=l
where Co stands for the known overhead cost. From (8.1.2) , the variance of the
estimator )lst is

(- ) L 2( fh) 2
1-
V Yst = 'L,Wh - - Shy' (8.2.10)
h=\ nh
We now discuss two cases:
( i ) the total cost is fixed; (ii) the variance is fixed.
Case I. Total cost is fixed: Minimization of (8.2.1 0) subject to (8.2.9) leads to the
Lagrange function
L(1 1 ) 2 2 A Co + InhCh- C, .
L\=I ---WhShy+ [L ] (8.2.11)
h =1 nh Nh h=\
On differentiating (8.2.11) with respect to nh and equating to zero we have
nh =WhShy/(fijC;). (8 .2.12)
L
Note that I nh = n , from (8.2.12) we have
h=1
j
'I [::1 JC; .
1 L WhShy
fi = (8.2 .13)

11
On substituting (8.2.13) in (8.2.12) we have

Vtx:
hy L WhShy) hy / L WhShy
nh = WhS =nWhShy Ch I - - =n WhS I rr cr : (8.2 .14)
fiJC; h=\ JC; VCh h=\ VCh
In a particular case if C1 = C2 =..... = CL = C , that is the cost of sampling in each
stratum is the same then (8.2.14) becomes

nh = n
[
f.WhShy
WhS
hy
]
.
(8.2 .15)

h=1
In other words the optimum allocation reduces to the famous Neyman (1934)
allocation. On substituting (8.2.14) in (8.2.10) , the variance of the estimator )lst
under optimum allocation is given by
Chapter 8: Stratified and Post-Stratified Sampling 663

(-;:)Opt -_ IL w"2S"y
V\Yst I W"S"y/
2 [L r;;- n[W"S"y) 1]
r;;- -]:I
"=1 "=1 VC" VC" "

=-
1( L
IW"S"yvr;;-)[ W"S"y)
C" IL - - -
L
I --.
W,,2Sly (8.2.16)
n "=1 "=1 je; "=1 N"
On substituting (8.2.15) in (8.2.10), the variance of the estimator jist under Neyman
allocation is given by

IW"S"y)2 - I~.
2 2
I(L L W S
V(Yst)N =- (8.2.17)
n "=1 "=1 N"
It can be easily shown that if f" is negligible then

V(Yst)oPt
-
=- IW"S"yvr;;-)[
1( L
n "=1
C" IL W"S"y)r;;-
,,=\ vC"
(8.2.18)

and
-
V(Yst ~ =-
1( L )2 .
IW"S"y (8.2.19)
n ,,=\

Case II. Total variance is fixed: In this case we minimize the cost given by (8.2.9)
subject to the fixed variance Vo'

V\Yst
(-;:)
=
1 --
IL ( - 1 JW"S"y=Vo ·
2 2
(8.2.20)
"=1 n" N"
In such situations, the Lagrange function L 2 is given by

L L
2=CO +"f.n"C,,+A. [ "f.L(1 IJW"S"y-V
--- 22 o] . (8.2.21)
"=1 "=1 n" N"
On differentiating (8.2.21) with respect to n" and equating to zero we have
n" = W"S"yJi/ je; . (8.2.22)
Now there are two possibilities.

( a ) Total sample size is fixed: Adding (8.2.22) over all possible strata we have

"tn" = JiAw"S"y/je; or Ji =1"~/w"S,,y/je;). (8.2.23)


Thus the optimum value of n" is given by substituting (8.2.23) in (8.2.22) as
»» =n(W"S"y/.fC;) IA(w"S"y/.fC;). (8.2 .24)

On substituting (8.2.24) in (8.2.9) the minimum cost is given by

Cmin = Co + n["~lw"S"yje;/ A(w"s"y/ je;)] . (8.2.25)


664 Advanced sampling theory with applications

( b ) Minimum sample size for a fixed value of variance: From (8.2.20) we have
22/
LL Wh Shy nh = Vo + LL Wh Shy 22/n, . (8.2.26)
h=1 h=l
On substituting the value of nh from (8.2.22) we have

Ji = AthShY.rc;;/ {Vo + AWlsly /Nh}. (8.2.27)

On substituting this value of A. in (8.2.22) we have

nh = WhShy ht WhShy.rc;;/ .rc;;[Vo + h~1 Wh2Sly /Nh]' (8.2.28)

Summing over all possible strata we obtain the minimum sample size n for the
fixed variance as

= =Ct
n htnh WhShy /.rc;; JCt WhShy.rc;;J/[Vo + h~1 Wh2Sly /Nh] ' (8 .2.29)

Case III. Total variance and sample size are fixed: Now the Lagrange function is

L 3 =C o + h~thCh +A.{AWh2( n1h - ~h fly -Vo]+A.2[An h -n]' (8.2.30)

On differentiating (8.2.30) with respect to nh and equating to zero we obtain


nh = JX:WhShy / ~Ch + A.z . (8.2.31)
L
Note that Lnh = n, which implies that
h=1
JX: .th~1 ~ChWhS+hyA.z '
,1,1 = /
L
and nh =
nWhShy /
~Ch + A.z
L WhShy
h:l ~Ch + A.z (8 .2.32)
the value of ,1,2 can be obtained by solving the following equation through iterative
procedure

~( f WhShy~Ch + A.z Jh=1f ~


n h=l C + A.z
= Vo + f wlSly .
h=l N
(8.2.33)
h h

On substituting the value of ,1,2 from (8.2.33) in (8.2.32) we obtain the allocation of
the given sample of n units into different strata for the fixed level of variance.

Example 8.2.2. Select a sample of 40 countries from population 5 using the


Neyman method of allocation. Record the yield/hectare of the tobacco crop from
the selected countries . Merge the strata if required. Estimate the average
yieldlhectare of the tobacco crop in the world. Estimate the variance under the
Neyman allocation. Construct a 95% confidence interval estimate.
Solution. By the Neyman method of allocation, the number of units to be selected
from the hth stratum are given by nh = nwhshy/ f WhShy . Thus we have
h=1
Chapter 8: Stratified and Post-Stratified Sampling 665

"2
Stratu,m ~~ l'f"h ,j 1ZC·~hy". ,~:., II .~S
,> ,hy
'~ .4 iV'h~h~~ ~~ -' n~
1 6 0.02682 0.163788 0.98270 0.5673 1
2 6 0.21809 0.467001 2.80200 1.6177 1
3 8 0.34699 0.589065 4.71250 2.7208 3
4 10 0.23456 0.484314 4.84310 2.7962 3
5 12 0.58214 0.762981 9.15570 5.2861 5
6 4 0.15310 0.391280 1.56510 0.9036 1
7 30 0.34385 0.586387 17.59220 10.1567 10
8 17 0.37855 0.615264 10.45950 6.0389 6
9 10 2.01830 1.420669 14.20670 8.2023 8
10 3 0.97460 0.987218 2.96160 1.7099 2
I' ,,; ~W;l;ii'" '!ff;~~"",Siim i 69.28074 l"a1!ii!f "t. I£\ YI' yilif

Using the Pseudo-Random Number (PRN) Table 1 given in the Appendix we


collect the following sample information.
Straturii~ Column nO '1 r;;,R~dom ~ rkl~',:, Coun bY' '~. Yield/' ,
No>~1 I ~used from '-& ~.j no. . Hectare'",
~A I"PRN table ~ ~ c.,
""
1 1 5 Nicaragua 2.03
2 2 4 Jamaica . 1.99
3 3 2 Belgium--Lux 3.69
8 Spain 2.79
1 Austria 1.90
4 4 and 5 07 Macedonia 1.36
08 Poland 2.34
02 Albania 0.64
5 6 and 7 07 Moldova 2.05
02 Armen ia 0.26
10 Turkmenistan 2.36
09 Tajikistan 2.48
04 Georgia 1.63
6 8 2 Libya 1.61
7 9 and 10 20 Nigeria 2.10
12 Kenya 2.51
30 Zambia 1.29
15 Malawi 1.22
07 Central African Rep 0.87
26 Togo 0.50
22 Zimbabwe 2.06
01 Angola 0.99
11 Cote d'lvoire 0.26
Continued ..
666 Advanced sampling theory with applications

05 Zaire 1.11
8 II and 12 15 Thailand 1.33
06 Indonesia 0.85
02 Burma 1.22
10 Korea, South 2.02
11 Laos 0.75
05 China 1.75
9 13 and 14 05 Lebanon 1.33
07 Syria 1.15
01 Cyprus 1.50
09 Turkey 0.91
06 Oman 1.11
04 Jordan 1.29
03 Iraq 1.09
02 Iran 1.39
10 15 3 New Zealand 2.58
2 Solomon Islands 0.95
This leads to the pooled strata and related results given below:
Pooled
~ap1~le.
'n/"'"
1,2,6 16 3 1.8766 0.0537 0.1875 0.15094 0.2832 0.00033 13
3 8 3 2.7933 0.8010 0.3750 0.07547 0.2108 0.0009505
4 10 3 1.4467 0.7281 0.3000 0.09434 0.1364 0.0015120
5 12 5 1.7560 0.8073 0.4166 0.11320 0.1987 0.0012070
7 30 10 1.2910 0.5251 0.3333 0.28301 0.3654 0.0028040
8 17 6 1.3200 0.2462 0.3529 0.16037 0.2116 0.0006829
9 10 8 1.2213 0.0363 0.8000 0.09434 0.1152 0.0000081
10 3 2 1.7650 1.3285 0.6666 0.02830 0.0499 0.0001773
: .Sum' 1.5717 0.0076734

where vh = Wh 1
2( ~:h }~Y '
Thus an estimate of yield/hectare of the world tobacco crop during 1998 is

Yst = IWhYh=I.5717 , and V(ySlh, = IWl(l-fh }~y =O.0076734 .


h= 1 h=1 nh

Using Table 2 from the Appendix the 95% confidence interval estimate of the
yieldlhectare in the world during 1998 is
Yst +fa/ 2(df = n -L~ , or Yst +fO.02S(df = 40-S)Jv(Yst)N
or 1.5717 +2.037-JO.0076734 , or [1.39306, 1.75014] .
Chapter 8: Stratified and Post-Stratified Sampling 667

Example 8.2.3. Select a sample of 40 countries from population 5 using the method
of optimum allocation. Record the yield/hectare of the tobacco crop from the
selected countries. Estimate the average yield/hectare of the tobacco crop in the
world . Estimate the variance under the method of optimum allocation. Construct a
95% confidence interval.
Given: C1 = $0.5, C2 = $2.0, C3 = $3.0, C4 = $5.0, Cs = $7.0, C6 = $1.5, C7 = $10.0, .
Cg = $5.0, C9 = $5.0, and CIO = $3.0.
Solution. By the method of optimum allocation, the number of units to be selected
from the hth stratum is

nh = n(WhShy / ~YLt WhShy / ~)


Thus we have

I 6 0.5 0.1637 1.3897 1.7760 :::::2


2 6 2 0.4670 1.9813 2.5320 :::::3
3 8 3 0.5890 2.7207 3.4770 :::::3
4 10 5 0.4843 2.1659 2.7679 :::::3
5 12 7 0.7629 3.4605 4.4224 :::::4
6 4 1.5 0.3912 1.2779 1.6331 :::::2
7 30 10 0.5863 5.5629 7.1091 :::::7
8 17 5 0.6152 4.6776 5.9777 :::::6
9 10 5 1.4206 6.3534 8.1193 :::::8
10 3 3 0.9872 1.7099 2.1851 :::::2

Using the Pseudo-Random Number (PRN) Table 1 from the Appendix we have :

5 Nicara ua 2.03
6 Panama 2.00
2 2 4 Jamaica 1.99
2 Dominican Re 1.51
1 Cuba 0.63
3 3 2 Bel ium--Lux 3.69
8 S ain 2.79
1 Austria 1.90
4 4 and 5 07 Macedonia 1.36
08 Poland 2.34
02 Albania 0.64
5 6 and 7 07 Moldova 2.05
Continued... ...
668 Advanced sampling theory with applications

02 Armenia 0.26
10 Turkmenistan 2.36
09 Taiikistan 2.48
6 8 2 Libya 1.61
I Algeria 1.96
7 9 and 10 20 Nigeria 2.10
12 Kenya 2.51
30 Zambia 1.29
15 Malawi 1.22
07 Central African Reo 0.87
26 Togo 0.50
22 Zimbabwe 2.06
8 11 and 12 15 Thailand 1.33
06 Indonesia 0.85
02 Burma 1.22
10 Korea, South 2.02
11 Laos 0.75
05 China 1.75
9 13 and 14 05 Lebanon 1.33
07 Syria 1.15
01 Cvnrus 1.50
09 Turkey 0.91
06 Oman 1.11
04 Jordan 1.29
03 Iraa 1.09
10 Yemen 1.73
10 15 3 New Zealand 2.58
2 Solomon Islands 0.95
Thus we have the following results:

1 2.015 0.00045 6 2 0.33333 0.0566 0.1140 4.80536E-07


2 1.376 0.47573 6 3 0.50000 0.0566 0.0778 0.000254005
3 2.793 0.80103 8 3 0.37500 0.0754 0.2106 0.000948747
4 1.447 0.72813 10 3 0.30000 0.0943 0.1364 0.001510807
5 1.787 1.06982 12 4 0.33333 0.1132 0.2023 0.002284833
6 1.785 0.06125 4 2 0.50000 0.0377 0.0673 2.17635E-05
7 1.507 0.53546 30 7 0.23333 0.2830 0.4265 0.004696889
8 1.320 0.24616 17 6 0.35294 0.1603 0.2116 0.000682147
9 1.263 0.06717 10 8 0.80000 0.0943 0.1191 1.49327E-05
10 1.765 1.32845 3 2 0.66666 0.0283 0.0499 0.000177327
QI01QS9,1··93g]
Chapter 8: Stratified and Post-Stratified Sampling 669

where vh = Wl(
1
~:h }~Y' Thus an estimate of the yield/hectare of the tobacco crop

during 1998 in the world is

vA
(_
Yst )Opt = LLWhz(- fh}Zhy = 0.010591932 .
L
Yst = LWhYh = 1.6155 and 1- -
h=l h=l nh

Using Table 2 from the Appendix the 95% confidence interval estimate of the
yield/hectare in the world during 1998 is
Yst =Ffa/z(df = n- LNv(Yst) Opt or Yst =Ffo.ozs(df = 40 -lONv(Yst) Opt
or 1.6155+ 2.042~0.010591932, or [1.4053, 1.8256].

Example 8.2.4. Find the minimum sample size from population 5 to obtain
estimates of population mean with different levels of relative variance. Plot relative
variance versus sample size.
Given: Ct = $0.5, Cz = $2.0, C3 = $3.0, C4 = $5.0, Cs = $7.0, C6 = $1.5, C7 = $10.0,
Cg = $5.0, C9 = $5.0, CIO = $3.0 and Y = 1.5507 .
Solution. The minimum sample size n for the fixed variance Vo is given by

(AWhShy/ Je;)Ct WhShyJe;)


[Vo i. W1S~y /Nh]
n= --'---.,...---"'--'--~,,--~

+
Thus we have the following table

I 6 0.5 0.1637 0.0566 0.006552078 0.013104156 0.00001431


2 6 2.0 0.4670 0.0566 0.037383268 0.018691634 0.00011646
3 8 3.0 0.5890 0.0754 0.076994560 0.025664853 0.00024701
4 10 5.0 0.4843 0.0943 0.102162993 0.020432599 0.00020875
5 12 7.0 0.7629 0.1132 0.228503058 0.032643294 0.00062159
6 4 1.5 0.3912 0.0377 0.018080007 0.012053338 0.00005448
7 30 10 0.5863 0.2830 0.524729262 0.052472926 0.00091780
8 17 5.0 0.6152 0.1603 0.220619748 0.044123950 0.00057262
9 10 5.0 1.4206 0.0943 0.299675299 0.059935060 0.00179611
10 3 3.0 0.9872 0.0283 0.048392846 0.016130949 0.00026021
1.';5630Q3 hhQ, J)}29p2P27S9

The relative variance is defined as

RV={v(Yst)/Y Z } xl00.
Therefore we have the following table.
670 Advanced sampling theory with applications

0.1 0.002405 63.97384098 ",64


0.2 0.004809 47.98036859 ",48
0.3 0.007214 38.38428905 '" 39
0.4 0.009619 31.98690430 '" 32
0.5 0.012023 27.41734456 ",27
0.6 0.014428 23.99017519 ",24
0.7 0.016833 21.32459927 '" 2 1
0.8 0.019237 19.19213870 '" 19
0.9 0.021642 17.44739833 '" 17
1.0 0.024047 15.99344810 ", 16
1.5 0.036070 11.28949194 '" 11
The following graph in Figure 8.2. 1 shows that as the desired level of relative
variance decreases the sample size increases and vice versa. A more precise
estimate can be obtained with a large sample.

Sample size versus relative varia nce

0.04
0.035
Q)
u 0.03
c
ca
.;: 0.025

.
ca
>
Q)
0.02
> 0.015
ca
Qj 0.01
0::
0.005
0
0 10 20 30 40 50 60 70
Sample size

Fig. 8.2.1 Relative variance in stratified sampling.


Now we have the following theorem.

Theorem 8.2.1. Show that


V(y) ~ V(Yst)p ~ VCVst)N (8.2.34)
where V(Yst)p and V(yst~ indicate that the variance of the estimator of population
mean in stratified sampling under Proportional and Neyman allocation respectively
and V(y) denotes the variance of the usual estimator under SRSWR sampling .
Proof. We know that:
Chapter 8: Stratified and Post-Stratified Sampling 671

In SRSWR sampling: V(y) = n- ;, where a;


1a
= N- 1 t (Y; - rf .
i=l
(8.2.35)

In stratified random sampling :


( a ) Under proportional allocation
(0;:) 1 L 2
V\Yst p = - rWhShy , (8.2.36)
n h=l
and
(b) Under Neyman allocation

V(Yst)N =.!..( f WhShy ]2 (8.2.37)


n h=1
Now we have to show that
V(y)- V(Yst)N ~ 0 . (8.2.38)
Clearly from (8.2.35) and (8.2.36) we have
2
ay 1 L 2 2 L 2
- - - LWhShy ~ 0, or a y - LWhS hy ~O. (8.2.39)
n n h=l h=l
Now we have
2
ay =-N1 LN(Y; - Y-\2j =-N1 LLNLh(Yhi - Y-\2j =-N1 LLNhL ~(~ Yhi - Y-)h + (-Yh - Y- )l2J
; =1 h=l ; =1 h=l ;=1

-\2
2
ahy = rL WhS hy
2 L (-
+ rWh Yh - Y j . (8.2.40)
h=l h=1
Thus we have
2 L 2 L (- -\2 (8.2.41)
a y-rWhShy= rWhYh-Yj ~O .
h=1 h=1
This proves that the inequality (8.2.38) holds. If, however, 0, = Y for all h then
the two variances are equal.
Now we prove the second part of the inequality (8.2.34), that is
V(Yst)p ~ V(yst~ (8.2.42)
672 Advanced sampling theory with applications

2
or I whsly - ( IWhShy J :2 0 or fWh[ShY - fWhShy]2 :2 O. (8.2.43)

r r-
h~1 h~l h~\ h~l

Clearly the inequality (8.2.43) holds because

,tW,[ Shy - ,tw,S,y =,tw{sly + [J,w,S,y 2S,YCtW,s J] hy

= 'LLwhsly -
( 'LLWhS hy )2 (8.2.44)
h~1 h~l

Thus the inequality (8.2.34) holds. Hence the theorem.

Theorem 8.2.2. Show that the estimator of gain in efficiency (em )owed to
stratification with respect to SRSWOR sampling is

, _ [~s; - V(Yst )] 0 (8.2.45)


GE - '(_ ) xIOOYo.
V Yst
Proof. We have

"(Yst) = Jtl ( ~:h 1 ]sly, (8 .2.46)

and

V(y)=C~f)s;, whereS;= N~IC~/i-NY2). (8 .2.47)

To find an estimator of v(y) , we need an estimator of S; , which is given by

s; = N
1
-
1[I N~yli - N~~ - V(Yst)}] .
h
h=\ nh i=\
(8.2.48)

Note that
L N h nh
E [ 'L -
L 2]
( 1 nh L 1 Nh L Nh
'LYhi = 'LNh E - 'LYhi = 'L N h -'LYhi = 'L 'LYhi
2) 2 2
h=l nh i=l h=l nh i=\ h=l Nh i=1 h=1 i=1
and
E~~ - V(Yst)] = E~~)- Ef(Yst)}= V(Yst)+ {E(Yst)F - V(Yst) = f2 . (8.2.49)
The percentage gain in efficiency (GE) owed to stratification can be defined as
GE = {V(y)- V(Yst)} x 100% . (8.2.50)
V(Yst)
Therefore the theorem follows by using method of moments.
Chapter 8: Stratified and Post-Stratified Sampling 673

Example 8.2.5. In example 8.1.2 find an estimate of the percentage gain In


efficiency (GE) owed to stratification.
Solution. We have V(Yst) = 0.015905 and V(y) = N
Nn
-n s;
= 0.024405 . Thus an
estimate of the percentage gain in efficiency (GE) is given by

N -n
- - S yZ - V"(-Yst)]
(8.2.51)
GE = [ Nn x 100 = (0.024405 -0.015905) x 100 = 53 44% .
v(jist) 0.015905 .

Reddy (1978b) has considered the case of a finite population of size N divided into
L strata of sizes N h, h =1,2, ..., L . Let Yhi denote the value of the variable Y for
L Nh
the lh unit in the h'h stratum. For estimating the population total Y = L LYhi it is
h= 1i=l
shown in several books (e.g., Cochran 1963) that stratified random sampling with
proportional allocation is superior to unstratified random sampling provided the
finite population correction factor (f.p.c.) in each stratum is ignored. Reddy (1978b)
has shown that the above result is true even without ignoring the f.p.c, under
proportional allocation with superpopulation model approach, if the variable Y
satisfies the following condition:
max(Yhi) ~ min(Yh+I,i) for h = 1,2, ...., L -1 . (8.2.52)
I I

Example 8.2.6. In a circus there are three types of elephants, viz., Light, Medium,
and Heavy in weight and some information about them is listed in the following
table.

250 150 100


20 12 8
1800 3200 5200
95 155 244.8
100 150 250
490 410 220
45 25 30
17200 8000 5028

( a ) How many elephants, N, are available in the circus?


( b ) Find WI' Wz and W3 .
( C ) Select an SRSWOR sample of n = 40 elephants using proportional allocation.

( d ) Estimate the average weight , Y, of all elephants in the circus using usual
estimator in stratified sampling.
674 Advanced sampling theory with applications

( e ) Estimate the variance of Yst .


( f) Construct the 95% CI estimate using usual estimator in stratified sampling.
( g ) Find average diet of all the elephants in the circus.
( h ) Find pooled sample mean Y from all the three strata .
(i ) Find pooled sample mean x from all the three strata.
(j) Find pooled sample variance s~ from all the three strata.
( k ) Find pooled sample variance s; from all the three strata.
( I) Find pooled sample covariance Sxy from all the three strata .
(m ) Apply ratio estimator to estimate the average weight of the elephants.
( n ) Construct the 95% confidence interval estimate using the ratio estimator.
( 0 ) Compare the interval estimates.
Solution: ( a ) Total number of elephants
L
N= 'LNh =250+150+100=500.
h=l
( b ) Strata weights are given by
Iv. = Nt = 250 = 0.50 W = N 2 =.!2Q. = 0.30 and W = N) = 100 = 0.20 .
, N 500 2 N' 500 ' ) N 500
( c ) Proportional allocation of II = 40 units
II, = 1Iff! = 40 xO.50 = 20 , 112 = IIW2 = 40 xO.30 = 12 and II) = IIW) = 40 x 0.20 = 8.

( d ) Estimate of the average weight of all the elephants using stratified sampling is
L
Yst = 2:WhYh = 0.50 x 1800 + 0.30 x 3200+0.20 x 5200 = 2900 .
h= l
( e ) Estimate of the variance of Yst is given by

'(7; ) -
v\Yst ~
2 1- fh
- L. Wh ( - 2
- Shy J
h=l IIh

= 0.52 x ( 1-20~08) x 4902 +0.32 x (1-1~08) x 410 2 + 0.22 x (1- ~.08)x 2202
= 4143.68 .

Note that we used jj =II,/N, =20/250=0.08, !z=1I2/N2=12/150=0.08 and


h = 11)/ N ) = 8/100 = 0.08

( f) Using Table 2 from the Appendix the 95% confidence interval estimate using
strat ified sampling is given by
Yst +fa / 2(df = II - L )JqYJ , or 2900 + f a /2(df = 40-3)J4143.68 ,

or 2900 + 2.026~4143.68, or [2769.72,3030.27].

( g ) The average diet of all the elephants in the circus is


L
X = 'LWhXh = 0.50 x 100+ 0.30 x 150+ 0.20 x 250 = 145 .
h=\
Chapter 8: Stratified and Post-Stratified Sampling 675

( h ) The pooled sample mean y from alI strata is given by


y = ntYI + n2Y2 + n3Y3 = 20 x 1800+ 12x 3200 +8 x 5200 = 2900 .
n 40
(i ) The pooled sample mean x from all strata is given by
x = nl xl + n2 x2 + n3 x3 = 20x95 + 12 x 155 + 8x 244 .8 = 142 .96 .
n 40
(j ) The pooled sample variance s; is given by

s2 = h-l
f tnh -1)s~J+ h-lf~h(Yh - yf}
y n-I
f
(20- 1)490 2 + (12 - 1)410 2 + (8- 1)2202 + 20(1800 - 2900)2 + 12(3200 - 2900 + 8(5200 - 2900f
40-1
= 1906405.128 .
( k ) The pooled sample variance s; is given by

f tnh -1)s~J+ h=lf ~h(Xh - xf}


Sx =
2 h=1
n-I
2 2
= (20-l)x 452 + (12-I) x 25 + (8-I) x 30 + 20(95-142.96f + 12(155 -142.96f + 8(244.8-142.96f
40-1
= 4675.996 .
( I ) The pooled sample covariance Sxy is given by

f Knh - Ihxy }+ fh(Xh -XXYh - y)}


.!!;h-=..!I ---!!hc=.-l!...- _
Sxy = n-I

=-(_I-)[(20 -I) x 17200 + (12-I) x 8000 + (8 -I)x 5028


40-1
+ 20(1800 - 2900X95 - 142.96) + 12(3200- 2900XI 55 - 142.96)
+8(5200-2900X244.8-142.96) 1
=87751.692.
( m ) The ratio estimate of the average weight of elephants is given by

- =Y
YR
-(XJ
x = 2900(145.00)
142 .96
= 2941.38 k g .
( n ) Confidence Interval estimate :
Note that r =-
Y =-
2900
- = 20.285 and f =-n =-40 = 0.08 .
x 142 .96 N 500
Thus an estimate of the mean square error of the ratio estimator is given by
• (-)
MSE YR = ( -n-
1- f)r
lSy2 +r 2Sx2 -2rsxy ]
676 Advanced sampling theory with applications

= (1-40~08 )[1906405.128 + (20 .285f(4675.996)- 2(20.285X87751.692)]

= 6219.281.
Using Table 2 from the Appendix the 95% confidence interval estimate of the
average weight of elephants using ratio estimator is given by
YR ± t o.025 (df = 40 - 1N MSE(YR) , or 2941.38 ±2.023.j~62-1-9.-28-1
or [ 2781.841, 3100.918 ] .
( 0 ) Comment on the CI estimates: Although we are losing an extra two degrees
of freedom in stratified sampling, still the length of the confidence interval estimate
obtained through stratified sampling is smaller than that of a ratio estimator at the
same level of confidence. Thus we conclude that stratified sampling performs better
than the ratio estimator in this particular situation.

The population of N units is first subdivided into L homogeneous sub-groups


called strata, such that the hth stratum consists of N h units , where h = 1,2, ...,L and

i», = N . Assume from the hth population stratum consisting of N h units, a sample
h=1
L ffi
of size nh is drawn using SRSWOR sampling, such that "[.nh = n • Let the i
h=1
th
sample unit of the study variable and auxiliary variable in the h stratum be denoted
bY Yhi an d Xhi
. Iy. Let Y- h =
respective -I nh
nh L Yhi an d Xh
- = -I nh
nh "[.xh i
denote th e hth
i= l i=1
stratum sample means for the study and auxiliary variable. Then for the hth stratum
let us define
Yh
EhO=~-1 an
d Ehl=~-1 Xh
Yh x,
such that
E(EhO) = E(Ehl) = 0
and

E(E~O) = C~:h )C~y ,


where
- _INh - _I Nh 2 ( )_ I Nh( - \2
Yh = N h L Yhi , X h = Nh L X hi , Sh y= N h -1 L Yhi - Yh ) ,
i=1 i=1 i=1
Chapter 8: Stratified and Post-Stratified Sampling 677

The separate ratio estimator Ysr of population mean Y is given by

-
Ysr = ~W-(Xh)
hYh -=-
L.,
h=l Xh
(8.3.1.1)

where Wh = Nh/N .
Then we have the following theorems:
Theorem 8.3.1.1. Bias in the estimator Ysr' to the first order of approximation, is

B(Ysr)= ~Wh(I-!h JY,,[cL: - PhxyChxChY] ' (8.3.1.2)


nh
h=l
Proof. The estimator Ysr, in terms of EhO and Ehl , can easily be written as

Ysr = ~WhY,,(I+EhOXI+Ehltl '" ~WhY,,[I+EhO -Ehl +E~l -EhOEhl +....]. (8.3.1.3)


h=l h=l
Taking expected values on both sides of(8.3.1.3) we have

E(Ysr)= ~WhY,,[I+(I-fh)~L:-PhXyChXCh)l.
h=1 nh 'J
(8.3.1.4)

_ L _
Taking the deviation of (8.3.1.4) from population mean Y = LWhYh we have
h=l
(8.3.1.2). Hence the theorem.
Theorem 8.3.1.2. The variance of the separate ratio estimator Ysr, to the first order
of approximation, is
V (Y- sr) = LL Wh2(I-fhJ-2r2 2 - 2phxyC
- - Yh lChy + Chx ]
hxChy . (8.3.1.5)
h=l nh
Proof. Note that the strata are independent we have

V(Ysr) = v[ h=l~WhYh(~h)]
Xh
~ WlV[Yh(~h)]
= h=l Xh
. (8.3.1.6)

r'"
Now we have

V{Yh ;: } = E[Yh ;: - E(Yh ~) E[Yh (EhO - Ehl)]2

=Yh
1- fh)f
-2(----;;;- 2 + C 2 - 2PhxyChxChy ) .
\Chy hx (8.3.1.7)
On substituting (8.3.1.7) in (8.3.1.6) we have the theorem.
Theorem 8.3.1.3. An estimator to estimate the variance of the separate ratio
estimator, Ysr, is given by
, (Ysr
VI -) = L
L Wh2( - fh J[Shy
1- - 2 + rh2Shx
2 - 2rhShxy ] .
(8.3.1.8)
h=l nh
where
678 Advanced sampling theory with applications

Shxy = (nh -ItI !:(Yhi - YhXXhi - Xh), and rh = Yh/xh are the estimators of respective
i=l
population parameters in the hth stratum.
Another estimator to estimate the variance of the separate ratio estimator Ysr is

V2(Ysr)= I wl(l-nhf h )nh_- (I) r. et i


h=l
1
i =l (8.3.1.9)
th th
where ehi = (Yhi - Yh)- rh (Xhi - Xh) denotes the i residual term in the h stratum.

The third improved estimator of the variance of the separate ratio estimator Ysr'
owed to Wu (1985), is
, (_) L 1- fh )(
V3 Ysr = IWh -
2( s, - (I) nhIehi2
)gh 1
-=- (8.3.1.10)
h=l nh Xh nh - i= l

where gh denotes the suitably chosen constant in the h th stratum such that the
variance of the estimator V3(Ysr) is minimum. It is shown by Wu (1985) that the
optimum value of s» leads to an efficient estimator.

Example 8.3.1. Select a sample of 40 countries from population 5 using the method
of proportional allocation. Record the yield/hectare and area under the tobacco crop
from the countries selected in the sample. Apply ratio cum product estimator to
estimate the average yield/hectare of the tobacco crop in the world. Assuming that
the total area in each continent under the tobacco crop is known, estimate the
variance of the estimator used. Construct a 95% confidence interval.
Solution. By the method of proportional allocation, the number of units to be
selected from the hth stratum are given by nh = n N h / N . Thus we have

1 6 3 0.500 0.0566
2 6 3 0.500 0.0566
3 8 3 0.375 0.0755
4 10 3 0.300 0.0943
5 12 4 0.333 0.1132
6 4 2 0.500 0.0377
7 30 11 0.367 0.2830
8 17 6 0.353 0.1604
9 10 3 0.300 0.0943
10 3 2 0.667 0.0283
Chapter 8: Stratified and Post-Stratified Sampling 679

Using the Pseudo-Random Number (PRN) Table 1 given in the Appendix we have
the following sample information and some results.

5 Nicara ua 2.03 2240


6 Panama 2.00 1094
2 EI Salvador 1.79 580
2 2 4 Jamaica 1.99 1175
2 Dominican Re 1.51 27050
1 Cuba 0.63 59000
3 3 2 Bel ium-Lux 3.69 320
8 S ain 2.79 15150
1 Austria 1.90 105
4 4 and 5 07 Macedonia 1.36 22000
08 Poland 2.34 19100
02 Albania 0.64 24000
5 6 and 7 07 Moldova 2.05 18600
02 Armenia 0.26 4304
10 Turkmenistan 2.36 1100
09 Taiikistan 2.48 3228
6 8 2 Lib a 1.61 900
I Al eria 1.96 2700
7 9 and 10 20 Ni eria 2.10 10000
12 Ken a 2.51 8805
30 Zambia 1.29 4882
15 Malawi 1.22 116700
07 Central African Re 0.87 750
26 To 0 0.50 4000
22 Zimbabwe 2.06 103110
01 An ola 0.99 3950
11 Cote d'lvoire 0.26 10000
05 Zaire 1.11 3700
06 Cameroon 1.62 3400
8 11 and 12 15 Thailand 1.33 51500
06 Indonesia 0.85 206625
02 Burma 1.22 36000
10 Korea, South 2.02 25730
11 Laos 0.75 4000
05 China 1.75 1445000
9 13 and 14 05 Lebanon 1.33 3750
Continued....
680 Advanced sampling theory with applications

07 Syria 1.15 15000


01 Cyprus 1.50 161
10 15 3 New Zealand 2.58 600
2 Solomon Islands 0.95 100
Thus we have:

1 1.9400 1304.7 3194.5 0.00149 0.0171 722185 .3 90.1 0.2688 0.000721


2 1.3767 29075.0 14660.0 4.7E-05 0.4757 839008125.0 -19863.0 0.1545 0.000246
3 2.7933 5191.7 18309.4 0.00054 0.8010 74387858 .3 71.3 0.7434 0.026599
4 1.4467 21700.0 14923.5 6.7E-05 0.7281 6070000.0 -2102.0 0.1984 0.000983
5 1.7875 6808.0 5987.8 0.00026 1.0698 63572981.3 391.1 0.1779 0.011204
6 1.7850 1800.0 3450.0 0.00099 0.0613 1620000.0 315.0 0.1291 0.000365
7 1.3209 24481.5 11682.7 5.4E-05 0.4824 1801653230.0 6724.0 0.1784 0.023125
8 1.3200294809.2145162.3 4.5E-06 0.2462 3.227E+1I 107376.5 0.1042 0.016126
9 1.3267 6303.7 33976.1 0.00021 0.0306 59939890.3 -1304.8 0.0232 0.004414
10 1.7650 350.0 1333.3 0.00504 1.3284 125000.0 407.5 0.1903 0.000053

where

if Shxy> 0,
if Shxy< 0,
and
2
I
Vh = Wh( 1~:h ][sly+ rlslx - 2rhshXY] if Shxy> 0,
Wl( 1~:h ][sly + rlslx + 2rhshxY] if Shxy< O.

We have to use ratio and product estimators in different strata because the
correlation between yield/hectare and total number of hectare is uncertain . An
estimate of the average yield/hectare of the tobacco crop in the world is given by a
new separate ratio cum product estimator defined as
L _
Ysrp = L.Yh = 2.1687.
h=l

An estimate of the variance of the estimator used for estimation is


Chapter 8: Stratified and Post-Stratified Sampling 681

Using Table 2 from the Appendix the 95% confidence interval is


Ysrp +f a/2(df =n - L - 3~ or Ysrp +fo.02S(df = 40 -13),jv(ysrp)

or 2.1687+2.052,",0.083673, or [1.5751,2.7623].
Note that in three strata the estimates of correlation were negative, and in the
remaining seven strata the estimates of correlation were positive. Therefore forthe
strata where ratio estimator was used the loss of degree of freedom was one, and
where the product estimator was used the loss of degree of freedom was taken as
two.

The separate linear regression estimator Yslr of population mean Y is given by

(8.3.2.1)

Then we have the following theorems :

Theorem 8.3.2.1. The bias in the separate regression estimator Yslr , to the first
order of approximation, is

B(Yslr)= f Wh(1-nhfh rInhxYXhChx(Ah03 - Phxy


h=l
Ah12 J (8.3.2.2)

where
Ahrs = r~hrsSl 2 and ,uhrs = (N h -1t '¥ (Yhi - r,J(Xhi -xhf
1
for h = 1,2, ..., L.
,uh20,uh02 i=l
Proof. Follows from the bias expression of the regression estimator in Chapter 3.

Theorem 8.3.2.2. The variance of the separate regression estimator Yslr, to the first
order of approximation, is given by

(- ) ILWh2(1-
V Yslr = -- fh J2hy\1- 2).
( Phxy (8.3.2.3)
h=l nh
Proof. Note that the strata are independent, we have

V(Yslr) = V[fWh
h=l
~h + phxAXh - Xh)}] = h=lf Wh2V~h + PhxAX h -Xh)} ' (8.3.2.4)

Applying the concept of the usual linear regression estimator in each stratum, we
have
-
V (Ysl ) L 2( fh)[ 2
h=l
1-
2 2
nh
1 L 2 2( 2)
r = IWh - - Shy + PhxyShx -2Phxy Shxy = IWh - - Shy \1- Phxy .
h=1
(1- h)
nh
Hence the theorem.
682 Advanced sampling theory with applications

Theorem 8.3.2.3. An estimator to estimate the variance of the separate regression


estimator )islr is

VI ()islr) = I Wh
h=1
2
( 1- fh
nh
J[s~y + plxifu - 2PhxyShxy1 (8.3.2.5)

where Phxy = S hxy / S ~x denotes the estimator of the regression coefficient in the. hth
stratum.
Another estimator of the variance of the separate regression estimator )islr is

V2()islr)= IWl(l-fhJ_(
h=1
I
nh nh -
1)I.e~i
i=\ (8.3.2.6)
th th
where ehi = (Yhi - )ih)- Phxy(Xhi -Xh) denotes the i residual term in the h stratum.
The third improved estimator of the variance of the separate regression estimator
)islr, proposed by Wu (1985), is given by

v3 Yslr = LWh -
(_) L 2(I- f h J( -=-
X hJgh 1 nh
- (1) Lehi
2
(8.3.2.7)
h=1nh xh nh - i=1
where g h denotes the suitably chosen constant in the hth stratum such that the
variance of the estimator V3()islr) is a minimum. It is shown by Wu (1985) that the
optimum value of g h leads to efficient estimator of variance.

Example 8.3.2. Select a sample of 40 countries from population 5 using the method
of proportional allocation. Record the yieldlhectare and area under the tobacco crop
from the countries selected in the sample. Apply the regression estimator to
estimate the average yield/hectare of the tobacco crop in the world. Assuming that
the total area in each continent under the tobacco crop is known, estimate the
variance of the estimator used for estimation purpose. Construct a 95% confidence
interval.
Solution. Using information from the previous example 8.3.1, for the case of a
separate regression estimator we derive following table.

I 0.0171 722185 .3 90.12 0.000125


0.810959 0.0000031
2 0.4757 839008125 .0 -19863 .00 -0.000024 -0.994250 0.0000029
3 0.8010 74387858 .0 71.32 0.000000 0.009239 0.0009500
4 0.7281 6070000 .0 -2102.00 -0.000350 -0.999870 0.357864 0.0000003
5 1.0698 63572981.0 391.09 0.000006 0.047423 0.201787 0.0022800
6 0.0613 1620000.0 315.00 0.000194 0.999592 0.079465 0.0000000
7 0.4824 1801653230.0 6724.01 0.000003 0.228080 0.360304 0.0021070
8 0.2462 3.23E+ll 107376.40 0.000000 0.424896 0.020372 0.0005840
9 0.0306 59939890 .0 -1304 .80 0.000022 -0.146320 0.068331 0.0000045
10 1.3284 125000.0 407.50 0.003260 1.000000 0.140679 0.0000000
O:OOS9316
Chapter 8: Stratified and Post-Stratified Sampling 683

where

Yh = w, ~h + PhxAXh - Xh)] and vh =Wl(


1
~:h }~y[l- rl'xy].
Thus an estimate of the average yieldlhectare of the tobacco crop in the world is
given by
L A

)lslr = L'yh = 1.6609.


h=1
An estimate of the variance of the estimator used for estimation is
L
V()lslr) = LVh = 0.0059316.
h=1
Using Table 2 from the Appendix the 95% confidence interval estimate is given by

)lslr +(a/2(df = n - 2L WV()lslr) , or 1.6609+ 2.086.J0.0059316, or [1.5002, 1.8215].

Note that we are using separate regression estimator in each stratum , so we are
loosing two degree of freedom in each stratum.

Sometimes it is not possible to know the population means X h, h = 1,2,00 .,L of the
auxiliary variable in each stratum, but the combined population mean,
_ L _
X = LWhXh , is known . In such situations it is not possible to use separate ratio and
h=1
regression type estimators, but we can use a combined ratio or combined regression
estimator.
For deriving the expressions of bias and variance of the combined ratio or
regression estimator in stratified sampling, let us define,

00 = y..!! - 1 and
y

L L
where )lSI = LWh)lh and XS! = L Whxh are the unbiased estimators of population
h=1 h=l
mean Y and X respectively .

Obviously we have
£(00)= £(01)= O.
Assuming that the strata are independent, we have

£(o~)= f Wl(l- fh 1('~yIY2 ,


h=1 nh r
and
£(0001)= f Wl(l-nhfhJShx / I(XY)
h=1 Y
\
.
684 Advanced sampling theory with applications

The usual combined ratio estimator Ycr of population mean f is defined as

- - (XJ
-=- .
Ycr = Yst
X st
(8.3.3.1)

Then we have the following theorems.

Theorem 8.3.3.1. The bias in the combined ratio estimator Ycr' to the first order of
approximation, is

B/~cr)=
\Y R~
t: WhZ(I-»«fhJ{S1-shxy}
X (8.3 .3.2)

where R = fix.
Proof. The estimator Ycr in terms of 80 and 8 1 can be approximately written as

(8.3.3.3)

Then we have

B(Ycr
- ) = E (-)
Ycr - -y = -[
y 1 -=z IW
L
X h=1
hZ(I-fhJ
- - ShxZ-~
nh
1 IWL
X Y h=1
hZ(l-fhJ
- - Shxy]
nh
which on simplification reduces to (8.3.3.2). Hence the theorem .

Theorem 8.3.3.2. The variance of the combined ratio estimator Ycr' to the first
order of approximation, is

V(Ycr ) = fh=1Wl( 1-nhfh J[s~y + RZsi; - 2RShxy] . (8.3.3.4)

Proof. By the definition of variance we have

V(Ycr) = E[ycr - EVcr W'" E[f(1 + 80 - 81 + 8? - 8081 + ....)- f f


-z Z Z r
= Y El8o+81 - 28081
]

= fzrJIWl( ~:h }~y + htWl( ~:h Js~x _2_hf_=I_W_l- ,(-=I=-~=:,. -h. :. .J_Sh_xy_1
1 1
yZ XZ X Y

= ~Wl(_I-_h_J[s~y +RzS~x -2RShXY].


h=1 nh
Hence the theorem .
Chapter 8: Stratified and Post-Stratified Sampling 685

Theorem 8.3.3.3. An estimator to estimate the variance of the combined ratio


estimator Ycr is given by

VI (Ycr) = h~1 Wl( I ~:h )[s~y + rZsL - 2rshXY] (8.3.3.5)

where r =Yst IXst denotes the estimator of population ratio R =Y/ X across the L
strata . Another estimator of the variance of the combined ratio estimator Ycr is

VZ(ycr) = IWhZ(I - f h ) _11)'IAi


(
h=1 1/h 1/h - i= 1 (8.3.3.6)
where ehi =(Yhi - Yh)- r(xhi - Xh) denotes the j'h 'h
residual term in the h stratum, but
using r obtained from all strata . The third improved estimator of the variance of
the combined ratio estimator YCT' due to Wu (1985), is given by
- )g
• _ L
V3(Ycr)= IWh - - - (
h=1 ()
Z I - fh
1/h
I nh Z X
I) I ehi
1/h - i=1 X st
(-=- (8.3.3.7)

where g denotes the suitable chosen constant such that the variance of the
estimator V3(YCT) is minimum . It has been shown by Wu (1985) that the optimum
value of g leads to efficient estimator of variance.

Exa mple 8.3.3. Select a sample of 40 countries from population 5 using the method
of proportional allocation. Record the production and area of the tobacco crop from
the countries selected in the sample . Apply the combined ratio estimator to estimate
the average production of the tobacco crop in the world . Assuming that the total
area in the world under the Tobacco crop is known . Estimate the variance of the
estimator used. Construct a 95% confidence interval.
Given: Total area is 3,650,492.66 hectares .
Solution. By the method of proportional allocation, the number of units to be
selected from the h'h stratum is given by, 1/h = 1/ Nhl N . Thus we have

~Nh>-
:i:.' Wh ~
StratUm .J fh
., No ~ ~\ ,,'~ lIhd' ,;~ 1
\,.
1 6 3 0.500 0.0566
2 6 3 0.500 0.0566
3 8 3 0.375 0.0755
4 10 3 0.300 0.0943
5 12 4 0.333 0.1132
6 4 2 0.500 0.0377
7 30 II 0.367 0.2830
8 17 6 0.353 0.1604
9 10 3 0.300 0.0943

,:
10 3 2 0.667 0.0283
~. Sum 106'i< ~ ,.' .40 -~"'~ i",:, .. i, 1. 0 00 0 ~,
Using the Pseudo-Random Number (PRN) Table 1 from the Appendix we have the
following samp le information and some results.
686 Advanced sampling theory with applications

St. l'iColumnno,used Random" i ,;;',,' . Co}mtry ,} Producti~n ··~fUeaJ


No. fr6mPRN~Taole ' . no;, ~~ Metric tons Hectares
I I 5 Nicaragua 4550 2240
6 Panama 2188 1094
2 EI Salvador 1038 580
2 2 4 Jamaica 2339 1175
2 Dominican Rep. 40950 27050
I Cuba 37000 59000
3 3 2 Belgium-Lux 1800 320
8 Spain 42300 15150
1 Austria 199 105
4 4 and 5 07 Macedonia 30000 22000
08 Poland 44700 19100
02 Albania 15000 24000
5 6 and 7 07 Moldova 38150 18600
02 Armenia 1100 4304
10 Turkmenistan 2600 1100
09 Tajikistan 8000 3228
6 8 2 Libya 1450 900
1 Algeria 5300 2700
7 9 and 10 20 Nigeria 21000 10000
12 Kenya 22120 8805
30 Zambia 6300 4882
15 Malawi 142300 116700
07 Central African Rep 650 750
26 Togo 2000 4000
22 Zimbabwe 2 12050 103110
01 Angola 3900 3950
11 Cote d'ivoire 2600 10000
05 Zaire 4110 3700
06 Cameroon 5500 3400
8 II and 12 15 Thailand 68600 51500
06 Indonesia 175631 206625
02 Burma 44000 36000
10 Korea, South 25000 25730
11 Laos 3000 4000
05 China 2524500 1445000
9 13 and 14 05 Lebanon 5000 3750
07 Syria 17200 15000
01 Cyprus 241 161
10 15 3 New Zealand 1550 600
2 Solomon Islands 95 100
Chapter 8: Stratified and Post-Stratified Sampling 687

Thus we have the following two tables :

1 2592.00 1304.67 3205948 .0 722185 .3 1521312 .0


2 26763 .00 29075.00 451299457 .0 839008125 .0 479521575 .0
3 14766.33 5191.67 569217900.3 74387858 .3 205728126.7
4 29900.00 21700.00 220530000 .0 6070000 .0 -36360000.0
5 12462.50 6808.00 302045625 .0 63572981.3 134543200.0
6 3375 .00 1800.00 7411250 .0
1620000 .0 3465000.0
7 38411.82 24481.54 5002786596 .0 1801653230.0 2864602080.0
8 473455.20 294809 .20 1.013E+ 12 3.23E+ll 5.7049E+ll
9 7480.33 6303.67 76515960 .3 59939890.3 67664108.7
10 822.50 350.00 1058512.5 125000.0 363750.0

146.7170 73.84925 109.08364


2 1514.8870 1645.75500 546967.57000
3 1114.4400 391.82420 125156.94000
4 2820.7550 2047.17000 727333.48000
5 1410.8490 770.71700 76564 .22600
6 127.3585 67.92453 182.52455
7 10870.5400 6928.27000 2065671.30000
8 75931.4900 47280.72000 47255953 .90000
9 705.6915 594.68580 25195 .84800
10 23.2783 9.90566 29.68217
5981Q~§~QQ

From the above table we have


L L
Yst = I,WhYh = 94666.01, x st = I,Whxh = 59810.82 and r = Yst/Xst = 1.582757.
h=l h=1

We are given the average area under the tobacco crop in the world, X = 34438 .61 .

Thus an estimate of average production of tobacco crop based on combined ratio


estimator is given by
- X = 94666 .01 x 34438 .61 = 54507.96
= -
Y cr Yst x st 59810.82
and an estimator of its variance is given by
688 Advanced sampling theory with applications

V(Yer)= Ih=1 WhZ(I-nhfh J[s~y +rZsh - 2rshXY]= h=1IVh = 50823164.5.

Using Table 2 from the Appendix the 95% confidence interval of the average
production of the tobacco crop in the world is given by
Yer+fa/z(df =n-1Nv(Yer) , or 54507.96+2.023"'50823164.5
or
[40085.92, 68930.00].

The usual combined linear regression estimator Yelr of population mean, Y, IS

defined as
Yclr = Yst + .8 (x -xst) (8.3.4.1)

where .8= f. WhZ(I-fhJShxY/f. Wl(l- f h}h . Then we have the following


h=1 nh h=1 nh
theorems:
Theorem 8.3.4.1. The asymptotic variance of the combined regression estimator
Yclr is
V(Yclr ) = VCYstXl- P;y) (8.3.4.2)

where Pxy = I Wl( 1-nhfh JShxy/f1h=1I Wl( 1-nhfh rI('~x h=1I Wl( 1-nhfh Jslry) denotes the
h=1
correlation coefficient in stratified sampling across all strata.
Proof. We have
V(Yclr ) = V~st + .8(x- xst)] '" V(Yst )+ jJzv(xst)- 2jJCov(Yst, xst) (8.3.4.3)
where jJ = Cov(Yst , xst)jv(xst)' Hence the theorem .
Theorem 8.3.4.2. An estimator to estimate the variance of the combined regression
estimator Yclr is

VI (Yclr) = I whz( 1-nhfh J[s~y + .8 Zsh - 2.8shxY]


h=1 (8.3.4.4)

where .8 = f. Wl(l- fhJShxY/ f. Wl(l- fhJs~x denotes the estimator of


h=1 nhh=1 nh
population regression coefficient jJ .

Another estimator of the variance of the combined regression estimator Yclr is


f
VZ(Yelr)= h=1 nh 1- I)Ie~;
I Wl(l-nh hJ_( ;=1 (8.3.4.5)
Chapter 8: Stratified and Post-Stratified Sampling 689

th
where ehi = (Yhi - Yh)- iJ(Xhi -Xh) denotes the lh residual term in the h stratum, but
using fJ obtained from all strata .
The third estimator of the variance of the combined regression estimator Yclr, due
to Wu (1985), is
• (_) L
v3 Yclr = IWh -
2(1- !h)-1(I)Iehi-=-
nh 2( X)g (8.3.4.6)
nh
h=1 nh - i=1 Xst
where g denotes the suitable chosen constant such that the variance of the estimator
V3(Yclr) is minimum. It has been shown by Wu (1985) that the optimum value of g
leads to an efficient estimator of variance.

Example 8.3.4. Select a sample of 40 countries from population 5 using the method
of proportional allocation. Record the production and area of the tobacco crop from
the countries selected in the sample. Apply the combined regression estimator to
estimate the average production of the tobacco crop in the world. Assuming that the
total area in the whole world under the tobacco crop is known, estimate the variance
of the estimator used. Construct a 95% confidence interval.
Given: Total area is 3,650,492.66 hectares.
Solution. Continuing information from the example 8.3.3 we have

812.4 385.6 146.72 73.85 45.88


2 256063.5 448028.5 1514.88 1645.76 731844.21
3 244129.7 88273 .2 1114.44 391.82 88840.02
4 -75507.3 12605.3 2820 .75 2047 .17 763600.10
5 287383 .1 135791.3 1410.85 770 .717 53810.33
6 1233.5 576 .7 127.36 67 .92 81.01
7 13202257.5 8303383.5 10870.54 6928.27 2441804.40
8 1582437849.0 895944582.9 75931.49 47280.72 14947955.11
9 140515.2 124474.7 705 .69 594 .68 50504.06
10 48.5 16.7 23.28 9.91 21.91
[r 596i 94785.0 ~~6(j6.QJ~ : 1907S~Q7W)r

From the above table we have


L L
Yst = IWhYh = 94666.01, Xs! = IWhXh = 59810.82
h=l h=1
and

h=1 h=l h=l nh


f
iJ = fD1h / fD 2h = Wl(l- fh )ShXY/ Wh2
h=1
f (1-nhfh}l.x = 1596494785
905058118 .7
= 1.7639.

We are given the average area under the tobacco crop in the world, X= 34438.61.
690 Advanced sampling theory with applications

Thus an estimate of average production of the tobacco crop based on the combined
regression estimator is given by
Yelr =YSI + !J(X - XSI )=94666 .01 + 1.7639(34438.61- 59810.82) =49911.96
and an estimate of its variance is given by

V(Yelr)= I W2(1-
h=l
h fh
nh
)[s~y + !J2 s'fu -2!JShxY]= 19078507.03.
fi
The (1 - a 00% confidence interval of the average production of the tobacco crop in
the world is given by
Yclr +fa/2(df = n- 2Nv(Yclr)
Using Table 2 from the Appendix the required 95% confidence interval estimate is
49911.96+ 2.024.J19078507.03, or [41071.34, 58752 .57].

Example 8.3.5. A private company ABC selects a sample of 40 countries, to


estimate the average production of the tobacco crop in the world, from population 5
using the method of proportional allocation as given tin the following table .

2 3 4 5 6 7 8 9 10 Total
6 6 8 10 12 4 30 17 10 3 106
3 3 3 3 4 2 11 6 3 2 40

ea ) Using full information from the description of the population given in the
Appendix, find the relative efficiency of separate ratio estimator with respect to
combined ratio estimator while estimating average production using known area
under the crop.
e b ) Using full information from the description of the population given in the
Appendix, find the relative efficiency of separate regression estimator with respect
to combined regression estimator while estimating average production using know
area under the crop.
Solution. From the description of the population 5 given in the appendix we have

o:
I 0.50 0.057 6515 .83 3194.50 2.040 51856391.4 10899652 .8 23619714.4
2 0.50 0.057 13545.66 14660.00 0.924 390206386.6 584984730.0 420480877.0
3 0.38 0.075 43629 .38 18309.37 2.383 3172330153 .0 635958094.9 1387271318 .0
4 0.30 0.094 21928 .10 14923.50 1.469 569396673 .9 209817189.2 319262930.2
5 0.33 0.113 11788.00 5987.83 1.969 153854582.5 27842810.4 62846173 .1
6 0.50 0.038 4653 .00 3450.00 1.349 7232769 .3 5876666 .6 6066866 .6
7 0.37 0.283 16862.27 11682.73 1.443 2049296094.0 760238523.4 1190767859.0
8 0.35 0.160 227371.50 145162.30 1.566 3.72E+II 1.24E+ II 2.14E+ll
9 0.30 0.094 32854 .10 33976.10 0.967 6802998890.0 8340765245.0 7529100326.0
10 0.67 0.028 3548.33 13333.33 0.266 22819758 .3 2963333 .3 8223083 .3
Chapter 8: Stratified and Post-Stratified Sampling 691

( a ) Separate ratio and combined ratio estimators: From the above table we
have
R
L
= IWhYh
_j L
IWhXh
_
= 52444.56/34438.61 =1.522838 .
h=l h=1
Let
2(I-fh)[2 22 -2R
Vh = Wh --;;;- Shy + RhShx hShxy
]

and
1-
Vh (C ) = Wh2(--;;;- fh)[
2
Shy + R 2st;
2 - 2RS ] ,
hxy

then we have the follow ing table

453 .24 2774 .25


2 60128.59 248921.34
3 204079 .04 500736.86
4 174795 .28 173606.04
5 30581.64 57702.68
6 554.50 848.41
7 902438.84 856060.71
8 16174803 .72 2 1597997.70
9 85250 .55 6674800.88
10 2490 .15 620.38

Thus the var iance of the separate ratio estimator is given by


L
V(Ysr) = I Vh = 17635575.55
h=l
and, that of the combined ratio estimator is given by
L
V(Ycr)= IVh(c) = 30114069.26 .
h=l
Hence the relative efficiency of the separate ratio estimator with respect to
comb ined ratio estimator is:
RE = V(Ycr) x 100 = 30114069.26 x 100 =170.77% .
V(Ysr) 17635575.55
This particular example shows that the separate ratio estimator IS better than
combined ratio estimator in such situations.

( b ) Separate regression and combined regression estimators: Let us first


calculate the combined correlation coefficient across all strata to find the variance
of the combined regression estimator.
692 Advanced sampling theory with applications

27691.20 5820.39 12612.88


208369.37 312380.59 224535.89
3764483.39 754667.25 1646222.05
1182442.36 435718.62 663000.03
328632.07 59472.01 134238.89
2574.86 2092.08 2159.79
9450943.18 3506067.82 5491582.89
1031861873.00 343953957.50 593597958.90
14127504.52 17320919.88 15635369.14
3046.43 395.60 1097.78
!~lOJ)p~57~~60.00' •3()p}~,J~9l~8P'
We know that the correlation coefficient Pxy across all strata is given by

Pxy = f Wl[l- fh )Shxy/ J fWh2[1- fh )S~y f Wl[l - fh )S'lx)


h=l nh l
h=1 nh h=1 nh
= 617408778 .3 = 0.9903.
.)1060957560.)366351491.8
Thus the variance of the combined regression estimator is given by

±
h=l nh r
v(Yclr ) = Wh2[ 1- fh I('~y (1- P;y)= 1060957560 x (I - 0.9903 2 )= 20482751 .17 .
Let

Vh(I)=W1C~:h )s~AI-P~XY) ' with Phxy =Shxy/(ShxShy)


then we have following table :

51856391.4 10899652.8 23619714.4 0.99350 358.90


390206386.6 584984730.0 420480877.0 0.88009 46975.34
3172330153.0 635958094.9 1387271318.0 0.97669 173434.47
569396673.9 209817189.2 319262930.2 0.92368 173605.41
153854582.5 27842810.4 62846173.1 0.96021 25631.04
7232769.3 5876666.7 6066866.7 0.93056 345.16
2049296094.0 760238523.4 1190767859.0 0.95400 849431.67
3.72E+11 1.24E+11 2.14E+11 0.99639 7426685.35
6802998890 .0 8340765245.0 7529100326.0 0.99952 13660.12
22819758.3 2963333.3 8223083.3 0.99997 0.16
Chapter 8: Stratified and Post-Stratified Sampling 693

Thus the variance of the separate linear regression estimator is given by

V(Yslr) = IVh(i) =
h= 1
I Wh (1-!hl<,rl)l- p~)= 8710 127.62 .
h=l
2
nh

Thus the relative efficiency of separate linear regression estimator with respect to
combined linear regression estimator is given by
RE= V(Yclr) x 100% = 20482751.17 x I00% = 235.16% .
V(.Yslr) 8710127.618
Thus separate ratio and regression estimato rs remain better than combined ratio and
regression estimator in this situation. If the population means are known for each
stratum, then one shou ld must go for separa te ratio or regressi on estimator. If only
overa ll population mean of the auxiliary variable is known then we do not have
choice and we have to go for combined ratio or regression estimators.

When we stratify the given population into L homogeneous strata or groups each
with N h units, then for each stratum we estimate population mean f,. , h = 1,2,...,L ,
as shown below:

I .Sfi'atilih No.rs I 2 3 h I I L I
ISt:atuiy RfPuI~tion Melip 1'J Yz Y3 Yh I I YL I
th
Then we select nh , h = 1,2,...,L units from the h stratum using SRSWOR sampling
(or any other sampling scheme). There are five cases :
Case I. When no model or auxiliary information is avail able , then we estimate the
popu lation mean in each stratum independently and every time we loose one degree
of freedom . Thus when we consider simple pooled estimator of the population
mean Yas
L
Yst = IWhYh (8.3.5.1)
h=l
then the degree of freedom df = (n - L)
Case II. Whe n we consider separate ratio estimator

-
Ysr = ~W-(Xh)
L. hYh -=-
h=l Xh
(8.3.5.2)

as an estimator of the population mean, Y, then basically we are assuming a linear


relationship between the study and auxiliary variables in the hth stratum through the
model
Yhi = RhX hi + Chi , (8.3.5.3)
where i = 1,2, ..., N h and h = 1,2, ..., L such that E(&hi)= 0 , E(cli) =al and
Ekh;Ghj) = 0 for i *" j .
694 Advanced sampling theory with applications

Nh th
Then on setting I Chi = 0 we are estimating only one ratio Rh = f,JXh in the h
i=l
stratum and we are loosing one degree of freedom in each stratum. The for the
separate ratio estimator, the degree of freedom should be again as df = (n - L),
because here we assume that the regression line in each stratum passes through the
origin, as shown in the Figure 8.3.1.
Stratum I Stratum 2 Stratum L

o o 0
Fig. 8.3.1 Separate ratio estimator.

Case III. When we consider separate regression estimator


Yslr = ±Wh~h + phxAXh -Xh)] (8.3.5.4)
h=l
as an estimator of the population mean Y then basically we are assuming a linear
relationship between the study and auxiliary variables in the hth stratum through the
model
Yhi = ah + PhXhi + Chi (8.3.5.5)
where i = 1,2,...,Nh and h = 1,2,...,L such that E(Chi) = 0 , E(C~i)= a~ and
'*
E(Chhi) = 0 for i j . Then we are estimating two parameters in each stratum by

minimizing N.f C~i in the hth stratum, and hence we are losing two degrees of
i=1
freedom in each stratum. Thus for the separate regression estimator the degree of
freedom should be taken as df = (n - 2L), because here we assume that the
regression line in each stratum does not pass through the origin and in each stratum
we have two parameters viz., intercept and slope, as shown in the Figure 8.3.2.

Stratum 1 Stratum 2 Stratum L

0 0
/ 0
Fig. 8.3.2 Separate regression estimator .
Chapter 8: Stratified and Post-Stratified Sampling 695

Case IV. When we consider combined ratio estimator


_ _(x)
-=-
Ycr = Yst
x st
(8.3.5 .6)

as an estimator of the population mean Y , then basically we are assuming a linear


relationship between the study and auxiliary variables across all the L strata
through the model
Yhi = RX hi + Chi (8.3.5 .7)
where j = 1,2,...,Nh and h = 1,2,...,L such that E(Chi) = 0 , E(cii) = ai and
I
E,ChiChj ) = 0 fior j *j . Th en on setting
. LL Nh
h;li;[
. .
L Chi = 0 we are estimatmg on Iy one

ratio R =Y/ X across all the strata in the population, and we are loosing only one
degree of freedom across all strata. The for the combined ratio estimator, the
degree of freedom should be taken as df = (n - I} In practice one can think that if
we arrange all the strata in ascending order on the basis of auxiliary information,
and the regression line passes across all the strata as shown in the Figure 8.3.3, and
we have to estimate only one parameter across all strata.

Stratum I Stratum 2 Stratum L

o
Fig. 8.3.3 Combined ratio estimator.

Case V. When we consider combined regression estimator


Ys[r = Yst + p(X - xst ) (8.3.5.8)
as an estimator of the population mean Y, then basically we are assuming a linear
relationship between the study and auxiliary variables across all the strata through
the model
Yhi = a + f3 Xhi + chi (8.3.5.9)
where j = 1,2,...,Nh and h = 1,2,...,L such that E(chi) = 0, E(cii) = ai
and
ElehiChj) = 0 for j * j. Then we are estimating two parameters by minimizing
L Nh
L L cii across all the strata in the population and hence we are loosing two
h;[ i ;(

degrees of freedom across all strata. Thus for the combined regression estimator,
the degree of freedom should be taken as df = (n - 2} In practice one can think that
if we arrange the all the strata in ascending order on the basis of auxiliary
696 Advanced sampling theory with applications

information , and the regression line passes across all the strata as shown in the
Figure 8.3.4 then we have to estimate only two parameters across all strata.

Stratum 1 Stratum 2 Stratum L

o
Fig. 8.3.4 Combined regression estimator.

Singh, Hom, and Yu (1998) extended the higher level calibration approach for
stratified sampling design as follows . Assume the population consists of L strata
with N h units in the h1h stratum from which a simple random sample of size n» is
L
taken without replacement such that total population size is N = I.Nh and sample
h=1
L
size IS n = I.nh . Associated with the l h unit of the h'h stratum there are two value
h=1
Yh'I and XhI. with Xh'I > 0 being the auxiliary variable. For the h1h stratum, let
Wh = Nh/N be the stratum weights, fh = nh/Nh the sampling fraction and Yh ' Xh ;
_ _ _ L _
Yh, X h the sample and population means, respectively. Assume that X = I.WhX h
h=1
_ L _
is known . The purpose is to estimate population mean Y = I.WhYh by using the
h=l

auxiliary information X . The usual estimator of population mean Y is given by


L
Ys, = I WhYh . (8.4.1)
h=1
A new estimator is given by
L
Y;, = IW;Yh (8.4 .2)
h=l

with new weights W; which are chosen such that the chi square (CS) type of
distance
f. (w; - Wh) (Whqh t
h=1
1
(8.4.3)

is minimum, subject to the condition


L. -
I. Wh xh = X . (8.4.4)
h=1
Chapter 8: Stratified and Post-Stratified Sampling 697

Note that qh in (8.4.3) is a suitably chosen weight which determines the form of
the estimator. Minimization of (8.4.3), subject to the calibration equation (8.4.4),
leads to the combined generalized regression estimator (GREG) given by

Y;t = ±WhYh+(±WhqhXhYh/±Wh%xl)[X- ±WhXh] (8.4.5)


h=1 h=1 h=1 h=1
for the optimum choice of weights

Wh_ = Wh + ( _/ L
Wh%Xh IWhqhXh X - IWh
h=1
L
h=1
_)
Xh . -2)(- (8.4.6)

If qh = 1/xh then the estimator (8.4.5) reduces to the well known combined ratio
estimator defined earlier in (8.3.3.1). Note that there is no choice of qh such that
the estimator (8.4.5) reduces to well known combined linear regression estimator in
stratified sampling as discussed in Section (8.3.4).
An estimator of variance of combined generalized regression estimator (GREG) is
given by
.f--)_{. W;(l- Ih) Seh2
v\)'st - L, (8.4.7)
h=1 nh

h
were 2 = (nh -I )-1 nh
seh L eh;2 is the h lh stratum sample vanance and
; =1

eh; = (Yhi - Yh)- b(Xhi - Xh) with b = ±WhqhYhXh/ ±Whqhxl .


h=1 h=1
The lower level calibration approach yields an estimator of variance of the
combined generalized regression estimator (GREG) as
.f-- ) L D hw; 2 2 (8.4.8)
V\)'st}greg = I -w: 2 Seh
h=1 h
where Dh = W;(l- Ih )/nh and W; is given by (8.4.6) .
If qll =1/XII then (8.4.8) reduces to an estimator given by

.f. - \ . =(X)2 {. W;(l-Ih) 2 (8.4.9)


V\)'st katto - L, Sell
Xst h=1 nh
which is a special case of a class of estimators for estimating the variance of the
combined ratio estimator given by Wu (1985) as
.f- - \ _ ( X)g {. W;(l- Ih) Sell2
V\)'st lw
- - L, (8.4.10)
X st h=1 nil
for g = 2 .
The properties of the estimators of the variance of the combined ratio estimator are
also studied by Saxena, Nigam, and Shukla (1995). Using higher level calibration ,
a new estimator is given by
• f.-) L n hw; 2 2
Vho \)' st }greg = I -w 2 Sell (8.4.11)
11=1 h
where nil are suitably chosen weights such that the chi square distance function
698 Advanced sampling theory with applications

I(Oh -Dh)Z(DhQht l (8.4.12)


h=t
is minimum subject to higher level calibration equation
L 2 (_ ) (8.4.13)
IOhShx = V X st
h=\
where
V(xst ) =
h=l
I
wl (I - fh) sTu is assumed to be known, and Qh' like q h » is again a
nh
suitably chosen weight to form different kind of estimators. This procedure leads to
a new estimator for the variance of the combined generalized regression estimator
(GREG) as
vhol;:t)greg =v~:tteg + Pst [v(xst)- v(xst)] (8.4.14)
where
• LW;2(I-fh) 22/LWh2(I-fh)4
Pst = I Qhshxseh I Shx denotes the combined improved
h=l nh h=l nh
estimator of regression coefficient in stratified sampling and
v(xst) = I wl (1-nhfh) sTu is an unbiased estimator of V(Xst )'
h=l
If qh=l!xh and Qh=l/ sTu, then the estimator of (8.4.14) reduces to a new
estimator of variance of the combined ratio estimator given by

Vst(Yratio)= i: wl(l-
h=\ nh
!h)slh(!)2(~(~st)))
v X st X st
(8.4 .15)

which is again a ratio type estimator proposed by Wu (1985) for estimating the
variance of the combined ratio estimator. Note that it makes use of the extra
knowledge of the known variance of the auxiliary variable at the estimation stage.
Several more new estimators can be constructed for new choices of weights q h
and Qh' Defining u = XlI.dixi and v = V(Xht)/V(Xht). Singh, Hom, and Yu (1998)
/ ' ES

suggested a wider class of estimators as

vss~:Jgreg ={..!.-2 iESj("i)es


I I Dij(diei - djej~}H(u,V) (8.4.16)

where H(u, v) is a parametric function of u and v such that H(I, 1)= 1 satisfying
certain regularity conditions. Then all estimators obtained from the functions,
H(u ,v)=uav p , H(u,v) = {I +a(u -I)}/{I+ p(v-I)} , H(u,v) = 1+a(u -1)+ p(v-I) and
H(u ,v)= {I +a(u -1)+ p(v_I)}-l are special cases of the higher level calibration
approach, where a and p are unknown parameters involved in the function
H(u, v). Replacement of these parameters with their respective consistent
estimators in the class of estimators at (8.4.16) yields estimators which possess the
same asymptotic variance as shown by Srivastava and Jhajj (1983a), Singh and
Singh (1984a), and Mahajan and Singh (1996) .
Chapter 8: Stratified and Post-Stratified Sampling 699

Exa mple 8.4.1. Select a sample of 40 countries from population 5 using the method
of proportional allocation . Record the production and area of the tobacco crop from
the countries selected in the sample. Assume that the area under the tobacco crop in
different continents is known. Obtain the calibration weights which forms
combined ratio and combined GREG type estimators . Obtain the values of the
estimates .
Solution. Using the proportiona l allocation, we have the same values of IIh as III
the previous example . Further from the sample information, we have
8t. . ,Whxh II Yr ",,""
no.I" >;hii.I f~!~~I" 'e.- "" ...... ~tW' Irifki2\it",~\ i, Wf;eg ffj
1l,,; iQ;!iJ.
'1
\)i;o. ",'J!'.l' i'!t ~,~';'

1 2592.0 1304.7 73.84432 0.032587 96342 .5 0.056468 84.46 146.36


2 26763.0 29075 .0 1645.64500 0.032587 47847 128.4 0.053662 872.12 1436.15
3 14766.0 5191.7 39 1.97108 0.043468 2034984 .5 0.074800 64 1.85 1104.49
4 29900 .0 21700.0 2046 .31000 0.054292 44404927.0 0.090646 1623.33 2710.33
5 12463.0 6808 .0 770.66560 0.065 174 524669 1.4 0.111824 812.26 1393.66
6 3375.0 1800.0
67.86000 0.021705 122148.0 0.037579 73.26 126.82
7 38411.8 24481.5 6928.27582 0.162934 169614861.6 0.27063C 6258.61 10395.40
8 473455.0 294809 .0 47287 .36360 0.092349 13940740376 .0 0.075972 43722 .94 35969.48
9 7480.3 6303.7 594.43608 0.054292 3747128.9 0.093239 406 .12 697.45
10 822.5 350.0 9.90500 0.016293 3466 .7 0.028282 13.40 23.26
,,,1.)S.uri1.· 59816.27651 .", 14213858055.0 54508.36 5,4003.43

In general the calibration weights in stratified random sampling are given by

W; =Wh+(Whqh Xh/ h=1


f Wh qhXl)( x - h=fWhXh)'
1

If % = 1/xh then the calibration weights which leads to the combined ratio estimator

W"r3tio= Wh( X/htWhxh J


and an estimate of the average production of tobacco using the combined ratio
estimator is given by
L .
Yr = I W"r3110Yh= 54508.36.
h=l
If q h =1 then the calibration weights leading to the combined GREG estimator are
g
Wfe = Wh+ (WhXh/ ~WhxlJ(x - ~WhXhJ
h=l h=1
and the combine d GREG estimate of the average tobacco production is given by
L
Ygreg = Iw/fre gYh= 54003.43 .
h=1
The true average production level of the tobacco crop in the world is 52444.56 as
shown in the description of population 5. Evidently the GREG estimate is more
close than ratio estimate. For this particular sample the sampling error is small in
case of GREG estimate than the ratio estimate.
700 Advanced sampling theory with applications

SSI0

Singh (2003c) recently considers a new estimator of the population mean Y III
stratified sampling as
_® L ®_
YSI = LW" Y" (8.4 .1.1)
"=1
where W,,®are the calibrated weights such that the chi square distance function
D® =..!.- I (w h® - Wh L (8.4 .1.2)
2h=1 whQf
is minimum subject to two constraints defined as
IW,,®= IWh (8.4 .1.3)
h=1 "=1
and
L ®_ _
IW" Xh = X (8.4.1.4)
h=1
where Qf are some suitably chosen weights. Minim ization of (8.4 .1.2) with respect
to (8.4 .1.3) and (8.4 .1.4) leads to new calibrated weights are given by

w® -
h - h+
(WhQhX"{ tWhQ,,)-(WhQJ tW"QhXh) (
W \h-1
Ct
\h-1
WhQ")(,,~\W"Qhxl ) _ (~~"QhXh
X
L
I W-
r-
h=1 hXh
)

(8.4 .1.5)

and thus a new calibrated estimator of the population mean Y becomes

y~ = "tW"Yh + PsI(OIS{ X- h~thXh] (8.4.1.6)

where

• (IWhQfXhYhJ( IWhQfJ-( IWhQf y" J( IWhQfxh J


Psl(ols) - h=\
_ h=1 h=\ h=\
2 (8.4.1.7)
L ®
LWhQ"
( h=1 J( L
LWhQh
e xh2J- (LLW ®
hQh s, J
h=1 h=1
If Qf = 1 then the estimator (8.4.1 .6) reduces to the traditional combined stratified
linear regression estimator, and hence better than the estimators developed by
Singh, Horn, and Yu (1998), and Tracy, Singh , and Amab (2003), and leads to the
following theorem:

Theorem 8.4.1.1. The combined linear regression estimator in stratified random


sampling is unique in its class of estimators.

Note that more work related to this topic will be discussed in the next volume.
Chapter 8: Stratified and Post-Stratified Sampling 701

Let us study how to make the strata boundaries which divide the scale of study or
auxiliary variable for the strata. The range of the auxiliary variable X is a to b
subject to the condition that (b - a) < OCJ • For example, let X E (a,b) and let the initial
rough four strata boundaries be given by a to B1 ; B1 to B2 ; B2 to B3 ; and B3 to, b .
It is to be noted here that to find two boundary points B] and B2 will result in three
strata, three boundary points BI , B 2 , and B3 will result in four strata and so on.
These boundary points can either be obtained using information on study variable
Y or auxiliary variable X. While doing stratification on the basis of the study
variable Y we assume that:

(i ) The study variable Y has a continuous distribution ;


( ii) The first and second order derivatives of the probability density function f(y)
exist for all y in the range (a, b) ;
( iii) The stratum boundary points YI, Y2" " ,Y(L-I) , which will result into L strata,
based on the study variable, take positions as shown in Figure 8.5.1.

1-----1-----1-----1--------1-----1--------1---- -1
Y3 Yh-I

(iv ) The h' h stratum weight, mean, and variance are defined as

w, = Yh ( w
Jf Y"Y, f1hy
Yh-I
=w1h Yh-I
Yh (w
fyf y,.y, and O'hy
2
=w1h Yh-I
Yh 2 (w 2
f Y f Y"Y-f1hy;

( V ) Under assumption ( ii ) we have the following results:


On differentiating Wh with respect to Yh we get
ow,
-;:;- = f ()
Yh ; (8.5.1)
UYh
Applying the basic rule of differentiation , for the lh stratum weight , we have
ow.
- -' =- f ()
Yh , i *- h . (8.5.2)
°Yh
On differentiating f1hy with respect to Yh we get

Yh Yh
fyf(y)dy fyf(y)dy ()
Yh-I x f(y ) = Yhf(Yh) Yh-I x f Yh
TU2
rrh
h W
h w, Wh
702 Advanced sampl ing theory with applications

=Yhf(Yh) // x f (Yh) = f (Yh) [y _ I I ]


W
h
,-.hy w, w, h d,y '
On differentiating a1y with respect to Yh we have

= y1f(Yh) fa 2 + //2
W ~ hy ,-.hy
)xf(Yh)
W
-2// f(Yhk _ // )
,-.hy W I)'h ,-.hy
h h h
f(y h)[2 f 2 2) ( )]
=----w;;- Yh -\ahy +f.1hy -2f.1hyl)'h -f.1hy l

= f(Yh)
w,
[~h - f.1hyr - a1J (8.5.3)

Again from the basic concept of differentiation we have

~alh = _fW(Yh)[~h - f.1iyr -ai~]' (8.5.4)


UYh i

Under proportional allocation, if finite population correction factor is ignored, then


the variance of the estimator Yst of population mean Y is given by
f-:;:) 1 L 2
Vp\Yst = - IWhahy . (8.5.1.1 )
n h~l
Assuming i = h + 1 then differentiating (8.5.1.1) with respect to Yh and equating to
zero we have

0Yh 0Yh
°
n oV (Yst) -_ --~fwhahy
p 2] +--~
0 Yh
°fw 2]iaiy

h) +W [oa 1yJ
-ahy
- 2 (OW
-- h - - +aiy
2(0W;)
- - +W;[ -
oa&J-
- -0. (8.5.1.2)
0 Yh 0Yh 0 Yh 0Yh
On using first order derivative results from the previous section in (8.5.1.2) we have

a1yf(Yh)+ Wh~~h)[~h - f.1hyr - a1y]+ ai~ {- f(Yh)} + W;{- f~h)}[~h - f.1iy r - ai~]
Chapter 8: Stratified and Post-Stratified Sampling 703

or (rh - ,uhy)2 - (r h - ,uiy)2 = 0, or (,uhy - ,uiyX(,uhy + ,uiy)-2Yh] = O. (8.5.1.3)


Note that ,uhy * ,uiy thus (8.5.1.3) implies that
Yh = (Phy + ,uiy);2 . (8.5.1.4)

Thus we have the following theorem:


Theorem 8.5.1.1. Under proportional allocation , the strata bound ary points are
given by
,uhy + ,uiy
Yh = 2 ' forh =I,2, ..., L - 1.

This set of equations is called the set of minimal equations. It is to be noted that
,uhyand ,uiy are based on the strata boundaries, therefore it is not possible to obtain
Yh directly . In other words , the strata boundaries under proportional allocation
technique are the means of the following and preceding strata mean values of the
study variable .

Under Neyman allocat ion, if the finite population correction factor is ignored then
the variance of the estimato r jist of population mean Y is given by

_ 1(
V(YS\)N = - I WhUhy
L )2 (8.5.2.1)
n h=1
L
Now minimization of (8.5.2.1) is equivalent to minimization of ¢= "iWhUhy' Thus
h=1
the minimal equation s in this case will be
o¢ =~ (WhUhy)+ ~(W;UiY )
0 Yh 0Yh 0Yh

OWh ) ( OUhY) ( OW) ( OUiY )


= Uh( 0 Yh +Wh 0 Yh +u i OY~ + W; 0Yh =0 . (8.5.2.2)

o Uh o Uh . .
Now we have - - y ) , which implies that
2y = 2Uhy ( - -
0Yh 0 Yh

OUhy =
0 Yh
_1_[ OU~y J
2Uhy 0 Yh
=
2uhyWh
~ u~J
f (Yh) [(rh - ,uhy - (8.5.2.3)

Similarly we have
OUiy = _ f(Yh) [(rh - ,uiy ~ - Ui; ]' (8 5 2 4)
0 Yh 2UiyW; . . .
Thus the set of minimal equations (8.5.2.2) will reduce to
Whf(Yh) rr \2 2 ] f ( ) W;/(Yh )rr \2 2]
Uhy f (Yh ) + 2 w: lllh- ,uhy J - Uhy - Uiy Yh - 2 W lllh- ,uiy J - Uiy
Uhy h Uiy i
704 Advanced sampling theory with applications

or

or
&h - Jlhy ~ + (T~y = &h - Jliy ~ + (Ti~
(8.5.2.5)
(Thy (Tiy

Thus we have the following theorem.


Theorem 8.5.2.1. The strata boundary points for the Neyman allocation are given
by the solution to the set of minimal equations

&h - JlhY~ + (T~y = &h - JliY~ +(Ti~ f


or i = h + 1 and h = 1,2,...,L -1 . (8.5.2.6)
(Thy (Tiy

Solving these equations is not easy since Jlh and (Th are dependent upon Y h'

The methods discussed so for obtaining the strata boundary points are due to
Dalenius and Hodges (1957, 1959). In actual practice the following six steps are
needed for applying Dalenius and Hodges 's method of finding strata boundaries.

Step 1. Sort the population by the auxiliary character X. The value of X is


known for each unit in the population and (thought to be) highly correlated with Y .
Step 2. Divide the population into a large number of small and equal width
intermediate strata.
Step 3. Calculate the population frequency within each intermediate stratum.
Denote the observed frequency in the h'h stratum by !h .
Step 4. Obtain the list of intermediate strata and calculate the cumulati ve sum of
[1;.
L
Step 5. Divide T = I [1; by L, the total number of strata; and let k = T/ L.
h=l

Step 6. The stratum boundaries are obtained by finding the value of cumulative
sum {JJn
h = 1,2,...,L that are closest to the multiples of k, and using the right

boundary point of the intermediate stratum giving that value of Cum {J7,;"}.
Some times it may be difficult to use intermediate strata of equal width. For
example , when using employment. The Dalenious--Hodges technique proceeds as
before, except ~Wh!h has to be used instead of [1;, where Wh is the width of the
intermediate stratum h. The above method is also called the Cumulative Square
Chapter 8: Stratified and Post-Stratified Sampling 705

Root (CSR) method. Unnithan (1978) modified the Newton method used by
Shannon (1970), which seems to perform very satisfactorily when adopted for
minimizing the variance in the search for the best boundary points of stratification
for Neyman allocation. Shiledar--Baxi (1995) discussed a sequel to the modified
Kossack and Shiledar--Baxi (1971) procedure proposed to obtain an improved ' unit
stratified' design.

Example 8.5.1. Stratify the 50 states listed in population I (Appendix) using


nonreal estate farm loans as an auxiliary variable. Modify the stratum boundaries
into six strata using the cumulative square root method.
Solution. Sort the data in the ascending order of the magnitude of nonreal estate
farm loans as follows.

0.233 RI 540.696 GA
0.471 NR 549.551 MS
7
3.433 AK 1 557.656 KY
4.373 CT 571.487 OR
16.710 NV 635.774 OR
19.363 VT 722.034 MT
8
27.508 NJ 848.317 AR
29.291 WV 906.281 CO
38.067 HI 2 1006.036 ID
43.229 DE 1022.782 IN
9
51.539 ME 1228.607 WA
56.471 MA 1241.369 ND
57.684 MD 1372.439 WI
80.750 SC 1519.944 MO
10
188.477 VA 1692.817 SD
3
197.244 UT 1716.087 OK
274.035 NM 2466.892 MN
4
298.351 PA 2580.304 KS 11
348.334 AL 2610.572 IL
386.479 WY 5 3520.361 TX
388.869 TN 3585.406 NE 12
405.799 LA 3909.738 IA
426.274 NY 3928.732 CA
431.439 AZ 6
440.518 MI
464.516 FL
494.730 NC
706 Advanced sampling theory with applications

From the above table we have

1 4.140 4 16.560 4.069398 4.069398


2 64.040 10 640.400 25.306130 29.375520
3 8.767 2 17.534 4.187362 33.562890
4 24.316 2 48.632 6.973665 40.536550
5 40.535 3 121.605 11.027470 51.564020
6 88.931 6 533.586 23.099480 74.663500
7 30.791 4 123.164 11.097930 85.761420
8 270.507 4 1082.028 32.894190 118.655600
9 235.333 4 941.332 30.681130 149.336800
10 343.648 4 1374.592 37.075490 186.412200
11 143.680 3 431.040 20.761500 207.173700
12 408.371 4 1633.484 40.416380 247.590100

Note the use of !h as frequency. Also note that we need 6 strata, therefore divide
12
the sum 'L)Whfh by 6, and we have the first boundary point at a cumulative
h=1
frequency of 41.26502. Thus the five boundary points (or six strata) are given in the
following table.

Table 8.5.1. Final list of states in different strata obtained by cumulative square
root method.

0.00000 41.26502 1,2,3,4 RI, NH, AK, CT, NV,VT, 18


NI, WV, HI, DE, ME, MA,
MD, SC, VA, UT, NM, PA
2 41.26502 82.53004 5,6 AL, WY, TN, LA, NY, AZ, 9
MI,FL, NC
3 82.53004 123.79510 7,8 GA, MS, KY, OR, OH, MT, 8
AR,CO
4 123.79510 165.06010 9 ID, IN, WA, ND 4
5 165.06010 206.32510 10 WI, MO, SD, OK 4
6 206.32510 247.59010 11, 12 MN, KS, IL, TX, NE, lA, 7
CA
Chapter 8: Stratified and Post-Stratified Sampling 707

Example 8.5.2. We wish to estimate the average death rate of the persons living in
the United States on the basis of year 2000 projections of the death rate. The
projected 191443 persons are to be grouped into five strata on the basis of their age
at the death time. The rough distribution of the death rate projected for the year
2000 in the United States has been listed in population 6 in 21 age groups with a
gap of four years. We wish to apply the Neyman method of sample allocation .for
selecting the overall sample. Apply the Cumulative Square Root method to form 5
strata.
Solution. From the information given in the population 6 we have

I 0-4 0.5--4.5 692 26.305890 26.30589


2 5-9 4.5--9.5 14 3.741657 30.04755
3 10--14 9.5--14.5 15 3.872983 33.92053
4 15--19 14.5--19.5 57 7.549834 41.47037
5 20--24 19.5--24.5 71 8.426150 49.89652
6 25--29 24.5--29.5 68 8.246211 58.14273
7 30--34 29.5--34.5 78 8.831761 66.97449
8 35--39 34.5--39.5 I II 10.535650 77.51014
9 40--44 39.5--44.5 182 13.490740 91.00088
10 45--49 44.5--49.5 315 17.748240 108.74910
II 50--54 49.5--54.5 533 23.086790 131.83590
12 55--59 54.5--59.5 853 29.206160 161.04210
13 60--64 59.5--64.5 1357 36.837480 197.87960
14 65--69 64.5--69.5 2010 44.833020 242.71260
15 70--74 69.5--74.5 3022 54.972720 297.68530
16 75--79 74.5--79.5 4423 66.505640 364.19090
17 80--84 79.5--84.5 6921 83.192550 447.38350
18 85--89 84.5--89.5 10609 103.000000 550.38350
19 90--94 89.5--94.5 16915 130.057700 680.44120
20 95--99 94.5--99.5 27188 164.887800 845.32900
21 100--104 99.5--104.5 44053 209.888100 1055.21700
22 105--109 104.5--109.5 71956 268.246200 1323.46300

For making five strata, we need to know four boundary points, say B), B2 , B3 , and
B4 by using linear interpolation between the class intervals and the Cum.JJ;
values. On dividing 1323.4630 by 5 and taking the cumulative totals, we get the
rough five boundary points corresponding to values 264.6926, 529.3853, 794.0779
and 1058.771 as given in the 6th column of the above table. Thus four boundary
points obtained through interpolation are given by
708 Advanced sampling theory with applications

B = 69.5 + 4(264.6926 - 242.7126) = 71.09


1 54.97272 '
B = 84.5 + 4(529.3853 - 447.3835) =87.68,
2
103

B =94.5+ 4(794.0779-680.4412)=97.25
3 164.8878 '

and
B =104.5 + 4(1058.771-1055.217) =104.55.
4 268.2462

It indicates that the persons with age in the ranges


[0,71.09], [71.09, 87.68], [87.68,97.25], [97.25, 104.55], and [104.55, 109]
will form the first, second, third, fourth, and fifth stratum, respectively, according to
the cumulative square root method.

8.5.3"STRATIFjCATION USING AUXIIlIIA~YINEORMAT~ON ;; ,,'?{

Serfling (1968) suggested the use of Cum.Jj; rule for stratification on the auxiliary
variable x when the regression of Y on X is linear with uncorrelated
homoscedastic errors and nearly perfect correlation. As the Cum.Jj; rule was
primarily proposed for stratification on the study variable Y it does not take into
account the regression of Y on X and also the form of the conditional variance
v(y Ix). The Cumifj; rule of Singh and Sukhatme (1969) and Singh (1971)
though takes into account the regression of Y on X and also the form of the
conditional variance function V(Ylx) but does not reduce to the rules recommended
for optimum stratification on Y when V(Ylx) = O. While deriving Cumifj; rule,
Singh (1971) made the assumption that V(Ylx) > 0 for all x in the range (a,b) of x
with (b - a) < 00. Singh (1975b) suggested an improvement in his Cumifj; rule
such that the form of the conditional variance function V(Ylx) also reduces to a rule
for optimum stratification on Y when V(YI x) = 0 for all x in the range (a, b) .
Schneeberger (1979) commented on the necessary conditions of Dalenius and
Hodges (1957, 1959) for optimal stratification points with Neyman's allocation.
Bankier (1988) suggested a power allocation method, under which the sample size
in the h1h stratum takes the optimum value

_ [[ShXZJ/~
nh - n
Y
- ShXZ]
Y
L.
h;l
- . (8.5.3.1)
h h

The value of q is called the power of this allocation and it can take any value in
[0, 1]. The choice of q between 0 and 1 can be viewed as a compromise between
Chapter 8: Stratified and Post-Stratified Sampling 709

the Neyman allocation and the almost equal coefficient of variation allocation.
Mandowara and Gupta (1999) have obtained optimum points of stratification for
two or more stage designs with unequal first stage units and the subsequent units.
For numerical illustrations related to Cum..[j and cum.iff methods , one can also
refer to Singh and Mangat (1996) . We would like to discuss Singh (1971) method
of stratification by using auxiliary information. We have seen that an unbiased
estimator of population mean Y in stratified random sampling is given by
L
Yst = IWhYh (8.5.3.2)
h=\
with approximate variance of the estimator given by
_ ) LWh2 2
V ( Yst = I-ahv .
h=l nh /
(8.5.3. 3)

Obviously for achieving the maximum precision, a stratification design with


minimum V(Yst) is desired. From (8.5.3.3) it is clear that the problem of optimum
stratification involves the simultaneous determination of ( a ) Optimum strata
boundaries (b) Optimum number of strata, and ( c ) Sample allocation nh. Singh
(1971) considered the problem of optimum allocation of the sample to different
strata when the total expected cost of the survey is fixed. Assume that the cost of
observing Y on a unit is a function of the value of the variable x for that unit. A
function e(x) is said to belong to Or if the first r derivatives of e(x) exist for all
x in the range (a, b) of x. If C(x) {C(x ) E 03 , C(x) > O} for all x in (a, b) is a cost
function, then the expected cost for observing n« units in the hth stratum is nhl'h e '
where I' he is the expected value of C(x) in the hth stratum . The total fixed cost
function is then defined as
L
C = Co + Inhl'he +'I'(L), (8.5.3.4)
h=\
where Co is the overhead cost and 'I'(L) is the cost of constructing L strata with
'1'(1) = o. Let the regression of Y on x be given by
Y =A (x)+e (8.5.3.5)
where E(elx)=O and V(elx)=¢(x»O '<I x E(a,b) and (b-a) < oo.Assumingthat
A(X) E 02 and ¢(x) E 03 . Under the regression model we have
a~y = a~A + I'h" . (8.5.3.6)
Thus the variance of the estimator under stratified sampling becomes
(7:)
VVst =
L wl ( 2
I - \ ahA + I'h" .
)
(8.5.3.7)
h=\ nh
For a given number of strata L the Lagrange function is given by

Lg = f wl
h=lnh
(a~A + I'h,,)+K[C- Co - h=\fnhl'he -'I'(L)]. (8.5.3.8)

The optimum values of nh are given by


710 Advanced sampling theory with applications

nh =[C- Co -lIf(L)]Wh~(oit + f.Jh/f / f.Jh/f)/ h~l Wh~f.Jhcla~). + f.Jh/f) . (8.5.3.9)


The minimum variance of the estimator Yst under stratified sampling is given by

Min.V(Yst) = [h~l whJ f.JhC(a~). + f.Jh/f) rftc- Co -lIf(L)} . (8.5.3.10)

If f(x) E 03 denotes the density function of x then we have


Xh 1 Xh 2 1 Xh 2
Wh = ff(x}dx, f.Jhc = - fC(x)f(x)dx and ah). = - J Af(x}dx-(f.Juf,
Xh-l WhXh-l WhXh-I
1h
where f.Ju is the expected value of A(X) in the h stratum and (Xh-l, Xh) are the

stratum boundaries . On differentiating ±WhJf.JhAa~).


h;l
+ f.Jh/f) with respect to Xh

and equating to zero we obtain the set of minimal equations given by

f.Jhc[¢(Xh) + {A(Xh) - f.Jh).}2] + C(Xh Xa~). + f.Jh/f)


Jf.Jhc (a~). + f.Jh/f)

= f.Jic[¢(Xh) + {A(Xh) - f.Ji). f] + C(Xh Xal). + f.Ji/f) (8.5.3.11)


f.Jic (aft + f.Ji/f)
i = h + 1, h =1, 2, ..., L -1. Again the exact solution to these equations is not easy. An
approximate solution to these equations is
Xh
K~ Jg(t )f(t):it = Constant, (8.5.3.12)
Xh-l
where K h =Xh-xh_land g(t)=~2C'2+C2¢'2+4C2¢A'2_2¢C¢'CY(¢cfz . Following
Singh and Sukhatme (1969) an approximate solution to the minimal equations is
Xh
f Vg(t )f(t)dt = Constant . (8.5.3.13)
Xh-l
Singh Cum.V p(x) rule: If the function p(x) = g(x)f(x) is bounded and possesses
first two derivatives for all x in (a, b), then for a given value of L taking equal
intervals of the Cum.V p(x) yields approximately optimum strata boundaries .

Example 8.5.3. We wish to estimate the average death rate of the persons living in
the United States on the basis of year 2000 projections of the death rate. The
projected 191443 persons are to be grouped into five strata on the basis of their age
at the time of death. The rough distribution of the death rate projected for the year
2000 in the United States has been listed in population 6 in 21 age groups with a
gap of four years. We wish to apply the method of proportional allocation for
selecting the overall sample required for estimation purpose. Apply the cumulative
cube root method to form 5 strata.
Chapter 8: Stratified and Post-Stratified Sampling 711

Solution. From the information given in the population 6 we have

0--4 0.5--4.5 692 8.845085 8.845085


2 5--9 4.5--9.5 14 2.410142 11.255230
3 10--14 9.5--14.5 15 2.466212 13.721440
4 15--19 14.5--19.5 57 3.848501 17.569940
5 20--24 19.5--24.5 71 4.140818 21.710760
6 25--29 24.5--29.5 68 4.081655 25.792410
7 30--34 29.5--34.5 78 4.272659 30.065070
8 35--39 34.5--39.5 111 4.805896 34.870970
9 40--44 39.5--44.5 182 5.667051 40.538020
10 45--49 44.5--49.5 315 6.804092 47.342110
11 50--54 49.5--54 .5 533 8.107913 55.450020
12 55--59 54.5--59.5 853 9.483814 64.933840
13 60--64 59.5--64.5 1357 11.071160 76.005000
14 65--69 64.5--69.5 2010 12.620170 88.625180
15 70--74 69.5--74 .5 3022 14.457660 103.082800
16 75--79 74.5--79 .5 4423 16.414930 119.497800
17 80--84 79.5--84.5 6921 19.057080 138.554800
18 85--89 84.5--89.5 10609 21.973110 160.528000
19 90--94 89.5--94.5 16915 25.669890 186.197800
20 95--99 94.5--99.5 27188 30.069470 216.267300
21 100--104 99.5--104.5 44053 35.317650 251.585000
22 105--109 104.5--109 .5 71956 41.593200 293.178200

For making five strata we need to know four boundary points, say B1 , B2 , B3 , and
B4 by using linear interpolation between the class intervals and the Cum. ifj;
values . On dividing 293.1782 by 5 and taking the cumulative totals we have the
rough five boundary points corresponding to values 58.6356, 117.2713, 175.9069
and 234.5425 in the 6th column of the above table. Thus four boundary points
obtained through interpolation are
B = 54.5 + 4(58.6356 - 55.4500) = 55.84
1 9.4838 '
B = 74.5 + 4(117.2713 -103.0828) =77.95
2 16.41493 '
B =89.5+ 4(175.9069-160.528) = 91.89
3 25.6699 '
and
712 Advanced sampling theory with applications

B = 99.5+ 4(234.5425-216.2673) = 101.57.


4 35.31765

It indicates that the persons with age in the ranges


[0,55 .84], [55.84,77.95], [77.95, 91.89], [91.89, 101.57] and [101.57, 109]
will form the first, second, third, fourth and fifth stratum, respectively, according to
cumulative cube root method.

Dayal (1985) considered the superpopulation model, defined as


Yj=a+fJxj+ej' j=1,2, ...,N (8.6.1)
where
E\ejlxj)=O, V(ejlxj)=r2x;, Cov\ej' ej'lxjxj')=O, i» )'.
Rao (1968) has shown that under model (8.6.1), the optimum allocation is given by
nh oc ~(Tl- Dh) (8.6 .2)

Nh
;=1
Nh (2 )
where Th = I X h; and Dh = Nh I X h; - Xff; . If g = 2 then Dh =
;=1
°and the optimum
allocation reduces to allocation proportional to stratum totals . Then we have the
following theorems .

Theorem 8.6.1. Under model (8.6.1) the Neyman optimum allocation reduces to
the allocation given by
nhocWh~'-~-;y-S-~x-+-r(I---PX-2y"T)S-;-X-h-gj-:'x-g . (8.6.3)

Proof. We have

Yst = IWhYh = IWh{~I.Yh;} = IWh{~I.(a+fJXh;+eh;)}


h=1 h=l nh ;=1 h=\ nh ;=1
L
= IWh(a + fJxh +eh) ' (8 .6.4)
h=1
Thus we have
2 L
V(Yst) = E1V2(Yst I X)+ V1E2(Yst I X) = L IW{Xff + fJ2 V(xst) (8.6.5)
nh h=\
Nh L
where Xff = N"I IX!; and V(xst)= Iwlv(xh)'
;=1 h=\
Also
vlY j)= E1V2lYj I X)+V\E 2lYj I X~
which implies
S2=fJ2S2+r2xg
y x '
where
Chapter 8: Stratified and Post-Stratified Sampling 713

L Nh ()
Xg = N- 1 I Ixfi and fJ2 = P~\S; / S; .
h; l i; )
Using these results in (8.6.5) we have

[SX2)V )+ Pxy{2) I -2--


_ ) 2 Sy (_ ( Sy W X~ 2 L h
V (Yst =Pxy - 2 Xst-=g 1-
X nh h;1

2 Sy 1 1
L 22 ( 2 -=-
=Pxy 2){L (- - - )WhS }+ I-pxy{2
Sy
)L -W -2-X-f . L
[s; nh N
h
-2 hx g (8.6.6)
h;1 X h;) nh
L
The Lagrange function subject to the condition n = L nh , is then given by
h;1

Lg = V(Yst)+A(n- ±nh) . (8.6.7)


h;(

On differentiating (8.6.7) with respect to nh and equating to zero we have (8.6.3).


Hence the theorem.
Cassel, Sarndal, and Wretman (1977) discussed the correspondence between
Horvitz and Thompson (1952) strategy and the stratified random sampling strategy
for a special case of the general superpopulation model. Rao (1983c) considered a
more general model and discusses the correspondence between these strategies in a
more formal way by following Hanurav (1965), Ramachandran and Rao (1974),
and Rao (1968, 1977b).
We have seen that in systematic sampling two or more auxiliary variables can be
utilised to select a sample in a systematic way. The same question arises here if two
or more auxiliary variables are available, whether all the variables can be used at
stratification stage or not? In order to answer this question we resort to the next
section.

Use of one-way stratification for energy data in agriculture has been discussed by
Singh, Singh, Mittal, Pannu, and Bhangu (1994) and Singh, Singh, Pannu,
Bhangoo, and Singh (1994). Moses (1978) exhibited the information of multi-way
stratification technique in validation of energy data. Chernick and Wright (1983)
found that in data validation respondent surveys many variables are candidates for
stratification, and since the relationship between these variables and the response is
not well understood, two-way or multi-way stratification with proportional
allocation along with each variable seems appropriate . Frankel and Stock (1942)
discussed the use of multiple stratification techniques in gathering data related to
unemployment. They considered the possibility of using sample designs in which
the Latin square principle can be used to reduce the number of sample units
necessary to represent all strata. For example, suppose two criterion for
stratification are used, say A and B, such that p strata can be constructed from the
A characteristic, and, within each of these, p from the B characteristic. If one
714 Advanced sampling theory with applications

relates the resulting pattern to a single treatment of a p x p Latin square, it is


obvious that in a sample of p sample units each of pA strata will be represented
and likewise each of the pB strata. Tepping, Hurwitz and Deming (1943) discussed
such designs in some detail and compared their variances with the variances of
single stratification sampling. Such a stratification is also called Deep Stratification .
Following Bryant, Hartley, and Jessen (1960) the population of size N is stratified
along two variables using R rows and C columns. The R x C sections formed are
called cells or strata, the (i, j) th cell being a typical cell for i = 1, 2,3, ...., Rand
j = 1,2,3, ...., C. Let N ij be number of population units for the (i, j}th cell, Yijk be
the value of JI" unit of the study variable for the (i, j}th cell for k = 1,2, ..., N ij ,
_ Nij
Pij = Nij/ N be the population proportion of units for the (i, j }th cell, Y;j = Nijl I Y;jk
k=1

be the population mean for the (i,j}th cell, SJ =(Nij -1)-/i (Yijk _Yij)2 be the
k=1
population mean square error for the (i,j}th cell, nij be the number of units in the
nij
sample belonging to the (i, j}th cell, Yij = nijl L Yijk be the sample mean in the
k=1
C R - R C - C R
(i,j }th cell, P;. = I Pij' p. j = LPij , Y =L L PijYij . n.: =L nij and n'j = Lnij are
j=1 i=1 i=lj=1 j=1 i=\

the marginal totals.


The marginal mean values are given by
_
Y;. = IC p;jYij
I IC P;j and
- R jRIP;j '
Y' j = IP;jYij
j=1 j=1 i=1 i=1

Table 8.7.1. Two-way stratification based on R rows and C columns.


Chapter 8: Stratified and Post-Stratified Sampling 715

For simplicity and to achieve the greatest benefit of two-way stratification we


assume independence between the variables of stratification, that is, Pij = P;.p.}. It
is also assumed that max(R, C):5: n :5: RC. For proportional allocation, the total
sample of n units satisfy the cond itions ni. = np;. and n.} = np'}, where n.: and
n.} are rounded to the nearest integer such that sum of n.: as well as n.} is
equivalent to n. The values of ni. and n.} are fixed, while nij is a random
variable.
Let us take a hypothetical situation of six villages having four different types of
farmers. This table consists of R = 6 rows corresponding to their villages and
C = 4 columns corresponding to different type of farmers as shown in Table 8.7.2.
Assume that we require a sample of size n = 15 which satisfies the condition
max(R, C):5: n s; RC.

Table 8.7.2. Yield in kg/ha of wheat crop in different six regions of the Punjab state
in India by different types of farmers and their proportions.

0.01 0.04 0.05 0.12 1.80~ 2


2831 3095 3762 3458.58
2 0.03 0.02 0.09 0.19 2.85 ~ 3
2568 3265 3520 3374.16
3 0.02 0.03 0.07 0.16 2.40~ 2
2672 3891 3410 3455.44
4 0.02 0.05 0.08 0.20 3
2512 3762 3268 3398.40
5 0.02 0.03 0.07 0.16 2.40~ 2
2381 3520 3297 3331 .06
6 0.02 0.01 0.06 0.17 2.55 ~ 3
2381 3210 3542 3527.06
0.12 0.18 0.42 1.00
7~3~ ~5.~~, ~~QQ.()~Q

Source: Singh , Pannu, Singh, Singh, and Kaur (1996).

After n, ni., and n. } have been determined, construct a square matrix having
n sub-rows (s = 1,2,..., n) and n sub-columns (t = 1,2,..., n) forming n 2 sub-cells by
following Bryant, Hartley, and Jessen (1960) . Combining ni. adjacent sub-rows
for i = 1,2,...,R , form the R rows, and by combining n.} adjacent sub-columns for
j = 1,2,...,C, form the C columns. Following these notations, the intersection of the
716 Advanced sampling theory with applications

{h row and j " column will contain a single cell consisting of nj.n' j squares or sub-
cells. This method of allocation is called random allocation .

Table 8.7.3. A 15 x 15 grid for allocation of total sample size in a two-way layout.

Steps for two-way random allocation: The following steps are needed for
allocating a sample of n units into two-way stratification :

Step 1. Construct n x n cells as above;


Step 2. In the first sub-row , one sub-column is selected at random and marked;
Step 3. In the second sub-row of the remaining n - 1 sub-columns , another sub-
column is selected at random and marked, and so on;
Chapter 8: Stratified and Post-Stratified Sampling 717

Step 4. At the end of the marking process, each sub-row and sub-column will
contain one mark. It will complete the allocation of the n observat ions to the RC
cells;
Step 5. The number of marks within the boundaries of the (i,j)th cell represents
the number of observations to be randomly selected from the cell and nij denotes
the number of observations in it.

Properties of random allocation method: The random allocation method has the
following properties:

(a) Every cell has an equal chance of ljn of receiving a mark;


( b ) The values of n ij constructed should satisfy nio = np;o and no j = np' j ;

( c) E(nij)= nPioPoj = nPij .

In such situations we have the following theorem:


Theorem 8.7.1. An unbiased estimator of population mean Y under two-way
stratification is given by
_ 1 R C _
Ytwst =- L L nijYijo (8.7.1)
n i =lj =1

with variance given by

_ )_ 1 [R C niOnOj{n(n-I)Nij-(n-ni.Xn -n.j) -(n-Ihon.j} 2


V (Ytwst - 2"" LL 2( ) Sij
n i=lj=l n n-I Nij

R C niOnOj(n-ni.Xn-n.j)-2 R C C ni.n.jnO j ,(ni.- n)--


+LL 2 Yij + + L L L 2() Y;jYij'
i= \j= 1 n (n -1) i= l j=I j'=1 n n-1

R R C nioni'oIlOj~l.j -n)- _ R R C C ni.nojni,.noj' _ _ ]


+LLL 2 YijY;'j + L L L L 2( ) YijY;'j' .
i = li'= lj=1 n (II - 1) i =!i'= lj =1 j'=\ n n- 1

(8.7.2)

Raghunandanan and Bryant (1971) have considered the problem of multi-way


stratification for random sample allocation mechanism. Chernick and Wright (1983)
have proposed a technique of systematically allocating a sample to the strata formed
by double stratification . The method can proportionally allocate the sample along
each variable of stratification. If there are Rand C strata for the first and second
variable of stratification , respectively, the technique requires that the total sample
size be at least as large as max(R, C). They suggested an unbiased estimator of
population mean and obtained the variance expression . The technique has been
found to be comparable with the random allocation technique proposed by Bryant,
Hartley , and Jessen (1960).
718 Advanced sampling theory with applications

8",~STRATUM;:BOU.ND~JJ:,S~.F.PRMIJtllEVARIAi~,RQRUltA:nPl~~S!:,;i",
Sadasivan and Aggarwal (1978) considered the problem of optimum points of
stratification with two study variates X and Y (say). The exact equations given by
Dalenius (1950) for the univariate case have been extended to the bivariate case.
Also for minimizing the generalized variance (variance--covariance matrix) under
Neyman allocation , a set of equations giving optimum points of stratification are
discussed for two cases: .
( a) When the correlation coefficient between the two variables is constant from
stratum to stratum ;
( b ) When the correlation coefficient is varying from stratum to stratum.

While extending the univariate procedure of Dalenius (1950) and Dalenius and
Gurney (1951, 1957) to the two-dimensional situation the following assumptions
are made:
( a) the variates X and Y have continuous joint p.d.f. f(x, y ) in the finite range
Xo :5 x :5 xLI; Yo :5 Y :5 YL2 ;
( b) the population is finite;
( c ) divide the population into L I x L 2 strata by determining the strata boundaries
for X and YI,Y2'''',Y(L2- 1) for Y such that the generalized variance
XI, x2,· ··,X(L\-I)
of the means of the variables for this stratified sample under Neyman 's allocation in
minimum;
( d) the generalized variance can be set out as
2
a x' a xy 2 2 2
GV= 2 =axay -axY' (8.8.1)
a xy, a y

where a ;, a ; and a xy denote the variances of X and Y and covariance ,


respectively, under Neyman allocation with fixed cost per unit.

Before proceeding further let us first define some mathematical notation as follows.

Joint p.d.f. of two study variables X and Y = f(x,y) .

Area under the (h, k ~h cell


xk Yh
Whk = J fJ(x,y)dxdy .
Xk- \ Yh- I

Marginal density of Xh corresponding to hth cell of y


Yh
fhk(Xk) = fJ(Xby)dy ·
Yh- I
Chapter 8: Stratified and Post-Stratified Sampling 719

Mean value of x in the (h,k)th cell


1 Xk Yh
I-Jhkx = - f fxf(x,y)dxdy .
W hk Xk-I Yh-I
Variance of x in the (h, k)th cell
2
CJ'hkx = -
1 Xk Yh 2 (\-J~ 2
f f x f x,y JUAdy - I-Jhkx .
W hk xk-l Yh-l
Conditional mean of x in (h, k)th cell at a point
1 Xk
Yh = I-Jhkx I Yh = ---;:---() fxf(x,Yh)dX.
Jhk Yh Xk-I
Covariance between x and Y in the (h, k)th cell
1 Xk Yh
CJ'hkxy = - f fxyf(x ,y}ixdy - I-JhkxI-Jhky .
W hk Xk-l Yh-I
Second moment of x in (h,k)th cell at a point Yh

I
I-Jihkx Yh = ~() 2
xr x f(X,Yh)dx·
Jhk Yh Xk-I
Approximate variance of x under the Neyman allocation with cost per unit constant
2
CJ'x = -
1 (L2L LILWhkCJ'hkx J2
n h=lk=l
Approximate variance of Y under the Neyman allocation with cost per unit
constant
2=
CJ'y -
1 (L2L LWhkCJ'hky
LI J2
n h=lk=1

Approximate covariance between x and Y under the Neyman allocation with cost
per unit constant

CJ'xy = -
1 (L2L LWhk~CJ'hkxY
LI J2
n h=lk=1
Thus as in the case of univariate strata boundaries, we have the following
differentiation results for bi-variate distributions to find the set of minimal
equations:
oWhk
-",- =
Yh
ff (\-J
XbY fUY =
() .,
fhk Xk
UXk Yh-I

OCJ'hkx _ fhk (Xk) [(x 1/ \2 CJ'2 ].


--- ~k-rhkxJ- hkx'
oXk 2CJ'hxkWhk
720 Advanced sampling theory with applications

Thus
b'a x = '¥(x,k,k)- '¥(x,k,l),
b'xk
where

'¥(x,k,l) = _1- L f hk (Xk [(Xk - Jlhlxf +a~lx] for 1= i or k.


2.[,; h 1ahlx
Similarly
b'a y ='¥I(x, k, k )- '¥I (x,k, i),
b'xk
where
'¥I (x, k , I) = 1 L,---
2'1/n h ahly
I I Xk + ahly
r " fhk (Xk ) [JlZhky Z + Jlhly
Z - 2JlhlyJlhky I Xk 1
and
xy
b'a ='¥z(x,k,k)- '¥z(x,k,i)
b'xk
where
L
'¥ z (x, k, l r:
~XY fhk( Xk) [ ( ) ]
- L ~ xklPhky I Xk - Jlhly + JlhlyJlhlx + ahlxy- JlhlxJlhky I Xk .
n h ",ah lxy

Now we have the following theorem.

Theorem 8.8.1. The set of minimal equations for minimizing the generalized
variance in (8.8.1) is given by

(8.8.2)

(8.8.3)

Using the above results (8.8.2) and (8.8.3) we have the set of minimal equations:
¢(x,k,k) = ¢(x,k,i), for i = k + 1; k = 1,2 ,"' , LI -1 and h = 1,2,.." L z -1 (8.8.4)
where
"'(x , k , I) = a xay
'I'
I I Xk + ahly
- -) [JlZhky
Z L -fhk(Xk Z - 2J1hkyJlhly I xk ]
Z + Jlhly
h ahly

Z fhk (Xk ) f.( \2 z] ( ~


+ayaxL---~Xk-JlhlxJ +ahlx -2axy}ZL ~ xklPhky Xk-Jlhly
fhk (Xk)[ ( I )
h ahlx h ",ahlxy

+ JlhlxJlhly + ahlxy - JlhlxJlhky I xk ] ,

and (8.8.4) implies that


Chapter 8: Stratified and Post-Stratified Sampling 721

¢(y,h ,h) = ¢(y,h,j), for j = h + I, h = 1,2,... , ~ - I and k = 1,2,..., L1 - I, (8.8.5)


where
A.(y , h, I ) =CTxCTy
'I'
2 f hk
I -(Y- l\Yh-f.llky)\2 + CTlky
h)rr 2 ] +CT2 li,k (Y, .)[ I I 2
yCTx I - - f . l2hkx Y h + CT,kx -
2 f.llkxf.lhkx IY h ]
k CTlky k CTlkx

- 2(0"xy y.5 I f ~~h (f.lhkx I Y h - f.llkx)+ f.llkxf.llky + CTlkxy - f.llkyf.llkx I Y h ]'


k VCT1kxy

Theorem 8.8.2. If the correlation coefficient is assumed to be constant from stratum


to stratum, then minimization of generalized variance is equivalent to the
minimization of covariance and the set of minimal equations reduces to
g(x,k,k)=g(x,k,i), i=k+l; k=I,2, ,L1 - I, (8.8.6)
g(y,h,h) = g(y,h,j), j = h + I, h = 1,2, , L2 -I, (8.8.7)
where

g ( x, k, 1) = I
hk (x k ) [ ( / )
~ Xk VJ2h ky I Xk - f.lhly + f.lhlxf.lh/y + CThlxy - f.lhlxf.lhky I xk
]

h "O"h/xy

and

The approximate solution to the set of minimal equations given above can also be
obtained by following Ekman (1959) method, which reduces to the set of simplified
minimal equations given by

Ifhk(Xk).J(Xk - Xk-S,.Yh - Y h- l ) = Constant


h
and
Ifhk(Xk).J(Xk -Xk-l'XYh - Y h- l ) = Constant .
k

Rizvi, Gupta, and Singh (2000) also studied stratification based on two auxiliary
variables.

Example 8.8.1. The death rates in the United Sates over the period 1990 to 2065 in
nine groups and in twenty two age groups are shown in a 22 x 9 contingency table
as given in population 6 of the Appendix. Construct three homogeneous columns
from nine columns and five homogeneous rows from 22 rows .

Solution. From population 6 the frequency distribution is given as shown in the


following table.
722 Advanced sampling theory with applications

Age"
Group l it 1990 1?
Year, Year:; r·Year
I~~eai" 1,1'2000\( 2010
Year,J' ~y'ear" Year Year
, 2050
Yea~; :Row ~% cum.JJ:
1~~99~:~ 202~, ' 20?~~ 1 ~2040 2065 ~ rrota1s'~

'" »: I ~ ,."."'" . fr l; 'Cf


I'
.R:
•-:'h-
I'....:.. '
0--4 967 818 692 496 255
355 183 131 8C 3977 63.06
5--9 19 16 14 10 7 5 4 3 2 80 72.00
10--14 20 11 15 11 8 6 4 3 2 86 81.28
15--19 67 6~ 57 48 4C 34 28 24 18 378 100,72
20--24 86 78 71 58 48 40 33 27 20 461 122.19
25--29 84 75 68 54 44 35 28 23 16 427 142,85
30--34 97 8 78 62 5C 40 32 25 18 489 164.97
35--39 138 124 III 90 72 58 47 38 27 705 191.52
40--44 221 201 182 150 124 102 84 69 52 1185 225,94
45--49 370 341 315 267 227 193 164 139 109 2125 272.04
50--54 613 572 533 464 403 351 305 265 215 3721 333,04
55--59 965 90 853 754 666 589 520 460 382 6096 41L12
60--64 1511 1432 1357 1218 1094 982 882 792 674 9942 510.83
65--69 2233 211 9 2010 1810 1629 1466 1320 1188 1015 1479C 632.44
70--74 3361 3187 3022 2718 2444 2198 1976 1777 1515 22198 781.43
75--79 4979 4693 4423 3930 3491 3102 2756 2448 2050 31872 959,96
80--84 7748 7323 6921 6182 5523 4933 4407 3936 3323 5029t 1184.23
85--89 12267 11687 10609 10108 9177 8331 7564 6868 5942 82553 1471.551
90--94 19099 18341 16915 16246 14987 13827 12758 11774 10439 13438t 1838,138
95--99 29744 2886 27188 26390 24869 23442 22102 20844 19095 222538 2309,877
100-104 46334 4555 44053 43329 41933 40600 39325 38104 36364 37559t 2922.736
105-109 72195 7210 71956 71906 71845 7183 1 71861 71930 72097 647721 3727.547
Col. Tot 203118 191443 It186301 9036 :~:1 66383 .J P0868
)-7< ' ,~ ,,,,, 153455 Co,
1~,9,~~> .,. r .
.""
. ~ "
~" ; " ,¥ i
"..
~ c,,~

I'cum:;'
p i; i ¥f I 'Ji:J~'''' ~6~~~~11 ,~ .:t
a I':
450.68~
~ '~'t !~~~
1333 .8~ 1765.4S 2188,,R2 ;6
I~ ,
~oli:71i
""' <
~4 1 2. 84 3804.57 ..
' I' 1",'

From the above table 'L.JJ: = 3727.547 and 'La


r c
= 3804.57 .

We like to make 5x3 contingency table of homogeneous cells.

Divide 'L.JJ:
r
by number of rows (R = 5) required and 'La by the number of
c
columns (C = 3 ) required.
Then the five points for grouping 22 rows into five homogeneous groups or rows
on the basis of cum.JJ:
are given by

745.50, 1491.0,2236.5,2982.0 and 3727.5.


The three points for grouping nine columns into three homogeneous groups or
columns on the basis of cumg are given by

1268.19,2536.38 and 3804.57.


The resultant 5x 3 contingency table of homogeneous strata is
Chapter 8: Stratifie d and Post-Stratified Sampling 723

. .
Age,Gr5mp I~"t. ~ears . ''''~Years
1990 and 1995 r,i-,2000, 20 10
~. Years 4
2030, 2040, 51
kJ " 'i~ "'" i£" ;: w ~;'and 2020~" ,,,,2050 and 2065 .
0--69 14240 16615 13607
70--89 55245 68548 63126
90--99 96048 126595 134281
100--104 91888 129315 154393
105--109 144295 215707 287719

Ahsan and Khan (1982) considered the problem of allocation to minimize the total
budgetary cost of survey subject to the desired precision assigned to the posterior
variances of the population means when the sampling is multi-p urpose , by
assuming that the overhead cost is proportional to the number of individuals
contacted in that stratum . Consider a popu lation divided into L strata and let p
variables be defined on each unit of population under study . Let
Yly (h = 1, 2,..., L; j = 1,2 , ..., p)be the unknown population means of the
Wh (h = 1,2, ..., L) be the
observations on the /h variable within the h1h stratum. Let
proportion of the population elements falling in the h stratum. Let Yly be the
'h

sample means for the rvariable in the h1h stratum. A simple way of representation
of the case under consideration is given below .



L

Assume Wand fj be the row vectors of the values of Wh and Yhj , respectively,
defined as W = (W"WZ''' ''WL) and fj = (Y;j, Y; j'" '' YLj ) . The overall population mean
Yj for thel' variable is given by
724 Advanced sampling theory with applications

(8.9.1)

where t stands for transpose of the vector. A stratified sample is determined by a


set of numbers (nl,n2 , ...,nL), where nh > (h =1,2•....L) is the number of °
observations drawn independently from the h1h stratum. Let a~j be the known
variance of the / h variable in the hlh stratum and for a given population mean Yj ,
the sample mean Yj of the r variable has a conditional L variate normal
distribution defined by the mean vector Yj and a diagonal variance--covariance
matrix M j (j = 1,2 , ..., p) given by

ml(I.I), · .. aI21Inll' .
.........,m2(2.2j.-·······..····..·· .. ,a'i.2/n22, .. (8.9.2)
Mj =

..........................mm p(p.p) .............................a~p/npp

A priori information about within stratum population means Yhj is assumed to be


available in terms of the L variate normal distribution of Yj with mean vector
E!j =(lL, ~j' ...., Yrj) and a non-singular variance--covariance matrix
al l, O, , °
0,.a22, ,0
Aj = (8.9.3)
[
0, 0, , a pp

Raiffa and Schlaifer (1961) have shown that the posterior distribution of population
means Yj for a given stratified sample with nh > and observed sample means Yj °
is L variate normal with mean vector

!!i j = [f j Wj + E!j vjJ [~] (8.9.4)


and variance--covariance matrix
~ = [Wj +VJ-I (8.9.5)

where Wj = M"/ and Vj = Ail.

The overall population mean of the / h variable Yj is a linear combination of the


YLj and has a univariate normal distribution with mean, bj = WE!~ and variance
a j = W !! j W I
, and a univariate normal posterior distribution with mean bj = W!!i~

and variance aj = W ~ WI .
Chapter 8: Stratified and Post-Stratified Sampling 725

Let Ch be the overhead cost of approaching an individual of the h1h stratum for
measurements and Chj be the cost associated with the measurement of the /h
variable of an individual in the h1h stratum, then the total cost of the survey is
L L P
C = LChnh+ L LChjnhj , (8.9.6)
h =1 h=lj=l

where nh = m~xlnlif ~ h = 1,2, ..., L; j = 1,2, ..., p. Our aim is to draw a sample which
J

attains the desired precision assigned to the posterior variances of Yj . For this
purpose we require
(8.9.7)

where the values of Wj are the required upper limits on the posterior variances of
~, j = 1,2,..., P . Note that we have

= f wl I {v j(h,h) + ~lhj 100lJ },


h=1
(8.9.8)

where vj(h.h) are the diagonal elements of Vj . Using (8.9.8) in (8.9.7) we have

ht 2/
Wh {Vj(h ,h) +(»lO't ) } s wj ' j = 1,2 , ..., p. (8.9.9)

Thus the problem reduces to the following question:


( a ) Determine the values of nhj > 0 which minimize
L L P
c= L.chnh+L. L. Chjnhj (8.9.10)
h=l h=l j=l
subject to
f wl IL(h,h) + (nhj I O'lj )}~ Wj;
h=l
(8.9.11)

( b ) Under the transformations given by

nhj = [Y~j - vj(hh) ]O'lj and nh = mIx[ {Y~j - vj(hh) }O'lj]


the values of Chj which minimizes

L p Chj L
L = L. L. - + L. L.Ch - - vj(hh)
h=l j=l Y hj h=l jelj Ylif
P [I }2hi : (8.9.12)
726 Advanced sampling theory with applications

where J j is such that the maximum of nhj for each h E Jj is attained for the /h
variable, subject to

Solution to these questions can be obtained by following Kuhn and Tucker (1952)
and Kokan and Khan (1967). Sekkappan (1981) also considered a problem of
optimum allocation in stratified sampling from a finite population with p variables
under study using a superpopulation approach put forth by Ericson (1969). He also
studied allocation at second phase using information obtained from the first phase.
The results obtained by Khan (1976) and Draper and Guttman (1968a, 1968b) are
shown to be special cases.

We have seen that in stratified sampling the population has to be divided into L
strata, which are homogeneous within them selves, and whose means are widely
different. The strata weights Wh can be used to obtain unbiased estimates of
population mean or total. Sometime these strata weights are not known then the
technique of two-phase sampling can be used to obtain estimates of theses weights .
Following the two-phase sampling scheme we have to select a preliminary large
sample of m units by SRSWOR sampling to estimate the strata weights. The m
units in the first phase sample are then stratified into L strata with mh units in the
h'h stratum . Select first phase sample of mh units from the h1h stratum such that
L
L mh = m. A second phase sample of size nh < mh is selected from the h1h stratum
h=\
L
by SRSWOR sampling such that L nh = n. Let Wh = mh/m be an estimator of
h=l
original unknown weights Wh = Nh / N . In such situations an estimator for
estimating the population mean Y is given by
L
Ystd = L: whYh . (8.10.1)
h=l
Then we have the following theorems:

Theorem 8.10.1. The estimator Ystd is an unbiased estimator of population mean


Y, provided that nh is free from Wh'
Proof. We have

E(.Ystd) = E1[E2{ IWhYh I Wh }] = E1[


h=1
IWh~] = h=1IWh~ = Y.
h=1
Hence the theorem .
Chapter 8: Strat ified and Post-Stratified Sampling 727

Theorem 8.10.2. The variance of the estimator Ystd is given by

\Ystd ) --
v(-:; ~ (1- f h JWh2Shy
L..
2
h=1 nh

(I
I J{L(I-
+- N- - - -
N- I m
I --
N
f h fh(I-Wh)Shy
h=!
L h(-
2 + IW
nh
Yh-YJ
h=l
-\2} . (8.10.2)

Proof. Note that the random sample size mh follows multinomial distribution with
parameters (m, WI ' W2, ..., WL), therefore in particular we have

thus we have

Corollary 8.10.1. For large N and nh = nWh , the variance V(Ystd) reduces to

(-:; ) 1 L 2
V V std "'-IWhShy +-IWh Yh-YJ .
1 L (- -\2 (8.10 .3)
n h=! m h=1

Corollary 8.10.2. An unbiased estimator of V(Ystd) is given by

- - m- L..
• (-:; )_
vu\Ystd ~ [{ wh - (N
2 ---
mJWh}S~y
- - + (N
---
mJWh(Yh -Ystdf] . (8.10 .4)
m - 1 h=1 N-1 m nh N- 1 m

For more details one can refer to Raj (1968).


728 Advanced sampling theory with applications

Corollary 8.10.3. If nh are small as compared to the size of preliminary large


sample size m, then an unbiased estimator of V(Ystd) is given by
L
Vu(Ystd)= I w~S~y/nh ' (8.10.5)
h=1
Now we have the following theorem.

Theorem 8.10.3. The minimum variance of the estimator Ystd under optimum
allocation is given by
Min.v(Ystd)=(~CIVZ +~Cz~j /(C-Co) (8.10.6)

where ~ =Ct WhShy rand v2 = htWh(~ - Y~


Proof. The cost function for double sampling, as defined in Chapter 6, is given by
C = Co +mCI +nCz · (8.10.7)

The variance of the estimator Ystd can be approximated by

V (Ystd ~ V2 .
- ) =-+- (8.10 .8)
n m
Thus the Lagrange function is given by

~ Vz
L =-+--A (C-CO-mC1-nCZ) .
n m

On differentiating L with respect to m and n respectively and equating to zero,


the optimum sample sizes, are

(8.10.9)

On substituting (8.10.9) in (8.10.8) it wi11lead to the minimum variance as

. (-) N;(jC;V; +N;) jC;V;(jC;V; +N;) (jC;V; +N;j


Mm.V Ystd = (C-C + (C-C = (C-C .
o) o) o)
Hence the theorem.

Example 8.10.1. We wish to estimate the yield/hectare of the tobacco crop in the
World during 2001. The number of countries on each continent growing this crop
during 200 I is unknown. Population 5 in the Appendix shows the yield/hectare of
the tobacco crop along with the number of countries on each continent growing this
crop . We wish to apply stratified random sampling to estimate the average
yield/hectare of the tobacco crop in the world . Using information from 1998 about
this crop in different countries, estimate the first phase and second phase sample
size for stratification.
Given: C = $1500, Co = $1000, Ct = $2 and Cz = $10.
Chapter 8: Stratified and Post-Stratified Sampling 729

Solution. Using information from population 5 we have:


=====......,,=~

6 0.056604 1.913 0.0268266 0.163788 0.009271 0.007429884


6 0.056604 1.388 0.2180900 0.467001 0.026434 0.001498375
8 0.075472 2.556 0.3469980 0.589065 0.044458 0.076273818
10 0.094340 1.549 0.2345600 0.484314 0.045690 2.72642E-07
12 0.113208 1.832 0.5821400 0.762981 0.086375 0.008958078
4 0.037736 1.470 0.1531000 0.391280 0.014765 0.000245754
30 0.283019 1.115 0.3438500 0.586387 0.165959 0.053726742
17 0.160377 1.381 0.3785500 0.615264 0.098674 0.004618562
10 0.094340 1.721 2.0183000 1.420669 0.134025 0.002736046
3 0.028302 2.086 0.9746000 0.987218 0.027940 0.008109795
Q~16~jS9J3Z8
where f = 1.5507 . From the above table we have
L 2
L WhShy = 0.653592 and IWh(~ - f) = 0.163597328 .
h=l h=1
Thus we have
2
VI = 0.653592 = 0.427182 and V2 = 0.163597328.

Hence the optimum second and first phase sample sizes, respectively, are given by

.J0.427182(1500-IOOO) =39.16~40
n= ..[C;( ~CIV2 + ~C2~ ) .J1O( .J2xO.16359 +.JIOx0.42718 )
and
m .jV;(C-Co) = .J0.163597(1500-IOOO) = 54 19 ~ 55
~( ~CIV2 +~C2~) .J2( .J2xO.16359 +.JIOx0.42718) . .

Holt and Smith (1979) showed post-stratification is potentially more efficient than
stratification. Since in order to maximise the gain in precision the stratification
factors can be chosen in different ways for different sets of variables, e.g., age, sex
etc., after sampling. This technique is found to be most practicable for surveys
where individual responses may be expected to vary with age, sex, occupation,
education, state, country, and race etc.. Usually none of these variables are available
for stratification at the individual level prior to sampling . Such a situation is called
conditional poststratification. In some situations, censuses may provide information
on all of these variables at the aggregate level. In some situations, this aggregate
level information may not be available . In other such situations this is called
unconditional post-stratification. We will discuss both situations.
730 Advanced sampling theory with applications

8.11.1 CONDiTIONAl;; _P OST-STRATIFIC~TION

Cond itional post-stratification can be defined as a sampling technique in which we


select a simple random sample with or without replacement. Once we know the
L
strata sample size nh ~ 2 (h = 1,2 ,..., L) such that n = 'Lnh we use the traditional
h=1 .
stratified estimator. If there is any nh < 2 then we merge the hth stratum with
another lh stratum such that the composition is homogeneous. Now we have the
following theorems .

Theorem 8.11.1. An unbiased estimator of population mean, r, is given by


L
Ypst = 'LWhYh . (8.11.1)
h=l
Proof. We have

E(ypst)= EIE2~pst I nh] =E1E2[ tWhYh I nh] = Et [ h=1


f,WhE2(Yh I nh)]
h=1

= E1[ tWh~] = E1(r) = r .


h=1
Hence the theorem.

(8.11.2)

=E1 l:L Wh2( -


[ h=1
1 --
1}2
nh N h
hy ] = l:L Wh2{- - - Shy'
h=l nh Nh
2 II} (8.11.3)

Hence the theorem.

Espejo and Pineda (1997) proposed an estimator to estimate the variance of Ypst as
"(.,., ) (N -1) "(- ) ~ w, ~ 2 -2
V\Ypst = -(N)V Y - L. - L .Yhi + Ypst (8 114)
-n h=1 nh i= \ . .

where
"(7:)
VV' = 'LL -w,' L
nh 2
Yhi -
-2 L 2 S~y
Ypst + 'LWh - .
h=l nh i=1 h=l nh
Chapter 8: Stratified and Post-Stratified Sampling 731

8.11;2
. UN€ONDITIONAL POST-STRATIFiCATION
~ - .

The unconditional post-stratification is a mechanism for the analysis of the exact


variance for the post-stratification estimators when we do not know the stratum
L
sample sizes nh ~ 2, h = 1, 2, ...., L, n = 'f.nh being the sample size from a finite
h~ ,
population. We assume that if nh < 2 we collapse the h1h stratum with another l h
stratum with a homogeneous composition to the h1h stratum . Thus nh are random
before selecting the sample and we have the following theorem :
Theorem 8.11.3. The unconditional variance of the post-stratified estimator Ypst
under SRSWOR is
t.; ) (I I)
Vl,Ypst ,., - - - 'f.WhShy +2""
n N
L
h;)
2 1'f.
n h ;)
L( \,,2
I -Wh P hy' (8.11.5)

Proof. We have
I: : ) I:: ) I::
VV'pst = E 1V2V'p st I nh + V)E 2V'p st I nh
) = E) [LL Wh2(Nh-nh) Shy2] + ~ (-)
y
h;) Nhnh

L
=E\ 'f.W ---
[ h;( h nh
2(1 1}2]
Nh
hy L
= 'f.W
h;l
2{ (I)
h E( -
nh
- - Shy'
Nh
I} 2 (8.11.6)

Following Stephan (1945) we have


E1('--!'-') = _ 1_ + (I ~ W~) . (8.11.7)
nh nWh n Wh
On substituting (8.11.7) in (8.11.6) we have the theorem.
From (8.11.5) we observe that the variance of )ip st consists of two components. The
first component is the variance of the usual stratified estimator of proportional
allocation , and the second component is the additional variance due to post-
stratification, since we could not allocate the population owing to lack of
information . For large n the second component is smaller compared to the first
component. Thus for large samples, the post-stratification technique results in
estimates which are nearly as good as with the proportional allocation . It is to be
noted here that unconditional formula of variance is useful when planning for a
survey (before the sample is selected), or when evaluating one methodology against
another. Whereas the conditional formula of variance under post-stratified sampling
schemes is useful when constructing the confidence intervals or testing hypotheses
using survey results. More work related to poststratification sampling can also be
seen from Shukla and Dubey (200 I), Agarwal and Panda (1993), Doss, Hartley and
Somayajulu (1979), Fuller (1966), lagers (1986) and lagers, aden, and Trulsson
(1985) . Karlheinz (1990) has discussed the use of two-phase sampling for doing
stratification. Chang , Han, and Hawkins (1999) have considered the problem of
multiple inverse sampling as a sequential sampling procedure in a stratified
population such that random samples are taken population wide and continuously
until a specified minimum number of observations are obtained in each stratum. A
732 Advanced sampling theory with applications

calibration approach for post-stratified sampling has been considered by Lundstom


and Sarndal (1999). The multiple inverse sampling procedure can be used in small
sample post-stratification to solve the empty post-strata problem. Estimation of
population mean for a deeply stratified population under post-stratified random
sampling has been discussed by Shukla and Trivedi (2001).

Example 8.11.1. We wish to estimate the average yield/hectare of the tobacco crop
in the world. Select an SRSWOR sample of 40 countries from the population 5.
Record the yield and area under the crop. Assume each continent as a different
strata and number of countries in each continent are known. Post-stratify the
countries selected in the SRSWOR sample into different continents. Merge the
post-strata if required using total area under crop as an auxiliary variable. Obtain a
95% confidence interval for the average yield of the tobacco crop in the world.
Solution . After applying remainder approach on the first three columns (N = 106)
of the Pseudo-Random Number (PRN) Table 1 given in the Appendix we end with
the following 40 distinct random numbers as:
038,058,071,019,077,014,061,024,096,036,094,008, 041, 049, 002, 092,
053, 044, 030, 046, 075, 078, 015, 085, 034, 063, 066, 091, 009, 039, 086, 021,
074, 006, 10 1, 035, 090 , 079, 054 and 065.
Thus the countries corresponding to these population unit numbers will be included
in the sample. Using information about population unit numbers from population 5,
we post-stratify the above sampled units as follows:

SRS of 40 countries across the


worm '"":; ··
038,058,071 ,019,077,014,061,024,096,036,094,008,
041,049,002,092,053,044,030,046,075,078,015,085,
034,063,066,091,009,039,086,021,074,006,101,035,
090,079,054,065

Stratum 9 10
no.
Units in 002, 008, 019, 024, 038,036 044,058,071 ,061 ,077,092 096,
the 006 009 014, 030, 041,034 046 049,053,075, 078,085 094
Sample 015 021 039,035 063,066,074 ,091,086 101
054,065 090 ,079
Post-strata
Fig. 8.11.1 Example of post-stratification.
Chapter 8: Stratified and Post-Stratified Sampling 733

We observed after post-stratification that the to" stratum remains empty. There are
several ways of merging this particular empty stratum with other one. We would
like to use the information on the total area under the tobacco crop in ten continents
or strata as follows :
,Bc4_. c;;.
.
cl n;;~ ;~
fStratUm'N6:<~ ,Tqtal~~rea]Sr~: ..1O·"ijr

1 19167.00 6 2
2 87960.00 6 2
3 146474.96 8 3
4 149235.00 10 3
5 65866.13 12 6
6 13800 .00 4 2
7 35048 1.90 30 11
8 2467759.10 17 8
9 339761 .00 10 3
10 3999 .99 3 0

We also observed each one of the strata 1, 2, and 6 have onl y two units after pos t-
stratification, which is the minimum requirement for variance estimation. We would
prefer to merge stratum 10 with anyone of these three. We mak e the use of kno wn
information on the total area under tobacco crop in different continents or strata.
We observe that total area in the ro" stratum is 3999.99 hectares which is close to
total area of 13800 hectares in the 6th stratum. Thus we shall prefer to merge the
loth empty post-stratum with the 6 th post -stratum. Afte r merging these post-strata,
we have the following situation.

.nh~~: ',. "":'1 2' •


r:·
~~ . ~ 1 ":~h_;e':
vu . ' I [!'i
~~~~~t;;~ , Ne~ 4t,Sample
V'lll~~lf:~S ~
~fc""
lO·. 2· ,
"'2 sh
stratum • ~h ill ?) ~h !~. L Yh '
<
no. ' values ,. .-
~

• rc ~ ~. i=l I
/
1 6 2 0.0566 0.3333 1.79, 2.00 1.895 7.2041 0.0221
2 6 2 0.0566 0.3333 1.51, 1.29 1.400 3.94 42 0.0242
3 8 3 0.0755 0.3750 2.14,3 .69,2.80 2.876 26 .0357 0.6050
4 10 3 0.0943 0.3000 1.64, 1.17,2.09 1.633 8.4266 0.2116
5 12 6 0.1132 0.5000 0.92,1 .82, 0.84 , 1.698 19.9217 0.5231
1.63,2.48, 2.50
6 7 2 0.0660 0.2857 1.61,1.18 1.395 3.9 845 0.0924
7 30 11 0.2830 0.3666 2.51 , 0.00, 1.22, 1.293 23.398 8 0.5016
1.00,0.87, 1.00,
1.63, 2.10, 0.96,
2.00, 0.93
8 17 8 0.1604 0.4706 0.88, 2.50 , 1.22, 1.445 19.4782 0.3963
1.25, 1.33, 2.0 2,
1.80,0.56
9 10 3 0.0943 0.3000 1.09, 1.50,5.71 2.767 36 .0422 6.5390
734 Advanced sampling theory with applications

1 0.107257 0.203876030 0.000035399


2 0.079240 0.111620860 0.000038760
3 0.217138 0.655231783 0.001149550
4 0.153992 0.264876127 0.000627217
5 0.192214 0.375856073 0.001117188
6 0.092070 0.131488500 0.000201247
7 0.365919 0.601987309 0.003652058
8 0.231778 0.390537910 0.001274509
9 0.260928 1.132926487 0.019382664

Thus an estimate of the average yield/hectare of the tobacco crop in the world is
given by
L
Ypst = I,WhYh ~ 1.7005 .
h=l
Now we have
,(_) L w,
nh 2 -2 L 2 s~
v Y = I,-I,Yhi- Ypst + I.Wh -
h=l nh i=l h=l nh

= 3.868401079 -1.7005
2 + 0.027478596 = 1.004 179425 .

Thus an estimate of variance of Ypst is given by

,(.,., ) (N -1) ,(-) ~ w, ~ 2 -2


Vllpst = (N _ n) v Y - {-:l-;;;t:/hi + Y pst

= ((106 -1)) x 1.004179425 _ 3.868401079 + 1.7005 2 = 0.62089 .


106 -40

Thus a 95% confidence interval of the average yield/hectare of the world tobacco
crop is
Ypst +1.96~v(ypst) , or 1.7005+ 1.96.J0 .6209 , or [0.15608, 3.24491] .

In order to find the percent relative efficiency (PRE) of the post-stratified random
sampling with respect to SRSWOR sampling, we have

- ) N -n 2
V (Ysrswor = - - Sy = 106-40 x 0.6323 = 0.0098424.
Nn 106 x40

After post-stratification we have


Chapter 8: Stratified and Post-Stratified Sampling 735

1 6 0.0566 2 0.0268266 0.001518 0.0253080


2 6 0.0566 2 0.2180900 0.012344 0.2057460
3 8 0.0755 3 0.3469980 0.026198 0.3207900
4 10 0.0943 3 0.2345600 0.022116 0.2124409
5 12 0.1132 6 0.5821400 0.065898 0.5164210
6 7 0.0660 2 0.5100900 0.033666 0.4764240
7 30 0.2830 11 0.3438500 0.097309 0.2465400
8 17 0.1604 8 0.3785500 0.060719 0.3178300
9 10 0.0943 3 2.0183000 0.190326 1.8279740

Under post-stratified sampling the unconditional variance of the estimator of


population mean is given by

I I ) L 2 I L ( \,, 2
V (Ypst ) '" ( - - - 'LWhShy +2 'L I-Wh P hy
n N h=1 n h=l

'" (.2..__1_ )
40 106
x 0.510094 + 4.149474
40 2
= 0.0105336 .

Thus, the relative efficiency of the post-stratified sampling with respect to


SRSWOR sampling or design effect is given by

RE = V (Ysrswor ) x 100 = 0.0098424 x 100 = 93.44% .


V~pst) 0.0105336

In this particular example the poststratified sampling remains less efficient than
simple random sampling.

Define a variable

I if the ilh unit in the hlh stratum belongs to group A,


Yhi = { 0
otherwise,

then
- I Nh N
ha
Yh = - 1: Yhi =-=P h (8.12.1)
N h i=1 Nh
denote the proportion of units falling in the hlh stratum possessing attribute A.
Obviously the proportion of the units belonging to group A in the whole
population can be written as
736 Advanced sampling theory with applications

(8.12.2)

If nh be the number of units belonging to group A in an SRSWOR sample of


n units, then obviously Ph = nh/n be an unbiased estimator of h1h stratum
population proportion Ph .

Thus we have the following theorems:


Theorem 8.12.1. An unbiased estimator of the overall population proportion P
possessing the attribute A is given by
L
Pst = IWh!Jh. (8.12.3)
h=1
Proof. Taking expected value on both sides of (8.12.3) we have
L L
E(Pst)= IWhE(Ph)= IWhPh = P
h=1 h=1
which proves the theorem.

Theorem 8.12.2. The variance of the estimator of population proportion III


stratified random sampling is given by

V(Pst) = f Wl( 1-nh!h )(---!!.L.)PhQh


h=l Nh - I
(8.12.4)

where Qh = 1- Ph .
Proof. It follows by using results from Chapter 2 because strata are independent, so

" ) = V [LIWhPh
V (Pst
h=\
"J = IW
L h2( ") = IL Wh2(-
V Ph I- -
nh
-h- ) Ph (I-Ph'
f h)(-N
N -1
)
h=l h=l h
Hence the theorem.

Theorem 8.12.3. An unbiased estimator of the variance of the estimator of


population proportion in stratified random sampling is given by

v"(")
Pst = L.~W2(I-fh)" "
h - - 1 Phqh, (8.12.5)
h=l ns :
where qh = I -!Jh .
Proof. Again it follows by using results from Chapter 2 because the strata are
independen t. Hence the theorem.

Note that the methods of equal, proportional , and optimum allocations are still valid
while estimating population proportion, and we are considering SRSWOR
sampling.
Chapter 8: Stratified and Post-Stratified Sampling 737

Example 8.12.1. A gardener has 60,000 mango trees in two orchards and over an
experience of 4000 years, it has been observed that each tree produces about 100
mangoes. The gardener wishes to estimate the proportion of trees producing more
than 100 mangoes out of all 60,000 trees in his two orchards.
( a ) If 10% trees are producing more than 100 mangoes, then find the variance of
the estimator of population proportion under SRSWOR sampling based on a sample
of 50 trees.
( b ) Later a statistician found that in the first orchard of 40,000 trees 10% of the
trees are producing more than 100 mangoes and in the second orchard of 20,000
trees also 10% of the trees are producing more than 100 mangoes, so he suggested
to applying stratified random sampling by selecting 25 units from each orchard.
(c ) Do you agree with the statistician?
Solution. (a) Under SRSWOR sampling we have
N = 60,000 , n = 50 and P = 0.1
so we have
V(p) = ( 1- f J(~Jp(l- p)= (1- 50/60 ,000J( 60,000 ) x 0.1x (1- 0.1)
n N- I 50 60,000 - I
= 0.01798 .

( b) Under stratified random sampling we have

v(Pst )= I wl(l-nhfh X~JPh(I


h=l Nh - I
- Ph )
= Wi 2(-
I- - Nt J fl (1-fl ) + W22(I-
fi J( --- h
- -J( -N
2
- - J P21-P2
( )
n\ Nt -I n2 N2 - 1

=(40,000)2(1 -25/40 ,000J( 40,000 ) XO.l X(I-O.I)


60,000 25 40,000 - 1

+ (20,000 ) 2( 1- 25/20,000 J( 20,000 ) x 0.1x (I - 0.1)


60,000 25 20,000- 1
= 0.003597 + 0.003596 = 0.007193.

( c ) The relative efficiency of the stratified random sampling over SRSWOR


sampling will be

RE= V(p) x l 00 = 0.01798 x I00 = 249.96%.


V(Pst) 0.007193

Thus the use of stratified random sampling will be more beneficial than simple
random sampling . Yes, we agree with the statistician.
738 Advanced sampling theory with applications

Exercise 8.1. In some practical situations the correlation of the auxiliary variable x
with the study variable Y is positive on some units and negative on other units .
Evidently, such auxiliary variables cannot be used directly as the auxiliary variate in
rat io and product method of estimation, since the population mean and/or the
sample mean of the variate may be close to zero and these may occur in the
denominator of the estimator. Can you resolve this difficulty by using the technique
of (a) Stratified sampling, ( b ) Post-stratified sampling, (c) Change of scale
method? Which method you will prefer and why?
Hint: Srivenkataramana and Tracy (1984) .

Exercise 8.2. For strati fied random sampling compare the estimator
L
Y:t == I.cVhYh,
h=!
where cVh == Shy /Yh denotes the estimator of the coefficient of variation in the h1h
stratum, with the usual estimator
L
Yst == IWhYh
h=l
under (a) proportional allocation, (b) optimum allocation.
Hint: Bennett (1983).

Exercise 8.3. In case of strat ified random sampling, assuming a common fixed
sample size n and neglecting the finite population correction factor, the relative

r
precision (RP) of proportional allocation to optimum allocation is given by

RP == (JIWh~PhQh /h~IWhPhQh'
{I
where
-1 Nh if iA in h
E
1h stratum,
Ph == N h I Yhi , Qh == I - Ph and Yhi == .
i=1 0 otherwise,
and A denotes the attribute of interes t.
Hint: Bennett and Islam (1983).

Exercise 8.4. Under the superpopulation model :


M : Yj==a+f3xj+ej' j==I,2, ..., N
where
E(ejlxj)==O, v(ej lxJ== y2xJ , and Cov(ej' ej'lxjxj')==O ' i v t'
the stratum size for optimum allocation in case of combined ratio estimator

- - (XJ
-=-
Ycr == Y st
X st

is given by
Chapter 8: Stratified and Post-Stratified Sampling 739

CV(X) 2 2 SxXf
2- ]
st. + (1- PXy) xg
2
nh oc Wh[{ Pxy - CV(y) }

and that for the combined regression estimator


)leI' = )1st + iJ(x - xst )
is given by
\
nh oc Wh(Xf fz .
Hint: Dayal (1985).

Exercise 8.5. Suppose there are three study variables satisfying the following
conditions:
( a ) the variates x, Y, and Z have probability density function. f(x, Y, z) in
the finite range xo:,> x:'> xLI' Yo:'> Y:'> YL2 ' and zo :,> z:'> zL3;
( b ) population is finite;
( c ) divide the population into LI x L2 x ~ strata by determining the strata
boundaries xI,x2, ...,X(LI_I) for x; Y\,Y2"",Y(L2- 1) for Y; and zl>z2, ..., Z(L3- 1) such
that the generalized variance of the means of the variables for this stratified sample
under the Neyman allocation is minimum; and
( d ) the generalized variance can be set out as
2
ax' a xy' a xz
2
GV = a xy' a y, a yz ,
2
a xz ' a yz , v;

where a;, a;, a;, a xy' a xz and a yz denote the approximate variances and co-
variances of x and Y under Neyman allocation with cost per unit constant. Find
the optimum points of stratification in this tri-variate population .

Exercise 8.6. If m is the size of the first phase sample from which Wh, an estimate
of Wh , has been obtained. If n is the size of the second phase sample then show
that
L
)1st = LWh)lh
h;1
is an unbiased estimator of population mean r with variance
V()lst) = I [{w,; + gWh(l- Wh)} (1- ih)sly + gWh(f" - rf ]
h;\ m nh m
where g = (N - m)/(N -1), fh = nh/ Nh and the value of nh is independent of Wh'
Hint: Dayal (1979).
740 Advanced sampling theory with applications

Exercise 8.7. For a given sample sh E D.h (h = 1,2,...,L) let ~hY' thx)assume values
in a closed convex sub-space R2 of the two-dimensional real space containing the
point (Yh , X h ) . Find the bias and variance of the class of estimators defined as

h
' = gh~hY' thx)
where gh~hY' thx) is a known function of thy and thx satisfying certain regularity
conditions such that
(a) gh(Yh ,X h )=Yh;
( b ) the function gh is continuous in R2 ;
and
( c ) the first and second order derivatives of gh exist and are continuous in R2 .
Hint: Dalabehara and Sahoo (1997).

Exercise 8.8. Let ~j = X ij / X be the selection probability for the i" unit of the /h
stratum.
( a ) Show that an unbiased estimator of population total Y is given by
• L 1 nj Yij
Ypps =I - I -
j=lnj j =1 Pij

with variance

v(fpp.)= jt~j [~l ~ -~2)-


( b ) Also derive the expression for optimum sample size n~pt in the /h stratum.
Hint: Gupta and Rao (1997), Rao (1984).

Exercise 8.9. For the h1h stratum define xZ = (NhX h - nhxh)/(Nh - nh) ' Find the bias
and variance of the dual to separate ratio estimator of population mean Y in
stratified sampling, defined as

Ydst = IL WhYh
h=l
[-* J
~h
Xh
.

Hint: Singh and Singh (1995).

Exercise 8.10. Show that the difference between the variances of the separate ratio
and the combined ratio estimator of population total in stratified sampling is

Discuss the effect of choice of Rh and R from stratum to stratum and across strata.
Chapter 8: Stratified and Post-Stratified Sampling 741

Exercise 8.11. Consider a population of size N stratified into L strata and the size
L
of the hlh stratum being Nh such that L Nh =N . A simple random sample of size
h=l
n is drawn from the population and is classified amongst the L strata such that
nh (h = 1,2,..., L) is the number of units in the sample that fall in hlh stratum nh
varying from sample to sample .
( a ) Compare the usual unbiased estimator
L
Yp S! = I,WhYh ,
h=1
where W h = Nh/N ,with the estimator defined as
L
Y;SI = I,WhaYh,
h=1
where W ha = awh + (1 - a )Wh for Wh = nh/n .

( b ) Discuss the different choices of the real constant a.


Hint: Agrawal and Panda (1993) .

Exercise 8.12. Divide the population of size N into L strata, such that the hlh
stratum has N h units . From the hlh stratum, draw a preliminary large sample of mh
units with SRSWOR sampling and measure the auxiliary · variable Xhi'
lh
i = 1,2,..., mh . Out of mh units selected in the preliminary large sample from theh
stratum select a second phase sample of nh units using SRSWOR sampling and
measure both the study variable Ytu and auxiliary variable Xhi' i = 1,2, ..., nh .

( a ) Study the bias and variance properties of the ratio and regression type
estimators of the population mean Y defined as

and

_ _Imh _ _I nh _ _I nh
where Wh = Nh/N , Xhm = mh L Xhi , Xhn = nh LXhi and Y hn = nh I, Yhi are the first
i=l i=1 i=1
phase and second phase sample means.
( b ) Find x, and k real constants such that the variances of the Y 2 and Y 4 are
mmimum.
Hint: Lindley and Deely (1993)
742 Advanced sampling theory with applications

Exercise 8.13. Consider we draw a preliminary large sample of m units from the
population of N units with SRSWOR sampling. First step is to post-stratify the
selected units in the L strata so that the h'h stratum contains mh units and note the
associated auxiliary variable Xhi ' i = 1,2 ,..., mho From the post-stratified sample with
mh units in the h'h stratum, select nh units with SRSWOR sampling and mea~ure
the study variable Yhi and auxiliary variable Xhi' i = 1,2 ,..., nh .
Study the asymptotic properties of the estimator of population mean Y defined as
-
Ypst = IL mh -
h=1 m
-
Xhm
- Yhn(-=---
Xhn
J.
Exercise 8.14. Consider Ph be the proportion of units in the h'h stratum possessing
L
an attribute of interest, say A. Evidently P = I Wh Ph be the proportion of units in
h=1
the population possessing the attribute A. Assume nh units are selected from the h'h
stratum using SRSWOR sampling. Suggest an unbiased estimator of P. Compare
the efficiency of optimal allocation with proportional allocation .

Exercise 8.15. Write a short note on stratification and post-stratification.

Exercise 8.16. Under post-stratification , study the asymptotic properties of the


estimator of population mean Ygiven by
L
IahWhYh/ E(ah)
- .;.:.h.=.;=I;-- _
Yd = .. L
IahWh / E(ah)
h=1
where
1- N -Nh c,/ N c, if N is finite ,
a" = 1- (1- W" r
1
Hint: Doss, Hartley and Somayajulu (1979).
if N is infinite.

_ L _ _ 1 nh
Exercise 8.17. Assume Y =I WhY" , W" =Nh/N and Yh =- I Yhj ' h =1,2,...,L.
h=1 nil j=1

Also let Y =..!.- IYi and Y p =..!.- IW"Yh denote the estimators of Y obtained from
n i=1 n h=1
simple random sampling and post-stratification (n" ~ I, h =1,2,..,L ) , respectively. It
should be noted that the sample size n of the first-stage simple random sampling is
a fixed number, while the classified sample size nh thereafter are random variables.
In case of small samples a commonly used estimation procedure for Y is to
collapse the empty post-strata , if there are any, with neighbouring strata. If yAL) be
the estimator of Y under such a procedure then define Yc(L), L = 2,3,4.
Hint: Chang, Liu, and Han (1998), Chang, Han, and Hawk ins (1999).
Chapter 8: Stratified and Post-Stratified Sampling 743

Exercise 8.18. Let the survey population n = Ii :i = I,...,N} be stratified into L


strata nj' j = 1,2, ..., L. In the superpopulation model M Yi, 0) = Ji(yJ- ai(O) let
/;(Yi)= Yi and ai(O)=OjXi for all i e n j and j = I, 2, ..., k with Xi' i=I,2, ...,N
being a covariate and OJ some scalar parameter specific to the stratum nj ,

j = I,...,L.
( a ) Show that the estimating function
N
h = L{t;(Yi)-ai(O)}/Jri + Lai(O)
ie s i=1
can be written as
• L ( ) L
YG = L L \Yi -OjXi VJri + L 0jXj,
j=liesj j=l
where S j denote the sample from the stratum n j and X j = LXi'
o.j

( b ) Also show that under SRSWOR sample of fixed size n the estimating
function reduces to
• N L t; _) N
YG=- L nj\Yj -OjXj + L OjX,
n j=l j=1
where Yj and Xj are the means of the yand X over the sample S j .

( c ) If the value of OJ = Yj /Xj' j = I, 2, ..., L , then the estimating function reduces to


separate ratio estimator given by
• L y.
YR = L _J Xj .
j=l Xj
Hint: Godambe (1995) .

Exercise 8.19. Suppose a stratified sampling strategy is defined as follows:


(i ) the population is stratified by type, e.g., industry, description or geographic
area;
( ii ) the type strata are further stratified by size;
( iii) the sample is allocated over all strata in such a fashion as to ensure that the
required degree of accuracy will be achieved for each estimate and that, subject to
that requirement, the number of sample units is minimized;
( iv ) the number of sample units allocated to each particular type size stratum is
selected by using SRSWOR.
Under the superpopulation model
Y; = ex, +ei
with

E(eJ = 0, E(eiej) = {oa} if i = i.


if i j . *
744 Advanced sampling theory with applications

( a ) Show that a sensible predictor of the population total Y is given by


y" = I Yi + /JI Xi '
ie s ie s

where /J is a sample estimator of fJ .


( b) If we take /J = .IWiYi/ ~WiXi' where the weights Wi are bounded but otherwise
lE S IE S

completely arbitrary , then show that the optimum value of these weights Wi over
the repeated sampling design is given by

Wi = a(n - 1a;1 ~ aj -1)


);1

N
with a = I TC jWjX j .
j; 1
( C ) Also show that it leads to an analogue of the Neyman allocation given by

TCi = na,l ~aj .


j;J

Hint: Brewer (1979).


L
Exercise 8.20. If the cost function is of the form C = Co + I Chn Z then show that
h;1
the optimum sample sizes nh for the minimum variance VVSI) is given by

nh = n - WlSl Jk~J / IL [WlSlY Jk~1 ,


[ C_h Y h;1
-_
Ch

where k is a constant.

Exercise 8.21. Consid er a population n of N units divided into independent L


L
strata of sizes N t, N z , ..., N L such that INh = N. Consider a sampling design
h;1
that selects n h units from the h' h stratum employing selection probabil ities
proportional to size sampling within the hth stratum . Define an estimator of
population total Y defined as
y.st -- ~L, '"
L,
Y hi
h; 1iESh TChi
where the Y hi now denote the Y value of the {h unit in the hth stratum, Sh is the
sample from the h'h stratum, 1::; h s Land TChi = nhxhd X h, 1::; i ::; N h, and
Nh
x, = I Xhi '
i; 1

Show that the estimator ~I is unbiased for population total, Y, and its variance is
given by
Chapter 8: Stratified and Post-Stratified Sampling 745

, ) 1 L Nh Nh ( V )2
V (Ygt := -- I I I Jrhij - JrhiJrhj !V'hd Jrhi - Yhj/ Jrhj .
2 h=l i=lj;<i=1
Hint: Padmawar (1998).

Exercise 8.22. Assuming that reliability ratio <5 IS known, a necessary and
sufficient condition for the estimator

Y:= [WO S1+.I WjslYlj; "" wO SL + . L WjSLYLj]1


JESI JESL
to be admissible in the class of linear estimators is that there exist a and ai

satisfying fi/(I+<5-1)~ai~land ~i(ai-I~_l)arl):= n1j(ar I:~_l)ai-I) for

1~ i '* j ~ L such that Wjsi := adni for al1 j E Si and wO Si := (ai -I)a for i:= 1,2, ..., L .
Hint: Zou and Wan (2000).

Exercise 8.23. ( I ) Single phase sampling : ( a ) Consider an estimator in stratified


sampling of the form
L
Yst(new):= InhYh ,
h=1
whe re the weights n h are chosen such that the chi square distance function
±(nh-whf
h= 1 WhQh
where Qh denotes suitable weigh ts, is minimized subject to the fol1owing two
calibration constraints
L L_ L 2 L 2
LnhXh := LWhX h and Lnhs hx:= LWhShx,
h=1 h=l h=l h=1
where

Shx _- (Nh _ 1)- 1~(Xhl'


2
L, -- x-h\2, 2 (nh - 1)-1~(
J Shx:= L, Xhi - Xh - nh-I ~
-)2 , Xh:= L,Xhi, an d
~1 ~ ~
_ Nh
X h := N'h 1 IXhi have there usual meanings .
i=1
Show that the resultant estimator of the popu lation mean is given by

Yst(new):= ±Wh~h + iJlxh -Xh)+ iJ2(Sl.x -sl.x)]


h=l
where

and
746 Advanced sampling theory with applications

• IWhQhSbeYh{ IWh(Sbe -Sbe)IWhQhXl- IWh(Xh -Xh)IWhQhXhSbe }


f32 -- h;1 h;1 h;l h;1 h;1

2L
2 '
L 4 (L 2)
h~IWhQhXh h~l WhQhShx - h~IWhQhShx
( b ) The minimum variance of the estimator Yst (new) is given by

V(yst(new)) = Ih;1 w1( 1-nhfh )slY{I- A~II _ (AhllAh04Ah-o3I -- A;12f


Ah03
)

where

Ahrs = ~hrsS/2 and J-Ihrs = (Nh -r): 1~(Yhi - f,,) (Xhi - X h)' .
J-Ih20J-102 i; 1

( C) Show that the variance V(Yst (new)) can be written as

V(yst(new))= I W1(1 -nh!h)_I_~E~i'


h;1 N h -I i;1

where
Ehi= (Yhi - f,,)- f31(X hi -Xh)- f32{(Xhi - x hf -o-kc} with a be = Nj,I%(Xhi -xhf .

( d ) Consider an estimator for estimating the variance V(Yst(new)) is given by

VO(Yst(new)) = I w1(1- fh )_I-I eL


h;1 nh nh - 3 i;1
where
ehi = (Yhi - Yh)- PI (Xhi -Xh)- P2t Xhi - Xh? - s~ }
with S ~ = nj,1 1: (xhi - Xh? being the maximum likelihood estimator of a be , and
i;1

compare it with a calibrated estimator of the variance defined as


vl(Yst(new))= I n~(I-nhfh )nh- I- 3- Ie1i '
h;1 i;1

( II ) Double Sampling: Consider the population of N units consists of L strata


L
such that the h1h stratum consists of N h units and L,Nh = N . From the h1h stratum
h;1
of N h units, draw a preliminary large sample of mh units by SRSWOR sampling
and measure the auxiliary variable Xhi only. Select a sub-sample of nh units from
the given preliminary large sample of mh units by SRSWOR sampling and measure
both the study variable Yhi and auxiliary variable Xhi'
mh ~( ~
Let xZ = mj,l L,Xhi and s~ = (mh -It) L,~Xhi -xZl denote the first phase sample
i;) i; )

nh
mean and variance . Also let xh = nj,1 L,Xhi , and
i ;1
Chapter 8: Stratified and Post-Stratified Sampling 747

nh nh(
_ _I 2 ( )_I - \2
Yh = »« I,Yhi' Shy = nh -1 I, Yhi - Yh J denote the second phase sample mean
i=1 i=1
and variances for the auxiliary and study variables, respectively.
( a ) Consider an estimator of the population mean in stratified double sampling as
L
Yst(d) = I,W;Yh,
h=1
where the W; are the calibrated weights such that the chi square distance
L
I,
(w; -WhY ,
h=1
WhQh
where Qh are prearranged weights, is minimized, subject to the constraints
L *_ L _* L *2 L *2
I,WhXh = I,Whxh, and I,Whs hx = I,Whshx'
h=1 h=1 h=1 h=1
where the W» = Nh/N are known stratum weights .
Show that a calibrated estimator of the population mean In stratified double
sampling is given by

Yst(d) = h~IWhYh + p;[Jth(Xh -xZ)]+ P;[h~th(sL: -sZ;)].


where p; and p; have their usual meanings, and work them out
(b) Show that the conditional variance of the stratified double sampling estimator
L
Yst(d) = I,W;Yh
h=1
is

V~st(d)1 W;]= f W;2(...!... __I JSly


h=1 nh Nh
where
2 (
Shy = N h - 1
)-1 Nh ( _ )2
L Yhi - Yh
i=1
( C) Show that a conditionally unbiased estimator of V~st (d) I W;] is

.[-stdlWh
vY ~ h*2(-1- - S
() *] =L..W 1 Jhy
2
h=1 nh Nh
where
nh
sly = (nh -ItI I,(Yhi - Yhf .
i=1
( d ) Show that the minimum variance of the stratified double sampling estimator
Yst (d), to the first order of approximation, is given by

V(Yst (d)) = f wllr(_I


h=1 ~ mh
l_JSly
Nh
+(...!...--I-JSly{I- A~lI - (AhII Ah03 - A;12f}] .
nh mh Ah04 -1- Ah03
( e ) Show that the variance of the stratified double sampling estimator Yst(d) is
748 Advanced sampling theory with applications

Eh i= (Yhi - ~)- fJl(X hi - -\ \ )- fJ2{(X hi - .\\~ -aLe } .


(f) Consider an estimator of variance V(Yst(d)) is given by

v(Ys,(d))= I Wl~(_1
h=l
__I J s~y + (...!...nh __
~ mh N h
I J...!... Ie~i]'
»« nh i=1

where ehi = (Yhi - Yh)- PI(Xhi - Xh)- P2 tXhi - Xh)2 - s~; } denotes the estimate of the

residual term and s~ = nhl 1: (Xhi - xhf


i=l
denotes the maximum likelihood estimator of

aLe. Study an estimator of variance in stratified double sampling as

v(yst(d))n = I W;2~(_1
h=1
__I Js~y + (...!...nh __
~ mh N h
I J-I-Ie~i]'
mh nh -I i=1

Hint: Tracy, Singh, and Amab (2003).

Exercise: 8.24. Under stratified random sampling, study the asymptotic properties
of the following estimators of the population mean, Y , defined as
IWh(Xh + Cxh) IWh(Xh + fJ2h(X))
( a -
) YstSD - h=1
= Yst L
( b) YstSK
- - h=1
= Yst ==-'L- - - - -
LWh(Xh+Cxh) LWh(Xh+fJ2h(X))
h=1 h=1
IWh(XhfJ2h(X)+ Cxh)
( c ) YsyUSI = Yst-",h~?I-----­
LWh(XhfJ2h(X) +Cxh)
h=1
L
where YSI = LWhYh has usual meaning. Discuss the special cases when there is
h=l
only one stratum ( L = 1) in the population .
Hint: Kadilar and Cingi (2003).

Practical 8.1. A World Bank manager selects two units by SRSWOR sampling
from each stratum of the population 5 given in the Appendix. Later on a statistician
collects the information on the production of the tobacco crop from the countries
selected in the sample. The total number of countries in each continent are known
then discuss the estimate of the average production of the tobacco crop in the world.
Derive the 95% confidence interval. Also find an estimate of the percentage gain in
efficiency (GE) due to stratification .
Chapter 8: Stratified and Post-Stratified Sampling 749

Practical 8.2. A statistician suggested to the World Bank to select a sample of 40


countries from population 5 given in the Appendix using the method of
proportional allocation. Following his instructions, record the production of tobacco
crop from the selected countries. Estimate the average production of the tobacco
crop in the world. Estimate the variance under the method of proportional
allocation. Construct a 95% confidence interval.

Practical 8.3. The variation in the production of tobacco crops in different


continents of the world is known as listed in the descriptive statistics of population
5 given in the Appendix. Use this information to select a sample of 40 countries
from population 5 using the Neyman method of allocation. Record the production
of tobacco crop from the selected countries. Merge the strata if requited. Estimate
the average production of the tobacco crop in the world. Estimate the variance
under the Neyman method of allocation. Construct a 95% confidence interval.

Practical 8.4. Owing to budget constraints the management of the World Bank has
suggested a condition of spending money in different continents while selecting a
sample. Keeping the instructions of the management in your mind, select a sample
of 40 countries from population 5 given in the Appendix using the method of
optimum allocation. Record the production of Tobacco crop from the selected
countries. Estimate the average production of the Tobacco crop in the world.
Estimate the variance under the method of Optimum allocation. Construct a 95%
confidence interval.
Given: Ct = $0.5, Cz = 2, C3 = 3, C4 = 5, Cs = 7, C6 = 1.5, C7 = 10, Cs = 5, Cg = 5, and
Cto = 3.

Practical 8.5. Find the minimum sample size from population 5 given in the
Appendix to obtain estimates of population mean with different levels of relative
standard deviations. Plot relative standard deviation versus sample size.
Given: C1 = $0.5, Cz = 2, C3 = 3, C4 = 5, Cs = 7, C6 = 1.5, C7 = 10, Cs = 5, Cg = 5,
Cto = 3 and Y = 1.5507.

Practical 8.6. Stratify the 50 states listed in population I given in the Appendix
using the real estate farm loans as an auxiliary variable. Modify the stratum
boundaries into six strata using cumulative square root method and cumulative cube
root method.

Practical 8.7. A team of doctors wish to estimate the average death rate of persons
living in the United States on the basis of year 2000 projections. The projected
191,443 persons are to be grouped into five strata on the basis of their age at the
time of death. The rough distribution of the death rate projected for the year 2000
in the United States has been listed in population 6 of the Appendix in 21 age
groups with a gap of four years. Apply the Neyman method of sample allocation for
selecting the overall sample and the cumulative square root method to form 4 strata.
750 Advanced sampling theory with applications

Practical 8.8. Suppose there are eight big cities in a particular country. The
following table lists the number of persons and standard deviation of their income
in each city with different age groups. We wish to allocate a sample of 1000
persons to collect information for social survey . Suggest the number of persons to
be selected from each city with the help of Proportional, Neyman and Optimum
allocation.
Standard':'
Devi liH()l{
9500 76.44
25--34 29500 25--34 29500 45 .89
35--44 19500 35--44 32500 18.32
45--54 4500 45--54 22500 26.58
55--64 6500 55--64 20500 24.88
65+ 15500 65+ 19500 19.14
<24 4500 <24 4500 44.63
" 25--34 20500 25--34 27500 13.65
35--44 27500 35--44 41500 18.58
45--54 19500 45--54 26500 27.99
55--64 21500 55--64 16500 12.44
65+ 21500 65+ 23500 15.07
<24 6500 <24 5500 4.62
25--34 12500 25--34 30500 11.02
35--44 28500 35--44 33500 16.58
45--54 29500 45--54 21500 23.91
55--64 29500 55--64 14500 23.03
65+ 31500 65+ 14500 21.82
<24 3500 <24 3500 36.94
25--34 23500 25--34 20500 8.74
35--44 26500 35--44 13500 12.25
45--54 16500 45--54 9500 14.24
55--64 19500 55--64 10500 22.09
65+ 20500 65+ 16500 9.97
Which method you will prefer and why?
Given: Cost of processing one unit in the t" the city is C1 = $2, C z = $4, C3 = $2.5,
C4 = $3, Cs = $4, C6 = $4.3, C7 = $2.8 and Cs = $3.2.

Practical 8.9. Consider a couple in your class consisting of a husband and wife.
Assuming that the life of every couple consists of good and bad events. Ask to both
of them to prepare a separate list of good and bad events in their life. Estimate the
proportion of good events in for each one of them . Also obtain a pooled estimate of
good events among the families assuming that 50% weights to each of them and
number of couples are infinitely large .
Chapter 8: Stratified and Post-Stratified Sampling 751

Practical 8.10. An estimate of the average death rate of persons living in the
United States has been found to be useful for making future strategies. The
projected 191443 persons are to be grouped into five strata on the basis of their age
at the time of death. The rough distribution of the death rate projected for the year
2000 in the United States has been listed in population 6 of the Appendix in 21 age
groups with a gap of four years. Apply the method of proportional allocation for
selecting the overall sample required for estimation purpose and apply the
cumulative cube root method to form 4 strata.

Practical 8.11. The estimation of total production of tobacco crop in the world is an
important issue to the health departments. Select an SRSWOR sample of 45
countries from the population 5 given in the Appendix and record the production
and area under the crop. Assume each continent as a different strata and number of
countries in each continent are known . Post-stratify the countries selected in the
SRSWOR sample into different continents . Merge the post-strata if required using
total area under crop as an auxiliary variable. Obtain 95% confidence interval for
the total production.

Practical 8.12. The following map shows rank of average temperature during
December 200 I in different states of the United States of America.

December Statewide Ranks


Na t ional Climatic Data Center/NESDISINOAA

Recor d
Colde st
Mu c h
Bel ow
Normal
D
Belo w
Nonnal
oHe r
Nonn al
o
Above
Normal
Muc h
Abov e
Normal

Recor d
Warme st

Source: Printed with permission from NOAA/National Climate Data Centre.


752 Advanced sampling theory with applications

( 1 ) SRSWOR sampling

( a ) Count the number of states whose rank of temperature is shown in the map.
( b ) Compute the average rank temperature in the USA.
( c ) List the names of the states in alphabetic order A to Z (Rule: Use two letter
abbreviations, e.g., use NY for New York).
( d) Select an SRSWOR sample of 15 states (Rule: Start with the first two columns
of the Pseudo-Random Number Table I given in the Appendix).
( e ) Construct a 95% confidence interval estimate of the average rank temperature
based on the SRSWOR sample information.
( f) Find population mean square s; .
( g ) Find the variance of the estimator of the average rank of temperature based on
SRSWOR sample of 15 units.

( 2 ) Stratified sampling: Equal allocation

( h ) Divide the population into three strata based on information such that the
stratum I consists of all states having temperature Near Normal, stratum 2 consists
of all states having temperature Above Normal, and stratum 3 consists of all states
having temperature Much Above Normal or Recorded Warmest. Construct three
list of the states falling in different three strata in the alphabetic order A to Z. (Use
two letter abbreviations, e.g., NY for New York).
(i ) Select sub-samples of five units from each stratum using SRSWOR sampling.
(Rule : Always start with first row and first column from the Pseudo-Random
Number Table 1 given in the Appendix, and make sure the states in each stratum
are in alphabet order).
(j ) Construct 95% confidence interval estimate of the average rank temperature
based on stratified random sample of 15 units.
( k ) Find the population mean squares for each one of the three strata slY' h = 1,2,3.
( I ) Find the variance of the estimator of the average rank of temperature based on
stratified sampling, while five units are selected from each stratum.
( m ) Find the relative efficiency of the stratified sampling over the SRSWOR
sampling.

( 3 ) Stratified sampling: Proportional allocation

( n ) Select a Stratified Random Sample of 15 units using proportional allocation .


(Rule: Always start with first row and first column from the Pseudo-Random
Number Table 1 given in the Appendix)
( 0 ) Construct a 95% confidence interval estimate of the average rank of
temperature.
( p ) Find the variance of the estimator of the average rank of temperature using
proportional allocation .
Chapter 8: Stratified and Post-Stratified Sampling 753

( q ) Find the relative efficiency of proportional allocation with respect to SRSWOR


sampling .
( r ) Find the relative efficiency of proportional allocation with respect to equal
allocation sampling.

( 4) Stratified sampling: Neyman allocation

( s ) Assuming that the population mean squares (S~Y ' h = 1,2,3.) for the three strata
are known, select a stratified random sample using Neyman allocation. (Rule:
Always start with first row and first column from the Pseudo-Random Number
Table I given in the Appendix, and make sure the states in each stratum are in
alphabet order).
( t ) Construct the 95% confidence interval estimate of the average rank of
temperature.
( u ) Find the variance of the estimator of the average rank of temperature using
Neyman allocation .
( v ) Find the relative efficiency of the Neyman allocation with respect to SRSWOR
sampling.
( w ) Find the relative efficiency of the Neyman allocation with respect to equal
allocation sampling .
( x ) Find the relative efficiency of the Neyman allocation with respect to the
proportional allocation sampling.

( 5 ) Stratified sampling: Optimum allocation

( y ) Let C\ = $4 .0, C2 =$9 .0, and C z = $25 .0 be the costs of collecting


information on a unit in the first, second, and third stratum, respectively. Again
assuming that the population mean squares (S~Y' h =1,2,3) for the three strata are
known, select a stratified random sample of 15 units using Optimum Allocation.
(Rule: Always start with first row and first column from the Pseudo-Random
Number Table 1 given in the Appendix, and make sure states in each stratum are in
alphabet order)
( z ) Construct a 95% confidence interval estimate of the average rank of
temperature.
( aa ) Find the variance of the estimator of the average rank of temperature using
Optimum Allocation.
( bb ) Find the relative efficiency of the Optimum Allocation with respect to
SRSWOR sampling.
( cc ) Find the relative efficiency of the Optimum Allocation with respect to equal
allocation sampling .
( dd) Find the relative efficiency of the Optimum Allocation with respect to
proportional allocation .
( ee ) Find the relative efficiency of the Optimum Allocation with respect to
Neyman allocation .
754 Advanced sampling theory with applications

( 6 ) Post-stratified sampling

( ff) Collected information for each state included in the SRSWOR sample of 15
units from the whole populat ion as in part (l) and note their status as: Near
Normal, Above Normal etc.. (Rule: Start with first two columns of the Pseudo-
Random Number Table 1 given in the Appendix , and make sure the states are listed
in alphabetic order A to Z).
( gg ) Post-stratify the sample of 15 units into three strata, viz., Stratum 1 with Near
Normal, Stratum 2 with Above Normal, and Stratum-3 with Much Above Normal
or Recorded Wannest, temperatures . Is there any empty post-stratum? If so,
suggest a possible solution to the issue.
( hh ) Estimate the average rank temperature using post-stratification, and deduce
95% CI estimate .
( ii ) Find the relative efficiency of the post-stratified sampling with respect to
SRSWOR sampling .

( 7 ) Collection of auxiliary information

Experience has shown that the statewide rank of temperature during Dec. 2001 has
correlation with the rank of statewide precipitation during Dec. 2000 and a
complete information on the precipitation is known . Collect the state wise
information on the precipitation using the following map:

December Statewide Ranks


Na tional Climatic Data Center/NESDIS/NOAA


Record
Dries t
MuCh
Below
Norma l
o
Below
Nonnal
o
Nea,
Normal
o
Above
No rm al
Muc h
Abo ve
Normel

Record
Wettest

Source: Printed with permission from NOAAlNational Climate Data Centre.


Chapter 8: Stratified and Post-Stratified Sampling 755

( 8) Ratio estimator under SRSWOR sampling

(jj ) Collected information on precipitation for each state included in the SRSWOR
sample of 15 units from the whole population as in part (1). Also calculate the
overall average precipitation in the USA based on information from the 48 states
listed above.
( kk ) Obtain the ratio estimate of average rank temperature.
( 11 ) Construct a 95% confidence interval estimate of average rank temperature
using estimate of mean square error of the ratio estimator based of an SRSWOR
sample of 15 units. Interpret it.
(mm) Find mean square error of the ratio estimator.
( nn ) Find the relative efficiency of the ratio estimator with respect to the sample
mean estimator.

( 9 ) Stratified sampling: Separate ratio estimator

Use the same stratification as used in the previous section and collect information
on the precipitation from each stratum. Assume that the information of precipitation
in each stratum is known as auxiliary variable.
( 00 ) Construct a 95% confidence interval estimate of the average rank temperature
using separate ratio estimate under equal allocation and proportional allocation.
Discuss the comparison among each other.
( pp ) Find the relative efficiency of the separate ratio estimator with respect to
stratified sampling estimator under equal and proportional allocation.

( 10) Stratified sampling: Combined ratio estimator

( qq ) Construct a 95% confidence interval estimate of the average rank temperature


using combined ratio estimate under equal allocation and proportional allocation.
Discuss the interval estimates.
( rr ) Find the relative efficiency of the combined ratio estimator with respect to
stratified sampling estimator under equal and proportional allocation.

( 11 ) Regression estimator under SRSWOR sampling

( ss ) Obtain the regression estimate of average rank temperature from same


information in ( 8 ).
( tt ) Construct a 95% confidence interval estimate of average rank temperature
using estimate of mean square error of the regression estimator based on an
SRSWOR sample of 15 units. Interpret it.
( uu ) Find mean square error of the regression estimator.
( vv ) Find the relative efficiency of the regression estimator with respect to the
sample mean estimator.
( ww ) Find the relative efficiency of the regression estimator with respect to the
ratio estimator.
756 Advanced sampling theory with applications

(12) Stratified sampling: Separate regression estimator


( xx ) Construct a 95% confidence interval estimate of the average rank temperature
using separate regression estimate under equal allocat ion and proportional
allocation. Discuss the interval estimates.
( yy ) Find the relative efficien cy of the separate regression estimator with respect
to stratified sampl ing estimator under equal and proportional allocation.

(13) Stratified sampling: Combined regression estimator

( zz ) Construct a 95% confidence interval estimate of the average rank temperature


using combined regression estimate under equal allocation and proportional
allocation. Discuss the interval estimates .
(aaa) Find the relative efficiency of the combined regression estimator with respect
to stratified sampling estimator under equal and proportional allocation.

( 14) Stratified sampling: Miscellaneous

( bbb ) Select a sample of 15 units using Neyman allocation for separate ratio
estimate and construct a 95% confidence interval estimate, and find its relative
efficiency with respect to usual ratio estimator without stratification.
( ccc ) Select a sample of 15 units using Neyman allocation for separate regression
estima te and construct a 95% confidence interval estimate, and also find its relative
efficiency with respect to usual regression estimator without stratification.
( ddd ) Select a sample of 15 units using Neyman allocation for combined ratio
estimate and construct a 95% confidence interval estimate, and find its relative
efficiency with respect to usual ratio estimator without stratification.
( eee ) Select a sample of 15 units using Neyman allocation for combined regression
estimate and construct a 95% confidence interval estimate, and also find its relative
efficiency with respect to usual regression estimator without stratification.
( fff) Select a sample of 15 units using optimum allocation with same costs defined
earlier for separate ratio estimate and construct a 95% confidence interval estimate,
and find its relative efficiency with respect to usual ratio estimator without
stratification.
( ggg ) Select a sample of 15 units using optimum allocation with same costs for
separate regression estimate and construct a 95% confidence interval estimate, and
also find its relative efficiency with respect to usual regression estimator without
stratification.
( hhh ) Select a sample of 15 units using optimum allocation with same costs for
combined ratio estimate and construct a 95% confidence interval estimate, and find
its relative efficiency with respect to usual ratio estimator without stratification.
( iii) Select a sample of 15 units using optimum allocation for combined regression
estimate and construct 95% confidence interval estimate , and also find its relative
efficiency with respect to usual regression estimator without stratification.
( jjj ) Which estimator is best among all the above estimators based on
stratification? Give your views.
Chapter 8: Stratified and Post-Stratified Sampling 757

(15) Post-stratified sampling: Miscellaneous

( kkk ) Apply the separate ratio estimator for the post-stratification of sample of 15
units in ( 6 ) to construct 95% confidence interval estimate , and find its relative
efficiency with respect to post-stratified estimator without using auxiliary
information.
( 111 ) Apply the separate regression estimator for the post stratification of sample of
15 units in ( 6 ) to construct 95% confidence interval estimate, and find its relative
efficiency with respect to post-stratified estimator without using auxiliary
information.
( mmm ) Apply the combined ratio estimator for the post stratification of sample of
15 units in ( 6 ) to construct 95% confidence interval estimate, and find its relative
efficiency with respect to post-stratified estimator without using auxiliary
information.
( nnn ) Apply the combined regression estimator for the post stratification of sample
of 15 units in ( 6 ) to construct 95% confidence interval estimate, and find its
relative efficiency with respect to post stratified estimator without using auxiliary
information.
( 000 ) Which estimator is best among all the above estimators based on post-
stratification? Give your views.

( 16) Calibrated weights for stratified sampling: Miscellaneous

( ppp ) Derive the calibrated weights using know precipitation rank and deduce the
estimates of average rank temperature based on them using combined ratio,
combined GREG and combined linear regression estimators.

Practical 8.13. Use SRSWOR sampling to select a sample of 40 countries from


population 5 given in the Appendix by using the method of proportional allocation.
Record the production and area under the tobacco crop from the countries selected
in the sample . Assuming that the total area in each continent under the tobacco crop
is known, estimate the variance of the estimator used for estimation purpose.
Construct a 95% confidence interval in each case.
( a ) Apply separate ratio estimator to estimate the average production of the
tobacco crop in the world.
( b ) Apply separate regression estimator to estimate the average production of the
tobacco crop in the world .
( c ) Apply combined ratio estimator to estimate the average production of the
tobacco crop in the world.
( d ) Apply the combined regression estimator to estimate the average production of
the tobacco crop in the world.
( e) Using full information from the description of the population, find the relative
efficiency of separate ratio estimator with respect to combined ratio estimator.
758 Advanced sampling theory with applications

( f) Using full information from the description of the population, find the relative
efficiency of separate regression estimator with respect to combined regression
estimator .

St. No. 1 2 3 4 5 6 7 8 9 10
Nh 6 6 8 10 12 4 30 17 10 3
nh 3 3 3 3 4 2 11 6 3 2
PRN 1 2 3 4 and 5 6 and 7 8 9 and 10 11 and 12 13 and 14 15
Col.

Practical 8.14. Consider you are a circus statistician, and in the circus there are
three types of elephants, viz. Niko, Sambo, and Jumbo, which are of light, medium
and heavy in weights . The following table lists the number of elephants in each
category along with their average weights ( kg ) and population standard deviations .

150 100
3500 5000
400 200

Give the following information to the owner of the circus.

( a ) How many elephants , N , are available in the circus?


( b ) What is the average weight, f, of all elephants in the circus?
( c ) Compute the population mean square of the all elephants using the formula

S2 JI kNh -l)sly}+ ht {Nh(~ - fr}


=
y N-l .

(d) Ifwe select an SRSWOR sample of n = 40 elephants from the whole circus,
then what will be the variance of the estimator of the average weight of all
elephants in the circus?
( e ) How many elephants you will select from these strata using proportional
allocation to get a sample of n = 40 units?
( f) What will be variance of the estimator of the population mean under
proportional allocation?
( g ) What will be the relative efficiency of proportional allocation with respect to
SRSWOR sampling?
( h ) How many elephants you will select from these strata using the Neyman
allocation to get a sample of n = 40 units?
(i ) What will be variance of the estimator of the population mean under the
Neyman allocation?
Chapter 8: Stratified and Post-Stratified Sampling 759

(j ) What will be the relative efficiency of the Neyman aIlocation with respect to
SRSWOR sampling?
(k) What will be the relative efficiency of the Neyman aIlocation with respect to
proportional aIlocation sampling?
( 1) Let C\ = $4, C2 = $9, and C3 = $25 be the cost of weighing an elephant in the
first, second, and third strata. Find the optimum aIlocation of a sample of
n = 40 units over three strata.
( m ) What will be variance of the estimator of the population mean under optimum
aIlocation?
( n ) What will be relative efficiency of optimum aIlocation with respect to
SRSWOR sampling?
( 0 ) What will be relative efficiency of optimum aIlocation with respect to
proportional aIlocation?
( p ) What will be relation efficiency of optimum aIlocation with respect to the
Neyman aIlocation?

Practical 8.15. In a circus there are three types of elephants, viz., Light, Medium,
and Heavy in weight. The foIlowing table lists the number of elephants in each
category along with their average weights (kg) and average weight of food (kg/day)
along with other information.

150 100
3500 5000
100 150 250
500 400 200
50 30 20
17500 8400 2800

( a ) How many elephants, N, are available in the circus?


( b ) What is the average weight, Y, of all elephants in the circus?
( c ) What is the average diet, X, of all elephants in the circus?
( d ) Compute the population parameters of the all elephants using the formulae

S2 = ht,tNh-l)sly}+ htJNh(~ -yf}


y N-1 '

S2 = httNh-l)slx}+ ht,{Nh(Xh-xf}
x N -1 '
and
760 Advanced sampling theory with applications

( e ) If we select an SRSWOR sample of II = 40 elephants from the whole circus,


then what will be the mean square error of the ratio estimator of the average weight
of all elephants in the circus?
( f ) How many elephants will you select from these strata using proportional
allocation to get a sample of II = 40 units?
( g ) What will be the mean square error of the separate ratio estimator of the
population mean under proportional allocation?
( h ) What will be the relative efficiency of the separate ratio estimator under
proportional allocation with respect to (w.r.t.) usual ratio estimator under SRSWOR
sampling?
( i ) How many elephants will you select from these strata using Neyman
allocation to get a sample of II = 40 units?
(j ) What will be the mean square error of the separate ratio estimator of the
average weight of elephants under Neyman allocation?
( k ) What will be the relative efficiency of the separate ratio estimator under the
Neyman allocation with respect to the usual ratio estimator under SRSWOR
sampling?
( I ) What will be the relative efficiency of the separate ratio estimator under the
Neyman allocation with respect to the separate ratio estimator under proportional
allocation?
( m ) Let C1 = $4, C2 = $9, and C3 = $25 be the cost of weighing an elephant in the
first, second, and third strata . Find the optimum allocation of a sample of
II = 40 units over three strata.

( n ) What will be variance of the separate ratio estimator of the average weight of
the elephants under optimum allocation ?
( 0 ) What will be relative efficiency of the separate ratio estimator under optimum
allocation with respect to the usual ratio estimator under SRSWOR sampling?
(p) What will be relative efficiency of the separate ratio estimator under optimum
allocation with respect to separate ratio estimator under proportional allocation?
( q) What will be relative efficiency of the separate ratio estimator under optimum
allocation with respect to the separate ratio estimator under the Neyman allocation?
( r ) What will be mean square error of the combined ratio estimator of the
population mean under proportional allocation?
( s ) What will be the relative efficiency of the combined ratio estimator under
proportional allocation with respect to the usual ratio estimator under SRSWOR
sampling?
( t ) What will be the mean square error of the combined ratio estimator of the
average weight of elephants under the Neyman allocation ?
(u ) What is the relative efficiency of the combined ratio estimator under the
Neyman allocation with respect to the usual ratio estimator under SRSWOR
sampling?
Chapter 8: Stratified and Post-Stratified Sampling 761

( v ) What is the relative efficiency of the combined ratio estimator under the
Neyman allocation with respect to the combined ratio estimator under proportional
allocation?
( w) What will be variance of the combined ratio estimator of the average weight of
the elephants under optimum allocation?
( x ) What will be the relative efficiency of the combined ratio estimator under
optimum allocation with respect to the usual ratio estimator under SRSWOR
sampling?
( y ) What will be the relative efficiency of the combined ratio estimator under
optimum allocation with respect to combined ratio estimator under proportional
allocation?
( z ) What will be the relative efficiency of the combined estimator under optimum
allocation with respect to the combined ratio estimator under the Neyman
allocation?
( aa) Ifwe select an SRSWOR sample of n = 40 elephants from the whole circus,
then what will be the mean square error of the regression estimator of the average
weight of all elephants in the circus?
( bb ) What will be the mean square error of the separate regression estimator of the
population mean under proportional allocation?
( cc ) What will be the relative efficiency of the separate regression estimator under
proportional allocation with respect to the usual regression estimator under
SRSWOR sampling ?
( dd ) What will be the mean square error of the separate regression estimator of the
average weight of elephants under Neyman allocation?
( ee ) What will be the relative efficiency of the separate regression estimator under
the Neyman allocation with respect to the usual regression estimator under
SRSWOR sampling ?
( ff) What is the relative efficiency of the separate regression estimator under the
Neyman allocation with respect to the separate regression estimator for proportional
allocation?
(gg ) Let C1 = $4 , Cz = $9, and C3 = $25 be the cost of weighing an elephant in
the first, second , and third strata. Find the optimum allocation of a sample of
n = 40 units over three strata.
( hh ) What will be the variance of the separate regression estimator of the average
weight of the elephants under optimum allocation?
( ii ) What will be the relative efficiency of the separate regression estimator under
optimum allocation with respect to the usual regression estimator under SRSWOR
sampling?
(jj ) What will be the relative efficiency of the separate regression estimator under
optimum allocation with respect to the separate regression estimator under
proportional allocation ?
( kk ) What will be relative efficiency of the separate regression estimator under
optimum allocation with respect to the separate regression estimator for Neyman
allocation ?
( 11 ) What will be mean square error of the combined regression estimator of the
average weight of elephants under proportional allocation?
762 Advanced sampling theory with applications

( mm ) What is the relative efficiency of the combined regression estimator under


proportional allocation with respect to the usual regression estimator under
SRSWOR sampling?
( nn ) What will be mean square error of the combined regression estimator of the
average weight of elephants under the Neyman allocation?
( 00 ) What will be the relative efficiency of the combined regression estimator
under the Neyman allocation with respect to the regression estimator under
SRSWOR sampling?
( pp ) What will be the relative efficiency of the combined regression estimator
under the Neyman allocation with respect to the combined regression estimator
under proportional allocation?
( qq ) What will be variance of the combined regression estimator of the average
weight of the elephants under optimum allocation?
( rr ) What will be relative efficiency of the combined regression estimator under
optimum allocation with respect to the usual regression estimator under SRSWOR
sampling?
( ss ) What will be relative efficiency of the combined regression estimator under
optimum allocation with respect to the combined regression estimator under
proportional allocation?
( tt ) What will be the relative efficiency of the combined regression estimator
under optimum allocation with respect to the combined regression estimator under
Neyman allocation?

Practical 8.16. In a circus there are three types of elephants, viz., Light, Medium,
and Heavy in weight and some information about them is listed below:

100
8
5200
244.8
250
490 410 220
45 25 30
17200 8000 5500

( a ) How many elephants, N, are available in the circus?


( b ) Select an SRSWOR sample of 40 elephants using proportional allocation.
( c ) Estimate the average weight, Y, of all elephants in the circus using the usual
estimator in stratified sampling
Chapter 8: Stratified and Post-Stratified Sampling 763

L
Yst = LWhYh
h=\

and derive a 95% confidence interval estimate.


( d ) Assuming that the average food Xh , h = 1,2,3 are known , estimate the average
weight, Y, of all elephants in the circus using separate ratio estimator in stratified
sampling and derive a 95% confidence interval estimate.
( e ) Assuming that the average food X is known, estimate the average weight , Y ,
of all elephants in the circus using combined ratio estimator in stratified sampling
and derive a 95% confidence interval estimate.
( f) Assuming that the average food Xh , h = 1,2,3 are known, estimate the average
weight, Y, of all elephants in the circus using separate regression estimator in
stratified sampling and derive a 95% confidence interval estimate.
( g ) Assuming that the average food X is known, estimate the average weight, Y,
of all elephants in the circus using combined regression estimator in stratified
sampling and derive a 95% confidence interval estimate.
( h ) Assume that the sample of n = 40 elephants is an SRSWOR sample across the
whole circus , and then find the following
_ nix +n2
i x2 +n3 x3
x = --'----'--=-...;=------"----"--
n

and

S =
I {(nh -l)s/txy}+ I {nh (Xh -XXYh - y)}
.:.:.h==..:I -'hc:..::=:.:..I _
~ n-l

(i ) Apply the usual ratio estimator


_ _(X)
YR =y
x
to estimate the average weight of the elephants and derive 95% CI estimate .
(j ) Apply the usual regression estimator
YLR = Y + ft(X - x)
to estimate the average weight of the elephants and derive 95% CI estimate .
( k ) Compare all your confidence interval estimates and find which one is better
and why?

Practical 8.17. Michael believes that when a soul is born it automatically falls into
one of the religious categories, say, Sikh, Hindu, Muslim, Christian etc., existing in
the world. Is it an example of stratification or post-stratification? Comment.
764 Advanced sampling theory with applications

Practical 8.18. Ms. Stephanie Singh selects a sample of 50 countries from


population 5 given in the Appendix using the method of proportional allocation and
records the production and area of the tobacco crop from the countries selected in
the sample. Assume that the area under the tobacco crop in different continents is
known . Obtain the calibration weights which result in ( a ) combined ratio, ( b )
combined GREG, (c) combined linear regression estimators . Deduce the values of
the estimates.

Practical 8.19. Divide your class into two groups based on gender . Decide to
select a sample of reasonable size with the suggestion of the class instructor. Select
a sample of the required size using proportional allocation, and collect information
on the GPA of the students selected in the sample from the both strata.
( a ) Estimate the average GPA of the class using the usual formula in stratified
sampling. Construct the 95% confidence interval estimate.
( b) Collect information about the number of classes attended by the students from
the register of the instructor if he/she permits. Use this information to improve your
estimates in ( a ) and comment.
9. NON-OVERLAPPING, OVERLAPPING, POST, AND
ADAPTIVE CLUSTER SAMPLING

In survey sampling the basic assumption is that the population consists of a finite
number of distinct and identifiable units. A group of such units is called a cluster.
If, instead of randomly selecting a unit for sample, a group of units is selected as a
single unit in the sample, it is called cluster sampling. If the entire area containing
the population under study is divided into smaller segments, and if each unit of the
population belongs to only one segment, the procedure is called area sampling or
non-overlapping cluster sampling. If one or a few units appears in more than one
segment or cluster, then such a procedure is called overlapping cluster sampling.
The main purpose of cluster sampling is to divide the population into small groups
with each group serving as a sample unit. Clusters are generally made up of
neighbouring elements; therefore the elements within a cluster tend to be
homogeneous. However at some stage in the research we become interested in
heterogeneous clusters rather than homogeneous . More broadly , the concept of
forming strata in the previous chapter was to form homogeneous groups, whereas in
this chapter the concept of forming clusters will be to form groups of a
heterogeneous nature. After dividing the population into clusters the sample of
clusters can be selected with either equal or unequal probability. The concept of
unequal probability may be based on the size of the cluster; that is, the larger the
cluster, the larger the probability of its being selected in the sample . All the units in
the selected cluster will be enumerated. As a simple rule the number of units in a
cluster should be small and the number of clusters should be large. The main
advantage of cluster sampling is that it is cheaper, since the collection of data for
neighbouring units is easier and faster. It is also useful when the frame for selecting
the sample is not available at the unit level. For example, a list of persons may not
be available, whereas a list at a dwelling level may be available. For a given sample
size cluster sampling is less efficient than simple random sampling . However, in
most situations the loss in efficiency can be balanced by the reduction in cost. Any
sampling procedure, for example simple random sampling , stratified sampling , or
systematic sampling , may be applied to cluster sampling by using the clusters
themselves as sampling units. Smith (1938) and Hansen and Hurwitz (1942) have
discussed the efficiency of cluster sampling.

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
766 Advanced sampling theory with applications

Population

e\:::::JA
© \cJ
Cluster I Cluster 2 Cluster N
Clusters are hetero
Fig. 9.0.1. Pictorial representation of cluster sampling.

The well known examples of the clusters and their units are given below :
School Dwellin Da
Students Persons Hours

There are two possibilities: ( a ) clusters of equal size or (b) clusters of different
sizes. It is obvious that if every cluster has the same number of units, then the
chance of selection of each cluster in the sample will be the same. On the other
hand, if the number of units in the clusters is different and known, then different
probabilities proportional to the number of units in the clusters can be assigned
before taking the sample. We will discuss both these situations in detail below.

Let N denote the number of clusters in the population and M the number of units
in each cluster. Evidently, the total number of units in the population is NM. We
select a sample of n clusters and hence we select nM units from the population. Let
us define
Yij = The value of the study variable y for the /h population element

(j = 1,2,..., M) for the th cluster (i = 1,2,..., N),


M
li. = L Yij , the t h cluster total for the population y values,
j=1
NM
Ye. = L L lij' the grand total or population total,
i = lj =1

-Y- = -1- L
N M
L v Ye.
I ij = - - , th i ation
e popu ' mean w h'IC h we want to estimate
. f rom
NM i = lj= l NM
the sample information,
Y = f..1 N, population mean per cluster,
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 767

- Y, I M th
fi. = ~ = - L fij , mean of M population elements in the i cluster,
M M j=1
nM
Y•• = L L Yij , the sample total of the study variable y,
i=lj=l
Y = y•• ln , sample mean per cluster,
- I n M y
Y= - L L Yij = - , sample mean per unit,
nM i=lj=l M

sl = _1_ ~(Y;.
N-li=l
-f)2 ,the variance between the means of different clusters,
s] = _1_ ~ (Yij _ 1';.)2, the variance within the t" cluster,
M-Ij=l

IN
5; = - Lsl = (
I NM(
) L L Yij -
_)2
fi. ,mean value of within cluster variances,
N i=1 N M -I i=lj=l

I - .L
S2 = - - N M (
L Yij - Y =)2 , the total population variance,
NM -It=lj=1

(
and
r~, fir f)( fij'- f)
p = t=ljO' j =1 ( ) 2 ' the intra-cluster correlation coefficient.
NM M-IS

Then we have the following theorems :

Theorem 9.1.1. The sample mean per unit


=
Y=-LLYr
I n M
(9.1.1)
nM i=lj=l !I
is unbiased for population mean Y.
Proof. We have

E(y)= E(_I_± I Yii) = ...!..-E(~ ±Yi.J =...!..-(-.!.- Ifi.J = Y = Y.


nM i=lj=1 M n i=1 M N i=l M
Hence the theorem.
Theorem 9.1.2. The variance of the estimator y is given by
v(Y)= (~- ~ )Sl. (9.1.2)

Proof. We have
f=)
VI)' =V [I-LYi. I --
n _] = ( - I) -(--L
I N(_fi.- Y=)2 = (I- - -I)Sb'2
n i=l n N N-I)i=l n N
Hence the theorem.
768 Advanced sampling theory with applications

Theorem 9.1.3. An unbiased estimator of V~) is given by


.(=) =
v\Y (I I) 2
-;;- N Sb (9.1.3)

where sl = (n -ItI ±(Y;e - yf.


;=\

Proof. Obvious by using E(sl)= sl.

Remark 9.1.1. A (l-a)tOO% confidence interval estimate of the population mean


Y will be given by

Example 9.1.1. Using the MAP of the United States of America construct 10
clusters, from the 50 states , each cluster consisting of 5 states.

Solution. There are several possible ways to construct 10 clusters based on their
locations. The number of possibilities may be decreased if we have some additional
information, for example, distance to be travelled between states to collect data . In
such a situation, we may decide to minimise the distance to be travelled by the data
collectors.

.0 ...

HI~<>

Looking at the MAP of the USA we arrive at the following situation.


Chapter9.: Non-overlapping, overlapping, post, and adaptivecluster sampling 769

Table 9.1.1. Clusters of equal size.

I WA, OR, MT, NV, ID


2 CA, AZ, NM, AK, HI
3 UT, WY, CO, SD, NE
4 TX, OK, KS, LA, AR
5 MO, lA, MN, ND, WI
6 IL, IN, MI, OR, PA
7 MS, AL, GA, FL, SC
8 TN, KY, WV, VA, NC
9 NY, VT, NR, ME, MA
lORI, CT, NJ, DE, MD

Example 9.1.2. Select 6 clusters from Table 9.1.1 using SRSWOR sampling.
Record the values of the real estate farm loans for the selected states from
population I given in the Appendix. Estimate the average real estate farm loans in
the United States using the cluster sampling. Estimate the variance of the estimator
used for estimating the average real estate farm loans . Construct a 95% confidence
interval.

Solution. Starting with the first two columns of the Pseudo-Random Number
(PRN) Table I given in the Appendix, we select the required 6 distinct random
numbers as 01, 04, 05, 03, 06 and 07. Thus the following clusters are included in an
SRSWOR sample of 6 units.

A, OR, MT, NY, ID 114.899, 292.965, 5.860, 53.753 1568.222


T, WY, CO, SD, NE 100.964, 315.809,413.777, 1337.852 2225 .310
X, OK, KS, LA, AR 1248.761, 612.108,1049.834,282.565, 907.700 4100.968
0, lA, MN, ND, WI 1579.686,2327.025, 1354.768,449.099, 1229.752 6940.330
L, IN, MI, OR, PA 2131.048,1213.024, 323.028,870.720, 756.169 5293.989
S, AL, GA, FL, SC 627.Q13, 408.978, 939.460,825.748, 87.951 2889.150
23~(j:f:f7;tQ~O

Note that M = 5 and n = 6. Thus an estimate of the average real estate farm loans
in the United States is given by

- 1 n M 1
Y =-2: 2: Yij =-x23017.97 = 767.2656.
nM i=lj=1 6x5

Now we have
770 Advanced sampling theory with applications

2 445.062 103815.16
3 820.193 2801.31
4 1388.066 385393.14
5 1058.798 84991.14
6 577.830 35885.85

Thus we have

I~;.- yf
s1; = ;=1 =818659.16 =163731.832 .
(n -1) 6-1
Thus an estimator of V(Y) is given by
A(=) =(1---1 ) sb2 =( ---
vI,)! 1 1) x163731.832 =10915.455.
n N 6 10

A (1- a)1 00% confidence interval of the population mean Y is given by

Y+ta/2(df=n(M-l)Nv(y) .
Thus 95% confidence interval for the average real estate farm loans in the United
States is given by

Y+ 10.025 [df = 6 x 4 ~ or Y+ to.025(df = 24Nv(Y)


or 767 .2656+2.064~10915.45, or [551.62,982.906].

Theorem 9.1.4. The relative efficiency of cluster sampling with respect to SRS is
RE = S2 /(Msl). (9 .1.4)
Proof. We know that the variance of estimator Y under cluster sampling is
v(Y)=(;- ~ )S1; (9.1.5)

and that of y for the SRSWOR estimator is

V(y) =(_I I_)S2 =J.-(~_.--!.-)S2. (9.1 .6)


nMMN MnN
The relative efficiency of the estimator in cluster sampling with respect to the one
in SRSWOR sampling is
RE =V(.Y)/V(Y) =(_1
nM __1 )S2/(~-.--!.-)Sl
NM n N =S2/(MSl). (9.1.7)
Hence the theorem.
Chapter 9. : Non-overlapping, overlapping, post, and adaptive cluster sampling 771

Exa mple 9.1.3. Suppose the United States has been divided into 10 neighbouring
clusters each consisting of 5 states as shown below.

; Cluster '~;',; Names of tlte states


I,(;, Number - ~E!"'t." .'
I WA, OR, MT, NV , ID
2 CA, AZ , NM, AK, HI
3 UT, WY, CO, SD , NE
4 TX, OK, KS, LA, AR
5 MO, lA , MN , ND, WI
6 IL, IN, MI, OH, PA
7 MS, AL, GA, FL, SC
8 TN, KY,WV, VA, NC
9 NY, VT , NH , ME, MA
10 RI, CT, NJ, DE, MD

We are interested in estimating the average real estate farm loans in the United
States. Is there any gain in efficiency due to clustering as opposed to simple random
sampling?
Solution. We have
Cluster ~':~r:lues of t~e re~'~state fatffiloans .., I ',
-~. "' ~. (-y:-y
~~t~
No !\,; l) , -,
> z. . . -1'

1 1100.745 114.899 292 .965 5.860 53.753 313 .644 58462.46


2 1343.461 54.633 140.582 2.605 40 .775 316.411 57132.15
3 56 .908 100.964 315 .809 413 .777 1337.852 445 .062 12182 .09
4 1248.761 6 12.108 1049.834 282.565 907 .700 820.193 70097.37
5 1579.686 2327.025 1354.768 449 .099 1229.752 1388.066 693275 .20
6 2 131.048 1213.024 323.028 870 .720 756.169 1058.798 253374.60
7 627.Ql3 408 .978 939.460 825 .748 87.951 577 .830 50 1.56
8 553 .266 1045.106 99.277 321.583 639.57 1 531.760 560.45
9 201.631 57.747 6.044 8.849 7.590 56.372 249063 .20
10 1.611 7.1 30 39.860 42 .808 139.628 46 .207 2593 12.30
\5554~345 165396.lton
From the above table we have
r = 5554.345 = 555.4345 .
10
Thus we have
S2 = (N _1)-1
b
I(r; _r)2= 1653961
i=\ /. 10-1
= 183773.444 .
Also we have
S2 = 342021 .5 .
772 Advanced sampling theory with applications

The relative efficiency of the estimator in cluster sampling with respect to the one in
SRSWOR sampling is given by
RE=S2/(MS 2)= 342021.5 =0.3722 .
b 5x183773.444
In this case, cluster sampling is less efficient than simple random sampling. Thus
there is no gain in efficiency through cluster sampling in this example. Let us try to
find the reason.

Theorem 9.1.5. The relative efficiency at (9.1.7) in terms of intraclass correlation


coefficient p is given by
RE = [l+(M -t)»]". (9 .1.8)
Thus cluster sampling is more efficient than SRS if p < O.
Proof. To find relative efficiency in terms of intraclass correlation coefficient p, let
us first express
~ ~j';\ (Jio'1 - r)(Jio,'1 - r)
;;lj*
in terms of S2 and s], To do so we have

I[ ~
,;\ J;(
(Jij - r)]2= I[ ~
,;1 J;(
Jir ~ r]2 = ~(Ji. - Mry = ~( MY;. _Mr)2
J;( ,;1 ,;1

Ji.-Y=)2 =M 2(N-1)Sb2'
N(-
=M2 ~ (9.1.9)
,;\
Also we have

=L
N M ( =)2
L Jir Y + ~N . ML; ( =)( =)
Jir Y Yij' - Y . (9.1.10)
,;IJ;( ';\J* J;\

Now

S2 = _1_ ~ ~ (Ji'1o_r)2
NM ;;\j;( and ~ ~j';1 (Jio'1 - r)(Ji"-
p = ;;lj* '1
r)/{NM(M -1)s2}

r)l
which implies that

;~[j~JYij - = (NM -1)S2+ NM(M _1)pS2. (9.1.11)

On comparing (9.1.9) and (9.1.11) we have


(NM -1)S2 + NM(M _1)pS2 = M2(N -l)S;
or S2 /{Ms;} = M(N -l)/{(NM -1)+ NM(M -l)p} .
Note that M and 1 both may be neglected in comparison to NM the term relative
efficiency becomes
Chapter9.: Non-overlapping, overlapping, post, and adaptive clustersampling 773

RE=~= M(N-I)
Msl (NM -I)+NM(M -I)p

= ( I ) = [1+(M_I)pjl (9.1.12)
1+ M -I P
which proves the first part of the theorem.
To prove the second part we have
RE > 1
which implies that
[I+(M-I)pjl > 1, or l+(M -I)p<1 (9.1.13)
which is possible only if
p c t).

Hence the theorem.

Remark 9.1.1. Cluster sampling is more efficient than SRS if intraclass correlation
coefficient p < o.
Remark 9.1.2. If p = 0, then the cluster sampling and SRS are equally efficient.
Remark 9.1.3. In practice units which are near one another are more similar than
units which are apart, therefore p is positive and hence in general the efficiency of
cluster sampling is less than that of SRS. We will observe later on that cluster
sampling is more efficient than SRS for the fixed cost of a given survey.

Example 9.1.4. In Example 9.1.3, find the value of the intraclass correlation
coefficient. Use it to find the relative efficiency of cluster sampling over simple
random sampling.
Solution. Using information from Example 9.1.3 we have

1 545.3105 -440.5360 -262.4700 -549 .5750 -501.6820


2 788.0265 -500.8020 -414.8530 -552.8300 -514.6600
3 -498.5270 -454.4710 -239.6260 -141.6580 782.4175
4 693.3265 56.6735 494.3995 -272.8700 352.2655
5 1024.2510 1771.5900 799.3335 - 106.3360 674.3175
6 1575.6130 657.5895 -232.4070 315.2855 200.7345
7 71.5785 -146.4570 384.0255 270.3135 -467.4840
8 -2.1685 489.6715 -456.1580 -233.8520 84.1365
9 -353.8040 -497 .6880 -549 .3910 -546.5860 -547.8450
10 -553.8240 -548.3050 -515.5750 -512 .6270 -415.8070

For computing the intra-class correlation coefficient we have


774 Advanced sampling theory with applications

-273572 .00
221008.50
131676.10
275711.40
2 -394645.00 -326915.00 -435644.00 -405565 .00
207758 .80 276857 .90 257742 .30
229342.70 213507.80
-46239 .30
3 226565 .60 119459.70 70620 .03 -390056.00
108902.70 64379.17 -355586 .00
33944 .76 -187487 .00
-110835 .00
4 39293 .22 342780 .30 -189188 .00 244235 .00
28019 .34 -15464 .50 19964. 11
-134907 .00 174159.90
-96122.50
5 1814554.00 818718 .50 - 108914.00 690670 .70
1416092.00 -188383 .00 1194614 .00
-84997.50 539004.50
-71703 .90
6 1036107.00 -366183 .00 496768 .10 316280.00
-152828 .00 207328.40 132000.90
-73274.40 -46652 .00
63288 .67
7 -10483.10 27487.96 19348.63 -33461.80
-56243.00 -39589.20 68466 .01
103807.30 -179526 .00
-126367 .00
8 -1061.86 989.19 507.11 -182.45
-223367.00 -114510.00 41199 .23
106673.10 -38379.50
-19675.40
9 176083.60 194376.30 193383.90 193829.30
273424 .80 272028.80 272655.40
300288 .90 300980 .60
299443.90
10 303663.90 285537.30 283904 .60 230283.40
282691.80 281075.40 227988 .60
264297 .20 214379.20
213153.40

From the above table we have


Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 775

173759.80
2 -423798.80
3 -420092.04
4 412769.87
5 6019655 .30
6 1612835 .67
7 -226560.20
8 -247807.58
9 2476495.50
10 2586974.80

Thus we have

~ .~ (Yij -r)(Yij' -r) = 2 x 11964232.30 = 23928464.6 .


1=11* / =1

Thus the value of the intraclass correlation coefficient is given by

p = 23928464.6 = 0.349809 .
lO x 5 x (5 -l) x 342021.5

Note that because the value of the intraclass correlation coefficient is positive, and
hence the cluster sampling is less efficient in this particular situation.

Effect of cluster size ( M ) on the relative efficiency: The relative efficiency of


cluster sampling with respect to SRS is defined as

(9 .1.14)

Following Smith (1938) we have

O<g <1. (9 .1.15)

Thus (9 .1.14) implies that

RE = (S2M g y(MS 2 )=M g- 1 . (9 .1.16)

Thus we have the following graph which shows that the increase in cluster size M
decreases the efficiency of cluster sampling provided that (1- g) > O.
776 Advanced sampling theory with applications

Effect on RE with increase in cluster size fo r given


values of 9

-o-g=0.1
-X-g=0.3
---lr-g=0.5
~g=0 .7

-o-g=0.9

Cluster size

Theorem 9.1.6. An estimator of relative efficiency from the sample information for
a large number of clusters is ~i,ven by
Est(RE) = ~l +(I -M - 1F;V~sl}. (9 .1.17)
P roof. In cluster analysis, there are three sources of variations, namely
( a ) Total variation,
( b ) Between cluster variation,
and
( c ) Within cluster variation.
Thus cluster analysis may be expressed in an Analysis of Yariance (ANOYA) table
form as below:

Yll Y12 Ylj YIM


2 Y21 Y22 Y2j Y2M Y2•

Yn Yi2 Yij YiM Yi• s~I

N YNI YN2 YNM YN•

Thus from the population information we have:


NM
Population Grand Total = y.. = I I Y;j .
i=lj=1
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 777

Thus population Correction Factor is:


CF= (Population Grand Total = (y••)2 f •
Total No. of Observations NM
The corrected population total sum of squares (SST) is given by
N M 2 N M 2
SST = I I Y;j - CF = I I Yij - - - = I I Yij - NM Y
(Y••f N M 2 (=)2 = I N M (
I
=)2
Yij - Y
i=lj=1 i=lj=1 NM i=lj=1 i=1j=1

=(NM -1)S2.
The corrected between population sum of squares (SSB) owed to cluster totals is
N 2 N 2
()2 _2
I Y;. I Y;. (\2 N
SSB = 1.=1.- - CF = 1.=1.- - J:!.!.L = M I Y;.
M M NM i=1 M
- NM(f) = M I(~. -
N

i=1
f) 2
-

=M(N-I)sl·
The corrected within population sum of squares (SSW) owed to cluster totals is

SSW = SST-SSB = [~I


1=IJ=1
YJ -CFJ-{{IY;:)M}-CF}
1=1

N M

i=lj=1 M
2 ~~
i=1 N M 2 N -2
= L L Yij - - - = L L Yij -MLYi• = L L Yij -MYi• = L L (Yij -yi.)
i=lj=1 i=l i=l j=1 i=lj=1
N
[2]-2
M N M - 2

N (M -1) M
= I-(- ) I Yij - Y;.
( _)2 = (M -1 )NI-(-1 ) IM (Yij - Y;.
_ )2
i=1 M -1 j=1 i=1 M -1 j=1

N
= ( M -1 ) ISi. 2(
= M -I -N ISi.
N ) 2 = N (M -I )-2
Sw'
i=1 N i=1

SSB=M I
N( f;.- _)2
f
M
MSB = -_- I f;.-
N( _)2
f
i=1 N 1,=1 F= MSB
MSW
=Msl

NM-I N M (
SST=L L Y;j -
-)2
f MST=--IL Yij-Y
I N M ( =)2
i=lj=1 NM -li=lj=l
= S2
778 Advanced sampling theory with applications

As mentioned earlier, cluster sampling is better if clusters are homogeneous clusters


and units are heterogeneous within the given cluster. In other words, the variation
of units or elements within the cluster ( SSW) should be large, but the variation
from cluster to cluster ( SSB ) should be small for cluster sampling to be more
efficient. Thus if the value of population F ratio is large, then the cluster sampling
will not be efficient. Similarly, as we saw earlier, if the mean total sum of square
MST = S2, that is the variation among all elements in the population is smaller than
the mean between sum of square MSB = us] , then cluster sampling will be less
efficient. Thus for cluster sampling to be effective than SRSWOR sampling, we
need sl > S2/M . Although we already discussed one method to find the values of
Intraclass (or Intra-cluster) correlation coefficient p , but the ANOVA method can
also be used as
-1_(~JSSW
p - M -1 SST'

Note that 0 ~ SSW ~ 1, therefore the value of intraclass correlation coefficient lies
SST
in - -1 () s P s 1. Again note the conditions of efficiency in terms of intraclass
M-l
correlation coefficient. Further note that the intra-class correlation coefficient can
be easily defined only for clusters of equal size. Similarly the sample analogue of
the population looks as given below:

YII YI2 Ylj YIM YI. sf


2 Y21 Y22 Y2j Y2M Y2. 2
s2.

Yil Yi2 Yij YiM Yi. sf.


I

n Ynl Y n2 Ynj YnM Yn. s2


n

Thus from the sample information we have


n M
Sample Grand Total = Y•• = L LYij'
i=lj=l
Chapter 9.: Non-overlapping, overlapping, post, and adaptivecluster sampling 779

Thus sample correction factor is


cf = (Sample GrandTotal? (Y••?
Total No.of Observations nM
The corrected sample total sum of squares (sst) is given by
n M 2 n M 2 (Y•• n M 2 ? {='V- n M t
sst = IIYij -cf= IIYij---=IIYij-nMI}') =IIl}'ij-Y} = nM-1}' .
='V- ( \_2
i=lj=1 i=lj=1 nM i=lj=1 i=lj=l
The corrected between sample sum of squares (ssb) owed to cluster totals is
n 2 n 2
IYi. IYi. (\2 n ( )2 n
ssb = i=1 -cf = l=.L-_.l:!!L = MI Yi. -nM~Y = MI(Yi. - yY = M(n -l)Sl .
M M nM i=1 M i=1
The corrected within sample sum of squares (ssw) owed to cluster totals is given by

ssw = sst -ssb = [t Y5


1=1]=1
I -Cf]-[~Yi~!M
1=1
-Cf]

= (M -l)t s~ = (M -1)~ ts~ = n(M -1)s~ .


i=1 n i=1

n-1
=MIn -io-Y
i=1
(y -t M n (y =)2
= - - I i.- Y
n-1 i=1
F
msb
=Msl msw
ssw msw
n(M -1)
=t
i=lj=1
I &ij-Yi.~ = (M
1
-1)~
n M _
~ &ij - Yi.)
2
n 1=1]=1
=s~
sst mst
nM-1 n M
=II ij - Y
& _Y = -1 - In M
I ij-Y& =Y
i=lj=1 nM -l i=lj=1
=s2
780 Advanced sampling theory with applications

Thus if the value of sample F ratio is large, then we can guess that the cluster
sampling may not be efficient. Other guess for the cluster sampling to be effective
than SRSWOR sampling, we need sl > s2/M . Obviously, an estimate of the Intra-
class ( or Intra-cluster) correlation coefficient p can be had from the sample
ANOV A method as:

. -I- -
p ( - JSSW
M- - -.
M -I sst

Note that 0 s ssw s I , therefore the estimate of intraclass correlation coefficient lies
sst
. 1 •
In --(--)~p~l .
M-I
We know that sl and s~ are unbiased estimators of sl and s~, respectively. But
t
-1- Ln ML \Yij - y-)2 1
is not an unbiased estimator of - - N M (
-L L lfr Y =)2 . Let us
nM -I i=l j =l NM -I i= lj=l

try to put -1- NL ML ( Yj' - Y =)2 in terms of S2 and S~ . From the ANOV A for
NM -I i=lj=l U

population we have

which implies that

(NM -1)S2 = N(M -I)S~ + M(N - I)S;


or
S2 = N(M -I)S~ +M(N -I)S;
(NM -I) (9.1.18)

In (9.1.18) the denominator (NM -I) is a constant quantity, so an unbiased estimator


of S2 is given by
Est{S2) = N(M-I)s~+M(N-I)Sl. (9.1.19)
(NM -I)

The formation of ANOVA helps us to find the multipliers of sl and s~, which, in
fact, are M(N -I) and N(M - I~ respectively. Evidently an estimator of the relative
efficiency RE is given by

( N1J2+MN(I
MN I- I-"M J-2
Sb SW

1
M 2N ( 1- NM J
Sb2
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 781

_sl +(I-M-1F; (9.1.20)


- Msl
Hence the theorem.

Example 9.1.5. Mr. Stephen Hom and Mr. Ken Brewer were asked to construct
three clusters of9 regions of the USA using the follow ing two maps :

U .S. Standard Regio n s


fo r T emperature & Preclpltallon

N atio n al C limatic Data Center , NOAA

August-October 2002 Regional Ranks


National Climatic Data Center/NESDIS/NOAA

Precipitation
I ~'~;~~~~:e~
• •
Reco rd
Dr ie st
M uch
Below
No rm a l
Below
Norma '
Ne ar
No rmal
Above
Normal
• •
M u ch
Above
No rm al
R ecord
W ett e st

Source: Printed with permission from NOAA.


782 Advanced sampling theory with applications

Mr. Stephen Hom suggested the following three clusters as:

North West West North Central East North Central


West South West South
North East Central South East

Mr. Ken Brewer suggested the following three clusters as:

North West West South West


West North Central East North Central North East
Central South East South

( A ) Whose clustering plan is more efficient and why? Apply the ANOVA method.
( B ) Find the relative efficiency of cluster sampling with respect to SRSWOR in
each case.
( C ) Select two clusters out of Mr. Stephen Hom's clusters, and construct 95%
confidence interval estimate for the population mean. Apply the sample ANOVA
approach to estimate intraclass correlation. (Rule: Use first row and first column of
the Pseudo-Random Number (PRN) Table I given in the Appendix)
( D ) Select two clusters out of Mr. Ken Brewer's clusters, and construct 95%
confidence interval estimate for the population mean. Apply sample ANOVA
approach to estimate intraclass correlation. (Rule: Use first row and first column of
the Pseudo-Random Number (PRN) Table 1 given in the Appendix)
( E ) Does both confidence interval estimates include true average precipitation of
the USA?
Solution ( A ) : ( I) Mr. Stephen Horn's Clustering: Using information from the
second map, we have

Cluster I 4 72 93 169
Cluster II 3 63 104 170
Cluster III 64 75 90 229
.;:~: .

( a ) Population correction factor (CF):

2
CF= (OT)l = 568 =35847.11.
NM 3 x3
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 783

( b) Population total corrected sum of squares:

N M 2
SST = I I lfj - CF
i=lj=l

= 4 2 +72 2 +93 2 +3 2 +63 2 +104 2 +64 2 +75 2 +90 2 -35847.11


= 10616.89.

( c ) Population between cluster totals corrected sum of squares:


N 2 2 2 2
SSB=Ilfo_CF=169 +170 +229 35847.11=786.89.
i=lM 3

( d ) Population within cluster totals corrected sum of squares:


SSW = SST -SSB = 10616.89-786.89 = 9830.

Thus we have the following ANOVA table

0.2401
6 9830 .00 1638.33

8 10616.89 1327.11

The value of intraclass correlation coefficient for Mr. Stephen Hom's clustering is
given by

PSlephen = 1- ( -M- - J
SSW
- = 1- (3
M -1 SST
J
- - 9830 = -0.388 ( w hiICh iIS negative.
3 -1 10616.89
ive)

( II ) Mr. Ken Brewer's Clustering: Using information from the second map we
have

4 3 63 70
93 64 229
75 90 104 269

( a ) Population correction factor (CF):


2
CF= (OT)2 = 568 =35847.11.
NM 3x3
784 Advanced sampling theory with applications

( b) Population total corrected sum of squares:

N M 2
SST = L L Y;j - CF
i=l j=1

= 4 2 +72 2 +93 2 +3 2 +63 2 +104 2 +64 2 +75 2 +90 2 -(35847.11)


= 10616.89.

( C) Population between cluster totals corrected sum of squares:

N 2 2 2 2
SSB = L Y;. _ CF = 70 + 229 + 269 - (35847.11) = 7386.89.
i=I M 3

( d ) Population within cluster totals corrected sum of squares:

SSW = SST - SSB = 10616.89 -7386.89 = 3230.

Thus we have the following ANOV A table

6.8609
6 3230.00 538.33

8 10616.89 1327 .11

The value of intraclass correlation coefficient for Mr. Ken Brewer's clustering is
given by

Pken = 1- ( -M- -SSW J


- = 1- (3
M -1 SST
J
- - 3230 = 0.5436 (w hiICh IS
3 -1 10616.89
itive)
. positive).

The value of intraclass correlation coefficient for Mr. Stephen Hom's clustering
plan is negative, whereas that for the Mr. Ken Brewer's plan is positive, which
indicates that Mr. Hom's clustering plan will perform better than Mr . Brewer's
clustering. The reason is that Mr. Hom's clusters have more variation withi n each
cluster and less variation between the clusters .

(B) The relative efficiency of Mr. Stephen Hom's cluster sampling over SRSWOR
sampling will be
2
RE(Mr. Horn) =-S- x100= MST x 100 = 1327.11 x lOO=33 7.31%
MS; MSB 393.44
and the relative efficiency of Mr. Ken Brewer's cluster sampling over SRSWOR
sampling will be
Chapter 9.: Non-overlapp ing, overlapping, post, and adaptive cluster sampling 785

S2 MST 1327.11
RE(Mr. Brewer) = - - 2 x 100 = - - x 100 = x 100 = 35.93% .
MS b MSB 3693.45

( C ) Sample from Mr. Stephen Horn's clustering: Now using the information
from second map we have

iili I 'nt<ll"

Cluster II 3 63 104 170


Cluster III 64 75 90 229
, ) , I. ,'S;:

( a ) Sample correction factor ( cf):

f
2
cf = (gt = 399 = 26533.5.
nM 2 x3

( b) Sample total corrected sum of squares:

nM
sst = L Ly3 -cf = 32 +63 2 +104 2 +64 2 +75 2 +90 2 -26533.51 = 6081.49.
i=l j = l

( c ) Sample between cluster totals corrected sum of squares:

=f
2 2
ssb yl. - cf = 170 + 229 26533.51 = 580.16.
i=lM 3

( d ) Sample within cluster totals corrected sum of squares :

ssw = sst - ssb = 6081.49 - 580.16 = 5501.33.

Thus we have the following ANOVA table

0.4218
4 5501.33 1375.33

5 6081.49 1216.29

Thus an estimate of the value of intraclass correlation coefficient for Mr. Stephen
Hom's cluster sampling is given by
786 Advanced sampling theory with applications

P•stephen = 1- ( -M- - ssw


M - 1 sst
J 3 - 1 6081.49
J
- = 1- ( -3- 5501.33 = -0.356 (w h'IC h iIS agam
. negative.
ive)

The estimate of average precipitation in the USA is given by


y=~= 399 =66.5.
n xM 2 x3
From the above sample ANDY A we have
M s; =580,16
which implies
2 _ 580.16 -193 38
Sb - 3 - . .

Thus we have
.r=) = (1-;;- N1 Jsb2 = ( 2-"3
vI)' 1 1Jx 193.38 =32,23 .

A (1 - a)100% confidence interval estimate of the population mean Y is given by

y± f a/2(df = n(M -1))Jv~) .


Thus 95% confidence interval estimate for the average precipitation in the USA is
given by
66,5 ± fO,02S(df = 4}J32.33 .
Using Table 2 from the Appendix we have
66,5 ± 2.776 x 5.69 , or [50.72, 82.28] .

( D ) Sample from Mr. Ken Brewer's Clustering: Now using the information
from second map we have

( a ) Sample Correction factor ( cf):


2
cf = (gt)2
= 498 = 41334.
nM 2x3

( b) Sample total corrected sum of squares:


n M
sst = L L Y~ -cf = 722 +93 2 +64 2 +75 2 +90 2 + 1042 -41334 = 1136.
;;lj;1

( c ) Sample between cluster totals corrected sum of squares:

ssb = I yf. - cf = 229


;; ) M
2
+ 269
3
2
41334 = 266.67,
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 787

( d ) Sample within cluster totals corrected sum of squares:


ssw = sst - ssb = 1136- 266.67 = 869.33.
Thus we have the following ANOVA table

266 .67 266.67


1.227
869.33 217.33

1136.00 227.20

Thus an estimate of the value of intraclass correlation coefficient is given by


-
Pken=l- ssw
( -M-) --=1- (-3-) ---=-0.147
869.33 (w h IC
' h is u hitynegattve.
' sug ' )
M - I sst 3-1 1136
The estimate of average precipitation in the USA is give by
y= - L = 498 = 83 .
nxM 2x3
From the above sample ANOVA we have
M sE = 266.67
which implies
s E= 266.67 = 88.89 .
3
Thus we have
-f=) (1 I)
VI)' = -;;- N sb2 = ( I I)
'2-'3 x88.89=14.82 .

A (1- a) 100% confidence interval estimate of the population mean Y is given by


y± ta/2(df = n(M -1)).jv~) .

Thus 95% confidence interval estimate for the average precipitation in the USA is
given by

83±to.025(df = 4)J14 .82


or, using Table 2 from the Appendix
83± 2.776x 3.85, or [72.31, 93.68].

( e) The true average precipitation in the USA is given by

y= GT = 568 = 63.11 .
NM 3x3
788 Advanced sampling theory with applications

Clearly the true average precipitation in the USA lies in the 95% confidence
interval estimate due to Mr. Stephen Hom's clustering, given by [50.72, 82.28], but
not in the 95% confidence interval estimate due to Mr. Ken Brewer 's clustering,
given by [72.31, 93.68]. Although the estimate of intraclass correlation coefficient
for Mr. Ken Brewer's sampled clusters is slightly negative, but it is not performing
very well because the true intraclass correlation coefficient in the population is
positive .

Theorem 9.1.7. Under the superpopulation model approach, the relative efficiency
is free from sample information and is inversely proportional to the size of clusters.
Proof. We know that under the superpopulation model proposed by Smith (1938)
the relative efficiency can be written as
RE = S2 I(MSl) (9.1.21)

where Sb 2= (N - 1)I
- L N (- li. - Y=)2 and S 2= (NM -I )1
- L L ( Yij -
N M =)2
Y
1=1 1=1]=1

Evidently if the cluster size changes then sl changes but S2 does not, because lij
and hence Y remains the same. Note that NM remains constant, M increases if N
decreases, and vice versa. Thus with change in cluster size, only sl gets altered,
because Y;. changes with cluster size. Smith (1938) suggested an empirical relation
between sl and S2 through a superpopulation model parameter g given by
Sl=S2jMg . (9.1.22)
Thus the express ion for RE reduces to
RE=~= S2 M g = Mg - 1 (9.1.23)
2
Msl MS
Now to find the value of g we make use of data that we selected for one particular
value of cluster size M . Thus (9.1.23) implies that RE is constant K (for example),
and its estimator is
Est.(RE)= {M(N -1)S1 + N(M - I)s; } / M(NM -1)Sl. (9.1.24)
Now we will equate the value of Est.(RE) from the sampled data for a fixed cluster
size M with the value of K . For the fixed value of M we can find the value of g
given by
Est.(RE) = Mg- 1 (9.1.25)
Taking logs on both sides we have
g = 1+ log]Est.(RE )];log(M) . (9.1.26)

Example 9.1.6. Select 6 clusters from Table 9.1.1 using SRSWOR sampling.
Record the values of the real estate farm loans for the selected states from
population 1 given in Appendix . Estimate the relative efficiency using the ANOVA
approach. From this calculate the value of parameter g.
Chapter9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 789

Solution. Starting with the first two columns of the Pseudo-Random Number
(PRN) Table 1 given in the Appendix, we select the required 6 distinct random
numbers as 01, 04, 05, 03, 06 and 07. The following clusters are included in an
SRSWOR sample of 6 units.

315.809, 2225 .310 445.062 103815.18

4100 .968 820.194 2801.37

6940 .330 1388.066 385393.10

5293.989 1058.798 84991.00

939.460, 2889.150 577.830 35885 .86

From the above table we have


- I n
Y = - LY;o =767 .2656
n;;1
and
n f-
LI,Y;o - Y
_)2 = 818658 .73.
;;1

Now we have

111208.490 425582.220 224961.090 579738.540 509100.278


504607.970 443957.870 203813.090 124954 .210 325568.802
231837.790 24073 .891 79844 .882 234934.700 19721.811
660026.850 2432849.300 345159.030 101230.010 213893.639
1859902 .300 198700.520 197347.070 10702.806 123.135
19670.801 128370.030 29650 .900 3420.187 461468.371
From the above table we have
n M f -\2
L L Ilij - YI = 10,706,420.62
;;lj;1

Also we have
790 Advanced sampling theory with applications

429920 .200 109007.610 192898.400 153122.733


582604.910 517291.220 165174.450 267970.219
19405.883 602110.820 1222132.500 230751.494
271324.520 1608400.200 371732.630 29225.338
2412486.200 403471.420 85784.552 31804.798
2418.967 28510.998 61463 .335 239981.435

From the above table, we have

I.I (yij - Yio~ =10,714,231.64.


i=lj=l

4093293.70

nM( - )2
L L Ilij - Yio = 10714231.64 s; = 446426.32
i=lj=l

n M (
L L Il i' -Y
-)2 = 10706420.62 52 = 369186.92
i =lj=1 ~

Thus an estimate of RE is given by


Est.(RE) = M(N -1}Yt + N(M -1)s; = (10-1) x 818658.74+ 10(5-1) x 446426.32
M( NM -1}Yt (lO x5-1) x81865 8.74
= 0.6288.
Thus the value of g is given by

g = 1+ 10g[Est.(RE)] = 1+ log(0.6288) = 0.7117.


log(M) log(5)

We have seen that for a given sample size, the sampling variance increases with
increase in cluster size and decreases with the number of clusters . It may be noted
here that the cost of the survey decreases with the increase in cluster size and
increases with the increase in number of clusters. Thus a need arises to compromise
between the cluster size and the number of clusters in the sample so that the
sampling variance is minimum for the fixed cost of the surveyor the cost is
Chapter 9.: Non-overlapping, overlapping, post, and adaptivecluster sampling 791

minimum for the fixed variance. Mahalanobis (1940, 1942) has considered the
problem of determination of optimum cluster size from the point of view of both
variance and cost. Singh (1956) studied cluster sizes in cluster sampling and sub-
sampling procedures . The best source of discussion of the construction of cost
functions in cluster sampling is that of Hansen, Hurwitz, and Madow (1953). In
their analysis, they postulate that the cost of the survey using cluster sampling
consists of two components in addition to the overhead cost:

( a ) Cost of enumerating the elements in the sample and travelling within the
cluster, which is proportional to the number of units in the sample.
( b) Cost of travelling between clusters, which is proportional to the distance to be
travelled between clusters. Empirical studies show that the expected value of
minimum distance between n points located at random is proportional to f;; .

Thus in general the cost of the survey can be written as


C = C1nM+C2 f;;, (9.2.1)
where C I is the cost of enumerating a unit within the cluster and C2 is the cost of
travelling per unit distance between clusters.

The variance of the estimator of population mean in cluster sampling is


v(Y)= (-!-_J...)Sl = (1 - f) sl. (9.2.2)
n N n

Several researchers, including Jessen (1942), Mahalanobis (1944, 1946) and


Hendricks (1944), have shown that there is a log linear relationship between S;
and cluster size M given by
S2=aM
w
b
, (9.2 .3)
where a and b are positive constants and are to be determined from the survey
data. From the ANOVA for population we have
sl = KMN -1)S2 - N(M -I)S;} /M(N -I) = S2 - (M -1}aM b- 1 • (9.2.4)
Ignoring f in (9.2.2), the variance of the estimator of population mean in cluster
sampling is given by

(9.2.5)

Case I.. Cost is fixed: In this case the Lagrange function L I is given by

LI = -!-[S2 -(M -1}aM b- I ]+AI[c- C\nM - c2


n
f;;1 (9.2 .6)

where Al is a Lagrange multiplier. Differentiating (9.2.6) with respect to n, M and


Al and equating to zero in each case to find three equations, one has to solve the
resultant equations for the optimum values of nand M .
792 Advanced sampling theory with applications

Case II. Variance is fixed : In this case the Lagrange function L z is given by
l
Lz = C1nM + Cz"['; +Az[VO - -n ~z - (M - 1)aMb- 1}], (9.2.7)
where Az is a Lagrange multiplier . Differentiating (9.2.7) with respect to n, M and
AZ and equating to zero in each case to find three equations, one solves the
resultant equations for the optimum values of nand M .

Consider an investigator is interested in estimating the proportion of units


belonging to a specified class A (for example) when the population consists of N
clusters each of size M. An SRSWOR sample of n clusters is selected. Suppose
elements in the cluster can be specified into two classes as follows :

y;. = {I if r element of jlh cluster belongs to class A,


(9.3.1)
lJ 0 otherwise.
M
Leta; = L: Y;j is the number of units in the th cluster. Then the investigator's
j=1
interest will be to estimate the population proportion P defined as
1 N
P=-Lai ' (9.3.2)
MNi=1

Then we have the following theorems:

Theorem 9.3.1. An unbiased estimator of population proportion P is given by


,In
Pel =- L:P; (9 .3.3)
n i=l
where p; = ad M is the proportion of units belonging to the t h cluster in the sample .

Theorem 9.3.2. The variance of the estimator Pel is given by


v(, )= (1- f)NP(I-P)[I+(M -I) ]. (9.3.4)
Pel (N-I)nM P
Proof. We have
V(Pel) = (1- f) sl (9.3.5)
n
where sl is the variance between cluster proportions and is given by
us] =-!.- ~(p; -p)Z =P(I-P)--!.- ~P;(l-p;). (9.3.6)
N i=l Ni=l
For large N we have
SZ '" sl + S; = NMP(I- P)/(NM -I)
and
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 793

s;2 = N (MM -I ) IP
N (
i 1-
i;}
Pi ) .
Note that we have
(N -1)Msl = (NM -1)S2 - N(M -I)S~
or
S2= (NM-I)S2_ N(M-I)S2 = (NM-I)xNMP(I-P) N(M-I)M ~R(I-R)
b M(N-I) M(N-l) W M(N -I) (NM-I) M(N-I)N(M-I)i;}' ,
= NP(I-P) _I_~R(I-R) =P(I_P{~ __I_~P;(I-P;)]
(N-I) (N-I)i;}" t(N-I) (N-I)i;IP(I-P)
= P(I_P{~_I+I __ I_ ~ P;(I-P;)]
t(N-I) (N-I)i;}P(I-P)
= P(I_P{ N-(N - I) + 1_ _1_ ~ P;(I-P;)]
t
(N- I) (N- I)i;}P(I-P)
_P(I-pJ_I_+ M -I{I __I_ ~ P;(I-P;)}]
- t(N -I) M -I (N-I)i;}P(I-P)

= P(I- P)[I +(M -I) N-I {I __I_ ~ p;(I-P; )}] .


N -I M -I (N -1)i;1 P(I - p)
Thus the value of the intra-cluster correlation coefficient p can easily be written as
N -I { 1 N Pi (1- P; )}
p= M-I I-(N-IL~I P(I_P) . (9.3.7)
Hence the theorem.
Theorem 9.3.3. An unbiased estimator of V(PeI) is given by
-(- )_(1-
v Pel -
f) 2 _ (1- f) ~(T>
- - Sb - - ( - - ) L, ' i -
p-\2
J (9.3 .8)
n n n -I i;1
- 1 n
where P =- I P; has its usual meaning.
n i;}

Theorem 9.3.4. The relative efficiency of cluster sampling as compared to


SRSWOR sampling is given by
RE = V(Psrs) = {(N -I)NP(I- p)} .
(9.3.9)
V(PeI) [(NM -I~NP(I- p)- i~P;(I- P;)}]

Example 9.3.1. Select 6 clusters from table 9.1.1 using SRSWOR sampling.
Record the values of the real estate farm loans for the selected states from
population I given in the Appendix. Estimate the proportion of states having real
estate farm loans of more than $555.4345 in the United States using cluster
sampling. Estimate the variance of the estimator used for estimating the required
proportion. Construct a 95% confidence interval.
794 Advanced sampling theory with applications

Solution. Starting with the first two columns of the Pseudo-Random Number
(PRN) Table I given in the Appendix, we select the required 6 distinct random
numbers as 01, 04, 05, 03, 06 and 07. Thus the following clusters are included in an
SRSWOR sample of 6 units

A, OR, MT, NY, ID 53.753 1568.222


T, WY, CO, SD, NE 1337.852 2225.310
X, OK, KS, LA, AR 1248.761, 612.108, 1049.834, 282.565, 907.700 4100.968
0, lA, MN, ND, WI 1579.686,2327.025, 1354.768, 449.099, 1229.752 6940.330
IL, IN, MI, OR, PA 2131.048,1213 .024, 323.028, 870.720, 756.169 5293.989
S, AL, GA, FL, SC 627.Ql3, 408.978, 939.460, 825.748, 87.951 2889.150

Here M = 5 and n = 6 . Distinguish the states having real estate farm loans of more
than $555. 4345 in each one of the selected clusters.

I 1 0 0 0 0 1 0.2 0.1225
2 0 0 0 0 1 1 0.2 0. 1225
3 1 I 1 0 1 4 0.8 0.0625
4 1 I 1 0 1 4 0.8 0.0625
5 1 I 0 1 1 4 0.8 0.0625
6 1 0 1 1 0 3 0.5 0.0025

Using cluster sampling an estimate of population proportion P is given by


• 1~ 3.3
Pel =- L.P; =- = 0.55
n i=l 6
a.
where P; =-.!... .
M
Note that
- 1 n
P = - IP; = 0.55
n i=l
and an estimate of V(Pel) is given by
v(Pel) = (1(- f)) ±(p; _pf = 1-(6/50) x 0.435 = 0.01276.
n n -1 i=l 66- 1
A (1- a)I 00% confidence interval of the population proportion, P, is given by
Pel +la /2(df = n(M - 1)),Jv(Pel) '
Chapter 9.: Non-overlapping, overlapping, post, and adaptive clustersampling 795

Using Table 2 from the Appendix the 95% confidence interval for the proportion of
states having real estate farm loans of more than $555.4345 in the United States is
given by

Pel =Flo.o2s(df = 24Nv(PeI) , 0.55+2.064.J0.01276 , or [0.3168, 0.7831].

Example 9.3.2. Suppose the United States has been divided into 10 neighbouring
clusters consisting of 5 states as shown earlier. We are interested in the proportion
of states having real estate farm loans of more than $555.4345 . Is there any
expectation for gain in efficiency due to clustering as opposed to simple random
sampling?
Solution. Let us test the gain due to cluster sampling through the concept of
intraclass correlation coefficient.

I 1100.745 114.899 292.965 5.860 53.753 I 0 0 0 0 0.20 0.16


2 1343.461 54.633 140.582 2.605 40.775 I 0 0 0 0 0.20 0.16
3 56.908 100.964 315.809 413.777 1337.852 0 0 0 0 I I 0.20 0.16
4 1248.761 612.108 1049.834 282.565 907.700 I I I 0 I 4 0.80 0.16
5 1579.686 2327.025 1354.768 449.099 1229.752 I I I 0 I 4 0.80 0.16
6 2131.048 1213.024 323.028 870.720 756.169 I I 0 I I 4 0.80 0.16
7 627.Ql3 408.978 939.460 825.748 87.951 I 0 I I 0 3 0.60 0.24
8 553.266 1045.106 99.277 321.583 639.571 I I 0 0 I 3 0.60 0.24
9 201.631 57.747 6.044 8.849 7.590 0 0 0 0 0 0 0.00 0.00
10 1.611 7.130 39.860 42.808 139.628 0 0 0 0 0 0 0.00 0.00
41'1:>0
where limit L = $555.4345 .

I
From the above table we have
N
P= La; 21
NM=-=0.42
;: 1 50
and
P(I- p)= 0.42 x (1- 0.42)= 0.2436.
The value of the intraclass correlation coefficient is given by
= N-1 {1 _ _I_~P;(I-P;)} = 1O-1{1 _ _I_x~}=07722
P M-l (N-l);:l P(I-P) 5-1 (10-1) 0.2436 . .
which again shows that the cluster sampling is less efficient than simple random
sampling. The relative efficiency of cluster sampling over simple random sampling
is given by
796 Advanced sampling theory with applications

(1O-1)xlOxOA(1-0A) =~=04592
(10x5-1){lOxOA(1-0A)-1.44} 47.04 . .
We have assumed that all clusters are of equal size, which restricts the applicability
of cluster sampling in actual practice. For example, villages or suburbs which are
groups of households, or households which are groups of persons, or schools which
are groups of students and teachers, are usually taken as clusters for operational
convenience. We see that unequal cluster sampling is a more practical situation. We
will now discuss the cluster sampling scheme with clusters of different sizes.:

N
Consider the /h cluster consists of M i, i = 1,2,..., N, units and M o = I.M i is the
i=l

total number of units in the population. The population mean f can be defined as
= 1 N Mi
Y = - I. I. Yij = - I. u.r;
1 N -
(9.4.1)
M 0 i=lj=1 M 0 i=1
- 1 Mi
where lfo = M j~1 Yij denotes the /h cluster mean. Suppose n clusters of unequal
i
size are selected with SRSWOR sampling and Yij denotes the value of thej" unit of
the variable under study in the /h cluster. The following three estimators of
population mean can be suggested.
_ 1 {!._
Yn = - L.Yio, (9.4.2)
n i=1
_ 1 Mi
where Yio = - L Yij'
Mij=1
_* l{!._
Yn = - L.MiYio, (9.4.3)
mOi=!
n
where mo = LM i , and
i=1
_** 1{!._ (9.4.4)
Yn = M L.MiYio,
n i=1
- 1 N M
where M =- I.M i = - o .
Ni=1 N
Theorem 9.4.1. The simple arithmetic mean estimator
_ 1 {!._
Yn =- L.Yio, (9.4.5)
n i=1
_ 1 Mi
where Yio = - L Yij' is a biased estimator of population mean.
M, j=1
Proof. Taking the expected value on both sides of (9.4.5) we have
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 797

_ ) [ 1 n _] 1 n
E (Y n = E - IYi. =- IE Yi. = - IYi• = YN ;t Y .
(_) 1 N - - =
n i= \ n i=l N i=\

Thus the estimator Yn is a biased estimator and the bias can be written as

f-:: )
Bv (-)
=Ey 1 N-
-Y=-D;; 1 N - = 1 N(
-~LM.Y = -~LM .-M)li
-\-;
n n N i=l ,. NM i=\ " . NM i= \ 1 I·

1 I(Mi -M Yr;.
NM i=l
=-
'\
= 1 Cov(r;. , MJ
M
-r) (9.4 .6)

The expression (9.4.6) indicates that the estimator Yn is a biased estimator of


population mean Y . It is to be noted that if Y;. and M, are uncorrelated, the bias in
the estimator (9.4 .5) will be zero.

Theorem 9.4.2. The variance of the estimator Yn is given by


V(Yn) = (1-f) sl, (9.4 .7)
n
2
where Sb = - - I Y;. - Y
1 N(_ =)2
N-li=\
Proof. We have
f-:: )
VVn =ELYn-EYn
r-:: (- )]2 =EYn-YNJ (I-f) 1 N(_
[- -]2 =---(--)L =)2
Y;.-Y
n N -1 i=\
Hence the theorem.

Corollary 9.4.1. An unbiased estimator of V(Yn) is given by


-(-) (1-f) 2
Y =--sb' (9.4.8)
V n
n

where sl =(n -It! i:~i. - ynf .


i=l

Theorem 9.4.3. The bias and variance of the second estimator of population mean
-* = - 1 ~
Yn -
t:...MiYi., (949)
. .
mOi=1
n
where mo = I Mi , are respectively given by
i=l

B~~)~ _1_
nM
[v(mo)- cov{mo, ~MiYi.}]
2 (9.4.10)
o 1=\

and
V~~)= (1 - f) S;2,
n (9.4 .11)

where Sb
*2 = 2( 1 N
) LM 2(-Y;. - Y=)2 .
i
M N-l i=\
798 Advanced sampling theory with applications

Proof. In the estimator y:, both the numerator ;;IM;y;.


1
and denominator mo are

random variables and hence form a ratio of two random variables. From the
standard ratio method of estimation, the asymptotic bias in the is given by y:
B~:) ~ ~[v(mo)-cov{mo,
nM
IM;y;.}] , (9.4.12)
o I;]

Clearly B~:) ~ 0 as n ~ 00, y: is an asymptotically unbiased estimator of


population mean. One can easily see that the variance of the estimator y: is
I-e)- E[-*Yn - r] -
V \Yn - - (1-
-- I)S*2b , (9.4.13)
n

where Sj*2 =
M
2(1 )IM;2(-If.- Y=)2.
N-I ;;[
N

Corollary 9.4.2 . An estimator of the variance of y: is given by


v~:) = (1- / )s;2, (9.4.14)
n
where
*2
sb = _
mo 2(1 )~
n -1 ;; \
2{- - *\2
L.M; \y;. - Yn)
.h
wit mo
_
=n IM;.
_] n
;; 1

Theorem 9.4.4. The third estimator s; of population mean is unbiased and its
variance is given by
v~:*) = (1- / )S;*2, (9.4.15)
n

where S;*2 = (N-It] ;;I(M;Y;.


] M
_r)2
Proof. We have

E~:*) = E[~ IM;Y;.] = ~


nM ;;[
IE(M;Y;.) =
nM ;; 1
r.
Hence y;* is an unbiased estimator of population mean. By the definition of
variance we have

v~;*) = E~:* - E~:*)f = E[ nM


1
IM;Y;. _r]2 =
;;[
(1-I) S;*2 .
n
Hence the theorem.
Corollary 9.4.3. An unbiased estimator of the v~:*) is given by
.(- **) (1-
v\Yn = - - S b '
I) **2 (9.4 .16)
n

where s;*2 = (n -ltl ;;]I(M;


M
y;. _y:*)2
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 799

Theorem 9.4.5. The efficiency of the estimator s; with respect to SRSWOR


sampling is
RE = s2/(MS;*2) . (9.4.17)

Proof. In unequal cluster sampling, the total number of units in the sample will be
n
mo = L,M;. The expected value of mo is
i=1

E(mo) = nE('!' I,M;J =.!!- IM; = nM.


n ;=l N;=I

If an equivalent sample of nM is directly selected from the population of NM


elements, then the variance of the sample mean estimator under SRSWOR will be
Vsrs(y) =( 1 _ 1 )S2
nM NM
and that of s: is
V~;*)=(~- ~ )S;*2.
By the definition of relative efficiency we have
RE = vsrs(y)lv~;*)= s2/(MS;*2) . (9.4.18)
Hence the theorem.

Example 9.4.1. We wish to estimate the average yieldfhectare of the world tobacco
crop. Select four continents from population 5 by SRSWOR sampling. Collect
information about the yieldfhectare from all the countries in the selected continents.
Estimate the average yieldfhectare in the world using three different estimators.
Also construct the 95% confidence interval in each situation.
Solution. We used the first two columns of the Pseudo-Random Numbers (PRN)
Table 1 given in the Appendix to select the four distinct random numbers as 01, 04,
05 and 03. The continents ' Central America', 'Western & Eastern Europe', ' FSU-
12' and 'European Union' will be included in the sample. Thus we have the
following sample information.

01 Central I 1 Costa Rica


America 2 2 El Salvador
3 3 Guatemala
4 4 Honduras
5 5 Nicara ua
6 6 Panama
800 Advanced samp ling theory with applications

03 European 1 13 Austria 1.90


Union 2 14 Belgium--Lux 3.69
3 15 France 2.80
4 16 Germany 2.40
5 17 Greece 1.96
6 18 Italy 2.77
7 19 Portugal 2.14
8 20 Spain 2.79
" :', · ;'t "" rf">.;i1li0~ ':;, i"";:, !!"
Average 21556
"' "

04 Western 1 21 Switzerland 2.09


and 2 22 Albania 0.63
Eastern 3 23 Bulgaria 1.61
Europe 4 24 Czech Republic 1.64
5 25 Croatia 1.67
6 26 Hungary 1.75
7 27 Macedonia 1.36
8 28 Poland 2.34
9 29 Romania 1.23
10 30 Serbia/Montenegro 1.17
~" ,0,,;j;; ':i£B " ,~, """'"
¥ ~ ,_: ~~ i,;;

';",.i"''''''' ,
_ ,Yi;t-~y;

" Average 1.549


05 FSU -12 1 31 Azerbaijan 2.33
2 32 Armenia 0.26
3 33 Byelarus 2.42
4 34 Georgia 1.63
5 35 Kyrgyzstan 2.50
6 36 Kazakhstan 1.82
7 37 Moldova 2.05
8 38 Russia 0.92
9 39 Taj ikistan 2.48
10 40 Turkmenistan 2.36
11 41 Ukranine 0.84
12 42 Uzbekistan 2.37
l'dl' Y; ; " ,,"
iiC'_
",,'' '0-',- ·"·''''j';''1~]t.j:~jAveragei 1.832

Estimator 1. Here we have

6 0.000020
8 0.334662
10 0.183612
12 0.021170
Sum '0.539465
Thus the first estimate of average yield/hectare of the world tobacco crop is
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 801

- =.!-~-.
Yn ""Y,o = 7.91 =19775
. .
n i=1 4
Note that
s; = (n-I)-I tCYio - Yn? = 0.179822.
i=1
Thus the estimator of the variance of the estimator Yn is given by

V(yn) = (1- f\; = 1-4/10 xO.179822 = 0.0269733.


n 4
A (1- a)I00% CI of the average yield/hectare of the world tobacco crop is given by

Yn +ta/2(df =
1=1
~Mi -n)~V(Yn)
Using Table 2 from the Appendix the 95% confidence interval is given by
1.9775+ 2.037.J0 .0269733 , or [1.6429, 2.3117].

Estimator 2. We have

6 1.973 11.838 0.0446


2 8 2.556 20.448 24.4590
3 10 1.549 15.490 15.1160
4 12 1.832 21.984 1.6120

Thus
_ 1 n 36
mo =-LMi =-=9
n i=l 4
and an estimate of the average yield/hectare of the world tobacco crop is given by
_* 1 n _ 69.76
Yn = - LMiYio = - - = 1.9378.
mo i=1 36
Note that
*2 _
sb -
1 ~M2(- __
z: i'lio Y n -
*)2 _
41.232
0.1697 .
m2 (n - l) i=1 92(4-1)
Thus an estimate of the variance of the estimator of population mean is
v~~)= (1- f) s~2 = 1- 0.4 x 0.1697 = 0.02545 .
n 4
A (1- a)100% confidence interval of the average yield/hectare of the world tobacco
crop is given by

Y~ +ta/ 2(df = 1=1~Mi -n)~v~~) .


Using Table 2 from the Appendix the 95% confidence interval is given by
1.9378+2.037.J0 .02545 , or [1.6128, 2.2627] .
802 Advanced sampling theory with applications

Estimator 3. Note that M = 10.6 , and an estimate of the average yieldlhectare of


the world tobacco crop is given by
_ 00 1 II _ 69.76
YII = ~ I.MiYio =- - - = 1.6453.
nM i=1 4 x10.6
Now we have

I 1.838 0.2793202
20.448 0.0805 178
15.490 0.0338484
1.832 21.984 0.1522993
.~~7:9 1 0' ~69:760;

Thus we have
002 _ (
sb - n -
~ (Mi
1)-1 L- - _00 )2
--=- Yio - Y II
_
-
0.5459857 -0
- •
1819952
i= 1 M 3
and an estimate of variance is
v~;o )= (1- f) s;02 = 1- 0.4 x 0.1819952 = 0.027299.
4 n
A (1- a )100% confidence interval of the average yieldlhecta re of the world tobacco
crop is given by
_ 00 ( ) r;:r-:;;\
ta/ 2 df = ~Mi - n " villi J.
II
YII =+=
1=1
Using Table 2 from the Appendix the 95% confidence interval is given by
1.6453 =+= 2.037.J0.027299 or [1.3087, 1.9818].

Now we would like to discuss how ANOVA works for unequal cluster sampling.

1 Y;I Yl2 }]j Y;Ml Y;.


2 Y21 Y22 Y2j Y2M2 Y2•

Y;I Y;2 Yij YiMi Y;.

N YNI YN 2 YNj YN MN YN•


.Popuhlti01! Grand :rotal (GT)
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 803

( a) The population correction factor (CF) is given by


2
CF= Y••
M.
( b ) The population corrected total sum of squares (SST) is given by
N u, N u,
SST=LLYJ-CF=LLy;J--!!-=LL Y;rY =(M.-I)S2.
}:2 N u, ( =)2
i=lj=1 i=lj=1 M. i=lj=\

( C) The population corrected between sum of squares (SSB) is given by

SSB = I Y;; - CF I Y;; - Y.~


i=I M i
=
i=I M i M.
= IM i(¥;;)- M'y2= M.{IP;¥;; _y2}
i=1 i=1

=f:-M
N =)2 = (N-I)Sb/2, where P; = MdM.
i Y;.-Y
(_

1=1
( d ) The population corrected within sum of squares (SSW) is given by

SSW = SST -SSW = L i YJ - L-i!..


NM ·
i=!j=1
N y;2
= L i (Yij - ¥;.f = (M. -1)S2
N[M '
i =I M i
]
i=1 j=1
.

SSB MSB

=LM
N =)2
i Y;.-Y
(_ 1 N
=--LM. (_ =)2
y; -Y
F= MSB
MSW
1=1 N -I i=l I I'

=Sb 2
SSW MSW
NM ' 1
=L i (Y;j - ¥;.f
N M·

i=lj=1
f:- ~=1(Y;j - ¥;. f
=N(M _ I) 1=1;

NM-I

Remark 9.4.1. If clusters are not of equal size, then an alternative measure is
adjusted values of coefficient of determination given by

R2 =1- MSW .
a MST
804 Advanced sampling theory with applications

Further note that the value of adjusted R; becomes negative if the cluster sampling
is efficient. However Zasepa (1962) has shown that the value of intraclass
correlation coefficient for unequal clusters can be found as

Example 9.4.2. Mr. Ken Brewer learned from Mr. Stephen Horn that the clusters
are supposed to be heterogeneous and suggested the following clustering plan:

West North Central East North Central


South West South
South East

Verify if Mr. Brewer's new clustering plan may result in efficient estimates? Apply
ANOVA approach.
Solution. From the map of Dec. 2002 precipitation (Refer to Example 9.1.5) we
have the following information:

Cluster;~ *~Cluster Totals ,


I 4 72 93 64 233
2 3 63 104 170
3 75 90 165
J!civ ~~~ Population Grandmitill:{GT) 1:'" ..568.
( a ) The population correction factor is given by
2 2
CF = Ye. = 568 =35847.11.
M. 9

( b ) The population corrected total sum of squares (SST) is given by


NMi 2
SST = I I Y;j - CF
i=lj=l
2 2 2 2 2 2 2 2 2
= 4 +72 +93 +64 +3 +63 +104 +75 +90 -35847.11
= 10616.89.

( c ) The population corrected between sum of squares is given by


SSB I.
= Y;; _ CF = 233
i=lMi 4
2 2
+ 170 + 165
3 2
2
- 35847.11 =970.97.

( d ) The population corrected within sum of squares is given by


SSW =SST - SSB =10616.89 - 970.97 = 9645.92.
Thus the ANOVA for Mr. Ken Brewer's new clustering plan will be:
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 805

6 9645.92 1607.65

8 10616.89 1327.11

Note that if the clusters are not of equal size then an alternative measure is adjusted
values of coefficient of determination given by

R2 =1- MSW =1_ 1607.65 =-0.214


a MST 1327.11 '

which is negative, and it shows that Mr. Ken Brewer's new clustering plan will
perform better than SRSWOR sampling. Ken! Well done!.

In most of the practical situations the study variable is found to be positively and
highly correlated with the cluster size. Under such circumstances, it is
recommended to select the sample with probability proportional to the size of
clusters. Let P; = 2 ;/2 be the probability of selecting the jth cluster in the sample.
Following Hansen and Hurwitz (1943), let us define a transformed variable,
uij =MiYij/ MoP; , j =1,2, ..., M i ; i =1,2, ..., N . Assuming that n clusters are selected
by PPSWR sampling scheme, define

-uio = MiYio
MoP; Dor I 2
i = , , ..., n .

Then we have the following theorem:

Theorem 9.5.1. An unbiased estimator of population mean is given by


-
Ypps -
_-!..~-. _-!..~MiYiO
L,u t O - L, (9.5.1)
n i=1 n i=1 MoP;
and its variance is given by
t.:
Vl)'pps
)
=-1 IP;
N (_
Uio -
=)2 .
Y (9.5.2)
n i=1
Proof. Taking expected values on both sides of (9.5.1) we have
806 Advanced sampling theory with applications

E(ypps) = E('!' f,UiO) =.!. f,E(i/io) =.!. I.P; f, Mi~O Y.


=
n i=l n i=l n i=li=l MOP;
Following Hansen and Hurwitz (1943) the variance of Ypps is given by

v(ypps)= E~~ps)- y2 = -;i~P;( t; - yt


Hence the theorem.
N
Theorem 9.5.2. If P; = MJM 0' where M o = IM i , then an unbiased estimator of
i=1
population mean Y is
_ 1 {!._
Ypps(c1) = - £..,Yio . (9.5.3)
n i=l
Proof. We have
(_ ) (1 n _ ) 1 n [N _] NM . 1 u,
E\Ypps(cl) = E - IYio = - I IP;Y;. = I - ' - I Yij = -
1 NMi =
I I Yij = Y .
n i=l n i=l i=l i=1 M 0 M, j=l M 0 i=lj=1
Hence the theorem.
Theorem 9.5.3. The variance of the estimator
_ 1 {!._
Ypps(cl) = - £..,Yi.
n i=\
is given by
(_ )_ 1 N M i (_
V\Ypps(c1) - - I - Y;. - Y .
=)2 (9.5.4)
ni=lMo
Proof. We have
- 1 {!.- 1 {!. P;Yio
Ypps(c1) = - £..,Yi. = - £..,----;;-
n i=l n i=1 'i
Thus
_
[
1 n_ 1 n P;Yi.
] [
1N -] P;Y;.
V(YPPS(c1)) = V -IYi. =V - I - =-IP; - - Y (- =J2 =-.Ip;(y;.-y)
1N - = 2.
n i=l n i=l P; n i=l P; n 1=1
Hence the theorem.

Theorem 9.5.4. The relative efficiency of the estimator Ypps(cl) with respect to
SRSWR is given by

RE=H-:nr. where

Proof. Under SRSWR sampling we have


(J'
2 1 NMi(
=~II
MM o i=lj=1
=)2
Yij-Y (9.5.5)

_ 1 [ 1 N Mi (
V(YsrJ=~ - I I Yij -Y
=)2] =~-I
lIN I Mi ( _ _
Y;rY;. +Y;.-Y
=)2
nM M o i=lj=l nM M o i=lj=l
Chapter9.: Non-overlapping, overlapping, post, and adaptivecluster sampling 807

1 1 N[M;
= ~-.2: 2: {(Yij - - J\2 + (_li. - Y
li. =)2}]
nM M O '= ;=\

1 1 N[M . M;(
=---=-2: - ' 2: Yij-Y;'
_)2 +M; (_Y;.-Y=)2]
nM M O ;=1 M; j=l

1
=-=2:
nM i=t
N[M.a~
_,_,_+_,
MO
M . (_ =)2] =~1 [a-.!!:.+V\Ypps(cl)
~.-Y
MO
2
(_
M
)~
n

where
2 N M.a~
a w = 2:_'_'_ .
;=1 MO

This implies
2
(.,., ) - _ a~ a - a~ 1 (2 2)
Vl)'pps(cl) =MV(Ysrs)--=~xM--=-\a -a w '
n nM n n

Thus the relative efficiency of cluster sampling is given by


2

RE= &(Ym)r f;, =[dl_a~)]-I


V ppslcl] '''l
a -a w a
n
Hence the theorem.

Example 9.5.1. We wish to estimate the production of the world tobacco crop.
Assume we form 10 clusters of the countries in the world based on the continents
listed in population 5. We wish to apply PPSWR sampling for selecting the cluster.
Should we expect any gain in efficiency over simple random sampling?
Solution. We have

6 0.022356 0.134133
6 0.181742 1.090450
8 0.303623 2.428986
10 0.211104 2.111040
12 0.533628 6.403540
4 0.114825 0.459300
30 0.332388 9.971650
17 0.356282 6.056800
10 1.816470 18.164700
3 0.649733 1.949200
808 Advanced sampling theory with applications

Thus we have

o-~ = ~ M io-l = 48.7698 = 0.46009 .


i= l Mo 106

Also we have
0-
2= ~ ~ t (Yij - y)2 =(N -IJS; = (106 -IJX 0.6323 = 0.626335 .
MM o i=lj=1 N 106

Thus the relative efficiency of PPS cluster sampling over simple random sampling
is given by

RE = [M(I - o-~ J]-I


0-
2
= [10.6 x (1- 0.46009
0.626335
J]-I = 0.3554 .

Thus for this case the PPS cluster sampling will be less efficient than simple
random sampling.

Raj (1954, 1958), Zarcovic (1960), and Foreman and Brewer (1971) have
considered the concept of a superpopulation model for comparison purposes. It is to
be remarked here that other sampling schemes like PPSWOR sampling, systematic
sampling, two-phase sampling and stratified sampling can also be used to construct
the estimation strategies under cluster sampling. Madow (1949), Sukhatme (1954),
and Sampford (1962) have also suggested estimation strategies under cluster
sampling . Singh and Singh (1999) suggested an unbiased class of estimators in
cluster sampling.

Following Royall (1992), Tam (1995) has assumed that the finite population of
interest l =(YI' Y2' ..., YN) is a realization of a random vector r that is related to
the design matrix X = (Xl' X 2, .. ., X NYvia superpopulation model, defined as

(9.6.1)

where X is assumed to be of full rank p , § is a null mean vector and covariance


matrix is «r..
where f. is a symmetric positive definite matrix such that the
vanances of the derived statistics are positive, and 0- and fJ are unknown
quantities . We are interested in estimating the population total defined as

(9.6.2)
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 809

where !Nxl is a vector of ones. Let s corresponds to the sampled n units and r for
the remainder of the population units (N -n) . Without loss of generality, the
population information can be represented as

1=
-
[!s]
1 '-
y = [Is]
Y '-
X = [XXs]' V = [~s'
V
~sr]
V
-r -r -r - rs' -r

Then we have the following theorem:

Theorem 9.6.1. The best linear unbiased predictor (BLUP) of population total T is
given by

f(K,y) = !~Is + !r::L~(K,y)+ !~~rs~;I~s -xs~(K,y)~ (9.6.3)


where

/J = (X
_
V-I X )--1 x' . V-I Y
~s-s -5) -S-S-s

is the best linear predictor of f3 with variance given by

v[f(K,y)] = (T2[!1 X -!I KI~;1 X S


}i;IUI X -!I KI~;IXJ
+(T2~1 VI-IK I~;l K!) (9.6.4)
where
!is = Xs~;lxs and K = t!::.s' ~sr)'
Corollary 9.6.1. If ~! = X AI for some vector ~l then the best linear unbiased
predictor (BLUP) of population total Y IS

(9.6.5)

with variance given by

(9.6.6)

For studying the optimal sampling strategies, Tam (1995) has considered the
variance-covariance matrix of the form

V
-
=[~A 'Q]
Q, yz (9.6.7)
810 Advanced sampling theory with applications

where !::.A is a (N - z)x(N - z) non-diagonal matrix that has the same correlation
coefficient in the off-diagonal elements, !::.Z is a z x z diagonal matrix, and
°s z s N. In other words, the diagonal matrix V looks like
v?, po,oz, POI03, • ., PO,ON-z ' 0, 0, ••, °

°
(9.6.8)
0, 0, 0, z
• . , VN-z+l> 0, 0, . ,.,
0, 0, 0, .,., 0, V~-z+z, .,., °


0, 0, 0, . , ., 0, 0, 0, .,., 0, v~

where
I if there is correlation p,
Ok = {
° otherwise.
(9.6.9)

The matrix !::. can also be expressed as

!::.=Diag~2)+ P!::.3!::.~ (9.6.10)

where
(I-OIP)v?, 0, 0, , ° O,V?, 0, 0, , °
0, (1- ozp)v~, 0, ° 0, 02V~, 0, °
and !::.3=

0, 0, 0,....., (1- ONP)v~ 0,

Letting », be the number of zeros in V3s' where V3 = (v;s, v;r) and defining
ns = Zs + (n(zs XI- )) , Tam (1995) extended Royall (1992) results to cluster
1+ n-zs-Ip
sampling as given in the following theorems:

Theorem 9.6.2. If !::.! = X~, and ~l = X ~2 for some ~l and ~2 under M(K, y)
defined by (9.6.1) and (9.6.10) we have
Chapter9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 811

var{f(K; y)- T}~ {U1~tl /'opt - !I r! }a 2 (9.6.11)

for all samples s, where


n if z ~ n,
n opt =I z + (n -zXI-p)
-'-...,....-~----'-,-'- if z < n. "
(9.6.12)
1+(n -z -I)p

The lower bound is attained by the estimator

"( v
T \.K;V =
)
IN ~(I - 8 p)v. ns [
I I I 1
---rT'=~ (9.6.13)
- j=l n op l j=l ~(I - 8jp)

where

Zopt =: {
if z ~ n,
if z < n,

if and only if the sample satisfies the condition,

variance covariance matrix is given by

and ~l is given by ~l = ~:s '~:r) .


T heorem 9.6.3. If r! = X ~!' ~l = K ~2 for some ~l and ~2' and s, = 1 for
i = I, 2, ..., N under M(K, r), then

v[fl,;:; !:)-r]~ a'(l- p{~('~" ,~ 'I] r- (9614)

for all samples s .

The estimator given by

i (K;r) =~(~VjXI Yj J (9.6.15)


n j=! s Vj
gives the lower bound of the variance if and only if

n-
1
( 7 )(vi
Vj
1
,•••••••••• ,v;llrs =!' K · (9.6.16)
812 Advanced sampling theory with appl ications

Owing to operational convenience and reduction in cost, sometimes cluster


sampling has been found to be useful even if all elements of the population are
identifiable. Clusters are formed either before selecting the sample (CBS) or after
selecting the sample (CAS). In both situations the clusters may be non-overlapping
or overlapping. For non-overlapping cluster sampling the theory is straight forward,
as we have discussed in the previous sections of this chapter. However, there may
be many practical sampling situations where one gets overlapping clusters . For
example, overlapping clusters may exist in some regional epidemiological survey
for a contagious disease like mycobacterium tuberculosis (T.B.), which is becoming
very prevalent with the spread of AIDS (Gifford--Jones, 1993). For overlapping
cluster sampling, one can refer to Goel and Singh (1977) , Agarwal and Singh
(1982), Amdekar (1985), Singh (1988), and Tracy and Osahan (1994b). Under the
overlapping cluster sampling strategies, the population size mayor may not be
known.

Let the population under consideration consist of N distinct and identifiable units.
Assume that these N units are expressible in the form of K overlapping clusters
with N;, (i = 1,2, ' 00' K) units in the r cluster and
K
L N; = M ~ N . The equa lity sign
;=1
will hold only for non-overlapping clusters . For an overlapping cluster situation, a
popu lation unit may be included in more than one cluster and let Fj be the
frequency of the r (;
= 1,2, 00" N) population unit occurring in K clusters. Let Y be
the variable of interest and we are interested in estimating the population mean
- IN
Y = N- L Yj •
j=1

Define
Zij = (M/NFjftlj' i=1,2, .oo,K and j = 1, 2,oo ., N;,
where Yij denotes the value of Y for the r unit in the {h cluster. Then we have the
following schemes:

Scheme 1. (a) Select k clusters out of K clusters by SRSWR sampling.


( b ) From the jth selected cluster of size N ; (i = 1,2, 00 " k), select n; units by
SRSWOR sampling.

Under such a sampling scheme we have the following theorems:

Theorem 9.7.1.1 . A biased estimator of population mean Y is given by


_ 1 kIn;
zSRS = - L- L zij . (9.7.1.1)
k ;=In; j= l
Chapter 9.: Non-overlapping, overlapp ing, post, and adapt ive clus ter samp ling 813

Proof. Let E z denote the conditional expectation for a given sample of clusters and
E I denote the expectation over all such samples , then we have

Hence the theorem.

Theorem 9.7.1.2 . The variance of the estimator ZSRS is given by

V(ZSRS)= f[ o-lz ;tC; -~Js;; ]


+ K-
I
(9.7.1.2)

z = K - I KL (Z;. -ZK
whereo-bz
_ )zand S;zz =(N ; -1 )-1 N;L (Zij -Z;.
-)Z
.
;=1 j =1
Proof. Let Vz denote the conditional variance for a given sample of clusters and VI
deno te the expec tation over all such samples , then we have

V(ZSRS) =V1Ez(ZSRS) + EIVZ(ZSRS) = Vl[~ fEz( z;.)] + EI[~ fVz(z;.)]


k ;=1 k ;=1

= Vj[~k ;=1
fZ;.]+ EI[~ f(~ __1 )s;;]
k ;=1 n; N;

= _1 f{Z;.-ZKf+-1
f(~ __1 )s;; .
kK i=1 kK N, i=1 n i
Hence the theorem.

Scheme-2. ( a ) Select k clusters out of K clusters by PPSWR sampling, with


P; = NdM .
( b ) From the lh selected cluster of size Ni , (i = 1,2,..., k}, select n; units by
SRSWOR sampling.

Then we have the following theorems :

Theorem 9.7.1.3. An unbiased estimator of popul ation mean Y is given by


_ 1 k 1 ni
zpp s =- L- L zij . (9.7.1.3)
k i= ln; j=1

Proof. We have

E(zpps )=E1E 2('!- I J.. I


k ;=1ni j=l
Zij ) = E1['!- IE2 (Zi
k i=1
O )] = E1('!- IZiO) f NMi Zi.
k i=l
=
i=1
814 Advanced sampling theory with applications

1 K N; Y;j -
=-LL-=Y.
N ;=lj=1 Fj
Hence the theorem.

Theorem 9.7.1.4. The variance of the estimator zpps is given by


1 )S2
v(z )= CY;z +.!.. fp(..!... __
pps k k ;=1 1 n; N; IZ (9.7.1.4)

-\2
where
2
CYbz = LP;
K (-
Z;. - YJ •
;=1
Proof. We have

v(Zpps) =~E2(ZPPS)+ E1V2(zpPs) =~[.!.. .~ E2(Z;.)] + EI[~ .~ V2(Z;. )]


k 1=1 k 1=1

=~[.!..k 1=1
.~Z;.]+E1[~ .~(..!... __I)s;;]
k 1=1 n; N;

=.!.. IP;(Z;. -zf +.!.. IP;(..!... __I )s;; .


k ;=1 k ;=1 n; N;
Hence the theorem .

It is possible that the population of N units under consideration is expressible in


the form of K overlapping clusters with N; units in the i'h cluster such that

LN;
K

;=1
= M;::: N is satisfied, but the population size N is unknown . When cluster

wise data on units are available on the computer, the values of these frequencies for
overlapping clusters may be easily available. Under such situations, define
Zij =Yij / Fij and W;j =1/Fij for i =1, 2,...., K; j =1, 2,..., N; , where Yij is the value
of y for the /h unit in the /h cluster and let Fij be its frequency of occurring in K
clusters.
Then again we have two schemes as discussed below :
Scheme 1. (a) Select k clusters out of K clusters by SRSWR sample.
( b ) From the /h selected cluster of size N;, (i = 1,2,..., k}, select n; units by
SRSWOR sampling .
Under such a sampling scheme, we have the following theorem :
Theorem 9.7.2.1. The ratio estimator under scheme 1 given by

zRS = (Kk 1=1~N;Z;]/(Kk IN;W;]


1=1
(9.7.2.1)
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 815

has relative bias, to the first order of approximation, given by

RB(ZRS)'" K
k
[[O'l~
N NY
IJ[Si~N -SiZWJ]
- O'bZWJK + INl(~ __
ni N
i=1 NY i
(9.7.2.2)

and mean squared error to the first order of approximation is given by

K "[:.N
MSE (ZRS ) '" kN 2
K
i
2[(-z, - Y- -\2
W; (1 1J(s;2+ Y-2Siw2- 2YSizw
J + --:- - N - )~
,~ ~

where
O'bzw = K- 1 HNiZi -K-1yXNiW; -rIN); Sizw = (N i -i): 1 ~ (Zij -Zi XWij -Wi) ;
~l j~

Ni
-I _ -I ni - _I N] _ -Ini
z, = Ni L Zij , Zi = ni L Zij' W; = Ni L Wij' Wi = n, L Wij ,etc
j=l j=l j=l j=1
have their usual meanings.

Proof. Follows by the standard method of moments.

Scheme 2. ( a ) Select k clusters out of K clusters by PPSWR sampling, with


P; = N;/M.
( b ) From the lh selected cluster of size Ni, (i = 1,2,..., k), select ni units by
SRSWOR sampling.

Then we have the following theorem :

Theorem 9.7.2.2. A ratio type estimator under scheme 2 is given by

zRP = (Mk fZiJ/(Mk fWiJ


,=1 ,=1
(9.7.2.3)
has relative bias, to the first order of approximation, given by

k N
2
RB(ZRP)'" M [[O'l;. _ O'bZW'J+
YN i=1
IP;(~ni __n,IJ[Si~ _
N
SiZWJ]
YN
(9.7.2.4)

and has asymptotic mean squared error


- ) = kN
MSE (zRP M "[:.
2
K
N, [(- - -\2
Z, - Y (1 1J( 2
W; J + --:- - -N. Siz + Y Siw- 2YSizw , -2 2 - )~ (9.7.2.5)
,=\ n, ,

where O'bzw'= IP;(Z;-M-1yXW;-M-1N), etc., have usual meanings due to Hansen


i=1
and Hurvitz (1943).

Example 9.7.1. Suppose there are three plots as shown in the figure below and nine
partners who are owners of these plots . A few partners have shares in only one plot,
a few have shares in two plots and others have shares in all three plots .
816 Advanced sampling theory with applications

Cluster III

Fig.9.7.2.1 Three overlapping clusters.

The income of the nine partners is given below.

2 3 4 5 6 7 8 9

1000 2000 6000 3000 3000 4000 2000 3000 1000

We wish to estimate the average income of all partners using plots as overlapping
clusters. Which sampling scheme would you prefer and why?
Solution. We observed that partner 3 is the owner of all the three plots, partner 6 is
owner of two plots and the other partners are owners of only single plots. This
means these plots or clusters of the partners have overlapped with each other. Thus
we shall use the concept of overlapping cluster sampling. Let us compare here two
methods suggested by Tracy and Osahan (1994b) .

When clusters are selected with SRSWOR sampling then we have

K K
MSE (IRS ) '" kN 2 ;~N;
2[(-z;- - W;)
y
-\2 + (1
-;;;- 1J 2]
N D;

= ~[3880211 .11 + 15388643.24 + 14404805.5] = 623586.29.


2x9
When clusters are selected with PPSWR sampling then we have
- )
MSE (zRP M "IN;
=--2 K
kN ;= \
[(- -
Z;-Y W;) (1 N;1J 2]
-\2 + - - - D;
n;

= ~[1293403.70 + 3847160.81 + 2880961.10] = 148546.77 .


2 x9
Chapter 9.: Non-overl apping, overlapping, post, and adaptive cluster sampling 817

We used the foIlowing information in the above formulae :

3 4 5
2 2
3 5 7 3 4 7 2 3 6 8 9

6000 3000 2000 1000 6000 3000 2000 2000 6000 4000 3000 1000
3 1 2 1 3 1 1 1 3 2 1 1
2000 3000 1000 1000 2000 3000 2000 2000 2000 2000 3000 1000

0.5 0.3333 0.3333 0.5

1000,000 666666.67 500,000

0.6111 0.8333333 0.766666


2000 2000 2000
0.1203704 0.111111 0.105555
250 0.0000 0.0000
646412.35 1840277.78 1114924.69
193.0015 501720 .764 241714.81
2
where D; = S;z2 + Y-22
S;w -
-
2YS;zw . Assummg k
. .
= 2 and we are given K =3 .
Thus the relative efficiency of PPSWR sampling strategy with respect to SRSWOR
sampling strategy is

RE = MSE(ZRS) x 100 = 623586 .29 x 100 = 419 .79% .


MSE(=RP) 148546.77

Thus we shall prefer to use PPSWR sampling for selecting overlapping clusters in
the sample.

In cluster sampling the lack of information relating to the composition of clusters


poses a serious problem. In such situations, the technique of 'Post-Cluster'
sampling introduced by Dalenius (1957) can be adopted. Ghosh (1963) has
developed a stochastic model for analysis of post-cluster sampling. The technique
of post-cluster sampling differs from two-phase or multi-phase sampling in the
sense that it uses a hierarchy of sampling units and differs from ordinary sub-
sampling in that sampling units at the second stage are larger than the sampling
units at the first stage. Ghosh (1963) proposed a ratio estimate based on post-cluster
sampling considering the ratio of the expected value of the variable under study to
the expected value of the number of observations. Sadasivan and Srinath (1975)
818 Advanced sampling theory with applications

have considered a ratio of the expected value of the variable under study to the
expected value of the auxiliary variable. Consider a finite population consisting of
N distinct units. Let Y be the variable under study and let it take the values
Yj, Y2" ",YN in the population . Let X be the auxiliary variable . The N elements in
the population may, in principle , be grouped into M clusters, the cluster r
consisting of N, elements. The population mean, X = N- 1L:
M N·
t X ij , is assumed to
i =l j = l

be known. The ratio estimator to estimate the population mean r under post-cluster
sampling is given by

_
M M N·
( -nm i=1 t
L:IJi BijYij
j=1
)

-
_
Y-
X (9.8.1)
YR = (~~IJi! BijXij ) X = i
nm i=1 j=1

where nand m denote the size of initial random sample and number of post-clusters
selected , IJi and Bij are the random variables used for selection of post-clusters and
the elements from the population . Then we have the following corollary:
Corollary 9.8.1. The covariance between y and x is given by
2(M
__) M(N - n) M -mXn-l) (M -mXN -n)(- -\ (9.8.2)
COy (y, x = Vx + V + X Y j,
mn(N -I) Y nmN(N -IXM -I) bxY mn(N -I)
1M Ni ( _ _)
where Vxy = N- L: L: XijYij - X Y is the covariance between Y and X in the
i= lj=1
M
population, vbxy =M-IL:(Xi-X/MXY;-Y/M) is the covariance between cluster
i=1
totals for X and Y in the population. From (9.8.2), one can easily derive v(y) and
v(x) . Following standard ratio method of estimation, we have the following
theorems:

Theorem 9.8.1. The relative bias expression for the ratio estimator YR, to the first
order of approximation, is given by
RB(YR) = M~N-n~(c;_CXY)+ mn
nm N - I
(M(-mXX -I)N)[c;. -cx.yol
N- 1 M- 1 I I I
(9 .8.3)

where
C; = Vx / X2 = Square of coefficient of variation of X,
Cxy = Vxy /(r x)= Coefficient of co-variation between X and Y,

C;. = ( Vbx )2 = Square of coefficient of variation of values of Xi'


I X /M
and

( b;/ )
II:
X /M Y M
= Coefficient of co-variation between X i and Y;.
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 819

Theorem 9.8.2. The variance of the ratio estimator YR is given by


- ) - Z[{ M f Z
V (YR = Y
Z ) (M-mXn-l)fz Z
mn fy +Cx -2Cxy + mn(M -I) \CYj +CXi -2CXi Yi
)~]
J. (9.8.4)
Proof. Follows by the ratio method of estimation.

Thompson (1990) reported that the use of adaptive cluster sampling for patchy
populations is an efficient design. Brown (1996) pointed out that adaptive cluster
sampling is efficient only for very patchy populations, but can be highly inefficient
for other less aggregate populations. Christman (1997) also supported Brown's
views about adaptive cluster sampling in case of non-patchy populations. In
general, the adaptive cluster sampling can be done in two steps. In the first step a
preliminary sample of n units is selected; for example, a random sample of n
quadrants from a study area are divided into N evenly sized quadrants. In the
second step, for any quadrant in the initial sample for which the variable of interest,
y, for example the number of plants in the quadrant, has a value as large as a
predefined critical value, the neighbourhood is sampled. The neighbourhood can be
defined in several ways, for example, four surrounding quadrants , that is, on the
east, west, north, and south sides. If any of the quadrants in the neighbourhood has
a value at least as the critical value, the neighbourhood of that quadrant is sampled,
and so on. The sampling of the units continues until all the neighbourhoods of
sampled quadrants are sampled, similar to inverse sampling. The difference is that
here initial sample size remains fixed, but the final sample size is variable and
depends on the number of networks selected in the sample. The group of adjacent
quadrants whose values are all at least as great as the critical value is known as a
network. If we divide a population into K networks then the Horvitz and
Thompson (1952) estimator of population total in adaptive cluster sampling
becomes

v_I
'adcl - -
~y;Ii
L..,- (9.9.1)
N i ; 1 Jri

where
Yi = total of the Y values in the /h network.
th
t. = {I if any quadrant in the i net work is in the initial sample,
I 0 otherwise,

and Jri =1- (N : Xi ) / ( : ) is the probability that a quadrant in the / h containing Xi

quadrants is selected in the initial sample.

Brown (1999) extended the concept of adaptive cluster sampling to stratified


sampling design. He considered its application to the ecological data and showed a
slight gain in efficiency using stratified design over the adaptive cluster sampling
820 Advanced sampling theory with applications

design. He showed that the two-phase adaptive sampling scheme introduced by


Francis (1984) may lead to biased estimates. Francis (1984) considered the
application of two-phase adaptive sampling design for the estimation of fish
biomass from trawl surveys. Jolly and Hampton (1990) also suggested a two-phase
adaptive cluster sampling design with the additional advantage that its bias can be
estimated very easily . Thompson and Seber (1996) compare the different estimation
strategies available in the literature under adaptive cluster sampling.

Exercise 9.1. Suppose a population consists of N clusters, M ; being the size of


the /h cluster (i = 1,2 , ..., N). Clusters are selected with PPSWR sampling with
P; = Md M . The cluster selected at the (r + l}th draw is rejected if the number of
distinct clusters selected in the first r draws is a pre-assigned number n . Let the /h
cluster occur r; times in a sample of r draws, r; = 0,1,2.....; i = 1,2, ..., N . If Y;. is
the mean of the t h cluster then show that
_ _I n -
Ys = r Ir;Y;.
;=1
is an unbiased estimator of population mean.
Hint: Sampford (1962).

Exercise 9.2. Let Yijk be the value of Y for the j(h element of the /h cluster in the /h
stratum. Define
_ I K Mij
Y;••= M ;. L L Y;jk = the per element cluster mean;
j=1k=1

n Mij
= the per element (large) sample mean in the /h stratum,
..0. •
Y. j . = M:) L L Yijk
;=l k= 1
• n ~h
where M. j = L M ij denotes the number of elements in the} stratum in the sample ;
;=1

s.,» m:) f 1. ;=!k=l


Yijk = the per element sample mean in thej" stratum ,

where m.j= fmij denotes the number of sample elements in the /h cluster of thej"
;=1

stratum. Study the estimator RAS of the population ratio R =;~IM;Y; /;~IM;X; ,

defined as
.
R AS= I
K..
M. jY. j
IKI M.. j x.j
. .
j=1 j=1

Hint: Akar and Sedransk (1979).


Chapter 9.: Non-o verlapping, overlapp ing, post, and adapti ve cluster sampling 821

Exercise 9.3. Draw an initial random sample of n elements from N elements in


the population with SRSWOR sampling. On the basis of this sample, form M' sub-
clusters. Then take a sample of m' sub-clusters from M' with SRSWOR sampling.
Again from M super-clusters on the basis of these m' sub-clusters and finally draw
a sample of m super-clusters from M with SRSWOR sampling. Let cijk for
k = 1,2,...,Nij , j = 1,2,...,M '; , i = 1,2,...,M be a random variable such that it takesthe
value 1 if Yijk is selected and 0 otherwise. Let bij be another random variable,
taking the value 1 if the r
sub-cluster of the lh super-cluster is selected and 0
otherwise. Let a; be another random variable taking the value 1 if the lh super-
cluster gets selected and 0 otherwise. Then show that an unbiased estimator of
population mean is given by
_ MM'M M; Nij
Yu = --,-La; L bij L cijkY;jk .
mm n ;=1 j=1 k=1
Hint: Sadasivan and Srinath (1975).

Exercise 9.4. A finite population of M o elements is divided into N clusters with


the lh cluster having M; elements. An SRSWOR sample of m elements is selected
from the total population elements and the sample elements grouped according to
the clusters to which they belong. Show that YG = (N/ mn)y, where y is the sample
total based on the values of the sample elements in the n selected clusters, is an
unbiased estimator of population mean.
Hint: Ghosh (1963).

Exercise 9.5. Suppose the overlapping clusters are selected such that if V; is being
selected then (M; -I) units are associated with it to form a cluster of M ; units.
Then show that an unbiased estimator of population mean Y is given by
__ 1 ~ '" Y j
Ya - - L . L . - '
n ;=l jE;M j
Hint: Amdekar (1985).
Exercise 9.6. From a population consisting of N clusters each contammg M
elements, a simple random sample of n clusters is selected to estimate the
population mean per element. Consider S; = aM g, with g < 0, be the variance
within clusters under the superpopulation model. Find the optimum cluster size
such that the variance of the estimator of population mean is minimum for the fixed
cost given by C =nMcI +cz"[;; .

Exercise 9.7. If NM elements in a population are grouped at random into N


clusters each of size M , then show that an estimator based on a SRSWOR sample
of n clusters has the same efficiency as an estimator based on nM elements
selected by usual SRSWOR random sample.
822 Advanced sampling theory with applications

Exercise 9.8. Suppose a sample of n clusters, from N clusters each containing M


elements, is taken systematically for estimating the population mean, such that N is
a multiple of sample size n . Suggest an estimator to estimate the population mean
and find its variance in terms of intraclass correlation coefficient Pc between pairs
of elements in the clusters and that between pairs of clusters p; at the samples.
Hint: Madow (1949).

Exercise 9.9. In order to examine the efficiency of sampling households, instead of


persons, for estimating the population of males in a given area, it is assumed that:
( a ) each household consists of four persons, viz., husband, wife and two children;
( b ) the number of boys (or girls) in a family are binomially distributed.
Show that:
( a ) The value of intraclass correlation coefficient Pc is 1/6;
( b ) The efficiency of sampling households compared to that of sampling persons is
200%, that is, the efficiency of cluster sampling is twice the random sampling.
Hint: Sukhatme (1954).

Exercise 9.10. Consider a large library with N distinct titles, each title being
present in one or more volumes. A sample of n books is to be drawn to estimate a
proportion 1( (say), for instance, the proportion of Canadian books. Because the
sampling the books directly proves physically difficult , the librarian develops a
scheme for choosing an SRS from the card catalogue. Since the catalogue contains
a card for each volume, not for each title, this is a sampling of titles with
probab ilities proportional to size, the 's ize' of the l h title being the number of
volumes M i ' It is assumed that M i are known at least for the sample units. Define

z,. = M I./M , I
. = 1" 2 ... " N h
were M = ~M.
L.
. = {I if the lh
I ' y, . title is Spanish, then
i=1 0 otherwise,

= N ~/i . Let
I N
our interest is to estimate the population proportion, J! Ui = y;/(Nz;) ,
n n
Vi = 1/(Nzi ) , i = 1,2 ,..., n, iI = n- I L.Ui and v= n-I L. Vi . Find the bias and variance of
i=1 i=1
the following three estimators of population proportion, defined as
7l-1 = iI ; 7l-z = iI + (1 - v) and 7l-3 = ii/v.
Discuss their relative efficiencies with respect to one another .
Hint: Alalouf (1996), Mayor (2002).

Exercise 9.11. Discuss balanced cluster sampling.


Hint: Tallis (1991) .

Practical 9.1. Using the MAP of the USA, construct 10 clusters of the 50 states,
each cluster consisting of 5 states, listed in population 1 based on their locations .
Chapter 9.: Non-overlapping, overlapp ing, post, and adaptive cluster sampl ing 823

Practical 9.2. Select 6 clusters from the list of clusters you made in practical 9.1 by
using SRSWOR sampling. Record the values of the nonreal estate farm loans for
the selected states from population 1 given in Appendix. Estimate the average
nonreal estate farm loans in the United States using cluster sampling. Estimate the
variance of the estimator used for estimating the average nonreal estate farm loans.
Construct a 95% confidence interval.

Practical 9.3. Suppose the United States has been divided into neighbouring 10
clusters each consisting of 5 states as shown below.

1 ME, NH, VT, MA, RI


2 NY, CT, NJ, DE, PA
3 MD,WV,VA, KY, NC
4 WI, MI, IL, IN, OH
5 TN, SC,GA,AL,FL
6 ND, MN, SD, NE, IA
7 MO, OK, AR, MS,LA
8 MT, WY, CO, NM,KS
9 CA, AZ, TX , AK , HI
10 WA, OR, ID, NV, UT
We are interested in estimating the average nonreal estate farm loans in the United
States. Is there any gain in efficiency owed to clustering rather than simple random
sampling?
Hint: S; = 609,964.83 and S2 = 1,176,526.14.

Practical 9.4. In practical 9.3 find the value of the intraclass correlation coefficient.
Use it to find the relative efficiency of the cluster sampling over the simple random
sampling.
Answer: The value of intraclass correlation coefficient p = 0.3451 .

Practical 9.5. Select 6 clusters from the table given in Practical 9.3 by using
SRSWOR sampling. Record the values of the nonreal estate farm loans for the
selected states from population 1 given in the Appendix. Estimate the relative
efficiency using the ANOVA approach. Hence, deduce the value of parameter g.

Practical 9.6. Consider the problem of estimation of proportion using cluster


sampling. Select 3 clusters from the table given in Practical 9.3 by using SRSWOR
sampling. Record the values of the nonreal estate farm loans for the selected states
from population 1 given in the Appendix. Estimate the proportion of states having
nonreal estate farm loans of more than $878 .16 n the United States using cluster
sampling. Estimate the variance of the estimator used for estimating the required
proportion. Construct a 95% confidence interval.
824 Advanced sampling theory with applications

Practical 9.7. Divide the United States neighbouring 10 clusters each consisting of
5 states as shown in practical 9.3. Consider the problem of estimating the
proportion of states having nonreal estate farm loans of more than $878.16 in the
United States. Is there any expectation of gain in efficiency due to clustering rather
than simple random sampling?

Practical 9.8. A world level team of doctors believes that the estimation of total
tobacco use in the world is an important factor in the persistence of health
problems. Select four continents from population 5 by SRSWOR sampling. Collect
information about the yield/hectare from all the countries in the selected continents.
Estimate the yield/hectare in the world using three different estimators and report
your 95% confidence interval to the doctors .
Practical 9.9. A tobacco farmer wishes to find a strategy which provides a better
estimate of the production of tobacco crops in the world. He divides the world level
tobacco growing countries into 10 clusters as listed in population 5 of the
Appendix. He wishes to apply PPSWR sampling for selecting the cluster. Can he
expect any gain in efficiency over simple random sampling?
Practical 9.10. For the purpose of comparison in overlapping cluster sampling,
suppose the clusters are formed and selected by following two different sampling
schemes :
Scheme 1. (a) Select k clusters out of K clusters by SRSWR sampling.
( b ) From the lh selected cluster of size Ni , (i = 1,2,...,k), select ni units by using
SRSWOR sampling.
Scheme 2. (a) Select k clusters out of K clusters by PPSWR sampling, with
P; = NilM , ( b ) From the lh selected cluster of size N i , (i = 1,2,...,k), select ni
units by SRSWOR sampling.
Assume that the population size is unknown and three possible clusters are formed
as shown in the figure below .

Overlapping clusters
Show that the relative efficiency of the estimator of the population mean based on
scheme 2 with respect to that based on scheme 1 is 114.21%.
Hint: Tracy and Osahan (1994b).
Chapter 9.: Non-overlapping, overlapping, post, and adaptive cluster sampling 825

Practical 9.11. Ms. Stephanie Singh and Ms. Renee Hom were asked to construct
three clusters of9 regions of the USA using the follow ing two maps :

Nov 2001-0ct 2002 Regional Ranks


National Climatic Data Center/NESDlS/NOAA

• •
R ecord
Dries'
Muc h
Below
No rma l
Below
Normal
Near
Norm al
Abov e
Normal
• •
Muc h
Abo ve
Normal
Reco rd
Wettes t

U.S. Standard Regions


for Temperature & Precipitation

National Climatic Data Center, NOAA

Source: Printed WIth permission from National Climate Data


Center/NESDIS/NOAA.
826 Advanced sampling theory with applications

Ms. Stephanie Singh suggested the following three clusters as:

I'" 2!;Y,l ,,\,,~


:, Ms. Stephanie SingMs'Clusters " ~ , ~3,",~";i'
Cluster"rH~ North West West North Centra East North Central
, Cluster 2 ~ West South West South
Cliister 3' ~ North East Central South East

Ms. Renee Hom suggested the following three clusters as:

( a) Whose clustering plan is more efficient and why?

( b ) Select two clusters using SRSWOR sampling from Ms. Stephanie Singh's
clusters and construct 95% confidence interval for the average precipitation. (Rule:
Always start from first row and first column of the Pseudo-Random Number Table
I given in the Appendix).
( c ) Select two clusters using SRSWOR sampling from Ms. Renee Hom's clusters
and construct 95% confidence interval for the average precipitation . (Rule: Always
start from first row and first column of the Pseudo-Random Number Table I given
in the Appendix)
Hint: Apply ANOVA approach.

Practical 9.12. Beta Corporation is considering 15 suppliers to secure the large


amount of steel pipes that it uses from different five states, viz., NY, VT, NH, ME,
and MA of the USA. Among the 15 suppliers, few belong to more than one state,
as shown in the following diagram on the next page:

Depending upon the quality and their standard the selling prices of one pipe of
these 15 suppliers are given below:

We wish to estimate the average selling price of the pipe by different suppliers.
Instead of asking all the suppliers, we wish to select all the suppliers working in few
states. At first-stage select two states and contact all the suppliers within the
selected states to collect information about their selling prices. Use the concept of
overlapping cluster sampling for estimating average selling price of a pipe.
Chapter 9.: Non-overlapping , overlapp ing, post, and adaptive cluster sampling 827

NH

VT
2 12
NY

3
4
13

S
ME
MA

6 15

Distribution of 15 suppliers in five different states.

Practical 9.13. Consider a hypothetical situation of a surve y on AIDS patients in


the NY city, where the residential dwell ings were considered as clusters and
apartments within dwellings were considered as elements. Nine dwell ings were
selected by simple random without replacement sampling out of total 1000
dwell ings available on a list, and noted the number of AIDS patients in each
apartment as listed below:

liii1]i,:: 1 ::ii,:: 1· ]1111; 1 i< T1J"


1 2 0
2 4 5 1
3 2 4 2 0
4 1 2 4 0 3
5 2 4 0 0 0 4
6 0 0 0 3 2
7 0 0 0 0
8 2 2 1
9 I 2

( a ) Estimate the total numbe r of AIDS patients in the NY city.


828 Advanced sampling theory with applications

( b ) Estimate the relative efficiency of the cluster sampling with respect to simple
random without replacement sampling.
( c ) Confirm your result in ( b ) based of the value of coefficient of determination.
Hint: Apply ANOVA approach with unequal number of units in each cluster.

Practical 9.14. The population of interest is the students in your class today. Each
of the rows of students in the class will form a cluster.

( I ) We wish to estimate the proportion of students in the class' Who visited theatre
last week?'

( a ) Using the Pseudo-Random Number Table 1 given in the Appendix select two
rows and estimate the proportion of students who visited a theatre last week. Also
construct 95% confidence interval estimate.
( b ) Now ask the same question to everyone in the class and find the true
proportion of such students who visited a theatre last week. Does the true
proportion lie in your confidence interval estimate? If yes, it is fine, but if not, then
suggest a suitable reason for it.

( II ) Consider the problem of estimation of average GPA of the class.

( a) Using the Pseudo-Random Number Table 1 given in the Appendix, select one
row and estimate the average GPA of the class. Also construct 95% confidence
interval estimate for the same.
( b ) Now ask the GPA question to everyone in the class and find the true average
GPA. Does the true average value lie in your confidence interval estimate?
Give your opinion .
10. MULTI-STAGE, SUCCESSIVE, AND RE-SAMPLING
STRATEGIES

10.0 INTROD.U€TION

The meaning of multi-stage sampling is clear from its name. Here we have several
stages for the sample selection. In fact it is an extension of the concept of cluster
sampling. Similar to cluster sampling first we divide the population into M clusters
or heterogeneous groups. We select m clusters and use the estimates of cluster
means or totals to form population estimate. For example, in two-stage sampling we
again divide our population into M groups and select a sample of groups, which
form the first stage sample. The units so selected are called first stage units (FSU).

First stage
sample

Second stage
sample List of selected
villages from each
district
Fig. 10.0.1 Two-Stage Sampling Scheme.
First stage units are sometimes also called preliminary stage units (PSU). From
each selected group at first stage, we select a sample of the population units forming
the group. The sub-sample of the first stage sample is called the second stage
sample and the units so selected are called the second stage units (SSU). Similarly
one can think three-stage or four-stage sampling and hence multi-stage sampling. A
pictorial representation of two-stage sampling is given in Figure 10.0.1. Thus, a
scheme of the type where at each stage we have a selection and go on selecting
smaller and smaller units is called multi-stage sampling scheme.

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
830 Advanced sampling theory with applications

Let us define a few symbols, which will remain useful for understanding the
concept of multi-stage sampling, as follows :
N= Total number of first stage units (FSUs) in the population;
n = Number ofFSUs selected in the sample;
M, = The number of second stage units (SSUs) in the i th selected first stage
units (FSUs), (i = I,2, ...,N);
mj = The number of second stage units (SSUs) selected from the lh FSU in
the sample of n FSUs, (i = 1,2,..., n );
h
Tij = Total number of third stage units (TSUs) in the/ SSU of the lh FSU;
tij= No . ofTSUs selected in the sample from the/h SSU of the lh FSU in the
sample, i = 1,2,..., nand j = I,2,...,mj ;
l
Jik = value of the IIh TSU of the SSU of the lh FSU of the study variable
Y in the population, i = 1,2,..., N ,. j = I,2,...,M j , k = I,2,...,Tij;
h
Yjk = value of the II TSU of the /h SSU of the lh FSU of Y in the sample for
i = 1,2, ..., n ; j = I,2,....mi , k = I,2,...,tij ;
T,..
Yij. = f Yijk , total corresponding to all the TSUs in the/h SSU of the lh FSU
k=1
in the population;

f Yijk , total corresponding to all TSUs in the/h SSU of the lh


/"
Yij. = FSU in
k=1
the sample;

sJ = (Tij _1)-1 f (Yijk - Yij.jTij ~ ,


T,"
population mean square between Tij units of
k=l
the/h SSU of the lh FSU;
u,
2 ( \-1 lJ ( \2 .
Sij = I!ij -11 L l;Yijk - Yij.jtijI ' the sample mean square between tij units of
k=l
the/h SSU of the ith FSU;

S?= (M -It1[1!
j
J=l
YJ. - Mjl(1!
J=l
Jij.)2], the population mean square error
between M, units of the lh FSU;

sl=(mj -Itl[~ yJ. - mjl(~ Yij.)2], the sample mean square error between
J=l J=l

mj units of the lh FSU;


10. Multi-stage, successive, and re-sampling strategies 831

S;=(N -ltl[ .~lj;. - N_I(.~lj•• )2], the population mean square error between
1;1 ,;\

the totals ofFSUs;


and

sE = (n -ltl[.IYj~. - n-\( .IYj•• )2], the sample mean square error between the
1;1 ,;1

totals ofFSUs.

Suppose we like to estimate the population total Y . Let Y denote its unbiased
estimator. At the first stage of sampling, we have N units in the population.
Therefore an estimate of per first stage unit (FSU) is given by YiN . Note that M,
j = 1,2,...,N , units are selected at the second stage from the selected lh first stage

unit (FSU), the total number of units at the second stage of sampling will be given
N
by M = IM j and hence the required unbiased estimator of per second stage units
j;\

(SSU) is Y/M . Now let the total number of units at the third stage units (TSU) be
N M' ,
T= I t Tij . Then an unbiased estimator of per third stage unit (TSU) is Y IT . Our
j;lj;1

objective is to find an estimator of the population total. For this purpose we proceed
as follows:

2 • n
1,2,...,ml 1,2, ...,m2 • 1,2, ,mn
tl j' t2j ' , t mlj tl j,t2 j, ·.., t m2j • tIj ' t2j ' , t mnj
j=I .2, , ml j = 1.2•...•m2 j = 1,2, .m;

Then we have the following corollaries.

Corollary 10.2.1. Suppose SRS has been used at each stage of selection of the
sample. Then if the Yj values are known for FSUs an estimator of the population
total Y is given by
, N n
Y ms = - L Yj . (10.2.1)
n j;\

In the case in which the Yj values are not known an estimator of the population
total can be obtained by replacing them by their estimators
832 Advanced sampling theory with applications

A M, mt
Yi = - I Yij '
mi ) ; 1
Thus
A N n M . mi (
Yms = - I-' I Y ij . 10.2.2)
n i;l mi );1

In the case in which the values of Y ij are not known they can be estimated as
I"

Yij = (T;) Itij)f Yijk


k;\
and the estimator of population total Y becomes
A N n M i mi Tij lij
Yms = - I - I - t I Y ijk . (10.2.3)
n i = 1 mi )=1 ij k=1

Corollary 10.2.2. Consider FSUs are selected using PPSWR sampling with Pi
being the probability of selecting the lh unit and Y values are known for FSUs,
then an estimator of population total Y is given by

Ym s -
A _.!. ~L. 1:'L . (10 .2.4)
ni=IPi
In the case in which the Y i values are not known and SSUs are selected with SRS
and they can be replaced by their estimators
A M, mi
Y i = - L Yij '
mi )=1
Then an estimator of the population total Y is given by
1 n 1 M, mi
A

Y ms = - I - - I Yij' (10 .2.5)


n i;1 Pi mi ) =1
In the case in which the values of Yij are not known and TSUs are selected using
SRS and then Yij can be estimated as
I"

Yij = (TijItij)f Y ijk


k=1
and the estimator of population total Y becomes
1 n 1 u,
mi T;) lij
Yms = - I - - I - I Y ijk .
A

(10.2 .6)
n i=1 Pi mi )=1 tij k=\

Corollary 10.2.3. If SSUs are selected with PPSWR sampling, whereas first and
third stage units are selected with SRS, then an estimator of population total is
given by
A N n 1 mi 1 T;) lij
Ym s = - I - I - - t I Yijk . (10.2.7)
n i=l mi j=l Pij ij k=l
10. Multi-stage, successive, and re-sampling strategies 833

In the same fashion the estimator of population total can be obtained under any
finite number of stages. It may be remarkable that the lesser the number of stages,
the more accurate win be the estimators based on a given sample size, but may not
be for fixed cost.

10.3 METHOD FORCALCULATING T


- ~ ESTIMAJ'1)R~oi:Il
' ' _ _....

Let us first discuss simple cases to find the variance of the estimators of population
total under multi-stage sampling. Let us first consider the situation of three-stage
sampling, assuming that SRSWOR sampling has been used at each stage of
selection of the sample. Evidently an unbiased estimator of population total is
, N n M, mi Tij tij
Y3s =- I - I - I Yijk . (10.3.1)
n i=1 m i )=\ tij k=1

To find its variance let E3 and V3 denote the conditional expectation and variance
for the fixed first and second stage samples; E2 and v2 denote the conditional
expectation and variance for the fixed first stage samples; EI and VI denote the
expected value and variance for an possible first stage samples and we have
V(Y3s ) = EIE2V3(Y3s) + E\V2E3(Y3s )+ fJ E2E3(h s ) = II + 12 + 13 (say). (10.3.2)
Note that the sampling has been done independently in each FSU and SSU we have

II = E\E2V3(hs) = EI E2V3 [Nf.M ~ Tijtij k=1


i
n i=1 mi )=\
f Yijk] = E\E V3[Nf.M )=1~ TijYy.]
2
i
n i=l mi

= E1E2 [N:.f. MtJ=I~ V3(Tij)iij.)l'J = E E [N:.f. MtJ=I~~ TJ(~tij --!...\'J]


n 1=1 mi
I 2
n 1=1 mi Tij r

N2 n
= E I -2 I - I
u, Mi Tij(Tij -tij) 2] =N -NIM-i uI, TijlTij -tij)Sr2,
Sr
(10.3.3)
[ n i=1 m i )=1 tij U n i=1 mi ) =1 tij U

(10.3.4)
834 Advanced sampling theory with applications

NL nM . m;] [Nn - ]
-' L lfj. = V1E2 - LM;Yij.
= V\E2 -
[ n ;=1 m; j =1 n ;=1

= V;[N~y,
1 L... /..
]=N(N-n)S2b • (10.3.5)
n ;=1 n
Thus on using (10.3.3), (10.3.4), and (10.3.5) in (10.3.2) we have the resultant
variance of the estimator in three stage sampling given in the following theorem:

Theorem 10.3.1. If three-stage sampling is performed using SRSWOR sampling at


each stage of sampling, then the variance of the resultant unbiased estimator of
population total is given by
" )_N~M;~1jATij-tij)S2 N~M;(M;-m;)s2 N(N-n)s2
v(Y3s - - L...- L... ij +- L... i + b : (1036)
n ;=1 m; j=1 tij n ;=1 m; n . .
The expression (10.3.6) shows that the variance in three-stage sampling consists of
three components, which corresponds to the variation at each stage.

An estimator of population total In two-stage sampling, assuming SRSWOR


sampling at each stage, is given by
" N n M; m;
Y2s =- I - I Yij (10.3.7)
n ;=I m; j=1

with variance
.
V IV2s
A
N(N -n)s2 N ~M;(M; -m;)s2 (10.3.8)
b + - L...
• )_

- i
n n ;=1 m;

and an unbiased estimator of V(Y2S) is given by


AI:. ) N(N -n) 2 N n M;(M; -m;) 2
VV2s = Sb +- L S;, (10.3.9)
n n ;=1 m;
where

2 ( )-I[ ;Ln [M;


Sb = n-l I:. N J\2]
~Yij J2-nV2s/
- . m;
1=1 m , J=I

Example 10.3.1. Out of 10 clusters or continents listed in population 5, select 4


continents with SRSWOR sampling as first stage units (FSU). From each one of the
selected continents select two countries by SRSWOR sampling and record the
yield/hectare of the Tobacco crop. Estimate the total yield/hectare in the world.
Also estimate the standard error of the estimator used for estimating the total
yield/hectare in the world.
10. Multi-stage, successive, and re-sampling strategies 835

Solution. For selecting first stage units, we used the i h and 8th columns of the
Pseudo-Random Numbers (PRN) Table I from the Appendix to select four distinct
random numbers between I and 10 as 07, 09, 01 and 02. Obviously the four
continents selected in an SRSWOR first-stage sample of four units are Other
Africa, Middle East, Central America, and Caribbean.
Note that there are 30 countries in the continent Other Africa. To select two
countries out of these 30 countries, we selected two distinct random numbers
between 1 and 30 by using first two columns of the Pseudo-Random Numbers
(PRN) Table 1 from the Appendix as 01 and 23. Thus at the second-stage the
countries Angola and South Africa will be included in the sample.
Further note that there are 10 countries in the continent of the Middle East . To
select two countries out of these 10 countries, we selected two distinct random
numbers between 1 and 10 by using third and fourth columns of the Pseudo-
Random Numbers (PRN) Table 1 from the Appendix as 06 and 07. Thus at second
stage, the countries Oman and Syria will be included in the sample.
Similarly the second stage countries selected from Central America are Honduras
and Nicaragua, whereas those from the Caribbean are Haiti and Jamaica,
respectively.
The structure of sample information collected takes the following form. Here we are
given N = 10 and n = 4 .

Other I
Africa 2
2 Middle 10 2 I 11.30 0.0320
East 2
3 Central 6 2 I 11.43 0.3750
America 2
4 Caribbean 6 2 I 9.84 2.9400
2

An estimate of total yieldlhectare in the world is given by


• N n M i mi 10
Y 2s = - L - L Yij = - x 77.42 = 193.55.
n i=l m i j =l 4
Now we have

sb2 =1
- [n[M
L - i mi
L Yij )2 -n\Y2s \2] =1-[ 77.4 22- 4(193.55
f :. / N J - )2]
n - 1 i= m i j =l 4 -1 10
836 Advanced sampling theory with applications

== 4495.3923 == 1498.4641 .
3
An unbiased estimate of V(Y2s) is given by
V(Y2.) == N (N -n) s~ + N I.
M ;(M; -m;) sf == 10(10-4) x 1498.4641 +.!.Q. x 217.568
n n ;;1 m; 4 4
== 23020.8815.
A (1- a p00% confidence interval for estimating population total Y is given by

Y2s ±taj2(df == n(m -1)Nv(Y2S) .


Using Table 2 from the Appendix the 95% confidence interval for total yield!
hectare of tobacco crop in the world is given by
193.55±to.025(df ==4(2-1))J23020.8815 , or 193 .55±2.776~23020 .8815, or
[- 227.64, 614.74] ,or [0.00, 614.74] .

Suppose first stage n units are selected by SRSWOR sampling out of N units in
the population. The second stage m units are selected from the given first-stage
units each consisting of M units . If Yij denotes the value of the sample unit
corresponding to /h first stage unit and /h second stage unit, then an unbiased
estimator of population mean is given by
=
Yeq
1 n M m
== - 2:- 2: Yij (lO.3.10)
n ;; \ m j;\

with variance given by


v(Y )== NNn- n S2b + MnMm
eq
- m S2
w'
(lO.3.11)

2 1 N (- -\2 2 1 N M ( - \2
whereSb == - - 2: Y;.-Yj and Sw== ( )2:2: Yij-Y;.j .
N - 1;;\ N M - 1 ;; lj; l

Consider C 1 be the cost of selecting first stage unit in a sample, C2 be the cost of
selecting a second stage unit, and Co be the overhead cost, then the simplest cost
function in two-stage sampling can be written as
C == Co + nC1 + mnC2 • (lO.3 .12)
One can easily observe that the optimum first stage and second stage sample sizes
for the fixed cost of survey are given by
(lO.3.13)

and

m == Sw~C\M / ~C2(MSl- S~) (10.3.14)


10. Multi-stage, successive, and re-sampling strategies 837

For simplicity let us assume that m is the number of SSUs selected from each of the
selected n FSUs and p is the number of TSUs selected from each of the selected
m SSUs. As we saw earlier in three stage sampling, the variance of the estimator
consists of three components. Let us assume that it is given by
2 2 _ 2
' ) =ab- +a-w+ -
v(Y3s aw
-, (1004.1)
n nm mnp
where
2 1 N( -\2 2 1 N M (- - \2 -2 1 N M P ( - \2
ab =-L If•• -Yj ,aw=--LL Yij.-lf•• j and a w = - - L L L lfjk - Yij.j
N ;=1 NM ;=lj=1 NMP ;=lj=lk=1
have their usual meanings .

Suppose Co is the overhead fixed cost for the survey and Cl> C2 ' C3 are the costs of
enumerating a unit at first, second, and third stage of sampling. Obviously, the
simplest cost function is
C = Co+ nC1 + nmC2 + mnpC3 . (1004 .2)

The Lagrange function is then given by


2 2 -2
L = ab + a w + a w +A[C O +nC l +mnC 2 +mnpC3 -C]. (1004.3)
n nm mnp
On differentiation (1004 .3) with respect to m, (mn) and (mnp) and equating to zero
in each case we have
si
-=--+AC1 =0
al
on n2 (100404)

st. 0'2
--=--_w-+ AC2 =0 z> (mn)= ~ ,
o(nm) (mnf (1004.5)
"AC2
and
si. -2
=-~+AC3 =0 ~ (mnp)= ~ . (1004.6)
o (mnp ) (mnp) "AC3

From (1004 .2), (100404), (1004 .5), and (1004 .6) we have

C _ Co = [CTbJe; + CTw,JC; + CTw,Je;""] =:> Ii = (1004.7)


C - Co .
5 ab}C; +awJC; +CTw.fC;
On substituting (1004.7) in (100404), (1004.5), and (1004.6), and solving for m, n
and p , we obtain optimum sample sizes as
838 Advanced sampling theory with applications

O'w ~ <Tw ~
nopt
O'b(C - Co)
=~[O'bCJ +O'w.fG; +<Tw~r' mopt = O'b V~ ,an
d
Popt = vc;- .
O'w (10.4.8)

The minimum variance can easily be obtained by substituting these optimum values
in (10.4.1). The three-stage sampling considered so for is called traditional three
stage sampling . If FSUs consists of district, SSUs consist of villages, and TSUs
consist of bank accounts of customers, then from each selected village the bank
account holding customers are required to be sampled separately and independently
across the villages. Raj (1968) and Rao (1975) have discussed the estimation
strategies in greater detail.

Chaudhuri, Adhikary, and Dihidar (2000) considered a homogeneous linear


unbiased function of the sampled first stage unit (FSU) values taken as an estimator
of a survey population total, the sampling variance is expressed as a homogeneous
quadratic function of the FSU values. When the FSU values are not ascertainable
but unbiased estimators for them are separately available through sampling in later
stages and substituted into the estimator, Raj (1968) gave a simple variance
estimator formula for this multi-stage estimator of the population total. He requires
that the variances of the estimated FSU values in sampling at later stages and their
unbiased estimators are available in certain simple forms. Chaudhuri , Adhikary , and
Dihidar (2000) illustrate a particular three-stage sampling strategy and present a
simulation based numerical exercise showing the relative efficiencies of two
alternative variance estimators.

To draw a sample of n FSUs, Chaudhuri (1997) has used the well known Rao,
Hartley, and Cochran (1962) scheme of sampling using Pi En . First we divide the
population n at random into n groups, taking N, FSUs in the i th group such that
n
IN; = N , the population size. From each group, one unit is selected with
;:1
probabil ity proportional to Pi, and selection is done independently across the
groups. Let Q; be the sum of the probabilities Pi falling in the /h group. The m;
SSUs are also selected using RHC scheme from the til FSU of size M; . We split
the M; SSUs into m; groups at random, taking Nij SSUs in the/h group such that

~ M ij = M ; is satisfied. From every group, say the


j:J
r. one unit is chosen with
probability proportional to Pij and is repeated across the groups. Let Qij denote the
sum of the Pij values falling in the /" group. Let s; denote the set of SSUs chosen
from the t" selected FSU, A(s;) be the set of bank accounts of customers with
dwelling addresses in Si and L;(s;) be the cardinality of A(s;). Also let s; be an
10. Multi-stage, successive, and re-sampling strategies 839

SRSWOR sample of ti(Si) TSUs chosen from A(Si) and it is to be repeated


independently across the selected FSUs. Thus we have the following theorem:

Theorem 10.5.1. An unbiased estimator of population total Y is given by



Ym3s
n"Q
-_L. i" milQ..
L. -
J (L (s )
i i "Ii
- - L.
J
Yijk . (10.5.1)
dsJ k=1
-
i=1 Pi )=1 Pij

Proof. The estimator (10.5.1) can be written as


• n Q i mi Qij
Ym3s = L - L-wij'
i=IPi )=1 Pij
where
Li(sJ "L. Yijk
wij =-(-) = Li (Si )-Yij'
ti Si kes;
Taking expected values on both sides we have

E(Ym3s) = E1E2E3 [± I Qi
i=1 Pi )=1
Qij Wij] = E1
Pij
[± I Qi
i=1 Pi
E)
1 )=1
Qij Wij}]
Pij

[± '!
= E1 Qi
i=1 Pi )=1
Yij] = E1 [± Q i YiO ] =
i=1 Pi
Y.

Hence the theorem .

Let Y; and X i be the totals of the /h FSU of the study variable Y and auxiliary
N
variable X , respectively. Assume that the population total X = LXi of the
i=1
auxiliary variable is known and we are interested in the estimation of population
N
total Y = IY; of the study variable. A sample S of n FSUs is selected according to
i=l
any design with J( i and J( ij as the known first and second order inclusion
probabilities, which are in fact function of first stage sample size n , The selected
FSUs are sub-sampled independently with suitable selection probabilities at the
second stage. When the /h unit becomes selected then it is assumed that from
sampling in the second stage, estimators t iy and t ix , respectively, for Y; and Xi are
available such that

V2Viy)=CTi~ = Mjl'!(Y;rf;f ' V2(tix)=CT& =Mi-I'!(Xij-xJ,


)=1 ) =1


C2ViY' tiJ=CTixy=Mj1i(xij-xiXYij-f;), E 2ViY)=Y;, and E2(tix ) = Xi '
)=1

Thus we have the following theorem:


840 Advanced sampling theory with applications

Theorem 10.6.1. The unbiased estimators of Yand X are, respectively, given by


n tiy n t.
t = L- and tx = L ....!:!... (10.6.1 )
Y i=l 1fi i=l 1fi

with
( ) N O'.
COVl!y, tx =axy +I ~, (10.6.2)
i=J 1fi
where
~ (1fi1fr 1fij XXj 1fi - X j /1fJ Y;/1fi - Yj /1fJ
a xy = -2'
i* j=1
Proof. By the definition of covariance we have
Cov(ty, tJ= £ICzVy , tx)+Cl[EZVy~ £z (tj

na iXY ] [n y, n
=£1 L-- +Ct L-!..., L-' =0'
X.] a ixy
+ NL-- .
[i=1 1fl i=l 1fi i=l 1fi -'Y i=1 1fi

Hence the theorem .

Corollary 10.6.1. From the above theorem the following results are obvious :
() z N ai~
Vl!y = a y + I -, (10.6 .3)
i=J 1fi
where

a; ±i*I.j=1(1fi1fj -1fijXY; / 1fi - Yj / 1fj y '


=

and

a;+ I -.--.i£,
a N Z
V(tx) = (10.6.4)
i=1 1fi
where

Sahoo and Panda (1997) first defined a sub-class to estimate the til unit population
total Y; (say) as

(10.6.5)

where lzi ~iY ' tuJ is a function of t iy and t ix , such that lz i(Y; , X;) = Y; satisfying
certain regularity conditions analogue to those defined by Srivastava (1980).
Evidently an estimator to estimate Y can be defined as
10. Multi-stage, successive, and re-sampling strategies 841

IIY,
o
'» = I--!...= IIz; l ;y ,lix V1(;
11 ( )
(10.6.6)
;=11(; ;=1
and that to estimate X can be defined as

(10.6.7)

Using (10.6.6) and (10.6.7), Sahoo and Panda (1997) have defined a main class of
estimators of population total in two-stage sampling as

(10.6.8)

where .Ii~;, I:] denotes the function of I; and I: satisfying certain regularity

conditions. Then we have the following theorem:

Theorem 10.6.2. The minimum variance of the main class of estimators Yc is

Min.V(Yc)= a;(I- p~)+ I ai~(I- P&J . (10.6.9)


;=1 1(;
Proof. Note that
.Ii(Y,X) = Y and o~ I(r .x)= 1
Oly

Using the second order Taylor 's series, the class of estimators Yc can easily be
expanded around the point, (Y, X) as
Yc=Ji[Y+~;-Y) x +~:-x)l
=.Ii(Y,x)+~; - Y)Of;
o ty
f;
I(r ,x ) +~:- x) O I(r ,x )
Otx
+.....
=I; + V: - X )tiz(Y,X)+ ....

= ~ ~ +~:-X~2(Y'X)+ ""
;=11(;

= ~J..1z;(t;y,t;J+~:- X ~2(Y'X)+ ....


;=11(;

Iu ( lj,X; ) + (liy - lj )-o h i I(yox o) +(lix - X; )of2i


- I(yo Xo)
11 Oliy ,,/ olix /- I (. L
=L + ~x - x )TdY, x ) + ....
i=1 1(;

_ 11 t;y+(t;x-X;)1z2;(Y;'X;) (0 L ( )
- L +Vx -X)Tl2 Y,X + ....., (10.6.10)
;=1 1(;
where .li2(Y, X) and 1z2;(Y; ,Xi) denotes the known first order partial derivatives of
the respective function used for the construction of the estimator. Then using the
definition of variance we have
842 Advanced sampling theory with applications

V(Yc
, ) = O"yZ+2li z( L Z( U NO"i; +2f 22i(Y;, x i}Tixy +flzi(y; ,x i}T& ( 10 .6 . 11)
Y,XP xy + li z Y,XP x + I
i= 1 " i

The variance V(Yc ) is found to be minimum for


Jiz(Y,X)= -Pxy and f22i(Ji ,X;) =-Pixy'
On substituting these values in (10.6.11) we have (10.6 .9). Hence the theorem.

Sahoo and Panda (1999a, 1999b) extended the results to the situation when two
auxiliary variables X and Z are available to estimate the population mean of the
study variable Y under two-stage sampling.

Bellhouse and Rao (1986) have pointed out that the prediction estimator may be
only marginally better than the classical estimator in PPS sampling under two stage
sampling design from the efficiency point of view. Following them, consider a
population of N units consisting of L FSUs with M j units in the FSU r
L
(i = 1,2, ..., L ) such that N =I M j • Let Yij be the value of the lh unit in the {h FSU
j=1

and we are interested to estimate the population mean


_ -I LMj
Y=N I I Yr
j = lj = 1 !J

by selecting a sample, s, of the FSUs and then choosing, for each i E S , a sample
Sj of units from the {h FSU . The probability p(s) with which the composite sample

s= U je sS j is chosen is called a sampling design. The design is said to be of fixed


size if the number of distinct FSUs in S is v and the number of distinct units in Sj

is mi, where v and mj are pre-specified. The first order inclusion probability for {h
FSU is lrj = LS)jP(S) and that for lh unit in the {h FSU is lrij = L;3(j,j)P(s) . Then
we have the following theorem:

Theorem 10.7.1. A linear homogeneous estimator of population mean Y IS

Ybs = .L .L bsijYij = N-I[~ .L Yij + L .L d SijYij] (10.7 .1)


lESjES; IESJE Sj lE SJESj

where d sij = N bsij -1 and bsij are defined for all s and all (i, j) E S.

Scott and Smith (1969) defined a model by assuming that Yij are uncorrelated, for
each i, with mean j.Jj and fixed variance a} and that j.Jj are uncorrelated with
mean j.J and variance (7z . They suggested a model as
(10 .7.2)
10. Multi-stage, successive, and re-sampling strategies 843

with
Emleij)=O, E(eJ)=02+ o } , Em (eijeij') = 02, J*J' and Em leJyeh'j') = 0, h e h",

Then we have the following theorem:

Theorem 10.7.2. If mi =m and a? =a 2 V i in the above model, then a prediction


estimator in two-stage sampling for a fixed (v, m) design is given by

(10,7.3)

h
were Yo
- = L: Mi s:
'<"' Yij /'<"'M
- - =- 1
z: i » Yb L: L: Yij an d IIf IS
' a cons tan t sue h th at the
iES jes] m iES m ViES jes]
variance of the estimator Yprl is minimum. It can be easily shown that vlY"prl) is
minimum if the optimum value of 'P is given by

2 1

'Popt = ~N LMi,
iES
with A= 02[02+ a
m
J- (10.7.4)

Theorem 10.7.3. In the case of stratified random sampling with v = L the


prediction estimator becomes

~Mi[Ai[~.L YijJ+{I-~J~ ,L Yij/m~A;J],


r
Ypr2 = (10.7.5)
N
1=1 m jESi \,=ljESi 1=1

where A; = 02[02+ ~:
Bellhouse and Rao (1986) have proposed a two-stage permutation model :
Yij=Y+eij (10.7.6)

where Em leij)= °
,Em (eJ )=al +a;, Em (eijeij') = al- (Mi -lt1a;, J * J', and
Em(eyei'j') =- a;/(L - I), h * h' .

Then we have the following theorem :

Theorem 10.7.4. The MSE of the estimator Ybs under model (10.7.6) is

Em[MSE{Ybs)] = 172[L .2: lJsij -IJ2


IESjESi

+ a! [LL[ .L lJsijJ2 -[ .L.L lJsijJ2 _2L( LMi


L I ' E S jESi IE SjESi IES N
-I) .L lJsij ]
jESi
844 Advanced sampling theory with applications

)2]+Em(E)
M; 1
(
2 2 - 2
[
+o"w L - - L !Jf;r L - - L bsij
ie s M ; -1 ) es; ies M ; -1 )es;
(10.7.7)

where If is the population mean of eij .

Theorem 10.7.5. For any fixed size design, the optimal model unbiased predictor of
population mean r is
Ia·Iy··
{1-(A)N- {(A)N-
Ia .M·I y ··
Yopt = 1I a;M;} ; ' ) l) + 1 Ia;M;} ; I ' ) IJ (10.7.8)
L-l ies mIa; L-l ie s mIa;M;
; ;
-1
Lm. M ·-m·
where r =0"b2/0"2w and a,
"
= m,
(
-----.!L+
L-l
' ,
M ,.-l )

Royall (1986) has considered the problem of robust variance estimators by


following Royall and Cumberland (1978, 1981a, 1981b) for some classes of
estimators which are useful for estimating population total and ratios of totals from
cluster sampling. In other words, he has considered the problem of estimation of
population totals
N M; N M;
Y = L L Y;- , Z =L L Z;-
;=1)=1 ~ ;=1) = 1 ~
and the ratio
R=Z/r
where tYij' Zij) are the numbers associated with /h unit in the i 1h cluster.
The estimators are generally of the form
A N A A N A A AI A
Y=- LU;Y; , Z=- LU;Z; , and R= Z Y
n ies n le s
where
M . m; M . m;
=- ' L Z; =-' L zij
A A

Y; Yij and
m; )=1 m; ) = \
are the simple expansion estimators of the /h cluster totals for Y and Z
respectively. The coefficient U; is free from y or Z values. Royall (1986) does not
guaranty that the model considered by him is correct, but he adopts it as a tool to
use in planning and inference. Now we have the following theorem:

Theorem 10.8.1. If variables in different clusters are uncorrelated, then the error
variance of the estimator under two-stage sampling design is given by
v(y-)= N B1 -2NB2 +NB 3, (10.8.1)
f
10. Multi-stage, successive, and re-sampling strategies 845

where
I N _ -1 mi
i cov Ysi'Yi ' y.
2 2(_)
Bt = - LUi M; V Ysi , Ysi =mi L Yij' B2 = n-1 i=NLujM 2 (- - )
= Y./M . and
n i=1 ) =1 l I I I

Clearly B2 and B3 depend on the variance-covariance structure of units that are


not observed, but B1 depends only on the variances of sample means. Obviously it
is possible to produce an unbiased or consistent estimator 81 of Bt under models
much more general than the one required for estimating B 2 and B) . Then the first

term of the variance N B1 is dominant when f ~ 0 and it is possible to produce an


f
unbiased or consistent estimator of it under fairly general models .

Now we have the following corollary:

Corollary 10.8.1. The variance of a linear function of independent random


variables, given by LI 2V(X) can be estimated by

where k stands for appropriate bias correction term.

Royall (1986) has considered both the situations when the cluster sizes M i are
known and unknown , to estimate the variance. We will discuss only the situation
when the cluster size is known. Royall (1986) considered the following model
which depends upon three unknown parameters u, p , and (]"2, given by

if i = k, j = I,
if i = k, j *1 = k, (10.8.2)
if i * k.
The above model has been also used by Scott and Smith (1969) and Royall (1976).
Burdick and Sielken (1979) have studied the same variance estimation problem
under the same model but focused on obtaining unbiased variance estimators
having chi squared distribution, whereas Royall (1986) has suggested robustness in
the sense of consistency under broad conditions. In fact, this model describes
populations in which the cluster total }j is roughly proportional to cluster size M i ,
that is, E(}j) = j.l Mi , with Yij correlated within clusters.

Thus we have the following theorem.


846 Advanced sampling theory with applications

Corollary 10.8.2. Under model (10.8.2), the following results hold:


j2
V(fr;) = M 0-2[I-p+mjp], (10.8.3)
mj
and
Cov(fr;, If)= M j0- 2[1- p+ M jp]. (10.8.4)

Then we have the following corollary:

Corollary 10.8.3. An estimator of variance v(y) can be obtained as a linear


2
function of two parameters 0, = 0- (1- p) and O2 = pa 2 , which can be estimated by
replacing these expressions by their unbiased estimates. Unfortunately such
estimates of variance are not robust.

Royall (1986) suggested the following steps to estimate the three different
components of the variance,
v(y)= N B,-2NB2+NB3 .
f
( a ) Estimate the third component under the model defined as:
0-2 if i = k,j = t,
M 1 :E(Yij)=,u,and M( : Cov(lfj,Ykl ) = p0-2 if i=k,jot,z, (10.8.5)
i * k.
1
o if

( b ) Estimate the second component under the model defined as


M 2 : E(Yij)=,u, i = 1,2, ..., N; j = 1,2, ..., M', ,andM2 : Cov(lfj' Ykl)= 0, i * k . (10.8.6)
( c ) Estimate the first component of variance under the model defined as
0-2 i = k,j = t,
M 3 :E(Yij)=,u, and M 3:Cov(lfj,Ykl)= op;o-l i=k,j*t, (10.8.7)
1 otherwise,

for i = 1,2, ..., N ; j = 1,2, ..., M j •

Then we have the following theorem:

Theorem 10.8.2. A robust estimator of variance of estimator of population total in


two-stage cluster sampling is given by

"Robust = ;2 j~Uj(Uj - 1
f~j + ~ j~UjMj(l- .Ii{ J+ t~lMl - ~ j~UjMl P2, (10.8.8)

where
10. Multi-stage, successive, and re-sampling strategies 847

{(M;IXJf /(1- g;/n)}L:u;rj /n 2(1_gjIn)


{I + ~(Mj/M f(uj lnf /(1- gjln)}
with g;=2u;M;/M<n "iiEs, B2= ;~(V;-M;sl//;)/i~M?, /;=n;/N; and
f = n/ N ,
etc., have their usual meanings. A general method for estimating the
variance in multi-stage sampling has also been discussed by Srinath and Hidiroglou
(1980).

The theory of two-stage sampling when selected FSUs on an occasion alone are
partially replaced on subsequent occasions after the first and by observing the same
set of SSUs in the given FSU from one occasion to the other was first developed by
Jessen (1942), Yates (1960), Patterson (1950), Eckler (1955), and Tikkiwal (1953,
1958, 1965). Singh (1968) considered the theory of two-stage successive sampling
and built up linear unbiased estimators of the population mean on the second and
third occasions separately under certain restrictive assumptions when the partial
replacement of units is made among the FSUs only. Kathuria and Singh (1971a,
1971 b) have shown that under certain circumstances, partial retention of SSUs may
be better than the partial retention of FSUs. Agarwal and Tikkiwal (1980) have
mentioned the following possibilities:
( a ) Replacement among FSUs only,
(b) Replacement among SSUs only,
( c ) Replacement among both FSUs and SSUs.
Surveys often have to be repeated on many occasions (over years or seasons or
months) for estimating the same characteristic at different points of time. The
information collected on previous occasions can be used to study the change or total
value for the most recent occasion. An investigator or owner of the industry of cold
drinks may be interested in the following type of problems:
( a ) The average or total sale of cold drinks for the current season;
( b ) The change in average sale of cold drinks for two different seasons;

Note that sampling on successive occasions is done according to a specific rule,


with partial replacement of units, is called Rotation Sampling. Hansen, Hurwitz,
and Madow (1953) and Rao and Graham (1964) have considered the problems of
rotation designs for successive sampling.

Let us discuss here few successive sampling strategies suggested by various


researchers.
848 Advanced sampling theory with applications

To discuss Amab (1979a) successive sampling scheme let us first define few
notation. Let us consider

Yit = Value of the study variable for the unit of a finite population n of size N
jlh

at the lh occasion, for i = 1,2,...Nand f = I,2, ..., T .

The following table will be helpful in understanding the estimators proposed by


Amab (1979a).

Table 10.9.1. Description of the sampling procedure on different occasions .

N
YT = :DiT , population total of the study variable Y at time T , P; = The probability
i=1

of selecting the /h unit on any occasion, .~ Pi (Yti / Pi - YI XYI 'i / Pi - Y/,) = 8 11-1'1Vo for
1=1
N
every f,f'= 1,2,...,T, such that <5 and Vo = I.P;(}(d p; - yf are known quantities .
i=\
10. Multi-stage, successive, and re-sampling strategies 849

Arnab's Strategy I. The main steps of Amab's first strategy are:

( 1 ) On the first occasion, select a sample S11 of size mil with PPSWR sampling.

( 2 ) On the second occasion:


( a) Select a sub-sample S21 of size m21 units from the first sample SII of mIl
units by SRSWOR sampling;
( b) Select an independent sample S22 of m22 units, which in fact is equal to the
(mil -m21) units, by PPSWR sampling from 0;

( 3 ) The value of the total sample size mIl on each occasion is supposed to be fixed
and determined from cost consideration.

Thus in general on the lh occasion a sample SII' of size mil' is selected by


SRSWOR sampling from S(I-I}t ' independently for each 1'= 1,2,...,(1 -I). Finally

another sample mil = ( mIl - :~:mll') is selected by PPSWR method from the whole

population for I = 2,3, ..., T.


Thus we have the following theorems :

Theorem 10.9.1. An unbiased estimator of population total Yr on occasion T IS

given by

Yr = fC/r(/\ (10.9.1)
1=1
where

I
I vn (~I
-I--fJrl - £ . .'". ~'
-i(r-l) (I )}, (\
1= 1, 2, ...,T-Ij,
Yr(/) = mn STt P; mT/ stt Pi (10.9.2)
_I_I Yri, I=T,
mIT SIT P;

with fJr (I) and CI are being constants such that the estimator Yr is unbiased and
has the minimum variance. It can be verified by the method of induction on
different occasions that minimum variance of the estimator Yr exists for fJr(/) = 8
for all I .
Proof. For I = I

Yr(I)=_I- L Yri -fJr(IJ_I- L Y(r-I~ -Yr- 1(1)} .


mTi STi Pi 1 mTi srI Pi
For T = I
850 Advanced sampling theory with applications

9.(1)= _1- I YTi


1 p
m
n Sn i
and

E(~)= E[_1_ I YIi] = n 'ii = J'j.


mi l sil Pi n

Now for T = 2

Y2 1 =1- I -
A ( ) (~1
Y2·' -fJ21 - YI"
I-'-J'j1 A ( ) }
.
m21 s2 1 Pi m21 s21 Pi

Note that s21 is selected from SII by SRSWOR sampling, therefore we have

E{Y2(1)}=EIE2{Y2(1)SII} =E1E 2[_1 L Y2i -fJ2(1


m21 s21 Pi
L
m21 s21 Pi
J_1_
'l Yli -~(1)}]'
where EI and E2 have their usual meanings.

Thus we have

E{Y2(1)}= E1[_ 1_ I Y2i - fJ21{_1- I Yli - ~(1)}] = Y2 - fJ 21{Y1 - Yd = Y2 .


m il -u Pi m i l sil Pi

Thus we can assume


E(YT_1 (t)) = YT- I ·

So we have
E(YT(t))= EIE2 {YT(t) I ST_ I,I}

=EI -1- I -YTi -fJT t - - I - (~ 1


YT--
I,i - Y - t
T I
A ( ) }]

[ mT-I,1 sT- I,1 Pi mT-1 sT- I,l Pi

Note that

E( _ 1-
mT_I,1 sT- I,1
I
Pi
J
YTi = E( _ 1_ I YTi = YT .
ml,1 -u Pi
J
Thus we have

E(YT ) = f Ct~ = YT
t=1
T
whenever I C1 = 1.
1=1
Hence the theorem.
10. Multi-stage, successive, and re-sampling strategies 851

Theorem 10.9.2. The variance V[1'T(/)] with Pr(/) = £5 for every 1= 1,2,..., (T -1) is
given by

- -1-Jr~£5 zY']
![
- 1 T-I(
I 1 J Vo, for 1= I,2,...,(T -11
V[1'T(/)] = mn 1'=1 mT+I_I',1 mT_I',1 (10.9.3)
~, fori = T.
mIT

Proof. Consider

1'T(I) = _1_ I YTi - Pr(IJ_I_ I Y(T -I),i - 1'T-I (1)) .


mTI sTl Pi mIj sIj Pi 'l
For T = 1 we have

V(~(I))= v(_1_ I
mIl sll
Yli J=
Pi
_1_( i=IPi
mil
~ yt - JIzJ = ~
mil

which proves the theorem for I = 1. Now let us consider

Yzi
A ( )
=-I--fJzI
1 YZi
mZI sZI Pi
(~I Yli
mZI sZI Pi
1. - I - - Yo A ( )}

Then
V(1'z(I)) = EIVz[1'z(I) ISII]+ ViEz[1'z(I) ISII]

=EIVz[_I- I YZi - fJZ(i)Yli] +VI[_I_ I YZi - fJz(I J_I_ I Yli - 1'o(I)}].


mZI sZI Pi mIl -u Pi ~ mIl sll Pi
Note that
1'0(1)= _1_ I Yli,
mIl -u Pi
therefore we have

l
I YZi - fJz (i )Yli jZ
1 )_1_ I YZi - fJz(i)Yli
V(1'Z(I)) = EI(_I __ slI Pi + Vi [_1 I YZi]
mZI mil mll-i si l Pi mIl mIl -u Pi

=(_ImZI I_)_I_[I (YZi - fJz(i)Ylif


mIl Pi
mIl -1 n
(Yz - fJZJIf]+~
mil

= (_ImZI I_)[vo + fJi(I)Vo-2fJz(I)oVo]+~ .


mIl mIl
Now setting
OV(1'z(I)) 0
ofJz(I)
we have
852 Advanced sampling theory with applicat ions

,82(1)=8
and which leads to the following relation
V(Y2(1))lp2(1)=8 = (_I I_)~o
m21 mi l
+ 8 2VO- 282VO)+ Vo
m il

( I I)o(
= --- V
m21 mil
2)
1- 8 +- Vo
mi l

=[( 2
ml21 - mllJ - 8 ( m~1 - mll J + ml,Jv o .

Let us suppose that


,8}1)=8 forj=I ,2 , ..., (T - I)
and

, ] [I j -I ( I I J( 'of]
V [Yi l ) = mjl-t~1 mj +I-t'.1 mj _t' 8 J Vo for j =I,2 , ..., (T - I) .
2

Now

J- ,8T(1 J_I I Yt- I.i - YT (I)}]


V (YT (I)) = v [( _ 1 I Yti
mTl STI Pi LmTl sTl Pi -I

= v(_ I- I Yti J+ ,8f(I)V{_ I- I Yt- I.i - YT-I (I)}


mTl sTl Pi mTl sTl Pi

\r. {I
- 2h (I.!'-'ov r«t, - I -r«' -
-I- I
mTl sTl Pi mTl sTl Pi
( )} .
YT- II (10.9.4)

To complete the above theorem we need the following results :

Result 10.9.1. v(_I- L Yti = Vo . J


mTl sTl Pi mTl
Proof. We have

V( -
I L -Yti ) = £1{V2(- 1 L -r« I sil )} + ~ {£2(- I L -Yti I SI I)} . '
mTl STI Pi mTl sTl Pi mTl sTl Pi

Now noting mn may be treated as an SRSWOR sample from SI I so we have

L Yti
2)+~
[ ]
V _1_ L Yti _£\ _1 1 I_ Yti _~ _I_ Y/i
( mTl STI Pi J- ( mTl
L
mi l) mil - 1-u Pi
L
{ mIl -u Pi }

1
mi l
10. Multi-stage, successive, and re-sampling strategies 853

( I -~I)Vo+~=
= mTl
o V Vo
mTl .

Result 10.9.2. cov(_I- I YT-1.i, YT-I(I)J = V(YT_I(I)) for T = 2,...


mTl STl Pi
Proof. For T = 2

C ov -£..,--,
I '" YT-I ,i Y -I (I)J
T =cov{_I- I YIi, _1_ I Yli}
( mTl STl Pi
m21 s21 Pi mil -n Pi

=EI C2 -I I -, Yli -I I- Yli } ISlI ]


[ { m21 s21 Pi mil -u Pi

+C1 [!E2(_1I YIi


m21 s21Pi
/Sl1J, E 2(_1 I
YIi
mil -u Pi
ISIIJ)]

=0+C1[ -I I- Y1's i l ] =VY


YI'' , -II - ' I ( , (1)) .
T- 2
mil -u Pi mil s u Pi
Suppose

I
COy -I- I - YT-2i'
-
mT-I,1 sT-I ,1 Pi
Then we have
', YT - 2 ())
I +V (Y,T - 2 ())
I .

cov{_I- I Y T- I,i , YT-I(I)}


mTl STl Pi

= C1 £2 - I I -YT-I
[ ( mTl sTl Pi
J
-,iI sT_I,1 , £2(AYT- I (I) I ST_I,I )~

+£1[ C2( - I
mTl sTl
YT -I
I - -
Pi
'i , YT-1I
' ( ) IST-I J]
= CI[_I- I YT- I,i , _1_ I YT- I,i _ _0_ I YT-2,i + oY - (1)]+0
T 2
mT-I,1 sT- I,1 Pi mT_I,1 sT- I,1 Pi mT-I,1 sT-I,1 Pi

= v[_I- I YT- l .i] _02 V[_I_ I YT-l,i] .


mT-1 ,1 ST-1,1 Pi mT-1 ,1 sr-1 ,1 Pi

Noting that
!
+oCov -I- I - YT--
mT-I,1sT- I,1 Pi
' ,YT - 21( )) .
I i'
854 Advanced sampling theory with applications

Result 10.9.3. V - 1 Yr-li'


L- -' -Yr - 1()} Vo
1 =--V {,Y - ()}
{ mTl sTl r 11 .
Pi mTl
Proof. Note that

Yr -u Y' _ ( )) = V (,
COY( - 1 L--, ( ))•
Yr - 11
r 11
mTl sTl Pi

We have

v{_1_ L Yr-l ,i -Yr- 1(1)} = V(_1_ L Yr-l ,iJ+V(Yr_ I(1))-2COV(_1- L Yr-l ,i, Y
r - 1(1)J
mTl sTl Pi mTl sTl Pi mTl sTl Pi

= v(_1_ L Yr-l ,i J+ V(Yr_ I(1))- 2V(Yr_ 1(1))


mTl sTl Pi

=-!JL_ V {Yr - 1(1)}.


mTl

Result 10.9.4. vti - 1 L -


COY{- 1 L -', Yr-l- -o- V {,Yr - 1(1)}J .
'i - Y' r- I (1)} = 0 [V
mTl sTl Pi mTl sTl Pi mTl

Proof. We have

COV{_1- L Yri , _1_ L Yr-l ,i -Yr-I(1)}


mTl sr i Pi mTl STl Pi

-- Cov(- 1£ ." -YTi


- , -1"
£.-Yr-l
-- ,i J - Cov(- 1
£."YTi
--, y'r-i (1)J
mTl sTl Pi mTl sTl Pi mTl sTl Pi

= OV(_1_ L YTi J- OCov(_1- L Yri , Yr - I (1)J .


mTl sTl Pi mTl sTl Pi
Noting that

Cov{_1- L YTi, _1_ L yr-k ,i} =Okv(_1_ L Yri J


mTl STl Pi mTl sTl Pi mTl sTl Pi
and using Result 10.9.2 we have Result 10.9.4,
Now using these results we note that
10. Multi-stage, successive, and re-sampling strategies 855

On differentiating it with respect to Pr(1) and equating with zero we have


ov(iHl))= 0
oPr(I)
which implies, Pr(I) = 0 and
V(Yr(I)lpr(l)=c5)=~-02{~-V(Yr_I(I))l =[_1 _02{_1 -V(Yr_ I(I))l]
mn mn I mn mn f
1
= --I
r-l( 1
[ mn 1'=1 mr+I-I',1
_1)(02tV
mr_I',1
OJ.
Now we have
r
Yr = ICrY
A

rt .
A ()

1=1
We note that Yr(t) are independent due to the sampling scheme shown in the Table
10.9.1 and E{Yr(t)} = Y for every t = 1,2, ..., T, thus the optimum value of
Cr oc {V(Yr(I))}-1 and

Vopt(Yr) If_
=[f1=11/ V(YAt))TJ =[f1=1 1mn1 - ~f-I(
1
1 _ _ 1 )(0 2r)vo]-l
1=1 mr+l-I',I mr_I',1

Now writing mn = A.T/mll we have

V",(YT)= ::, [,V!.'"-;~:( ".:-0 -,,~,.}2)')r [,~,~J' = ::,

where
t/J(t)=_I_- rf-I( 1
A. T/ 1'=1 Ar+1-1',1
and
t/J(T)=_I- for t=T.
ArT

l
Thus we have

Vopt[Yr]= ~OJ~~11 ~/ ArI(I-02)+02~_I(t )}+ ArT


Now equating
oVopt(Yr )
-'-"--'----'-= 0
OArI
and writing

we get
856 Advanced sampling theory with applications

Hence the theorem.

Following the proofs of the above theorems, the results given in the following
theorems can easily be proved, and avoided to save the space .

Theorem 10.9.3. If fJr(l) = 0, 'v' 1=1,2, ...,(T-l), then the optimum value of V(Yr )
with variation of values of CI , I = 1,2, ...., H is

1= 1,2,..,(T -1),

(10.9.5)
1= T,

where

-1- r-I'(
L 1 -l- 2'y I,--1,2, ...,(T-1),
) f\0)
Arl = mTt and ~(t')= Arl' 1;1 Ar+I-I,I' Ar-I,I'
mll _1_ 1'= T.
1Arr
Theorem 10.9.4. The optimum values of ATt (and hence mn ) are given by

J1jb( 1
1+ 1- 0 2 tA/-I)opt
0)
I
1= 1,2,....,(T-1),
(10.9.6)
ArI(Opt) =

1 15b(~r~I)opJ I=T,

(1+~1-02 )g(r-l)opt
where~I)oPt(I)=l,g(h)oPt=(~----::::::::::=----:,....-~==;:----
~) ( ~) and g(I)opt = 1 .
I -vl-o- + l+vl-o~ g(r-l)opt

Theorem 10.9.5. The minimum variance of the estimator of population total on the
r" occasion is given by
, ) o
V
VMin (Yr = -g(r)opt . (10.9.7)
mil
10. Multi-stage, successive, and re-sampling strategies 857

The optimum values of C, are given by

(10 .9.8)

The optimum proportion of matched units on the r" occasion is given by

Ar(opt) = ~ [ 1
1+~1-02 g(T-I}opt
J. (10.9.9)

It can be easily shown that the strategies proposed by Tripathi and Srivastava
(1979), Ghangurde and Rao (1969), Chotai (1974) and Chaudhuri and Amab
(1977) are the topics related to Amab's first strategy in successive sampling.

Arnab's Strategy II. The main steps of Amab's second strategy are:

( 1 ) On the first occasion select a sample SII of size mIl with the RHC scheme
with size measures Pi'

( 2) On the second occasion:


( a) Select a sub-sample S21 of size m21 units from the first sample SII of mIl
units by RHC scheme using size measures I Q;(2), where I Qi (2) is the sum of the Pi
values of the group containing the l" unit formed in selecting SII .
( b ) Select an independent sample S22 of m22 units, which in fact is equal to
(mll -m21)units, by the RHC scheme from the whole population n with Pi size
measures.
( 3) In general, on the l" occasion a sample SII' of size mil' is selected from S(t-I}t'

independently for t'= 1,2, ..., (t -1) following RHC scheme using normed size
measure I' Qi (t) which, in fact, is the sum of I'Qi (t -1)values containing the l" unit
occurring in S(I-I}t' for t'=1,2, ...,(t-l)and another independent sample SII is
selected from the entire population using Pi as normed size measures.

Then we have the following theorems:

Theorem 10.9.6. An estimator of population total YT on the 'fIr occasion is

(10 .9.10)

with
858 Advanced sampling theory with applications

I
L YTi CQ;(T + 1))- .8r(J L Yr-I,I {I Qi(T + 1)}- f;(t)J t = 1,2,..., (T-1),
f;(t) = iEST! Pi iEs rl Pil
L Yhi {rQi(T + I)} t = T,
iESTT Pi

where c;
and Pr (t) are constants chosen such that estimator f; is unbiased and
have the minimum variance.

Theorem 10.9.7. The variance of the estimator f;(t) is minimum if Pr(t)= 0 and
minimum variance of the estimator f; is given by
-1
,*) NVo
Vrnin (Yr = - -
~
1 -
1 1 - -1- - 1 - -1-- (10.9.11)
[N - 1 ml2Arr(opt) N )[ 1- 0 2 [ Arr(OpJ ]
for the optimum values of C1 given by

_ 1[ 1
r/Jr(t)OPI ml1Arr(opt)
~)[1
N
_ _ 1 [1 _ _
~1-02
1 )]-1 t=1 2 (T-l1
Arr(opt) , , ...,
*
C1(opt) =

[1- ~1~o2 (1- ~oJr I=T,


and

",",(op,) = 1 n
mil 1- 1-0 B(T-l)
)[mI1-N~I-02{1-~1-02 }B(T-l)]
with
B(T-l)= ril~2Nr/Jr_1(t)-(1-02)J1 .
1=1

Arnab's Strategy III. The main steps of Amah 's third strategy are:

( 1 ) On the first occasion, select a sample SII of size mIl with the RHC scheme
with size measures Pi'

( 2) On the second occasion :


( a) Select a sub-sample S21 of size m21 units from the first sample SIl of mil units
by SRSWOR sampling;
(b) Select independent sample s-, of m22units, which in fact equal to (mll-m21)
units, by RHC scheme from the whole population n with Pi size measures.
10. Multi-stage, successive, and re-sampling strategies 859

( 3 ) In general, on the lh occasion sub-samples SII' of size mil' are selected by


WOR sampling from S(I-l).I" (t'=1,2, ...,(t-1))and an independent sample of size

mil ( = mIl - I-Lm


I J
u' from the entire population by RHC method using Pi '
1'=1

Under such a sampling scheme we have the following theorem:

Theorem 10.9.8. An estimator to estimate the population total YT , on occasion T,


is given by
T ** "**( )
Y"T** = LeI YT t (10.9.12)
1=1
with

.!!!JL I YTi LQi(t + 1)}-{.!!!JL I YT-l ,i CQi(t+1))-Y;~I(t)}


...
YT (t ) =
mn iesTI Pi mTt iesTt Pi
for t = 1,2, ..., (T - 1),
for t = T.

Amab (1979a) has shown that Strategy II remains always superior than Strategy I
from the minimum variance point of view. Comparing Strategies II and III, the
efficiency comparison depends upon the values of the parameters involved in the
variance expressions. Sekkappan and Thompson (1994) have given thought on
multi-phase and successive sampling for a stratified population with unknown
stratum sizes. Amab (1998) has suggested two new sampling strategies for
estimating the most recent occasion total on the basis of two samples selected with
varying probability at two different occasions. Empirical results shows that Amab
(1998) strategy remains more efficient than one described by Prasad and Graham
(1994).

Example 10.9.1. The data in population 9 of the Appendix relates to the number of
immigrants coming to 51 states in the United States during 1994--1996. Select a
sample (SI ) of 10 states by PPSWR method using number of immigrants in 1994
( z) as measure of size. From the selected sample Sl , select a sub-sample Sm of size
m = 4 from SI by SRSWOR method assuming all elements of SI are distinct.
Finally select an independent sample Su of size u = (10- m) = 6 with PPSWR using
x as size measure. Estimate the total number of immigrants in 1996 using the
composite estimator.
Solution. We want to estimate total number of immigrants in 1996 ( Y ) using the
composite estimator
Y = ¢Ym+ (1 - ¢fu
with
860 Advanced sampling theory with applications

where an estimate of Y based on the matched sample is given by


Ym =..!.- L Yi - g L !L +!... L !L
m iesm Pi m ies m Pi n ies] Pi
with

g= L
sm
(
Yi ..'. L Yi
Pi m sm Pi J(!L _..!.- L !L JVl L (
Pi m sm Pi sm
Yi ..'. L Yi z
Pi m sm Pi J
L ( !L _..!.- L !L
sm Pi m sm Pi
JZ]
and an estimate of Y based on unmatched units is given by
Yu =..!.- L a;
U ies u Pi

( a ) Selection of sample SI and unmatched sample su : We use cumulative total


method s to select a PPSWR sample of n = 10 units. We used the first six columns of
the Pseudo -Rando m Number (PRN) Table 1 given in the Appendix to select 10
random numbers R( (say) between 1 and 789767. The second independent random
sample (called unmatched sample su ) of six states from the population was
selected by using columns i h to 12th of the Pseudo -Random Numb ers R z (say).
The data on the number of immigrants during 1994, cumulative totals CT(z) and the
states selected are shown in the following table.

1 AL 1837 1937
2 AK 1129 3066
3 AZ 9141 12207
4 AR 1031 13238 014737, AR,
049819 AR
5 CA 208498 221736
6 CO 6825 228561
7 CT 9537 238098 236263 CT
8 DE 984 239082
9 DC 3204 242286
10 FL 58093 300379
11 GA 10032 310411
12 HI 7746 318157
13 ID 1559 319716 339922 ID
14 IL 42400 362116
Continued ... .. .
10.Multi-stage, successive, and re-sampling strategies 861

15 IN 3725 365841
16 IA 2163 368004
17 KS 2902 370906
18 KY 2036 372942
19 LA 3366 376308
20 ME 829 377137
21 MD 15937 393074
22 MA 22882 415956
23 MI 12728 428684
24 MN 7098 435782
25 MS 815 436597
26 MO 4362 440959
27 MT 447 441406
28 NE 1595 443001
29 NV 4051 447052
30 NH 1144 448196 465225 NH
31 NJ 44083 492279
32 NM 2936 495215 588183, NM, 622048, NM,
601448, NM, 534272, NM,
549171, NM, 513080 NM
626895 NM
33 NY 144354 639569 644818 NY
34 NC 6204 645773
35 ND 635 646408
36 OH 9184 655592
37 OK 2728 658320
38 OR 6784 665104 675048 OR
39 PA 15971 681075
40 RI 2907 683982
41 SC 2110 686092
42 SD 570 686662
43 TN 3608 690270 697856 TN
44 TX 56158 746428
45 UT 2951 749379
46 VT 658 750037
47 VA 15342 765379 771280 VA
48 WA 18180 783559
49 WV 663 784222
50 WI 5328 789550
51 WY 217 789767
862 Advanced sampling theory with applications

Thus the sample Sl of n =10 units is given by

AR 1031 934 0.001305 715463 .0242


2 AR 1031 934 0.001305 715463.0242
3 CT 9537 9240 0.012076 765172.1799,
4 NH 1144 1186 0.001449 818761 .9423
5 NM 2936 2758 0.003718 741886.0307
6 NM 2936 2758 0.003718 741886.0307
7 NM 2936 2758 0.003718 741886.0307
8 NM 2936 2758 0.003718 741886.0307
9 NY 144354 128406 0.182780 702514.7997
10 TN 3608 3392 0.004568 742486.0488
:7421~05 :1J.420~
Given: Z=789767.

and the unmatched sample Su of u = 6 units is given by

, H'i i i i~i· U'U ' . U i i i ii~~~:i i H~


1 ID 1559 1612 1825
2 NM 2936 2758 5780
3 NM 2936 2758 5780
4 NM 2936 2758 5780
5 OR 6784 4923 7554
6 VA 15342 16319 21375

( b ) Selection of a matched sample Sm

We used the first two columns of the Pseudo-Random Numbers to select a matched
sample Sm of m = 4 units from the given sample Sl of n = 10 units by using
SRSWOR sampling. The sample Sm selected is given by

2 03 CT 9537 9240 10874


3 04 NH 1144 1186 1512
4 05 NM 2936 2758 5780
10. Multi-stage, successive, and re-sampling strategies 863

( c ) Calculation of the estimates :

From the matched sample information


Sta tes
; Pi ~t~~ ~ Yi
Pi
. iIiI'

AR 715463.0 1144434.0 2012219533 .00 270489496 .7


,
(-Xi- "I
-k
,~xi
- r
Sm we have

~ "'\'
'0 .,

Pi mSmPi ' '" 1 ~I 'Pi ' ~ m s.. Pi ~, P, -;;, ,: P, '1?! :~7iA PI


. I"
fy'· ~
I
h<'2y .
I" I - - - k -

737756225 .7
. J
f (XI I. X(X YI ~J 'f YI )

CT 765172.2 900485.1 23535942.68 67806024636.0 -1263280930.0


NH 818761.9 1043818.0 3415367782 .00 13703762934.0 -684 1300346.0
NM 741886.0 1554787.0 339840510.10 l.55162E+ll -7261555471.0
l :rSum~ 3041283:0 '4643524.0 57902 &3768.00 1~1;~·~6.942g;tl .•~ H:· f",'W
> .• ..• .• ' . .. '. '".'~
"·"H' -14628380521
. .',.' .." ~0~M'
...
Thus we have
I
sm Pi
(-i.. Y'J(x.
Y' _ _1 I -i..
m sm Pi Pi
x'J
-i.. _ _1 I -i..
m sm P i
g = r========='"'r===========
I ll- - ~ Ill-
( 2 I (.::L - ~ I .::LJ2
sm P i m sm P i J sm Pi m sm P i
- 14628380521 = -0.3949
.j5790963768~2.36942 x 1011

and an estimate of the required total is given by

A A

Ym = ~ I Yi _!- I .::L+~ I .::L


m iESm Pi m iESm Pi n iESt Pi

4643524 + 0.3949 x 3041283 _ 0.3949 x 7427405.14 =1167823.44 .


4 4 10

From the unmatched sample information, we have

I ID 1559 1825 0.00 1974 924 518 .778 1


2 NM 2936 5780 0.003 718 1554786.5330
3 NM 2936 5780 0.003718 1554786.5330
4 NM 2936 5780 0.003718 1554786.5330
5 OR 6784 7554 0.008590 879407.4172
6 VA 15342 21375 0.019426 1100330.4410
Sum 7568 6 16.2340
864 Advanced sampling theory with applications

An estimate of Y based on unmatched units is given by

Yu =..!.- L Yi = 7568616.234 = 1261436.04.


U iEsu P i 6

Also we have

( .4!.- - ~)(1-
2
0.3949 )+ .l,
10 10 = 4.412905 = 0.4238.
1 +6 10.412905

( ~4 - ~)(1-
10
0.3949 )+ ~
2
10

Hence the composite estimate of total immigration Y during 1996 in the United
States is given by

Y = ~Ym + (1- ~)ru = 0.4238 x 1167823.44+ (1-0.4238)x 1261436.04


= 1221763.02 .

Note that the estimation of parameters in successive sampling or repeated surveys,


the inference can be made in three different ways:

( a) Extracting a new sample on each occasion (Repeated sampling);


( b ) Using same sample on every occasion ( Panel sampling);
( c ) Performing a partial replacement of the units from the one occasion to another
(Sampling on successive occasions or Rotation sampling).

The last possibility has been discussed extensively by several research workers, for
estimating population mean of the second occasion using information from the first
occasion, including Jessen (1942), Tikkiwal (1950, 1955), Rao and Mudholkar
(1967), Das (1982), and Artes and Garcia (2000a, 2000b). Ratio estimator in
successive sampling was first studied by Avdhani (1968) and his followers Sen,
Seller, and Smith (1975), and Artes and Garcia (200Ia, 200Ib). Gupta (1970) and
Artes, Rueda, and Arcos (1998) studied product estimator in successive sampling.
The problem of estimation of ratio of two population means in successive sampling
over two occasions has been studied by Rao and Mudholkar (1967), Okafor and
Amab (1987), Okafor (1992) and Artes and Garcia (2001c, 2001d). Artes and
Garcia (200 1e) developed an unbiased ratio cum product estimator of the
population mean in successive sampling. Singh and Yadav (1992) have discussed a
generalized estimation procedure for successive sampling. Sud, Srivastava, and
Sharma (2001a, 2001b) have paid attention for estimating the population variance
over repeated surveys. Okafor (2001) discusses some successive sampling
estimation strategies in the presence of random non-response .
10. Multi-stage, successive, and re-sampling strategies 865

The basic idea of using least squares to incorporate information from a previous
occasion into the estimate of the current occasion is that of Jensen (1942) . Patterson
(1950) considered the use of information from the rotating samples . His genius idea
has been spawned in the vast literature owed to Eckler (1955), Rao and Graham
(1964), Raj (1965b), Smith (1978), Wolter (1979), Jones (1980), Huang and Ernst
(1981), Kumar and Lee (1983), Breau and Ernst (1983), Singh (1996), and
Yansaneh and Fuller (1998) . Duncan and Kalton (1987), Schreuder, Gregoire, and
Wood (1993), Fuller (1990), Lent, Miller, and Cantwell (1996) discussed different
kinds of repeated surveys and the objectives. Kasprzyk, Duncan, Kalton, and Singh
(1989) have discussed various aspects of panel surveys in an excellent way. A
rotation survey is one in which a unit is observed for a partial set of time points and
is not observed for the remaining set of time points in the study. The Canadian
Labour Force Surveys and the U.S. Current Population Survey are the well known
examples of the rotation surveys. The National Resources Inventory (NRI) is nearly
a pure panel survey of certain land area with a five year observation interval. A pure
panel survey can be defined as a survey in which the same units are observed at
each time point of a survey conducted at more than one time point. The longitudinal
survey can be defined as a survey conducted at more than two points in time with
multiple observations on some units planned as part of the survey design . Fuller and
Breidt (1999) introduced generalized least squares estimators for such surveys.
Following them, consider a simple three period survey in which one fourth of the
units are observed in all three periods and each of the remaining three sets of one
fourth of the units is observed in exactly one of the three periods. In other words, if
n is the total sample size, then 0.5 n of the units are observed at each point. Let
(ft, Yz, Y3 ) denote the value of a characteristic observed at times one, two, and three
respectively. Assume that the correlation between observations at time i and time
j on the same element is PjHI' For simplicity assume SRS for the selection of all

samples. Let YZI' Y31) I denote the estimated mean at time one, two and three,
(Y11>

of the sample elements that are observed all three periods . Let (YlZ, YZ3' Y34)1
denote the sample means for the three periods for the sample elements that are
observed once . These six estimators can be named as elementary estimators . We
wish to estimate the population means II = (Ill> liz , 113) for the three periods .

The above six elementary estimates can be modelled as

Y = X Il+e, (10.10.1)

The co-variance matrix of e is


866 Advanced sampling theory with applications

IPI P2 0 0 0

PIIPI O O O

!l=4n-1er2 P2 PI I 0 0 0
(10.10.2)
o 0 0 I 0 0
000010
000001

It is to be noted that the term 4n- 1u 2 is the variance of the mean of n/4
observations. By applying the weighted least squares method the best linear
unbiased estimator of P = (PI> P2, P3) is given by
" _ (" " ")_ (Xlr.-IXt-lXlr.-I-
P - PI, P2' P3 - .. J •• y, (10.10.3)
and
V(u)= (x1n-1xjl. (10.1004)
Fuller and Breidt (1999) considered the comparison of the variance--covariance
matrix in (10.1004) with the variance--covariance matrix of a pure panel survey in
which the same n/2 units are observed on all three periods. The best linear
unbiased estimator of P under the pure panel design is
.upanel =Crl' ;;2, ;;3Y . (10.10.5)
The variance-covariance matrix for the pure panel design is given by

V(.upanel)= 2n-lU2[~1 ~l ;~] .


P2 PI 1
Fuller and Breidt (1999) found that in most of the practical situations, their
generalized least square estimation remains better than the pure panel surveys.

The major benefit of re-sampling methods is that a single standard error formula
can be used for all statistics, unlike the linearization method which requires the
derivation of separate formula for each statistic . Sometimes the linearization
method becomes too cumbersome in handling the situations specially for post-
stratification and non-response adjustments . Such situations can be handled very
easy through re-sampling methods. Re-sampling methods are found to be more
applicable to stratified random sampling or stratified multi-stage sampling.
Establishment surveys are the examples of stratified random sampling, whereas
large socio economic surveys are the examples of stratified multi-stage sampling. In
the case of stratified multi-stage sampling, re-sampling methods are valid if the
sample clusters are selected with replacement of the first stage sampling fraction is
negligible. Here we consider a stratified multistage design with large number of
strata, L, and relatively few primary sampling units or clusters, mh ~ 2, sampled
within each stratum , h = 1,2, ..., L, Assuming that sub-sampling within sampled
clusters i = 1, 2,..., mh is performed to ensure unbiased estimation of cluster totals.
10. Multi-stage , successive, and re-sampling strategies 867

The basic design weights Whik attached to the sample elements hik are adjusted for
post-stratification and unit non-response to get adjusted weights W~ik . The weights
W~ik may also be calibrated weights. As we saw from Godambe's work, many
parameters of interest such as mean, total, median, and variance etc., can be derived
as a solution to the census equation
S(O) = L U(Yhib0) = 0 .
(hik)en (10.11.1)
Obviously the GREG estimator of S(O) is
S(O) = LW~ikU(YhibO). (10.11.2)
(hik )es
In this section we would like to discuss different methods of variance of the
estimator S(O) viz. Jackknife, Balanced Repeated Replication (BRR), and Balanced
Half Sample (BHS) methods. Ahmad (1997) also suggested are-sampling
technique for complex survey data. Willson, Kimos, Gallagher, and Wanger (2002)
considered variance estimation from calibrated samples which improve techniques
adopted by statistical packages such as SUDAAN, SAS, and STATA etc..

At the first step, keep all the sampled units and solve the equation
S(O)= IW~ikU(yhik>O)=O
(hik)es
(101111)
. ..
we have an estimator Or of o. Delete one sampled cluster, say (l j), and adjust the
weights W~ik accordingly to the following two steps.

Step I. Change the basic weights Whik to the Jackknife weights


Whik(lj) =Whikblj where
0 if (hi) = (lj),
blj = ....!!!.!....- if h = land i"* j,

1
mt- 1
1 if h "* l.
Step II. Replace Whik by Whik(lj) in the post-stratification process to get adjusted
Jackknife weights W~ik(lj) .
Now solve the equation
S(lj)(O) = L W~ik(lj)U(yhib 0) = 0
(hik \es
i*t,k*j
to obtain the estimator O(lj) when the sampled cluster (lj) is deleted.
Then a Jackknife variance estimator of the estimator 0 is given by
868 Advanced sampling theory with applications

(10.11.1.2)

The Jackknife variance estimator given In (10.11.1.2) IS quite genera l and IS


applicable in many practica l situations .

Example 10.11.1. Divide the United States of America into four independe nt strata
as Northeast, Midwest, South, and West. Suppose the first stratum Northeast
consists of two clusters New England and Mid Atlantic, second stratum Midwest
consists of two clusters East North Central and West North Central , third stratum
South consists of three clusters South Atlantic, East South Central, and West South
Central, and the fourth stratum West consists of two clusters Mountains and Pacific
as given in popu lation 7 in the Appendix. From each stratum (or region) select two
clusters ( or divisions) by SRSWOR sampling and within each selected cluster (or
division) select two units ( or states) and collect the information on the projected
population counts during 1995.
( a) Suggest an estimator for estimating population total in the US using stratified
multi-stage design.
( b ) Estimate the variance of the estimator of population total by using the
technique of Jackknife.
( c ) Find the 95% confidence interval for total projected population during 1995 in
the USA.

Solut ion. Let Yhik be the value of the J(h units (projected counts in a state during
1995) of the study variable in the lh cluster (division) of the h1h stratum (region).

1 M l1 =6 2 2
2 M 12 = 3
2 1 M 21 = 5 2 2
2 M 22 =7
3 1 M 31 = 9 3 2 2
2 M 32 =4 2
M 33 =4
4 1 M 4 1 =8 2 2 2
2 M 4 2 =5 2
10. Multi-stage, successive, and re-samp ling strategies 869

Let L be the total number of strata, N h be the total number of clusters in the h1h
stratum, n h be the number of clusters selected from the h1h stratum , M hi be the
total number of units in the /h cluster of the h1h stratum and mhi be the number of
units selected from the /h cluster of the h1h stratum.

The following table provides the information collected in the sample.

New Ham shire 1276


Massachusetts 6032
Mid Atlantic New York 17909
New Jerse 8100
Midwest East North Central Ohio 10958
Michi an 9364
West North Central Minnesota 4501
North Dakota 631
South South Atlantic Ma land 5180
North Carolina 7197
East South Central Kentuck 3740
Alabama 4282
Idaho 1018
Colorado 3407
California 31749
Hawaii 1253

( a ) Estimation of population total: Using these notation an unbiased estimator of


population total Y in the United States is

This estimator can be written as

22M Ii mli 22M 2i m2i 3 2M 3i m3i 2 2M 4i m4i


=- L - - L Y lik +- L - - L Y2ik +- L - - L Y3ik +- L -- L Y 4ik
2 i=1 m li k= l 2 i= \ m 2i k=1 2 i=1 m 3i k=1 2 i=1 m4i k=\
870 Advanced sampling theory with applications

2[Mllmll M12m12] 2[M21m21 M 22m 2 2 ]


=- - L Ylik + - L Ylik +- - L ru« + - - L Y2ik
2 mll k=l m12 k=l 2 m21 k=l m22 k=l

3[M 31 m31 M 32 m32 ] 2[M 4 1 m41 M 42 m42 ]


+- - - L Y 3ik + - - L Y3 ik +- - - L Y4ik + - - L Y4ik
2 m31 k=l m32 k=l 2 m41 k=l m42 k=l

=~[i(1276 + 6032)+ ~(17909 + 8100)] + ~[2(10958 + 9364) + 2.(4501 + 631)]


2 2 2 2 2 2

+~[2.(5180
22
+ 7197)+i(3740 + 4282)] +~[!(1018 +3407)+2(31749 + 1253)]
2 22 2

=~[21924 + 39013.5]+~[50805 + 17962]+~[55696.5 + 16044]+~[17700 + 82505]


2 2 2 2
= 337520.25 .

which is an estimate of the projected population of the United States during 1995 as
shown in the appendix .

( b ) Estimation of variance using Jackknife estimator: The estimates of


population total 1'(h, i) (say) by dropping lh cluster (i = 1,2) from the hth stratum
(h = 1,2,3,4) are given by

1'(1,1) = ~[39013.5]+~[50805 + 17962]+~[55696.5 + 16044]+~[17700 + 82505] = 354609.75 ,


1 2 2 2

1'(1,2)= ~[21924]+~[50805 + 17962]+~[55696.5 + 16044]+~[17700 + 82505] = 320430.75 ,


1 2 2 2

1'(2,1) = ~[21924 + 39013.5]+~[17962]+~[55696.5 + 16044]+ ~[17700 + 82505] = 304677.25,


2 1 2 2

1'(2,2) = ~[21924 + 39013.5]+ ~[50805]+ ~[55696.5 + 16044]+~[17700 + 82505]= 370363.27,


2 1 2 2

1'(3,1) = ~[21924 + 39013.5]+~[50805 + 17962]+~[16044]+~[17700 + 82505] = 278041.5,


2 2 1 2

1'(3,2) =~[21924 + 39013.5]+ ~[50805 + 17962]+~[55696.5]+~[17700 + 82505] = 396999,


2 2 1 2

1'(4,1) = ~[21924 + 39013 .5]+~[50805 + 17962]+ ~[55696.5 + 16044]+ ~[82505] = 402325.25 ,


2 2 2 1
and
1'(4,2) = ~[21924 + 39013.5]+ ~[50805 + 17962]+~[55696.5 + 16044]+ ~[17700] = 272715.25 .
2 2 2 1

Thus using the concept of Jackknife, an estimate of the variance v(1') is given by
10. Multi-stage, successive, and re-sampling strategies 871

= 2~1[{Y(1,1)_i}2 + {y(1,2)-ir] + 2~1[{Y(2,1)-ir + {y(2,2)-ir]

+ 2~1[{Y(3,1)-ir + {y(3,2)-ir] + 2~1[{Y(4,1)-ir + {y(4,2)-ir]


= 2 -1 [{354609.75 _ 337520.25f + {320430.75 - 337520.25}2]
2

+ 2 -1 [{304677.25 _ 337520.25}2 + {370363.27 - 337520.25f]


2

+ 2 -1 [{278041.5 -337520.25}2 + {396999 - 337520.25}2]


2

+ 2 -1 [{402325.25 _ 337520.25}2 + {272715.25 - 337520.25}2]


2
= 9108124043 .
( c ) Using Table 2 from the Appendix the (1- a)l 00% and hence 95% confidence
interval of the population total is given by

i±ta / 2(df = n- fnh)~vAi) ,


h;1
or i±tO.02S(df = 16-8}.fvAi)
or ' 337520.25± 2.306~9108124043 , or [117443.69,557596.80].
The true count for the projected population during 1995 is Y = 259,593 which
clearly lies in the above 95% confidence interval.

A half sample can be formed by selecting one of the two first stage sample clusters
from each stratum. A set of R half samples can be defmed by an R x L matrix
H with the (r,h}th element defined as
lh lh
+ 1 ifthe cluster ofthe h stratum in the r half is lSI from first stage sample cluster,
Eh-
r. -
1-1 if the cluster of the h lhstratum in the r lhhalf is 2nd from first stage sample cluster.
R
Then a set of R half samples is said to be balanced if LErh Erh'= 0 for all h -:;: h' . A
r;1
balanced H can be made from an R x R Hadamard matrix by choosing any L
columns excluding the column of all values of +1 , where L + 1 s R s L + 4 . Assume
872 Advanced sampling theory with applications

lh
B(r) be the survey estimator of () obtained by treating the r half sample as the
original data set after adjusting the weights Whik as
(r\ _ { 2 Whik if Erh = +1,
wh"k - (10.11.2 .1)
I 0 if Erh=-1.

Using these weights in place of the Jackknife weights, a standard Balanced Half
Sample (BHS) variance estimator of B is given by
fA)
VBHS\() = -
1 R[ A AlA A]I .
Il()(r) - () ()(r) - ()
(101122)
. . .
R r=1
A limitation of BHS method is that some of the replicate estimators may become
extreme because only a half sample is used. In other words, the weights wu« are
sharply perturbed. One well known method to solve this problem is to change the
weights Whik as
(r) _ {1. 5Whik if Erh= +1,
w hik - .
O.5whik If E rh=-1.

In such situations, the full sample information IS utilized to obtain estimators


B(r (~) for r = 1, 2,..., R of ().

The usual BHS method of variance estimation becomes

(10.11.2.3)

For practical use, we need a BRR with number of replications (R) as small as
possible. It can be made a balance between a desire for small R and the need to
have a reasonably stable variance estimator by following Wolter (1985). It may not
be trivial to find a BRR, but some methods have been suggested by Gupta and
Nigam (1987), Gurney and Jewett (1975), Sitter (1993), and Wu (1981). One may
note that the standard BHS variance estimator can also use the weights as

if cluster (h,i) is in replicator,

if cluster (h,i)is not in replicator.

The original BHS was proposed by McCarthy (1969) for mh = 2 and was further
studied by Krewski and Rao (1979). It is difficult to construct balanced replicates
for arbitrary value of mh and its limitations can be had from Valliant (1996).
Fortunately there is an another solution to deal with arbitrary value of mh through
the method of Bootstrap.
10. Multi-stage, successive, and re-sampling strategies 873

10.11.3 BOOTSTRAP .YARIANCE ESTIMATOR

A bootstrap sample can be obtained by drawing (m" -1) clusters with replacement
from the m" sample clusters in each stratum independently. Select a large number,
R, of bootstrap samples independently which can be represented in terms of the
bootstrap frequencies m"i(r) = number of times (h, i)th sample cluster get selected
in the rt/r bootstrap sample r =1,2, ..., R.
Then the bootstrap design weights are given by

w/rik(r) = W/rik[---'!!..!L]m/ri(r). (10.11.3.1)


»« -1
Using these weights obtain the bootstrap estimators B(r), r = 1, 2,..., R of ().
Then a bootstrap variance estimator of 0 is given by
(10.11.3.2)

Several research workers including Kott and Stukel (1998), Lohr and Rao (1998),
Rao and Shao (1992), Rao (1996b), Rao and Shao (1999), Shao, Chen, and Chen
(1998), and Shao and Chen (1999) have considered the applications of the above
standard techniques, with intelligent modifications, to estimate the variance in the
presence of non-response or imputed data through different techniques like hot
deck, cold deck, ratio imputation, and regression imputation, which we shall discuss
in details in Chapter 12.

EXERCIS~S:."""""... _,c.
Exercise 10.1. A sample of size n is to be taken from a population of size N to
estimate the total Y of values Yi (i = 1,2, ...,N) ofa variable Y when 'normed' size
measures Pi are available. Suppose the population is divided into n random

groups of sizes N s> (g = 1,2,..., N) such that fN


g=1
g = N. If the selected l/r unit
consists of M, second stage units (SSU), one may estimate Yi by taking an
SRSWOR sample of mi units instead of ascertaing it. Suggest an estimator of
population total on the basis of above two stage sampling schemes and hence
deduce an estimator of its variance.
Hint: Chaudhuri, Adhikary, and Seal (1997).

Exercise 10.2. Let Y; and Xi be the totals of the l/r FSU of the study variable Y
N
and auxiliary variable X, respectively. Assume that the population total X = IXi
i=1
of the auxiliary variable is known and we are interested in the estimation of
874 Advanced sampling theory with applications

N
population total of the study variable, Y = I Y; . A sample s of n FSUs is selected
;=1
according to any design with "; and "ij as the known first and second order
inclusion probabilities , which are in fact function of first stage sample size n . The
selected FSUs are sub-sampled independently with suitable selection probabilities
in the second stage. When the l h unit is being selected then it is assumed that from
sampling in the second stage estimators t;y and tix, respectively, for Y; and X; are
available such that EZV;y )=Y; and EZ(tix)=X;. Find the asymptotic properties of
the following estimators:

(a) t l = ;~I~; ~;y-Pix>'(tix- X;)]-PXY(;~l~ -xJ [Sahoo (1987)]

when the first and second stage units are selected with SRSWOR sampling .

( b ) tz = E,,1; rIt;y - P;xy(t;x - X ;)]- Pxy(;~I t.~ X J


n n
- [Sahoo and Panda (1997)]

when the first stage units are selected with PPSWOR sampling and second stage
units are selected with SRSWOR sampling.

(c) t3 = ;~IM;)J{ X/;~IX;) [Smith (1969)]

(d) t4 = r. ~x;(xl r.x;)


;=I X; ;=1
[Murthy (1967)]

(e ) t5 = [N
n
r. (M;yJM;X;)][
;=1 X;
N r. x;J/x
n ;= 1
[Sahoo and Swain (1986)]

[Mahajan and Singh


(f) te = n-1!~~; -a; (x; - X; )-r; (s;~ -S&)]-d(x-X) (1996)]

( g) t7 = N r. [M;y;+ d;(M;x;- x;)Kxjxf [Sahoo and Panda (1997)]


n ;=1
Hint: Panda and Sahoo (1999).

Exercise 10.3. Show that Hartley and Ross (1954) unbiased ratio type estimator of
population mean in two-stage sampling is

tu -- +(N-l)(
=rX - - sb rx -(}sbrz )+- n Z
1 La; [M;-l
---- N-l(M
- ;-m;JJ(S;rx -(}S;rz )'
N n;=1 M; N M;m;

f
where
a; = NM; LM;,
N
I
() = XrtzZ, r;
;=1
L
= m; rij /
j =1
m;, r = La;r;
n
;=1
n,
10. Multi-stage, successive, and re-sampling strategies 875

Sbrz = (n -It I f(a;i) - rXaizi - z~ and Sirz = (mi -r): I


i=1
I ~ij - r;Xzij - z;)
j=l
Hint: Sahoo (1991).

Exercise lOA. In two-stage sampling n FSUs are selected with PPSWR sampling.
If the lh FSU occurs If times in the sample one of the following procedures may be
adopted for the second stage:
( a) If mi SSUs are selected with SRSWOR sampling;
( b) rt independent samples of mi SSUs selected with SRSWOR sampling;
( c) mi units are selected WOR and observations are weighted by rt -
Propose unbiased estimators of population total Y and derive the expressions for
the variance in each situation. If va' Vb, and Vc denote the variance of the
estimators under ( a ), ( b ), and ( c ), respectively, then show that the inequality
Va s Vb s Vc holds.
Hint: Rao (1961).

Exercise 10.5. A sample of n FSUs is selected with SRSWOR sampling and from
each selected FSU a constant fraction I: of SSU is selected. If If out of the mi
SSUs in the lh FSU possess an attribute, show that the estimator ratio to size
p = ~If/i~mi estimates the population proportion of the attribute. Find the
variance of the estimator p and suggest an estimator of variance.
Hint: Cochran (1977).

Exercise 10.6. (a) Suppose n FSUs are selected with PPSWR sampling in two
stage sampling. From each sampled FSU, m SSUs are selected with SRSWOR
sampling. Find the bias and variance of the estimator of population total defined as,
n , ,
YR = Ia), , where ai are the real numbers and Yi is an unbiased estimator of the
i=1
lh FSU. Deduce an unbiased estimator of variance.
Hint: Raj (1966).

( b ) In two stage sampling, n FSUs are selected from N FSUs in the population by
Midzuno--Sen's sampling scheme and the mi SSUs are selected from the lh
selected FSU by SRSWOR sampling. Derive an unbiased estimator of population
total, and expression for its variance and estimator of variance.
Hint: Raj (1954a, 1954b).
876 Advanced sampling theory with applications

Exercise 10.7. Consider n FSUs are selected with PPSWR sampling. From the ;th
selected FSU of size M i , suppose mi SSUs are selected with SRSWR sampling.
For estimating the population total Y the sub-sampling number mi are to be fixed
in two ways:
( a ) Expected value of mi is m;
( b ) Total number of SSUs is mo.

Find the optimum value of mi in each situation such that the variance of the
estimator of population total is minimum. Hence deduce the efficiency comparison.
Hint: Rangarajan (1957).

Exercise 10.8. Show that in the following cases the sample mean is an unbiased
estimator of population mean.
( a ) Divide the population into N clusters each of size M units. n FSUs are
selected with SRSWR sampling. From each selected FSU, m SSUs are selected
with SRSWR sampling.
( b ) Divide the population into clusters each of m' units. Select a sample of n'
clusters with SRSWR sampling.
( c ) Deduce the condition such that the efficiency of both schemes will be the same.
Hint: Singh (1956).

Exercise 10.9. Discuss three-stage sampling by proposing an unbiased estimator of


total and discuss its variance and estimator of variance. Discuss a method of
optimum allocation in three-stage sampling.

Exercise 10.10. Suppose n units are selected with SRSWR sampling on the first
occasion and a sub-sample of nst units is selected by a similar method and is
retained for the second occasion. A supplementary sample of size n(l- tr) units has
been selected independently by the same method. Find the constants a i'
Pi' (i = I, 2) and tr such that the variance of the estimator of population mean on
the second occasion defined as

is a minimum.
Hint: Hansen, Hurwitz, and Madow (1953).

Exercise 10.11. From a population of N units, a large sample of m units is drawn


by SRSWR sampling scheme to draw information on the auxiliary variable x (say)
and named first phase sample. In the second phase, a sub-sample of ml units is
drawn from the first phase sample and a sample of m2 units is taken independently
from the whole population by SRSWR sampling and the study variable is noted.
( a ) Compare the following two estimators of population mean:
10. Multi-stage, successive, and re-sampling strategies 877

(i ) Unbiased estimator based on direct sample of m2 units;


( ii ) Regression estimator based on two stage sample of m, and m2 units.

( b ) Derive the optimum values of mt and m2 for the fixed cost.


( c ) Obtain the best linear estimator from (i ) and ( ii ) and derive its variance
expression.
( d ) Find the optimum value of a parameter forming the linear variety of estimators
from (i ) and ( ii ), such that its resultant variance is minimum.
Hint: Patterson (1950), Cochran (1963).

Exercise 10.12. Consider a population n consisting of N units. Let the {h FSU


consist of M , SSUs. On the first occasion a sample of n FSUs is drawn from the
population n with PPSWR sampling. Within each selected FSU a sample of m
SSUs is drawn with SRSWR sampling and order of drawing FSU and SSU is
specified. Assume A to be the proportion of FSUs matched. Reject the first
nil = n(l- A) units and retain the remaining nA units in the sample as FSUs.
Supplement these retained FSUs with a set of new nil FSUs selected from the
population with PPSWR sampling. Suggest an estimator of population mean.
Extend the procedure such that one can estimate the population mean on the h th
occasion .
Hint: Kathuria (1975).

Exercise 10.13. In multi-occasion sampling let the first occasion consist of nj


FSUs selected by SRSWOR sampling from a population of N units. For the i th
(i ~ 2) occasion a sample of size nj FSUs consists of the following parts:
(i ) The n;FSUs which are also observed from the same variate at least on the
previous occasion;
( ii ) Then;*FSUs from the population FSUs not selected so far in the sample such
that the condition n, = n; + n;* is satisfied;
( iii) The FSU in the sample is studied on different occasions by selecting m SSUs
out of M SSUs by SRSWOR sampling again once for all.
Suggest an estimator of population mean based on the above sampling mechanism
and derive its variance .
Hint: Agarwal and Tikkiwal (1980).

Exercise 10.14. Let Pi> j = 1,2, ..., N , be the probability of selecting a unit U j
based on a variable z, N being the number of units in the population. On the first
N
occasion select a sample St of n units with probability P] such that L: P] = 1
j =t
using PPSWR sampling and observe the study variable y . On the hth occasion
878 Advanced sampling theory with applications

(h > I) the sample Sh of n units consists of a sample S hm of mh = nAh units


(matched part) selected with SRSWOR sampling from the n units obtained on
(h -I)th occasion and a sample Shu of Uh = n - mh units (unmatched part) selected
directly from the population with probabilities p ; using PPSWR sampling .
Hint: Tripathi and Srivastava (1979).

Exercise 10.15. Consider a sample S\ of size n is selected on the first occasion by


some suitable sampling design PI . Let Yli (Y2;) be the values of the variate Y
under study for the t h unit on the first (second) occasion . On the second occasion, a
matched sample, say sub-sample Sm of size m (= ny) with 0 < y ::; 1, is selected
from the sample SI selected at the first occasion by some suitable sampling design
Pm. Assume the sub-sample Sm is supplemented by another unmatched sample Su

of size U either from the entire population or from the non-sampled units at the first
occasion by any suitable sampling design Pu • Assuming that the initial sample SI is
selected by using the Rao--Hartley--Cochran (RHC) scheme, compare the following
two estimator
'0 = I (OJ
Y2 =t,6Y2m+ (1-t,6)Y2u, and Y2m r; p;0)-
P; + fJZ ,
iES m

where
Y2m = I~;jp;~ ; Y2u= I (y2d p; )P;' ; Y;; =Y2;lld p;; J}= Ip;;P;'=Ip; ,with
;esm ;esu om Ou
Om and 0u being the groups of those values associated with those units that
belong to the random group from which the t h unit was selected in Sm and SU '
respectively. In addition

p; = zd Z and p; = (Ylill;P;/( .IYlill;/P;)


1 ,esl
for the t unit in the first occasion sample
h
SI and III be the sum of the Pt values for
the group containing the t h unit.
Hint: Amab (1998) .

Exercise 10.16. Study the bias and variance properties of the following estimators
of population total in two stage sampling:

II = f....!-~;Y-fJixY(I;x-X;)]-fJxy(flix -x), 12 = fM;y;(xjfx;J,


;=I;r; ;=I;r; ;=1 ;=1

13 = .f ~~ x ;(xj.fx;J,
,=1Xl ,=1
14 = [N t (M;y;XM;X;)][N tx;J/x ,
n ,=\ X, n ,=1

and

15 »» -
n ;=\
-
=-I[M;y;+d;(M;x;-X;) nX NIX;
n (j
Jg
;=1
Hint: Sahoo and Panda (1997).
10. Multi-stage, successive, and re-sampling strategies 879

Exercise 10.17. Consider the problem of estimation of change in total on two


successive occasions . For a fixed survey population n = (i: i = 1,2, ..., N), let the
population vector on the present occasion be Y = (Yl ' yz,.,YN ) and on some
previous occasion be / = ~~ ,Y;,00 ' Y~ ). Let the sample data ~, ~i' Y;) : i E s} are
available. Show that optimal estimate for the change in total, LYi- LY; , is given
ie Q ieQ

by L~i - Y;);Jri '


ies
Hint: Godambe (1995).

Exercise 10.18. Let a general linear unbiased estimator of population total Y under
a multi-stage design be defined as
A N .
Yms = LtisYi = LtisYi
ies i=1
where

tis• = {tis if i E S,
. and E( tis) = LtisP ( S ) = 1.
o otherwise, ies

Then deduce the following results:


( a ) An unbiased estimator of variance of Yms is given by

v\(YmJ = LGisi + L LbijsYiYj + LtisVZ(y;) ,


ies i.] « s ies
where Gis' bijs and tis are predetermined real numbers for every s, and VZ(Yi) is an
estimator of conditional variance of Yi given s, say VZs(yi)'
Hint: Rao (1966c) .
( b ) If VZs(Yi) is not independent of s then an estimator of variance of Yms can be
written as:
vz(ymJ= LGisi+ L LbijsYiYj+ L~i;-GiS)VZ(Yi)'
ies i.jes ies
i« j
Hint: Rao (1975) .
( c ) The following estimators of variance are the special cases of the estimator
vZ(Yms) :
( i) Considering Yms as the Horvitz and Thompson estimator YHT of total in
multi-stage sampling, the Sen--Yates--Grundy form of the estimator of variance is
given by

Vsyg(YHT)= LLDij(Yi _YjJZ +{_I__ Dij}[VZS~;)+ vzs~J],


i<jes Jri Jrj n-l Jri Jrj
880 Advanced sampling theory with applications

(ii) In the case of ordered sampling the well known Raj estimator for n = 2
becomes

Vd'l =.!..[(I_lId.:~LY2)2
\11 P2
_(I_lIY{V2S~I)+
11
V2S(;2)}
P2

l
4

+1+11f '2~~') +(Hf '2~~2)}


and in the case of unordered sampling, it reduces to

vd'2 =.!..[(I -11 Xl - p)(21_ Y2)2 - [V2S


4 111 P2 11
~I) +V2.(;2 )J]
P2

+(1 +R)2 V2.(Yl) +(1 _ R\2 V2s (2) ] .


1 112 I) pi
(iii) Under the Rao--Hartley--Cochran (RHC) scheme
N2 - fN; [

VRHC = n
;=1 I n
IJr;
(Yi/ . )2
- - YRHC
INl- N 1=1 P;r
;=1

( iv ) Murthy's estimator of variance for n = 2 becomes

vm= (1-11 XI- P2XI-1I - P2)[(21_ Y2)2 _ {V2S~I) + V2S(;2)}]


(2-1I- P2Y 11 P2 11 P2
(I-lIY(I-P2Y {V2S&I) V2S(Y2)}
+ (2-1I- P2Y (I-lIY1I2 + (I-P2Ypi .
Hints: Brewer and Hanif (1970), Amab (1988).

Exercise 10.19. Consider a population 0 consists of N first stage units such that
N
the t h first stage unit O . (say) containing M; second stage units and M
I
= 'IM;
~
. Let

y;, X; and Z; be the totals of 0; for the study variable Y and two auxiliary
N N
variables x and z, respectively, with corresponding totals Y = IY;, X = IX; ,
;=1 ;=1
10. Multi-stage, successive , and re-sampling strategies 881

N
and Z = IZi . At the first stage a sample s e n of n first stage units is drawn from
i=l
n according to any design with lfi and lfij as the known and positive first and
second order inclusion probabilities. For every i Es, a sample Si of mi second
stage units is drawn from n I. with suitable selection probabilities at the second
stage. Let r;, Xi and Zi be the unbiased estimates of Y;, Xi and Z., respectively,
I

such that
Vz(r;)= (Ti~' V z(X i)= (Ti~' Vz(z;) = (Ti; ' Cz(r;, Z;)= (Tiyz ' cz(r;, X;) = (Tiyx' and
Cz(Xi' z;) = (Tixz .

,
Y=I-L, y:
X=I-',andZ =I-L
' -
From the first phase sample information define
- ' X Z· -
ies lfi ies lfi ieslfi
such that
E(Y)=Y, E(X)=X, and E(Z)=Z.
Also
,) N (T2
V (Y = (T; + I -2.:..,
i=1lfi
N(Tixy
, ,) (Txy+ L--, , ,) N (T.
C(Y,X = CXZ=(T
( , xz +~--lB..
L..J ,
i=l lfi i=1 lfi
where

(Txz =-I
1N IN ( {X
X')[Z.
lfilfj-lfij _, __J J
-L _ _ z.) •
2 i=lj(",i)l lfi Ifj lfi Ifj

( a ) Study the asymptotic properties of the following class of estimators of


population total in two-stage sampling .
Yg = G[Ya , z] ,
where Ya = Gi[r; , Xi] for i ES .
( b ) Show that the estimator
Yzs = I(r; -ai(Xi -Xi))/lfi-a(Z-z)
ie s
is a special case of the class of estimators in ( a ).
( c ) Defining
Z= I Zi
ieslfi
where i, = hi(Zi' Xi)' study the class of estimators

~1 = H[Ya , zal '


882 Advanced sampling theory with applications

where G(., .), H(., .), hi (., .) and gi(" .) are the parametric functions as defined by
Srivastava (1980).
Hint: Sahoo and Panda (1999a, 1999b).

Exercise 10.20. Let K h be the number of primary clusters in the hlh stratum of the
finite population. At the first stage of sampling, k h = Ch(Kh) clusters are sampled
from stratum h as a SRSWOR, where Ch( 1)= 1 and Ch(t)~ 2 for t ~ 2. At the
second stage of sampling nhi = gh(Nh;) observations from a sampled cluster, where
N hi is the number of population units in this cluster. Let Yhi be the mean of these
sampled observations.
( a) Show that an unbiased estimator of population mean j.J is given by
_ L «, kh _ / L x, kh
Y = L - L N hiYhi L - LNhi·
h=l k h i=l h=1 k h i=1

( b ) Also show that if f.p.c. is ignored, then an estimator of the variance of the
estimator Y under the concept of repeated sampling is given by
, _
v(y)= L
L Kt

h=1 h h
kh

1=1 [
_ _ 1 kh

h ]=1
_
k (k -1) L Nhi(Yhi - Y )-k- ~ Nhj~hj -
_

~
y) 2/( h=1
L -kh ~Nhi )2
L K kh

h 1=1
Hint: Kom and Graubard (1998).

Exercise 10.21. In exercise 10.20 select the first stage k h clusters with PPSWOR
sampling using 'size' cluster level variable (Z, say) for constructing selection
probabilities. At the second stage of sampling, nhi = g h (Nhi ) ~ 2 observations are
sampled as a SRSWOR from the {IJ sampled cluster from the hllJ stratum. Show that
the estimator of the variance of the estimator of population mean Y defined as
YI = LL kh
L N hiYhi
- / L kh N
L L----.!!l..
h= l i=1 Jrhi h=l i= 1 Jrhi
is given by
j _NhjJ]2/(LkhNhiJ2
, (-YI ) = L
V L -- kk L - hi Yhi
kh[(N _- _YNhiJ
- - -1L kh[NhjYh
---Y - L L-
h=l (kh -l)i= Jrhi Jrhi k h j=1 Jrhj Jrhj h=li=IJrhi
Hint: Kom and Graubard (1998).

Exercise 10.22. Consider a population consists of N FSUs and the {IJ FSU contains
M, SSUs. Select a sample of n FSUs by PPSWR (and by PPSWOR ) sampling.
From the (IJ selected FSU select m i SSUs by SRSWOR sampling. Derive an
expression in each case for optimum allocation of second stage units with the
constraint that the total number of sampled SSUs remains fixed.
Hint: Rangarajan (1957), Chaudhuri and Amab (1982).
10. Multi-stage , successive, and re-sarnpling strategies 883

Exercise 10.23. Consider SI is an SRSWOR sample of n units, S2 = S2mvS2u;


where S2m is an SRSWOR sample of m matched units from the sample Sl and S2u
is an independent SRSWOR sample of u units from the population n . Then show
that the following estimators are unbiased estimators of the population mean

Yl=¢Y;*+(I-¢X:v; +b~I - Y~)] ' where O<¢ <I,


and
- _ ",m2Y;
Y2 -
'I'
+U2Y;; + (I - 'I'",J-* b{- -*)~
Jl.Y2 + \YI - YI ~,
m2 +u2
where U2 = u- m2 is the number of units in a sample S2u drawn from unmatched
units in the population.
Hint: Pathak and Rao (1967).

Exercise 10.24. Consider a sampling on three occasions in which a constant


number n = mA of sampled units is retained from each occasion to the next and a
fresh sample of u units is selected from the units not used up to that occasion.
Assume that sampling is by SRSWOR and total sample size at each occasion is n.
Show that an unbiased estimator of population mean Y at the third occasion is
given by
Ys = al~~ - y~*)+ a2~; - Y;* )+a3~; - y;*).
Find the optimum values of aj , i = 1,2,3 by assuming that sl = S2 and Pij = P .
Hint: Singh(1968).

Exercise 10.25. Find the value of e such that the following estimator based on two
occasions is an unbiased estimator of population mean

Ys = e[(N - m)Y2u + my;]+(1- e)[y; + b~l - y*)] .


Hint: Singh (1972).

2 _
Exercise 10.26. Derive an estimator of the parameter r = Iaj}j where a., i = 1,2
j=l
are constants. Assume that Sl is an SRSWOR sample of n units from population
n, S2m is an SRSWOR sample of m matched units and S2u is an SRSWOR
sample of n - m unmatched units from the rest of the population n - sf . Obtain the
optimum values of the sample sizes nand m such that the variance of the estimator
you suggest is minimum for the fixed cost of the survey given by
C = Co +nC1+ mC2+C3(n-m) .
Hint: Kulldoff (1963).

Exercise 10.27. Find the bias and variance of the following estimator of population
mean Y defined as
884 Advanced sampling theory with applications

Y~ = 1Ym-r (I -a {Yr
a-_-+
zm-r
-=-+ p21[Xn xr
-=---=-
zr zn Zr
J])z-,
where 0 s a s I, Yr(x r) denote the means of the matched sub-sample on the second
(first) occasion, zr is the mean of the auxiliary variable of the matched sub-sample,

and P21 = i~(lj -RZJXi -RZi)/i~(Xi -RZy with R = fjz .


Hint: Feng and Zou (1997).

Exercise 10.28. Let (Xi' Yi : i=I,2, .... ,n) be the observed values of the variables
on the first and second occasion respectively, and the corresponding true values are
(17;, OJ; : i = 1,2, ...., N) . An SRS sample of n units is obtained on the first occasion .
A random sub-sample of m = nA., 0 < A. < 1 of units is retained for use on the
second occasion . An independent unmatched SRS sample of u = n - m = nu units is
selected . Consider the measurement errors are defined as e, = X; -17; and
e;=Y;-OJi such that Em(c:ili)=O, Em(eil i)=O, E m(c:lli)=o;7 and Em(elli)=oi~'
where Em denote the expectation over the model. Let the single prime indicates
units common to two occasions and a double prime indicates the units selected
independently, so that we have the following situation:

Consider the following estimator of the population mean:


Yss = a y"+b y'+cx' + dx".
(a) Find the conditions such that E(yss) = f and show that under such conditions,
the minimum variance linear unbiased estimator of the population mean is given by
Ymvlue = d(x"-x') + (1- b)y"+bY'.
( b ) Find the minimum variance of Ymvlue for the optimum values of the parameters
involved in it.
Hint: Sud and Srivastava (2000).
( c ) Let a;and a~ be the population variances on the first and second occasions
respectively. Find the bias and variance of the following estimator of a~ given by
• '2 a u 2 b U U C m 2 d m m
ay = - L Xk +-(--)L L XkXj +- LXk + ( )L L XkXj
uk=l uu-I kj#k mk=l mm-I kj#(k)
em2 f mm gU 2 h uu
+- LYk + ( )L L YkYj +- Ln +-(--)L L YkY j
m k=1 m m -I k j#k u k=l u u - 1 k j#(k)
where a,b,c,d,e,f,g and h are suitably chosen constants .
Hint: Sud, Srivastava, and Sharma (2001a, 2001b).
10. Multi-stage, successive, and re-sampling strategies 885

Exercise 10.29. Consider an SRS s of n units selected on the first occasion from
a universe n of N units and measurements are taken on two variables Y and x
in each of the two occasions in bivariate normal population. While selecting the
second sample, we assume that m = pn, (0 < p <) of the units of the selected sample
on the first occasion are retained for the second occasion (matched sample) andthe
remaining u = n - m = nq, (q = 1- p) units are replaced by a new selection from the
universe of (N - m) units left after omitting m units. Let Xi (Yi) be the x (y)
variables on the ith occasion, i = 1,2 . Consider the problem of estimation of the
ratio of two population means. Let Ri = Y;/ Xi' i = 1,2 be the population ratio on

the ith occasion. Let Ri = YilXi' i = I, 2 be the estimator of the ratio on the /h
occasion. Further let Rim and Riu be the estimates of ratio on the /h (i = 1,2 )
occasion based on matched and unmatched units, respectively. Consider the
estimator R; of population ratio on the second occasion R 2, given by
"'* A A A A

R2 = a Rtu + bRim + cRZ m + dRZ u .


( a ) Show that the conditions on a, b, c and d for the estimator R; to be unbiased
are a + b = 0 and c + d = I , and the above estimator reduces to
R; = a(Rtu -Rlm~cRZm + (l-c)Rz u .
( b ) The variance of R; is given by
2
V(R' 2* ) = a2(~q + ~) _A 2 + c _B2 +(I-cf2B
p nX. pnX qnX
2acCov( , , ).
Rtm,R2m
2 2
where
A = S;. + R?S;, - 2R1Cov(Yl>Xl), B = S;2 + Ris;2 - 2R2Cov(Y2,X2)
and
COV(R1m,R2m)= 1 [Cov(y.,Y2)- RtCoV(Y2,Xl)- R2Cov(Yl,X2)+ RtR 2Cov(X"X2)]'
npX.X2
( C )Find the optimum values of a and c, which minimize the variance V(R;) and
find the minimum variance.
Hint: Artes and Garcia (200Ie)

Exercise 10.30. Consider the size N of a population remains same over two
occasions, but the values of the units changes over occasions. Let an SRSWOR
sample of nt units is selected on the first occasion. Out of this sample let n; units
are retained on the second occasion while a fresh sample of size n;* is drawn on the
second occasion from the remaining (N - n.) units of the population so that the total
sample size at the second occasion becomes n2 = n; + n;* . Assume that the
information on an auxiliary variable X, which is positively correlated to y is
available at the second occasion.
886 Advanced sampling theory with applications

Let
Yi the population mean of the study variable Y on the th occasion (i =1,2 );
Sly the population mean square of Y on the lh occasion;
YI the sample mean based on nl units drawn on the first occasion;
y; the sample mean based on n; units observed on the second occasion and
common with the first occasion;
y;* the sample mean based on n;* units drawn afresh on the second occasion;
y~ : the sample mean based on n; units common to both occasions and observed
on the first occasion;
X2 : the population mean of the auxiliary variable X on the second occasion;
six : the population mean square of X on the second occasion;
x; : the sample mean of X based on n; units common to both occasions and
observed on the second occasion; and
x;* : the sample mean of X based on n;* units drawn afresh on the second
occasion.

Study the asymptotic properties of the estimator of the population mean on the
second occasion Yz as

1'z = rp Yz** + (I - rp) t;


where rp is a constant to be determined
Yz** = y;* + Pzx)Xz - x;*),
Yz* = y; + PIXY~l - y~)+ Pzx)X z -x;),
Pzxy is a regression coefficient of Y on x at the second occasion,
and
PZl stands for the regression coefficient of the variable y of the second occasion
on the same variate of the first occasion, and both PZl and Pzxy are known .
Hint: Singh and Singh (200 I) .

Exercise 10.31. A sample SI of size n is selected from a population n by RHC

scheme using Pi = xi , the second sample Sz =SZm USZ u , SZm being an SRSWOR
X
sample of m units taken from SI, and SZu is an independent sample drawn from
the population n by using RHC scheme using the same Pi measures.
( I ) Study the bias and variance of the following estimator:
~ =rp YZm+ (l-rp)Yzu ,
10. Multi-stage, successive, and re-sampling strategies 887

where
Y2m = ~ (Y2i - Yli }rli + ~ YIi'1i
iES2m (mjn)Pi iES\ Pi
and
Y'2u -
- " Y2i'2i
L.---
iES2 Pi
with 'Ii and '2i are the totals of Pi values included in the t h group while selecting
Sl and S2u using RHC scheme.

Hint: Ghangurde and Rao (1969)


( II) Select SI and S2u is the same way as above, but select S2m from Sl following
RHC method using 'i in place of Pi' Let f i denote the total of 'i over RHC
groups while selecting S2m sample. Under this scheme find the bias and variance of
the following estimators:

Ycl =!Pc Y2m + (1- !PJY2u


where
Y =
2m
~ (Y2i - Yli )r\i + ~ YIi'1i
iES2m (mj n}ri iES\ Pi
and
y' 2u
c _ " Y2i'2i
- L.- •
iES2 Pi
Hint: Chotai (1974).

Practical 10.1. Out of 10 clusters or continents listed in population 5 of the


Appendix select 4 continents with SRSWOR sampling as first stage units (FSU) .
From each one of the selected continents select two countries by SRSWOR
sampling and record the production of the tobacco crop. Estimate the total
production in the world. Also estimate the standard error of the estimator used for
estimating the total production in the world.

Practical 10.2. Divide the United States of America into four independent strata as
Northeast, Midwest, South and West. Suppose the first stratum Northeast consists
of two clusters New England and Mid Atlantic, the second stratum Midwest
consists of two clusters East North Central and West North Central, the third
stratum South consists of three clusters South Atlantic, East South Central, and
West South Central, and the fourth stratum West consists of two clusters Mountains
and Pacific as given in population 7 in the Appendix. From each stratum (or region)
select two clusters ( or divisions) by SRSWOR sampling and within each selected
cluster (or division) select two units ( or states) and collect the information on the
projected population counts during 2000.
888 Advanced sampling theory with applications

( a ) Suggest an estimator for estimating population total in the US using stratified


multi-stage design .
( b ) Estimate the variance of the estimator of population total by using the
technique of the Jackknife.
( c) Find the 95% confidence interval for total projected population during 2000 in
the USA .

Practical 10.3. The data in population 9 of the Appendix relates to the number of
immigrants coming to 51 states in the United States during 1994--1996 . Select a
sample ( Sl ) of 20 cities by PPSWR method using number of immigrants in 1994
(z) as measure of size. From the selected sample s, , select a sub-sample Sm of size
m = 12 from Sl by SRSWOR method assuming all elements of s, are distinct.
Finally select an independent sample Su of size u =(20 -12) =8 with PPSWR using
x as size measure. Estimate the total number of immigrants in 1996 using the
composite estimator. Compare your estimate with the estimate obtained in example
10.9.1 and comment on it.

Practical 10.4. The data in population 9 of the Appendix relates to the number of
immigrants coming to 51 states in the United States during 1994--1996 . Consider a
sampling on three occasions in which a constant number m such that n = mit = 2x 5
of sampled units is retained from each occasion to the next and a fresh sample of
u = 5 units is selected from the units not used up to that occasion . Suppose that
sampling is by SRSWOR and total sample size at each occasion is n. Estimate the
populat ion mean Y at the third occasion by using the estimator

Ys = al~~ - Y~*)+ a2~; - Y;*)+ a3~; - Y;*)'


where the sample means y; , y;*, and real constants a j for j = 1,2,3 , have their
usual meanings.
Hint: Singh( 1968).
11. RANDOMIZED RESPONSE SAMPLING: TOOLS FOR
SOCIAL SURVEYS

The randomized response technique (RRn is useful for reducing response error
problems when potentially sensitive questions such as the illegal use of drugs,
sexual practice, illegal earning, or incidence of acts of domestic violence are
included in surveys of human populations. Direct questioning of respondents about
sensitive issues often results in either refusal or falsification of the answers. Social
stigma and fear of reprisals sometimes result in untruthful, exaggerated, or
misleading responses by respondents when approached with conventional survey
methods. Warner (1965) was the first to suggest an ingenious method of
counteracting fears in response to sensitive questions.

It is widely understood that direct questions on sensitive issues often result in


deliberately incorrect answers and/or refusal to respond. The real essence of
Warner's (1965) randomized response technique is to reduce the response error of
potentially 'sensitive questions'. Warner, while estimating n , the proportion of
population possessing a stigmatized character (say) A, used two randomized
questions inquiring about the respondents status in relation to the sensitive character
only.
( a) Are you a member of group A?
( b) Are you not a member of group A?
Warner (1965) used a spinner (say) whose background consists of two different
shapes, viz., circular and heart representing proportion P and (1- p) of the total
area as shown in the Figure 11.1.1.

Fig 11.1.1 Spinner as a device.

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
890 Advanced sampling theory with applications

Every interviewee selected in the sample is requested to rotate the spinner


unobserved by the interviewer. If the pointer stops in the circular region and the
interviewee possesses the character A then he/she is requested to report 'Yes', and
if the pointer stops in the heart region he/she is requested to report 'No'. Similarly,
if the spinner stops in the heart region and the interviewee does not possess the
character A, again he/she is requested to report, 'Yes' otherwise 'No'. Note that
the response 'Yes' comes from two possibilities and also the response 'No' comes
from two possibilities, so the respondent's privacy is maintained. One can see
from the following table:

RESPONSE FROM
WARNER'S
MODEL

No Yes

Evidently the probability of 'Yes ' answer is given by


Ow = PJr+(1-PX1-71-). (11.1.1)
Let nl be the number of 'Yes' responses out of an SRSWR sample of n
respondents. Let Yi be a discrete random variable such that
Yi = j1 if the jth selected person reported 'Yes',

o otherwise.

Obviously the random variable Yi follows Binomial distribution with parameters n


and Ow, that is, Yi - Binomiallx, Ow). Note that nl responses are 'Yes' and (n - n\)
responses are 'No', thus the maximum likelihood function is given by

L(Ow)= nCnl (Ow)"1 (1- owin - ntl , (11.1.2)

Taking logarithms on both sides of (11.1.2) we have


Log{L( Ow)} = Log{nCnl}+ n)Log(Ow)+ (n - nl)Log{(1- Ow)} ,
On setting
o Log{L(Ow)} = 0 .
oo;
We get a maximum likelihood estimate of Ow as
(11.1.3)

Thus we have the following theorem:


Chapter 11.: Randomized response sampling : Tools for social surveys 891

Theorem 11.1.1. An unbiased estimator of population proportion tr is given by


i = Ow-(l-P) P~0.5 (11.1.4)
w 2P-l'
where Ow is the observed proportion of 'Yes' answers in the sample of n units
drawn by SRSWR sampling.
Proof. Note that Ow - B(n, Ow), taking expected values on both sides of(11.1.4) we
have
E(i )= E(owl- (1 - p) Ow - (1- p) = tr •
w 2P-l 2P-l
Hence the theorem.

Theorem 11.1.2. The variance of the estimator i w is given by

= tr(l-tr) P(l- p)
v(trwA )

+ 2 • (11.1.5)
n n(2P-l)

Proof. Note that OwIlB(n,Ow) we have

V(i )= V[Ow -(l -P)] = v(oJ = Ow(I-0w)


w 2P-l (2P-lf n(2P-lf
which on simplification reduces to (11.1.5).Hence the theorem.

Theorem 11.1.3. An unbiased estimator of V(i w ) is given by


A( A ) _ Ow(l- Ow) (11.1.6)
V trw - .
n-l
Proof. Obvious by taking expected value on both sides of (11.1.6).

Example 11.1.1. Michael believes that 'churches' and 'academic institutes' were
made for honest people, but after watching the movie 'Wonder Boys-2000', he
felt that due to the increase of unscrupulous people in the world that there may be
academic research cheaters (or academic thieves) in the institutes. He selected an
SRSWR sample of 20,000 researchers across the world from different institutes by
using a randomisation device, like the spinner, producing 70% of the statements,
'Have you ever stolen your colleagues research papers or books?' along with 30%
of the statements, ' Have you never stolen your colleagues research papers or
books?' Out of 20,000 selected researchers 6060 reported 'Yes' through the above
randomization device. Estimate the proportion of 'academic thieves' in the world,
and also construct a 95% confidence interval estimate.
Solution . Here P = 0.7 , n = 20,000 and nt = 6060 .
Thus the observed proportion of 'academic thieves' is given by
ow = 6060
20000
= 0.303 .
892 Advanced sampling theory with applications

Using Warner's (1965) estimator, an estimate of true proportion of academic


thieves is given by
ir = O
w- (1- p) = 0.303 - (1- 0.7) = 0.003 = 0.0075.
w 2P-l 1.4-1 0.4

An unbiased estimate of V(ir w ) is given by

v(ir ) = Ow(l- Ow
w n-l
L 0.303(1- 0.303) = 0.211191 = 0.0000156 .
20,000-1 19999
Thus a 95% confidence interval estimate of the true proportion of ' academic
thieves' is given by

irw ± Za/2~V(irw) , or 0.0075± 1.96"'0.0000156, or [0.00113, 0.01386] .

Thus Michael claims with 95% confidence that in these days there are 0.113% to
1.386% of the researchers in this world are 'academic thieves', but he is pleased
that this proportion is very negligible.

Let " represent the proportion of individuals in a population who belong to group
A (e.g., the proportion of women who have had an abortion). Franklin (l989a,
1989b) considers that a with replacement simple random sample of n > 1
individuals is chosen. This is usually the case and eliminates concerns about finite
population correction factors. A total of k > 1 trials per respondent are conducted.
For respondent i on trial j, random values are drawn from the densities gij and
hij (where we assume independence of all densities). The respondent but not the
interviewer seeks both values and is asked to report the value from gij if he/she
belongs to group A and the value from hij otherwise.
,
~",.1,.~ 'itRespo'nse from Franklin's model .
Status of the respondent
Belongs to group A Do not belongs to group A

Report value from gij Report value from hij

The interviewer knows the exact form of gij and hij but sees only the value
reported by the respondent. This value we denote by Zij : The interviewer does not
know for certain from which of the two densities it comes. From the total of kn
observations of Zij (i = 1,2 , ..., II; j = 1,2 , , k) inference can be made about " . The
conditional density of a k-tuple (ZiJ , Zi2, , Zik) representing a random observation
of the fh respondent given " is
Chapter 11.: Randomized response sampling : Tools for social surveys 893

Thus (Zil,ZiZ" ",Zik) can be represented as the mixture of random variables

(11.2.1)

where P is Bernoulli (p) and X ij - g ij and Yij - hij' The model can be
specialised by having gij = gi and hij = hi for all i . Thus all respondents are
observing the same distributions and allows us to perceive the k-tuple response
(Zil,ZiZ, ,,,,Zik) as independent, identically distributed (i.i.d.) random variables with
form
(11.2.2)
from the density

JTng)Z;)+(I-JT)nhj(zJ (11.2.3)
j=l j=l

Although any density may be used for gi or hi, the Franklin (1989a, 1989b) model
consider only both as normal densities with known means J.Jlj and J.JZj' and
known standard deviations alj and aZj respectively. The choice of gi and hi as
Bernoulli (p) with hi(Z) = 1- g;(z) and k =1 reduces this model to Warner's
(1965) original model and the densities hi and g i are dependent. Also, if gi
denotes the distribution generated by the first deck of cards with known proportion
01 of red cards and hi denotes the distribution generated by the second deck of
°
cards with known proportion Oz (* 1) of red cards, then Franklin's model reduces
to the model suggested by Kuk (1990). It is a well known result that while the
estimators derived from the method of moments (MM) are usually not preferred
over maximum likelihood (ML) estimates, but under certain circumstances they
might be. Franklin has shown that there is one such possible instance where MM
estimators and their associated variances can be found analytically, while those of
the ML estimators cannot be obtained. This allows formation of confidence
intervals and test of hypothesis to be conducted for these MM estimators. Two
obvious MM estimators can be derived, one by concentrating on the row averages
of Z ij and other on its column averages.
Let
2;. = .!.[Zil + ZiZ + ...+ Zik] and Z.j = .!.[ZIj + ZZj
+...+ ZnJ (11.2.4)
k n
for i = 1,2, ..., nandj = 1,2, ..., k, represent the row and column averages
respectively. Concentrating on the row averages gives

(11.2.5)

since the k-tuple responses are independently and identically distributed


(ij.d.). Now
894 Advanced sampling theory with applications

E(ZJ=n,uIj+(I-n) ,uZj for j=I,2, ....k . (11.2.6)


Thus

k j=l
i j=l k
i
E(Z;o) =-!-[n ,ulj + (1-n) ,uZj] =-!-[ nml + (1- n )mz]' (11.2 .7)

where ml =
k
L ,ulj and mz
j=\
=
k
L ,uZj .
j=l
From these the standard MM approach gives
k _
L ZOj -mz
kZ -mz j=l (11.2.8)
=
A

nl
ml-mZ

Note that the k-tuple responses are i.i.d, it can be easily shown by (11.2.6) that
E(i 1) = n
with the variance of i 1 given by

V(i l) = kZV(Z)j(ml -mz? = v(L Zj J/~(ml -mz?}. (11.2 .9)

Now the observations ZI ,ZZ,oo"Zk from a certain single respondent are not
independent and need not even be identically distributed. In fact, for example, the
joint density of (z; Zj) of two trials from a single respondent is given by

Jij(Z;, Zj)=ng;(Z;)gj(Zj) +(I-n)h;(Z;)hAzJ . (11.2.10)


It can be easily seen that
V(Z;) = n(l- n X.uli - ,uz;)Z + nO"a + (1- n )O-i;, (11.2.11)
and
Cov(Z;, Zj) = n(l- n X,uli - ,uz;"f.plj - ,uZj) . (11.2.12)

The correlation between Z; and Zj is given by


n(1- nX,uli - ,uz;"f.plj - ,uzj )
In (X \2 Z ( LZ I \f \2 Z Z . (11.2 .13)
V I-n ,u1i-,uZiJ +nO"Ii+ I-nJUz;V n(l-nNJIj-,uzj) +nO"\ j+(I-n)o-zj
Thus as 0"1;' O"lj' O"z;, O"Zj all ~O,then CoVlZ;,Zj)~I.Alsoas (,u1i-,uZ;) and

(,ulj - ,uZj)~ 00, then Covlz;, Zj)~ 1 . This is apparent since knowledge of Z; for
a particular respondent under these circumstances gives perfect predictability for Z;
given by the same respondent.
Equations (11.2.9), (11.2.11), and (11.2.12) give the variance of i\ as

V(i 1) = n(ml ~mz? [tV(Zj)+2;~COV(Z;'Zj)]


Chapter 11.: Randomized response sampling : Tools for social surveys 895

(11.2.14)

Thus the vanance of i l can be thought of as the variance from the ordinary
sampling plus an additional term due to the randomizing effect.

Now concentrating upon the column average gives

E{Z. j) = .!. f E{Zij) = .!. f E{Z j) = E{Z j) = 1!J.llj + (1 - 1! )J.l2j (11.2 .15)
n ;=1 n j=1
for j = I,2, ...k .

Hence a second MM estimator can be formed from the average of the /h column of
observations by setting Z. j = E{Z. j) and solving for 1! to have

, Zoj - J.l2 j (11.2 .16)


1!oj =
J.llj - J.l2 j

Thus a total of k such estimators (not independent of one another) for j = 1,2,..., k
can be formed. Each of these estimators is an unbiased estimator of 1! • The variance
of i O
j is given, since observations are i.i.d. by
, )_ V(ZoJ _ v(Z j)
V (1! 0 j - (
J.llj - J.l2j
)2 - (
n J.llj - J.l2j
)2 . (11.2.17)

Using (11.2.11) and (11.2 .17) we have


, )_ 1!(I-1!) 1!0"i} +(I-1!)aL
V (1!oj - -- +
n
(
n J.llj - J.l2j
f ' (11.2.18)

Furthermore

, , )_ Cov{Zo;, ZOj)
COY ( 1!o;, 1!oj - 'jp ) (11.2 .19)
(J.ll; - J.l2; Ij - J.l2j

Note that Zp i and Zmi are independent if p"* m . But since the responses are i.i.d.
we have from (11.2 .12)
, , )_ nCov(Z;,zj) = 1!(I-1!)
COY ( 1!o;, 1!oj - ( ) (11.2 .20)
n(J.ll; - J.l2;) J.ll j - J.l2j n

Any weighted average of k estimators iI ' i 2 , •••,ik will also be an unbiased


estimator for 1! . An estimator of this type, which (11.2.16) suggests , is one where
the weights are proportional to the difference in the means of g ; and h; .
Specifically, let
896 Advanced sampling theory with applications

_ IJll j - Jl2jl Dj
Wj - k (11.2.21 )
~ IJljj - Jl2jl D
J=!
k
where Dj = IJllj - Jl2jl and D = ~ Dj for j = 1, 2, ..., k .
J=!

Also let us take second method of moment estimator 1r2 given by


k
1r2 = I Wj1r. j . (11.2.22)
j=!

Thus by (11.2.18), (11.2.19) and (11.2.21), we have

V(1r2) = f W]V(1r.J+ 2IWiWFov(1r.i, 1r.J


j=1 i<j

(11.2.23)

Thus this second MM estimator has a variance similar to the first MM estimator . A
comparison of the variance of the two MM estimators 1rj and 1r2 from (11.2.14)
and (11.2 .23) reveals that they will be identical if and only if all the terms
(,ulj - Jl2j) have the same sign for j = 1,2,..., k then

(11.2.24)

In any other case the denominator of the right hand side of (11.2.14) will contain
both positive and negative terms and will result in V(1rl) :5 V(1r2) ' In fact, when all
(,ulj - Jl2j) for j = 1, 2,..., k have the same sign then we have 1rt = 1r2 ' Thus to
conclude the two MM of estimators will be identical if all terms (,ulj - Jl2 j) for all
j have the same sign. In any other case, 1r2 will have a smaller variance. This
aspect of keeping the sign of all (,ulj - Jl2j) the same is the particularly important
design consideration in using the 1r1 estimator. It is seen that (mj- m2) could not
be equal to or near zero. By violating this condition the variance of 1rl could be
extremely large. On the other hand, for the estimator 1r2 it is seen that the variance
can be decreased by making at least one absolute difference large. Singh and Singh
(1992a, 1993d) have suggested some improvements over Franklin's model. Amab
(1996) developed a unified setup for Singh and Singh's models. More details about
randomized response sampling can be had from Chaudhuri and Mukherjee (1988),
Sheers (1992), and Tracy and Mangat (I 996a).
Chapter 11.: Randomizedresponse sampling : Tools for social surveys 897

In Warner's model both questions refer to the same sensitive character A or its
complement A C • Greenberg, Abul-Ela, Simmons, and Horvitz (1969) felt that to
protect the privacy of the respondents, it is desirable that the two questions be
unrelated and suggested an unrelated question model. In Greenberg et.al.'s pioneer
unrelated question model the data gathering randomization device consists of two
questions or statements:
(i ) Are you member of group A?
( ii ) Are you member of group Y?
where character Yor complement of it are innocuous and unrelated to A. For
example, in estimating the proportion of persons having extra marital relations in a
certain community the two questions may be:

( a ) Are you having extra marital relations?


(b) Were you born in the month of March?

Clearly the second question has nothing to do with extra marital relations. An
SRSWR sample of n is drawn from the population. Greenberg, Abul-Ela,
Simmons, and Horvitz (1969), in their theoretical development, dealt with two
situations of Jr y (proportion of unrelated character, Y, say), being known and
unknown.

Each interviewee selected in the sample chooses questions ( i ) and ( ii ) with


probabilities P and (1- p) respectively. Assuming that the proportion Jr y of the
unrelated question model is known, the probability e of ,Yes' answer is given by
(11.3.1.1)

Let nt be the number of observed 'Yes' answers in a sample of n units, so that

e=!!in .
Then we have the following theorems:

Theorem 11.3.1. An unbiased estimator of Jr is given by

A e-(I-p)Jr (11.3 .1.2)


Jr
g
= P Y
.

Proof. It follows because e- B(n,e).


Hence the theorem.
898 Advanced sampling theory with applications

Theorem 11.3.2. The variance of the estimator irg is given by


v( JT.g ) = 0(1-0)
2' (11.3 .1.3)
nP
Proof. Note that 0- B(n, 0), therefore
v(O) = 0(1-0) .
n
Hence the theorem .

11.3;2 WHEN PROPORTION 'OF UNRELATED CHAAACTER IS ~


UNKNOWN - -<> --

Here JTy the proportion of neutral character (say) Y in the population is unknown .
In such a situation, Greenberg, Abul-Ela, Simmons, and Horvitz (1969) suggested
to take two independent SRSWR samples of sizes nl and n2 from the population.
In the /h sample, Pj and (1 - fj) denote the probabilities of selecting the statements
regarding the possessing of the sensitive characteristic A and non-sensitive
characteristic Y in the randomized response device R, used for the respondents in
the /h sample, i = 1,2 , so that the probability of 'Yes' answer in the ith sample is
given by
(11.3 .2.1)

Then we have the following theorems :

Theorem 11.3.2.1. An unbiased estimator of JT is given by


.
JTo =
(I -P2)B\-(I -I1)B2 ,
11 -P2 (11.3.2.2)

where 01 and O2 are the observed proportion of ' Yes' answers in the first and
second sample, respectively.
Proof. Solve the two equations given by (11.3.2.1) for JT and use the method of
moments. The unbiasedness is clear from the fact that OJ - B(nj, OJ).

Theorem 11.3.2.2 . The variance of the estimator iro is given by

V(ir ) =
o
1
\2
[(I-P2]201(1-01) + (1-11]202(1-02)], (11.3.2.3)
( 11-~) ~ ~
where OJ =fjJT+(I -fj)JTY' i=l, 2, denotes the probability of ' Yes' answer in the
first and second sample , respectively.
Proof. It follows from OJ- B(nj,O;) so that
V(Oj)= OJ(I-O;) , i=l, 2.
nj
Hence the theorem .
Chapter 11.: Randomized response sampling: Tools forsocial surveys 899

For the optimum choice of III and 112

(obtained by setting aV(Ji-o ) =0 )


rni
the minimum variance of the estimator Ji-o is

. (,)_ [(I- P2)JO\(I-OI)+(I-ll)J{h(I-02)f (11.3.2.4)


Mm.VlZ"o - (\2
1l11-P2 J

where 11 = III + 112 and III and 112 are Greenberg, Abul-Ela, Simmons, and Horvitz
(1969) optimum sample sizes. They suggested that one of the optimal choices of
P;, i = 1, 2, should be close to one and the other close to zero. The value lZ" y
should be chosen close to zero or one according as lZ" < 0.5 or lZ" > 0.5 respectively.
Moors (1971) claimed that if one of the value of p; (say, P2 ) is chosen to be zero
then the model of Greenberg, Abul-Ela, Simmons, and Horvitz (1969) becomes
optimal so far as the choice of p; is concerned. The optimality choice of P2 = 0
(i.e., Moors' model) discloses the privacy of sensitive character whenever any
respondent appears in both the independent samples and reports 'Yes' in the first
sample and 'No' in the second sample. The probability of repetition of the
respondents in both independent samples is quite high when large samples are
required for gaining efficiency in the randomized response techniques. This
difficulty has been pointed out by Mangat, Singh, and Singh (1997). The
probability of a respondent being selected in both samples depends upon the size of
stratum or the level of the estimates required. Some organizations require estimates
at country, state, district, village, or school level. As the stratum size changes from
country to school level, the probability of the respondents being selected in both the
samples increases. Another practical situation where the respondent gets selected in
both samples is the well-known overlapping cluster sampling. If there is no
population list, then in such situations, although an individual/ respondent might not
be detected by the interviewer, the interviewee might think that in the first sample,
he replied 'Yes' out of choice of two questions, now another interviewer is asking
him a direct non-sensitive question. If he understands RRT then he will never
respond, as he becomes suspicious. In this case the problem arises from the
interviewee's side. Mahmood, Singh, and Hom (1998) observed that the method
suggested by Mangat, Singh, and Singh (1997) needs splitting of the population of
N units into two random groups. Sometimes it is difficult to do so owing to the
population being large or non-availability of a list of individual units at population
level. Moreover, the variance expression of the estimator proposed by them is too
complicated to implement in actual practice. Mahmood, Singh, and Hom (1998)
have provided three simple and alternative survey techniques which parallel the
optimality conditions suggested by Moors (1971) and are free from the difficulties
in the Moors (1971) as well as in the Mangat, Singh, and Singh (1997) model.
900 Advanced sampling theory with applications

Technique I. Mahmood, Singh, and Horn (1998) suggest a device consisting of


three types of statements:
( i) 'I belong to group A'; ( ii) 'I belong to group yC '

and
(iii) 'I belong to group Y ,
with probabil ities PI' P3 and P4 respectively , such that ~ + ~ + P4 = 1 .

Then the probability of 'Yes' answer in the first sample of nl respondents is

(11.3.2.5)

In the second independent sample of n2 respondents, the question is asked only on


the unrelated question ' Y , to estimate the proportion 1C y of the unrelated character
as discussed in Moors' model. Obviously this technique guarantees the
respondents' privacy which is lacking in Moors' model.

By the method of moments, an unbiased estimator of 1C is given by

.03-
1CI =
P3(I-i)-P
y 4iY (11326)
..•
~
where 03 and i y are the observed proportion of 'Yes' answers in the first and
second independent samples.

Thus we have the following theorem:

r
Theorem 11.3.2.3. The minimum variance of the estimator i l is given by

. . [~83(1-83) +(1)-P4 N 1CA I - 1Cy) (11.3 .2.7)


Mm.V(1CI) = 2
n~

Proof. Note that 03-B(nl' 83), iy-Bh, 1Cy ) and the samples are independent,
V(i.) = V(03)+(P3-P4 f v(iy)
~2

8 (1- 8 ) + ( ~ - P )2 1Cy{1- 1C y)
3 3
--=,-,,----,,-,-
4 (11.3.2.8)
nl n2
~2
On differentiating (11.3.2.8) with respect to nl ' such that nl + n2 = n , and equating
to zero we have

(11.3.2 .9)

Hence the minimum variance of the estimator it is given by (11.3.2.7).


Chapter II.: Randomized response sampling: Toolsfor socialsurveys 901

Technique II. The randomized response device here differs from technique I in the
sense that the second statement 'I belong to y C ' is simply replaced by the statement
'Try once again.' If this statement appears on the second trial, then the respondent
is requested to report 'Yes' irrespective of his actual status. Then the probability of
a 'Yes ' answer in the first sample of "( respondents is given by
(}4 =(I+P3XPIJZ" +P4JZ"y) +pl · (11.3.2:10)
The second independent sample of liZ respondents is asked the direct question on
Y by following Moors (1971).
By the method of moments an unbiased estimator of JZ" is given by
. 04 -pl-P4(I+P3)Jfy
JZ"z Pt(I+~) (11.3.2.11)
Thus we have the following theorem
Theorem 11.3.2.4. The minimum variance of the estimator Jfz is given by

Min.v(Jfz)=[~(}4(1-(}4) +P4(I+P3~JZ"AI-JZ") r/~(Pt(I+~) fl. (11.3.2.12)

Proof. Similar to the proof of Theorem 11.3.2.3.

Technique III. The randomized response device differs from technique I in the
sense that the second statement ' I belong to y C ' is simply replaced by the statement
' I belong to A C ' . Then the probability of 'Yes' answer in the first sample of "l
respondents is given by
(}s = PtJZ"+P3(1-JZ")+P4JZ"y ' (11.3.2.13)
The second independent sample of liZ respondents is treated by following Moors
(1971). By the method of moments, an unbiased estimator of JZ" is given by
• Os -~ -P4Jf
JZ"3 = Pt -P y (11.3.2.14)
3

Thus we have the following theorem:


Theorem 11.3.2.5. The minimum variance of the estimator Jf3 is given by

Min.V(Jf3)=[~(}s(I-(}s) +P4~JZ"y(I-JZ"y) r/~(Pt-P3f}. (11.3.2.15)

Proof. It is similar to the proof of Theorem 11.3.2.3.


Mahmood, Singh, and Horn (1998) have shown that the estimator proposed at
(11.3.2.6) remains better than the other estimators. A few more strategies in this
direction have been suggested by Tracy and Mangat (1995, 1996b), Singh, Singh,
and Mangat (2000), and Singh (1999). Note that unfortunately the variance
expressions in Singh (1999) and Singh, Singh, and Mangat (2000) are missing
small terms, but after making those corrections the suggested strategies still remain
efficient than the competitors.
902 Advanced sampling theory with applications

Example 11.3.1. Assume the true proportion of the drug users in a city is 0.1.
Consider we used a randomization device to collect information from a large
sample of the persons bearing two types of statements:
(i ) ' Are you a drug user?' with probability 0.70;
( ii ) ' Were you born in spring ? ' with probability 0.30.
Assuming that the proportion of persons born in spring is 0.2. (In actual practice it
is unknown). Find the relative efficiencies of the above three techniques with
respect to the Greenberg, Abul-Ela, Simmons, and Horvitz (1969) technique.
Solution. We are given Jr = 0.1 , Jr y = 0.2 , ~ = 0.7 and P2 = 0.3 . Thus we have
01 =~Jr+(I-~)Jry=0.7 xO.I+(1-0.7) x O.2 = 0.13,
and
O2 = P2Jr+(I-P2) Jr y = 0.3 xO.I+(1-0.3) xO.2 = 0.17.

Technique I. According to technique I let us split P2 = 0.3 into two parts as


~ = 0.15 and P4 = 0.15. Thus we have

fJ3 = ~Jr+P3ll-Jry) +P4Jry = 0.7 xO.I+0.15(1-0.2)+0.15 xO.2 = 0.14 .

Thus the relative efficiency of the estimator tTl with respect to tTG is given by

+ (I - ~ ).)02(I - 02) ]2
r (~
[ (I - P2).)OI(I - 01) 2
~= ~
[~03(I-03) +(~ -P4~JrAI-Jry) P2f

r
[ (1- 0.3NO.13(I- 0.13) + (1- 0.7No.17(I- 0.17) ] 2 2
= 0.7 = 3 0822
[~0.14(1-0.14) +(0 .15-0.15No.2(1-0.2) (0.7 0.3f . .

Technique II. Here we have


04 = (I + P3X~Jr + P4Jry)+ pl = (I+ 0.15XO.15 x 0.1+ 0.15 x 0.2)+ 0.152 = 0.07425.
Thus the relative efficiency of the estimator tT2 with respect to tTG is given by

+(I-~).)02(I-fJ2)
r
[ (I - P2).)fJ1(I - fJ1) ]2 (( )\2
~= ~1+P3)

[~04(I-04) + P4(1+ ~ ~JrAI- Jry ) (~ P2f

r
[ (1- 0.3).)0.13(1- 0.13) +(1 -0.7).)0.17(1-0.17) ]2 ( ( )\2
= 0.71 +0.15) =44747
[~0.07425(1-0.07425) +0.15(1+0.15).)0.2(1-0.2) (0.7 0.3f . .
Chapter 11.: Randomized response sampling: Tools for social surveys 903

Technique III. Here we have


05= fjJr + 13(1- Jr)+ P4Jry = 0.7 x 0.1 + 0.15 x (1- 0.1)+ 0.15 x 0.2 = 0.235.
The relative efficiency of the estimator K3 with respect to KG is given by

r
RE = [(1- PZ'N01(1- OI) +(l-fj'NOz(l-Oz)]Z (fj-P3)\2

[~e5(1-e5) + P4~JrAI - Jry) (fj pzf

[ (1- 0.3'N0.13(1- 0.13) +(1-0.7}j0.17(1-0.17)]Z (0 _ \2


z x ·7
U
= 0.15) = 0.977.
O.235(1- 0.235) +0.15~0.2(1-0.2) ] (0.7-0.3f

Singh, Joarder, and King (1996) considered the problem of estimation of regression
coefficients in the traditional regression model. They assumed that the variable of
interest Y; is related to k non-stochastic regressors via the classical linear
regression model
v xp ;«
s (11.4 .1)
where Y is an n x 1 vector of Y; values, X is an n x k matrix of regressors, p is
k x 1 and e is an n x 1disturbance term such that e - N(O ,O"zl n).
Here Y; is a
sensitive variable whose observations have to be obtained by survey methods.
Because some respondents are unlikely to respond truthfully to questions about
their behaviour which is immoral, unpopular, or unlawful, Eichhorn and Hayre's
(1983) scrambling response approach is applied as follows: Each respondent is
requested to report the product YjSj where S, is the value of the scrambling
variable drawn by the l h respondent. The privacy of the respondent is protected by
the fact that S.I is not known to the interviewer but its distribution, and in particular
its mean E(Sj) = 0 and variance V(Sj) = yZ are known. The scrambling device may
be a deck of cards, a spinner, etc., following some suitable distribution, e.g.,
Normal, Weibull, or any discrete distribution. The Y;Sj value obtained from the lh
respondent can be standardized as Z, = Y;S;/O after collection. Singh, Joarder, and
King (1996) considered the estimation and testing of p under the model
Z=XP + 1/, (11.4.2)
where Z is the n x 1 vector of Z, values and 1/ is an n x 1 vector of errors whose
distribution is unknown.

Theorem 11.4.1. E(1/) = 0 and the unbiased OLS estimator of p is given by


P*=(X1xtxIZ. (11.4.3)
904 Advanced sampling theory with applications

Proof. This theorem follows immediately from the property


E(Z) = EMER(Z) = EM(r) = Xp,
where EM and ERdenote the expected values with respect to model (11.4 .1) and
the distribution of the randomization device providing the S, values , respectively .
To find the variance of the OLS estimator jJ we need the following lemma.
Lemma 11.4.1. If VR denotes the variance--covariance matrix of Z over the
distribution of S, then we have
VR(Z) = C;diag(lJ2 ,rl ,....,r}), (11.4.4)
where Cy = ri B is the coefficient of variation of the scrambling device.

Theorem 11.4.2. The variance--covariance matrix of the estimator jJ is given by

V~.)= 0- 2 [(XIxt +C;(Xlxt]

+ c;(x' X)'X'd;' { It, X,ArH, X,A J'.H,x"Pj J']x(x X}' . t

(11.4.5)
Proof. Let VM denote the variance-covariance matrix over the .model (11.4.1).
Then we have
V~.)= EMVR(B.)+VMER(B.) = EMVR[(Xlxt XIZ]+ VMER[(XIXt XIZ]

=EM[(XI Xt XIC;diag(lJ2,rl,....,rn2)x (XIXt] + VM[(X IXt xlr]


=0-2[(X lxt +C;(Xlxt]
+C;(X IXt X ldiag[ ( ~ X1jp j)2,( I X2jpj)2,....,(I Xnjpj)2]X(X IX)-l.
J=I J=I J=I

Hence the theorem.

2
Theorem 11.4.3. An estimator of 0- is given by

u: =Jr7Ir7-C;~(~jJ. jXi,j)2Mi,i) !II(n-k+C;~Mi' i)'


1 1=1 J=I / 1=1
(11.4.6)

where r7 = Z -XjJ. is the OLS residual vector from (11.4.3) and M


ii denotes the lh
diagonal element of the matrix
M= I -X(XlxtX I.
Proof. We have
Chapter II .: Randomized response sampling: Tools for social surveys 905

2
= (n - k + Cy ~Mi'i)a [ ~ /3jXi,jJMi,i '
2 n 2 2 n k
+C y L (11.4.7)
1=1 1=1 ;=1

By the method of moments (11.4.7) with unknown /3j values replaced by estimates
gives (11.4.6) . Hence the theorem.

Theorem 11.4.4. The Wald test statistic to test the null hypothesi s H o:/3 = /30
against the alternative hypothesis H a : /3 * /30 for the scrambled response model is
given by
(p. - /30 nvCB.)t (p. - /30) a~ X2(p)
under H0 assuming that p. ~ N(P ,v~. )) and an estimator of v~. ) is given by
v(p.)= a. 2[{x tx t +c;(xtX)-I]

t
+c;{xtX Xtdia g[ [ f Xlj/3. jJ2,[ ~ x2j/3.jJ2 ,....,[ ~ Xn,jP'jJ2]X{XtX
;~ ;~ ;~
t.
A detailed empirical study has been carried out to study the nature of estimates of
regression coefficients under different levels of respondents lying in direct
surveying . Strachan, King, and Singh (1998) have considered likelihood based
estimation of the regression model with scrambled responses, where they compared
Bayesian estimator achieved through a Markov Chain Monte Carlo (MCMC)
sampling scheme with a classical maximum likelihood estimator and the estimator
proposed by Singh, Joarder , and King (1996), and later coefficient of determination
is studied by Singh and King (1999).

A more serious problem arises when a character under study is sensitive in nature.
For example, researchers may be interested in estimating the average income using
expenditure and assets of households as independent variables . Obviously high
correlation is expected between expenditure and assets of household and will result
in matrix (X' X) to become ill conditioned . Singh and Tracy (1999) consider the
ridge regression estimator of the regression coefficient under scrambled responses
and found more efficient than the ordinary least square estimator.

Singh and Tracy (1999) considered a ridge regression estimator under scrambled
response as
PR(sc) = (X' X + RcI)- 1X'z , (11.4. I.l)
where Rc denotes the ridge constant.
906 Advanced sampling theory with applications

Lemma 11.4.1.1. The variance VR (z) is given by

VR ( )= C 2Diag
z y
· (lJ 2, Y22, ..., Yn2). (11.4 .1.2)

Theorem 11.4.1.1. The bias in the estimator p~) is given by (A - I)f3 .


c
Proof. Let EM and ER denote, respectively, the expected value with respect to
model and the distribution of the randomi zation device providing the S , values.
Then
E~~~} = EMER~~~} 1
= (X' X +RcI t X ·Xf3.
Using

B~~~) = E~~~) - f3
we get the theorem.

Theorem 11.4.1.2. The variance of the proposed estimator p~) is given by


c

V~~~} 2 l
= er A(X 'X t A'+C;(X'X + Rcl t ' X' (D +er
2I)x (X ' X + RcIt l ( 11.4.1.3)
which is equivalent to trace of the right hand side of ( 11.4.1.3).
Proof. Let VM and VR denote , respectively, the variance-covariance matrix over
the model and the distribution of the randomization device . Thus we have

V~~~ ) = EMVR~~~)+ VMER~~~)


= EMVR[(X'X + RcItI X'z] +VM ER[(X1X + RcItIX'z]

= EM[(X' X + RcIt' x'C;D iag(lJ2' Y22,...,y'; }y(x' x + RcIt']

+VM [(X1X + RcIt ' X 'y ]

=er2 A(X 'Xt' A'+C;(X'X + Rcl t X' (D + er 2I )x(X ' X + Rcl t '.
l

Hence the theorem.

Theorem 11.4.1.3. The mean square error of the estimator p~~ is given by

MSE~~~] = V~~~] + (A - I)f3f3'(A - I I . (11.4.1.4)


Proof. Trivial with the definit ion of trace of a matrix.

The ridge regression estimator p~) under scrambled respon ses is more efficient
c

pes) sue to Singh, Joarder, and King ( 1996) if


than the estimator

V~(s)] -MSE~~~] >0 (11.4.1.5)

or if
Chapter 11.: Randomized response sampling : Tools for social surveys 907

O<Rc <
2(1 + CYZ ~Z
pIp (11.4.1.6)

which can be derived by taking trace of (11.4.1.5). It is interesting to note that the
range of ridge constant Rc is wider in the case of scrambled responses than in the
case of direct question survey sampling because Cy > o. Also (11.4.1.6) shows that
the ridge constant Rc is directly proportional to Cr '

Singh, Horn, and Chowdhury (1998) have introduced a new and interesting model
in survey sampling for social sciences. Suppose someone is interested in estimating
the proportionate strength (or size) and average of a stigmatizing quantitative
character of a particular hidden gang G (defined by that character or otherwise) in
the finite population n. People are unwilling to admit membership of gang G due
to fear of administrative retribution or social embarrassment. Singh, Horn, and
Chowdhury (1998) attempted a possible solution to the following types of problems
is survey sampling.
( 1 ) Estimation of the proportion of persons in population n having income
greater than or equal to $60,000 (say, gang G1 ) along with their average income.
( 2) Estimation of the proportion of persons having extra marital relations
(say, gang G z ) in the whole population and their average income.
( 3) Estimation of the proportion of politically active persons in the country
(say, gang G) ) and their average income [or the average number of murders
committed by them].
( 4) More generally, estimation of a proportion of persons involved in a particular
crime (say, gang G) along with average value, Jix ' of any stigmatized quantitative
character (say, X) of the same gang.

In the first method Singh, Horn, and Chowdhury (1998) suggested drawing two
independent samples of sizes nl and nz from the population using SRSWR. The
persons appearing in the first sample are provided with a randomization device R1
(say). The device may be a spinner, deck of cards, or computer grid which
generates a random variable taking a value greater than or equal to '1'0 (say) from a
given probability distribution. Each respondent selected in the sample is requested
to draw one random number (which is obviously greater than or equal to '1'0) from
the randomization device R 1 , without showing it to the interviewer. Now he is
requested to report the actual value of the stigmatized quantitative variable, say X ,
if and only if he belongs to gang G, otherwise he is requested to report the random
number as drawn. The value of '1'0 depends upon the problem under consideration .
908 Advanced sampling theory with applications

For example, in problem 1 the value of 'Po is $60,000, in problems 2 and 3 the
value of 'Po can be taken as zero or any other suitable value. The choice of 'Po can
be made such that the respondent's privacy will not be jeopardized if they respond
honestly. It is assumed that the distribution of the randomization device, R1 , is
known to the interviewer, but not the number drawn by the respondent. Let 81 and
a? denote the known mean and variance of the randomization device R1 • Let"
denote the proportion of persons belonging to the gang G in the population n . If
the respondents are reporting 100% truthfully then the distribution of the /h
response, say ZIi , is given by
Xi with probability x,
ZI '- (11.5.1.1)
I -
{R with probability (1-" ~
li

where Xi and R li denote, respectively, the actual value of the stigmatized


quantitative character and the random variable drawn by the /h respondent in the
sample. The expected value of the random variable Zli is given by
E(Zli) ="E(Xi )+ (1-" )E(R1i) = "J1x + (1- Jr)el (11.5.1.2)
where J1x denotes the average value of the stigmatized quantitative variable of the
hidden gang G in the finite population n . Now in equation (11.5.1.2) we have two
unknown parameters of interest.x and J1 x' satisfying the single equation. This
difficulty can be removed by taking another independent sample of size nz from n
and using a similar randomization device R z with different known distribution.
z
Consider 8 (;><081) and ai
denote the known mean and variance of the second
randomization device Rz. Then distribution of the /h response, say ZZi' in the
second sample is given by
X i with probabil ity n,
Zz ' - (11.5.1.3)
{ R with probability (I - " ~
I -
Zi
where Xi and R Zi denote, respectively, the actual value of the stigmatized
quantitative character and the random value drawn by the /h respondent in the
second sub-sample. Again the expected value of the random variable ZZi is
E(ZZi)= JrJ1 x +(I-,,)ez· (11.5.1.4)
We now have the following theorems:

Theorem 11.5.1.1. An unbiased estimator of Jr is given by


ir = 1- ZI -Zz (11.5.1.5)
s 81-8z '
- -I nl _ _I nz
where Z, = nl IZli and Zz = nz IZZi denote the response means in the first and
i=l i=1
second sub-sample respectively.
Proof. Follows on solving (11.5.1.2) and (11.5.1.4) for" .
Chapter 11.: Randomized response sampling: Tools for social surveys 909

Theorem 11.5.1.2. The variance of the estimator irs is given by

(" ) 1 2 O"Z22]
O"ZI
V Jrs = (0 -0
1 2
'1 [~+~ ,
(11.5.1.6)

where
0"11 = JrO"; + (l-Jr)af + Jr(l-Jr Xf..Jx - 01'I (11.5 :1.7)
and
0"12 = JrO"; + (1-Jr )ai + Jr(l- Jr Xf..Jx - O2'1 . (11.5 .1.8)
Proof. Follows from the independence of samples.

Theorem 11.5.1.3. An estimator of f. J x is given by


" 2201 -2102 (11.5.1.9)
f. Jx = (22 -02 ) -(21 -o[) .
Proof. Substitute the value of Jr obtained from (11.5.1.2) III (11.5 .1.4), then
(11.5.1.9) follows by the method of moments.

Theorem 11.5.1.4. An expression for the variance of the estimator fix is given by

V(fiJ= 2( 1 )2 [0"1 1 (02 - f..Jx'f + 0"12 (01 - f..Jx'f]. (11.5.1.10)


Jr O2 -~ n[ n2

Proof. The estimator (11.5.1.9) can be written as

fix = 11/12 ,
where
11 = 2 201-0221 and 12 = (22 -02)-{21 - O[).
By the ratio method of estimation, the variance of the estimator fix is given by

One can easily show that


2 2 02 2
(" ) 0IO"Z2 20"ZI
V\d l = - - + - - ,
n2 nl

(11.5.1.12)

On using (11.5.1.12) in (11.5 .1.11) we have (11.5 .1.10). Hence the theorem .
The next section discusses strategies for using the method 1 in actual survey
sampling.
910 Advanced sampling theory with applications

Decision Making Strategies: The decision maker may have any of the following
objectives:
Case I. When the investigator is interested in obtaining a more precise estimate of
population proportion Jr than mean J.l x ' he will seek to minimize the variance of
the estimator irs at the expense of the variance of fix . The variance of the
estimator irs will be minimal if the values of sample sizes nl and n2 are given by

nl = naZI /(azl + a z 2) and n2 = naz 2/(az i + a z 2) .


The minimum variance of the proposed estimator irs is then given by

Min.V(irs) = (az l + a zJ / ~(8, - 82 f}.


Case II. When the investigator is interested in estimating J.l x more precisely than
the size (or proportion) of the gang, then he will be interested in the minimum
variance of the estimator fix rather than the minimum variance of the estimator irs .
The variance of the estimator fix will be minimal if the values of the sample sizes
n, and n2 are given by
naZI 182 - J.lxl naz2181 - J.lxl
n, - and n2 = I 1 .
- a z,182 - J.lxl + a z 2 181 - J.lxl ' aZ I 82 - J.lx + a z 218, - J.lxl

The minimum variance of the proposed estimator fix is given by

. ( ")_ [az I182-J.lxl+ a z2!8'-J.l xlf


MIllY J.lx - 2( \2 .
ntt 82 - 811

Case III. Let ail = ai 2 as there is no prior reason to treat the two samples
- 1,
asymmetrically. Then for a given 8 =I~ 82 we can minimize V(irs) and v(jJx)

simultaneously by letting 8( = J.lx -.!.8 , 82 = J.lx +.!.8 and nl = n2 =!!....


2 2 2
Its proof is obvious from that n, = n2 =!!... minimizes V(irs ) when a i, = ai 2 ' then
2
the variance becomes
ail ~82 - J.lxl + 18, - J.lxW ail
Min.V(fix ) 2 \2 ~ -2
nn (82 - 8,) ntt
and the lower bound is achieved whenever 8, ~ J.lx s 82 , and in particular, if
8, = J.lx - 0.58 and 82 = J.lx + 0.58. For this choice of 81 and 8 2 it follows that
n, = n2 = n/2 minimizes V(fix).

W It· h a ll2 = a Z22 t he c horce


·
8( = J.lx - 0.58, 82 = J.l x + 0.58 and nl = n2 = n/2
minimizes V(irs) and V(fix) simultaneously for a given 8 =I~ 82 We should - 1.
make 18( - 82 1 as large as possible without threatening the privacy of the
respondents.
Chapter II .: Randomized responsesampling: Tools for social surveys 911

Singh, Horn, and Chowdhury (1998) proposed another method, according to which
a sample of n respondents instead of nl respondents of the previous method is
selected by SRSWR. Each respondent selected in the sample is asked two
questions . First question about the value of X; is asked by using the randomization
device R I as discussed earlier. The second question is asked about the membership
of the gang using the Warner (1965) model. For this situation we define a random
variable t, such that
I with probability PJr+(I-PXI-Jr),
t.-
I - { ° with probability P(I- Jr)+ Jr(I- p}
(11.5 .2.1)

Then we have the following theorem from Warner (1965).


Theorem 11.5.2.1. An unbiased estimator of Jr is given by
iw=t-(I-P), P,*O.5, (11.5 .2.2)
2P-I
n
where t = n -I LA and its variance is given by
;;1

';'w)= Jr(I-Jr)+ P(l-P)


V(" (\2 . (11.5 .2.3)
n n 2P-IJ
Using this estimator of Jr in (11.5 .1.2) we obtain an estimator of Il x •

This estimator is presented in the following theorem.

Theorem 11.5.2.2. An estimator of Il x is given by


•• Z-(I-iw)e) (11.5.2.4)
Il x • ,
Jrw

where Z =n -I I ZIi
;;1
denotes the observed mean of n responses. In order to find the

variance of the estimator jJ; we need the following lemmas.

Lemma 11.5.2.1. The variance of Z is given by

V(Z) = n-Ia;1 = n-I[Jra; + (l-Jr)at +Jr(I-JrXllx -elf] . (11 .5.2.5)

Lemma 11.5.2.2. The estimators Z and i ware correlated.


Proof. By the definition we have
cov(Z,iw)=-(_1_)cov(ZJ)= (I )[E(Zt)-E(Z)E(t)]
2P-I n 2P-1
where
E(Zt) = E{E(Zt It)}= E{tE(Z It)} = p(t =I)E(Z It =1)= E(t)E(Z I t =1).
912 Advanced sampling theory with applications

It follows that
COY(ZJ)= E(t) [E(Z I t = I)-E(Z)] * 0
n
as
E(Z I t = I)= p( GI t =1)fix + p(G I t =1~ * E(Z)
since
p(Glt =I)= (P1C X )*1C.
P1C+ I-P 1-1C
Hence the lemma.
We now have the following theorem, the proof of which is easily obtained by
proceeding on the lines of obtaining the variance for the ratio estimator.

Theorem 11.5.2.3. The variance of the estimator jJ.;, to the first order of
approximation, is given by
V~;)'" ~~(Z)+ (fix - ~~V(iw)- 2(fix - ~)cOY(Z, i w)] . (11.5.2.6)
1C
The result of Theorem 11.5.2.3 can also be expressed as follows:

Theorem 11.5.2.4. From (11.5.2.6), it follows that


2
V~;)=
n1C h
O'Z ,if 0, = fix
and
{ .*)= O'1(
V \fix 2
(1- P 2)'f -0 = COY Z,iw
,1 fix, V •
n1C 1Cw
where
P= Cov(z, iJ .
~V(iJWJ
Comparing (11.5 .2.6) with (11.5.1.10) we can see that jJ.; is capable of achieving a
variance which is smaller than or at most the same as that of jJ. x ' In this method it is
also possible to use a well known unrelated question model in place of Warner's
(1965) model.

We wish to estimate the proportion of persons involved in a particular crime (say,


gang G) along with average values, fix' fiy and correlation coefficient Pxy of any
two stigmatized quantitative characters (say, X and Y) of the same gang. The
estimator of correlation coefficient rxy may be useful to study the relationship
between two characters.
Chapter 11.: Randomized response sampling : Tools for social surveys 913

Let
1l = NG
N
be the proportion of persons in the finite population n belonging to the hidden
gang G . Let Y and X be the two quantitative sensitive characters of a hidden
gang G . We wish to estimate the population means
1 NG 1 NG
J1x =-N LXi' and J1 y =-N L Yi
G i=1 G i=l

spread of a hidden gang 1l and p xy =Sxy /(s xS y ), where


NG( NG(
s;2 = {NG-I }_I L X i-J1x}\2 ,Sy2 = {NG-I }_I L Y;-J1 y}\2 ,
i=1 i=l

Here we have selected a sample of n units by SRSWR scheme . Each respondent is


provided with three randomization devices, Rx' Ryand Rw(say). The
randomization device Rx is provided to each respondent such that if he belongs to
hidden gang G then he is requested to report actual value of quantitative character
X , otherwise he is assumed to report a random number Tx generated through the
device Rx ' It is assumed that random variable Tx takes values in the range of
actual X values to maintain the privacy and its mean ex = E(TJ and variance
y; = V(Tx ) are assumed to be known. Here is the probabil ity that he will report
1l

value X and (1- 1l) is the probability that he will report value Tx '
Thus the distribution of the first response of the /h respondent is:
X with probability 1l ,
ZI - (11.5.3.1)
i - { T with probability (1 - 1l }
x

Then the expected value of i 1h response is given by


E(ZIi) = 1lJ1x + (I-1l)ex- (11.5.3.2)
The randomization device R y is provided to each respondent such that if he belongs
to hidden gang G then he is requested to report the actual value of the quantitative
character Y, otherwise he is supposed to report a random number Ty generated
through the device Ry . It is assumed that the random variable Ty takes values in
the range of actual Y values to maintain the privacy and its mean By = E(Ty) and
variance r;
V(Ty ) are assumed to be known. Note 1l is the probability that he
=
will report value Y and (I-1l) is the probability that he will report value Ty , the
distribution of the second response of the /h respondent is given by
914 Advanced sampling theory with applications

y with probability x, ( 11.5..


3 3)
z -
2; - { Ty with probability (1-"}
Thus the expected value of the i th response is given by
£(Z2;) = "J.ly +(I-,,)ey . (11.5.3.4)

The third device R w is the same the as invented by Warner (1965), that is, it
consists of two outcomes. The statement' Are you a member of gang G ?' occurs
with probability P and its complement 'Are you not a member of gang G ?' occurs
with probability (1- p). Each respondent is also requested to use the device R w and
report 'Yes' or 'No' according to his status and the statement drawn by him from
the device, which, in fact, comes from the third response of the lh respondent. Thus
the probability of 'Yes' answer is given by
0= P"+(I-P)(I -") . (11.5.3.5)

Thus we have the following theorems:

Theorem 11.5.3.1. An unbiased estimator of " is given by


A

Jrw
= e- (1- p) P
,,*
0. 5, (11.5.3.6)
2P-l
e
where = nl / n is the observed proportion of' Yes' answers in the sample.
Proof. Obvious from (11.5.3.5).

Theorem 11.5.3.2. An estimator to estimate the mean J.l x of the quantitative


character X of the hidden gang G is given by
21 -ox 0
J.lX=-A-+ x ·
A
(11.5.3.7)
"w
Proof. Obtained by the method of moments after solving (11.5.3.2) and (11.5.3.5)
for J.l X'

Theorem 11.5.3.3. An estimator to estimate the mean J.ly of the quantitative


character Y of the hidden gang G is given by
22 - 0Y 0
J.ly = -A-+ r : (11.5.3 .8)
A

"w
Proof. Obtained by the method of moments.

Theorem 11.5.3.4. An estimator of correlation coefficient P xy between the two


quantitative characters X and Y of a hidden gang G is given by
Chapter 11.: Randomizedresponse sampling : Toolsfor socialsurveys 915

Proof. The distribution of ZI~ is given by

2 lX 2
with probability n,
Zli =
T; with probability (I- Jr)
(11.5.3.10)

Thus the expected value of ZI~ is given by


E(ZI~)= JrE(X 2) +(I-Jr)E(T;) (11.5.3.11)
which implies that

E(X2) = E(ZI~)- (I-Jr1:; + 0;) . (11.5.3.12)


Jr
Thus we have

V(X) = E(X 2)- {E(X)f = 0';1 - (l-Jr )~(Ox - f-lxf + r;} . (11.5.3.13)
Jr

Similarly
v(r) = 0;2 -(l-Jr)~(Oy-f-ly~+r;} (11.5.3.14)
Jr

Again we have

E(ZIiZ2i) = JrJrE(xr) + Jr(l-Jr )E(X)E(Tx)


+ Jr(l-Jr )E(r)E(Ty)+ (l-JrXI-Jr )E(Sx)E(Sy) . (11.5.3.15)
which implies that
E(ZIiZ2i)- (l-Jr){Jrf-lxOy + Jrf-lyOx + (l-Jr )exOy} (11.5.3.16)
( )
Exr=---'---"--....=:..:.----'-~----.:.:--"=-----'~::--'--------'-~~
Jr2
Thus we have
a
cov(x,r) = E(xr)- E(X)E(r) = ZI;2. (11.5.3.17)
Jr
From (11.5.3 .13), (11.5.3 .14), and (11.5.3.17) the correlation between two
characters X and r of the hidden gang G is given by
_ °Z(z2

Pxy - Jr~O;1 -(l-Jr)~(Ox _f-lx)2 +r;} ~O;I -(I-Jr)~(Oy _f-ly~ + r;} . (11.5.3.18)

By the method of moments it proves the theorem.

The difference between the estimator of correlation coefficient developed at


(11.5.3.18) and that developed by Bellhouse (1995) and Singh (1991b) is that this
formula estimates the correlation between two sensitive characters of a particular
hidden gang, whereas Bellhouse 's technique estimates the correlation between two
sensitive characters of the whole population .
916 Advanced sampling theory with applications

Let P be a finite population of size N (known) and N G (unknown) be the total


number of persons belonging to some sensitive group G . Let Xi be the value of a
stigmatized quantitative variable, X say, for a person i in the sensitive group
(Gang) G. Independent samples Sk , k = 1,2 of sizes nk are selected from
population P following the sampling designs Pk' Let ll"i(k)and ll"ij(k) be the first
and second order inclusion probabilities for the sampling design Pk . If the
respondent labelled i is selected in the sample Sk , then he/she has to disclose the
true response Xi if he belongs to the group G; otherwise he has to produce a
randomize response R ki following a certain randomized design. In other words, the
persons appearing in the J(h sample are provided with a randomization device R k
(say). The device may be a spinner, deck of cards or computer grid which generates
a random variable taking the value greater than or equal to 'Po (say) from a given
probability distribution (Normal, Weibull or Gamma etc.). Each respondent i
selected in the sample is requested to draw one random number (which is obviously
greater than or equal to 'Po) from the randomization device R k , without showing it
to the interviewer. It is to be noted that for the chosen randomization device R k ,
being almost similar in range with Xi is better for collecting sample data,
especially in a sensitive survey. Now he/she is requested to report the actual value
of the stigmatized quantitative variable, say X I. , iff he/she belongs to gang G ;
otherwise he/she is requested to report the random number R kI. , say, as drawn. The
value of 'Po depends upon the problem under consideration. For example, under
problem I, the value of 'Po is $80,000; under problems 2 and 3, the value of 'Po
can be taken as zero or any other suitable value. The choice of 'Po can be made
such that the respondents' privacy will not be jeopardised if they respond honestly.
It is assumed that the distribution of the randomization device, R k , is known to the
interviewer but not the number drawn by the respondent. Let Ok and o-f
denote
the known mean and variance of the randomization device Rk . Let ll" denote the
proportion of persons belonging to the gang G in the population. Thus for the /h
unit ( if it is included in the sample S k ), we obtain a response
Xi if i E G
Zk' ={ R (11.6.1)
I
ki if i ~ G
where X i and R ki denote, respectively, the actual value of the stigmatized
quantitative character and the random variable drawn by the / h respondent in the J(h
sample, k = 1, 2. Amab and Singh (2002b) consider the problem of estimation of
the population proportion, given by
ll" = N G/N (11.6.2)
Chapter II. : Randomized response sampling : Toolsfor socialsurveys 917

and the population mean of the desired sub-group or hidden gang given by
I
Jlx =N IXi ,
G ieG
(11.6.3)

where Xi is the value of the quantitative characteristic associated with the

respondent. Let G be the population of the gang, G = P - G, and Ii - =


'{Io if. i . EG
If I ~G
.

The randomized response from the /h unit for the sample s, as given in (11.6. 1) can
therefore be written as
Zki = X;Ii + (1- I;)Rki = X;Ii + Rk;If , (11.6.4)
where I f = 1- Ii . Denoting the expectation and variance as E R, VR, respectively
with respect to the randomization device, we have
E R (zki) =X;Ii + ER(Rki )If =X iIi + (}k I f =r, (k) (say) (11.6.5)
and
vR(zki)=IfvR(Rk i)=I fa} . (11.6.6)

Now consider the following linear homogenous unbiased estimator for ER (Zk ),
- I N
where Zk =- IZki based on the sampling design Pk as
N i =l

r, = I- L bSkiZki (11.6.7)
N iesk
with bSki being the known constants satisfying design unbiasedness ( Pk
unbiasedness) condition given by

(11.6.8)

Denoting the expectation and variance as by E p , Vp , respectively, with respect to


the sampling design Pk, we have the following theorems :

Theorem 11.6.1. The estimator Tk is an unbiased estimator for


EAzk)= JrJl x + (1- Jr)Bk . (11.6.9)

Proof. We have
E(Tk) =EpER(Tk) =..!-E p I bskiYi =..!- IYi(k) =..!- ~ {x;Ii + I f (}k}
N iesk N i N I

I (}k / NG I
= - LXi+-LIi = - - LXi+(}k
(N-N G) (\ll
=JrJlx+ I-JrPk ·
N ieG N i N NG ieG N
Hence the theorem.
918 Advanced sampling theory with applications

Theorem 11.6.2. The variance of the estimator Tk is given by

Vk = V(Tk) = ~[OlLa;(k)If +{r.rl(kXa;(k )-1)+ ~r.Y;(k)Yj(kXaij(k )-1)1] (11.6.10)


N" ~j f
where a;(k)= r. .b;;Pk(Sk) and aij(k)= r.. .bSk;bskjPk(Sk)'
Sk 31 k Sk 31,j
Proof. We have

N2V(Tk)= N2[Ep[VR(Tk )]+Vp[ER(Tk)ll = Ep[.r. b; .o"1 I f ]+Vp[.r. bSk;Y;(k)]


1ESk k' 1ESk

+r.r.y;(kh(k) r.. bSk;bsk 'Pk(sd-(.~Y;(k))2


l~ j Sk31,j g 1=1

0"1 r. a;(k)r f +{r.r1(a;(k)-I)+ ~~y;(k)y j(kXaij(k)-I)l .


=
1 1 I~ j f
Hence the theorem.

Theorem 1l.6.3. An unbiased estimator of the variance V(Tk ) is given by


, 2
, , ak ( A())
Vk=Vk+-I-ll"k,
N
where

VAk = - 12 [ .L -zl;
() (a; ()) k -I )~ an d ll"A(k) = - 1 r. -I;().
Zk;Zkj (aij ()
k -I + ~ .r. -(-)
N 1Esk ll"; k l~jESk ll"ij k N ;EPk ll"; k
Proof. Consider
E~k)=_12 E [ r. rl(k)+alIf (a;(k)-I)+ r. r. Y;(k)Yj(k)(a;-(k)_I)l
N p ;ESk ll";(k) ;~j Esk ll"ij(k) v J
=~[r.rl(k Xa(k); -1)+ .r.. r. y(k );Yj(kXaij(k )-1)+ ah;: (a; (k )-I)If]
N 1 l~jESk 1

= V(Tk)- :~ 'fI f = V(Tk)- a1(1-;(k)) .


Hence the theorem.

Corollary 1l.6.1. An alternative unbiased estimator for estimating V(Tk ) is

V = v; + ak (1- i(k)),
, 2
(11.6.11)
N
where
A,
Vk = -
1
r. r. (ll";(k)ll"j(k)-ll"ij(k)J(-Zk;- - -Zkj-J2 + --"-~--'--'-'-
al(l-i(k))
N2 i<jEsk ll"ij(k) ll";(k) ll"j(k) N
Chapter 11.: Randomized response sampling: Tools for social surveys 919

assuming each Sk (k = 1, 2) with positive P(Sk) to have a ' common' number of


distinct units.

Theorem 11.6.4. An unbiased estimator of 1t is given by

i =I_1j-Tz (11.6.12)
°l-OZ
with variance

(11.6.13)

and an unbiased estimator of v(i) is given by

V(i)= ~ +vz . (11.6.14)


(~ -ozf
Proof. Straightforward from previous section.

The estimator of proportion in (11.6.12) becomes non-functional if 0] - 0z = o. It is


important to keep in mind while making a pair of randomization device such that
01 - Oz '* 0 . The expression (11.6.12) indicates that the variance of the proportion is
inversely proportional to the difference 01 - Oz . Thus one should choose °1 and Oz
such that the difference 0, - Oz is maximum without threatening the privacy of the
respondents .

We have
(11.6 .15)

This implies that


J.lx =
1j - (1 - i }'II _
i
°
1- Oz {r.
(1j - Tz}'I] }
-(O]-Oz)-(1j -Tz) , - 0l-OZ
1)0\ - T\02 -1)0\ + TA T20\ -1)°2
(11.6.16)
(T2 -02)-(T1-°1) = (T2 -°2)-(1) -°1).

Thus an estimator of J.l x is given by


• d]
J.lx =d;'
where d, = TzOI -1jOz and dz = (Tz - Oz)- (1j - 0,). Amab and Singh (2002c) studied
its special cases under different sampling designs such as Horvitz and Thompson
(1952), and Rao, Hartley, and Cochran (1962) strategies, etc., and applied to real
data.
920 Advanced sampling theory with applications

Eichhorn and Hayre (1983) proposed an ingenious method to collect information on


quantitative characters rather than qualitative characters. According to this model
each respondent selected in the sample is requested to give the scrambled response
Yi=XiS' where Xi are the real values of the quantitative variable and S is the
scrambling variable. It is reasonably assumed that the first two moments of S are
known, i.e., E(s) = () and v(s )= y2 are known. They suggested an unbiased
estimator of population mean
_ 1 n
YO=-2: Yi (11.7.1)
n(} i;1
with variance

V(YO) =;[ a; +(~ y(a; +JJ;)], (11.7.2)

where Yi , Y 2, ····,Yn are the scrambled responses obtained from n sampled


respondents and JJx and a;
denote the mean and the variance of the sensitive
quantitative variable x under study. Mangat (1991), Mangat and Singh (1994) have
proposed an optional randomized response technique for qualitative characters and
have shown empirically that the proposed optional randomized response technique
is superior to Warner 's (1965) technique in certain situations. Singh and Joarder
(I997a) proposed an optional randomized response technique for estimating the
population mean of the quantitative sensitive variable. It is interesting to note that
the proposed optional randomized response technique remains always more
efficient than the well known technique of Eichhorn and Hayre (1983). In the
optional RRT, each respondent selected in the SRSWR sample has either of the
following two options:

(i ) He can report his actual value of Xi ; or


( ii ) He can report the scrambled response t, = (XiS)/ ()

without revealing to the interviewer which mode has been followed for giving the
response. Here we should stress the fact that Ii should have the same support as Xi'
otherwise one immediately knows the answer is either from option (i ) or ( ii ). Let
W be the probability that the respondent selects the option (i ) and (1- W) will be
the probability that he select the option ( ii ).

Then response Zi has the distribution:


Xi with probabil ity W,
Zi = { t, with probability (l-W). (11.7.3)
Thus we have the following theorem:
Chapter 11.: Randomized response sampling : Tools for social surveys 921

Theorem 11.7.1. An unbiased estimator of P x is given by


_ 1n
Y I =-L Z;. (11.7.4)
n ;=\
Proof. By the definition of the expected value we have
E(z;) = WE(x;)+(I-W)E(t;} = p x . (11.7 .5)

Hence the theorem.

In order to find the variance of the estimator in (11.7.4) we have the following
lemma.

Lemma 11.7.1. The distribution of z1 is given as follows :


2 fxl with probability W, (11.7 .6)
z; =1/1 with probability (1-W),
and the expected value of z1 is given by
E(z1)=WE~1)+ (1 - w)EV1)= W(a; + p; )+(1-W{I + ~~ )a; + p;). (11.7 .7)

Thus we have the following theorem :

Theorem 11.7.2. The variance of the estimator y\ is given by

v(Yd =~[a; +(~ )\a; +P;XI- W)]. (11.7 .8)

Proof. We have

v(yl) =v(~n ;=\fz;)= ~fv(z;)


n ;= 1
= a;.
n
(11.7.9)

Note that a;=E(z1~{E(z;)f , on using (11.7.5) and (1 1.7.7) we have


a; =a;+(~r(a;+p;XI-W) . (11.7 .10)

On putting the value of a; from (11.7 .10) in (11.7.9) we have (11.7.8). Hence the
theorem.

Theorem 11.7.3. An unbiased estimator of the variance of YI is given by


2 1 "( - \2 ( 11.7.11)
Sx = -(- ) L z; - Y\ ) .
n -I ;=\
Proof. The estimator s; can easily be written as
Sx2 = - 1- [" L Z;2 + Y- I2 - 2y\
-" [n
LZ;] = -1- L Z;2 -ny\
- 2] . (11.7.1 2)
n - 1 ;=\ ;=1 n -1 ;= 1
922 Advanced sampling theory with applications

Taking expected values on both sides of(11.7.12) we have

£(s;) = V(Yl) '


Hence the theorem.

The percent relative efficiency (PRE) of the estimator Y1 with respect to Yo is


PRE = {V(YO)/V(YI)} x 100% . (1l.7.13)
On using (11.7.2) and (11.7.8) we have

Cx2+C2(1 + c2)
PRE = y x x 100% , (11.7.14)
C;+C;(I+C;) (I-W)
where Cx = ax and Cy = L have their usual meanings. Note that 0 ~ W ~ 1,
J.lx (}
therefore (11.7.14) shows that PRE is always more than 100%.

Thus we have the following theorem:

Theorem 11.7.4. The estimator Y1 is always more efficient than the estimator Yo.
Remark 11.7.1. The exact value of W can never be known. In practice, the
investigator/ interviewer does not need the actual value of W . The estimator
YI =.!. I: z; at (11.7.4) and estimator of its vanance s; =(n -1 r' i:(z; - YI f at
n~ ~

(11.7.11) are free from the value of W .

Remark 11.7.2. The percent relative efficiency (PRE) defined at (11.7.14) is an


increasing function of W . In other words, if W approaches one then PRE increases.
Thus if the interviewer is interested to know how much he gained by using the
optional model, a rough guess can be made by using an unbiased estimator of W .
Each respondent selected in the sample can be provided another randomization
device R 2 (say) with the following questions:

( a ) Have you reported the actual value? with probability T;


(b) Have you reported scrambled response? with probability (1- T) .

If e is the observed proportion of 'Yes' answers, then following Warner (1965) an


unbiased estimator of W is given by

" e-(l-T) (11.7.15)


W= .
2T-l

Remark 11.7.3. A rough guess can be made about W from a past surveyor pilot
survey. For example, if out of 100 persons 20 would like to report actual value
truthfully then W can be taken as 0.2.
Chapter 11.: Randomizedresponse sampling : Tools for social surveys 923

Remark 11.7.4. It is a fact that we have assumed W to be the same for all the
respondents in the sample, which is somewhat restrictive. It is worth mentioning
here that the case of unequal W for each respondent can be handled by a
hierarchical (or empirical) Bayes' approach.

Example 11.7.1. Show graphically that the percent relative efficiency (PRE) of the
optional randomized response technique is an increasing function of the proportion
of respondents revealing direct answers.
Given: ex = 0.1,and 0.7 .
Solution. The graphical representation of the PRE of the optional randomized
response technique is as follows:

C(x)=O.1

1200
1000
800 -+-C(s)=0.1
w

_--MA
It: 600 _C(s)=0.5
Do
400 -C(s)=0.9

200 ~~,.
O +-t-t-H H-t-+++-t-t-H H-t-++-H

~.
~~ ,,'Y
",~.
~~ ,,~
",~ .
~~ ",? C!>~
",~ .

C(x)=O.7

350
300
250 -+- C(S)=0.1
w 200 _C(S)=0.5
g:150
-:1(- C(S)=0.9
100~~~
50
o +++++++-+-+-+-+-+-+-+-I-HH-I-j
~~ ~'Y ~~ ~~ ~~ ~'? ~.C!>~
~. ~. ~.

Fig. 11.7.1 Percent relative efficiency.


924 Advanced sampling theory with applications

Following Chaudhuri and Mukherjee (1988), Chaudhuri and Roy (1997b) assumed
that randomized response devices are available to produce a response rj from the /"
respondent in the sample such that

(11.8.1)

where a, rand () in (11.8.1) are constants and have their usual meanings. For
example, consider a practicable randomization device proposed by Chaudhuri and
Adhikary (1990). According to this device the /" respondent in the sample is
required to choose independently at random two tickets numbered a j and b, out of
boxes proposed by the investigator containing the tickets numbered (i ) AI> A2 ,
..., Am with known mean :4 and known variance cd, and ( ii ) BI> B2 , .. •, BI with
known mean B and variance a ~ .

The respondent is required to report the response as


Zj=ajYj+b l •

Thus
( ) - -
ERZj =AY;+B , Rj= Z j-Bj.A (-\V-
() = V; 2 =\aAY;
and VRIf ( 2 2 2 \/-2
+aBJ/A
where ER and VR denote the expected value and variance corresponding to the
randomization device. Thus on comparing (11.8.1) with the above randomization
device, we get a = a~/:42 , t = 0 and () = aV:42 • Chaudhuri and Mukerjee (1988)
have also given an estimator for V; as

Vj = -I-I_(ajr? + pjlf + (}I) (11.8.2)


+aj
satisfying ER(r;) = v;, i E S provided 1+ a, *- o. Consider an auxiliary variable x
N
with known positive values x., with a total X = L.Xj , is available . Assume the
j=l

superpopulation model is given by


Yj = pXj+ej , i = 1,2, ..., N, (11.8.3)
where Em(e;) = 0, Vm(e l)= al and Cm(ej, ej)= aij' i *- j .

Then we have the following theorem:

Theorem 11.8.1. The regression predictor for estimating the population total Y,
under randomized response sampling, is
t, = X PQ(r)+ IRJj - xjPQ(r)] , (11.8.4)
ies
where Qj and Rj are chosen subject to the condition
(I - R, " j)/ Qj"jXj = a constant \::l i E n
Chapter 11.: Randomized response sampling : Tools for social surveys 925

PQ(r) = L'iXiQij.LXlQi and ei(r)='i-XiPQ(r).


lES lES

Following Chaudhuri and Roy (1997b) we have the following theorem:

Theorem 11.8.2. Two estimators for estimating the variance V(t r ) are given by

Vk(t r ) = ~~ L dijeij[(diakiei(r )-djakjek)~ - tdiaki~ F;i + (djakj ~ frjj - 2akiakjdidiJ


,* JE S

+ L(diaki~V;
ie s
where k = 1,2, frij are the sample analogue of the parameters obtained as
Fij = ER[e;(r)- e;][ej(r)- eJ, di = 1/Jri' dij = 1/ Jrij' eij = lJri Jr r: Jrij)' ali = 1, and

a2i -- 1 + (x '" xi JLXi


- XiQiJri
L. -
S o.
Jri
--2- '

The problem of estimation of variance of the linear regression estimator under the
randomized response sampling has also been considered by Chaudhuri (1993), and
Chaudhuri and Maiti (1994). Tracy and Singh (2000) considered the problem of
estimation of variance of the general linear regression est imator under scrambled
responses using low level and higher level calibration approach.

Tracy and Singh (2000) consider an estimator of population total Y in randomized


response sampling as
• n
Ys = LWi'i· (11.8 .1.1)
i=l
Then we have the following theorems:

Theorem 11.8.1.1. The conditional bias in the estimator 1's is given by

B(1's Is) = I(WiJri -1)Yi . (11.8.1.2)


i=1
Proof. Let ED denote the expected value over the design D. Then

E(1's ) = EDER(1'S ) = EDER(~Wi'i ) = ED(i~WiYi) = i~WiYiJri .


Taking its difference from the true value we have the theorem.

Theorem 11.8.1.2. The conditional variance of the estimator 1's is given by

v(1's Is) = ~ ~ .(I (JriJrj -JrijXWiYi -WjYj~ + Iwlv;Jri' (11.8.1.3)


1= IJ *i}-l 1=1
Proof. Let VD denote the variance over the design D then we have
V(1's Is)= VDEA1's IS)+ EDVA1's Is).
Hence the theorem.
926 Advanced sampling theory with applications

Theorem 11.8.1.3. A conditionally unbiased estimator of V(YS ) is given by


_( " ) Inn (Wi'i -wjrjf n 2-
vYs ls =-I I dij0ij J '1/ ) +IWi Vi' (11.8.1.4)
2 i;\j("i)=( V(I + a, J\I + a j i;\

where ~(I + ai Xl + a j) denotes the bias adjustment factor due to randomized


experiments in estimating the variance component. Several estimators are shown to
be special cases of the estimator in (11.8.1.4). For simplicity, minimize the chi
square distance function

(11.8.1.5)

n
subject to the condition I wiXi = Tx yields Wi = di(I + qiX;' A), where A denotes
i;l
the Lagrange multiplier and the values of qi are suitably chosen weights which
results in different forms of the estimators. The resulting estimator Ys at (11.8.1.1)
becomes

Ya = E1di'i + PS(TX - i~ diXiJ (11.8.1.6)

where Ps = (i~diqiXiXiJli~diqiXi'i denotes the weighted estimator of the

multiple regression coefficient in the presence of scrambled responses. This type of


estimator has also been studied by Chaudhuri and Roy (1997b). The properties of
the estimator of regression coefficient Ps
under the general linear regression model
have been studied by Singh, Joarder, and King (1996). Thus the general linear
regression estimator (GREG) under scrambled responses is a special case of the
calibration technique. Following the model assisted survey sampling approach of
Chaudhuri and Roy (I 997b) for scrambled responses, an estimator for the variance
of the regression estimator under scrambled responses can be expressed in the form
_(- ) Inn (Wi77i-W/ljf n 2-
V Ya Is =-I I dij0ij l( V ) + IWi Vi' (11.8.1.7)
2 i.. j;( 'J 1+ ai)\1 + a j ;;1

where 77i = 'i - AXi


denotes the residual term under scrambled responses studied by
Singh, Joarder, and King (1996). We have the following cases:

Case I. If Wi = 1/TCi then we have

-) Inn
y(Ys =- I I
r,
-L - -
rj J2 + In y. = V- (Y-s ,lin
-!- (11.8.1.8)
2 i .. j;l TCi TC j i;( TCi

leading to the estimator of the variance of the usual Horvitz and Thompson (1952)
estimator under RR sampling.
Chapter II.: Randomized response sampling: Tools for social surveys 927

Case II. Under simple random sampling without replacement (SRSWOR)


scheme, Jr; = Jr j = n] Nand Jrij = n(n -1)/N(N -I) the estimator of variance reduces to

,(v I )_ N2(I- f).{!. ril N .{!.'2


v IG s - ( ) L..-(--) +- L.. v; , (11.8 .1.9)
n n -I ;=1 I + a; n ;=!
where f = n]N . Thus (11.8.1.9) denotes the usual estimator of variance of the
regression estimator under scrambled responses.

Case III. If we choose q; = I / x; the strategy reduces to the usual ratio estimator of
total under scrambled responses, say YR ' Under SRSWOR sampling,
YR = Nr( ~}
_ I n _ I n - I N
where r =- Lr; , x =- LX; X = - LX; and the estimator of variance takes the
n ;=1 n ;=\ N;=\
form given by

'(y,' I )- N2(I-f).{!.~l
v R s - ( ) L.. (
X)2 +-N .{!.'2
) -,,- L.. v; , (11.8.1.1 0)
n n -I ;=\ I + a; X n ;=1

where 17; = r; -(r/x)x; and X = NIx; .


n ;=\
Thus we have the following theorems:

Theorem 11.8.1.4. A class of estimators for estimating the variance of the ratio
estimator of population total Y, is given by
, (, ) N2(I-f) n 17l (X)g N n ,2
vg YR Is ( ) L-(- ) -,,- +-LV;, (11.8.1.11 )
n n- I ;=1 I + a; X n ;=1

where g is a suitably chosen constant such that the variance of the estimator of
variance is minimum.

Theorem 11.8.1.5. A general class of estimators for estimating the variance of the
general regression estimator under scrambled responses is given by
, (, ) N2(I-f) n 17l (X)g N n ,2
Vg YG Is = ( ) L-(- ) -,,- +-LV;, (11.8 .1.12)
n n-I ;=Il+a; X n ;=1

where g is a suitably chosen constant.


In order to study the efficiency of the above estimators of variance, Tracy and
Singh (2000) introduce two classes of estimators for estimating the variance
, (AYG Is)= N2(I-f)n
(
17l (X) Nn'2
) L-(-)H -,,- +-LV;,
VI
n n- I ;=\ I + a; X n ;=\
(11.8.1.13)
and

v2
2(I-f)
,(,YGls)= [N ( ) L-(-)+-LV;
n 17l N n H-,,-,
(X) ,2] (11.8 .1.14)
n n- I ;=! I + a; n ;=\ X
928 Advanced sampling theory with applications

where H(e) is a parametric function such that H(l)=1 satisfying certain regularity
conditions.

Following Singh, Hom, and Yu (1998) (refer to Chapter 5 for details) consider
here another estimator of variance
A
VN G S
(r. I )_I~ z:~
--£..,
n
ij J
(Wi'li- W/7jf
( )
~A.A
+ £"'Y'iVi, (11.8.2.1)
2 i;Ij("i~1 ,,(I + ai) 1+ a j i;1

where nij and ¢i are the new weights such that the distance between nij and
Dij = dl~ij' as well as that between ¢i and wl, respectively, are minimum. Tracy
and Singh (2000) considered two chi square type of distance functions
D) = -2
1
I I
i;Ij(";~1
(nij- Dijf(Dijt1ij)-1 (11.8.2 .2)

and
1 n{
D2 =2i~I\¢i -Wi J °iWi
2\2( 2)-1 . (11.8.2.3)

They assumes that in many situations the variance of the estimator XHT of
population total X given by
VSYG(XHT)=!I I
2i;Ij(*~1
0ij(diXi- djXjf

is known either from past surveys or can be calculated . The weights nij are
chosen such that the chi square distance (11.8.2.2) is minimum subject to the
second order calibration constraint
!i: i: nij(diXi-djxjf=vsYG(XHT)'
2 i;[j("i~l
(11.8.2.4)

Now minimization of (11.8.2.2), subject to (11.8.2.4), leads to the new optimal


weights

(11.8.2.5)

where
A (A ) 1 n n ( \2
VSYG X HT = - L L Dij\diXi - djxj) .
2 i;[j(,,;~[
Tracy and Singh (2000) introduced a new calibration constraint
i:¢i[V[x[ +v2 xi +V3]=
i;)
I [u1xl +U 2X i +U 3]= Qx,
i;[
(11.8.2.6)
Chapter II.: Randomized responsesampling: Tools for social surveys 929

where Qx == ~ [UIX? +UZXi +U3] denotes the known quadratic expression or


i= l
variance of the auxiliary variable and Vi' i == 1,2,3 are the random variables such

that E[Vlx? + VZXi + V3] == Qx . Minimization of (11.8.2 .3), subject to (11 .8.2.6), leads
to the new calibration weights given by

do _
n - Wi +
z b"iwl(UJ x? +vZxi +UJ)
~ )
[Qx - ~ Wiz{\UJ Xiz + VZXi + UJ )~ .
z: (11.8.2 .7)
n Z Z i-I
L b"iwi UJ Xi +VZXi +UJ -
i=1

Substitution of (11 .8.2.5) and (11.8.2.7) in (11.8.2.1) leads to the calibrated


estimator of variance

vN (Y"G Is) ==-I


Inn I Dij (w1J,-w'1J
" ) }'~ " InnI Dij(x
+B1[ VYG (X" H T ) --I . X'JZ]
....!.._-L
2 i =Ij(.. i)=1 ~(I+aiXI +aj) 2 i =Ij(.. i)=1 "i "j

(11.8.2.8)

where

and
n b". w~ (Z Y Z )
I -I-'-'- \ai'i + Yi'i + 0iAUJ Xi + VZXi + V3
" - i=1 +ai
BZ -
I OiWl(UJ xl + VZXi + V3)
i =1
A large number of estimato rs can be shown to be special cases of the estimator
( 11.8.2.8).

Case I. Under the SRSWOR sampling, let qi == l/Xi ' d ij ==(di Xi-d j Xj)-Z, and

0i == ~l xl + VZXi +v3 t.
Also, for simplicity if we take VI == I and Vz == UJ == 0 then
an estimator for estimating the variance of the ratio estimator under scrambled
responses is given by

" (9R 1)
VN s == NZ(I-
(
f) n~
) L(
? ) -:- 2 ) +-N~"z(X)Z
X )Z(s; ""vi -:- , (11.8.2.9)
n 11 - I i=l I + ai X sx n i= l X

where s; == (n _1)--1 f(Xi - if is an unbiased estimator of S; == (N -I r' ~ (X i - x)2 .


i=l i= l
Case II. Under the SRSWOR sampling, a class of estimators for estimating the
variance of the GREG under scrambled responses can be defined as follows:
930 Advanced sampling theory with applications

vAYG Is)= N
2(1_
f)f ryl F(~,
n(n-I) ;=I(I+a;) X
stJ+ NfVlG(~),
Sx n ;=1 X
(11.8.2.10)

where F(e,e) and G(e) are parametric functions such that F(I,I)=I and G(I)=I,
satisfying certain regularity conditions . Under SRSWOR sampling a more general
class of estimators has been suggested as
A( A
c YG
V
2(I
Is) = [N ( -f)n
n n -I
ryl Nn 2 ] (X
) L-(--)+-LV; F -;:-, -2
;=1 I + a ; n ;=1
A

X
s;J
sx
. (11.8.2.11 )

Let u = X/XHT and v = v(xHT );v(xHT ), a wider class of estimators


VN(YGls)=[-'f::hij t "'-WJ"Jf
2 ;=1) =1 (I + a ; XI + a))
+f:WM]H(U'V~
;=1
(11.8 .3.1)

where H(u, v) is a parametric function of u and v such that H(I, I) = I, satisfying


certain regularity conditions . All estimators obtained from the functions ,
a p () 1+ a(u -I)
Hu,v
( ) =u v, Hu,v = ()' H(u ,v)=I+a(u-I)+p(v-I), and
I+pv-I
H(u, v) = {I + a(u -1)+ p(v _1)}-1 are special cases of the higher level calibration
approach , with a and p unknown parameters in the function H(u, v) .

Comparison between different randomized response (RR) models have been


performed by several research workers, for example Greenberg, Abul-Ela,
Simmons, and Horvitz (1969), Moors (1971), and Dowling and Shachtman (1975).
All these comparisons have concentrated solely on variances . The emphasis on
variances , however, amounts to considering matters from the statistician's point of
view only. In the general public's opinion a much more central question would be:
To what extent do the different methods protect the privacy of the interviewees?
Since the degree of privacy protection is an essential part of the RR procedures, and
since greater privacy, in general, results in more cost in terms of the variance of the
estimate, one obvious basis for comparing RR models is to compare variances only
if the degree of privacy protection is held constant. First we would like to discuss
such two measures for qualitative characters proposed by Leysieffer and Warner
(1976), and Lanke (1975, 1976).

A population is divided into two complementary sensitive groups, A and A C with


unknown proportions Jr and (1- Jr), respectively. Let us consider a dichotomous
response model where a typical response R is 'Yes' (say, Y) or 'No' (say, N). The
Chapter 11.:Randomized response sampling: Tools for social surveys 931

conditional probabilities that a response R comes from an individual of groups A


and AC are P(R I A) and P(R I AC ) , respectively. These quantities are at the
investigator's disposal and are called design probabilities. The posterior
probabilities that a respondent belongs to group A or AC , respectively, when hel
she reports R, are, say, P(A IR) and P(A C IR) and are called revealing probabilities
by following Lanke (1975) and Anderson (1976, 1977). Using the Bayes' theorem

p(AIR)= "P(RIA)() (11.9 .1.1)


"P(R I A)+(l-")P R I A C

and
P(A c I R)= I-P(A I R) . (11.9.1.2)
According to Leysieffer and Warner (1976), the response R IS regarded as
jeopardizing with respect to A or A C if
P(A IR»" or p(Ac IR»l-" (11.9 .1.3)
respectively. Thus we have

((AIR)
PA
C
IR
(1-")= ((RIAl.
" I PR A
C
(11.9 .1.4)

It follows that if the right hand side of (11.9.1.4) is greater (less) than unity, then R
is Jeopardizing with respect to A (A C ) in the sense that, with this response, a
respondent genuinely of group A ( AC ) rather than in group AC ( A ) thus tilts the
scale against himselfiherself if A ( AC ) is stigmatizing. From (11.9.1.3) and
(11.9.1.4), Leysieffer and Warner (1976) proposed the natural measures ofjeopardy
carried by R about A and AC , respectively, which are as follows
and g(R I AC)= 1/ g(R I A).
g(R I A)= P(R I A)/P(R I AC ) (11.9.1.5)
These functions are called Jeopardy functions. The response R is non-jeopardizing
if and only if
g(R I A)= 1. (11.9.1.6)
Clearly the probability of' Yes' response is given by
A= p(r I A)" +(1-" )p(r I A C)=
[p(Y1 A)- p(r lAc)] "+ p(r I A
C
) . (11.9 .1.7)
If an SRSWR sample of size n is taken and .i is the sample proportion of 'Yes'
answers, then following Warner (1965) an unbiased estimator of" is given by
ir _ .i- p(r I AC
)
(11.9 .1.8)
- p(rIA)-p(rIA C )
which is defined if and only if
p(r I A)- p(r I AC ) * 0 (11.9 .1.9)
that is, if and only if the condition g(R I A) = 1 is violated. In fact, we have proved
that the existence of an unbiased estimator for " necessarily makes a response
jeopardising with respect to either A or AC •
The variance of ir becomes
932 Advanced sampling theory with applications

V(i) = 1r(1-1r) + 1rg(YIA)+(l-1r)g(NIYc).


(11.9.1.10)
n n{g(YIA)-l}~(NIAc)-l}
Assuming without loss of generality that p(y I A) > p(Y I AC ) so that g(Y I A) > 1,
g(N I AC » 1 and hence 'Y ' and 'N' are jeopardizing for A and AC , respectively.
On differentiating (11.9.10) partially with respect to g(Y1 A) and g(N I A we have C
)

8V(i) 8V(i) .
< 0 and ( ) < 0, respectively.
8g (Y I A ) 8g N lAc
It indicates that for the sake of efficiency one needs as large magnitudes as possible
for g(Y I A) and g(N I AC ) both above unity. From a practical point of view,
regarding protection of privacy, one can fix some maximal allowable levels of
g(Y I A) and g(N lAc) (say, k1 and kz ), respectively. After fixing g(Y I A) and
g(N I A at
)
C
k( and kz the optimal choice of the design parameters for the particular
RR model can be worked out. In this way one can work out the variance
expressions for each RR model by substituting the values of design parameters and
then can be compared at the same level of protection of privacy.

Lanke (1976) assumed that the member of A may hesitate to reveal which group
he/she belongs to. On the other hand, a person who belongs to A C is supposed to be
quite willing to acknowledge the fact. It means the membership in A may be
embarrassing while membership in AC cannot be considered so. 'Embarrassment'
must mean 'suspicion of belonging to A '. A reasonable conclusion is, then, that
the larger the conditional probability of belonging to A given a certain answer, the
greater the embarrassment caused by giving that response.

The conditional probability


Pr[lnterviewee belongs to group A I Interviewee has given the response 'Ye s']
denoted by P(A I Y) can be defined as

p(AIY)= 1rP(YIA) ( ) (11.9.2.1)


1rp(y I A)+ (l-1r)P Y lAc
Similarly when the response is 'No' then
p(AIN)= 1rP(NIA\) . (11.9.2.2)
1rP(N I A)+(l-1r)P N lAc
Thus for randomized interviewers one method may be considered to be more
protective than the other if
'1/ = Max.[P(A I Y1 P(A IN)] (11.9.2.3)
is smaller for the former method than for the latter. The procedure of comparing
two strategies (Sl and Sz, say) at the same level of protection of the privacy can be
Chapter 11.: Randomized response sampling: Tools for social surveys 933

summarised as follows. First derive the conditional probabilities P(A Ir) and
P(A I N) using the design probabilities and also check P(A Ir) > P(A I N) or vice-
versa for both strategies. The two strategies, say Rand P2 , will then be considered
equivalent from the protection of privacy point of view if R(A I r)= P2(A I r) . On
equating these two conditional probabilities a relationship is obtained between the
design parameter (say, R) of one strategy and the design parameter (say , P2 ) of the
other strategy. Substituting this value of P; into the variance expression of the first
strategy and on comparing it with the variance of the second strategy, we can be in
a position to assess the efficiency of the first strategy with respect to other strategy
at the same level of protection of privacy. Using the above two measures Bhargava
(1996) and Bhargava and Singh (2002) considered the comparison of Mangat and
Singh (1990) and Mangat (1994) strategies with the pioneer model of Warner
(1965) .

In Mangat and Singh (1990) model, each interviewee in an SRSWR sample of n


respondents is provided with two randomization devices R} and R2 •

The randomization device R} consists of two statements, namely,

(i ) I belong to sensitive group A,


( ii ) Go to randomization device R2 , represented with probabilities T and
(1-T), respectively.

The randomization device R2 which uses two statements, namely,

(i ) I belong to sensitive group A ,


(ii ) I belong to group A C ,

with known probabilities P2 and (1- P2 ) respectively, is exactly the same as used
by Warner (1965) . The respondent is instructed to experience first the
randomization device R j • He/she is to use R2 only if directed by the outcome of
R} . The respondent is required to answer 'Yes' if the outcome points to the
attribute he/she possesses and answers 'No' if the complement of his/her status is
pointed out by the outcome. The whole procedure is completed by the respondent,
unobserved by the interviewer.

Then A2' the probability ofa 'Yes' answer, is given by

(11.9.3.1)

A flow chart of Mangat and Singh 's two-stage model is given below:
934 Advanced sampling theory with applications

Use the first randomization device


R1

Statement' 1belong Statement 'Go to the


to group A ' occurs second randomization
with probability T. device R2 ' occurs with
probability, (I-T).

Statement Statement 'I do


'I belong to not belong to
group A' group A'
Does the occurs with occurs with
statement probability P2 probability,
match with (l-P 2 )
the status 0
the person?

Fig. 11.9.3.1 Mangat and Singh's two-stage model.

Thus we have the following theorems:

Theorem 11.9.3.1. An unbiased estimator of" is given by


"
"I =
-iz -(I-TXI -P2 ) , T'#
1-2P2
. (I 1.9.3.2)
2Pz -I + 2T(I- pz ) 2(1- pz)
Proof. Note thad z is the observed proportion of ,Yes' answers in the sample and it
follows a binomial distribution B(n, Az), which proves the theorem.

Theorem 11.9.3.2. The variance of the estimator it is given by

v(""1)_ "(1-") (I-TXI-Pz){I-(I-TXI-Pz)}


----+ z . (11.9.3.3)
n n{2Pz -I + 2T(I- pz)}
Proof. Note that 2z is the proportion of 'Yes' answers in the sample and it follows
a binomial distribution B(n, AZ) , therefore

V(i.)= v[ -iz - (1-TXI- pz) ] = Az(I- Az )


2P2 -I + 2T(I- P2 ) n{2P2 -I + 2T(I- pz )}2 .
Chapter 11.: Randomized response sampling : Tools for social surveys 935

On substituting the value of Az and after simplification we obtain (11.9.3.3). Hence


the theorem.

Corollary 11.9.3.1. For Pz =11 (say) and T = 0 the Mangat and Singh (1990)
model reduces to Warner (1965) model, or W model.
Corollary 11.9.3.2. The V(7fI) < V(7f w) if T >(1- Z1\)/(I- 1\) which shows the
estimator 7f1 can always be made more efficient than the usual Warner estimator,
7f w (say) by suitably choosing the value of T for any practicable value of PI '

Here we consider two different measures as follows .


(a) Leysieffer and Warner's measure: For the Warner (1965) model the design
probabilities are given by
P(YIA)=P(NIA c)=1\ and p(NI A)=P(Y1 A C)=I-1\
. (11.9.4.1)
It can be easily checked that the condition
p(r\A»p(rIA c) (11.9.4.2)
holds if 1\ > 0.5.

The jeopardy functions are given by


_ P(YIA) _ _ 1\_ (c)_ p(NIACL_1\_
gw(r I A) - ( ) -( ) and gwN I A - ( ) - ( )' (11.9.4.3)
P r IA C
1- 1\ P N IA 1- 1\
where the suffix w stands for Warner's model. Let kl and k z be the maximal
allowable values of gw(r I A) and gw(N lAc). Ifk! = kz = k (say) then maximization
of gw(YIA) and gw(NIA C
leads to a design with PI/(I-pd=k, that is,
)

PI =k/(1+k). If, however, kl *k z , asgw(r I A)= gw(N lAC), different upper bounds
for them cannot be attained simultaneously. In that case if, without loss of
generality, k, < k z one should chose a design such that
PI =kl/(k, +1) . (11.9.4.4)
For the Mangat and Singh (1990) model we have the design probabilities as
p(Y1 A)= P(N I AC)= T+(I-T)Pz and P(N I A)= p(YI AC)= (I-TXI-P z) . (11.9.4.5)

Then we have the following theorem.

Theorem 11.9.4.1. The inequality p(r I A) > p(Y lAc) holds if Pz > (1- ZT)jZ(I- r).
Proof. On substituting the values of p(r I A) and p(Y lAc) in p(Y I A) > p(Y I AC) we
have
936 Advanced sampling theory with applications

T+(I-T)P2 > (I-TXI-P2) or 1-2P2 <--.!- or 1-2T < 2P2 or P2 > 1(-2T).
I- T I- T 21-T
Hence the theorem .

Now the jeopardy functions for the Mangat and Singh model are given by

gms(YIA)= [+\-T)P2)
I-T I-P2
and gms(NIA c)= [+(l
I-T I-P22r iT)P(l1.9A6)

Let k) and k 2 be the maximum allowable values of gms(Y I A) and gms(N lA c). If
k\ = k2 = k (say) maximization of these jeopardy functions leads to a design with
T+(I -T)P2 () ( X)
or T+ 1- T P2=kl-T I -P2 or P2=(
k(I-T)-T
X ). (11.9.4.7)
(I -T XI-P2)= k 1-T l+k\
Thus PI =kl/(k1 +1) and P2 = {kt(l-T)-T}/{(l-TXl + k))} are the optimal choices
for design parameters of the W model and MS model respectively.

Now we have the following theorems .

Theorem 11.9.4.2. With the optimal choice It = kt/(kt + 1) of the design parameter,
the variance of the unbiased estimator i w is given by
V(i )= JT(l-JT) + k\(kt-1)-2 (11.9.4.8)
w
n n
Proof. On substituting the value of It in the expression on variance of W model,
we have

v(,; )= «1-<)+ ~(I-~) = «1-<)+ klk~I(1 klk~l]


w n n(2It-lf n (k
n 2- t--1
J2
k) +1

JT(I-JT)
=
k\ (kl -1
--+~-'---'--
t2
n n
Hence the theorem .

Theorem 11.9.4.3. With the optimal choice of design parameter


P2 = {k\(I-T)-T} /{(I-TX1+k j ) }
the variance of the unbiased estimator i) under the MS model is given by
, )_ JT(I-JT) k\(k\-1)-2 (11.9.4 .9)
V(JT\ - - - +---'-'--'-----'--
n n
Proof. It follows on substituting the value of P2 in
V(i\)= JT(l-JT) + (I-TXI-P2){I-(I-TX1~P2)} .
n n{2P2 -1 + 2T(I- P2)}
Hence the theorem .
Chapter II.: Randomized response sampling: Tools for social surveys 937

Thus for the optimum choices of 11 and P2 we have V(i l ) = V(i w), hence we
conclude that both MS model and W model are equally efficient at the same level of
protection of privacy.

( b ) Lanke's measure: According to this measure and considering Warner (1965)


model , the revealing probabilities are given by

P.(AIY)- Jr11 (11.9.4.10)


w - Jr11 +(I -JrXI -11)
and
pw(AIN) = Jr(I-l1) . (11.9.4.11)
Jr(I- l1 )+ (1-Jr)l1

To find the Max[Pw(A I Y~ Pw(A IN)], we have the following theorem.


Theorem 11.9.4.4. Revealing probabilities satisfy the inequalities
(i ) Pw(A I Y) > Pw(A I N) if 11 > 0.5; (ii) Pw(A I N) > Pw(A I Y) if 11 < 0.5 .
Proof. Let us prove first inequality and then second is obvious. On substituting the
values of Pw(A I Y) and Pw(A I N) in the first inequality, we have
Jr 11
--;----':--,-------,- > Jr(l -11) or ( Jr 112 > ( I - Jr XI -11 J\2
I- )
Jr11 +(l-Jr) (1-11) Jr(I-l1)+(I-Jr) 11
or11>(I-l1) or 11>0.5 .
Hence the theorem.

Thus a measure of protection of privacy in W model is given by


_ {Pw(A I Y) if 11 > 0.5,
If/w - Pw(A I N) if 11 < 0.5 . (11.9.4.12)

For the MS model the revealing probabilities are given by

P. (A I Y) = Jr[T + (1- T)P2 ] (11.9.4.13)


ms Jr{T+(I-T)P2 } +(I-JrXI-TXI -P2 )

(11.9.4.14)

Theorem 11.9.4.5. The Max[Pms(A I Y), Pms(A I N)] occurs if the conditional
probabilities are satisfying the following inequalities:
(i) Pms(AIY»Pms(AIN) if P2 > (I-ZT)jZ(I-T); (11.9.4.15)
(ii) Pms(AIN»Pms(AIY) if P2 «I-ZT)jZ(I-T). (11.9.4 .16)
Proof. On using Pms(A I Y) and Pms(A I N) in the inequality Pms(A I Y» Pms(A IN),
we have
938 Advanced sampling theory with applications

li[T+(1-T)P2] li(1-TX1-P2)
li{T + (1-T)P2}+ (1-li X1 - TX1 - P2) > li(1- TX1 - P2)+ (1-li ){T+ (1- T)P2} ,
or
li(1-liXT+(1-T)P2f > li(1-liX1-Tf(1-P2f,
or
[T + (1- T)P2] > (1- TX1- P2~
or
1-2T
P2 >-21-T
(- ) '

Which proves the first part of the theorem. The second part of the theorem can be
similarly proved. Hence the theorem Therefore the measure of protection of
privacy in the MS model is given by
_ {Pms(A I Y) if P2 > (1- 2T)j2(1- T),
ms (11.9.4.17)
IfI - Pms(A I N) if P2 < (1- 2T)/2(1- T)

Thus there are two measures of protection of privacy in each strategy. We shall
consider the following four cases for comparing these two strategies.
Case 1. On setting IfIw = IfIms we have

and in tum we have


Fj = T+(1 -T)P2 ·

For such a choice of PI the variance ofW model reduces to MS model.

Case II. On setting IfIw = IfI ms we have

Fj(1-li IT + (1- T)P2]= (1-li X1- TX1- FjX1- P2) ,


which in tum reduces to PI = (1- TX1 - P2) .

For such a choice of Fj the variance ofW model reduces to MS model.

Case III. Again on equating IfIw = IfI ms we have


(1-liX1- TX1- FjX1- P2)= Fj(1-liIT + (1- T)P2] ,
which again results in
Fj =(1-TX1 -P2)·

Case IV. On equating

(1- FjX1-li XT + (1- T)P2] = (1-li )Fj(1- TX1- P2)


which on solving reduces to
Fj=T+(1-T)P2 ·
Chapter II.: Randomized response sampling : Tools for social surveys 939

Thus the variance expression of the MS model and the W model under Leysieffer
and Warner's measure remains the same. Hence both the MS model (or two-stage
model) and the W model are equally efficient at the same level of protection of the
respondents . Nayak (1994) has also reported a similar conclusion .
Remark 11.9.1. Although the MS model and the W model are theoretically
performing the same, but are psychologically different to the respondents in a
sample, and thus from psychological point of view one may expect more co-
operation in the MS model than with the W model.

Mangat (1994) proposed a method in which each of the n respondents selected


through SRSWR sampling is instructed to say 'Yes' if he/she belongs to sensitive
group, A. If he/she does not belong to group A the respondent is required to use
the Warner (1965) randomization device which consists of two statements.

( i ) I belong to sensitive group A; ( ii ) I belong to group AC


represented with probabilities P:3 and (1 - P:3) respectively. The respondent is
required to report 'Yes' or 'No' according to the outcome of this device and the
actual status that he/she has with respect to the sensitive group A. The whole
procedure is completed by the respondent, unobserved by the interviewer . Since the
'Yes' answer may come from respondents in group A and group A C , the
probability of 'Yes' answer for this M model (say) is given by

(11.9.5.1)

Then we have the following theorems:

Theorem 11.9.5.1 An unbiased estimator of Jr is given by


. i - (I- P:3 ) .
Jr2 = 3 (11.9.5 .2)
P:3
Proof. Note that i3 is the observed proportion of "Yes" answers obtained from the
selected sample, E(i3)= ,13' which proves the theorem.

Theorem 11.9.5.2. The variance of the estimator 1r2 is given by


, )_
V(Jr2 Jr(I - Jr) (1-Jr XI- P:3) .
----+ (11.9 .5.3)
n nP:3
Proof. Note that i 3- B(n, ,13), thus we have

V(1r2)=v[i3-(I-P:3)]= v(~3L ,13(1-


P:3 P:3 nP:32,13),
which on substituting the value of ,13 proves the theorem.
940 Advanced sampling theory with applications

Again we shaII consider here two measures for comparing M model and W model.

( a ) Leysieffer and Warner's measure: For the M model we have the design
probabilities as foIIows:
P(YIA)=I, p(NIA)=O, P(YIA C)=I-P3 , and p(N IA C)=P3 •
Clearly the condition p(Y I A) > p(Y lAc) is true for P3 > O. Now the jeopardy
functions are given by
gz(YIA)=1/(I-P:J) and gz(NIAC)=a:>. (11.9.6.1)
Note that the second jeopardy function is infinite, we can take maximal aIIowable
limit for gz(Y I A) as k(, that is,

_1_=k 1 , or P:J=[k1-I]jkt , (11.9.6.2)


I-P3
which is the optimal choice of design parameters for the M model. Then we have
the foIIowing theorem.

Theorem 11.9.6.1. With the optimal choice of design parameter P3 =(k( -l)/k t the
variance of the unbiased estimator "z is given by
V("z)= tr(l-tr) + (l-trXk\-lt' (11.9.6.3)
n n
Proof. On substituting P3 =(k 1 -1)/k1 in V("z) we have
V("z) = tr(l-tr) + (l-trXI-P3 ) = tr(l-tr) +(l-tr{l- kl-I)/n(k1-1)
n nP3 n \ k( k(

= tr(l-tr) + (l-trXk1-It(
n n
Hence the theorem.

The variance expression given in the above theorem is different from the one for W
model under the optimum choice of parameters. Thus we have the foIIowing
theorem:

Theorem 11.9.6.2. The M model is always more efficient that the W model at equal
level of protection of the respondents.
Proof. For the optimum choice of parameters we have
Z
V("w)= tr(l-tr) + k1(k,-lt ,
n n
and
A tr(l-tr)
V(trz - -
)_ (l-trXk(-lt'
- - +-'------'-'---'---'--
n II
Chapter 11.: Randomized response sampling: Tools for social surveys 941

Now V(i2 ) < V(iw ) if


(I-lZ"Xk1-It l < k)(kl - It 2
,

which reduces to
(11.9 .6.4)

Note that k) is always more than one, therefore the above inequality will always
hold. This completes the proof.

( b ) Lanke's measure: The revealing probabilities for the M model are given by

Pm(AIY)= (lZ"X ) and P2(A\N)=O. (11.9.6.5)


lZ"+I-lZ" I-Fj
Clearly P(A I Y) > P(A I N), hence the measure of protection of privacy to be used
for this strategy is given by
(11.9 .6.6)

On equating 'l/w = Pw(A I Y) with 'l/m we have

~(I-lZ" XI- Fj)= (1- ~XI-lZ")


or
1
~=-- .
2-P3 (11.9.6.7)

Thus we have the following theorem.

Theorem 11.9.6.3. The variance of the estimator i w for PI = 1/(2 - P3) is given by
v(.lZ"w )= lZ"(I-lZ") + 1- Fj
2' (11.9 .6.8)
n nFj
Proof. Obvious from the variance expression of the W model.

Theorem 11.9.6.4 . The M model is always more efficient than the W model at an
equal level of protection of the respondents.
Proof. We have
V(i)= lZ"(I-lZ") + I-Fj
n np;l

Now V(i2) <V(iw ) if


lZ"(I-lZ") + (I-lZ"XI-Fj) < lZ"(I-lZ") + (I-~3), or (I-lZ") <1/Fj
n nP3 n nFj
since the right hand side of the above inequality is always greater than one. Hence
the theorem.
942 Advanced sampling theory with applications

Let us first introduce a few RR models for quantitative data.

11.10.1 UNREU.ATED QUESTION MOQEUFOR QUANTITATIVE D~T.A

Greenberg, Kuebler, Abernathy, and Horvitz (1971) suggested an extension of the


Unrelated model (U model) to quantitative characters. They considered a sensitive
variable X which is supposed to be continuous with true density g(e). The
treatment remains similar for a discrete X, if one replaces everywhere a density
function by the corresponding probability mass function. The unrelated character is
Y with density h(e). The problem is to estimate f.J x' the unknown population
mean of X They considered both the cases when f.J y' the population mean of Y, is
known and when it is not known. Here we will consider the case when u y is
known. For this purpose an SRSWR sample of size n is drawn. Each respondent
reports his/her X value with probability P and Y value with probability
Q = 1- P . Then a randomized response (RR), say Z, has the density
f(Z)=Pg(Z)+Qh(Z) . (11.10.1.1)
Therefore
E(Z)=PE(X) +QE(Y) and E(Zz)=PE(Xz)+QE(Yz) (11.10.1.2)
and consequently the population mean f.Jz and the population variance a'; of Z
are
f.J z = Pf.Jx + Qf.Jy, (11.10.1.3)
and
a; =E(ZZ)-{E(ZW =Pa;+Qa;+PQ~x-f.Jy~ , (11.10.1.4)
where a; and a; are the population variances of X and Y respectively. If the
available responses are Z" Zz, ..., Zn then we have sample mean and variance as

z=..!.-fz;
n ;=1
and s;=(n-It'f(z;-zf
;=1
respectively, such that
E(Z) = f.J z and E(s;)= a; .
Thus we have the following theorem:

Theorem11.10.1. An unbiased estimator of f.J x is


jJx = (z - QJ.l y)/ P
with

V(jJx)= v(:;L
P
aiz =~[pa; +Qa; +PQ~x - f.Jy~]
nP nP
and an unbiased estimator of V(jJx) is
Chapter 11.: Randomized responsesampling: Tools for social surveys 943

V(pJ=S;/~p2}.
Proof. Obvious using above results .

Pollock and Bek (1976) considered the additive model. In this model each
respondent in an SRSWR sample is asked to sum his/her sensitive attribute (X )to
a random value (Y) taken from a known distribution. The observed response,
denoted by 2 , is
2 = X +Y. (11.10.2 .1)
Here the random variable Y is distributed independently of sensitive attribute X .
Then the mean and variance of 2 are
2 2 2
Il z = Il x + Il y an d 0" z = 0" x + 0" Y .
If 21> 2 2 , .. ., Z; be the observed responses in a SRSWR sample of size n, then an
unbiased estimator of Il x is obtained as

Px = Z- Il y (11.1 0.2.2)
where Z denotes the sample mean of the observed responses. The variance of the
estimator Px is given by

(11.10.2.3)

It is easy to obtain an unbiased estimator of V (Px ) by replacing 0"; with s; .

11.10.3!( Irn.}\1lJ!.;TlPb IGATIVE MODEL , ~

Pollock and Bek (1976) also considered the multiplicative model. This model was
further considered by Eichhorn and Hayre (1983). According to them, each
respondent in an SRSWR sample of n units is asked to multiply his/her X value
by a random number Y taken from a known distribution and thus giving a
scrambled response to the interviewer. They also referred to this model as
scrambled RR method. The observed response ( 2 ) is
2=XY . (11.10.3.1)
Here also the random variable Y is independent of X , The mean and variances of
2 are
Il z = Ilxlly and

Then we have the following theorem.


Theorem 11.10.2 . An unbiased estimator of Il x is given by
Px=Z/lly , (11.1 0.3.2)
944 Advanced sampling theory with applications

where Z is the mean of observed sampled scrambled responses. The variance of


itx is given by
v(Il' x )== <7;2==.!-[ <7x2+ <7;~;+<7;)]
2 . (11.1 0.3.3)
nlly n Il y

An unbiased estimator of v(itx) can easily be obtained by replacing <7; with s; .


Proof. Obvious from the above results.

Anderson (1977) developed a measure for privacy protection in case of quantitative


characters. Following him, consider a population of individuals with some sensitive
attribute X E n x , distributed according to the unknown distribution Fx (x) . Our aim
is to estimate the mean of X, or some proportion or distribution function of X . For
this purpose a random sample of respondents to be interviewed is taken. However,
for privacy reasons the sampled individuals may not be interviewed directly. Thus
we have to ask questions in such a way that, from the answers, the population
parameters can be estimated but the values of X for the respondents can not be
inferred. This can be attained by introducing a randomization device known to the
interviewer and producing an answer Z with a probability distribution dependent
on the X value of the respondent. More precisely, we have the following general
RR model. An individual with X == x answers according to the known probability
density hz(z Ix), x E n x . The densities hAz [x], X E n x , are called response densities
and they hold with respect to some measure d ..1.(z) thus including continuous as
well as discrete distributions. In general, the family of response densities
{h z(zlx), x Enx} is chosen by the interviewer and nearly family can be used by
choosing an appropriate randomization device. The interviewer can only observe
the answer Z from a randomly taken individual. It has the density
fz(z)== JhAz I x)dFAx)== E[hAzl X)] (11.10.4.1)
which is a mixture of the different response densities with Fx (x) as mixing
distribution. The population parameters of X may be inferred from observations
of Z, whereas the single X value may not be determined, if the densities
hAz I x1x E n x ' overlap. The family of conditional densities of X given Z == z ,
Z En z ' is
hAx I z)== hz(z I x)fAx)/fAz), z E nz> (11.1 0.4 .2)
where n z being the set of possible RR answers. In the following models this
family will be called the family of revealing densities. It depends both on the
response distributions and on the unknown population distribution of X . Further,
the joint density of random variable (X,Z) is given by
fx z(x,z) == hAz I x)fAx). (11.1 0.4.3)
Using (11.10.4.2) and (11.10.4.3) the revealing densities can be defined as
Chapter 11 .: Randomized response sampling: Tools for social surveys 945

hAx I z) = fx ,z(x,z)/ fz(z) . (11.10.4.4)


The importance of the revealing densities for the degree of protection of the
interviewee's privacy is obvious. Therefore we come to the privacy aspect. The
choice of the family of response distributions determines the degree of protection of
the interviewee's privacy as well as the possibility of obtaining good estimates of
the population parameters. Before the interview the density fAx) provides
knowledge about the variable X of a randomly sampled individual. Assume that he
answers Z = z. Then the revealing density hAx I z) yields information about X
and the discrepancy between fx (x) and hAx I z) reflects the invasion of privacy
caused by the answer Z = z. If hAx I z) is very concentrated around some value x,
the protection provided by the answer Z = z is very small; conversely, if hx(xlz) is
widely spread, privacy is little affected. Therefore we take v(x IZ = z) as a general
measure of the privacy protection associated with the answer Z = z, and the
average E[V(X Iz)] as an overall measure of the privacy protection provided by an
RR model. In the following section, we shall compare the additive and
multiplicative models. Chang and Huang (200Ib) also estimated the proportion and
sensitivity level of a quantitative character.

Bharagava (1996) has considered the comparison of additive and multiplicative


models when both the sensitive and scrambling random variables are beta
distributed.

( a ) Multiplicative model: Let us first consider the multiplicative model as


Z = XY. Let us assume that the sensitive variable X is beta distributed with
parameters (ab /31) ' The interviewees are asked to multiply a random number Y ,
from a known beta distribution with parameters (a2' /32) to their X values. The
interviewer thus receives the randomized (scrambled) response Z. Let us also
assume that X and Yare independently distributed. The corresponding density
functions are
ft (x) = [B(al' /31 )]-1 x(al-I)(I- x )(PI-I), 0 < x < I (11.10.5.1)
h (y) = [B(a2' /32 )]-1 y(a2-t)(I_ y)(p2-I) , 0<y <I (11.1 0.5.2)
Note that x and yare independent their joint distribution is given by
i
f(x ,y) = [B(alo /31 )]-1 [B(a2 ' /32 )]-1 x(al-I)(I- x pt- I)y(a2-t)(I_ y )(P2- I), (11.1 0.5.3)
where 0 < x < 1 and 0 < y < 1. Now making the transformations
u = xy and v = x . (11.10.5.4)
Solving for x and y we have
x = v and y = u/v . (11.1 0.5 .5)
The Jacobian of the transformation is
946 Advanced sampling theory with applications

OX ox
ou' ov 0,
J = oy oy = -,--
1 u
2 v
- - v V
OU' OU
which implies that
IJI=~v ·
The joint distribution of u and v is given by
g(u, v) = f(x,Y)IJI
2-1)
~
= [B(al> PI )]-1 [B(a2' P2)]-1 v(al-I)(I_ v)(Pl-I)( )(a (I_~ )(P2-I)
= [B(al> PI)jl[B(a2,P2)jlval-(a2+P2)(I- vjPI-I)u(a 2-1)(v - ujP2- 1). (ll.l 0.5.6)
Ifwe assume
al =a2 +P2
we obtain
g(u, v) = [B(al' PI )]-1 [B(a2' P2 )]-1 (1- v)(al-l)u(a2-1)(v_u )(P2-1). (ll.l 0.5.7)
The region (0 < x < I, 0 < Y < I) in the (x, y) plane transforms to the region
(u < v <1, 0 < u < I) in the (u, v) plane. Following Rao (1973) the value of integral
of g(u, v) over v from u to I, the density of u is also beta with parameters
(a2, PI + P2) provided that the condition al = a2 + P2 is satisfied. Thus the marginal
density of u is given by
h(u)= [B(a2,al +a2)]-l u(a2-1)(I-u)(PI+P2-I), 0 < u < 1 (l1.I0.5.8)
which reduces to
h(u)=BIU(a2-1)(I-uiPI+P2-1), O<u<1 (11.l0.5.9)

where B1 = qal +p!l .


r P2qpI+P2)
Thus the revealing density function becomes
hAxlz)=h*(vlu)= g~(~))
[B(al' PI )]-1 [B(a2' P2)]-1 (1- v)(PI-I)ua2-1(v-u P2 - I)i
B u(a2-1)(I_ u)(PI+P2- 1)
I

r(al +Plf(a2+P2) (1- v)(PI-I)(v-u ip2-1)


r al r PI r «z r P2
r(al+p!l (l-uipI+P2-1)
r a2r(PI+P2)
Chapter 11.: Randomized response sampling: Tools for social surveys 947

= [r(PI+pz) J(1- v)(PI-I)(v- u)tPz-I)


rplrpz (l-uip\+PZ-I)

_[B( )1-1 (I-v)tPI-I)(v-u)tPZ-I)


- PI,flz] (I-u)(PI+PZ-I)

(1-v)tPI-l)(v- u)(PZ-I)
-B (11.10.5.10)
- z (l-uip1+Pz-J) ,

where Bz = [B(PI,PZ)t l , U < v < 1 and 0 < u < 1.

The general rule for working out the measure of privacy protection is given by
E(X I Z)= E(V I U)= JVh*(v Iu)dv (11.10.5.11)

Var(X I Z)= Var(V I U)= E[(V -E(V I u)f Iu]


= J(v- E(V I U))Zh*(v I u)dv (11.10.5.12)
and the measure of protection is given by
E[Var(X I Z)]= E[Var(V I U)] = JVar(V I U)z(u)du .
Therefore we have
_ I v(I-vjPI-I)(v-ujPz-I)
E ( V IU ) - B z J (I-u )(p1+PZ-I) dv (11.10 .5.13)
u

and
I (1 jPI-I)( jPz-I)
Var(VIU)=BzJ(v-E(VIU))Z -( 'j/~u I) dv, (11.10.5 .14)
u I-u 1+ z-
and the measure of privacy protection is
E[Var(V IU)] = BI Jvar(v Iu)u (az-I)(I-u)PI+Pz-I du . (11.10.5.15)
o
Note that the integral expressions (11.10.5.13) to (11.10.5.15) cannot be evaluated
exactly, therefore it is not possible to derive a simple expression for the efficiency
of the multiplicative model. Bhargava (1996) has resolved this issue through
numerical illustrations with different sets of parameters.

( b ) Additive model: The additive model is given by


Z = X + Y. (11.10.5.16)
Assume that the sensitive variable X is beta distributed with parameters (a]> PI)
and each interviewee in the sample is asked to add a random number Y to his/her
X value. The random number Y is from a known beta distribution with parameters
(az,pz) and is distributed independently of the sensitive variable X. The
interviewer receives the randomized response Z . The corresponding density
functions of X and Y and their joint distribution are given by
II (x) = [B(al> PI )]-1 x(a 1-1)(I-x)(PI-I), 0 < x <1, (11.10.5.17)
948 Advanced sampling theory with applications

/z(Y) = [B(aZ,pZ)jl y(uZ- I)(I- yjPZ-I) , 0 < y < I , (1 1.10.5.18)


and
f(x, y ) = [B(al> PI )]-1 [B(az, p z )]-1 x(Ut -I) (I _ xi P1- 1)y(UZ - l\ l _ y i pZ- 1), (11.10.5.19)
where ° °
< x < I and < y < I , respectively. Again let us make the transformations
u =x +y and v= x . (11.10.5.20)
Solving for x and y we have
x = v and y = u - v. (11.10.5.21 )
The Jacobian of the transformation is
o x ox

J= o u' o v
d y oy
=1°'1,-111 =_1
ou' ov
therefore
IJI = 1.
Using ( 11.10.5.2 1) in ( 11.10.5.19) and using jJj = 1 we have the following jo int
distribution of u and v
g(u, v) = [B(at> PI)j l [B(az,pz)j l v(Ut -t)(I _ vjPI-I)(u - vjPz-t)(I _ u + vjPz-l) (1 1.10.5.22)

°
The region (0 < x < I, < y < I) in the (x, y) plane transform s .to the region
(O < v <u, O< u< 1 and u - l< v< l, l <u < 2) in the (u, v) plane. Hence by
°
integrating g(u, v) over v, first from to u and then from (u -I) to I, then the
density of u as

hI(u) = [B(al ,PI)j l[B(a z,Pz)j ' lvCu1- 1)(I- v)(PI-l)(u - v)(uZ - I)(I _ u + v)(PZ - I)dv
o
for 0 < u < I ,
I
hz(u) = [B(a l,p t)jl [B(a z,pz)jt jv(UI- I)(I_ vjPt-t)(u - v)(uZ-I)(I_ u + vjPz- l)dv
u-I
for 1 < u < 2 .

The revealing density becomes


*( ) v(UI-I)(I_ vj PI -t )(u- vj uz-l)(I _ u + v)(PZ-l)
hI vl u = , O<u <1
[vCUI - I)(I - vjP l- l)(u - vjuz - I)(I _ u + vjPz- I)dv
o
and
*( ) v(UI-t)(I _ viP1-t)(u - v)(UZ -I)(I _ u + v)(PZ - I)
h z v Iu = t ' I<u< 2.
r)Ut-I) (I _ viPI-1)(u - v)(UZ -l)( I _ u + vipZ-1)dv
u-I

Thus we have the following results:


Chapter 11.: Randomized response sampling : Tools for social surveys 949

and

Varl(V I U)= f(v- E,(V IU)fh;(v I u)dv, 0< u < 1,


o

Var2(VIU)= f(V-E 2(VIU))2 h;(vlu)dv, I<u <2 .


u-l
The measure of privacy protection is then given by
I 2
E[Var(V IU)] = IVarl(V I U~l(u)du + IVar2(V I U~2(u)du .
o I

Again these expressions cannot be evaluated exactly. Thus Bhargava (1996)


obtained the numerical measures of the level of respondents' protection. The
following table provides the measure of privacy protection for beta distribution in
case of multiplicative and additive models.

_ I" ii

2 2 4 2 0.0238 0.0133
2 2 5 3 0.0178 0.0155
2 2 3 I 0.0333 0.0217
1 I 3 2 0.0250 0.0222
I 1 4 3 0.0083 0.0155
1 I 2 1 0.0417 0.0321
0.5 0.5 2.5 2 0.0215 0.0191
0.5 0.5 3.5 3 0.0139 0.0117
0.5 0.5 1.5 1 0.0412 0.0352

The above table shows that if the sensitive variable follows a beta distribution , the
multiplicative model remains more protective than the additive model in most cases.

Assume Jr be the proportion of people belonging to sensitive group A and (1- Jr)
be the proportion of persons belonging to non-sensitive group A C such that
Au A C = n . Owing to the sensitive nature of the group A, people do not like to
disclose their status to the interviewers. An SRSWR sample of n respondents will
be taken and each respondent will be given two random devices R 1 and R 2 • Each
of the random devices will have two statements:
950 Advanced sampling theory with applications

(i ) I belong to group A;
( ii) I do not belong to group A.

Using different probability mechanisms, under each device the respondent chooses
statement (i ) or ( ii ) with probability P or (I-P) and simply answers 'Yes' or
'No' depending upon his/her actual status. The responses under the two devices are
assumed to be independent. Let I be the probability that a person gives an
untruthful answer whether he/she belongs to A or AC •
Thus we have
I = Prll.Intruthful answer I A) = pr(Untruthful answer I A C
) .
(11.11.1)
Let

X. =
I
{I if the ith respondent answers 'Yes' with device R1,
0 otherwise,

and
__ {I if the ith respondent answers 'Yes' with device Rz,
li
o otherwise .
Then
Pr(X; = 1, li li = I)=[PI + (1- PXI-/)!(I- PY + P(I-/)]
= 0) = Pr(X; = 0,

=P(l- P) +(I-IY(I- 2P)= B, .


Let ni l, nlO ' nO I and n OO be the number of respondents who answered (Yes,
Yes), (Yes, No), (No, Yes), and (No, No) respectively from devices R I and R z as
shown below.

Instead of assuming
I = Pr(Untruthful answer A) = Pr(Untruthful answer I AC )
one can assume
Pr(Untruthful answer I A) = I, and Pr(Untruthful answer I AC )=O.
Lakshmi and Raghavarao (1992) considered a p-variates N p (J.1, L) normal
distribution with mean vector J.1 and positive definite dispersion matrix I and by
In ' a column vector of nones. Lakshmi and Raghavarao (1992) considered the
problem of testing the null hypothesis
H o : 1=0
against the alternative hypothesis
n, : 1>0.
Following Rao (1973), the asymptotic distribution of the bivariate random vector
Chapter 11.: Randomized response sampling : Tools for social surveys 951

~[(n~o, n~I)_0/1'2]
is N 2(02' L / ) , where 02 is a vector of dimension 2 and

L/-_[0/(1-0/ 1 -Ol] .
-ol, 0/(1-0/)
Under the Null hypothesis H 0 : I = 0 , the statistic

lO Ll
T -- n(n--uo' nO] Ll)" -]( nlO Ll nO] Ll)1
--uo '::"'0 --uo' --uo (11.11.2)
n n n n
is distributed as a central chi square distribution with two degrees of freedom,
where 00 = P(I- p) and

LO=[00(1-00), -OJ].
-OJ, 00(1-00)
Lakshmi and Raghavarao (1992) developed the following result: A critical region
for an a level test
H o : 1=0
against the alternative hypothesis
n, : I > 0
is
(11.11.3)

Singh (2002c) suggested another procedure which may result in a greater sense of
response confidentiality among the sampled individuals. The procedure can be used
in surveys where the respondents selected in the sample assemble at a common
place for the conduct of the survey . This could be a situation of collecting data
from a small town, community or organization. The procedure invokes K decks of
cards (which is named as stochastic randomization device) with different
proportions of cards carrying the statement, '1 belong to group A ' . After explaining
to the respondents how the randomization device provides confidentiality to their
responses, the investigator asks one of the assembled respondents to randomly
select one deck of cards from the box containing K decks of cards . The deck is
then used to collect information on the sensitive attribute from the respondents.
Every sampled respondent draws one card from the selected deck of cards and reads
the statement on it. In this procedure every respondent is provided with two
identical slips of paper with 'Yes' or 'No' printed on them. According to his status
in relation to the statement printed on the card drawn, each respondent is requested
to put one of the two slips of paper into an empty box . After the survey is
completed the number of ' Yes' answers is counted from the box and the proportion
p. for the deck used in the survey is noted. Random selection of one randomization
952 Advanced sampling theory with applications

device from several such devices may help in increasing the sense of confidentiality
among the respondents. The choice of values of p for preparing K decks of cards
for the survey is important in this procedure. These K values of p could either be
purposively selected by the investigator or they could be taken as a random sample
from a known discrete or continuous density function . Let this density function be
denoted by f(p) . The value of p corresponding to the deck used in the survey
will be selected from this random sample of p values with equal probabilities.
Thus the value of p * used in the survey is a random variable with f(p) as its
probability density function . When f(p) is a one point distribution then this
procedure reduces to Warner (1965). Assume nl persons in the sample answered
'Yes' and (n-nl) answered 'No'. Note that the probability of 'Yes' answer for a
particular choice of p * is given by
B== p*lf+(I- lf X1- p*) . (11.12.1)
Consider the following estimator of If as
A _0-(1-P*)
lfR- 2p*-1 ' (11.12 .2)

where 0 == ndn is the proportion of 'Yes' answers in the sample .

Thus we have the following theorems :

Theorem 11.12.1. The estimator ffR is unbiased for the population proportion If •

Proof. We have

A ) == E1E (A
E (lfR z lfR ) == E1Ez {0-(1-* P*)} == E1()If == If
2p -1
where E z denotes the expected value for the fixed value p * of p and EI denotes
the expected value for all values of p generated by its distribution. This
completes the proof of the theorem.

Theorem 11.12.2. The variance of the estimator ffR is given by

V( lfR If(1 - If) + b p(1 - p)\2 f( p;up


\A (11 . 12.3)
J( , a ::; p ::; b, p"* 0.5,
A ) __

n an 2p-lJ
where f(p) denotes the probability density function (p.d.f.) of p .
Proof. We have
V(ffR) == E1Vz(ffR) + fJ Ez(ffR)
where E I and E z have been defined earlier. Similarly, Vz and VI, respectively,
denote the variance for a fixed value p * of p and for all values of p generated by
its distribution. Thus we have
Chapter II .: Randomized response sampling : Tools for social surveys 953

Note that in (11.12.3) f(p) denotes the probability density function of p . We


shall now try to choose the probability density function (p.d.f.) f(p) in such a way
that the estimator ll-R becomes more efficient than the estimator ll- w : Although
there could be many choices for f(p), consider the following two such probability
density functions:
(i) f(p)=_I_ , a c p c b, p*O.5 ; (11.12.4)
b-a
(ii) f(p)=C(Zp-Ifp U-l(l -p)lJ-l , O<p <l , a >O, fJ >O; (11.12.5)
where
C= r (a+p+2}
r (uf(p+2} + r p r(u+2} - zr(u+lf(p+l}

Now we show that the stochastic randomized response procedure in addition to


providing greater sense of response confidentiality and is also more efficient than
the Warner's (1965) procedure .
Theorem 11.12.3. The variance V(ll-R) of the estimator ll-R with f (p ) defined in
(11.12.4) becomes
V(ll-R ) = Jr(I - Jr) _ ~ + ( IX ) (11.12 .6)
n 4n 4n Zb- I Za - I
Proof. On substituting the values of f(p) from (11.12.4) in (11.12 .3) we have

V(ll-R) = Jr(I-Jr) -~+~f I dp,


n 4n 4n a (Zp - If (b - a)
which proves the theorem.

Corollary 11.12.1. For a = Po and b = Po + g(l- Po), with 0 < g < I, the estimator
ll- R remains always more efficient than Warner's estimator . Here it is possible to
find more acceptable choices for a and b.

Theorem 11.12.4. The variance of the estimator ll-R with f(p) defined in (11.12.5)
is given by
V(ll-R) = Jr(I- Jr)+
n
t afJ
n a -fJf +a+fJ
} . (11.12.7)
954 Advanced sampling theory with applications

Proof. After replacing the value of f (p) from (11.12.5) in (11.12.3) we have
V(n-R) = "(1- ") +..!.- fp(a+'}-I(I - pj P+I}-l dp,
n no
which proves the theorem.

Corollary 11.12.2. If we choose a = Po and f3 = 1- Po , where Po denotes 'the


value of p used in Warner's model, the variance V(n-R) under this distribution of
p becomes
" )- "(1-")
v("R - -+
n
t Po(l- Po)
n 2PO - If + I
},
which is smaller than the variance V(n- w ), making the stochastic procedure
more efficient than Warner's (1965) method. Hence one could also possibly find
other efficient and more acceptable values of a and f3 .

Corollary 11.12.3. If we choose a (I- Po XI - To) and f3 = I - a , where Po and To


=
denote the value of p and T used in Mangat and Singh (1990), the variance V(n-R)
under this distribution becomes
V(n-R)= "(1-") + (1- PoXI- To){I- (1- PoXI-To)},
n nf2Po-lf+ 2To(l-po)+I}
which is smaller than the variance v(n-ms ) given by

n II
t2Po-l f +2To(l-po)} ,
v(""ms )= "(1-") + (I -PoXI -ToXI-(I-PoXI-To)}

making the proposed procedure more efficient than the Mangat and Singh
(1990) method. Hence in this case as well one could possibly find other efficient
and more acceptable values of a and f3 .

Exercise 11.1. (I) Consider a social survey has been conducted by an investigator
using randomized responses from a device consisting of two devices R1 and
R2 (say). The device R, consists of the following two statements:

( I ) 'Do you possess the sensitive attribute A?', with probability T .


( 2 ) 'Go to second randomization device R2 ' with probability (1-T) .

The second randomization device R2 is the same as defined by Warner (1965) as


follows:
( I ) 'Do you possess the sensitive attribute A?' with probability P.
( 2 ) ' Do you not possess the sensitive attribute A?' with probability (1- p).
( a ) Show that an unbiased estimator of population proportion J( is given by
Chapter 11.: Randomized response sampling : Toolsfor social surveys 955

A 81 -(I-PXI-T)
Jrl = 2P-I+2T(I-P)·
( b ) Show that the estimator JTI reduces to the estimator proposed by Warner
(1965) for T = 0 .
( c ) Show that the estimator JT, is more efficient than JT w if
T > (I-2p)/(I-P).
(d) Study the properties of the estimator JTl under the SRSWR and the SRSWOR
sampling designs.
Hint: Mangat and Singh (1990, 1991).

( II ) Modify the second randomization device R2 of Mangat and Singh (1990)


model with statements :
( I ) 'Do you possess the sensitive attribute A?' with probability P;
(2) 'Say forcibly Yes' with probability (I-P)W ;
( 3) 'Say forcibly No' with probability (1- PXI- w) ;
where WE [0, 1] is a suitably choosen constant. Obviously the probability of 'Yes'
answer is given by
B=TJr+(I-TXPJr+(I-P)W] .
Develop an estimator of Jr and discuss its properties .
Hint: Singh, Singh, Mangat, and Tracy (1995), Tracy and Osahan (1999) .

( III ) Modify the second randomization device R2 of Mangat and Singh (1990)
model with the statements :
( I ) 'Do you possess the sensitive attribute A?' with probability P ;
( 2 ) 'Do you possess the unrelated attribute Y?, with probability (1 - p).
Obviously the probability of 'Yes' answer is given by
B = TJr+(I-rXPJr+(I-P)Jry] .
Develop an estimator of Jr and discuss its properties : ( a ) when Jry is known ; (b)
when Jry is unknown.
Hint: Mangat (1992), Mangat, Singh, Singh, and Singh (1993) .

(IV) Compare Mangat and Singh (1990) model with Warner (1965) model at
equal level of protection of the respondents . Discuss your views.
Hint : Nayak (1994) , Bhargava (1996), Moors (1997) .

Exercise 11.2. Consider we have two mutually exclusive groups G1 and G 2 • In


the population of interest the proportion of persons in G1 which we would like to
estimate is Jr . We select n persons at random from this population. It is convenient
to create a random variable T;, so that T; = 1 if the /h individual is in G, and T2 = 2
956 Advanced sampling theory with applications

if it is in G2 • We confront each person with two urns. There are red bal1s and black
balls in each urn. We ask each person to select k balls from each urn (WR
sampling), mental1y noting the number of red balls obtained from each. We do not
observe the drawing of the balls, and the subjects understand that we will not know
the results of the separate draws. Then each person is told to reveal the number of
red balls obtained from the urn corresponding to his/her group. Persons from ,G 1
will tell the number of red balls obtained from urn 1, and the persons from G2 will
tell us the number of red balls obtained from urn 2. By such a mechanism ,
confidentiality of individuals is preserved. Let {}1 and {}2 be the proportion of red
balls in the two urns. We control the values of {}t and {}2, and we would never
consider the case {}1 ={}2 ' Let r be the observed proportion of ' Yes' answers in a
sample of n respondents, then assuming that k = 1, show that an unbiased
estimator of population proportion " is given by
, r- {}2
"I =- - .
e.. - (}2
Find its variance and discuss the results and your views.
Hint : Kuk (1990), Chatterjee and Simon (1993).

Exercise 11.3. Let the value Y; of a sensitive variable y, defined on a finite survey
population of N identifiable and labelled persons, be supposed to be unavailable
through a direct response survey when one intends to estimate population total Y
on choosing a sample s from the population with a probability p(s) according to
design p . Instead, let a randomized response R, be available in independent
manner from the respective persons i , on request if sampled, in such a way that
their expectations, variance, and covariance (ER, V R, CR) respectively satisfy
ER(r;) = Y;, VR(r;) = ajy;2 + PjY;+{}j= ~i (say), CR(r; , rj)= 0, for i *j such that
ai > 0, Pi and {}i are known for every unit in the population.

( a) Show that for PPSWR sampling an unbiased estimator of population total Y is


, _1~r;
Ywr --.:.,.- .
ni=IPi
Find its variance and estimator of its variance.
Hint: Chaudhuri and Adhikari (1990), Arnab (1990).
( b ) Under multi-character survey and PPSWR sampling design, six estimators of
population total Yare given by
, 1 ~ r;
Ymc =-.:.,.-., for k =0,1,2,3,4, 5,
n i=l P;k
where
Chapter II.: Randomized response sampling : Tools for social surveys 957

l}; = (~ J(I-PXY)l}PXY , l}~ = [N(l- PXY)+ P:; r, and l}~ = (1-(P;;)jN +(P;;l}
with Pxy being the known correlation coefficient. Find the bias and variance for
each one of the estimators .
Hint: Bansal, Singh, and Singh (1994), Grewal, Bansal, and Singh (1999) .
( c ) Under PPSWOR sampling, an unbiased estimator of population total is
• n 'i
YWOR = I - ·
i=IJri
Find its variance and suggest an estimator of variance.
Hint: Godambe (1980b), Amab (1994).
( d ) If population total X of an auxiliary character is known, then find the variance
of the generalized regression estimator (GREG) of the population total, defined as

YGREG =
i=1 Jri
±.!l
+ PdS[X -
i=1Jri
±~] .
Suggest at least two estimators of its variance.
Hint: Chaudhuri, Maiti, and Roy (1996) , Tracy and Singh (2000) .
( e ) Assuming that Yi is a qualitative variable, show that a linear homogeneous
unbiased estimator of the population proportion is given by
Yp = Ibsi'i
ies
where the bsi are constants and are free from Yi values , and satisfy the condition
Ibsip(s) = N- I .
S=>;
Hint: Amab (1996)

Exercise 11.4. Consider a finite population of N first stage units (FSUs) and let the
;th , i = 1,2,...,N, FSU consist of M, second stage units (SSUs). Let a sample s of
n FSUs be selected with probability PI (s) following some sampling design PI and
if the ;th FSU is selected in the sample, we take a sub-sample Si of mi SSUs from
M i SSUs of the ;th FSU with probability pz(si ) following a sampling design pz.
The sub-sample Si , i ES, are selected independently . The overall sampling design
for selection of sample (si,iEs)is denoted by p . Let Eplvp) , EI(VI), and Ez(Vz)
denote expectation (variance) operators over the sampling design p, PI, and Pi-
respectively. Let Jri' Jrij' etc., denote the first and second order inclusion
probabilities of FSUs. Let Yij be the value of the character under study for the /h
SSU of the ;th FSU (; = 1,2,...,Mi; i = 1,2,...,N} Let rij denote the standardized
randomized response, such that
ERh)= lfj' VR~ij)= O"J = Bijlf] + .BijYij +oij and CRh,rkl)= 0 for (i,j) * (k,l) .
958 Advanced sampling theory with applications

Find the variance of the estimator of the population total Y defined as


r(r) = Ib;(s)~(r) ,
;ES

where ~(r)= I bis;) rij ' b;(s) and bj(s;)are constants free from rij values such
jes;
that I bj(s;)pz(s;)= 1 and Ib;(s)p;(s)= I . Suggest an estimator of its variance.
S;3j 53;
Hint: Arnab (1992b) .

Exercise 11.5. ( a) Consider a randomization device as defined by Warner (1965)


as follows:
(i ) 'Do you possess the sensitive attribute AT with probability P;
( ii ) 'Do you not possess the sensitive attribute A? ' with probability (1- p);
has been used to conduct a social survey. Show that an unbiased estimator of
population proportion 1r is
. 81 -(I-P)
1rt = 2P-I
under SRSWR and SRSWOR sampling design. Compare the relative efficiency of
the above estimator under the two different designs.
Hint: Singh and Kathuria (1995).
( b ) Assume the randomized response device consists of a deck having three types
of cards as follows:
(i ) ' Do you possess the sensitive attribute A?' with probab ility fl ;
(ii ) ' Do you not possess the sensitive attribute A?' with probability Pz (pz '* fl) ;
( iii) ' Forcibly No ' with probability f3 ;
such that fl+PZ+P3 = 1. Obviously the probability of ' Yes' answer is given by
() = fl1r + Pz(1-1r). Develop an estimator of 1r and find its variance expression.
Hint: Mangat, Singh, and Singh (1995).
( c ) Replace 'Forcibly No ' statement in ( b ) with 'Forcibly Yes' statement, and
construct a new estimator of 1r • Compare its variance expression with ( b ) at equal
and unequal levels of protection.
Hint: Bhargava and Singh (2000).

Exercise 11.6. Consider a randomization device as defined by Warner (1965) is as


follows:
(i ) ' Do you possess the sensitive attribute A?' with probability P;
( ii ) ' Do you not possess the sensitive attribute A? ' with probability (1- P~
has been used to conduct a social survey.
( a ) Assume that the interviewer goes on selecting or collecting the response from
the respondents from the population until m respondents (pre-specified number of
Chapter II.: Randomized response sampling : Tools for socialsurveys 959

'Yes' answers) are not reporting 'Yes'. Then show that an unbiased estimator of
population proportion tt is given by
• ~ - (I - P)
Jrt = 2P -I
where 01 = min follows an Inverse Binomial distribution. Is it possible to list
situations where Inverse Binomial Randomized Response (IBRR) can be
implemented in actual practice?
Hint : Mangat and Singh (199lb, 1995).

( b ) Let OJ be the proportion of observed 'Yes' answers in a sample of n units


drawn with an SRSWR sampling, and it follows Binomial distribution with same
parameters as in Warner's pioneer model. Suppose an estimator for estimating the
population proportion Jr is defined as
n-o = aO, +b .
Find the values of a and b such that the estimator Jro reduces to the estimators
defined as
• O,-(I-P) •
..1A-(I -p)
JrJ = and Jr2 = .
2P-I 2P-I
Also find the optimum values of a and b such that the mean squared error of the
estimator n-o is minimum.
Hint : Sampath, Uthayakumaran, and Tracy (1995), Mangat, Singh, and Singh
(1991), Singh and Singh (1992b).

Exercise 11.7. Consider a population n consists of n(n) individuals and a group


of individuals in n possessing sensitive quantitative variable (e.g., income ~
$10,000 per month or number of abortions ~ 2 or number of murders committed
~ I etc. ) is denoted by G. Our problem is to estimate the proportion n(G)jn(n) ,
where n( G) denotes the number of individuals in the group G . Simultaneously, we
are also interested in estimating the average value of the sensitive quantitative
variable for n( G) individuals of group G. Because of the fear of the law or of
embarrassment in society, the respondents from group G are not likely to reveal
their membership of the group and also the value of the sensitive variable for them
through a direct survey. Let Jr denote the proportion of respondents possessing the
sensitive variable in group, G (i.e., Jr= n(G)ln(n)) and /l x be the average value of
the sensitive quantitative variable (X) for n(G) respondents . Thus we would like
to estimate Jr and /lx simultaneously . In this randomized response model, two
independent sub-samples of sizes n, and n2 are drawn from n by using SRSWR
such that nl + n 2 = 11 , the total sample size required . Each respondent selected in
the (h sub-sample, i = 1,2 is requested to generate a random number S, using
randomization device R, and report only the scrambled response Z, = XS j if he
belongs to group G ; otherwise he is required to report a random number 2 3 = S3
960 Advanced sampling theory with applications

generated by another randomization device R3 . The whole process is performed by


the interviewee unobserved by the interviewer. Let E(Si) = Oi and V(Si) = E(Si - 0i f
= al, i = 1,2,3, denote the known mean and variance for the scrambling variable
Si ' Also let a;
denote the variance of the sensitive quantitative variable (X) in
group G. Without loss of generality we have
E(21) = "OlJ.lx+(1-,,)03 and E(2Z) = "OzJ.lx +(1-,,)03'
where 21 and 2z are means of responses in the two sub-samples . Then show that
an estimator of " is
, = °z21-012z-03(OZ-01)
"s andth at flor J.lx IS
" (21-2z)?3
J.l x = --,:::-----"-=-----=:.L.:"'------,,-
°3(°1-Oz) Oz21 -01 2Z -03(OZ-°1)
Find the bias and variance of both estimators. Find the values of nl and nz such
that the value of the linear combination of variances of both estimators , that is,
LC = aV(is)+(l-a)V(.ux) is minimum.
Hint: Singh, Mangat, and Singh (1997).

Exercise 11.8. Consider a respondent belonging to group A is requested to repeat


the trial in the Wamer (1965) randomization device if in the first trial he does not
receive the statement according to his status. The repetition of the trial is known to
the interviewee but remains unknown to the interviewer. Assuming completely
truthful responding by the respondents the probability of a 'Yes' answer is given by
0= "{P+(1-P)P}+(1-,,X1-P) . Develop an unbiased estimator of " and find its
variance.
Hint: Singh and Joarder (1997b).

Exercise 11.9. In the first phase select a preliminary large sample of m units from
the population of N units by using SRSWOR and only auxiliary information X is
measured on these m units as Xlo Xz, ..., Xm . In the second phase a sub-sample
of n units is drawn from the preliminary large sample of m units using PPSWR
and then the scrambled responses ri are measured through a randomization device.

1 n r,
Yp =-I-.!;-
mn i=lPi
where Pi• =Xi IXi
i=1
1
Study the asymptotic properties of an estimator of the population mean
m

denotes the probability of selecting the lh unit from the given first phase sample.
Hint: Grewal, Bansal, and Singh (2002).

Exercise 11.10. In the direct question survey methods , if there are u distinct units
in an SRSWR sample of size nand k, is the frequency with which the lh distinct
unit occurs in the sample, then we have E(yu) = Y and V(Yu) s V(Yn) where
1 I
Yu = u- fYi and Yn = n- ±kiYi . Show that this inequality oscillates in the case of
i=1 i=1
Chapter II .: Randomized response sampling : Tools for social surveys 961

randomized response with replacement sampling . Deduce the results for qualitative
characters also.
Hint: Arnab (1995), Mangat , Singh, Singh, Bellhouse, and Kashani (1995), Singh,
Mahmood, and Tracy (2001).

Exercise 11.11. Consider an investigator is interested in estimating the proportion


of people who individually belong to two different sensitive groups, called group A
and group B . The investigator proposed a randomization device with which the
respondent reply ' Yes' or ' No' to whichever of the two statements occur by chance :
(i ) I am a member of group A. (ii) I am not a member of group B. In order to
estimate the proportion of those sensitive groups, two independent non-overlapping
samples of sizes nl and nz are selected . Let ~ denote the probability of the first
sensitive question being selected by respondents in the /" sample (i = I, 2 ). Let ni

be the total number of respondents who replies 'Yes' in l" sample. Let Jra and Jrb
be the true proportion of persons with attribute A and B, respectively. Obviously,
the probability of 'Yes' answer in the l" sample is ()i = P;Jr a +(I-~XI-Jrb)' i = 1,2.
Deduce the estimators of Jra and Jrb and derive their minimum variances subject to
total sample size remains fixed n = nl + nz .
Hint: Chang and Liang (1996).

Exercise 11.12. Consider Jr be the proportion of respondents in the population that


belong to the sub-group (say) A, each member of which has the value of sensitive
variable X that satisfies the inequality X ~ W , where W is a known fixed constant.
Further let Y denote the random variable that is generated by a randomization device
(say R,). The distribution of the variable Y is known and it takes values equal to or
greater than W . Each respondent selected in an SRSWR sample of n respondents is
instructed to report actual value for X if he belongs to sub-group A; otherwise, he is
instructed to report the random number Y generated through the randomization device
R) provided to him. The mode followed by the respondents for giving their responses
is not revealed to the investigator. Thus the response Z, from the l" respondent will
take value Xi with probability Jr and a random value Y; with probability (I - Jr). Thus
we have
E(Zi)= Jr E(Xi )+(1 - Jr )E(Y;) = Jr Ilx + (1-Jr )Ily

where Ilx and Ily respectively denote the means for the sensitive variable X in
the sub-group and for the variable Y in the randomization device R 1 . In addition to
the above, each respondent in the sample is also provided with the usual Warner's
(1965) randomization device Rz to estimate Jr . This device may consist of a deck
of cards having two types of statements:
(i ) 'Do you belong to the sub-group A?' with probability P;
and
( ii) 'Do you belong to the sub-group notA?' with probability (1- p).
962 Advanced sampling theory with applications

Show that an unbiased estimator of lr IS

ir = {nt/n- (1- P)}/(2 P -1)


where nl denotes the number of observed 'Yes ' answers from the sample when the
randomization device R2 is used. Also deduce an estimator fix of Il x and show that,
to the first order of approximation, its MSE is

( . x ) = -1-
MSEll
[ 2 ( \-2
2 lrax + 1-lr JU y +(
p(1- p) t
\2'Px-lly }
\2] '
ntt 2p-1J
Hint: Singh, Singh, and Mangat (1996).

Exercise 11.13. Suppose X denotes the sensitive variable. Let S be the scrambling
variable, independent of X and having finite mean and variance. For simplicity,
assume that X ~ 0 and S > 0 . The respondent generates S by using some
specified randomization device. Each respondent scrambles the response on X by
multiplying it with the value taken by the scrambling variable S in his/her case.
Only the scrambled result y = xs is revealed to the interviewer. Note that the
particular value taken by S is not known to the interviewer, thus the respondent's
privacy is not disclosed. Consider a sample of size n is drawn using simple random
with replacement sampling from a population of size N. Let Yj denote the value
of the scrambled variable Y for the lh respondent of the sample , i = 1,2,....n. Show
that square of coefficient of variation of the sensitive character X is given by

C; = (c; -C; )/(1+C;),


where C; = 112/ X2 and C; = '2/
(J2 are the squares of coefficient of variation of the

sensitive character X and scrambling variable S , respecti vely. Suggest an


estimator of C; and study its asymptotic properties.
Hint: Singh, Singh, and Mangat (1998).

Exercise 11.14. In Franklin's (1989a, I989b) model k ~ 1 responses are obtained from
each respondent of a simple random with replacement sample of size n. The response
Zij' i = 1,2,...,n; j = 1,2,...,k is a random number drawn from the density gij if the
respondent belongs to the sensitive group A; otherwise, it is drawn from the density
hij' The interviewer does not know the density used by the respondent for drawing the
random number. The model can be specialized by having gij = gj and hij = hj for all
i = 1,2,....n . The densities gj and hj , respectively, have known means Ilu and 1l2j

and known variances a?j and aij. Suppose an investigator modified the procedure
suggested by Franklin (1989b) by using the known proportion of unrelated character
lr y in the population and by suitably choosing known parameters of the proposed

randomization device as follows:


If a respondent,
Chapter 11.: Randomized response sampling: Tools for social surveys 963

(i ) belongs to A but not to Y he is instructed to use the density glij ,

( ii ) does not belong to both A and Y he is instructed to use the density g2ij'

( iii) belongs to both A and Y he is instructed to use the density g3ij'

( iv ) belongs to Y but not to A he is instructed to use the density g4ij '

On the basis of the above information suggest estimator of finite population proportion
of interest. Derive its variance expression.
Hint: Singh (1994).

Exercise 11.15. A sample of n respondents is drawn from a finite population of N


units using SRSWR but the information from the d distinct units in the sample,
1 ~ d ~ n , is used in the construction of estimator of population proportion based on
Warner's model. Let d' denote the respondents reporting "Yes" answer in the
interview conducted with the Warner's RR device, then study the properties of the
following estimators of Jr defined as
d'/d-I+p d d'/E(d)-I+p
Jrl = an Jr2 = ----'-----'---'----'-
A A

2p -1 2p-1
Also suggest an estimator of Jr in case of unrelated question model based on the
information collected from distinct unit.
Hint : Tracy and Mangat (1998), Arnab (1999).

Exercise 11.16. Develop an estimator of proportion Jr of the persons possessing


the attribute A in the population with the help of the following randomization
devices:
( a ) Every respondent selected in the sample of size n drawn with SRSWR
sampling is requested to report "Yes" if he belongs to sens itive group A, otherwise
he is requested to use Warner's (1965) device. Obviously the probability of a 'Yes'
answer is 0, =Jr+(I-P)(I-Jr) .
Hint: Mangat (1994).
( b ) Each respondent in a sample of n respondents is instructed to say 'Yes' if he
belongs to sensitive group A and to report his membership in Y in terms of ' Yes'
or 'No' if he does not belong to group A. Obviously the probability of 'Yes'
answer is given by a = st + (1 - Jr)Jry . Can this randomization device be used in mail
surveys?
Hint: Mangat, Singh, and Singh (1993), Singh, Mangat, and Singh (1993).

Exercise 11.17. Consider the population under study consists of N units. We assume
that the population could be thought of as consisting of two strata, the first stratum
having N l respondents who will return the completed questionnaire without waiting
for any further communication from the investigator while the N 2 members of the
second stratum (defined as non-response stratum) will not do so. Thus N, + N 2 = N .
964 Advanced sampling theory with applications

Let J.lI and J.l2 denote the population means for the sensitive character X in the first
and second strata. Let the sensitive character X in the first and second strata of the
population be denoted by Xl and X 2 respectively, so that
E(X\) = J.l1 and E(X2) = J.l2 '
Then the actual population mean of the sensitive variable X is given by
2
uIIV;J.li ,
>
i=1
where IV; = Nj N, i = I, 2. Assume we select a sample of size n using SRSWOR
method. Questionnaires will then be mailed to each of the selected respondents in the
sample with the request that they should be returned after completion. The respondents
will also be required to scramble their response on sensitive character X . For
scrambling the sensitive character X, each respondent is instructed to select K
random natural numbers Sj (j = 1,2,...,K) out of the random sequence of first No
natural numbers sent with the questionnaire by SRSWOR. Then each respondent is
K
instructed to calculate the mean, S = K - 1 L Sj of the K selected natural numbers and
j=1
record only the scrambled response Y = SX. The value of K and No are same for
each respondent. Here Sj is a random variable with 1 ~ Sj ~ No ' Using

lft Sj = No(No + 1)/2


j=1
and lftj=l sJ = No(No + 1XNo + 2)/6, it can be easily seen that

E(S)= No+1 and v(S) ={No _1}No+1. Let nl denote the number of respondents
2 K 12
who return the completed questionnaires so that n2 = (n - nl) is the number of
respondents who do not return the questionnaire. We select a sub-sample of h2
respondents with SRSWOR from n2 respondents ofthe non-response stratum such that
n2 = h2g , (g ~ 1). These h2 respondents will then be interviewed personally. Let
lJi = XIiS I (i = 1,2,...,n l) and Y2i = X 2iS2 (i = 1,2,...,h2 ) denote the scrambled
responses given by the respondents at the first and second efforts, respectively. Note
that Xli and X 2i are independent of SI and S2 respectively, therefore
E(X Ii) = E(lJi)/ E(SI) = J.ll and E(X2i) = E(Y2i)/ E(S2) = J.l2 . On using sample analog of
scrambled responses, unbiased estimators of J.l1 and J.l2 are iLl = 2)11 /(N 0 + 1) and

• = 2Y2
J.l2 - /( N o + 1) , were
h - = nl- 1 ~
YI L,YIi an d Y2
- = h-2 l h2
L Y2i' Consid .
onsi er an estimator f or
i=1 i=l
the population mean J.l as:
• (n\iLl + n2iL2)
J.lw = .
n
Show that iLw is unbiased for J.l and find its variance for the fixed cost.
Hint: Singh, Singh, and Mangat (1995) .
Chapter 11.: Randomized response sampling: Tools for social surveys 965

Exercise 11.18. Study the properties of the ridge estimator


j/s)
Rc
= (X'X +R c I)-IX' z

under scrambled responses. Show that the variance of j}~! is

v~¥})= a ZA(X'xtIA'+C~(X'X + Rclt 1X'{D+ aZI)x(X'X + Rclt 1,

rl
where

C, = r/8 and D = Di,g[[1:. x,A J'.· ··t~, x"A


Hint: Singh and Tracy (1999).

Exercise 11.19. Assume a simple random sample of n people is drawn with


replacement from the population. Each interviewee in the sample is then provided
with three random devices, say Ro, RI and Rz each consisting of two mutually
exclusive outcomes. The random device Ro is the decision device and the
interviewee uses Ro at the first instance. The random device Ro has two types of
statements namely, (i ) Use a random device R1 (ii) Use a random device Rz .
The probabilities with which these two statements occur in the device are T and
(1-T), respectively. If the interviewee selects the card from the device Ro having
the statement (i ) Use the random device RI , then the interviewee is instructed to
use the density g~(Lij)' if he/she belongs to the sensitive group A otherwise he/she
uses the density h;(Lij)' On the other hand, if the interviewee selects the card from
the device Ro having the statement ( ii ) Use the random device Rz . then the
interviewee will use the density gijlLij) if he/she belongs to the sensitive group A,
otherwise he/she uses the density hijlLij)' Form an unbiased estimator of the
proportion of persons, say Jr in the sensitive group A based on the information
collected through the above randomization device.
Hint: Singh and Singh (1992a, 1993d).

Exercise 11.20. Divide the population into (k + 1) groups --- of which the first
group consists of responsive group and the remaining k groups belong to non-
sensitive group of varying degrees --- each of these non-responsive groups supply
aI, az, ..., ak levels of responses. Let Jrl' Jrz, , Jrk+1 be the true probabilities of

responses in these (k + 1) groups and PI ' Pz, , Pk+1 be the probabilities that the
spinner points to the responsive group. The value of the response variable is 1,
al> az, ..., ak, that is, the response group supply the full information while the
remaining k groups reveals only al> az, ..., ak level of information.
Thus
p(x; =1)= JrIP] + (1- P1XI- Jrl) ,
966 Advanced sampling theory with applications

P(Xi =j)=JrjPj+{I-pJI-Jrj), j=2,3,...,(k+l).


The likelihood function of the sample is

L= W[PjJrj+(I-pJI-JrJn j,
j=1

where
k+1
n = L nj .
j =1

Show that the maximum likelihood estimate of Jrj' j = 1,2,...,(k+l) is the solution to
the set of singular simultaneous equations

nl(2PI -1),nz(l- PZ~ ,nk+1(1- PhI)] {Jrl (2PI -1)+(1- PI )}-t = [0:]
nl (1- PI ),nz(2pz -1~ ,nk+1(1- Phi) {Jrz(2Pz -1)+(1- PZ)}-I
[
nl (1-PI ),nz(l- PZ~ ,nk+1 (2Pk+1 -1) {Jrk+1 (2Pk+1 -1)+(1- Pk+)}-I
Hint: Mishra and Sinha (1999), Bourke (1981), Eriksson (1973).

Exercise 11.21. ( a) Assume two independent samples of sizes ni, i = 1,2 are drawn
from the whole population by using an SRSWR method such that n\ + nz = n , the total
sample size required. Provide a randomization device S, to the respondents in the th
sample, i = 1,2 with two statements: (i) 'I belong to group A' and (ii) 'I belong to
group Y', represented with probabilities Pi and (1- Pi), i = 1,2 , respectively. Then
for this model 0i = PiJr + (1 - Pi)JrY is the probability of obtaining ' Yes' answer from a
person in the th sample. If Bi denotes the proportion of 'Yes' answers obtained from
the respondents in the til sample, show that an unbiased estimator of Jr is

with variance
v(Jid = [(1- pzfol (1-Ol)/nl +(1- PI fOz(l- Oz)/nz] /(PI - pzf .
Show that the optimal choice of one of the Pi' i = 1,2 is close to one and other close to
zero. Show that the choice of the value Jr y (unknown) close to zero or 1 according as
Jr < 0.5 or Jr > 0.5 and if Jr = 0.5 , the minimum variance occurs at the tails of Jr y •

( b ) In the above part ( a ) select the respondents in the independent sub-samples by


using an SRSWOR sampling. Show that the variance of the estimator lfG is given by

Vk(Ji-G)=V(Ji-G)- [(1 - PZf(nl -l)~lll-(l- Jr)+ (1- Pl f 7iAI-7iy)W[nl(N -IXpt - pzf]

- [(1- Plf(n z -1)~~7i(I-7i)+ (1- pzfJrAl- Jr JV[nz(N -IMI - pzf] .


Chapter 11.: Randomized response sampling: Tools for social surveys 967

Discus the optimal choice of Pi and JrY under an SRSWOR sampling.


( c ) The choice of P2 = 0 under both SRSWR and SRSWOR sampling schemes
discloses the privacy of certain kind of respondents.
Hint: Kim (1978), Singh, Singh, and Mangat (2000).

Exercise 11.22. Let Jr be the true proportion of persons belonging to group A. Let
Pj' j = 1,2,3 be known parameters of a randomization device used for eliciting
information in randomized response surveys. Let y I. be the binary response of the
lh respondent in the sample consisting of n respondents, then the probabilities for
the randomized response sensitive question are given by
Pr(Yi =1) = Pt + (1- PI - P2)Jrrr and Pr(Yi =0) = P2 + (1- PI - P2XI- Jr rr ) , where Jr rr
denote the proportion of observed 'Yes' answers through the randomization device.
Let Xi be a vector of explanatory variable and f3 is a column vector of unknown
parameters, so that the probability for the lh respondent is given by
Jr = e P'X i (1 + e P'X i
the likelihood function
t. Show that an estimate of f3 can be obtained by maximizing

L= n
i :Yi=!
[PI+(I-PI-P2) eP';~.]
I+e I
n
i :Yi=O
[P2+(I-PI-P2)
I+e
~'X']'
I

Hint: Kerkvliet (1994)

Exercise 11.23. Consider each interviewee in a given study is subjected to the


randomization procedure , say Ro, which is similar to the one used by Kuk (1990).
The proportion for the red cards in the two decks is represented by 0\ and 2 , The °
° °
values of the probabil ity 1 and 2 in a particular investigation are determined
°
randomly and depend on the joint distribution of 1 and 2 , Assume r denotes the °
proportion of red cards reported by a random sample of n respondents. Then for the
°
given values of 1 and O2 we have
A

Jrs
0\ -0
r-O;
=-.--.'
2

01 *°• 2,

where 0; and 0; are fixed for the given study and their values depend upon the
joint distribution of 01 and 2 , °
( a ) Show that the estimator irs is unbiased for the population proportion, Jr .
(b) The variance V(irJ of the estimator irs is given by
V( A)= Jr(I-Jr)
Jrs + fbf(I - Jr )e2(InOt-02)
d - 02)+JrBt (I - OI )/ (0
(\2 1>
°\..10dO
2 JU 1 2
n ca
where /(01) ° 2) denotes the joint p.d.f. of 01 and ° 2 and O:$; a < 0t < b :$; 1,
0 :$;c <02<d:$;1.
Hint: Singh (2002c).
968 Advanced sampling theory with applications

Exercise 11.24. Consider an unrelated question model with two non-sensitive


characters lJ and Y2 . Two independent SRSWR samples of samples of sizes nl
and n2 are drawn. In each sample, the respondent answer a direct question about a
non-sensitive character, they also answer one of the two questions about A as well
as about a non-sensitive character, selected with probability P and (1- p)
respectively through a randomization device . In other words, in the first sample ,
each selected respondent answer through a randomized device question regarding
A and lJ while question Y2 without the randomization device. In the second
sample each respondent response for A and Y2 through the randomization device,
while they respond to the first natural question lJ directly. Let 0 1r and 0 2r be the
probabilities of obtaining 'Yes' answer through the randomization device in the
first and second sample. Also let 01d and 0 2d be the proportion of direct questions
obtained in the second and first sample on the neutral characters lJ and Y2 ,
respectively.
( a ) Show that the following two estimators of 7! are possible
;TI = 0 1r - (1- P)02d and ;T2 = 0 2r - (1- P)0ld .
P P
( b ) Consider a linear combination of the above two estimators as
A

7! 0 = I.2 Wi 7! i
A

i=1
2
such that I. Wi = 1. Show that ;To is unbiased and find the optimum weights such
i=1
that the variance of ira is minimum.
Hint: Folsom, Greenberg, Horvitz, and Abernathy (1973).

Exercise 11.25. Consider that in Warner (1965) model the persons belonging to
sensitive group are not reporting truthfully, but only a proportion 6. of them are
honest. Assuming that those who are not members of the group A are honest and
report truthfully. Show that the probability of a 'Yes' answer becomes
e = 7! 6.(2P -1)+(1- p). Find the variance of the unbiased estimator of 7! defined as
A

7!u=
e- (1 - p)
6.(2P-l)
where P ~ 0.5 .

Hint: Bourke and Dalenius (1976).

Exercise 11.26. Show that an optimal estimator of the population total, Y, under
scrambled responses is given by
e(s,r) = as + 'LA;'i,
iES

where ri denotes the scrambled responses, as and bsI. have their usual meanings.
Hint: Arnab (2002) .
Chapter II.: Randomizedresponse sampling : Tools for social surveys 969

Exercise 11.27. Consider an unrelated question model (or U model) for the
situation when 7r y is known. Each sampled respondent is provided with a random
device consisting of two statements: (r ) I belong to sensitive group A; and (ii) I
belong to non-sensitive group Y; represented with probabilities p\ and (I - PI),
respectively. The respondent selects randomly one of these two statements,
unobserved by the interviewer and reports 'Yes' or 'No' with respect to his/her
actual status. The probability of 'Yes' answer is B1 = p\ 7r + (1- PI)7ry. Thus, an
unbiased estimator of the population proportion 7r is given by
A BI -(1- P\)7ry
7rG = .
PI
Consider a two-stage randomized response unrelated question model in which each
interviewee is provided with two randomization devices R\ and Rz . The
randomization device RI consists of two statements, namely: ( i) I belong to
sensitive group A ; and (ii) Go to randomization device Rz; represented with
probabilities T and (1- T), respectively. The randomization device R z is the same
as used in the U model represented with probabilities pz and (1- pz), respectively .
The probability of 'Yes' answer is given by Bz = T7r+(I-rXPZ7r+(I- pZ)7ry] and
an unbiased estimator of 7r is given by
A

7r -
Bz - (I - Pz XI - T )7rY
m - T+ pz(I-T)
( a) Find the variance of JTG and JT m .
( b ) Show that at equal level of protection of the respondents V(JTG) = V(JTm ) .
Hint: Mangat (1992), Mangat, Singh, and Singh (1992), Bhargava and Singh
(2001).

Exercise 11.28. Let X denote the response to the first sensitive question (e.g.,
income) and Y denote the response to the second sensitive question (e.g.,
expenditure). Further assume S\ and Sz be the two scrambling random variables,
each independent of X and Yand having finite means and variances . For simplicity
also assume that X 2: 0, Y 2: 0 , S\ > 0 and Sz > 0 .

Consider the following two cases:


( a ) The respondent generates S\ using some specified method, while Sz is
generated by using the linear relation, Sz = a S\ + f3 , where a and f3 are the known
constants and therefore S\ and Sz are dependent.
( b ) S\ and Sz are random variables following known distributions . The particular
values of S\ and Sz to be used by any respondent are obtained from two separate
random devices. This way S\ and Sz becomes independent.
970 Advanced sampling theory with applications

The interviewee multiplies his response X to the first sensitive ques tion by SI and
the response Y to the second sensitive question by Sz. The interviewer thus
receives two scrambled answers 2 1 = XS1 and 2 z = XSz . The part icular values of
SI and Sz are not known to the interviewer, but their joint distribution is known.
In this way the respondent's privacy is not violated. Let E(SI) = 1 , E(Sz) =Oz ,°
v(sd= rzo , V(Sz)=roz ,COV(SI,SZ)= rll ' E(X)=,uI' E(Y)=,uz , V(X)=O";=mzo,
V(Y )=O"; =moz , rrs =E[SI- odr[sz - ozY and mrs=E[X-,uIt[Y-,uzf , where 01'
0z, rzo, roz, r l l' and rrs are known to the interviewer but ,ul ' ,uz, 0";, 0"; and
mrs are known. Also let 0";1 ' O";z and O"ZIZZ denote the variance and co-variance
of 2 1 and 2 z , respectively.
( a ) Show that the correlation coefficient between the two sensitive variables X and
Y is then given by
(O"ZIZZ - rll,ul,uzWrzo + O? ~roz + oi
Pxy = \ I z zI z z.
(rll + 0IOZ IV O"z] - rzo,ul VO"zz - roz,uz
( b ) Develop an estimator of the correlation coefficient using scrambled responses.
Hint: Singh (1991b), Bellhouse (1995).

Exercise 11.29. A randomization device (say, deck of cards) consists of three types
of cards bearing statements: (i ) I belong to group A; (ii) I belong to group Y;
and ( iii) Draw one more card. The statements are represented with proportions P ,
PI , and P i » respectively. Note that the characters A and Yare uncorrelated. The
respondent is required to draw one card randomly from the above deck and give
answer in terms of 'Yes' or ' No' according to his/her actual status if the statements
(i ) or ( ii ) are drawn . However if statement (iii) is drawn the respondent is
required to repeat the above process without replacing that card . If the statement
(iii ) is drawn in the second phase , the respondent is directed to report 'No' . If m
be the total number of cards in the deck, then the probability of 'Yes' answer is
0= [ll" P + PIll"y][1 + pzm/(m -I)] .
Construct an unbiased estimator of ll" and study its properties for different values
of the parameters involved in it.
Hint: Singh, Singh, Mangat, and Tracy (1994)

Exercise 11.30. Consider a procedure in which the randomised response (RR)


device and method for sampling remain same as in Warner (1965) model. However,
it differs in the sense that the respondent is free to give an answer in terms of 'Yes'
and ' No' either by using RR device or without using it, without revealing to the
interviewer which mode has been followed for giving answer. If T be the
probability that a respondent gives answer without using the RR device, then the
probability of ' Yes ' answer is given by
Chapter II.: Randomized response sampling: Tools for social surveys 971

t91 == nT+(I-T)t9

where t9==Pll"+(I-PXI- n} If nl be the number of ' Yes' answers in an SRSWR


sample of n respondents , then study the bias and mean square error of the
estimator of ll" defined as
• nl
ll"ms ==- '
n
Hint: Mangat and Singh (1994).

Exercise 11.31. Consider a sensitive variable X which is supposed to be


continuous with true density g(e) and Y be an unrelated variable with density h(e).
Consider the problem of estimation of population mean IJ x of the sensitive variable
X using the known mean lJy of the unrelated character. Each respondent selected
in the sample is provided with two randomization devices: The first randomization
device consists of a deck of cards having two types of statements: (a) Report the
value of sensitive variable X; (b) Go to the second randomization device; with
probabilities Wand (1 - W ~ respectively. In the second randomization device, each
interviewee reports his/her X value with probability P and Y value with
probability Q == (1- p). Obviously, the observed randomised response, say Z , has
the density function

and the population mean of the randomised response Z is given by

IJz == IJAp+w(l-p)] + Q(I - W)lJy '

Construct an estimator of IJx and study its properties.


Hint: Singh, Mangat, and Singh (1994).

Exercise 11.32. Consider a randomization device consisting of three types of cards


bearing statements: (i ) I belong to sensitive category A; (ii) I belong to category
3
Y; and ( iii) Blank cards; with probabilities PI ' P2 and P3 such that LP; == 1.
;=1
In case the blank card is drawn by the respondent, he/she will reports 'No'. The
rest of the procedure remains as usual. The probability of' Yes' answer will now be
t91 == PIll" + P2ll"y .
Study the properties of the estimator of the proportion of the sensitive attribute Jr
given by
;;-1 == ~I - P2ll"y }/PI
where 01 is the proportion
of ' Yes' answers obtained from the n sampled
respondents .
Hint: Singh, Horn, Singh, and Mangat (2003).
972 Advanced sampling theory with applications

Exercise 11.33. Consider the problem of estimation of population proportion Ii of


individuals possessing a certain sensitive characteristic. We selected an SRSWR
sample of n respondents. Each sampled person is provided with a random device
which produces integers 1,2, ..., L with frequencies PI> P2, ..., PL respectively. For
example, the device might be a deck of M cards with exactly Mpj of those cards
showing the integer j, j = I, 2,..., L or perhaps the device is a die with L faces,
each one showing one of the integers I, 2,..., L with probabilities PI> P2, ..., PL
respectively. Using the random device in the absence of the interviewer, the
individual produces one of these numbers and he/she reports how far away this
number is from L + 1 if he/she has the characteristic or from 0 if he/she does not
have it. Let X; take on the value L + 1 if the lh individual possess the characteristic
and the value 0 if not. Obviously, P(x;=L+I)=1i and p(x;=O)=I-li .Let y; be
the integer produced by the i lh individual using the random device . Obviously, the
reported random number is d, = Ix; -
y;1 such that P(d; = k) =(1- Ii )Pk + Ii PL+I-k ,
k = 1,2,...,L . Study the bias and variance of the estimator of Ii given by
ic _ d -E(y)
tee - L + 1- 2E(y)
L _ n
where E(y)= 'LkPk and d =n-I'Ld;.
k=l ;=1
Hint: Christofides (2003), Chaudhuri (2001).

Practical 11.1. Michael visited libraries of certain developing counties such as


India, China, etc., and found that about 20 years ago most of the top class scientific
journals were available in the shelves of libraries of the universities of those
countries, but he felt that the availability of top class scientific journals is
decreasing from the shelves of libraries of developing countries, and there may be
some hidden reason behind it. One obvious reason may be that the editors of the
journals may not be honestly reviewing papers and scientists from developing
countries may have lost interest in the top class journals. Michael decided to do a
survey to estimate the proportion of editors and associate editors of the scientific
journals to know their review process . Michael selected 5000 editors and associate
editors and asked the following questions using a randomization device producing
80% statements, 'Are you reviewing every article honestly?' along with 20%
statements, 'Are you not reviewing every article honestly?' Out of 5000 selected
persons 3000 reported ' Yes' through the above randomization device. Find
Michael's estimate of the proportion of editors reviewing articles honestly. Also
construct a 95% confidence interval estimate. Comment!

Practical 11.2. Assume the true proportion of the extramarital relations in the world
is 0.3. Suppose we used a randomization device to collect information from a large
sample of persons using a randomization device with two types of statements:
Chapter II .: Randomizedresponse sampling: Tools for social surveys 973

(i ) 'Are you having extramarital relation?' with probability 0.80;


( ii ) 'Were you born in spring?' with probability 0.20.
Assuming that the proportion of persons born spring is 0.4. Find the relative
efficiencies of three different techniques of Mahmood, Singh, and Hom (1998)
discussed in this chapter with respect to Greenberg, Abul-Ela, Simmons, and
Horvitz (1969) technique.

Practical 11.3. Ms. Poonam Singh wishes to estimate the proportion of persons
having extra marital relations in the world. Suppose she selected an SRSWR sample
of 10000 persons across the world, and took responses through a randomization
device producing 80% statements, 'Are you having extra marital relation?' along
with 20% statements, 'Are you having no extra marital relation?' Out of 10,000
selected persons 3,000 reported 'Yes ' through the above randomization device.
Find her estimate of the proportion of extramarital relations in the world. Also
construct a 95% confidence interval estimate.

Practical 11.4. A social society of a particular community wishes to study the


virginity status of couples before their marriage. In certain communities, the status
of virginity before the marriage is an important issue. A researcher selected an
SRSWR sample of n couples (each made of one male and one female).

( I ) Each male in the sample was asked to respond to one of the two outcomes
using randomization device R} as follows:

Q1: 'Were you virgin before you got married?' with probability 11 ;
Q2: 'Were you born during the first 6 months of the year?' with probability (1-11) .

If lrm denotes the true proportion of virgin males before marriage, and ¢m in the
probability of males born during the first 6 months of the year, then the probability
ofa 'Yes' answer for a male in the couple is given by
Om = 11 + (1-11)rpm'
lr m

( II ) Each female in the sample was asked a response to one of the two outcomes
using randomization device Rz as follows:
Ql: 'Were you virgin before you got married?' with probability Pz ;
Q2: 'Were you born during the first 6 months of the year?' with probability (1 - pz )

If lrJ denotes the true proportion of virgin females before marriage, and ¢J in the
probability of female born during the first 6 months of the year, then the probability
of 'Yes' answer for a female in the couple is given by
OJ = lrJPz +(I-Pz)rpJ .
974 Advanced sampling theory with applications

( III ) Assume IjJm =IjJf =IjJ (say), that is, the proportions of males and females born
during the first six months of a year is same, then every couple was asked the
following question directly.
'Were either one (not both) of you born during the first 6 months of the year?'
Obviously the probability of a 'Yes' answer from a couple is given by
e = llj(l-Jrm# + Jrm(l-llj }p.
( IV ) Using information from ( I ) to ( III ) estimate the proportion of the males and
females who were virgin before they were married. Also obtain a pooled estimate.
Given: n = 5000, nm = 2000, nf = 4000, nc = 2500 and Pt = P2 = 0.80, where nm ,

"r: and nc denote the number of observed ' Yes' responses from males, females,
and couples, respectively.

Practical 11.5. Michael visited hospitals and found AIDS is a very serious
problem in these days. He selected an SRSWOR sample of 70,000,000 persons
across the world and each respondent asked to use the following two randomization
devices in the sequence.
The randomization device R1 consists of two statements viz.:
Statement (i ): 'Are you suffering from AIDS?' with probability 0.8;
Statement ( ii ): 'Use second randomization device, R2 ' with probability 0.2.
The second randomization device R2 consists of the following two statements:
Statement (i ): 'Are you suffering from AIDS?' with probability 0.7;
Statement (ii): 'Are you not suffering from AIDS? ' with probability 0.3.
Out of sampled 70,000,000 persons he received 10,360,000 number of 'Yes'
answers.
( a ) Make a flow chart of the randomization device.
( b ) Estimate the proportion of AIDS patients, and derive 95% confidence interval
estimate.
12. NON-RESPONSE AND ITS TREATMENTS

Incompleteness or non-response in the form of absence, censoring, or grouping is a


troubling issue of many data sets. Statisticians have recognized for some time that
failure to account for the stochastic nature of incompleteness or non-response can
spoil the nature of data. There are several factors which effect the non-response rate
in any particular inquiry. Some of these factors are the type of information being
collected, the official status of the surveying agency, the extent of publicity, the
legal obligations of the respondents, the time of visit by the enumerator and length
of the schedule, etc.. Hansen and Hurwitz (1946) were the first to deal with the
problem of incomplete samples in mail surveys. It is a well known fact that mail
surveys or telephone surveys are most commonly used by most of the bureaucratic
or business organisations because of their low cost. Rubin (1976) has defined three
key concepts: Missing at random (MAR), Observed at random (OAR), and
parameter distinctness (PD).

Missing at random (MAR): The data are MAR if the probability of the observed
absence pattern given the observed and unobserved data, does not depend on the
values of the unobserved data. We shall put all such cases where the data are
missing only owing to chance factors. It will therefore include cases where the
enumerator is not able to contact the respondents only by chance and had he been
able to contact, the data would have been collected. For example, when the
information is kept on punched cards the non-response owed to the accidental loss
of one or more cards is of this category. This type of non-response is called random
non-response.

Observed at random (OAR): The data are OAR if for every possible value of the
missing data the probability of the observed absence pattern, given the observed
and unobserved data, does not depend on the values of the observed data.

Parameter distinctness (PD): PD holds if there are no a priori ties between the
parameters of the absence model and those of the data model. In other words, we
can possibly classify the non-response with respect to its nature into two broad
categories.

Deliberate non-response (DNR): If the respondents are not willing to reveal their
response, such cases of non-response will be classified as deliberate non-response.
For example, the non-response in surveys in which information is being collected
on personal income or on certain personal habits such as drinking, gambling, etc.,
will come under this category.

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
976 Advanced sampling theory with applications

Unit and total non-response: Kahan and Kasprzyk (1986) says that it is a
common practice to distinguish between total (and unit) non-response, when none
of the survey responses are available for a sampled element, and item non-response,
when some but not alI of the responses are available. Total non-response arises
because of refusals, inability to participate, not at homes, and untraced elements.
Item non-response arises because of item refusals, do not know, omissions, and
answers deleted in editing . For more details about the forms of non-responses the
reader is referred to Rubin (1978). We will discuss here a few basic models
folIowed by their recent developments .

Hansen and Hurwitz (1946) proposed a model for mail survey designs to provide
unbiased estimators of population mean or total. Their pioneer model consists of the
folIowing steps:
( a ) Select a sample of respondents and mail a questionnaire to alI of them;
( b ) After the deadline is over, identify the non-respondents and select a sub-
sample of the non-respondents;
( c ) ColIect data from the non-respondents in the sub-sample by interview;
( d) Combine the data from the two parts of the survey to estimate the population
parameters of interest.

Consider that a population consists of N units can be divided into two classes :
( a ) those who wilI respond at the first attempt forming the response class;
( b ) those who will not respond at the first attempt forming the non-response class.

Assume that N, and N 2 are the number of units in the population that belong to the
response and non-response class, respectively. We may regard the sample of n\
respondents as a simple random sample from the response class and the sample of
n2 as a simple random sample from the non-response class. Let bz denote the size

of the sub-sample from the n2 non-respondents to be interviewed so that n2 = h2 g ,


where g > 1.

Then we have the folIowing lemma:

Lemma 12.1.1. The unbiased estimators of N} and N 2 are, respectively,

We have the folIowing theorems:


12. Non-response and its treatments 977

Theorem 12.1.1. An unbiased estimator of the population mean Y is given by


_ nlYI +n2Yh2
Y hh = (12.1.1)
n
where Yh2 denotes the mean of h 2 observations from the sub-sample of non-
responding units .
Proof. We have
1Y1 2Yh2] (-)-
E(Yhh E2 Yhh I nl> n2 )1'J = E, [n +n
- ) = E1[(- n = E1 Yn = Y .
Hence the theorem.

Theorem 12.1.2. The variance of the estimator Yhh is given by


V(Yhh) = (-!.-
n
~)S2
N y
+ (g -I)(!!-)Si
n N y (12.1.2)
2

where siy is the population mean square error for the non-response class.
Proof. By definition we have
V(Yhh) = ~[E2(Yhh I nl> n2)]+E1[V2Vhh I n"n2)]

=V'(Yn)+E{:~ V2(.YhJ] =~Vn)+E{:~U2 - ~Jsiy]


=~(Yn)+[~E{ ~~ _I)n: siy] = ~(Yn)+[~EI(g-l)n: - .

= ~ (Yn)+ [~(g -I )E{ n: )E1(s~y)] because g is constant

=(-!._~)S2
n N y
+ (g-I)(~)Si .
n N y

Hence the theorem .

Now we have the following corollary :

Corollary 12.1.1. The second term in the expression of V(Yhh) will vanish if g = I.
In fact, it is true if it is possible to interview every non-respondent to collect
information.

The cost function for this model has been found to be made of three components:

( a ) Cost of including a sample unit in the initial survey = Co (say);


978 Advanced sampling theory with applications

( b ) Cost of collecting, editing, and processing per unit in the response class = CI
(say) ;
( c ) Cost of interviewing and proces sing information per unit in the non-response
class = Cz (say) .
Thus it is reasonable to consider the cost function given by
C·= nCo+ nICI+ hzCz . (12 .1.3)
Note that C· varies from sample to sample thus it is recommended to use the

-l- + ~CI +~ Cz]


expect ed cost function. Taking expected values on both sides of (12.1.3), we have

E(C' )= E[nCo+nIC) +hzCz] = n ng

=n[ Co + E( ~ Jc, + ~ E( n: JCz] = n[ Co + ~ CI + :~ Cz]

or C = ~[NCo+ NICI +(:zJCz l (12 .1.4)

Then we have the following theorem:

Theorem 12.1.3. The optimum values of g and n for the minimum expected cost
are, respect ively , given by

g = Cz [sz-
y
Nzsi y
N J/szZy (c + NIC) J
0 N (12.1.5)

and
n={szy + Nz (g-
N l) SZ
Zy
}/f~
~ 0
+Syz/N}. (12.1.6)
Proof. Let the variance V(Yhh) has been fixed as Vo , that is, V(Yhh) = Vo · Then the
Lagrange function is given by

L= ~ [NCo +NIC, +( :z JCz} A[V(Yhh )- Vol. (12.1.7)

where A is a Lagrange multiplier. Differentiating (12.1.7) with respect to nand


equating to zero we have
z z
NCo +NICI +-C
N
oL g
on N
which implies

nZ= A{S; +(g - l); S;z}N/{NCo+N)C) + :z Cz} (12 .1.8)


or

(12 .1.9)
12. Non-response and its treatments 979

Note that
V(Yhh) = (.!-- ...!...)s; + (g
n N n
-l)(!!"'-)Si
N
y = Vo
z
On using (12.1.9) we have

'" =( NCo+N,C, +:' c, J \S;+&:-I); s;,l/~vo+ ~} (12L10)

On subst ituting (12.1.10) in (12.1.9) we have

n= {s; +(g- l)~ S;z}/h +S;/N} (12.1.11)

which is the required optimum value of sample size n. Now differentiating (12.1.7)
with respect to g and equating to zero we have
ilL =_ nNzZCz + A. Nz si = 0
t% Ng nN y , (12.1.12)
which implies
z nzC z
g = ,,,z . (12.1.13)
IWZy
Using (12.1.8) in (12.1.13) we have
) Nz z} (z Nz z) NzCz z

{N:::':,I+~:):: C{C::~'):I,:;:s;:' ,
{ z (

g' = =

or

or

g = Co( Syz - Nz z )/( Co +/i


JjSZy N])SZy
z (12.1.14)

which is the required optimum value of g. Hence the theorem.

Example 12.1.1. Consider a city consists of 1000 persons. We wish to estimate the
average income of the persons living in the city. We selected an SRSWOR sample
of 100 persons and mailed a questionnaire to each of them regarding their annual
income at the beginning of a particular month. Out of 100 questionnaires mailed we
received a reply from 70 people. The average annual income from the 70 responses
was $35,000. At the end of the month we selected 10 people out of 30 who did not
respond through the mail survey and contacted them through personal interviews
for collecting information regarding their income. The average income obtained
through personal interview survey was $38,000.
980 Advanced sampling theory with applications

( a ) If the questionnaire had been mailed to all the 1000 persons in the city, then
find the estimate of the number of persons which are expected to respond to it as
well as will not respond to it.
( b ) Apply the Hansen and Hurwitz technique to estimate the average income of the
persons living in the particular city.

Solution. ( a) Here N = 1000 be the total number of persons living in a particular


city. Let Nt and N 2 be the unknown number of persons belonging to the
responding and non-responding class respectively. Then an estimator of N1 is
• n\ 70 . •
N1 =- N =- x 1000 = 700 , and that of N 2 IS N 2 =-n2 N =-30 x 1000 = 300 .
n 100 n 100

( b ) According to Hansen and Hurwitz model, an estimate of the average annual


income of the persons in the city is given by
Yhh = n\Yl + n2Yh2 = 70 x 35000 + 30 x 38000 = $35900 .
n 100
Several other research workers including Dalenius (1955), Hendricks (1949), Kish
and Hess (1959), Bartholomew (1961), Rao (1968b), and Srinath (1971) discussed
different strategies to improve the efficiency of the pioneer model. Deming (1953)
has shown that successive recalls are helpful to increase the efficiency in the mail-
survey designs. Elliott, Little, and Lewitzky (2000) have shown that sub-sampling
call-backs seems to be more cost-effective however their analysis suggests that
subsampling should begin at the telephone rather than the face to face interviews.
Okafor and Lee (2000) applied Hansen and Hurwitz (1946) technique for ratio and
regression estimator. Politz and Simmons (1950) developed a model to reduce the
bias due to incomplete sample, without successive call backs. Their technique is
popular whenever call backs are not feasible owing to shortage of time or money
however the efficiency depends upon the circumstances.

Following Politz and Simmons (1950) model, the interviewer makes only one call
during specific time (such as morning) on six weekdays. The time of calls has been
considered as random within interviewing hours. If the respondent is at home the
desired information is collected and he is asked how many times in the preceding
five days he was at home at the time of the visit. The information so obtained is
used to estimate the probability of the respondent's availability . Assume n
households have been selected by SRSWR and 'If i denotes the probability that t h
household is available at the time of the first ring. Then we have the following
theorem:

Theorem 12.2.1. An estimator of population mean Y is given by


-
Ysp --..!..~11..
L.. . (12.2.1)
n i=t'lJ'i
12. Non-response and its treatments 981

Assuming e-, = (d + 1)/ D , where d denotes the number of times the respondent was
at home during the last D days (or hours, or months etc.), d = 0, I, 2, ..., D - I . Show
that the bias in the estimator Ys p is given by

B(ysp) = - ~ E Y;(I-'II; f (12.2.2)


and variance
(_) I[D N a .y,2
VlYsp =- - L - ''- - -LY;~I-(I -'II;f
n N i=1 'IIi N i=1
{I N { )~2] . (12.2.3)

Proof. Let 'IIid i = 1,2,...., nand d = 0,1,2,...., D -1 denote the probability that the i,h
respondent will be at home d times out of D -I attempts or calls fixed by the
investigator for collecting information. Assuming 'IIi remains same for every d th
day, the probability that the lh person will be available on d th days out of D -1 calls
is given by the binomial distribution
(. ) (D- l)
PI =d = d 'IIid( 1- 'IIi )D-I-d . (12.2.4)

(D-l}
Under this distribution we have
D-I D-l
E l'L li)
( 'II;
= Ll'LP(i=d)= Ll'L f(I-'II i)D-I -d
d=O'lli d=O'lli d

= ~1 2.L(Dd-I}'!(I_'I/. \D-I-d =Y,D~I~(D


L.- d + I I L.- d I d
-l}d(l_
'
'I' , )
.\D- I- d
'II,)
d=O_ _ d=O +
D

= l'L Di'(
D }f+I(I - 'IIi f-(I+d) .
'IIi d=O d + 1
Note that

;~~( d~ 1)'IIr l(I-'II


i f- (I+d) +(~}I -'IIif = I,

which implies that

E(l'L I
'IIi
i)=l'L[I-(I-'II;f].
'IIi
Note that the probab ility of the lh person being selected and found at home is given
by '11;/ N , thus we have
(_ ) (...,) I n
ElY sp =E jE 2 lYsp =- ~ EI E2 -!..Ii
(Y') = -i ,Is ,It: -»,' -y.' [1-(I-'ll; )D]
n .=1 'IIi n ,=11=1 N 'IIi

-
=Y - - I INY; (I-'ll; )D .
N ;=1
Hence the bias in ysp is given by
(_) I N
Bl)'sp = - N ;~l Y; (I - 'II; f
982 Advanced sampling theory with applications

which proves the first part of the theorem.

To find the variance of the estimator we have

D-I(
I 1l.J2 P(i = d )= y[ D-l(
[( J2Ii ] = d;O
E 1l. I .E:)2(D-IJV';d(I -V';'f- 1- d
'1'; '1'; d;O d + 1 d

-_ Y;2 D~I
L. -
1( D )2(D-IJ'1';d+1(I- V';)vr-i-«
--
d;OV'; d + 1 d

= y[DDil_I_(~)(D- IJ V'f+I(I- V'('f-(I+d) = Y[ Da;,


'1'; d;O d + 1 d + 1 d '1';
where
J
a ; = Di l_ I_ ( D V'f+I(I-V' ;'f-(I+d),
d; Od + 1 d + 1
which implies

E( Y;: J=!2 ~ y[~; XV'; = D ~ a;Ji2 .


'1'; N ;;1 '1'; N ;; 1 '1';
Thus the variance of the estimator Ysp is given by

V(ysp)= v[~~ Y;.] = ~~v(Y;.J


n '1',
,;1 n '1',
= ~[E(Y;. J2 -{E
n '1',
,;1
(Y ;.J}2]
'1',

1
= - [-
D IN-'-'-
a .y;2 - {I
-IN Ji (1-(I-V';'f )~2] .
n N ;; 1 '1'; N ;; 1

Hence the theorem.


Assuming V'; as the probability of selecting { h units under the probabili ty
proportion al to size and with replacement sampling scheme, an estimator of the
variance of the estimator Ysp can be developed as

. (.,.., )
v\Ysp =-(-1) L.
n n - ;;1
1 ~( -Y;
'1';
-
- J2
Ysp (12.2.5)

The Politz and Simmon (1950) estimator adjusts the non-response bias owed to the
non-availability of the selected respondents at home during the period of survey by
classifying the available respondents into six groups according to their availability
at home during the previous week (D = 6) and employing appropriate weighting
procedures. Thus this estimator is based on the premise that the selected
respondents available at home necessarily co-operate with the enumerator, which
however may not be true. Sharma and Sil (1996) considered to study the non-
response bias in the Politz-Simmon estimator taking into account the possible non-
co-operation from the selected respondents who are, though , available at home yet
may be busy otherwise.
12. Non-response and its treatments 983

Example 12.2.1. Select an SRSWOR sample of sixteen units from population 4


given in the Appendix. Assume that we made 7 attempts to collect information
about the different species groups selected in the sample. Information about the
number of fish caught of different species is not available every time you contact
the fishermen. We collected the information on the number of fish during 1995 in
seven visits. It was also noted that out of seven visits, how many times (D) the
information about these species was available? Estimate the average number of fish
caught by marine recreational fishermen at Atlantic and Gulf coasts during 1995.
Construct 95% confidence interval for the average number of fish in the United
States.
Solution. The population size N = 69. Thus we used the first two columns of the
Pseudo-Random Number (PRN) Table I given in the Appendix to select 16 random
numbers between I and 69. The random numbers so selected are 58, 60, 54,01,69,
62, 23, 64, 46, 04, 32, 47, 57, 56, 33, and 05. The following information was
collected .

:~I~.t~I;~:;,~"'!4~]~~~:~"'::;~~f~~£1
01 Sharks, other 2016 6 1.000000 2016.00 . 34479613.30
04 Eels 152 2 0.428571 354.67 56750122.37
05 Herrings 30027 6 1.000000 30027.00 490138226.70
23 Blue runner 2319 6 1.000000 2319.00 31013030.07
32 Yellowtail snapper 1334 6 1.000000 1334.00 42954055.79
33 Snappers , others 492 6 1.000000 492.00 54699845.28
46 Spot 11567 3 0.571429 20242.25 152629114.60
47 King fish 4333 5 0.857143 5055.17 8024572.89
54 Tautog 3816 5 0.857143 4452.00 11805645.03
56 Wrasses, other 185 6 1.000000 185.00 59335197.99
57 Little tunney/Atl 782 6 1.000000 782.00 50494303.34
bonito
58 Atlantic mackerel 4008 6 1.000000 4008.00 15053890.75
60 Spanish mackerel 2568 4 0.714286 3595.20 18427568.41
62 Summer flounder 16238 6 1.000000 16238.00 69723595.94
64 Southern flounder 1446 6 1.000000 1446.00 41498518.49
69 Other fish 14426 2 0.428571 33660.67 664233729.80

Given: D=7.

Thus an estimate of the average number offish during 1995 is


±
Ys p =.!- 1'L = 126207 =7887.934 .
n i=l 'fi 16
984 Advanced sampling theory with applications

An estimate of variance of the estimator Ysp is

v(ys
p
)=_I_±(.l.i..._
n(n-l) i=1 v,
Ys )2= 1801261031
p 16x15
=7505254.295.

A (1- a)100% confidence interval for the average number of fish caught during
1995 by marine recreational fishermen in the United States is given by

Ysp+fa/2(df = n - 1'N v(ysp) .


Using Table 2 given in the Appendix the 95% confidence interval of population
mean Y is given by

Ysp +fO.02S(df = 16 -1'N v(Ysp)


or
7887.934 +2. 131·h505254.295 or [2049.907, 13725.962].

Let there be a population of N units from which a sample of size n is to be drawn.


Let the study variable be denoted by y and the auxiliary variable be denoted by x .
The selection probabilities Pi of the population units are taken to be proportional to
the corresponding x-values. Let r , r = 1,2,...,(n -1) be the number of units
(including repetitions in the case of PPSWR sampling) on which the information on
y could not be collected . Singh and Singh (1979) assumed that the value of r is
supposed to be less than or equal to (n - 2) while estimation of variance of the
estimator of population total is concerned . Then we have the following theorem :

Theorem 12.3.1. An unbiased estimator of population total Y is given by


, n n-r y .
YHTR=--I-l.., (12.3.1)
n- r i=1 Jri

where Jri = np; denotes the first order inclusion probability.

Proof. Let E3 be the expected value for a given sample of n units from which
(n - r) units (on which response has been received) can be treated as an SRSWOR
sample, E 2 be the expected value over all such samples for a given r, and E 1 be
the expectation over all possible values of r .

Then we have

E(YHTR)=EtE2E3(YHTR) = EIE{~;:] = Et(Y) = Y.


Hence the theorem .
12. Non-response and its treatments 985

Theorem 12.3.2. The variance of the estimator YHTR is given by

(, ) (,) [N y;2 1 NN Jrij ] ( r )


(12.3.2)
V YHTR = V YHT + i~ ~i - n -1 E}7=1 Jri Jrj Y;Yj E I n - r '

where
, n y-
for YHT = L.a: .
i=I Jri

Proof. Define the variances VI, V2 and V3 in the same manner as the expectations
E( , E 2 , and E3 we have
V(YHT R ) = E IE2V3 (YHTR )+E1V2E3(fHTR )+IItE 2E3(fHTR )

-E N y;2
1 "
Li
1
..----
N Jr.. y.y .]( r) + Vy,(, )
"L.. -IJI
-- J -- HT
[ i=l Jri n -1 i;t j=1 JriJrj n- r

_
-
N V
"Ii2 1 ,N, 1Jr"Y
J I'Y J'] E
L . . - - - - L.. - - -
[ i=1 Jri
I --
(r) + V (, ) Y,liT
n - 1i;tj=1 JriJrj n- r
which proves the theorem .
986 Advanced sampling theory with applications

Theorem 12.3.3. An unbiased estimator of the variance V(YHTR) is given by

V(YHTR)= (n~rf[EII-(:lr)ll"; Yl (12.3.3)

+( 1 )nfY Hn-r-l)Jrij-(n-rXn-l)ll";ll"j }y;Yj].


n- r -I ;=1 I ,,;=I ll"ijll";ll"j
Proof. We have

E[v(YHTR)]

= EIE2E3[V(YHTR)]
-EE E II [n~rll-(n-r)ll"; i+ I n~r y {n(n-r-I)ll"ij-(n -rXn-I)lr;ll"j}y;Yj ]
- 1 2 3 (n-rY ;=1 ll"f ; (n -r-I) ;=1 j;<;=1 ll"ijll";lrj
_ [I
- E1E2 -(- ) I
n-r ;=1
n n-(n-r)ll";
ll";
2
2
Y; + (
I n n {n(n-r-l)ll"ij-(n-rXII-I)ll";ll"j}Y;Yj]
II-r Xn -I );=Ij;<;=l
I I ll"ijll";ll"j
= E n-(n -r)ll";lf 2+ I I IHn-r-I)ll"ij-(II-rXn-I)ll";ll"j}lfYj ]
1[_I_I
(II - r);=1 ll"; (n - rXn -I) ;=lj#=1 ll";ll"j

= E{~lr ~;; + (n _rr 2+ (n _ rXn - l);~lj;<t { ll"ij;:;ll"j


)ll"J lf (n _ r Xnll"~ l)ll";ll"j }lfYj ]
= V(yHT )+[ f l_f f ll"ij lfYj]E1(_n-r_)r .
lf2 _ _
n- ;=l j;<;=1 ll";ll"j
;=1 ll"; 1
Hence the theorem.

Remark 12.3.1.The value of


E(!:"') '" E(r) + E~2
n II n
L.. ,
where E(r) and E~2) can easily be obtained if the distribution of r is known. In
practice r may follow Binomial or Negative Binomial distribution.

Tracy and Osahan (1994c) studied the effect of random non-response on the usual
ratio estimator of the population mean in two situations: (i ) non-response in
study as well as the auxiliary variable and ( ii ) non-response in the study variable
only. Singh, Joarder, and Tracy (2000) suggested three regression type estimators,
which are further studied by Singh and Tracy (2001), in the presence of random
non-response in different situations under an assumption that the number of
sampling units on which information can not be obtained owing to random non-
response follow some distribution.
12. Non-response and its treatments 987

Let n:{v\, V2, " " VN) denote the population of N units from which an SRSWOR
sample of size n is drawn. If r (r =0, I,..., (n - 2)) denotes the number of sampling
units on which information could not be obtained owing to random non-response ,
then the remaining (n - r) units in the sample can be treated as an SRSWOR sample
from n. Note that we are considering the regression type estimators , therefore we
are assuming that r should be less than (n - 2). We assume that if p denotes the
probability of non-response among the (n - 2) possible values of responses, then r
has the discrete distribution
_ (n - r) n-2 CrPrq n-2-r ,
p(r ) ---2- (12.4.1.1)
nq+ P
where q = 1- P and r = 0, I,2, ..., (n - 2}
Let us define
e =Y'.!:"r _l , 0= x~r x
y
-I, and 77 = X -I,
X

where Yn-r = (n-rttnfy; , xn-r = (n-rt1nfx; and x=n+t x; have their usual
;= 1 ;=1 ;=1
meanings (Tracy and Osahan, 1994c). The probability model defined at (12.4.1.1)
is free from actual data values, hence it can be considered as a model suitable for
MAR situation. Then under the probability model given by (12.4.1.1), we have the
following results:

£(e)= £(0)= £(77)= 0,


ele 2)- [ I ~]C2 el02)= [ (nq+2p)
~ - (nq+2p) N y' ~
I _~]Cx2
N , £(n
V,
2)= (~ _ ~)Cx2 ,
n N

£(eo)=[(nq+2p
I ) ~]PXyCyCx, £{e77)=(~-~)pXyCyCx,and £(077)=(~-~)C;'
N n N n N
where Cy = SyIY, Cx = Sx/X and Pxy = sxy/lsxSy). It is interesting to note that
under the model (12.4.1.1) the above expected values are exact and hence makes
valid comparison with the estimators in the absence of non-response. The logic and
the practical importance of the distribution defined at (12.4.1.1) has been discussed
by Singh and Joarder (1998). In place of (12.4.1.1) we can also use another
distribution (e.g., truncated binomial), but these distributions add an extra
approximation in the expected values.

12.4.2 ESTIMATiON OF POPU.... A:rION~ME~

Strategy I. Consider the situation when random non-response exists on both the
study variable y and the auxiliary variable x and population mean X of the
auxiliary variable is known. Thus we consider a regression type estimator as
988 Advanced sampling theory with applications

Y t = Yn-r +al (x - xn- r ) . (12.4.2.1)


The minimum variance of the estimator at (12.4.2.1) is given by

Min .V(Y\) = [(
nq+2p
1 f ~]s;
N
(1- P;y) (12.4.2.2)
for the optimum value of a\ given by
a\ = SXy / S; . (12.4.23)

If p = 0 then the variance in (12.4.2.2) reduces to the variance of the usual linear
regression estimator. In this situation, we can estimate a\ by al = S;y/ s? , where

are the

estimators of Sxy and S; respectively. Thus we have the following theorems:

Theorem 12.4.2.1. The regression estimator, if random non-response exists on


both the study and auxiliary variables, is given by
-
Yllr - r+( *
= Yn- Sxy jSx*2 x- - )
X -Xn- r . (12.4.2.4)

Theorem 12.4.2.2. An estimator for estimating the variance of the estimator YI is


,(- ) [1
v Yl = (nq+2p) N1]Sy l-rx*2)
*2 (
y , (12.4.2.5)
where
*2 ( )_In-r(y - \2 , (n - l + r)- ~(n - l + rf - 4nr (n - 3)/(n - 2)
"» =n-r-l j~ j-Yn-r! and p= 2(n-3)
is a maximum likelihood estimator of p obtained from the distribution given by
(12.4 .1.1) , q=l-p and r;y= s;yj(s;s;) . If r=O then p=O,andif r=(n-2) then
p = 1, i.e., p is an admissible estimator of response probability p. The behaviour
of the estimator p for different values of rand n is given in the following table.

Table 12.4.2.1. The values of p for different choices of nand r .


III~ ' "lill!'i ,II:, ' ",~ f\\

I:,:?'"J
"

fi, "':,:. i ' ,!2 ,f ': 1~ ~~


20 0.11687 0.23355 0.34993 0.46586 0.58107 0.69496 0.80624 0.91135 1.00000
40 0.05397 0.10794 0.16190 0.21584 0.26978 0.32369 0.37758 0.43144 0.48527
60 0.03506 0.Q7013 0.10519 0.14025 0.17531 0.21037 0.24542 0.28047 0.31552
100 0.02061 0.04122 0.06184 0.08245 0.10306 0.12368 0.14429 0.16490 0.18552
1000 0.00200 0.00401 0.00601 0.00802 0.01003 0.01203 0.01404 0.01604 0.01805
2000 0.00100 0.00200 0.00300 0.00400 0.00500 0.00600 0.00701 0.00801 0.00901
5000 0.00040 0.00080 0.00120 0.00160 0.00200 0.00240 0.00280 0.00320 0.00360
12. Non-response and its treatments 989

Strategy II. Consider the situation in which information on variable y cannot be


obtained for r units while information on variable x is available and population
mean X of the auxiliary variable is known. Under such situations, consider the
following estimator
yz = Yn-r +az(x- i) . (12.4.2.6)

The minimum variance of the estimator YZ is given by


Min.V(Yz)=[(nq +1 2p ) ~]S~+(~
n
-~)S~(I-P~)
n N (12.4.2.7)
for the optimum value of az given by
az = SX
y/S; . (12.4.2.8)

If p = 0 the variance III (12.4.2.6) reduces to the variance of the usual linear
regression estimator. Here az can be estimated by £1Z =s~/s;, where
s; = (n -r): I f(Xi -if , which leads to the following theorems:
i~ l

Theorem 12.4.2.3. The regression estimator, if non-response exists only on the


study variable, is given by

YZlr = Yn-r + (s;y/ s; Xx - i). (12.4.2.9)

Theorem 12.4.2.4. An estimator for estimating the minimum variance of yz IS

given by

v(yz) = [(nq ~ 2P) ~ };z +(~- ~ };Z(I_ r;;). (12.4.2.10)

Strategy III. Consider the situation when information on variable y cannot be


obtained for r units while information on the variable x is obtained for all the
sample units. The difference from the previous case is that the population mean X
of the auxiliary variable is not known. In this case consider another regression type
of estimator
(12.4.2.11)

The minimum variance of Y3 is given by


1 )- ~]s~
Min.V(Y3) = [(nq+2p n
(1 - P~ )+ (~-
n N
~)s~ (12.4.2.12)

for the optimum value of a3 = Sxy / S;. Again a3 can be estimated by £1 z . Thus we
have the following theorems:
990 Advanced sampling theory with applications

Theorem 12.4.2.5. The regression estimator in the absence of population mean X,


but in the presence of non-response, is given by

Y31r = Yn-r +(s~/s;X:~n-r -x) . (12.4.2.13)

Theorem 12.4.2.6. An estimator of the minimum variance of the estimator Y3 is


given by

V(Y3) = [(nq ~ 2P) :}~Z(I-r~)+(~- ~}~Z. (12.4.2.14)

Example 12.4.2.1. We selected an SRSWOR sample of twenty states from the


population 1 given in the Appendix. We collected information about the real estate
farm loans and nonreal estate farm loans from the selected states, but unfortunately
the information on the real estate farm loans was not available on six states as
marked in the table below.

AL 348.334 408.978 27 NE 3585.406 1337.852


AZ 431.439 54.633 29 NH 0.471 6.044
AR 848.317 907.700 32 NY 426.274 201.631
CA 3928.732 1343.461 33 NC 494.730 639.571
c
CO 906.281 315.809 36 OK 1716.087 Missiil
CT 4.373 7.130 38 PA 298.351 Missiri
IN 1022.782 1213.024 40 SC 80.750 Missiii
ME 51.539 8.849 42 TN 388.869 · Missin
MI 440.518 323.028 46 VA 188.477 Missin
MN 2466.892 1354.768 47 WA 1228.607 Missire '
x -- Nonreal estate farm loans; Y -- Real estate farm loans.

Apply the following three estimators


- - ( *y/ Sx*zX-
Yllr=Yn-r+\Sx - - (* / SxzX-X-x-)
X- x-n_r), YZlr=Yn-r+\Sxy
and Y31r = Yn-r + (s:y/ s; XXn-r - x) of population mean for estimating the average
real estate farm loans assuming that the average nonreal estate farm loans in the
United States is known and is equal to $878.16. Construct 95% confidence interval
estimates.

Solution. We are given N = 50 and X = 878.16 . From the above table, we have
n = 20 , x = 942.8615 and s; = 1307911.82.
From the responding states we have
12. Non-response and its treatments 991

AL 348.334 408.978 518339 .52 29309 .09 123256.0896


AZ 431.439 54.633 405581.74 276196.49 334694.2730
AR 848.317 907.700 48389 .00 107271.31 -72046.8719
CA 3928 .732 1343.461 8182116 .94 582602.46 2183328 .0850
CO 906.281 315.809 26247 .56 69890.43 42830.5240
CT 4.373 7.130 1131923.63 328382 .86 609675.5912
IN 1022.782 1213.024 2071.16 400495.32 -28800 .8669
ME 51.539 8.849 1033786.66 326415 .68 580899.4580
MI 440.518 323.028 394100.19 66125 .60 161431.4563
MN 2466 .892 1354.768 1956081.96 599991.21 1083342 .9730
NE 3585.406 1337.852 6335862.89 574071.40 1907154 .3500
NH 0.471 6.044 1140241.68 329628 .70 613071.2742
NY 426 .274 201.631 412187 .11 143297.07 243033 .3458
NC 494.730 639.571 328973.36 3527.64 -34066 .1414

Thus we have
x- n_r = 1068.292 , -
Yn_r=580 .177, Sx~ = 1685838.731 , Sy~ =295169 .6416,
* d * 595984.887
Sxy = 595984.887 an rxy = I = 0.84487 .
,,1685838.731x 295169.6416

From Table 12.4.2.1, for n = 20 and r = 6, an estimate of the probability of non-


response is p = 0.34993 "" 0.35 and q = 1- P= 0.65.

Estimator 1. We have
- - ( * / *2X - - ) 80 595984.887 (8 8 )
Yllr=Yn-r+\Sxy Sx X -xn-r =5 .177 + 1685838.731 7 .16-1068.292 =512.96
and

V(YlIr) = [(nQ+1
2P)
~};2(I-r;/)
1 1 ] X295169.6416 X(I-0.84487 2 ) = 4476.61.
= [ (20 xO.65+2xO.35) 50

A (1- a)100% confidence interval for the true population mean Y is given by

Yllr±f a/2(df=n-r- 2 'Nv(Yllr) .

Using Table 2 from the Appendix the 95% confidence interval for the average real
estate farm loans is given by
992 Advanced sampling theory with applications

512.96 ± IO.05/Z(df = 20 - 6 - 2)J4476.61

or 512.96±2.179.J4476.61 , or [367.17, 658.75] .

Estimator 2. We have
YZlr = Yn-r + (s.:.r/ s; Xx - x) = 580.177 + 595984.887
1307911.82
(878.16 - 942.861) = 550.694

and
V(YZlr)

=[(nq ~2P) -nI]s *z + (I-n - -1Js -z(


Y N Y
*Z)
1- rx
Y

=[(20 xO.65+2xO.35
1 ) ...!...] X295169 .6416 + (...!..._...!...J X295169.6416 X(I-0.84487
20 20 50
Z)

= 6786 .74 + 2534 .27 = 9321.02 .


A (1- a)1 00% confidence interval for the true population mean Y is given by
YZlr ± la/z(df = n - r - 2),jv(YZlr) .
Using Table 2 from the Appendix the 95% confidence interval for the average real
estate farm loans is
550.694 ± IO.05/Z(df = 20 - 6 - 2)h321.02

or 550.694 ± 2. 179.J9321.02 , or [340.32, 761.066].

Estimator 3. We have
hlr = Yn- r + (s.:.r/s; Xx,,-r - r) = 580.177 + 595984.887 (1068.292 - 942.8615) = 637.33
1307911.82
and
V(Y3Ir)

= [(nq ~2p)
I]
n Y Y
(I 1J
- s -z(I-rx*z) + - - - s -z
n N Y

=[(20 x 0.651+ 2 x 0.35) "'!"'] X295169.6416X(I-0.84487


20
Z)+(...!..._...!...J
20 50
X295169.6416

= 1942.33 + 8855.09 = 10797.42 .


A (I - a)J 00% confidence interval for the true population mean Y is given by

Y3Ir ± la/z(df = n - r - 2Nv(Y3Ir) .


Using Table 2 from the Appendix the 95% confidence interval for the average farm
real estate loans is
637.33 ± IO.05/Z(df = 20 - 6 - 2).J10797.42

or 637.33± 2. 179.J10797.42 or [410.90, 863.75].


12. Non-response and its treatments 993

Singh and Joarder (1998) consider the problem of estimation of finite variance in
the presence of non-response in survey sampling.
Let us define

8=(s;2/ S; )- I, o=(s?ls;)-I, and 17 =(s;ls;)-I,


where
s;2 = (n - r -lt f &i- y'r
ln
and s;2 = (n - r -lt l I (Xi - x'r are conditionally
i=1 i=l

unbiased estimators of S; and S;, respectively, and y'=(n-rttnfYi and


i=1

x' = (n - rt nfXi . Thus


l
under the probability model given by (12.4.1.1) we have
i=1
the following results.

E(8)=E(0)=E(17)=0,

E(8 2)=[(nq+2p
1 fJ...](A40-
N
11 E(02)=[( 1 fJ...](Ao4- 1),
nq+2p N

E(172)=[~-J...](Ao4-11
n N
1 )-J...](~2-1),
E(80)=[(nq+2p N

E(e17)=[~- ~J~2-1), and E(017)=[~- ~JAo4-1) .


It may be noted that if p = 0, i.e., if there is no non-response, the above expected
values coincide with the usual results discussed in Chapter 3.

Strategy I. Consider the situation when random non-response exists on both the
study variable Y and the auxiliary variable X and population variance of the S;
auxiliary character is known. An estimator of a finite population variance is
V=S y
( 21Sx'2) .
. '2 \Sx (12.4.3.1)

Then we have the following theorems:

Theorem 12.4.3.1. The bias in the estimator v, up to terms of order n -I, is

B(V)=[(nq+2p
1 ) 1
N
]S;(Ao4-~2) ' (12.4.3.2)
v in terms of
Proof. The estimator 8 and 0 can be written as
v= S;(1+8-0 +0 2- 80 +....) . (12.4.3.3)
Taking expected value on both sides of (12.4.3.3) and using the results on the
expectations from previous section, we get (12.4.3.2). Hence the theorem.
994 Advanced sampling theory with applications

Theorem 12.4.3.2. The mean square error, up to terms of order n- I , of the


estimator v is given by

1 ) ~]S;(A40+Ao4-2A.zZ)'
MSE(V)=[(nq+2p N (12.4.3.4)
Proof. It is easy to check that

MSE(v) = E{V-SJ) =S;E{C-O+OZ-cO) =S;E(CZ+OZ-2 CO)

=[(nq+2pf~]S;(A40+Ao4-2A.zZ)
1
N

Theorem 12.4.3.3. An estimator of the MSE(v) is given by


. (.) [1 r 1] *4(,1,40'* +..104'* - 2A.zz'* ).
MSE v = (nil +2p N Sy (12.4.3.5)
n r
h
were '* .*/{(.*
AIs=fils .* VIZ} and fils
VJzo)YI2(Vioz} '* = (n-r-l )_I ;EV';-y}
- ( -*y(X;-X}.
-*V

Strategy II. Consider the situation when information on variable y could not be
obtained for r units while information on variable x is available and population
variance S; of the auxiliary variable is known, then we have the estimator
• *z( z/ z)
VI = "» S x Sx . (12.4.3.6)

Thus we have the following theorems and their proofs are obvious.

Theorem 12.4.3.4. The bias in the estimator VI , up to terms of order n- 1, is

(12.4.3.7)

Theorem 12.4.3.5. The mean square error, up to terms of order n- I , is given by

MSE(VI) =MSE{sJ)+[(nq+2p
1 )-~]s;
N (12.4.3.8)

where MSE(sJ) denotes the MSE of ratio type estimator of variance proposed by
Isaki (1983) as discussed in Chapter 3.

Theorem 12.4.3.6. An estimator of mean square error of VI is given by

M§E(VI)=(~_~)S;4(~0+Ao4-2i;z)+[(
n N
.1 . )
~+~
1] *4
-s
n y (12.4.3.9)

where Ao4 =it04/ it~z and itos r'I (x; - xf .


=(n -1
;;1
12. Non-response and its treatments 995

If information on x is available for all the n units then we can obtain both s; and
s;2. Using this information consider another estimator as

v2 == s;2(S;/s;)+ a(s;2/S; -I) (12.4.3.10)


where a is a suitably chosen constant such that the mean square error of v2 IS
minimum. Thus we have the following two theorems:

Theorem 12.4.3.7. The bias in the estimator v2 is the same as in the estimator VI '

Theorem 12.4.3.8. The minimum mean square error of the estimator v2 is given by

S4{(_ 1 -~J(~2-1)-(~-~)(Ao4
y nq+2p N n N
_1)}2
Min.MSE(V2) ==MSE(VI) (12.4.3 .11)

Proof. We have
f
MSE(V2) == £(V2 - S; == £[S;(I+e -1] +1]2- &1])+ ao- S; f
== MSE(v\)+ a 2 { ( nq+2p
1 ) ~}(Ao4
N
-I)

+2aSy2{(_1 -~J(~2-1)-(~-~)(Ao4
nq+2p N n N
-I)} (12.4.3.12)

On differentiating (12.4.3.12) with respect to a and equating to zero we have

{ ( _nq+2p
I -~J(~2
N
-1)-(~-~)(A04
n N
-1)}S2
Y
a == (12.4.3.13)
{(nq+2p) -~}(Ao4
1
N
- I)
and then putting the optimum value of a III (12.4.3.12) we have (12.4.3 .11).
Hence the theorem.

Theorem 12.4.3.9. An estimator of the Min.MSE(v2) is given by

(12.4.3.14)

If optimum value of a is not known, then it is advisable to replace it with its


estimator Ii and then we have the following estimator given by
V3• == sy*2(s;2/Sx+a\sx
2) .( *2/ s;2-I,) (12.4.3.15)
where
996 Advanced sampling theory with applications

a=s*2{(_.1
y
. -~)(~2-1)-(~-~)(io4-I)l/{
nq+2p N n N
.1 .
J (nq+2p) ~}(iorl)
N
denotes a consistent estimator of a. To find the mean square error of the estimator
v3 let us define K= (a/ a)-1, where E(K) = O(n -11 then the MSE of v3 is given by
MSE(V3)=E(V3- S;) =E[S;(I+8-1J+1J2- 81J)+a(I+K)o-s;f. (12.4 .3.16)

This is approximately the same as MSE(V2) . It may be noted here that estimators
V2 and v3 may take inadmissible value, i.e., a negative value. Thus an equally
efficient alternative estimator is given below

• _
V4- S y
2 [l.
'2~ '2)a (12.4.3.17)
Sx
2
s;2
for a 7' I. If a = 1 then it leads to the following strategy.

Strategy III. Consider the situation when information on variable y could not be
obtained for r units while information on the variable x is obtained for all the
sample units, but the difference is that the population variance S;
of the auxiliary
variable is not known. In this case consider another ratio estimator
(12.4.3.18)

Thus we have the following theorems:

Theorem 12.4.3.10. The approximate bias in the estimator Vs , up to terms of order


n- 1 , is given by

B(vs) = [~ _ _I_]S;(Az2 --104) . (12.4.3.19)


n nq+2p

Theorem 12.4.3.11. The asymptotic mean square error of the estimator vs, up to
terms of order n- 1 , is

MSE(vs)= [{nq+2p
1 ~}(A40
N
_I)+{_I --~}(-104 +1-2Az2)]S~.
nq+2p n
(124320)
. . .

Theorem 12.4.3.12. An estimator of mean square error of Vs is given by

MSE(vs)= [{_._I-.-~}(A~O -1)+{_._I-.


nq+2p N nq+2p n
-~}(~4 + 1- 2.i;2)ls;4.
J .
(124.3.21)

Example 12.4.3.1. A Government organisation selected an SRSWOR sample of


twenty states from the population 1 given in the Appendix and collected
information about the real estate farm loans and nonreal estate farm loans, but
unfortunately the information on the real estate farm loans was not available on four
states as marked in the table below.
12. Non-response and its treatments 997

05 CA 3928.732 34 1241.369
07 CT 4.373 36 1716.087 612.108
09 FL 464.516 40 80.750 87 .951
13 IL 2610.572 42 388.869 553 .266
19 ME 51.539 43 3520.361
24 MS 549.551 627.013 44 197 .244 56 .908
25 MO 1519.994 1579.686 46 188.477
27 NE 3585.406 1337.852 47 1228.607 1100.745
30 NJ 27.508 39.860 48 29 .291 99.277
31 NM 274.035 140.582 49 WI 1372.439 1229.752
x -- Nonreal estate farm loans, y -- Real estate farm loans .

Apply the ratio type estimator "5 = s;2(s; /s;2) for estimating the finite population
variance of the real estate farm loans in the United States. Construct 75%
confidence interval.

Solution. From the Table 12.4.2.1 for n = 20 and r = 4, we have p = 0.23355 and
q = 1- P= 1- 0.23355 = 0.76645. From the responding units in the sample we have

05 3928.732 1343.461 7868634.0 369283.3 2.90575E+ 12 l.3637E+ll


07 4.373 7.130 1252721.0 530922.4 6.65098E+ 11 2.81879E+ll
09 464.516 825.748 434422.5 8095.3 3516775046 65533670.5
13 2610.572 2131.048 2211016.0 1946789.0 4.30438E+ 12 3.78999E+12
24 549.551 627.Q13 329559.1 11829.0 3898353147 139925144.2
25 1519.994 1579.686 157109.7 712187.1 1.11891E+11 5.07211E+ll
27 3585.406 1337.852 6060374.0 362497.7 2.19687E+12 l.31405E+11
30 27.508 39.860 1201469.0 484296.6 5.81867E+11 2.34543E+11
31 274.035 140.582 721800.4 354253.7 2.557E+ll 1.25496E+11
36 1716.087 612.108 351013.1 15293.3 5368158462 233885818.0
40 80.750 87.951 1087585.0 419674.9 4.56432E+ 11 1.76127E+ll
42 388.869 553.266 539864.0 33309.2 17982458456 1109505370.0
44 197.244 56.908 858178.7 460859.3 3.955E+ll 2.12391E+ll
47 1228.607 1100.745 11021.6 133203.7 1468112708 17743224089.0
48 29.291 99.277 1197563.0 405128.7 4.85167E+ ll 1.64129E+ll
49 1372.439 1229.752 61909.2 244014.1 15106720154 59542870868.0
998 Advanced sampling theory with applicat ions

n-r nr-r
L Xi L Yi
xn- r = i=1 = 17977.97 = 1123.62 Yn-r = i=1 = 11772.39 = 735.77
n -r r 20-4 n- r 20 -4

nf (Yi - Yn-rf 6491637


432778 . 2
, + -
J.l20 - Sy+2 -_ 1=1 -- --
n- r- 1 20 - 4 - 1
n-r( )2
L xi - Xn - r J 2434424 1
jL~2 = s;2 = H = = 1622949.4
n- r - 1 20 - 4 - 1

13
1.2406 x 10 = 8.270667 x io"
20-4-1
and

, +
nf (Yi - Yn-r t
-"i=::-'.I _ 5.83837 x 10
12
= 3.8922 x 1011
J.l40 = n - r- l 20 -4-1

From the complete information on the auxiliary variable we have

05 CA 3928.732 77269 87.825 5.97063E+13


07 CT 4.373 1310138.920 1.71646E+12
09 FL 464.516 468499 .180 2.19491E+ll
13 IL 2610.572 2136233 .635 4.56349E+12
19 ME 51.539 1204389.918 1.45056E+12
24 MS 549.551 359322.319 1.29113E+ll
25 MO 1519.994 137646.936 18946679008
27 NE 3585.406 5936142.4 16 3.52378E+13
30 NJ 27.508 1257712.904 1.58184E+12
31 NM 274.035 765539 .252 5.86050E+ll
34 ND 1241.369 8534.618 72839716.17
36 OK 1716.087 321603 .544 1.03429E+ l l
40 SC 80.750 1141128.152 1.30217E+12
42 TN 388.869 577777 .854 3.33827E+11
43 TX 3520.361 5623419.391 3.16228E+13
44 UT 197.244 905812.835 8.20497E+ll
46 VA 188.477 922577 .539 8.51149E+ll
47 WA 1228.607 6339.504 40189306.41
48 WV 29.291 1253716.893 1.57 181E+ 12
49 WI 1372.439 49931 .243 2493129048
.~~Jlr6!,5ltl8$O
12. Non-response and its treatments 999

n
LXi I(Xi-xf
X= i=l = 22979.72 = 1148.986 iL02 = s; = i=1
n -1
= 32113454.88 = 1690181.84
20-1
n 20

14 " II
1.41818x10 =7.46411x10 12 l' = J.l40 = 3.8922 xl0 = 2.0781
20-1 40 iL;5 432778.2 2

, Oil " 8 6 11
..io =J.l04=7.46411x l =2.6128,&i;2= :22. = .2406 7 x l 0 =1.1775.
4 iLJ2 1690181.842 iL20iLo2 432778.2 xI622949.4

Thus an estimate of the finite population variance of the real estate farm loans is
given by

Vs = s;2(s;/s;Z) = 432778.2 x 1690181.84


1622949.4
= 450706.50 .

An estimate of the MSE(vs) is given by

_[{ 1 ...!.-}(20781-1)
- (20xO.76645+2xO.23355) 50 .

+{( 1 f...!.-}(2.61281+1-2Xl .I775) ]X(432778.2


20 x 0.23355+ 2 x 0.76645 20 f
= 3.4938 xl0 10 .

The (1 - a)1 00% confidence interval for the finite population variance is given by
Vs ± la/2(df = n - r- 2)JM?m(vs) .
Using Table 2 from the Appendix the 75% confidence interval of the finite
population variance of the real estate farm loans is given by

Vs ± 10.25/2 (df = 20 - 4 - 2)JMSE(vs) ,

or 450706.50± 1.200~3.4938 x 1010 or [226404.16, 675008.83].

The next section has been devoted to discuss a few imputation techniques in brief
for handling the non-response in survey sampling.

Important note: Please keep in mind that in the proceeding sections the notation r
and (n - r) have different meanings than in the preceding sections .
1000 Advanced sampling theory with applications

In almost all large scale surveys non-responses are unavoidable. Several methods
are available for handling non-response problems in survey sampling. Details are
given by Rubin (1987). We consider here the problem of variance estimation in .the
presence of non -response in a unified setup. Several researchers have suggested
several methods of estimating the variance of the estimators of total or mean in the
presence of non-response in survey sampling. In some cases the probability of
response of the /h individual may be known or may be estimated through logistic
regression model. Consider a finite population n = {1,2,... ,i,...,N} from which a
sample s of size n is selected with probability p(s) according to some sampling
design p. Let us denote the inclusion probabilities for the /h , and /h and /h (i * j)
units by "j and "ij' respectively. Let Sr be the respondent units in s. Here we
assume that the set of the response sample sAc s) selected from s by nature with
probability q(sr) . Let us denote S(r) is the collection of all possible samples of size

r, those can be selected from s, that is, S(r) will consist of (~) different samples.

Let q = q(sr J Lq(sr)j-t, then q forms a sampling design. Let lJIj(s,r) and
1res(r)
S

lJIij(s, r) be the inclusion probabilities for the /h, (i'h, /h) i * j units . Obviously
lJI i (s, r) and lJIij(s,r) will depend upon sand r, the number of respondents in the
sample for the sampling design q defined on S(r). Amab and Singh (2001) consider
the problem of estimation of population total Y and its variance under two different
situations, viz., (i ) absence of auxiliary information, and ( ii ) presence of auxiliary
information.

Let Y j be the value of the variable of interest, y, for the /h population unit. The
well known Horvitz and Thompson (1952) estimator of the population total, Y, in
the presence of non-response is given by
~ = L ..2L = L l , (12.5.1.1)
ies; " jlJljrs ies; lJIjrs
where lJIjrs=lJIj(r,s) and Zj=Yd" j . Let s, (vs ), Erls(vrl s), and Esr1r,s (vsrlr,s)
denote respectively the unconditional expectation (variance) over the initial sample
s conditional expectation (variance) over r given s and conditional expectation
(variance) over Sr given rand s. Similarly E; (vr ) denotes unconditional
expectation (variance) of r, the size of the respondent units. Then we have the
following theorems.
12. Non-response and its treatments 1001

Theorem 12.5.1.1. The estimator ~ is unbiased for the population total Y.


Proof. We have

E(~) = EsErlsEsrlr,s[~] = EsErlsEsrlr,s[ .I Ifirsie s;


3-] = EsErls[.r.Zi]
te s
= Es[.r. Yi] = Y.
ies J!i

Hence the theorem.

Theorem. 12.5.1.2. The variance of ~ is given by

( , ) =Er [y2
V}( 1
I~air+-I IOijr(Y ' _Yj
-1.. _ )2 +-I
1 I 0ij (y.
-1.. _Yj
_ )2] ,
ieD J!i 2 i* jeD . J!i J!j 2 i* jeD J!i J!j

where air = oir - I


J(*i)
0ijr , 0ir =I
ssi
(_1_-1}(S),
Ifirs
Oijr =I
ssi.]
(1- Ifirslfijrsjrs}(s1
If and

0ij = (J!iJ!r J!ij) '

Proof. Writing Er,srls (Vr,srls ) for the overall expectation (variance) for variation
over rand s, when s is fixed we have

V(~) = e, ~r,srls (~ Is)]+ Vs [Er,srls (~ Is)] . (12.5.1.2)


Now

Vs[Er ,srls (~ Is)] =Vs(I Yi)


J!i
=.!.I I 0ij(Yi _ Yj)2
ie s2 jeD #J!i J!j
(12.5 .1.3)

and

Vr,srls (~ Is) = ErlsVsrlr,s(~ Is)+ VrlsEsrlr,s(~ Is) = ErlsVsrlr,s( I ~)


Ifirs
+vr( .r. Yi)
J!i
les r i es

= Erls[IZl(_1_-1)+ I I ZiZ j( Ifijrs (12.5.1.4)


ie s Ifirs i* jes Ifirslf jrs

Noting that Vr ( I Yi) = O. From (12.5.1.4) we have


iesJ!i

Es~r,srls(~ Is)]= EsErls[IZl(_1


ies Ifirs
-1)+ I i*
I
jes
ZiZ j( Ifijrs
Ifirslfjrs
1)]
= ErEslr[IZl(_1
ies Ifirs
-1)+ I jesI ZiZj( Ifirslfjrs
i*
Ifijrs 1)].
Now noting (i ) EsErls = ErEslr is valid because Yi values are not involved in the
selection of samples and (ii) Es1r = Es' we obtain
1002 Advanced sampling theory with applications

[ 2 [J2]
-.s, v:
L~ir +-L L
I
0ijr
v. Yj
-1- _ _ (12.5.1.5)
iEn " i Z i* j En " i "j

since
L L ZiZjOijr=-~LLtZi-Z}-(zl+ZJ)~ijr=-~LL(Zi-Zj~ +LZl L 0ijr'
i* jEn i* j i* j i J(*i)
From (12.5.1.3) and (12.5.1.5), we have the theorem.

(12.5.1.6)

The estimator (12.5.l.J) can be written as


Y2 = LdiYi, where di = 1/("illfi) (writing lIfirs = lIfi ). (12.5.2.1)
te s;

Note that here d, is slightly different than that defined in Chapter 5. Let us assume
that an auxiliary variable Xi is available and it is positive for every i, then we can
estimate IIfi through Xi ' So, in the presence of auxiliary information, we can choose
weights in different stages as follows:

YI xI 1/"I 1/1lf1 dl dlYI dlXI xI/"l

Yz Xz 1/"2 1/1If2 d2 d 2Y2 d 2X2 x 2/ " 2

Xr 1/" r 1/11fr dr d rYr drx r xrl"r


Yr
Xr+1 1/"r+1 1/11fr+1 d r+1 xr+1 I" r+l

Xn 1/" n 1/11fn dn xnl " n


where d, = 1/("illfi) '
12. Non-response and its treatments 1003

When the auxiliary variable is available we can propose a calibrated estimator for
(12.5.2.1) as
Yg = LWiYi ' (12.5.2.2)
ie s;

The calibrated weights Wi are such that the chi square distance function owed to
Deville and Sarndal (1992) defined as
L (wi-dif
iesr d.q,
(12.5.2.3)
is minimum subject to the constraint
LWixi = L~ = x (say),
ies ; ie s trj (12.5.2.4)
where the qi are suitably chosen weights. The calibration equation (12.5.2.4) is
similar as used by Dupont (1995) and Hidiroglou and Sarndal (1995, 1998) for
two-phase sampling. The choice of q i leads to different forms of estimators of the
population total. Minimization of (12.5.2.3) subject to (12.5.2.4) leads to the
calibrated weights

Wi = d i + diqixi 2 [ xA
- L.. iXi 0
s:« J. (12.5.2.5)
LdiqiXi ies;
ie s;

On substituting (12.5.2.5) in (12.5.2.2) we obtain a general linear regression


estimator (GREG) in the presence on non-response given by
Yg = L wiYi = Ldiei + bx, (12.5.2.6)
ies ; ies ;

where b = LdiqiXiYi / Ldiqixf and ei = v, <bx. : Thus an estimator of V(Y


g ) is
ies; ies ;
given by

v(Yg \Jstg=o ="L.. bo os(r)e12 +..!..'


II 2 L.." '" bo o(r{!L._!.LJ2
L.. lJS
(12.5 .2.7)
ies; Jri i",jes; Jri st j

If 1/Jri = Wilfli then we consider first stage calibrated estimator of variance in the
presence of non-response as

V(Yg )stg =1 = ieLb us{r)wlIf11el+..!..L L bijArXwilfliei-Wjlfljej~ ' (12.5.3.1)


s; 2 i", jes;

We consider a second stage calibrated estimator of variance of GREG as


V(Yg ) =2 = L wuAr)WlIf11el+..!..L L wijArXwilfliei-Wjlfljej~ (12.5 .3.2)
stg ies; 2 i", jes;
1004 Advanced sampling theory with applications

The second stage weights wijs (r) are obtained such that the distance function

L (wu,(r)- bus(r ))2 +.!- L L (wij,(r)- bij,(r)~


ies; bus(r)QUsr 2 i¢ [es; bijs(r)Qijsr (12 .5.3.3)

is minimum subject to

xl + -1 L L wijs(r{ -Xi - -Xj


L wus(r)-2 J2 = v2, (x), (12.5.3.4)
tes; J(i 2 i¢ j es; J(i J(j
where

v2
, ( x ) -_ '"
a
i r xi2
£"'-3
1 '" '" 8··
+- £.., £..,
IJr
+ E> IJ..[ xi
- --
xJ. J2 (12.5.3.5)
iES J(i 2 i¢ j es J(ij J(i J(j

Now we have the follow ing theorem.

Theorem 12.5.3.1. The second stage calibrated weights obtained by minimizing


(12.5 .3.3) subject to (12.5.3.4) are
_ () V2(X )- Vl(X ) ( )x2
wUs(r) - bus r + R(x) bus r i QUsn (12.5.3.6)

and
2
_ V2(X)-Vl(X) Xi Xj (12.5.3.7)
Wijs(r)- bij,(r)+ () bij,(r)Qijsr [- - -
R X J(i J(j J
with
, ~
21
v[(X) = .L bus(r)-T+- L .L bijs(r -1... _ _
IES r J(i 2 JESr J(i J(j

{
x· J2 ~

and

R(x) = L bijS(r)Qijs(r) X~ +.!- L L bij, (r)Qij,(J Xi _ xj J4.


ie s; J(i 2 i¢ j es ; \ J(i J(j
Proof. Let
<1> = L (wu,(r)- biiS(r)f +.!-L L (wijs(r) - bijs(r)~
ies; bu,(r)Qiisr 2 i¢ j es; bij,(r)Qijsr

- 2A[ L wu,(r)
ies ;
X~ +.!-2 Li¢ L Wij, (J!l...-
J(i j es;
Xj J2 - V2(X )],
J(i J(j \
(12.5.3.8)

Now - 8 <1> . I'res that


- (- ) = 0 Imp
&us r
x~
wus (r) =bus(r) + Abus(r)-T Qus(r),
J(i
which implies that
12. Non-response and its treatments 1005

2 42
.I wiiAr) Xi2 = A.IbiiAr
IESr trj IESr
f i4 Qiisr + . I biiAr ) Hi\ .
1r; IE Sr
( 12.5.3.9)

·
Agam . l'res that
0 <1> = 0 Imp
--(-)
Owijs r

wijs (r) = bijs (r)+ Abijs(r)Qijs(r{ !L _!l.J2, (12.5.3 .10)


\ Jri Jrj
which implies that

1 .I
-~ Wijs (r {Xi
- - Xj-J2 A .I
=-~ bijS()Q
r ijs({ -J4+ -1 ~. I bijS(r {Xi
r -xi -Xj -J2 .
- - Xj
2 ,* JESr Jr i Jr j 2 ,*JES r Jri Jr j 2 ,*JESr Jri Jr j

Adding (12.5.3.9) and (12.5.3.10) we have A = V2(xl(x~1 (x), which proves the

theorem.

Now we discuss the special cases of the above strategy as follows :

If the sub-sample Sr is selected by SRSWOR sampling, that is, . Iffirs = r/n and
Iffijrs = {r(r -l)}/{n(n -I)}, then oijr = {(n - r)jr(n -l)}Jrij ' Ojr = {(n - r)/r}Jrj , air = 0 ,

_
b.. (r)-O, bijs()
r -_n(n-l){n-r 0 ij }_
- (- ) -(-) +- -t'1.ij (say) , , (x )_
VI
l I t'1.ij [-
--I Xi - Xj
-J2 ,
us r r-l r n-l Jrij 2 i*jEsr Jr j Jri

,()
V2 x = r(r -l ) I I t'1.ij[Xi
- - -Xi J2 , and R ()
x = -1 I I t'1.ijQijS(r { -Xi -XiJ4
-
2n(n - 1) i* JES Jri Jr i 2 i* jes; Jri Jr i

Case I. If Wi = d. , then ei = Yi and


, , n Yi
Yg = YHT = - L: - .
r ies; " i
Then three estimators of variance are
2,
V(YHT tg=o = .!. L: L: t'1.ij [Xi - Yi J V(YHT ktg=1= V(YHT ktg=o, and
2 i * jes; Jri Jr i

v, (,Y
HT Jstg=2 = - L: L: wijs r - - -
\ 1 ({ Yi Yi J2
2 i*i Esr Jri Jri

J V2(X )-Vt (x) ({Xi Xi J2)


where wijS(r=
) t'1.ijl 1+ R(x) Qijsr\Jri - Jri .
1006 Advanced sampling theory with applications

Case II. If qi = Xi-I and Qij = (x. XjJ-


-1...._- we have
Tri Trj

Wi=~{L~}-I{LXi}
Tri ies; Tri iesTri
, «<v»: L Yi{L~}-IXi'
ies; Tri iesTri
and

wij,(r) = !',.ij r(r-1)L L


n(n -1) i* je s
!',.ij(~_ Xj J2jL L !',. ij(~_ Xj J2)-1
Tri Trj je s; i*Tri Trj

In this case Yg becomes an extension of the Horvitz and Thompson (1952)


estimator
-I

(
YHTR = !!.... L Yi
r ies; Tri J[ !!.... L ~
r ies; Tri J
L~.
(ie sTri J
The three estimators of variance of the ratio estimator YHTR are given by

vA(AYHTR.!stg=O
\
= -1 L L !',.ij (ei
---
ej J2,
2#jesr Tri Trj

Case III. If qi = Qij = 1 then Wi = z.L, Xi { L xl


r Tri Tri ies; Tri
}-I[L Xi
iesTri
L Xi J'
.s.r ies; Tri

b = L XiYi {L x~ }-I ,and ei «v,- bXi. In this case Yg becomes the regression
tes; Tri tes; Tri
estimator of the following form

YHT(lr)=!!.... L Yi
r ies; Tri
+b[L~-!!.... L ~J.
iesTri r ies; Tri
The three estimators of variance of the regression estimator are given by
2
AA 1 e. ej
V(YHT(lr)t =0 = - L L !',.ij -1.... - - ,
g 2 i* je s; ( Tri Trj J

V(YHTR(lr)t=1 = ~(!...)2 L .L !',.ij(Wiei - wjej ~


n '* Je sr
and
12. Non-response and its treatments 1007

Case I. If Wi = d i then
A N A
Yg = - LYi =Yu
r ies;
and
A( A\ 1 1 J 2 A( A\
2(-;-
v Yu.!stg;O = N N Syr = v Yu.!stg;l·

Now

v(Yultg;2 = N: 2: 2: Wijs{r'hi - Yj~ '


2n b" jes;
-2
x x.
If QijAr) = -L -
( "i
_J
"j J
then we have

W .·
ljS
n
(r)=--
2
---
r(r-l) r N
(1 1Js;r s2
--E!...

and hence

vA(A\
Yu.!stg;2 = N 2(1- - -IJ2S;n
Syr-2-'
r N Sxr
which is a similar ratio type of estimator studied by Isaki (1983), Garcia and
Cebrian (1996), and Singh and Joarder (1998).
1008 Advanced sampling theory with applications

In the case Qij =1 we ten have

v(Yultg=2 = N 2 ( ~ - ~ )[s~r + B(s;n -s;r)],


where B= '?- .I (Xi - Xj ~(yi - Y) / I ,I (Xi - Xj ~ , which is again similar regression
l ~ JES r 1'* JE S r

type estimator studied by Isaki ( 1983), Garcia and Cebrian (1996), and Singh and
Joarder ( 1998).

Case II. If we take qi = xi' then Wi = N ~n and


r Xr

which is a ratio estimator of population total. In this case ei = Yi - (yJxr)x; and we


get three estimators of variance as
A(A \
vYR },;tg=o= N
1 1)-=-1
2(--N 1 .Lei2, and
r r IES
r

Case III. If we take qi = 1 then Wi = N + (NX) L xl ](xn - z.). b = .I XiYi/ ,I xl ,


r / IESr IESr ie s;

ei = Yi - bxi and
Yg = ~r = N[yr +b(xn - xr )]
is a regression type estimator for the finite population total studied by Singh,
Joarder, and Tracy (2000). In this situation a few estimators of variance are
v(~rLg=o = N2(~_~) L(ei -ef ,
r N tes;
- 1
h
were e = - I ei .
r ies;

v(~r ltg=1 = N2(~_


r N
~ ).Lk - ef + Y'(X-xrf + Y2(X -Xr)],
~)[_(
r 1 IES
n n
r

where
12. Non-response and its treatments 1009

l)2 (r-l) I. xl
( I.I.
ES
x ies ;

-2 2
X· Xj n 2
If QijS(r) =
(-1.. _ _
Jr; Jrj
J
=-2 (x;-Xj)
N
,then

I.y; is the mean of the finite population n = {1,2,.., i , .., N }. An


- = N- \ N
Consider Y
;=\

SRSWOR sample s of size n is drawn from n to estimate Y. Let r be the


number of responding units out of the sampled n units. Let the set of responding
units be denoted by A and that of non-responding units by A such that s = A u A .
For every unit i E A the value y ; is observed. However, for the units i E A the y ;
values are missing and imputed values are derived. If y; denotes the imputed value
for the l h non-responding unit, then the general method of imputation can be
expressed as:

.= {y; if i
E A,
Y., "f'-
Y; 1 lEA .
(12.6.1)

The general point estimator of population mean takes the form

_ 1[ r n-r. ]
Y s =- LY;+ LY; , (12.6.2)
n ;=[ ;=[

where the value of y; is different for each imputation method.


1010 Advanced sampling theory with applications

In ratio method of imputation, we assume that imputation is carried out with the aid
of an auxiliary variable X , such that Xi the value of x for unit i, is known and
positive for every i ES . In other words, the data Xs = {Xi: i ES} are known.
Following the notation of Lee, Rancourt, and Sarndal (1994, 1995) , in the case of
single value imputation, if the th unit requires imputation, the value ;; Xi is

imputed, where;; =i~/i /i~lxi . The data after imputation becomes

Yi if i EA,
Y.i = { • - (12.6.3)
bx, if i E Ao
This method of imputation is called the ratio method of imputation. Thus we have
the following theorem:

Theorem 12.6.1. Under ratio method of imputation the point estimator (12 .6.2) of
the population mean becomes
(12 .6.4)
- =Y-r (Xxr ) '
n
YRAT

_ ~n _ ~r _ ~ r
where Xn =n LXi ' Xr =r LXi and Yr =r L Yi.
i=1 i=1 i=1

12.6.2 ME~N METHOD OF IMPUTATIQN


,, ,,-_

Under mean method of imputation the data after imputation becomes


. = { Yi if i E A,
Y./ - of ° - (12.6.5)
Yr 1 lEA .

Thus we have the following theorem:


Theorem 12.6.2. Under mean method of imputation the point estimator (12.6.2) of
the population mean is

Ym =2. I Yi =Yr . (12.6.6)


r i=1

Under the HD method the data after imputation takes the form
Yi if i E A,
v; = Yg(i) if i E :4,
{ (12.6 .7)

where Yg(i) is the Y value given by the donor unit g(i) E R , drawn at random (with
replacement) from the r responding units . Thus we have the following theorem:
12. Non-response and its treatments 1011

Theorem 12.6.3. Under the HD method of imputation the point estimator (12.6.2)
of the population mean becomes
_ 1[ r n-r ]
YHD = -;; i~/i + i~/g(i) . (12.6.8)

Under the NN method the data after imputation becomes


Yi if i E A,
Y.i ={ Yg(i) if i E A, (12.6.9)

where Y g(i) is the Y value given by the donor unit g(i) such that Min.h - xii
gER p

occurs for g = g(i). If it results in more than one unit, a donor is randomly selected
from them. More detail can be had from Chen and Shao (2001).
Thus we have the following theorem :

Theorem 12.6.4. Under the NN method of imputation the point estimator (12.6.2)
of the population mean becomes

YNN =~t~Yi+ :fYg(i)]' (12.6.10)

Example 12.6.1. A bank committee selected an SRSWOR sample of twenty states


from the population 1 given in the Appendix and collected information about the
real and nonreal estate farm loans. Unfortunately the information on the real estate
farm loans was not available on six states as marked in the table below.

1~r. N01 l j~~tf: if4l\i ix i","


I,
·:w:~ . <xy iJl:; "'~;
I,'"
'F§£;" ~C~;7 ~~!litew ; xi"1~ .J It ~., ~1'!3.
"'.Yi"~""'f:","j'
"'"
02 AK 3.433 2.605 27 NE 3585.406 1337.852
07 CT 4.373 1,,)Missirig F1 28 NV 16.710 .~N1is~ing .
08 DE 43.229 42.808 30 NJ 27.508 1 1i1:~ N1issmg 8
11 HI 38.067 40.775 31 NM 274.035 140.582
18 LA 405 .799 282.565 33 NC 494.730 639.571
19 ME 51.539 i~~Missing1;. 37 OR 571.487 114.899
20 MD 57.684 139.628 42 TN 388.869 553.266
22 MI 440.518 323.028 47 WA 1228.607 :;:Missmg!
24 MS 549.551 627.013 48 WV 29.291 99.277
25 MO 1519.994 "'~MiSsing': 49 WI 1372.439 1229.752
x - Nonreal estate farm loans, Y - Real estate farm loans.

There is 30% non-response about the real estate farm loans. Impute the missing
values with different methods of imputation .
1012 Advanced sampling theory with applications

Solution. (a) Ratio method of imputation: We have


r
LY'
b= ;=1 = 5573.621 = 0.6752
Ix;
I

8254 .538
;=1
thus the imputed values are given by

CT 4.373 2.95265
ME 51.539 34.79913
MO 1519.994 1026.30000
NV 16.710 11.28259
NJ 27.508 18.57340
WA 1228.607 829.55540

( b ) Mean method of imputation: Here we assume that no auxiliary information


is available. The missing values are replaced with the mean of the responding units
on the study variable only.

XroP.yt~g~Y~me~
CT 398.11
ME 398. 11
MO 398. I I
NV 398. I I
NJ 398.11
WA 398.11

( c ) Hot deck method of imputation: Here again no auxiliary information is used


for imputation . We have to select 6 values from the responding units by using
SRSWOR sampling . There are total 14 responses available on the study variable
and 6 response are missing. We used first two columns of the Pseudo-Random
Number (PRN) Table 1 given in the Appendix to select 6 distinct random numbers
between 1 and 14. The distinct random numbers are 01, 04, 05, 03, 14 and 06. Thus
the imputed values are given as

IiriptlteclYaly~s
CT 2.605
ME 282.565
MO 139.628
NV 40.775
NJ 1229.752
WA 323.028
12. Non-response and its treatments 1013

( d ) Nearest Neighbourhood method: Here we use the minimum absolute


difference between the values of the nonreal estate farm loans of two different
states to impute the missing real estate farm loan of the other state.

CT
ME 51 .539 139.628
MO 1519.994 1229.752
NV 16.710 39 .860
NJ 27.508 99 .277
WA 1228 .607 1229 .752

( e) Regression method of imputation: Here we use the model Yi = a + bx, + ei .


The values of a and b are estimated by the method of least squares. In this case the
fitted line on the responding units is given by

Yi = 76.89265 + O.071471xi

and the imputed values are given in the following table

CT 4.373 77 .2051
ME 51.539 80.5761
MO 1519.994 185.5260
NV 16.710 78.0869
NJ 27.508 78.8586
WA 1228.607 164.7010

The imputation for missing data appeals to one or more model assumptions. The
imputed values are assumed, on the average, to be good substitutes for the missing
values. The difference between the true value, the unobserved value Yi' and its
imputed value Yi is assumed to be zero . These assumptions are useful to find the
estimator of the variance under different imputation mechanisms. Thus we have
three different model components here :

( a ) the response mechanism;


( b ) the regression relationship;
and
( c ) the variance structure for the regression.
1014 Advanced sampling theory with applications

Assuming that the response mechanism is uniform, we have considered the


following model
M : Yi = flxi+Ei' (12.7.1)

The variance of the estimator of population total based on the data obtained after
imputation consists of three components. An estimator of population total Y based
on the imputed data is given by
, N n N [ r n-r, ]
Yos = - L Yoi = - LYi + LYi (12.7.2)
n i= l n i=1 i=1

and that based on full information is given by


, N n
Y, = - L Yi . (12.7.3)
n i=1

Now we have
A

Ye s - Y =Y, - Y + Yos -
A A '"

Y, . (12.7.4)

Squaring both sides of (12.7.4) and taking expected values we have

E[Y.s - yf = EpEq[r. - yf + EpEq[Y.s - r.f + 2E pEq[:Ys - Y][fos - fs], (12.7.5)

where E p and E q are the expectation operators with respect to the sampling design
p and the response mechanism q respectively . Thus we can say

VTOT(Y. S ) = VSAM + VIMP + VMIX = VORD + VDIF + VIMP + VMIX (12.7.6)


where
VSAM = Ep{fs - y}2, {y.
VIMP = EpEq{fos - f s}2 ,and VMIX = EpEq s - r. Hfs - Y} .
Thus an estimator of v(y.s ) can easily be obtained as
v(fos) = VORD + VDIF+ VIMP + VMIX' (12.7.7)

, 2(
1- - Syoswlth Syos = n -l
where VORD =N - f) 2 . 2 ( )-1 ~n { va :»-I ~n Yoi }2
n 1=1 1=1
Note that an estimator of the sampling variance is also given by
VSAM=VORD+VDIF ' (12.7.8)
the correction term VDIF should be constructed such that

EmfDIF}= N2(I-f)Em[S~s-S~osl . (12.7 .9)


n
12. Non-response and its treatments 1015

We shall now present the three components VDl F , V1MP and VM1X for each of the
four types of imputat ion methods . Let us define Sd = Ons, rd =0 n Rand
ld = On R C
• Then we have the following lemmas:

Lemma 12.7.1. For the ratio method of imputation, under model M the different
components of the variance estimators are given by

Lemma 12.7.2. For mean method of imputation, Xi =1 Vi, then under model M,
the different components of the variance estimators are given by

2
V,IM P = tNl d {ld
-;;; + 1 }'2(J" ,

and
2(1_
v;'MIX -- N f)l d {ld
--
I} , 2 (J",
n2 m
1 2 (m-l )-1,,(
witith , 2 =m-
(J" - - Syr an d Syr=
2 L.Yi - Y-)2
r which, in fact, is a special case of
m Rp

ratio method of imputation for Xi = 1.

Lemma 12.7.3. Under the NN method of imputation and the model M we have

and
1016 Advanced sampling theory with applications

Lemma 12.7.4. Under Hot Deck method of imputation and the model M we have
VDIF = 0,

, N ld ,2
2 ( --;;;+2 ) a ,
V1MP =--;;'2ld

and

v.'MIX -_N2(I-f)l{md
2 d --
1}'2
a,
n m
. h a, 2 =--Syr
Wit
m -1 2 an d Syr
2
= (m-l )-1", ( - ) 2 which in fact is a special case of
L.. Y j -Yr
m R
NN method of imputation for Xj = 1.

We know that the principle of the Jackknife technique is to re-calculate the


estimator after deleting a unit from the sample. The variance between the re-
computed estimators is used to obtain an estimator of the variance of the estimator
calculated using the complete sample. After the deletion of the /h unit, the estimator
for the whole population is given by
, ( .) N
Y.l = --1 LY.j, (12.8.1)
n - j ¢ j es

where the superscript (j) denotes that the /h unit was deleted. This is performed for
all units j E S . The modified Jackknife estimator of variance is

(12.8.2)

For data sets containing imputed values this estimator does not take the imputation
into account. Rao and Shao (1992) proposed a Jackknife variance estimator that
corrects the estimator by adjusting the imputed values when the deleted unit is in
the response set. For some imputation methods the adjusted values are the re-
imputed values based on the reduced response set after deletion of thej" unit. If the
/h unit deleted is a non-respondent the imputed values are unchanged . The data set
after adjusting the imputed values is
12. Non-response and its treatments 1017

j
Yi
y!1j ) = Yi +afj) if i E R, j E R, (12.8.3)
Yi if i E R, j E R C
,

where y!~j) is the adjusted imputed value and ap is called the adjustment. The
Jackknife variance estimator is then

v = n -1 "'{f:(a
J -s
j)_ f:(a)}2
-s '
n
~
jes (12.8.4)

where Y.~j) = ~ I Y !~j) and Y.~) = N I y!1) are the estimators of the total
n -1 i* jes n ie s
obtained after dropping the l unit and from all the sample information,
respectively. In (12.8.3), the value of the adjustment factor afj) changes from one

-
method of imputation to the other. For example, for the Ratio and NN method of

imputation afj) =[ ~f~i ~: ]xi, for the Mean and the HD method of imputation

afj) = y>j) - Yr' Estimation of variance with Jackknife method for data with imputed
values has also been discussed by Lee, Rancourt, and Samdal (1995a, 1995b).
Some advanced techniques to estimate the variance of an estimator of total in multi-
stage designs is also available from Rao and Shao (1999), Shao, Chen, and Chen
(1998), Rao (1996b), Chen, Rao, and Sitter (2000), and Lee and Kim (2002) .

Rao and Shao (1992) considered the problem of missing data in a multi-stage
survey design. They consider the situation in which the first stage units or clusters
are selected with replacement, or so treated and in which independent sub-samples
are taken within those clusters selected more than once and followed Krewski and
Rao (1981) asymptotic set-up in studying the consistency of the adjusted Jackknife
variance estimator. Assume nh clusters are selected with probab ilities Phi and with
replacement independently from the hth stratum. Let
• 1 nh Y. (12.9.1)
Yh=-I-'
h
nh i=I P hi

be an unbiased estimator of stratum total Yh with Yhi is a linear unbiased estimator


of the total Yhi for a selected cluster based on sampling at second and subsequent
stages. If S n denotes the total sample of n units then if there is no non-response an
unbiased estimator of population total Y is given by

Y= I WhikYh ik (12.9.2)
(hik )eSn
1018 Advanced sampling theory with applications

where Y hik and Whik (k = 1,2, .., nhi ; i=1 ,2, ..., nh ; h=1 , 2 , ... ,L) denote the value of
the variable under study and design weights, respectively. Let ~~ik} be the imputed
values for the non-respondents, Sm' using a hot deck single imputation class
mechanism. Then imputed estimator of population total Y is given by

11 = L WhikYhik + L WhikY~ik' (12.9.3)


(hik )esr (hik )esm

where Sr denotes the sample of respondents. Under SRS sampling if we select the
imputed values Y ~ik as Ygjl Wgjl j Whik where (gjl)E s; denotes the selected donor,
then following Platek and Grey (1983) the estimator 1'1 becomes unbiased,
otherwise it remains biased. It is more appropriate for quantitative data, but may
not provide good estimates for qualitative data. Rao and Shao (1992) suggested a
simple method to select the donors (gjl) E Sr with replacement with probabil ities
Wgi// IWgi/ and use Y~ik =Ygjl ·
Sr

Then we have the following theorem:

Theorem 12.9.1. If E. denotes the expected value with respect to HD imputation


then

E.(~)= ~ {; (12.9.4)
T
where S= L WhikYh ik, i = L Whik and O= L Whik
(hik )esr (hik )esr (hik )esn
Proof. Taking expected values on both sides of
1'1= L WhikYh ik + L WhikY~ik
(hik )esr (hik )esm
we have
E. (~ ) = L WhikYhik + L WhikE'~~ik)
(hik )esr (hik )esm

L WhikYh ik
~ ~
(
(hik )esr
= +
J
L... WhikYh ik L... Whik
(hik )esr (hik )esm L Whik
(hik )esr

L WhikYhik
~ ~ ~
(
(hik)e sr
+
J
= L... WhikYhik L... Whik - L... Whik
(hik )esr (hik )esn (hik )esr L Whik
(hik )esr

I WhikYh ik
_ '" (hik )esr
- £..,Whik
( (hik )esn ) I Whik
(hik )esr
Hence the theorem.
12. Non-response and its treatments 1019

Theorem 12.9.2. Under uniform response mechanism prove that


E(S) = pY, E(f) = pN, and E(U)= N.
Proof. Define a random variable
I if (hik) E s.,
thik = { 0 if (hik)ES m , (12.9.5)
such that E 2(thik)= P under the uniform response mechanism and E2 denotes the
expected value within a given sample .
Now
S= I WhikYhik
(hik )esr
can be written as
S= IWhikYhik = IthikwhikYhik'
(hik )esr (hik )esn
Let E 1 denote the expected value over all possible samples, then we have

E(S)=ElE2[ I.thikWhikYhik] =El[ I,E2(thik)whikYhik] =El[ I,PWhikYhik]


(hik)esn (hik)esn (hik )esn

= PE 1[ IWhikYhik] =pE1(f)=pY.
(hik )esn

Now the estimator


i = I,wh ik
(hik)esr
can be written as
i = I, Whik = I,thikWhik .
(hik )esr (hik)esn
Taking expected values on both sides, we have

E(f) = ElE2(f) = E1E2( I,thikWhik] = El( I,E 2(thik)whik]


(hik)esn (hik)esn

= pEl ( IWhik) = pN .
(hik)esn
Now taking expected values on both sides of
U= I,wh ik
(hik )esn
we have

E(U) = E( I Whik) = N .
(hik)esn
Hence the theorem.
1020 Advanced sampling theory with applications

Defining
S
Eo=--I,
pY
such that
E(Ei) = 0 for i = 0,1,2.
Then we have the following theorem.

Theorem 12.9.3. An approximately unbiased estimator of population total Y is


given by
~= L WhikYhik + LWhikY~ik '
(hik)esr (hik )esm

Proof. Taking expected value we have

E(~)= E[E.(~)]= E[1-U]


T
= E[PY~+ EO~ N(I+ E2)] = YE[(I+ EoXI+ E2XI+ Elt
pN 1+ El
1
]

~ YE[(I+ EoXI+ E2~1- EI + Ef +...)] = YE[(I+ EO + E2 + EOE2XI- El + Ef +...)]

~ YE[(I+ EO + E2 - El +....)] = Y.
Hence the theorem.

Let ng be the number of units in the lh stratum. Then estimators corresponding to


S, i and U by deleting /h cluster from the lh stratum can be defined as

(12.9.6)

"s
A

T_ gj = LWhik +-- L Wgik' (12.9.7)


(hik)esr,h*g ng -I (gik)e sr, i* j
and
ng
gj =
A

U_ LWhik +-- L Wgib (12.9.8)


(hik)e sn ,h* g "s -I (gik)esn ,i*j
such that E(S_gj)= pY, E(f_ gj)= pN and E(U_gj)= N. Also for (hik) E Sm and
(hi);" (gj) adjust the imputed values as
.( .) • A / A AI A
Zhif! = Yhik + S_gj L gj - S T, (12.9.9)
which are called ' synthetic' imputed values such that
'(gj)) = S_gj L gj
E.(Zhik A / A

because E~~ik)= sif. Rao and Shao (1992) used synthetic imputation values to
construct an approximately unbiased estimator of population total Y as

(12.9.10)
12. Non-response and its treatments 1021

Theorem 12.9.4. The estimator ~a(_ gj) is approximately unbiased for the
population total Y.
Proof. Taking expected value of ~a(_ gj) we have

Hence the theorem.

Then a Jackknife estimator of the variance of ~ is given by

VJ = f: n -1 f VIL~a(_ gj)_ ~}2 .


g
(12.9.11)
g=1 "e j=l

Rao and Shao (1992) have also considered the situation of multiple imputation
-
classes. They showed (~-Y)/.,F; N(O,l). Yung and Rao (2000) have considered
the post-stratified design for estimating the variance using Jackknife estimator of
variance.

Rubin (1978) introduced mutiple imputation (Ml) to account for the inflation in the
variance owed to imputation. It requires the construction of M(:::: 2) complete data
sets by replacing each missing value by M imputed values using the same
imputation procedure . If )III, )lI2, ...,)lIM denote the M imputed estimators of the
population mean Y under SRSWR sampling. The 'final' imputed estimator of
population mean Y is then given by

)II. = M
1
I)llJ
J=I
(12.10.1)

with estimated variance

,(_) M1(1
V YI. = - 1)~L..sU2
---
n NJ=I
+(M+1){
--- ---
M
1 ~(- - \2}
L..
M-1 J =I
Yu - YI.j , (12.10.2)
where s 7J denotes the sample variance for the fh completed data set and n is the
sample size. Rubin and Schenker (1986) pointed out that the variance estimator
(12.10.2) leads to valid inference at least when the number of imputations, M is
large and imputations are 'proper' in the sense that the imputed values are drawn
from the posterior distribution of non-observed values from the given respondent
values. The traditional simple random hot deck imputation is not proper in this
sense and it may lead to be an underestimate of true variance of )I I , thus Rubin and
Schenker (1986) suggested a new method called Approximate Bayesian Bootstap
(ABB) method of variance estimation. Rao (1996b) considered stratified random
sampling design for estimating the variance with multiple imputed data. Rubin's
1022 Advanced sampling theory with applications

method has been found to be applicable in the situations in which the fraction of
missing data is large and the user and imputer are the same individual who chooses
multiple imputation because of its convenience; for example, see Little and Yao
(1996), Paik (1997) , Taylor, Muzoz, Bass, Sah, Chmiel, Kingsley et al. (1990) , Tu,
Meng, and Pagano (1993), and Clayton, Dunn, Pickles, and Spiegelhalter (1998).
On the other hand, Fay (1992 , 1994, 1996), Meng (1994) and Rubin (1996) have
shown that the estimator of variance proposed by Rubin results in upward bias and
inconsistent in certain cases. Robins and Wang (2000) derived a general formula
for the large sample bias in Rubin's estimator of variance, which not only confirms
the findings of the Fay (1992, 1994, 1996), Meng (1994) and Rubin (1996), but
also indicates there are other scenarios under which Rubin's estimator of variance is
downwardly biased. Robins and Wang (2000) provided an interesting formula
which overcomes the deficiencies of Rubin's estimator of variance and they
provided a consistent estimator of variance, unlike Rubins estimator, when the
imputation and analysis models are mis-specified and incompatible with one
another. Schafer and Schenker (2000) used imputed conditional means for drawing
inference from it. Let us explain the concept of multiple imputation with the help of
a simple example .

Example 12.10.1. Select an SRSWOR sample of twelve units from the population
I given in the Appendix. Record the values of the real estate farm loans for the
states selected in the sample. Observe the random non-response in the selected
sample. Impute the missing values three times with the help of hot deck method of
imputation. Apply the concept of multiple imputation for estimating population
mean and construct the 95% confidence interval.
Solution. We apply the remainder approach on the first two columns of the Pseudo-
Random Numbers (PRN) given in Table I of the Appendix to select a sample of 12
states from the population 1. The first 12 distinct random numbers between I and
50 were selected as: 49, 08, 10, 04, 42, 01, 19, 37, 12, 23, 38 and 44. The
information so collected from the selected units in the sample is give below :

01 AL 408 .978
04 AR 907.700
08 DE 42.808
10 GA 939.460
12 ID ~ i ~;
19 ME 8.849
23 MN 1354.768
37 OR 114.899
38 PA 756.169
42 TN :':"_ T'
mr:':

44 UT 56.908
49 WI ,:':'(m
W_ :':
12. Non-response and its treatments 1023

We observe that data is missing for the three states ID, TN and WI. In the following
table we impute these missing values three times with the help of hot deck method
of imputation. On the first occasion we used the 3 rd column of the PRN to select
three random numbers betwee n 1 and 9 to impute the missing values as shown in
the fourth column of the Table 12.10.1. First time three distinct random numbers
came in the sequence as 2, 8 and I.

"Sr. Third time


No.
~<
imputed data
J =3
I AL 408.978
2 AR 907.700
3 DE 42.808
4 GA 939.460
5 ME 8.849
6 MN 1354.768
7 OR 114.899
8 PA 756.169
9 UT 56.908
10 ID 939.460
II TN 756.169
12 WI 408.978

We use the 7th column of PRN to select three random numbers between 1 and 9 to
impute the missing data second time. We find the three distinct random numbers
as: 6, 7 and 9. The corresponding imputed data is shown in the fifth column of the
Table 12.10.1. We use the 11th column of PRN to select three random numbers
between I and 9 to impute the missing data third time. We find the three distinct
random numbers as: 4, 8 and 1. The corresponding imputed data is shown in the
sixth column of the Table 12.10.1. Here M =3, n =12 and N =50 . An imputed
estimate of the average real estate farm loans during 1997 in the United States is
given by

- _ 1 ~- _ 555.282+509 .759+557.929 _ 540 99


Y / o - - L. YlJ - - . .
M J=I 3

An estimate of variance is given by


"(_) 1(1- - -1)M2. s 2 + -(M+1)
v Y/o = - - {- 1
-
lJ 2. Y lJ - Y_/ o )2}
M(_
M n N J =I M M- 1J = I
1024 Advanced sampling theory with applications

= .!-(~-~)(196565.530 + 275716.920 + 198684.640)


3 12 50

+ (3+ 1)[_1_t555.282 -540.99f + (509.759- 540.99f + (557.929-540.99f}l


3 3-1 J
= 14164.86+977.71 = 15142.57 .
A (1- a )100% confidence interval, assuming large sample size, is
Ylo±fa/2(df = n -1),JV(YlO) .
Using Table 2 from the Appendix the 95% confidence interval of the average real
estate farm loans is
540.99± 2.201"15142.57 or [270.145,811.834].

Multiple imputation is becoming very popular these days and standard techniques
for properly dealing with the missing data are appearing to have easy access to the
computers during the forthcoming centuries . Let Y be the complete data set. Let
Yobs and Ymis denote the observed and missing components of the complete data,
that is Y '" (Yobs, Ymis ) . Assuming that with complete data valid inference about a k
component quantity fl, possibly a model parameter or a finite population
characteristic follows the standard large sample statement
(P-fl)- N(O, v) (12.10.3)
where p '" p(y) is an estimator of fl
and v'" V(Y) is its associated variance. The
basic idea of multiple imputation is to fill the missing data multiple times with
values drawn from some distribution that predicts the missing values given the
observed data and other available information . Each draw of Ymis is an imputation.
Thus in case of multiple imputation we have m imputations Y~!s' y~~l ,.., Y~~ as
repeated independent draws of Ymis from Bayesian prediction model. In general we
do multiple imputation in three steps:
Step I. Compute the complete data statistics ft.l '" p(y(I)) and VOl'" V(y(l)) for each
of the m completed data sets yell '" (YObS' y~!s) for 1= 1,2,...,m.
- 1 m
Step II. Compute an estimate of fl flm I.JJo l and an estimate of variance as
A

as =-
ml=1

Vm=Vm+(1+ ~)Bm

where Vm= 2- I VOl and Bm= _1_ I (POl - jim XiJol - jim) .
ml=1 m - 11=1
12. Non-response and its treatments 1025

Step III. The (I- a)100% confidence intervals for fJ are formed on the basis of
k component Student's t distribution given by
(P-Pm)/.JV:, -tvm (12.10.4)
where
V m= (m-l)y;;;2 with i; = (l+m-l)tr(BmT';I);k. (12.105)
Barnard and Rubin (1999) pointed out that in small data sets, however, it can be
unsatisfactory to set the degree of freedom to infinity, especially when there is little
missing information, because V m can then be many times the degree of freedom
available if there were no missing data. Thus Barnard and Rubin (1999) thought
that there is a need for the new expression for multiple imputation degree of

r
freedom that does not rely on a large complete data sample. They provided a new
expression as an adjusted degree of freedom

vm= vcom[{A(vcomXI- ym)}-l + V~:m (12.10.6)

where A(V) =(v+ I)/(v+ 3) and vObs = A(vcom)vcom(l- Ym) denotes the observed degree
of freedom with Vcom being complete degree of freedom .

Singh and Hom (2000) suggested a compromised imputation, in which the data
after imputation becomes
_ {a ny.] r+(1 - a )bxi if i e A,
v; - ( \£ _ (12.11.1)
I-apxi if ieA,
where a is a suitably chosen constant, such that the variance of the resultant
estimator is minimum. Note that Meeden (2000) has also suggested the idea of
adjusting responding values in addition non-responding values while doing
imputation . They used information from imputed values for the responding units in
addition to non-responding units. Thus we have the following theorem:

Theorem 12.11.1. The point estimator of the population mean Y under the
compromised method of imputation becomes

Y comp = aYr + (1- a )y,( ~: J. (12.11.2)

Proof. We have

Ycomp =..!.n LiesYoi =..!.[


n ieA
LYoi + LYOi],
ieA
(12 II 3)
. .
and using (12.11.1) we have (12.11.2). Hence the theorem .

The estimator at (12.11.2) is an analogue of the well known estimator of population


mean considered by Chakrabarty (1968) , Vos (1980) and Adhvaryu and Gupta
(1983) as discussed in Chapter 3, and given by
1026 Advanced sampling theory with applications

Ycvag= aYn +(I- a)yJ K/:xJ (12.11.4)


Defining
&=(Yr/ r )- I, 0=(:XJK)-I, and 1]=(:Xn/X)- 1.
Using the concept of two-phase sampling following Rao and Sitter (1995) and the
mechanism of missing completely at random (MCAR), for given rand n, we have
E(&) = E(o) = E( 1]) = 0
and

E(& 2 )=(~ _ ~)c;, E(02)=(;- ~ )c;, E(&O)= (~- ~ )pxyCyCx


E(1]2)= (~- ~)c; , E(01])=(~-~ )c; , E(&1])=(~- ~ )PxyCyCX
where Cy2 = Sy2/-2
y , Cx2 = s;2/-2 I \ Sy,
X , Pxy = Sxy/,SxSy}> 2 Sx'
2 and Sxy have their
usual meanings . Now we have the following theorems.

Theorem 12.11.2. The conditional bias in the estimator Ycomp at (12.11.2) is

B(Ycomp)'" (1- a {~ - ~ )r(c; - PxycycJ (12.11.5)

Proof. The estimator Ycomp e , 0 and 1] can be written as


in terms of
Ycomp'" ar(1 +&)+(I-a)r(1 + &+1] - 0 +&1] -&0 -01] + 02+ 0(&2)). (12 .11.6)
Taking expected value on both sides of (12.11.6) and its deviation from actual mean
we obtain (12.11.5). Hence the theorem.

Theorem 12.11.3. The minimum MSE the estimator Ycomp is given by


Min.MSE(Ycomp) '" MSE(yralio)-(;_~)a2f2C; (12.11.7)

for the optimum value of a given by


a=l-pxyCy/Cx ' (12.11.8)
Proof. We have
Ycomp '" r + r& + (1- a)r(I] - 0)+ 0(&2)
where 0(&2) indicates terms of higher orders of &,0 ,I] etc.. Thus
MSE(Ycomp) = E~comp - rf "'E[r&+ (1-a)Y(1] - o)f
=(.!._-.!..
r N
)r2c2+(.!.
y r _.!.)r2r(l_aYc;_2(I_a)p
n ~ xy CyCx]. (12.11.9)
On differentiating (12.11.9) with respect to (1 - a) and equating to zero, we obtain
a = 1- PxyCy/Cx and on substituting it in (12.11.9), we have the theorem .
12. Non-response and its treatments 1027

The main difficulty in using the compromised imputation procedure is the choice of
a . It is important to note that the optimum value of a depends only upon the well
known parameter K = PxyCyjCx ' The value of K is quite stable in the repeated
surveys as shown by Reddy (1978a). Thus if the value of K is known then the
compromised imputation method can be easily implemented in actual surveys.
Some time the value of K is not known. In these situations , Singh and Hom (2000)
have suggested two estimators of a . The first estimator is given by
(12.11.10)

"Z = (r-I )-1 Lr ( Xi-Xr


SX
- )Z
. The second
i=1
estimator is given by
az =1- (xnS;y)/~rs; ), (12.11.11)

where s; = (n -r): 1 I (Xi - xnt The choice between al and az is not very
i=1
important for the infinite populations, because the asymptotic mean squared error of
the resultant estimators of mean remains same by following Sampath (1989). Singh
and Hom (2000) have shown that the compromised imputation technique remains
better than ratio or mean methods of imputation.

It is interesting to note that if a strong imputation variable, Xi' is available then


Yi = bx, very nearly holds for all i . Then to very close approximation, b = band
bx, = v., that is, the imputed value is near perfect. Then the imputation rule
(12. I 1.1) reduces to
Y.i = {Yi[1+ a(n/r -I)] if i E~,
(l-a)Yi if iEA .
(12.11.12)

Under such situations the values of Pxy ~ I and Cy '" Cx' then the optimum value
of a, and hence its estimators ai' i = 1,2, will tend to zero. In other words, then the
imputed values , using the compromised technique, remain close to the true values
in A . Also the actual values Yi do not have any impact of imputation in. It is
remarkable that a bad guess of a may lead to bad results in the compromised
imputation . Since the compromised imputation provides better estimator of
population mean, therefore, it is recommended to use in future.
1028 Advanced sampling theory with applications

Singh and Hom (2000) pointed out that this type of compromisation can also be
done between other type of imputation methods. For example a compromisation
between Hot deck and Cold deck methods of imputation may lead to the ' Warm
Deck' method of imputation, defined as
Ywo = ayco + (1- a )YHO . (12.11.13)
The correlation between estimates obtained via the cold deck and hot deck methods
is expected to be high, and hence the resultant estimator (12.11.13) named the
"Warm Deck" method of imputation is expected to be efficient for the optimum
value of a , given by

ao = {V(YHO)-COV(YHO' Yco)} /{V(YHo)+V(Yco)-2Cov(YHO'YCO)} . (12.11.14)


A consistent estimator of a o at (12.11.14) can be obtained by replacing variance-
covariance terms by sample analogues.

A compromised method of imputation obtained by pooling Mean method and


Nearest Neighbourhood (NN) method of imputation, given by
YMN =aYr+(l-a)YNN (12.11.15)
can be named the ' Mean cum NN' method of imputation. In the same fashion a
linear variety of any of two or more imputation procedures can be used to make a
compromised imputation procedure. Chen and Shao (2001) have discussed variance
estimation techniques for nearest neighbourhood imputation.

Example 12.11.1. Select an SRSWOR sample of twenty units from the population
1 given in the Appendix. Record the values of the real and nonrea1 estate farm loans
for the states selected in the sample. Observe the random non-response in the
selected sample. Impute the missing values with the following two methods:
( a ) Ratio method of imputation; and ( b ) Compromised method of imputation.
Estimate the average real estate farm loans with each method and comment on your
results.

Solution. We apply the remainder approach on the 3rd and 4 th columns of the
Pseudo-Random Numbers (PRN) given in Table 1 of the Appendix to select a
sample of 20 states from the population 1. The first 20 distinct random numbers
between 1 and 50 were selected as: 29, 31, 14,41 ,05,47,28,22,18,12,42,23,48,
02,06,07, 11,21,25, and 39. The information so collected from the selected units
in the sample is given below.
12. Non-response and its treatments 1029

02 AK 3.433 2.605 23 MN 2466.892 1354.768


05 CA 3928.732 1343.461 25 MO 1519.994 1579.686
06 CO 906.281 28 NV 16.710 5.860
07 CT 4.373 7.130 29 NH 0.471 6.044
11 HI 38.067 40 .775 31 NM 274 .035 140.582
12 ID 1006.036 53.753 39 RI 0.233
14 IN 1022.782 1213.024 41 SD 1692.817 413.777
18 LA 405 .799 282.565 42 TN 388 .869 553.266
21 MA 56.471 47 WA 1228.607
22 MI 440 .518 323.028 48 WV 29.291

From the above table xr = 880.6352 and Yr = 488.0216 , which imply


b = Yr/xr = 0.55417.

Also s;2 = 1264908.658 and S;y= 505875.559, which imply


a = 1- (x s· )/~ s'2) = 1- 880.6352 x 505875.559 = 0.27832522.
1 r xy rx 488.0216 x 1264908.658

The imputed values for the ratio method are then given by

if the /h state selected in the sample is not responding to the value of real estate
farm loans .

On the other hand, the imputed value for the compromised imputation method is

A () Ina y ; +(I - a "jJxi =0.3711Yi+0.39993xi if the jlh stateisresponding,


Yi C = r
(1- a PX; =0.39993xi if the jlh state is not responding.

Thus by the ratio and compromised methods of imputation the data takes the form
as shown in the 4 th and 5th columns, respectively, of the following table.
1030 Advanced sampling theory with applications

~Sr.
No. ~
I ~~irffi~ndtj .~eal es~~;}.I !i,~Ratio.
tory 'Gf '· J oans" y
. C?1Jl~m)1~~
Imp!1tatlOn , Imputatlon~
2 AK 2.605 2.60500 2.339678
5 CA 1343.461 1343.46100 2069 .779000
6 CO 502.23370 362.449400
.'~,
Missingi~
7 CT 7.130 7.13000 4.394841
II HI 40.775 40.77500 30.355770
12 rn 53.753 53.75300 422 .292200
14 IN 12 13.024 1213.02400 859.195300
18 LA 282.565 282.56500 267.151400
21 MA ,"';'iJ,'ri: ,;l,;~i~ .• M:issirfg ;~ 31.29453 22.584480
22 MI 323.028 323.02800 296.052400
23 MN 1354.768 1354.76800 1489.340000
25 MO 1579.686 1579.68600 1194.114000
28 NV 5.860 5.86000 8.859487
29 NH 6.044 6.04400 2.431297
31 NM 140.582 140.58200 161.765000
39 RI "'w ~;; MissiIi'g"1i' 0.12912 0.093184
41 SD 413.777 413.77700 830.561900
42 TN 553.266 553.26600 360.837800
47 WA Missiilg;,,· 680.85710 491.357400
48 WV ..-f'
'".',J, Missing '" 16.23219 11.714360
"''' \r;,~jf~ .
',;
f
' ff •
Sum": 1l',,8551.07,J00 · 8887.666700""

Thus an estimate of the average of the real estate farm loans in the United States
during 1997 by the ratio method of imputation is given by

- . -.!- {!.
Yratto -
".( )= 8551.071
L..Y, r 20
_ 427 55
- .
n i=l

and that by the compromised imputation method is given by

-
Ycom p
=.!- {!. ".( ) = 8887.6667 = 444 . 38 •
L..Y, C
n i=1 20

From the description of the population 1 given in the Appendix, the true average
real estate farm loans is give by f = 555.43 . One can see here that the estimate
based on the compromised method of imputation is close to the true value of the
population mean than the estimate based on the ratio method of imputation.
12. Non-response and its treatments 1031

Consider a sample s of size n is selected by a given design p(s) and it is assumed


that a set r of nr units respond and a set r" of (n - nr) units does not. The
information collected in the sample can be represented as
(k, h, Yk> Xk 1 k E S, N, n , where h is an indicator variable such that E(h) = Pk
and Pk denotes the individual response probability. Cassel, Sarndal, and Wretman
(1979) has shown that it is always possible to eliminate non-response bias by
estimators of population total or mean by the response probability, Pk, provided
that Pk is always positive for all k. Since the response probability Pk is unknown,
but it can be estimated . It is generally assumed that the response probability takes
the form
Pk = F((}, Xk) (12.12.1)
where () denotes an unknown parameter (or vector of parameters) and q(., .) is a
functional form to be explicitly specified, for example Pk = Exp(-(}Xk)
or Pk = [1 + EXp(- (}xk )t1 etc.. Estimated Pk are then obtained by replacing in
(12.12.1) estimated values 8 of parameter (). Specification of the functional form
as well as estimation of its parameters may be cumbersome. Thus if auxiliary
information consists of a grouping or stratification variable(s) and it is reasonable to
assume response propensity is homogeneous within strata, then a way to avoid this
difficulty is to assume response probabilities constant within strata, thus if
h = 1,2,...,L denote the stratum levels, then (12.12 .1) becomes
Pk=(}h' (12.12.2)
It is natural to use the response rate in each stratum as maximum likelihood estimate
of Pk as
h=8h=nrh/ nh (12.12.3)
where nrh and nh are the size and respondents in the h1h stratum, respectively.
Sometimes however no clearly identifiable strata or other groups exists, this may
often occur when auxiliary information is represented by the Xk values of a
continuous variable, then this method, although still easily applicable, may present
same drawbacks. In such situations it seems natural to tum the idea of
homogeneous response behaviour within strata into the idea that similar or near Xk
are connected with similar Pk' However, the latter does not fit well the strata
structure introduced above, where groups are non-overlapping. In fact, Xk values,
some distance apart in the same group, give the same estimates of response
probabilities, whereas Xk values, near to each other but in different groups , may
give very different estimates. Giommi (1984) has given the idea of centring at each
unit, for estimating the response probabilities, and using a technique ideally related
to the concept of moving average . Thus the response probability of the J(h unit can
be estimated by the response rate in the group centred at the J(h unit or equivalently
1032 Advanced sampling theory with applications

in the convenient interval of x values centred at Xk ,(k ES) . Assuming 2lk be the
length of the interval centred at Xk , then Pk can be estimated as

I ikD~k -xJ {1 I
if Iz s;, lk>
II
• jES ()
Pk = ( ) ' where D z = (12.12.4)
IDxk-Xj 0 if z > lk>
jES

and ik is the realised value of I k , that is ik =1 if k E rand 0 otherwise. Using


estimates of response probabilities, Cassel, Sarndal, and Wretman (1979) proposed
the following estimator of population total:

Y•c = I -.-
Y k + f3ds
' [X - I - Xk.-] (12.12 .5)
kES "kPk kES "kPk
where
2 ]-1
/3,
ds
=
[I Yk Xk
kES vk"kPk ][I~
kES vk "kPk

and "k denotes the probability of including the j(h unit in the sample .

Example 12.12.1. We wish to estimate the average duration sleep time (in minutes)
of the aged persons living in a small village of the United States as shown in
population 2 of the Appendix . Select an SRSWOR of eight persons from the
population 2. Collect the information about the duration of sleep and age of the
persons selected in the sample. Suppose the response probability of the lh person is
inversely proportional to the age of the persons selected . Assuming that an old
person will respond less quickly than a matured young person . Estimate the average
duration sleep time of the old persons in the particular village .
Solution. The population 2 consists 000 old persons living in a small village. We
used first two columns of the Pseudo-Random Number (PRN) Table 1 given in the
Appendix to select eight distinct random numbers between 1 and 30 as: 01, 23, 04,
05,22,29,03,27.

01 60 492 0.036708 13403.07


03 55 408 0.040045 10188.53
04 56 465 0.039330 11823.04
05 82 312 0.026859 11616.22
22 71 360 0.031021 11605.04
23 63 390 0.034960 11155.61
27 60 390 0.036708 10624.38
29 66 390 0.033371 11686.79
12. Non-response and its treatments 1033

We assume the response probability

v, = xi-1/ N( )
/ i~ 1/ Xi'
30
where iZ:} /Xi = 0.454 is known.

Then an estimate of the average duration sleep time is


- _ 1 ~ Yi _ 92102.68 _ 38376 . t
Yes! - - . t . . . - - -. mmu es.
Nn i=I'l/i 30 x8

In almost all large scale surveys non-responses are inevitable . Several methods for
handling non-response problems in sample surveys are available in the literature.
Good details are given by Rubin (1987). Sarndal (1992) developed a method of
estimation of the population total and its variance when a single imputation is used
for estimating unobserved values under the superpopulation model described below.
Let n = {1,2,oo .,N} be a finite population of N identifiable units and Yi be the value of
the variate under study for the lh unit of the population n. It is assumed that the
vector ~ = (Yl,oo,Yi,oo,YN) is a random sample from a superpopulation z having the
following distribution :E;(Yi)=jJxi' V;(Yi)=O' 2x f and C;(Yi 'Yj)=O for i e j ,
i,j = 1,oo,N, where E;, v; and C; denote respectively expectation, variance and
covariance operator with respect to the model ;; jJ, 0'2(> 0), andg(> 0) are
unknown model parameters, and Xi (>0) is the value of the auxiliary variable for the
lh unit. The objective is to estimate the finite population total Y = LYi on the basis
iefl
of a sample s, of size n, selected with probability p(s) according to a sampling
design p. Let 1Ci and 1Ci} be the inclusion probabilities for the and lh and r, r
(i *j ) unit of n . Let sr (c s) be
the set of respondent units of size m from which
responses Yi values are observed and the complement s - s r (of size n - m ) be the
set of non-response units. Let Yj for j E S - Sr be the imputed value of the j'h unit
computed according to a certain rule depending on the superpopulation model ;
(for details, see Sarndal (1992» .
Let
t=LwiYi
ie s
be an unbiased estimator used for estimating Y in case of 100% response
(i.e., m = n) where Wi are suitably chosen weights independent of Yi values. In the
presence of non-response (m < n) , Sarndal (1992) modified the estimator t as
i = LWiYiO,
ie s
where
1034 Advanced sampling theory with applications

Y; for r e s,;
Y;o = { c. •
Y; ror z e s s, ;
A

Denoting Ep (Vp ) and E; (vr ) as expectation (variance) with respect to the


sampling design p and response mechanism respectively, Samdal (1992) derived
the overall variance of 1 as

= E~EpEr(1 - Y~ = E~VpV)+ EpVrE~(i)+ 2E~Ep[(t - y){E~v - t)1 s}l


VIOl

where Vp(I)=Ep(I-y)2= variance of 1 with respect to the sampling designp.

V~=E~~-t~ls,r}. Now writing EpErV~(i)=~mp=imputation variance and


assuming E~E p[(t - Y){E~V - t)1 s}] is close to zero. Samdal (1992) derived an
approximate expression for VIOl as
VIOl:= Vsam + ~mp' (12.13.1)
Consider the situation where an SRSWOR sample S of size n is selected and m
the size of the response sample Sr' then under the following superpopulation
model,
E~(Y;)=fi, V~(Y;)=CT2, and C~(y; ,y)=O for i e j (12 .13.2)

the unobserved values of Y; may be imputed by using y; = /J = Yr = LY; / m for


ies;
iES-S p i.e. y;o=y; for iES and y;o=Yr for iES-S r .
Sarndal (1992) proposed an estimator for the population total Y as
A N _
tOI = - LY;o = N Yr' (12.13.3)
n ;ES

He proposed the estimator for variance of 101 as


21 1 2 2 2
N (---)Syr - J'f01 (say), where (m -l)Syr = L(Y; -Yr)
A _A

VIOl =
m N ;ES
r
(12.13.4)
For the superpopulation model
. 2 (12.13.5)
E~(y;) = fix;, V~(y;) = CT x;, and C~(Y;'Yj) = 0,
Samdal (1992) used
Y; for i E s.; A

Y;o ={ with B = ,L Y; j, L x;
Bx, for i s.,
A

E S - IES
r IES
r
and proposed an estimator for Y under SRSWOR as
- Yr
t02 = N XS -=-, (12.13.6)
Xr

where Xs = LX;, xr = LX;, and Yr = LY; .


ies ie s; ies,
Samdal (1992) estimated the variance of t02 by
V02 = Vsam (2)+ ~mp(2~ (12.13.7)
where
12. Non-response and its treatments 1035

. () = N 2(-;;-
Vsam 2 1 N1)L2
r yos + CoO".2} , ftimp 1 N1)CtO".2
. ()2 = N 2(-;;-
where
S;o s = (n -1).t(Yi - Yso)2, Yso = .t In,
YiO 0- = .t~i -
2
EXi f/[(m -1)~r(l- cV;r 1m)}],
lES lE S ie s;

ie s;
and

te s; ies ;

Finally, in order to compare the relative efficiency of the estimator i02 ' Sarndal
(1992) conducted a Monte Carlo study with 100,000 repeated response sets s.;
N = 100, n = 30. Three different response mechanisms were used:

Mechanism 1: Response probability ei = 1- exp(alYi) decreases with Yi ;


Mechanism 2: Response probability ei = exp(-a2Yi) increases with Yi ;
and
Mechanism 3: Response probability ei = 0.7 is a constant.
The constants a\ and a2 are positive and so chosen to make the average of
ei = "fA IN
= 0.7. Note that Sarndal's (1992) estimators cannot be used gainfully

when the response probabilities ek are known or can be estimated from the
available data. So Amab and Singh (2002d) proposed some alternative estimation
procedures assuming that the response probabilities ek for the 11" unit are either
known or estimated through the log linear models:

(i) log(e;) = 1- clPi ; and ( ii ) log(e;) = c2Pi ;

respectively, where Pi = zd2 and C\, C2 are unknown positive constants which are
appropriate when response probability increases or decreases with x.

Amab and Singh (2002d) assumed that the response probabilities ei are
independent. Under this assumption, they consider that the response sample s,
(formed by the set of respondent units) is a sub-sample from s selected by the
nature according to the Poisson sampling scheme with inclusion probabilities
lTils = e i and lTijl s = eij = 8i8 j for i *' j. The Horvitz and Thompson (1952) type

estimator for the total Y is


1036 Advanced sampling theory with applications

( 12.13.1.1)

It can be easily checked that Yht is unbiased for the total Y. Th e express ion for the
variance of Yht is given in the following theorems:

Theorem 12.13.1. The variance of Yht is V(Yht )=YJ + V2 , where

YJ = -1 L L7r;7Lj
( -7rij {Y;
- - -Yj J2, (12 .13.1. 2)
Z ;* j 7r; 7rj
and

V2 = Li
i
(J..-1J.
«. 0;
(12.13 .1.3)

Proof. We have

Now

(wh ere z; = Y;/7r; ) .

Theorem 12.13.2. If the sample size is large enough to ensure Prob{m ~ Z} :; 1, then
the following two estimators
Vht(l) = VII + V2 and Vht(Z) = Vl2 + V2
are unbiased for V(Yht ) , where

VII -_ -L
1
L
Z; *jEs r 7rijO;Oj
-J2 ,
(n-;7rj - 7rij) [ Y;
- -Yj
7r; 7rj
.
V2 = L - i; - (1
--1 J
;ES r 7r; 0; 0;

'"
and
Ii.12 = '" £{1~ -lJ -'"L.
L. ;r 'O' L.
(7r;7rj - 7rij) y ; Y j .
;ES r I I 7r; ;* JESr 7rij7r;7rj 0; OJ
Proof. Noting
12. Non-response and its treatments 1037

1 v. v, 2 1 y . Yj
VI =-LL(IT;11j-ITij { -L_-L 2 =LY; ( - - I ) +LL(ITij -IT;ITj)l1-- ,
2 ;¢ j IT; IT; ) i IT; ;¢ j IT; ITj

we can verify that both the estimators VII and ~2 are unbiased for VI ' It can be
easily checked that V2 is unbiased for V2 •

Case I. Consider an SRSWOR design where IT; =.!!.... = IT;o and


N
ITij = n~n
N N-I
-1\ = ITijO •

In this case we have Yht = N L Y; • The variance of Yht can be estimated by


n ie s; 0;
putting IT; = IT;o and ITij = ITijO in the expressions of the variance estimators given in
the Theorem 12.13.2.

Case II. Consider an SRSWOR design where IT; = IT;o, ITij = ITijO and response
probabilities 0; are equal to °for every i. Further, if we estimate °by iJ = m/n,
then
• N _.
Yht = - LY; = N Y r =/01 ' (12.13.1.4)
m ie s;

The estimator tOI was proposed by Samdal (1992) when the method of single
imputation is used as described in (12.13.3). Putting 0; = iJ = m/n ,
IT; = IT;o ' ITij = ITijO in the Theorem 12.13.2 we have two approximate variance
estimators for tOI as
• _ N(N-nXm-l) 2 N 2(n-m) 2 • _. N(N-nXn-m) 2
vOl - () Sy r + 2 LY;, and V02 - VOl + ( ) 2 L y; ·
m n -I nm ies ; n n -I m ie s;

Finally, noting that Va = N 2 ( ~ - ~ JS;r, the expression for the estimate of the

variance of tOI given by Sarndal in (12.13.1.4), we have the following theorem


relating to the magnitude of the three variance estimators of tOI •

Theorem 12.13.3. (i ) v02 ;::: VOl for all the Y ; values, and ( ii ) V02 ;::: Vo whenever
all the Y; values are positive.
Proof. Straightforward and hence omitted.

We shall discuss estimators of total and variance under the calibrated response
probabilities mechanism in the following sections:
1038 Advanced sampling theory with applications

Consider the Xi' i E S are known. Then following Deville and Sarnda l (1992), a
calibrated estimator of the population total is given by

(12.13.2.1)
where the Wi are the calibrated weights obtained by minimizing the chi square type
distance function

D= I(Wi-Ijeyqjl
ies ;

subject to the calibration constraint

Here the qi are suitably chosen weights. Minimization of Dleads to the calibrated
weights

(12.13.2.2)

On putting (12.13.2.2) in (12.13.2.1), the calibrated estimator Yc becomes

Yc = Yht+ Br(.I Xi. -xhtJ


lES1(,
(121323)
• • •

where

(12.13.2.4)

Now writing

Br=[EI
IES r
YiX~qiJ/[E.I
1(i IESr
xl;iJ=(.I YiXiq;O;]/( .I xlq;O;]
1(i lEU 1(i lEU 1(i

and E, = Yi- BXi , an approximate expression for the variance of Yc is obtained as


12. Non-response and its treatments 1039

It can be easily checked that

Q(r) = .L(el/"lB;'(~ - I) (12 .13 .2 .5)


IE S r \. B,

is an approximate unbiased estimator of L(El/";'(~


i \. B;
-1) where e; = v. - Brx; and

B; is as given in (12.13.2.4). Hence using the TheoremI2.13.2 we obtain the


following two design consistent estimators for Vc as
·
VC<l)
.
= Q(r) + VII> (12.13.2.6)
and
·
VC(2)
.
=Q(r) + V\2, (12.13.2.7)
whereQ(r) is given in (12.13.2.5), and ~l' ~2 are as defined in the Theorem
12.13.2. Following Deville and Sarndal (1992), we obtain two alternative model
based as well as design based consistent estimators for Vc when we replace 1/B; by
the calibrated weight W; in the expressions of Vc(l) and VC(2) . The estimators are
as follows
· -
VC(3) = Q(r) + Vl3,
. (12.13.2.8)
and
(12.13.2.9)

Case I. Consider an SRSWOR design where " ; = n/ Nand " ij = n(n -1)/N(N -I)

and q; = 1/X; ,then

Yc = NL y; +br(N LX;- N L x; J, (12.13.2.10)


n r B;
;E S n ; ES r n ; E S r B;
where b; = Jir/xr . Denoting Xs = LX; / nand xr = LX;/m, and in particular, if
ie s ie s;

we assume B; = B for every i = 1,2,..,N , then we obtain W; =!!... ~s = w (say) and


m Xr

(12.13.2.11)
1040 Advanced sampling theory with applications

Remark 12.13.2.1. It is important to note that the estimator (12.13.2 .11) is


independent of the response probability e. Hence one can use 102 without
estimating the response probability e. The estimator (12.13.2 .11) was obtained by
Sarndal (1992) with the single imputation method. Now replacing e by its estimate
B = min and putting 1Ti = n/ N, 1Tij = n(n -1)/ N(N -1) and Wi = W in the expression
of E.g (Ci) = 0 for j = 1,2,3,4, we have the following variance estimators for 102

= N2 n-m( -1)s2 + N(N-n)(m-l) 2


voC (1) 2 m er Syr'
n m m(n-l)
2
voC(2) =N- -n2
- m ( )s2
- m-l er + N(N - n) {(m-l )s2yr +(---)
1 1 LYi2} ,
n m m(n -1) m n ies ;

VoC (3) -
_ »: n
(m -1) (is) (nXs -mXr ) 2 + N(N - n)(m -1) (is)2 2
m
2 -
x; x;
Ser
m(n-l)
-
Xr
Syr

and

~ 1)(~s) (nXs - mir) sir + N(N - n) i s (1- ~s)( LYl)


2
voC(4) = N (m
n m Xr Xr nm Xr Xr ies;

+ (~sr )2 [N(N
X
-n) {(m -I)s~r +(~_.!-) L Yl}]
m(n-l) m n ies ;

where (m-l)Sir = L(Yi-brxif.


ies ;

Case II. Consider an SRSWOR sampling design, qi = 1 and ei = e for every


i=I,2,oo,N, and if we replace e by its estimate B=m/n in (12.13.2.1) and obtain

Wi = WiO =!!:.... + n( Lxl)-l (xs - xr)x; , and Wi = WiO yields


m ie s;

Yc = N[Yr + b(is - i r)] = t03 (say), (12.13.2.12)

where b= ,~ YiXi/,~xl.
ies ; IE Sr

Now writing (m - 1Fe; = L (Yi - b.x, and Y e; = Yi - bx-, we have the following
tes;
estimators for the variance of (12.13.2.12) as follows
2(n-m)(
-voC (1)-- N
2 m
_1)-2
Ser +
N(N-nXm-l) 2
( ) Syr'
n m m n-l
2(n-m)(
2 =N- - 2 - m-1J1::'
-voC () ser2 + N(N-n)[(
(_) m-l )s2yr + (1 1) ,L Yi2],
---
n m mn 1 m n ie s;
12. Non-response and its treatments 1041

2
_ () N ( '1:::'2 N(N-n) 1 f. \2
VoC 3 = -
2 L WiO WiO-1Fi + 2( ) L L WiOWj Ol)'i- Y j} ,
n ies; n n-I 2 i "'j ESr
2
_ () N ( \::'2 N(N-n) 2 N (N- n )
V oC 4 = -
2 L WiO WiO - 1JCi + 2 LWiOYi - 2( )L L WiOWOjYiYj'
n ies; n ies; n n - 1 i", JES r

One can also refer to Lundstrom (1997) and Lundstrom and Sarndal (1999) and
they suggested that calibration can be taken as a standard method for treatment of
non-response in survey sampling.

Exercise 12.1. Distinguish between 'Missing at random' and ' Missing completely
at random' . Give examples of random non-response and deliberate non-response in
interview surveys.

N
Exercise 12.2. Let t = t(s) = LbsiYi be an estimator of population total Y = LY; In
~s ~1

the case of complete response. Let r*(s*)= L bsi Yi be an estimator of population


i3S· (}i
total in the presence of non-response and (}i be the probability of response.
Compare the variance of the estimator t· (s•) with that of t(s).
Hint: Arnab (1979b).

Exercise 12.3. Let there be a population of N units from which a sample of size n
is to be drawn. Let the study variable be denoted by Y and the auxiliary variable be
denoted by x. The selection probabilities Pi of the population units are taken to be
proportional to the corresponding x values. Let r (r =O,I,2,... ,(n-I)) be the number
of units (including repetitions in the case of PPSWR sampling) on which the
information on Y could be collected. The value of r is supposed to be less than or
equal to (n - 2) while estimation of variance of the estimator of population total is
concerned. Then show that an unbiased estimator of population total Y is given by
• n n-r
YHTR = - - LdiYi
n-r i;1

and the variance of the estimator YHTR IS

V(YHT R ) = i~j!:)7<i7<r 7<ij XdiY;-djYj ~ + n ~ 1 i~j!:;1 7<ij(d iY; - d j Yj ~E{n ~ r ).


where d, = 7<j l .
Hint: Singh and Singh (1979), Bouza (1994).
1042 Advanced sampling theory with applications

Exercise 12.4. Study the properties of an estimator of finite population variance


cy;in the presence of random non-response given by

VI• =Sy.z +fJ•• Z( Sxz -Sx.Z) , Vz. =Sy.z +kl (s;


z -Sx.Z) +k (s;
z
z -Sz.Z)

•• • / .z
where fJ = Sxy Sx , kl and k z are real constants.
Hint: Singh and Joarder (1998), Singh, Chandra, and Singh (2003).

Exercise 12.5. Consider (lj,x;) be the value of the fh unit of the study variable y
and the auxiliary variable x for a population of size N with population means Y
and X, respectively. Let n be the size of the SRSWOR sample drawn from it.
Consider only nl selected persons respond and nz do not, such that nl + nz = n.
From the nz non-response, let r = nz/k, k > 1 units are selected by making extra
efforts. An estimator of population mean Y is given by
-' nz _·
nl _
Y =-Yl +-Yz,
n n
where YI and Y; are the sample means based on nl and r units, respectively, for
the study character. Assuming that population mean X of the auxiliary variable is
known, study the asymptotic properties of the following estimators:

(a) ml = Y_.( X ) -=- , were


X
h x_. = -Xl
nl -
n
nz _.
+-xz,
n
-
XI an
d -'
Xz bei
emg th e samp Ie means

based on nl and r units, respectively, for the auxiliary variable;

( b ) mz = y.(;), where x = n-I ;~lx; denote the sample means based on n units

for the auxiliary variable;

(c) m3 _.(x+c)
=Y -=----
X +C
an d m4 _.(X+C
=Y -_--) ,
x+C1
I

with C and Cl suitably chosen constants such that the variances of m3 and m4

are minimum. Compare m3 and m4 with the estimator of population mean in the
presence of full response defined as
- =Y-(X+C x)
Ysd -=---C .
x+ x
Hint: Khare and Srivastava (1997).

Exercise 12.6. Let ljj' j = 1,2,...,N;, be the value of the study variable for the /h
unit in the fh stratum i = 1,2,...,L . The population of each stratum is divided into two
classes, those who will response at the first attempt and those who will not respond,
thus creating the problem of incomplete sample in the mail survey. In order to
12. Non-response and its treatments 1043

preserve the advantage of mail survey, an investigator decided to apply the


following strategy:

( a ) Select a random sample from each stratum;


( b ) Send a mail questionnaire to all of the selected units in each stratum;
( c ) After the deadline is over, identify the non-respondents in each stratum;
( d ) Collect data from the selected non-respondents in the sub-sample by interview
and combine data from the two parts of the survey in each stratum to provide the
unbiased estimate of population mean.

Let Yi and Y~.1 be the sample means of the respondent groups in the z4h stratum,
obtained at the first attempt and second attempt (mail survey, say), respectively.

( I ) Show that an unbiased estimator of the population mean is given by


Yst = IWi
L - + ni2Yri
[nilYi -]
i=l 1';
with Wi = N, / N . Find the variance and suggest an estimator of its variance.

Compare the strategies under:


( a) Optimal allocation; (b) Proportional allocation; and ( c ) Equal allocation.
( II ) If the population mean X of the auxiliary variable is known then repeat the
above exercise for the estimators, defined as
Yl = .f Wi(nilYi +1';ni2Yri J[x!/fW;(nilXi +1';ni2Xri J] ,
1=1 1=1

Y3 = i~IW;( nilYi ~ni2Yri J[xj( nilxi ~ni2xri Jland


Hint: Khare (1987), Lindley and Deely (1993).

Exercise 12.7. Discuss the different methods of imputation and suggest estimators
of variance.
Hint: Lee, Rancourt, and Sarndal (1994, 1995a, 1995b).

Exercise 12.8. Discuss Rao--Shao adjustments for estimating the variance of the
estimator of population total using Jackknifing under different methods of
imputation.

Exercise 12.9. Consider a finite population of size N. Let Yi be the value of the
variable under study, Y, for the lh unit. Each unit of the population has a definite
probability of providing the necessary information with respect to the variable of
interest under the particular given field method. Let Pi denote this probability of
obtaining required information from the lh unit and let qi = 1- Pi ' Consider we
1044 Advanced sampling theory with applications

selected a sample of n units by SRSWOR sampling and only m units provided the
required information. In the repeated samples, the value of m will be a random
variable taking the values 0,1,2,....n . Show that an unbiased estimator of the
population total is given by
0 if m = 0,

1
Yl = N ~
L. Y;V;Ii; )'f
m > 0,
n ;=\ P;
where V; = 1 or 0 according as the lh unit is III or is not in the sample and
e, = 1 or 0 according as the lh unit does or does not provide the required
information. Find its variance.
Hint: Singh and Narain (1989).

Exercise 12.10. Let a finite population consist of N units . To every unit there is
attached a characteristic y . The characteristics are assumed to be measured on a
given scale with distinct points Yl,YZ" ",YT' Let N I be the number of units
associated with scale point YI' with N = INI . A simple random sample of size n
I

is drawn. Using the likelihood function of (NbNz, ...,NT) and assuming Nl n to be


an integer, a maximum likelihood estimate of population mean Y is given by,
Yml =.!. InlYI , where nl is the number of YI values observed in the sample. When
n I

non-response is observed assume that response is obtained from n(r) units in the
sample and non-response from n(r) units, such that n(r)+ n(r) = n . Show that under
the likelihood function, defined as

[ (-)] (NYJJ(N12J
L Nrl ,NrZ'· ···,NrT(r);N r = nYJ n (NrT(r)](N(r)J!(NJ
nrT(r) n(r) n'
12
a maximum likelihood estimator of the population mean is given by
_ 1
Yr = - ()LnrlYrl .
nr I
Hint: Laake (1986).

Exercise 12.11. ( a ) Adjust the inclusion probabilities in the Horvitz and


Thompson (1952) estimator of population mean (or total) defined as
tHT = N-1Ikes(Yk!"k)

where only r out of n units (r < n)are responding . Let "l j


) denotes the r
adjusted inclusion probability, then study the asymptotic properties of the following
five estimators
1
L.kerYk / "k(j) ' }. -- 12
tcc(j) -- N- ", , ,3,4,5,
12. Non -response and its treatments 1045

where
1l-l1) = (Jrk vr, Jr12) = !....Jrk' Jr13) = Jrk Pr(k EtR I Ik)= I) , with UR being a particular
n Pr k E UR
set of units in the population that would respond to the survey question, and I k is

. diicator functi
an m nction d e fime d as I k = {I if. k E S, sue h th at Jrk = Pr[I k = I I k E U R 1;
o If k ~ S,

Jr14)=~[(I-rXy*+rXyJrkl where rxy is an estimator of correlation coefficient

Pxy between x and Y ; and Jrls) = (1- rxy XJrk rill + rxy !....(Jrk). Study the bias and
n
variance properties of these five estimators.
Hint: Chaubey and Crisalli (1995).
( b ) Consider here an estimator of the population total Y as

~:n = IAv;iYi,
ier
where V;i are the new calibrated response weights obtained by minimizing a new
penalized chi square distance function , defined as

o, =! IV(U -I J . : I
( 0 02
<l>ivni,
2 ier qi 2 ie r qi
where qi are the suitably chosen weights to form different types of estimators, and
<l>i is a relaxed penalty, and it can take any value in the range (-1, 00) .

Under such assumptions study the following situations:

(i ) Show that, when no auxiliary information is available, the minimization of D]


leads to calibrated weights given by
v;; =1/(I+<I>i) '
( ii ) Show that the resultant estimator of the population total, Y , becomes
'0 ()
Y>fl1 0 = I - -d,'-Yi .
ierl + <l>i

Note that the estimator, r:.o(o) , is free from qi and thus comment on its choice.

Further verify the following cases for different choices of <I> i :

Case I. If <I> i = v(r:.o Yy 2 and is constant for each respondent in the sample, then
r:.o(o) reduces to the Searls (1964) estimator in the presence of non-response.
1046 Advanced sampling theory with applications

Case II. If <l> i = (Jrii(r/n}-l) -I then the estimator f.:o(o) becomes

f.:o(O)CC(I) = I ( Yr/n)
IE r H i

which is the first estimator considered by Chaubey and Crisalli (1995).

Case III. If <l> i = (~- I) then the estimator f.:o (0) becomes

' . ()
YwQ n IdiYi
0 cc(z) =-
r ier
which is the second estimator considered by Chaubey and Crisalli (1995).

Pr(i E OR I I · = 1)- Pr(i E OR) . .


Case IV. If <l>i = (') , where OR represent set of units In
Pr i E OR
the population that would respond to the survey question and I, is an indicator

function such that I . = { I if i E S


and Jr,= Pr[Ii = II i E OR ] '.
, then the YwQ(O)
, 0 if i ~ S
becomes
Y.'wQ
. (0) __ "" Pr(i E OR)
cC(3) L... (. ) Yi
OR I t, = I
iEr P r I E

which is the third estimator considered by Chaubey and Crisalli (1995).

Case V. If <l>i = [(I-rxy)n-}r/n-l)+rxy(r/n)] -I then the estimator f.:o(o) becomes

'. ( )
YwQ 0 cC(4) = I
ier
t y.
t

1- rxy )n-}r/n) + rxyJri(r/n)


}

which is the fourth estimator considered by Chaubey and Crisalli (1995).

Case VI. If <l>i = [(I-rxyXr/NXI/Jri)+rXy(r/n)] -I then the estimator f.:o(o) becomes


' . () n y.
Ywn 0 cc(s) = - I {( X ') }
r ier I - rxy n] N + rxyJri
which is the fifth estimator considered by Chaubey and Crisalli (1995).

(i ) Minimize D1 subject to the calibration constraint

IdivsiXi = IXi
ier ie s
and study the resultant estimators.

( ii ) Suggest a few new calibration constraints and study the resultant estimators.
Hint: Singh and Amab (2003).
12. Non-response and its treatments 1047

Exercise 12.12. (I) Assume the data after imputation takes the form

~~ =~jYi
j r-l(r-ltl{(n-l)xi-nXn}i~~;
if i e A,

Yoi = if ieA ,

where A and A denote the responding and non-responding sets of the sample s
such that s = Au A .

( a ) Show that point estimator


A 1 n
e2 =- LYoi
n i=1
of the population mean Y becomes unbiased estimator of the population mean
A r(n-l)(
__ )
e2 = zrXn + -(--) Yr - zrXr ,
n r-l

where Yr = r-
I
LYi,
ie A
xr = r - I
ieA ies
±
L_xi' and xn = n- I LXi, zr = r- I Yi have the usual
i=1 Xi
meanings.
(b) Find the variance of the estimator e2 and estimate it by Jackknife technique.

( II ) Consider the data after imputation take the form


r(n-r+l) (n-rXr -l)_ (,) xn
Yi - Yr I _ ( .) if i e A,
n r xr I

Yoi = r(n-:+l)li~Yi]Xi if ie A ,
LXi
i=1
- -
where y,(i) = ryr - Yi and x,(i)= rXr -Xi.
r-l r-l

( a ) Show that the point estimator e2 =.!- i»: of the population mean Y under
n i=1
this method of imputation becomes
r(n-r+l)_ Xn (n-rXr-l)~- (,) xn
A

e3 =
n
Yr -=--
x
r rn
£...Yr l-=---:-.
i=1 X,(l)
( b ) Show that the estimator e3 is an unbiased estimator of the population mean.
( c ) Show that the variance of the estimator e3' to the first order of approximation,
is equivalent to that of the ratio estimator in two-phase sampling.
Hint: Singh, Hom , and Tracy (2001).
1048 Advanced sampling theory with applications

Exercise 12.13. Consider a population consists of N units. We draw an SRSWOR


sample of n units and measured the study variable Y and auxiliary variable x . We
are interested in estimating the population ratio R = Y/ X of the population means
Y and X. Assume that after taking the sample only a set of (n - p - q) complete
observations (Yi' Xi 1 i = 1,2, ..., (n - p - q) is available. Assume that p observations
x~ , x;,...,x; in the sample are available for the x variable but the corresponding
observations of the Y variable are missing. Similarly assume that we have q

observations y~*, Y;* ,....,Y;* on Y variable in the sample, but the associated values
of the auxiliary variable x are missing. Defining
p-q
I n- p - q ( )In- * I P * ** I P **
x=n-p-q-
( ) IXi,y=n-p-q- IYi, X=P-IXi, and y =q-IYi'
i=1 i=1 i=1 i=1
Find the bias and variance expressions of the following estimators of R defined as
Y (n-q)y (n- p-q)y+qy**
1j=-=, r2= *' r3=
x (n - p - q)x + px (n - p) x

n - q J(n - p - q) y + qy** d f{ **)V f * )~


r4 =( - - ( ** , an rs = ao ~alY + a2Y \a3x + a4 x * ~ .
n- p n- p-q) x+ px
Show that the estimators r;, i = 1, 2,3,4, are the special cases of the estimator rs for
different choices of a j' j = 0,1,2,3, and4.
Hint: Toutenburg and Srivastava (1998).

Exercise 12.14. Consider the population of interest has been stratified into L strata
with N h clusters in the h1h stratum. At the first stage of sampling, let nh ~ 2 clusters
are selected from the h 1h stratum. Let Phi' i = 1,2,..., N h » h = 1,2,...,L denote the
probability of selecting the i' h cluster from the h1h stratum Assuming that these
clusters are selected independently across strata and without replacement and the
overall sampling fraction Inh/INh is negligible. Assume Yhik be the ultimate
h h
population units value, where the index (h, i , k1 k = 1,..., N hi ; i = 1,...,Nh ; h = 1,...,L,
have their usual meanings. We wish to estimate the population total
L NhNhi
y= I I I Yhik'
h=1 i=l k=1
If there is no non-response and Whik are the design weights, then show that the
unbiased estimator of population total Y is given by
Y= I Whik Yhik'
(hik }es
In a stratified multi-stage sampling design consider for the ratio method of
imputation that the adjusted values are given by:
12. Non-response and its treatments 1049

-(r)( ) _ {/J~r )(&)xhik if Yhik is missing,


Yh'k e -
I Yhik if Yhik is observed,
with
/J~r)(&)= LOhikvW~k (&}yhik/ LOhikvW~k(&)xhik'
{hik}es (hik}es
Then the balanced repeated replication BRR(&) , for a fixed e, method of variance

+
estimation is
2
VBRR(&) = f{e {r)(&)-e}
e R r=1
where e is an estimator of any parameter of interest, say (J, and e (r)(&) is
computed using the same formula for e but with the original weights Whik replaced
with the new weights, given by

if the [h, i) th cluster is selected in the r 1h replicate,

otherwise.

Discuss your views by taking e = Yhik and e = 0.5. Also show that
VBRR (&)/ var(e)~ I .
Hint: Rao and Shao (1999) .

Exercise 12.15. Show that under the multi-stage post-stratified sampling design an
estimator of population total Y in the presence of unit non-response is
YN,ps = LL L dpebwhikahikeOhik17k;kYhik ,
e p (hik}es
where e M = post-stratum count, Yhik denotes the value of the Jlh unit of the study
variable in the lh cluster and h1h stratum,
L Whik17k;k
d = {hik}es I if (hik) responds,
ahik = { 0
p p
L. whik ahik17hik
'" otherwise,
{hik }es

eb =
eM p
p , 17hik =
{I
if (hik) E plh weighing class,
L L d pWhikahikeOhik17hik 0 otherwise,
p (hik)es
and Whik (> 0) denotes the design weights. Suggest a method to estimate the
variance of the estimator YN PS using the concept of Jackknife.
Hint: Yung and Rao (2000).
1050 Advanced sampling theory with applications

Exercise 12.16. Consider we took a sample Sn of n units using SRSWOR


sampling from a population n of N units . Let Y; , i = 1,2,...,n denote the value of
the /h unit selected in the sample. Unfortunately there is a random non -response,
that is, only r < n units selected in the sample are responding. Let y;,
i = r + I, r + 2,..., n denote the imputed values.

The situation is as shown in the following table .

YI • at
YI
2 yz yz
• az

3 Y3 • a3
Y3

r Yr
Yr * ar

r +1 • ar+1
Yr+ 1
r+2 • ar+z
Yr+Z

n • an
Yn

Let y' and var~') denote he sample mean and sample variance of the sample
which comprises the values making up y'. Let m and u > 0 be the fixed real
numbers. Let a = (aI ' az , ..., an) be the vector of real numbers. Show that the

minimization of I: (a; -Y;f


;=1
subject to the two constraints a =.!.. I:a; =m and
n ;=1

n-I ;=I
I:
var(a) = _1_ (a; - m = u f leads to the adjusted data set given by

O.5
a? = m + { ~} &; -Y'), i =1,2,...,n. Show that the mean method of imputation
is a special case of it. Extend the results for ratio method of imputation. Let V; be
another variable corresponding to the auxil iary variable. Show that the
minimization of I: (a; - Y; f
;=1
subject to the two constraints a =.!.. I: a; =m
n ;=1
and

var(a) = _1_ (a;-


n-I;=I
I: V; f =u leads to the adjusted data similar to the ratio method of

imputation.
Hint: Meeden (2000).
12. Non-response and its treatments 1051

Exercise 12.17. Consider a census of the population has been undertaken and all
units could be classified by the call back index number on which they respond by
forming L strata. Each call back stratum h would be of size N h and the total
population size N = "iNh • Let Yhi be the (fixed) value of a variable of interest for
h

h h - -I L Nh
the l unit in the hI call back stratum. Defining Y =N I IYhi ,
h=li=1
Nh
-\2
2
a = N-
1L Nh(
I I Yhi - Y)\2 , Y- h = N h_I IYhi and 2
CYh
_I Nh(
= N h "i Yhi - Yh ) • Show that under
h=l i=1 i=1 i=1
the sub-sampling strategy a an unbiased estimator of Y is given by y(a) = IwfYh
h

where Yh=_I- IYhi for the set of interviewed units Sh in the stratum h , n~ isthe
nf i ESh

number of interviews obtained in hlh call back stratum under the a sub-sampling

strategy and wf =
nf /ah
( ) for ah =
{I if k < m,
. Show that the conditional
..
I\nf /ah alfk >m.

r
h
variance for the fixed number of attempts to be made is given by
2

v~(a)) ~ na(t :: t ~: [( CY~ +(~ - yn}


Hint: Elliott, Little, and Lewitzky (2000).

Exercise 12.18. Consider for estimating the mean of a finite population, a random
sample of n units is distributed equally among m enumerators chosen randomly
from an infinite population of enumerators. Assume that Yij the value of Y on unit
i as enumerated by interviewer j , is given by the model Yij = Yi + a j + eijk , where

E(eijk I i,j)= 0, V(eijk I ij)= S2 , and Cov(eijk , eijk') = 0 \;/ k * k'.


Find the expected value and variance of the estimator defined as
_ I m n
Ys =-I IYij '
mn j=li=1
Hint: Sukhatme (1953) .

Exercise 12.19. Assume the data after imputation take the form

Yoi = j;:[(n_r)xn
ax; +
(ar(x~-xr)]~ i : iiE:~,
I-a n IXi
iER p
where a is a suitably chosen constant, such that the variance of the resultant
estimator is minimum. The sets A ( A) denote the sets of responding (non-
responding) units in the sample S such that s = AUA .
1052 Advanced sampling theory with applications

( a ) Show that the point estimator


_ I n
Yp= - LYOi
n i=1
of population mean Y under this method of imputation become s
- YrXn
p
Y = {atr +(l-a)xn} ,
where Yr = r- I I Yi , i r = r- 1 I_xi' and in = n- 1 I Xi .
ie A ie A ies
Study the asymptotic bias and variance properties of the estimator Yp '
Hint: Singh (2000c).

Exercise 12.20. (I) Consider the data after imputation becomes


_ { Yi if i E S,
Yoi - ,
bx, if i E S' - S,

where b = L iesY;/Lies Xi .
( a ) Show that the point estimator of population mean given by
- - I",
Ys = n L.ies'Yoi
becomes
X n'
YR =Yn -=-
- -
Xn
with variance
V(YR) = (~ -
n' N
J...)S2+(~n - ~)SJ
Y n'
2 I ( - \2
Sy = - - I l j - YJ,
N -l ien
I I I d 2 ,Y
Sd2 =- N i
-
=N-I Iienlj, di =Yi - Rxi an d R= I lj / I X i ·
- ien ien ien
( b) Show that the estimator YR is design consistent for populat ion mean Y .
( c ) A design consistent linearization estimator of the variance of YR is given by
the standard formula
Vo = (~-~)sJ
n n'
+(~n' _J...)s2
N Y

with sJ =(n -r): IIdl , s; =(n -r): IL (Yi - yy, where d, =(Yi - y)- k (Xi - z).
ies ies

( d ) Show that s; can be written as


2 2 ' '22
Sy =sd + 2RSdx+ R Sx
which is an estimator of S
;.
12. Non-response and its treatments 1053

( e ) Develop a new estimator of S; by replacing s; with s'1 as


Sy=Sd+ 2Rsdx+ R sx '
2 2 A A 2 '2

Use it in estimating the variance of the ratio estimator.


( f) Study the following two new estimators for the variance of YR as

VI = (~_J...)s~ +2(~-J...)RSdx +(~_J...)R2s'1


n N n' N n' N
and
v~ = (~_J...)s~
n N
+2(~-J...)RSdx
n' N
+(~-J...)R2s;.
n' N
( g ) Consider a Jackknife variance estimator for YR by recalculating YR with the
/h element removed for each j E s' and then using the variance of these n'
Jackknife values, YR (j} Define
_ ( ') _( ,)X'(j)
YR J = Y J x(j) for all j E s',

where

f~~~;j j
nY n - Yj
if j if j
x(j)= E S,
y(j)= _ n-l
E S,

lXn if i E s'-s, Yn if i E s'-s,


n'x ,-x ·
and, x'(j) = n J if j E S' .
n'-1
Show that the linearization of the Jackknife estimator of variance of YRas
AJ
VI = -n'-1
, - '" {- ( ') - }2
L. YR J - YR
n j ES'
is approximately equal to usual estimators of variance.
Hint : Rao and Sitter (1995).
1054 Advanced sampling theory with applications

( II ) Consider a modified Jackknife estimator of the variance as

VI = n'~1 f~~*(j)- YRf,


n j=1

where
_**( .) __*( .)X'(j) x'(J.) = n'xn,-xj.
J' E S'
YR J - YR J x*(j)' n'-1 if

nxn - xj ·

j
if j E S, if j E S,
-* n-l
and x (j) = nX . _ x .
n j if j E S'-S .
if j E s'-s, n-l
n-l
( a) Show that the modified Jackknife method of variance estimation becomes

_** . _
YR(J)-YR=
j ~'(j) ~j -Rx
x(j) n-l
j
) +R(X'(j)-Xn.)+(_n_)R~(~)(Xn'-Xn)
n-l x(J)
ifj ES,

R(x'(j)-xn,) ifj E S'-S .


(b) Show that the modified Jackknife estimator of variance reduces to
2 2(n'-I)
Am _
VI (Xn.)2-+
= - sJ 2(Xn.)Rs R-s";+ n
- - -dx+ - RA2(Xn,)2(_
- - \2
X ,-x) .
xn n xn n' n' n'(n-l)Z xn n n
Hint: Amab and Singh (2002e)

Exercise 12.21. For estimating the mean of a finite population, a random sample
of n units is distributed equally among m enumerators chosen randomly from an
infinite population of enumerators. Assume that Yij' the value of Y on the lh unit
as enumerated by the lh interviewer, is
12. Non-response and its treatments 1055

Yij = Yi+aj+ eijk> where E{eijk li,j )=O, V(eijk li ,j)=CT; and Cov{eijk> eijk·l i,j)=O.
( a) Find the expected value and variance of the sample mean.
( b ) Assuming that the cost function is of the form
C = nC1 + mC2 + ";;;;;;C3 •
Find the optimum number of enumerators for which the variance of the sample
mean will be minimum for the fixed cost.
Hint: Sukhatme (1953)

Exercise 12.22. (I) Consider the data after imputation take the form
Yi if i E A

Y.,= +lU -1~;,


j ;f ;EA

where a is a suitably chosen constant, such that the variance of the resultant
estimator is minimum . The sets A ( A') denote the sets of responding (non-
responding) units in the sample s such that s = AUA' .

(a) Show that the point estimator )is =..!.- LYei of population mean Ybecomes:
n ies

_)a
(xr
- - Xn
Y s = Yr '

where )ir = r- 1 L Yi , xr = r- I L_ xi ' and xn = n- 1 LXi


ieA ie A ies
( b ) Study the asymptotic bias and variance properties of the resultant estimator )is '
( c ) Show that a = 1 and a = -I lead to ratio and product methods of imputation .

( II ) Let X = {xij L p
' (i = 1,2,...,n; j = 1,2,....,p ) be the n x p matrix of the p auxiliary
vectors associated with the study variable y. It is assumed that full information is
available on the auxiliary variables, but responses are missing only for the study
variable. Consider a method of imputation given by
Yi if iE A

Yei =
if iE A

p
where O Xi = x l x 2... .xp denote the product of p terms.
i=1

( a) Show that the point estimator )is = n- I L Yei becomes


ies
1056 Advanced sampling theory with applications

-
Ymult = Y r
- Ilp(XnjJa
-=-
j

j=\ Xrj

(b) Show that the minimum variance of the estimator Y mult is given by

V(Ymult)~ (~- ~ )s; +U-~)S;(I-R;. XIX2"'XJ


where R;.XIX2..'Xp denotes the multiple correlation coefficient.
Hint: Singh and Deo (2002).

Exercise 12.23 . Consider the data after imputation take the form
Yi if i E A,
Y.i = { -
a + wXi if i E A,
where a and ware suitably chosen constants, such that the variance of the
resultant estimator is minimum, and the method of imputation becomes optimum.
The sets A (A') denote the sets of responding (non-responding) units in the sample
s such that s = AUA' .

. estimator
( a ) Sh ow t h at th e point . Y- s = -1.{!.
L..Y.i 0f popu anon mean Y- under the
nonulati
n i =1
above method of imputation becomes

h
were - = r -I
p = r/ n , Yr '" - =
L..Yi' Xr r -I '"
L..Xi' an d X- n = n-1",
L..Xi '
ieA ieA i es

( b ) Show that the estimator Ys is unbiased if either a = (Y - wx) for any real value

of w, or W =(; - ; J for any real value of a .


( C )Discuss choices of a and w such that the above optimum method of
imputation reduces to ratio, mean and regression methods of imputation.
( d ) The minimum mean square error of the estimator Ys , for the optimum values
of a and w, is given by

. (_) 2 2 (1 1) {(~n _ ~) _p(!_


N N r
~)}2 P~
Mm.MSE Ys =P Sy -
r -N
- - ( Il ) (
211 ) ( 11 )
- - - + p - - - -2p - - -
n N r N n N

( e ) Discuss difficulties in using the above optimum method of imputation and


suggest some possible solutions.
Hint: Singh and Valdes (2003).
12. Non-response and its treatments 1057

Exercise 12.24. Michael works in a private sector and his boss Harold Mantel
considers an imputation technique based on the mechanism of the ratio method of
imputation while estimating ratio R = Y/ X of two population means. Harold
Mantel shows data to Michael on a spread sheet as shown in the following table in
the first two columns, 'data before imputation', and suggests to him to use the
following two ratios given by
n-p-q jn- p- q n-p-q jn- p-q
byx = Iy; Ix;, and bxy = Ix; I y;
;=1 ;=1 ;=1 ;=1

for imputing the missing Y variable and the missing X variable , respectively, as
shown in the last two columns of the following table.

Y variate x variate Y variate x variate

YI xI YI xI
yz Xz yz Xz

Y(n-p-q) X(n_p_q) Y(n-p-q) X(n_p_q)


Missing
xI* b yxXl
* xI*

b yx X2*
Missing
Xz* Xz*

.Missin
Missin
*
byxx p*
Missing *
xp xp

*
YI
Missing
YI
*
bxyYl
*

yz*
Missing y z* b xyY2
*
Missin
Missin
Missing
Yq* Yq*

( a ) After imputation Michael found the sums Ys and x s ' and took the ratio of
these two as
1058 Advanced sampling theory with applications

and reported back to Harold Mantel that his imputation method is not going to
work. Justify Michael's claim by showing that
, Yn-p-q
R = - _ - - = Ratio of observed responses .
xn- p - q

Hint: Use \::


Ys = (n_q)Yn_p_q[_xn-q J+qyq and X s = (n- PJAn-p-q
xn- p _q
[-_-- J -
Yn-p + pX p '
Yn-p-q

( b ) Michael suggests to his boss Harold Mantel the following efficient class of
estimators of population ratio as follows:
p
Rw= ["ilIYn_p_qHl_xn-q J+"il2Yq]/["il3Xn-p-qGl !n- J+ "il 4xP ]
x n- p-q Yn-p-q
where "ilk> k = 1,2,3,4 are real constants, and H(.) and G(.) are the parametric
functions such that they satisfy the following assumptions :
(a)H(I)=1 and G(I) = I;
( b ) first (G1 and HI say) and second order (HI I and Gil) derivatives of Hand
G exist and are known constants.

Justify the Michael's claim by comparing the mean square errors of Rand Rw •
Hint: Singh, Singh, Tailor, and Allen (2002).

Practical 12.1. John selected an SRSWOR sample of twenty states from the
population 1. He collected information about the real estate farm loans and nonreal
estate farm loans from the selected states, but unfortunately the information on the
real estate farm loans was not available on ten states as marked in the table below.

01 AL 348.334 408.978 27 NE 3585.406


03 AZ 431.439 54.633 29 NH 0.471
04 AR 848.317 907.700 32 NY 426.274
05 CA 3928.732 1343.461 33 NC 494.730
06 CO 906.281 315.809 36 OK 1716.087
07 CT 4.373 7.130 38 PA 298.351
14 IN 1022.782 1213.024 40 SC 80.750 ~~jssing,
19 ME 51.539 8.849 42 TN 388.869
22 MI 440.518 323.028 46 VA 188.477
23 MN 2466.892 1354.768 47 WA 1228.607
X - Nonreal estate farm loans, y - Real estate farm loans.
12. Non-response and its treatments 1059

Apply the following three estimators:


- =Yn-r
Yllr - +\Sxy
(*/Sx*2X-X-X-n_r),
- - ( * / Sx2XX- - xr],
Y21r=Yn-r+\Sxy
and
Y31r = Yn-r +(s~/s;Xxn_r - x)
of the population mean for estimating the average real estate farm loans assuming
that the average nonreal estate farm loans in the United States is known and equal
to $878 .16 . Construct 95% confidence intervals.

Practical 12.2. Select an SRSWOR sample of sixteen units from population 4 given
in the Appendix. Consider you made 7 attempts to collect information about the
different species groups selected in the sample. Information about the number of
fish caught in different species is not available all the times you contact the
fisherman. You collected the information on the number of fish during 1994 in
seven visits. It was also noted that out of seven visits, how many times (D) the
information about these species was available? Estimate the average number of fish
caught by marine recreational fishermen at Atlantic and Gulf coasts during 1994.
Construct 95% confidence interval for the average number of fish in the United
States.

Practical 12.3. Michael selected an SRSWOR sample of twenty states from the
population I, and tried to collect information about the real estate farm loans and
nonreal estate farm loans from the selected states , but unfortunately the information
on the real estate farm loans was not available on nine states as marked
. Xi .,
;'1: '" IT:&,'
,,'
I ,~ando~~ il~~Mbm -« ,rr", Ylii.~"
<~

:i, Yi I~Stat~c
" No. '.\'~ t",,~ ~~. ", No. if ,,,,:Uti 1'.\i.?"i~~ ,.

01 AL 348.334 408.978 27 NE 3585.406 1337 .852


03 AZ 431.439 54.633 29 NH 0.471 6.044
04 AR 848.317 907.700 32 NY 426.274 :'l~Missing
05 CA 3928 .732 1343.461 33 NC 494 .730 $1 Missi rig
06 CO 906 .281 i i Missing 36 OK 1716.087 ~Mis§ing
07 CT 4.373 7.130 38 PA 298.351 g~Mis§ing
14 IN 1022.782 1213.024 40 SC 80.750 Mi~iPg
19 ME 51.539 8.849 42 TN 388 .869 . M issing
22 MI 440.518 323.028 46 VA 188.477 ~~Missiiig
23 MN 2466 .892 1354.768 47 WA 1228 .607 t~~Missing
x - Nonreal estate farm loans; Y - Real estate farm loans .

Apply the ratio type estimator Vs =s;2{s;/s;2) for estimating the finite population
variance of the real estate farm loans and construct the 95% confidence interval.
1060 Advanced sampling theory with applications

Practical 12.4. Santa Singh and Banta Singh were appointed to select two
candidates from a list of four candidates n = {Anokha, Banto, Channa, Didar} with
their respective scores 25, 35, 40, and 45, respectively. In the first phase the
administration suggested to Santa Singh and Banta Singh to select three candidates
for telephone interview, and in the second phase they decided to select two
candidates for face to face interview. Santa Singh likes every one whereas Banta
Singh likes Banto so both suggested the following first phase sampling plan :

I!:!,!!!>!
s; = {Anokha, Banto, Channa} pSI
. = 1/4 pSI
,
= 1/3
s~ = {Anokha, Banto, Didar} p S2 = 1/4 P s2
. = 1/3
s~ = {Anokha, Channa, Didar} p s3 = 1/4 P s3 = 0.0
s~ = {Banto, Channa, Didar} pls~ = 1/4 p s4 1= 1/3

( a ) Construct the first order and second order inclusion probabilities for the first
phase telephone interview.

In the second phase the administration decided to select two candidates for face to
face interview out of the selected three candidates during telephone interview.
Again Santa Singh and Banta Singh suggested the following possibilities

Sl s;I = {Anokha, Banto} 1/3 0.5


SI
S2 s;I = {Anokha, Channa} 1/3 0.0

s3 Is; = {Banto, Channa} 1/3 0.5

. SI I s~ = {Anokha, Banto} 1/3 0.5


S2
S2 I s~ = {Anokha, Didar} 1/3 0.0

S3 I s~ = {Banto, Didar} 1/3 0.5

. SI I s~ = {Anokha, Channa} 1/3 0.0


s3
s2 I s~ = {Anokha, Didar} 1/3 0.0

S3 I s~ = {Channa, Didar} 1/3 0.0

sl I s~ = {Banto, Channa} 1/3 0.5


S4
S2 I s~ = {Banto, Didar} 1/3 0.5

S3 I s~ = {Channa, Didar} 1/3 0.0

( b ) Construct the first order and second order inclusion probabilities for the second
phase face to face interview. Find difficulties in Banta Singh's sampling scheme.
12. Non-response and its treatments 106 1

( c ) Estimate the total score from each one of the second phase sample for the given
first phase sample. (Except s') for Banta Singh's scheme).
(d) Find the bias and variance of Santa Singh and Banta Singh 's selection schemes
by using the definitions of bias and variance.
( e ) Discuss the relative efficiency of Banta Singh's sampling scheme over the
Santa Singh's sampling scheme and comment.

Practic al 12.5. Consider the problem of estimation of number of cattle in a farm


using known information about the size of the farms. A preliminary large SRSWOR
sample of 70,000 farms out of a total 14,000,000 farms gave the estimated average
area per farm as 5.00 hectares. A second phase SRSWOR sample of 3000 farms
gave the following information:

IXi = 18867.089 , IYi = 12525.246, Ix? = 48780336.98 , Ii = 18001501.38 , and


i=1 i=1 i=1 i=1
n
LXiYi = 26591710.56 .
i=l

( a ) Estimate the average number of cattle per farm and derive 95% confidence
interval estimate using two-phase ratio estimator.
( b ) Estimate the average number of cattle per farm and derive 95% confidence
interval estimate using two-phase regression estimator .
( c) Comment on the confidence interval estimates obtained .

Practical 12.6. John and Michael were appointed to select two candidates from a
list of five candidates n= {Amy,Bob,Chris,Don,Eric} with their scores 125, 126,
128, 90 and 127, respectively . In the first phase, the administration sugges ted to
John and Michael to select three candidates for telephone interview , and in the
second phase they decided to select two candidates for face to face interview. Both
John and Michael suggested the following first-phase sampling plans for telephone
interview:

w ;jc<,J()hn;afidjMichaen:~~
, ,
sl = {Amy, Bob, Chris} P s\ = 1/4
,
s2 = {Amy, Chris, Don} P s2 = 1/4
,
s) = {Amy, Don, Eric} p s) = 1/4

s4 = {Amy, Chris, Eric} p s4 = 1/4

( a ) Construct the first order and second order inclusion probabilities for the first
phase telephone interview .
1062 Advanced sampling theory with applications

In the second phase the administration decided to select two candidates for face to
face interview out of the selected three candidates during telephone interview. John
likes every one whereas Michael likes Amy, so they suggested the following
possibilities:

St Is;= {Amy, Bob} 1/3

S2 Is; = {Amy, Chris} 1/3

S3 Is; = {Bob, Chris} 1/3

St I s~ = {Amy, Chris} 1/3

S2 I s~ = {Amy, Don} 1/3

S3 I s~ = {Chris, Don} 1/3 0.0


sl Is; = {Amy, Eric} 1/3 0.5
s2 1s; = {Amy, Don} 1/3 0.5
S3 Is; = {Don, Eric} 1/3 0.0
St I s~ = {Amy, Chris} 1/3 0.5
S2 I s~ = {Amy, Eric} 1/3 0.5
S3 I s~ = {Chris, Eric} 1/3 0.0

( b ) Construct the first order and second order inclusion probabilities for the second
phase face to face interview for both sampling schemes.
( c ) Estimate the total score from each one of the second phase sample for the given
first phase sample for both sampling schemes.
( d ) Find the bias and variance of John and Michael's selection schemes by using
the definitions of bias and variance.
( e ) Discuss the relative efficiency of Michael's sampling scheme over the John's
sampling scheme and comment.

Practical 12.7. Select an SRSWOR sample of twenty units from the population I
given in the Appendix . Record the values of the real estate farm loans for the states
selected in the sample. Assume 5% random non-response in the selected sample.
Impute the missing values four times with the help of hot deck method of
imputation. Apply the concept of multiple imputation for estimating population
mean and construct the 95% confidence interval.

Practical 12.8. Select an SRSWOR sample of twenty five units from the
population I given in the Appendix. Record the values of the real and nonreal estate
farm loans for the states selected in the sample. Assume 5% random non-response
in the selected sample. Impute the missing values with the following two methods:
12. Non-response and its treatments 1063

( a ) ratio method of imputation, and ( b ) compromised method of imputation.


Estimate the average real estate farm loans with each method and comment on your
results.

Practical 12.9. Professor Forgetful (e.g., refer to the film 'The Nutty Professor'
directed by Jerry Lewis) believes that the percentage of marks of students in an
examination depends upon the number of classes attended by them, and the number
of marks in the assignments. Professor Forgetful misplaced mid-term exams of 4
students, but has information about the number of classes attended and marks in the
assignments.

Alabi Aioke 59.29 30 66


Anderson Beniamin 72.51 36 79
Anderson Rebecca 77.66 35 93
Barland Deborah 9 85
Bea er Eric 75.26 34 89
Bistodeau Jill 78.81 36 83
Brendel Ste hanie 86.67 40 84
Cherne, Tara 73.87 34 85
Danzl Jennifer 87.20 39 93
Houck Jennifer 92.55 40 94
lzu Yohei 55.43 25 75
Kimura Shi eru 51.17 20 70
Kieseth Katie 74.51 32 76
Kladek William 47.37 20 73
Laliberte Jose h 26 54
Montana Larissa 33 79
Olson Cher 32 89
Rief Abb 15 74
Rin Paula 16 78
Ta lor Abb 33 94
Todd Andrew 38 81
Tuladhar Tserin 39 95

( a ) Find the average marks in the class (excluding missing exams).


( b ) Impute the missing marks with the mean method of imputation, and again find
the average marks in the class .
1064 Advanced sampling theory with applications

( c ) Assuming the information about the number of classes attended by the students
is known, impute the missing marks with the ratio method of imputation and again
find the average marks in the class.
( d) Impute the missing marks with the following method of imputation

Y.i=j~ =:lYi r
r-l (r - ltl {(n - l}xi - nXn }i~ ~;
if i

if iEA,
E A,

where Yi and Xi denote, respectively, the marks and number of classes attended by
the lh student, A and A denote the responding and non-responding sets of the
sample s such that s = A u A . Again find the average marks in the class after
imputation .
( e ) Repeat (c) and ( d ) using known marks in the assignments .
( f) Suggest a new method of imputation to use both of the variables viz., the
number of classes attended and marks in the assignments, to impute the missing
marks in the exam.
( g ) Give your views on the methods of imputation used by Professor Forgetful,
and your suggestion in ( f).
13. MISCELLANEOUS TOPICS

13.0. INTRODUCTION

The main purpose of this chapte r is to keep this book open to the new topics comin g
in the recent years or which have not been touch ed upon by the author in the present
version of the book. In this chapter we shall introduc e a few miscellaneous topics
namel y:

( a) Estimation of Measurement Errors ; (b) Raking Ratio Estimators;


( c ) Cont inuous Populations; and (d ) Small Area Estimation.

13.1 ESTIMATION OF MEASUREMENT ERRORS

In many econometric, social studies , agricultural sciences and medical sciences


applications the reported values of variables are often not exact, but are only
obtained with some uncertainty, herein called measurement error. With knowledge
of the measurement error , it is possible to adjust estimates determined from
measurements that contain measurement error (see Fuller (1987)).

In other situations more engineering applications, where measuring systems (tools)


are used to measure the characteristics of population units and the tools are not
perfect, it is the determination of the measurement error itself that is important.
Grubbs (1948) estimated the measurement error of three clocks measuring the fuse
burning time of shells of guns. Jaech (1981 ) estimated the measurement error of a
device that determined the percen tage of an isotope of plutonium in pluton ium oxide
powder. Morrison, Mangat, Desjardins, and Bhatia (2000) used this methodology to
est imate the measur ement error of a pipeline inspection tool used to measure
corrosion pit depths , and a field tool that obtained data after some portions of the
pipeline were excavated for rehabilitation purposes.

The estimation of measurement error is also important where calibration of a tool is


required as part of its development procedure. As no instrument can claim to be
fully accurate, the instrument used for calibration might also contain some
measurement error . The estimation of measurement precision of both tools , the one
that is being calibrated and the other that is being used as the standard, is necessary
to ensure that calibration is helpful in improving the tools under development.

The statistical methods used for estimation of measurement errors for the tools
individually are studied below .

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
1066 Advanced sampling theory with applications

Grubbs (1948) was the first to suggest a sampling theory methodology for
estimating the variance of measurement error of any number of tools separately.

In a two tool case for a population of N units the model is


Y; = 'Pi + Vi' and Xi = 'Pi + V;, (i = 1,2,...,N) (13.1.1.1)
where for the i'h unit , Y; and Xi are the measurements reported by Tool I and
Tool 2, 'Pi is the true value and Vi and V; are the random measurement errors in
measuring .pi ' It is assumed that the errors Viand V; are random, normally
distributed and have mean zero. They are independent of each other, among
themselves and of the true values 'Pi' The population variances of 'Pi' Vi and Vi'
(i = 1,2,...,N) , are denoted by cr~ , crt and crt .

For the sample units all variables including the variables used in the model, will be
denoted by lower case letters . Assuming the n units are drawn randomly using
equal probability with replacement sampling from a population of N units , the
usual unbiased estimators of the variances and covariance of the measurements of
Tool 1 and Tool 2 are

s; = (n -lt1(IYl- niJ, s; = (n -It\(~x1- nx 2J, and syx = (n -It\(~YiXi - ny xJ


,=1 ,=1 ,=1

where y = n- I Yi
I
and x = n- IXi are the respective sample means .
I
i=1 i=\
Theorem 13.1.1.1. The variance of the measurement error of Tool 1 is estimated by
/\

crU=Sy-Syx
2 2
'
Proof. If E is the expected value over the units then

E(s2)=E[_1{IYl-n(.!-
y n-I i=\ ni= ni=\ nn1-I ) i'"Ij=\YiY']
I 1yiJ2}] = E[.!-IY1--( }
Under model (13 .1.1.1) the expected value of s; becomes
E(S;) = E[~i~(lfIl +ul + 2lf1iUi)- n(nl-l)iJ=1(lfI,+UiXlfIj + Uj)]
1 N 2 1 N 2 n(n -1)- -
=-I'P +-IV --(--)'P 'P =cr'l'+cru
2 2 ( )
Ni=\ i N i=l i nn- 1
because fJ =0 .
13. Miscellaneous topics 1067

Similarly

E(syJ=E[_111 - 1 ;;1
{IY;X;-II(~I /I ;;(
y;J(~Ix;
11 ;; J
J}]=E[~I
/I ;;(
Y;X; __ 1 I Y;Xj ]
11(11 - 1)N' j;1

Under model (13.1.1.1) the expected value of Syx becomes

E(syx) = E[~/I ;;1I ('If; +uJ'If;+v;)--( 1 ) I ('If/+u;X'lfj +vJl


11 /I - 1 ;* j;J J
1 N 2 11(11 -1)- - 2
=- I 'P; - - ( - ) 'l' 'l' = a'l"
N;;I /1/1-1
Using the values of E(s;) and Elsyx )we have
[ "J
2 2 2 2 2
E au = a'l' + au - a'l' = au ·

Hence the proof.

Theorem 13.1.1.2. The estimator of variance of measurement error of Tool 2 is


1\

c?V=S;-Syx .
Proof. The proof is similar to the one given above for Theorem 13.1.1.1.

Since the variances of measurement errors of tools are estimated from the
" "
measurements of sample units, a& and a~ may differ from sample to sample.
Under certain conditions of normality

v[ au J----+
"2 2at a~a& + a~a~ + a&a~
/I-I /I-I
.

To determine v[:~ J the at in the first term of the right hand side is replaced by

a~. For practical purposes the unknown parameters in the above expression are
replaced by their respective sample estimates.

The relationship between total scatter, represented by the root mean square
differential/error, and the estimates of variance of measurement error for individual
tool, including bias if present, is illustrated in Morrison, Mangat, Carroll, and
Riznic (2003).

Example 13.1.1.1. The hypothetical set of 30 true values of a characteristic of a


unit given in Table 13.1.1.1 were selected at random from a normal distribution of
mean of 40 and standard deviation 15. To these true values input measurement
errors from normal distributions equal to mean zero and standard deviations of 10
and 6 were added to generate measurements for Tool I and Tool 2 respectively.
Using Tool 1 and Tool 2 data estimate the standard deviation (SD) of measurement
errors (ME) of Tool 1 and Tool 2 separately.
1068 Advanced sampling theory with applications

Solution. Denoting the measurements reported by Tool 1 and Tool 2 by y and x


respectively, one has n = 30, s; = 291.4 , s; = 212.1 and Syx = 188.8.

Tab le 13.1.1.1. Hypothetical data set for two tools.


, ~. ", I~• S;:B:
Sr. TrUe1: ~~TiUe v: . ' ¢
No. ',D'epth 1~"TooH h ' TooI2 .~ 'No7
'j,
~Depth ', Tool l' '~';'d6i'~~
1 40.80 38.00 28.90 ; 16 9.00 17.60 8.50
2 40.20 49.20 37.40 'if 17 54.60 54.50 57.00
3 20.70 5.00 30.70 18 41.50 37.10 45.50
4 57.50 60.30 56.70 19 32.30 28.50 29.00
5 39.40 50.70 35.90 20 46.30 52.90 32.40
6 29.20 40.60 20.10 21 22.00 14.00 22.10
7 32.10 39.20 37.60 22 45.10 62.90 43.40
8 58.40 65.10 53.10 23 62.40 73.70 72.00
9 21.10 17.40 18.30 i ~ 24 40.50 43.50 30.50
is:
10 67.80 80.20 60.30 25 36.50 38.00 36.50
11 20.00 27.40 18.90 iti 26 42.70 43.00 59.70
12 24.40 40.60 32.70 I ~ 27 38.60 41.90 33.50
13 47.80 52.20 52.50 i'· 28 34.60 41.00 26.50
48.70 28.90 . 29 42.80 34.20 34.60
';i~

14 35.80
15 38.20 57.70 42.30 ~ 30 30.30 36.10 36.80

These values yield estimates of the standard deviations of measurement errors as:

SO of ME OfTOOII=J;i =~s; -Syx ="'291.4 -1 88.8 =10.1,


SOofME of Tool 2=J;i = ~s; - Syx ="'212.1-188.8 =4.8 .

Grubbs model is based on single measurement made per unit by each instrument.
Consider the case where it is possible to measure the characteristic of a unit more
than once by each tool. Bhatia, Mangat, and Morrison (1998) presented estimators
for this situation using varying probability and equal probability with and without
replacement sampling. Here the results are presented for equal probability with
replacement sampling only.
13. Miscellaneous topics 1069

Let each unit be measured repeatedly using Tool I and Tool 2. Using t as a
subscript for the l h measurement and i for the lh unit, the model is written as
1ft = \}li + Vii and XiI=\}li+ViI, (i=I,2 ,..., N ) (13.1.2.1)
where
1ft = observation, subjectto measurement error, recorded by Tool 1,
Xii = observation, subjectto measurement error,recorded by Tool 2,
true valuefor the ith unit,
\}li =

Vii = errorin observing \}li byTool 1,


and
Vii = errorin observing \}li byTool 2.
(i ) The errors Vii and Vii are independent of each other, among themselves, and
are independent of the true value of the unit.
( ii ) Denoting the sample values by lower case letters vi- Xi> 'l'i> UiI, Vii' the
conditional expectations E2 over the measurements for the lh sample unit are

E2{uiI) = Ui, E2{viI) = 0, E(Ui~)= ul, E2{Yit) = 'l'i + Ui, E 2{xiI) = v,


where Ui represents the measurement bias of the Tool lover measurements for
measuring the lh sample unit selected, but COV{\}li, Vi) = O.
The variances a~ and a& are defined as
2 1 N 2 -2 2 1 N 2 -2
a'f' =-L\}li -\}l and a u =-LVi -V
N i~l Ni~l

where \}l is the mean of the true values for the population units , fJ is the mean
over measurements and units in the population, u] represents the mean of squares
of the measurement errors over the infinite large repeated measurements for the lh
population unit and a& is the varia nce of the measurement errors.

The actual number of measurements rj and r2 made respectively by Tool I and


Tool 2 on each unit are considered as independent selections from their respective
infinitely large populations of repeated measurements.

Let us define notation for sample units as:


_ 1 '1 _ 1 "2 - 1'1 __ 2 1'12
Yi = - LYit ' Xi = - LXii , v« = Yil - xi' Ui =- LUiI = Yi - Xi' Yi =- LYit ,
rl t ~l r2 t~l Ij t~ l rl t~l

2" 1'1 2 2 1 n 2 1 n_
ui = - LUiI , Sya=-L Yi- -LYi
Ij t~l n i~l n i~l
_ ()2 2 1 n-2 1 n_
, Sy=-- LYi-- L Yi
n -1 i~l n i~l
[ ()2] ,

Sy2 =-1- [nL Yi2 - -


1 (nL Yi )2] ,Syx =-1- [nLYiXi
__ - -1 ( Ln_)( n_)] ,
Yi LXi
n - 1 i~l n i ~l n- 1 i~ l n i~l i~ l
1070 Advanced sampling theory with applications

Expressions for s;a' sff, and s; are obtained by substituting u for y in the
.
expressions f or Sya'
2 2 d 2
Sji , an SY '

Five different situations as given below are considered:

Case I. 'i ~ 2 and ri is sufficiently large so that the average of r2 could be taken as
the true value for the /h unit.
Case II. 'i ~ 2 and r2 ~ 1 but not so large that the average of ri could be taken as
the true value for the /h unit.
Case III. 'i = 1and r2 is sufficiently large so that the average of r2 could be taken
as the true value for the /h unit.
Case IV. rl = 1 and rz ~ 2 but not so large that the average of rz could be treated as
the true value for the /h unit.
Case V. 'i = 1and rz = 1.
Note that Case V is equal to the Grubbs estimators for two inspection tools.
For equal probability with replacement sampling the estimators of the variance of
measurement error ab of Tool I are given by

( Case I )

A 2
au
2 2
= Sya + -
Sji
-Sji x
( Case II )
n

( Case III )

(Case IV)

(Case V)

Theorem 13.1.2.1. For equal probability with replacement sampling the Case II
estimator of variance of the measurement error of Tool I is given by
A 2
2 2 Sji
au = Sya + - -Sjix '
n
Proof. Let E 2 be the conditional expectation over measurements for a given / h unit

and E1 the expectation over units. Then

E
[
2 Sji
Sya +~ 2] = E1E2 Sya +~[ 2
2]
Sji
13. Miscellaneous topics 1071

+ - -1
I Z n n_ n I n_
1
[
= E1E2 - LY; - - LY;
n ;=1 n ;=1 ( 2 I
J n(n -I) LX; - - (J2)]
n
LX;
-2

i=1 i=1

I 1
= E1E 2[ - LYi - - (
nz
I) L YiYj .
n __ ]
n ;=1 n n - r~ j=l
Now

E2[ yf ] = E2(..!-
'i
IY~ J= E2[..!-'i I ('IIi +Ui/ )2] ='Ill +ul + 2'11iUi
1=1 1=1

E2~i Yj] = E2(y;)E2~j)= ('IIi +U; X'IIj +Uj)


Substituting the conditional expected values obtained above one gets

E1[s;a + s~]
n
= E1['!' t('IIl +ul +2'11iUiJ-_1- t ('II; +u;X'IIj +uj)l
n;=1 n(n-1);..j=1 J
=-I'Pi
I N 2 1 N 2 2 N n(n-1)(-
+-IU; +-I'PP;--(--) 'P+UA'P+U
-v.,- -)
N ;=1 N ;=1 N i=1 n n-1
=a~ +a&, because Cov('P;, U i )=o.
Now

= E1 [_1 {I('IIi +Ui) 'IIi _.!.n I ('IIi +U/ I 'IIi J}]


n- 1 i=1 i=1 \;=1

[ n - 1 ;=1
1n 2
1 I'll;
= E1 -1- I 'IIi - -
n ;=1
(n J2) +--
1 {nI'II;u; - -
n- 1
n Iu;
1 I'll;
;=1
n }]
n ;=1 ;=1

= a; (since E1{Cov('II;, u;)} = 0 ).


Thus

E
[
2
Sya "t
Sy2
:» ] = a'¥2 +au2 -a,¥2 = au2
which proves the theorem .

Corollary 13.1.2.1. Assuming that the average of repeated measurements of Tool I


does not yield the true value of the units, all other assumptions remaining the same,
the BMM Case II estimator for Tool 2 can be written in the same way
/\ 2
2 2 s-
av = Sxa +-L- Sx y
n
where s;a and s~ can be determined by replacing Y by x in the formulae for s;a
and s~y and s--
xy = s--
yx·
1072 Advanced sampling theory with applications

Example 13.1.2.1. Thirty true values were selected at random from a normal
distribution of mean 40 and standard deviation 15. To these values simulated
measurement errors of standard deviations 10 and 2 were added three times for
Tool 1 and three times for Tool 2 respectively to generate three measurements for
each tool. The data is given in Table 13.1.2.1. Using this data estimate the standard
deviations of measurement error of Tool 1 and Tool 2 separately assuming the
average of neither of the two tools is the true value of the unit.
Table 13.1.2.1. Hypothetical replicated data for two tools.
. ... :'''' : ,_.,-
Sr.N~. True Depth It.Too·l1 r l I'Tool' tr2 .T6~i . h r3 :fkToo12 r1 J'0012 12 4f ool2 r3
ii;~ ~.

1 31.29 15.21 25.35 38.14 33.90 29.93 30.63


2 9.89 13.02 13.10 6.84 12.06 8.04 9.20
3 62.40 71.27 49.71 45.83 60.82 61.13 62.28
4 43.68 33.64 33.58 51.04 40.38 44.61 41.66
5 20.37 18.68 27.45 11.98 18.49 19.56 17.95
6 30.24 24.89 11.79 34.26 29.95 31.62 31.10
7 13.81 5.00 32.57 18.61 12.44 12.73 15.38
8 53.25 54.89 49.13 72.18 50.55 52.87 51.65
9 49.65 45.76 36.93 68.38 48.75 52.36 50.16
10 41.17 39.28 38.04 60.11 37.46 44.93 42.64
11 35.04 21.33 22.42 47.47 30.62 33.53 35.23
12 67.61 62.87 55.50 68.14 68.38 67.76 66.67
13 40.70 34.21 49.61 50.09 41.18 39.07 40.30
14 74.77 84.53 71.98 81.88 74.95 76.61 75.50
15 36.56 32.58 38.00 26.39 37.45 36.96 35.55
16 38.63 45.36 42.32 34.87 37.76 37.65 35.30
17 30.17 12.51 31.57 19.99 30.64 28.44 30.29
18 37.68 33.47 29.94 40.02 40.93 36.81 37.56
19 56.79 53.98 56.95 55.76 56.36 55.72 53.23
20 17.36 22.50 7.46 35.98 19.04 21.43 17.98
21 21.92 20.72 5.33 18.63 21.98 24.05 20.22
22 61.95 45.64 73.45 61.77 60.30 60.69 61.24
23 17.19 12.15 40.27 27.27 14.91 18.90 16.60
24 25.38 30.32 7.24 42.56 24.28 26.38 26.06
25 28.82 33.61 28.57 15.90 32.48 27.40 25.31
26 39.40 24.64 38.58 46.49 38.06 38.97 38.97
27 57.37 74.93 44.83 42.40 54.23 60.07 59.31
28 34.29 32.43 44.84 38.70 33.06 37.08 35.23
29 48.17 56.44 52.25 44.10 53.36 48.23 46.27
30 4 1.26 44.78 46.68 36.75 41.48 39.43 40.96
13. Miscellaneous topics 1073

28 34.29 32.43 44.84 38.70 33.06 37.08 35.23


29 48.17 56.44 52.25 44.10 53.36 48.23 46.27
30 41.26 44.78 46.68 36.75 41.48 39.43 40.96
Solution. The BMM Case II estimator is used: Here n = 30, '1 = 3, rz = 3 , and one
can easily obtain the following
1n z
-Iy; =1811.8,
r n,
-Iy;=38.3,
1 n z
-Ix; =1757.5,
l n,
-Ix;=38.7,
n ;~ I n ;~I n ; ~I n ;~I

Sy2 = 273.4, si2 = 269.2, and Syi = 260.9.


This yield
s;a = 1811.8- 38.32 = 344.9, and s;a = 1757.5 - 38.72 = 259.8.
Using the Case II formula one gets:

Standard deviation of measurement error of Tool I:


273.4
344.9+---260.9 = ±9.6.
30
Standard deviation of measurement error of Tool 2:
269.2
259.8 + - - - 260.9 = ±2.8.
30

Remark 13.1.2.1. ( a ) Estimation of measurement errors for multiple tools: The


Grubbs method and the BMM estimators can be extended to more than two tool
situations. For this readers are referred to Grubbs (1948) and Bhatia, Mangat, and
Morrison (1998) .
( b ) Remedy for negative measurement error estimates: Since the estimates of
measurement error are based on samples of values obtained from the measurements
by two tools, and the objective is to split the total scatter into components to be
associated with each tool, it is possible for the estimate of measurement error for one
of the tools to be negative. This happens when the variance of the measurements
obtained by one tool is less than the covariance between the two tools
measurements. This can be rectified either using Jaech 's (1981,1985) method or the
Bayesian approach suggested by Worthingham, Morrison, Mangat, and Desjardins
(2002) and Morrison, Mangat, Carrol, and Riznic (2003).

13.21UKINGIUTIOUSING..CON11INGENCY 11ABLES

The use of raking ratio estimator in survey sampling is quite old, which was first
introduced by Deming and Stephan (1940) . This procedure uses an iterative method
of adjusting two-way contingency table so that the row and column sums add to
certain preassigned value. The concept is basically similar to the calibration
approach . Let n be a population of units cross classified into an R x C table, and
let nrc be the set of the Nrc units in the (r,c)th cell. We draw a sample s, with
Src = S n nrc, and let d, be the survey weight attached to the /h unit. Assuming that
the variable of interest Y, taking the value y; for the /h unit and LW;y; is a
ies
1074 Advanced sampling theory with applications

consistent or unbiased estimator of Y = I Yi .In the presence of non-response, let


iEO

Q* ~ Q be a set of units who would respond if they were to be sampled, let


*
Qrc=Q rc "Q *beo f si *1et src=
size N rC' * src"Q* b
e th '
e size 0 r »:
nrc, an dl et
/ = S" Q* . Let X be an auxiliary variable associated with the study variable Y.
Define Yrc = IYi , X rc = IXi' Y;c = I Yi and X;c = I Xi ' r = 1,2,....n and
iEOrc iEOrc iEO;c iEO;C

c = 1,2,...,C . The definition of the raking ratio method is to adjust the weights of
each of the observation so that the resulting estimators of the auxiliary totals X r.
and X. c (r = 1,2,...,R ; c = 1,2,...,C) correspond to their population values. In order to
have a better understanding of it, let us consider the following R x C contingency
table.

-Rows [ i .. ..
'i coli.l:riIDs .... Totals
.1 2 C
1 1] I , 1]~ ,X II' X;I Y12 , YI*2 ' X 12 , X;2 }) C 'YI~ ,XIC 'X;C Yl . , XI.
2 Y21 ,Y;I ,X21>X;1 Y22,Y;2,X22,X;2 Yzc, Yz*c ,Xzc ,X;c Yz. , X z.

R YR1'Y;I ,XRl,X~1 YR2 'Y;2 ,XR2 ,X~2 YRC'Y;C,XRC,X;c YR. ' X R•

Sum " ' T~l-i


1:.1 fez , X. z fee, X. e z., X••
Define Yr~) be the raking ratio estimator of Y,.c after lh iteration, we have

'LAYi if 1= 0,
*
iESrc

Yr~) = Yr~-I{ xt~I) J if tis odd , (13.2.1)

Yr~-I{ A~;) J if 1> 0 is even.

Formulas for the asymptotic variance of the raking ratio estimator are given by
Brackstone and Rao (1979) for up to four iterations . Following them we have
following relations .

If 1=0, E(Yr~»)=E[.~diYil=y,
lE SrC J
13. Miscellaneous topics 1075

if t = 2, E(r,(2)\
rc;= E[r,(I)[--&]]
rc • (I) '" Y .
X oc
If general if t is odd then

E(Yr~))= E[Yr~_I)[ X(t-1)


.Xro ]] fYr~-I)[ .Xro ],
r=1
=
X(t-l)
(13.2.2)
r. r.

and if t is even then

E(Yr~))= E[Yr~-I)( X(t-I)


.Xro ]] fYo~-I)[~].
X(t -l)
=
c= 1
(13.2.3)
r. . C

Now using the result that the bias in the usual ratio estimator

- -(x]
YR =y x

is given by
B(YR)= ~ [fv(x) - cov(y, x)] . (13.2.4)
x
Thus we have the following theorem:

Theorem 13.2.1. The bias in the raking ratio estimator, to the first order of
approximation, is given by
~ ~[y.(t-1)V(X(t-I))_ Cov(X(t-l) r,(t-I))~~ if t is odd

j
c:y\t- IJ r· r· r· ' r · ,
r=IXro
B= (13.2.5)
~ ~[y.(t-l)V(X(t-I
s: y\I - IJ .c .c
))_ Cov(X(t-l) y.(t-I))~~
' .C
if t is even
.
c=IXoc
. C

r
Similarly using the result that the variance of the usual ratio estimator YR IS

V(YR)= V(y)+ (i V(x) - {i]cov(Y,x). (13.2 .6)


We have the following theorem:

Th eorem 13.2.2. The variance of the raking ratio estimator is given by

~[V(r,(t-I))+(
LJ r.
Yr~-I) )V(X(t-I))_(
~ r.
Yr~-I) )cov(X(t-l) r,(t-I))~
r . ' r.
~
if t is odd ,
r= Xro Xro
V= (13.2.7)
~[V(Y.(t-I)~
z: y'~-I) )V(x(t-1)\I 2(~
~
. C
Yo~-I) )cov(X(t-l)
.C .C '
Y.(t-I))~ . C
if t is even
.
c= X oc X oc

Deming and Stephan (1940) were the first who used raking ratio method of
estimation for estimating the cell probabilities l [ rc in the r x c contingency table for
which the marginal probabilities st ro and l [ oc are known. Later on various iterative
procedures have been developed by several researchers including Smith (1947),
Friedlander (1961), Ireland and Kullback (1968), Fienberg (1970) and Causey
1076 Advanced sampling theory with applications

(1972) to find the solutions to this problem. The method originally proposed by
Deming and Stephan (1940) is called the Iterative Proportiona l Fitting Procedure
e
(IPFP) and it minimises the modified distance function as
;2 = If (nrc - n1l'ref, (13.2.8)
r=!c=1 nrc
where nrc > 0 denote the sample size in the (r, c}th cell and
R C
n = I In re·
r=l e=1
Ireland and Kullback (1968) proved that the IPFP rrururmses a discrimination
function defined as:

I(1l',p)= I f1l' re1n(1l're],


r=!c=1 Pre
(
13.2.9)
where Pre = nre/n .

Konijn (1981) has derived biases, variances and co-variances for the estimators of
the cell and marginal totals and of the corresponding marginal averages in the
R x C contingency table. We shall now like to explain the raking ratio method with
the help of a numerical example given by Binder and Theberge (1988) as given
below.

Example 13.2.1. Consider R = C = 2 and that we have drawn a simple random


sample of 80 units with 70 respondents. We obtained the data given in the
following table:

n21
• = 20

n22 = 25

700
Use wi = 80 = 8.75 and taking Yi = Xi = I .
For t = 0 we have
'(0)_ _ _ • _ '(0)
Yrc - I Wi Yi- I8 .75-8.75xn re-Nre (say). (13.2.10)

iE S re

ie sre

Thus at the oth iteration we have the following table.

8.75 x 10 = 87.50 8.75 x 15 = 131.25


8.75 x 20 = 175.00 8.75 x 25 = 218.75

For t = 1 (odd) the raking ratio estimator takes the form


9.(1) = 9.(0) X r •
re re ' (0)'
Xr•
13. Miscellaneous topics 1077

Evidently
~ = ~ = 1.371429
X (o) 218.75

and
X z• =~=1.015873 .
X (o) 393.75

Thus we have following table

87.50 x 1.371429 = 120.00 131.25 x 1.371429 = 180.0 l a3 00;OO,j~


175.00 x 1.015873 = 177.78 218 .75 x 1.0 15873 = 222 .2 1~~399:98fzT~
1 ;~e ' e ,fY ,keek ".!, to'I O~t' . '>:';~r X l" ,\;~·;;r.¥402;2'11'!r1! " ~ 699;98;!$~

For t = 2 (eve n) the raking ratio estimator takes the form


y(z) = y(l) X . c .
rc rc ' (1)
X .c
Evidently
~=~=1.l7568, and X.z =~=0.870214 .
X( I) 297.78 X(I) 402.2
-I .1
Thus we have following table

120 x 1.17568= 141.08 180 x 0.870214 = 156.63


177.78 x 1.17568 = 208.96 222 .2 x 0.870214 = 193.37
350.04
and so on , until convergence is achieved . Such a method of impro ving the estimates
is called racking ratio method of estimation.

The literature on the survey samp ling of finite populations mainly dea ls with the
populations that consist of sets of discrete units. Here we shall discuss the
populations that may be considered as one continuous entity or a conti nuum of
points . Examples of natural continuous populations are air, soil temperature over
space and time, po llutant levels in a volume of material such as a river, percent
chemical contents along a strip of soil and inches of rainfall over a region and
commodity prices ove r time etc.. Thus a continuous population exist within a
support medium such as time or space . Let the function Ya assign the value of a
characteristic of interest y to point a of the support region . In geo-statistics this
function is called a regionalized variable . In survey sampling, the entire set of
labelled pairs Yp = {(a,Ya )} that is the subject of inference after sampling is called
the population parameter and any real function Y p is called a parametric function.
If P denote the support region , then we are interested in estimating the tota l value
of the Y characteristic defined as
1078 Advanced sampling theory with applications

Y = JYada . (13.3.1)
p

In the fixed population sampling strategies , the parameter {(a, Ya)} is regarded as
fixed but unknown. In case of superpopulation model approach, the parameter
{(a ,Ya )} is treated as a realisation of a random vector or function {(a , Ya)} whose
stochastic distribution .; is specified or partially specified. Let E.; denote the
expected value with respect to distribution .;. In the fixed population approach the
inference is based on a sample s of n units drawn with probability p(s), called
sampling design , from the population. Let E denote the expected value with
respect to sampling design .

Design unbiasedness: An estimator 0 is said to be p unbiased if

Ep(0) = l;Op(s) =Q(yp)


s

if 0 denote an estimator of the parametric function Q(yp)

Model and design unbiasedness: If


E.;(O)= Q(Yp )
then 0 is called'; unbiased.
If
EpE.; (0) = Q(yp)
then it is called p'; unbiased .

Predictive mean squared error: The predictive MSE of 0 is defined as

E.;[O-Q(Yp)f·
The superpopulation value Ya can be partitioned into two components, viz.,
( a ) The first is a trend component, the values of which usually depend on the
location a of the population unit in the support region, defined as
p
f(a)= E.;(Ya) = l; cd k(a), a E P;
k=O
and cO,c\,...,c p are P + 1independent known functions ;

( b ) The second component represent the local random fluctuations, denoted as


Z(a)= Ya - f(a) ,
which is, in fact the vector of residual terms.
The general superpopulation model can be defined as
Ya = f(a)+Z(a) (13.3.2)
with E.;[Z(a)] = 0, a E P.
13. Miscellaneous topics 1079

We shall now consider a one-dimensional support region, the labelling vector a


will now be replaced by a scalar t (say).

Then we have the following assumptions. The error structure Z(t), t E P, has
( a ) mean zero,
( b ) finite variance, a; = V~ {Z(t)} ,
(c) the covariance, vz~; ,tj)= Cov~{Z(t;~Z(tJ '
( d ) second order stationarity in one dimension can be written as a function of a
single variable h = tj - t; , so that, Vz (h) = Cov~ {Z(t), Z(t + h)} ,
and
( e ) linear operations of integration and expectation can be interchanged.

Then, following Bartlett (1986), our objective is to estimate the population total Y
as
T (\-1 (13.3.3)
tc=fytJUt
o
over the support region P = {t : 0::;; t : ; T} . Thus tc can be regarded as a realisation
of
T (13.3.4)
'Fe = fY(t}it
o
under the super-population model (13.3.2). Since we have only a finite number of
observations over the support region P, therefore, our sample will consist of the set
s= {t; : i = 1,2,...,n}.

Thus if YI(' YI2 ,... , YIn are the observed values then we will choose our estimator to
be a linear combination
• n
tc = IWI;YI; ' (13.3.5)
;=1

which is a realisation of
• n
t; = Iwl;Yr;, (13.3.6)
;=1
where WI; are the weights depending upon the population point t; selected.
Barltett (1986) has discussed the following four criterion to find these weights:

( a ) Determine both weights and the sample locations such that the estimator
(13.3.5) is model unbiased and has minimum predictive mean square error;
( b) Determine the weights which for any sample s actually obtained the estimator
(13.3.5) is model unbiased and has minimum predictive mean square error;
(c ) Determine for present weights the sample s for which the estimator is model
unbiased and has minimum predictive mean square error;
1080 Advanced sampling theory with applications

( d ) Determine either the weights or the sample or both such that the estimator has
minimum model bias for a given class of trend functions.

Padmawar (1996) considered the problem of estimation of population total of the


continuous populations by using following sampling designs.

(i ) SRS : Simple random sampling where p(t) '" 1.


(ii) PPX : design for which p(t)oc nti
i=1
( iii) PM : the analogous of Midzuno--Sen sampling design with

p(t) =_1 It
nj.J i=1
i,

where u = E(t) = ftf(t)cit .


o
( iv ) Pg : a sampling design with

p(t) = Antr1 Ix;-g


i=1 i=1
and
( v) RHC: the continuous analogous of the RHC strategy.

Padmawar (1996) compared the following four estimators of the population total of
a continuous distribution, defined as:

(i ) YR : the ratio estimator

~y(t/T!~ti) ;
1=1 \ 1=1

( ii ) YHT : the HT estimator

I ~ti )f(ti)/;r(ti),
i=1
where
n
;r(t;) = L qj(ti), ;r(ti) is assumed to be positive for t > 0, and qi(ti) = fq(t) fI dtj ,
j=l j*i=1
1 :5 i :5 n;

( iii) Yg = (T!~tl-g)~tl-g, g E [0,2];


1=1 1=1
and
(iv) tRHC= IWh[y(th)] , where Wh = ff(t)cit.
h=1 th /Th Gh
Padmawar (1996) has shown that the RHC strategy also remains design unbiased
under continuous population.
13. Miscellaneous topics 1081

The problem of small area estimation is an important technique in survey sampling


owed to a heavy demand for reliable small area statistics from public as well as
private sectors. Direct survey estimates for small area are likely to yield
unacceptably large standard errors owed to the smallness of sample sizes in the
area. Thus it make sense to use information from related areas to find accurate
estimates for a given area or, simultaneously, for several areas. There are several
methods such as Sample size dependent, Synthetic, Empirical best linear unbiased
prediction, Empirical Bayes, and Hierarchical Bayes estimation. It is clear from the
name 'small area' or 'local area' that we are interested in the estimates of
population total and its standard errors in a particular small geographical area, such
as a village, district, county, or a census division. Some times we may be interested
in 'small domains' rather than 'small area or local area' . For example, we may be
interested in the small domain formed by Sex, Age, and Race of group of persons
living in a large geographical area. Such a mechanism is also called
interchangeable. Brackstone (1987) considered the small area estimation issues
based either on a census or using complete enumeration from administrative
records. No doubt that the sample survey data can be used to obtain reliable
estimators of mean and total for large areas or domains, but it may yield
unacceptable large standard errors due to small sample size in the area of interest.
Sample sizes for small areas are typically small because the overall sample size in a
survey is usually determined to provide specific accuracy at a much higher level of
the aggregation than that of small areas. Rao (2003), Ghosh and Rao (1994), and
Chaudhuri (1992) have given a complete review on small area estimation strategies
in survey sampling. Here we would like to discuss few estimation strategies.

These techniques make use of data from administrative registers in addition to


related data from the latest census. This method covers many methods such as: ( a )
Vital Rates Method (VRM); (b) Census Component Method (CCM); and (c)
Housing Unit Method (HUM). We first discuss these methods in brief, but for more
details one can refer to Purcell and Kish (1980).

Following Bogue (1950), the VRM method makes use of only birth and death data.
These data are used as symptomatic variables rather than as components of
population change. In the first step, the number of births B, and deaths D, in a
given lh year are determined for a local area. If .8 10 denote the crude birth rate for
the local area in the latest census year (t =0) ,BIt denote the crude birth rate in the
current year for a larger area containing the local area, and B IO denote the crude
birth rate in the census year for a larger area containing the local area. Then an
estimator of the crude birth rate in the current year is given by
1082 Advanced sampling theory with applications

(13.4 .2.1)
Similarly, if blO denotes the crude death rate for the local area in the latest census
year (t = 0), Dlt denotes the crude death rate in the current year for a larger area
containing the local area, and D IO denotes the crude death rate in the census year
for a larger area containing the local area. Then an estimator of the crude death rate
in the current year is given by
bit =blO(Dlt/D IO) . (13.4 .2.2)
Then an estimator of total population P, for the local area in year t is given by

Pr =.!-[BrlBlt + Drlb1t] . (13.4 .2.3)


2
It is to be noted here that the VRM method works well if (Bit/ B IO)"" (Bit / B IO) and
(bIt / b lO)"" (Dlt / D IO), but sometimes this assumption may be violated and the
VRM method becomes non-functional.
Example 13.4.2.1. In a college of a University the total number of the recruitment
of new students during 2000 was 450, and the number of students who left the
college was 250. The recruitment and departure rate according to 1999 records was
2.5% and 1.5%, respectively, for both the college and the University. The
recruitment and departure rate of the students during 2000 for the university is 2.6%
and 1.7% respectively. Estimate the total number of students in the college during
2000 by using the VRM method. Assuming that there are a large number of
colleges in a University . A particular college is small in comparison to the whole
University.
Solution. An estimate of the current recruitment rate is
Bit = BIO(Blt / BIO) = 0.025 x (26/25) = 0.026,
and an estimate of current departure rate is given by
bi t = blO(Dlt/D IO) = 0.015 x(17/15)= 0.017 .
Thus an estimate of the total number of students in the college during 2000 is
Pr =.!- [Bt / BIt + o,/ b It] = .!-[450/0.026 + 250/0.017] = 16006.78 ""16007 .
2 2

This method takes into account an important factor, migration, while estimating
total population in a local area. If M, denotes the net migration in the local area
since last census , then an estimator of total population P, is given by
~ = Po +B t -Dt +Mt, (13.4 .3.1)
where Po denotes the population count of the local area in the census year t = O.
Migration can be estimated in several ways. For example, the net migration can be
subdi vided into civilian and military migration. Evidently the military migration can
be taken from the administrative records and civilian migration can be taken from
school enrolments. If you estimate the net migration from the records for the
13. Miscellaneous topics 1083

individuals as opposed to collect units like schools, then such a method is called
administrative record method, and can be used for producing local area estimates.

Example 13.4.3.1. In a university let the total number students be 15000 during
1999. During 2000 there is a recruitment of 1500 students according to the schedule
and later on 50 students left the university and stopped their study. There was
migration + 10 students (say, 20 students migrated to other universities and 30
students came from other universities during the academic year 2000). Apply the
Census Component Method (CCM) to estimate the total number of students in the
university during 2000.
Solution. We have
A= Po + BI - D I + M I = 15000 + 1500 - 50 + 10 = 16460.

If HI denotes the estimator of the number of occupied housing units at time t,


PPHI is the estimator of the average number of persons per housing unit at time t,
and GQI is the estimator of the number of persons in group quarters at time t, then
an estimator of total population PI is
A= H I PPHI + GQI '
X (13.4.4.1)
Smith and Lewis (1980) have discussed several methods for obtaining the estimates
HI> PPHI and GQI of the respective parameters. Marker (1983) came with an
interesting idea that most of the above methods can be derived as a special case of
the multiple linear regression model. We will not discuss this method here as it is
straightforward by following Marker (1983) or the basic technique of regression
analysis, but we will discuss some different methods which makes use of sampling
techniques discussed in the previous chapters of this book.

Before discussing synthetic estimator, let us try to understand the structure of


population under small area estimation process. Let Yij and X ij be the totals of the
study variable and auxiliary variable for the /h unit in the lh area of interest for
i =1,2,...,K and j =1,2,....N], Assume that the population totals X ij bases on N ij
units for the auxiliary variable X are accurately known. Now our interest is to
draw inference about the study variable Y in a particular area of interest by making
use of maximum information. The structure of the population and mechanism of
synthetic estimation technique can be easily seen from Table 13.4.5.1. Following
Gonzalez (1973), "An unbiased estimate is obtained from a sample survey for a
large area; when this estimate is used to derive estimates for sub-areas under the
assumption that the small areas have the same characteristics as the large area, we
identify these estimates as synthetic estimates". Based on this definition a general
synthetics estimator of small area total, r;, is defined as
1084 Advanced sampling theory with applications

Y;ss = LK(XijjXOj).•
Yj , (13.4.5. 1)
j=1

where r; may be a ratio estimator or difference type estimator or any other direct
estimator. As shown in Table 13.4.5.1, the direct estimator r; as a ratio estimator
has been discussed by Ghosh and Rao (1994) and a difference estimator has been
discussed by Singh, Stukel, and Pfeffermann (1998). It is further investigated by
Gershunskaya , Eltinge, and Huff (2002).

Table 13.4.5.1. An easy way to understand the synthetic estimation mechanism.

~. = ~. =
Y.01X. -
'-
I Y. I + .ti(XoI - Xot)
Xol
2 Y;I Y;2 Y; N 2 r; = r; =
X 21 X 22 X 2N2 Y.02 -X'- 02 r02 + .ti(X02 - X02 )
N 21 N 22 N 2 N, X02

K YK I YK 2 YKN K r; = r; =
X K1 X K 2 X KN K Y.oK-X'oK
- roK+.ti(XoK- XoK)
N KI N K2 N KN K X. K

Now we have the following theorems:


Theorem 13.4.5.1. Assuming the direct method of estimation as a ratio estimator,
show that the synthetic estimator for the {h area of interest reduces to
'S K . Xij
L YOj-'-
Y; = j~ ~j
' (13 4 5 2)
. . .

Proof. Substitute r:j =r.Axoj/xoJ, j=1,2,... , K, into r/ in (13.4.5.1), we have


(13.4.5.2) . Hence the theorem.
13. Miscellaneous topics 1085

Theorem 13.4.5.2. Assuming the direct method of estimation as a difference type


r
estimator, show that the synthetic estimator for the area of interest reduces to
Y;'S = IK ( xij/xoj )"
j~l
Yoj+fJI
K ( XOrXO
j~l
j 'Xxij /'XOj), (13.4.5.3)

where p is an estimator of fixed regression coefficients in the linear regression of


y on x.
Proof. It follows by substituting direct estimator r; r j + p(Xoj - Xoj)
= o in the
general synthetic estimator Y/ in (13.4.5.1). Hence the theorem.

If the direct estimator Yo: (may be a ratio type or a difference type) is


approximately design unbiased, then we have the following theorem .

Theorem 13.4.5.3. The bias in the general synthetic estimator, Y/ , is given by


B(Y;s)", ~ Xij(Y.j/XorYij/Xij) (13.4 .5.4)
j~l

K
Proof. We know that the true total for the lh small area is given by Y; = I Yij ' Also
j~1

taking design based expected value on both sides of the general synthetic estimator
y;s = f (Xij/XOj)Y;
j~l

t
we have

E(Y;S)= E[t (Xij/XOj)Y; ] = (Xij/XOj)E(Y;)", j~l (Xij/XOjXyOj) '


Thus by the definition of bias we have
B(Y;S)", f (Xij/XoJYOj - f Y;j
j~l j~l

= ~ (Xij/XOj)Yor ~ (YijXij/Xij)= ~ Xij(YOj/XOj-Yij/Xij) '


j~l j~l j~l

Hence the theorem.

Theorem 13.4.5.4. Under the assumption Cov(y;S, y;,) '" 0 an approximately


unbiased estimator of MSE(Y;S) is given by
, (,s) = (,Y;S - Y;' "\2J + V,(,")
MSE Y; Y; (13.4.5.5)
where v(y;") is a design unbiased estimator of variance of Y;".
Proof. The mean squared error of the general synthetic estimator Y/ is given by

MSE(Y;' s )= E[,s
Y; - Y; 12J = E[,s
Y; - Y;' " + Y;'" - Y; 12J

=E(Y;,s- Y;,,\2
J + E(," '"X'" )
(,s - Y; Y; - Y; .
Y; - Y; J\2 + 2EY;
1086 Advanced sampling theory with applications

Under the assumption


EY "X" )
( ' j S -Yj Yj -Yj ",0

we have
(,S) ('S ,.\2
MSE lj '" Elj - lj ) + Elj -
(,. ljJ\2 '" (,s
Elj - lj ) + Vlj . ,. \2 (,.)
Hence the theorem by the method of moments.

It is to be noted here that the condition cov(Bs, B')'" 0 may be realistic in practice
since the synthetic estimator is much more stable than the direct estimator in small
area estimation process.

A class of estimators can be defined by combining the direct estimator and the
synthetic estimator as
lj' C = Yj lj•• + (1- Yj ).ljS , (13.4 .6.1)
where r, is called a shrinkage factor between 0 and 1. The estimator is yF
expected to have a small prediction mean squared error if r: provides a suitable
trade off between the instability of direct estimator and the bias of the synthetic
estimator. Singh, Stukel, and Pfefferman (1998) reported that most of the work on
small area estimation is on the determination of the shrinkage coefficient rs-

Theorem 13.4.6.1. The optimum value of r, which minimises the MSE(~C) is

Yj = MSE(~s)/{MSE(W)+V(~')} . (13.4.6 .2)


Proof. We have
MSE(y~)= E[~C - lj f = E~j~' + (1- r. )~S - lj f
= E~j~' +(I_Yj)~S -{Yjlj +(1- yJlj}f = E~j(Yj' -Yj)+(I-Yj XY/ _Yj)]2
= E[rl(~' - lj) + (1- r, f(~S - lj) + 2dl- yd~S - ljX~' - lj)]
= rlE(~' - lj) +(1- Yjf E(~S - lj) + 2Yj(l- Y;)E(BS - ljX~' - lj)
= rlV(~')+ (1- Yj)2MSE(~S ).
Now setting
oMSE(~c);OYj = 0, we get Yj V(~' )-(1- yJMSE(W)= 0
which proves the theorem.

The main difficulty in using r, is that it depends upon the unknown population
parameters. This difficulty was over come by Purcell and Kish (1979) by
suggesting as estimator ofYj as

(13.4 .6.3)
13. Miscellaneous topics 1087

Drew, Singh, and Choudhry (1982) suggested an interesting method to obtain the
shrinkage coefficient r, as
1 if ilj "? su;
h= _ilj (13.4.6.4)

where
1
ilj
oN j
otherwise,

is the direct, unbiased estimator of the known domain population size N,


and 0 is subjectively chosen to control the contribution of the synthetic estimator.
Sarndal and Hidiroglou (1989) proposed an alternative method as

(13.4 .6.5)
otherwise,

where h is subjectively chosen constant. It is to be noted that for 0 = 1,h = 2 the


Sarndal and Hidiroglou (1989) weights reduces to Drew, Singh, and Choudhry
(1982) weights.

Example 13.4.6.1. The average area in hectares, number of countries in each


continent, and the number of countries to be selected from each one of the 10
continents are as given in the following table.

1 6 3194.50 2
2 6 14660.00 2
3 8 18309.37 3
4 10 14923.50 4
5 12 5987.83 4
6 4 3450.00 2
7 30 11682.73 11
8 17 145162.30 6
9 10 33976.10 4
10 3 1333.33 2

Select a sample of 40 countries from different continents listed in the population 5


of the Appendix. Estimate the average yield in hectares in each of the continent
using the synthetic ratio (SR) estimator and the composite estimator (CE).

Solution. We selected the sub-samples of the required size from each continent by
SRSWOR sample and collected the information on the area and yield of tobacco
crop as shown below .
1088 Advanced sampling theory with applications

Contine,~~ ri
.' ' ' '
I '~
7
'ni o.
';e,,-." _""'''''';'-'
--;: ,,,
x,ij kf~ ",~3~~i.
~ . .: . " . . y.Y.. ~' ~~r
1th'~'
I 2 9024 7090 .50 2.21 1.995
5157 1.78
2 2 27050 14 112.50 1.51 1.750
1175 1.99
3 3 9260 24003 .30 2.80 2.780
47600 2.77
15150 2.79
4 4 24000 14025.00 0.63 1.450
7000 1.67
19100 2.34
6000 1.17
5 4 4500 8226 .00 2.33 1.615
4304 0.26
5500 1.82
18600 2.05
6 2 2700 4700 .00 1.96 1.570
6700 1.18
7 II 705 1.00
4000 21939 .09 0.45 1.010
3400 1.62
750 0.87
10000 0.26
10 1.00
116700 1.22
655 1.63
103110 2.06
0 0.00
2000 1.00
8 6 36000 325593 .16 1.22 1.450
12165 0.74
1445000 1.75
445000 1.43
11000 1.05
4394 2.50
9 4 18000 78800.00 1.39 1.175
2100 1.29
1800 1.11
293300 0.91
10 2 3300 1700.00 2.73 1.840
100 0.95

The overall samp le means are given by


x = 68157.725 and ji = 1.486 .
Thus we have the following two cases :
13. Miscellaneous topics 1089

Synthetic Ratio (SR) estimators: The synthetic ratio estimates of average yield in
different continents is
."" .=.. -
,~f\ cont.i nent . ~j' ~ ,
SY!1tli~{stiinator: Y; =-? x, ,
'" """ l ~ - '- _ "' ,: ~':. f; X

1 1.486
x 3194.50 = 0.06947
68157.725
2 1.486
x 14660.00 = 0.31962
68157.725
3 1.486
x 18309.37 = 0.39918
68157 .725
4 1.486
x 14923.50 = 0.32536
68157 .725
5 1.486
x 5987.83 = 0.13054
68157.725
6 1.486
x 3450 .00 = 0.07522
68157 .725
7 1.486
x 11682.73 = 0.25471
68157 .725
8 1.486
x 145162.30 = 3.16488
68157.725
9 1.486
x 33976.10 = 0.74076
68157.725
10 1.486
x 1333.33 = 0.02906
68157.725

1 0.050 0.0566 WI < W I 1.695


2 0.050 0.0566 W2 < w 2 1.594
3 0.075 0.0755 W3 < w 3 2.393
4 0.100 0.0933 W4 > w4 1.469
5 0.100 0.1130 Ws < w s 1.40 1
6 0.050 0.0370 W6 > w 6 1.543
7 0.275 0.2830 W7 < w 7 0.771
8 0.150 0.1600 Ws < w s -2.131
9 0.100 0.0940 W9 > w 9 0.197
10 0.050 0.0280 WIO > wlO 1.832
1090 Advanced sampling theory with applications

The estimate of the average yield in the eighth continent is negative, which is not
possible and hence can be taken as zero. It looks this estimator needs
improvements, may be due to no correlation between yield and area under the crop.

The small area estimation techniques are model dependent, and to understand these
techniques, one should must know the Henderson (1975) model. We first provide
here a complete solution to it.

We give here a complete solution to the general Henderson (1975) model defined as
Y = XfJ+Zu +e, (13.4.7.1)
where Y = col col lJ;ij) is an nx1 vector, X = (Xijl) k matrix, u = col (u;),
1$; $1 I$j $n; nx 1$;$1

e = 1$;$II
col col (eij)' and
$j$n;
Z = (zij)
nx
kmatrix. Let us partition a matrix as

(13.4.7.2)

Let D be a non-singular matrix, then by block multiplication we have


AP+BR = I, (13.4.7.3)
AQ+BS = 0, (13.4.7.4)
CP+DR=O, (13.4.7.5)
and
CQ+DS =1. (13.4.7.6)
From (13.4 .7.5) we have
R =-D-ICP. (13.4.7.7)
From (13.4.7.6)
S=D-I(I-CQ). (13.4.7.8)
From (13.4.7.3) and (13.4.7.7)
AP-BD-ICP=/or (A-BD-IC}?=I,
which implies
P=(A-BD-1Ct · (13.4.7.9)

t.
From (13.4.7 .7) and (13.4.7.9)
R = -D-1C(A - BD-1C (13.4 .7.10)
From (13.4.7.4) and (13.4.7.8)
AQ+BD-I(/-CQ)=O or AQ+BD-1 -BD-ICQ=0 or (A-BD-ICk=-BD-1,
which implies
t
Q = -(A - BD-IC BD- I . (13.4.7 .11)

Put it in (13.4.7.8) we have


13. Miscellaneous topics 1091

or
S = D- I[I + e(A - BD- Ie t BD- I ]. (13.4 .7.12)

From (13.4.7.9), (13.4.7.10), (13.4.7.11) and (13.4.7. 12), the inverse of the matrix

[eAIl D
B] is given by

[~-~ ~ r~~~~~~~~~~t~-I~ ~~-~:~~~!~::~~i'~~-,1


- ] = -
Now in Henderson (1975) model, the partition of the matrix is given by
A= (Xl R- lX ), B=(xorlz) , e=(z'rlx), and D =Z'R -lZ + G- 1. (13.4 .7.13)
Thus the least square estimates of /J and jJ in the Henderson 's model
y = Xp + Zf.J+ e (13.4 .7.14)
where

v[:] = a2[~: ~] (13.4.7 .15)

are given by
/J = (A - BD-Iet X'r ly - (A - BD- Ie t BD-IZ ' r ly

or /J =(A- BD-Iet k-BD-I Z'~-ly, (1 3.4.7.16 )


and
it = - D-1e (A - BD-Iet X' R- 1y + D-l [I + e(A - BD- Iet BD- I ]Z'R-Iy

= D - 1[z'+e(A - BD-I et (BD -lZl_Xl)]R-1 y . (13.4.7 .17)

r
To simplify these estimates of p and it we have
t Xz,r r
r
[(X' R-1x )- (XIR- IZ
l
(A - BD-Ie = l
Z + G- I Z' R- l X

=[ X'( R- 1- R-IZ(ZlR-lZ + G- 1 z'rl)xt


= (X'V- lxt, (13.4 .7.18)

where ( R- I _R-lZ(ZlR-l Z + G- l t z'r l ) = V-I as shown by Henderson (1973) .

Now
1092 Advanced sampling theory with applications

k-BD- I Z'~-I = [ x'-k R- I Z ~Z' r l Z + G- I t Z}-l

= X'[s:' -rlz(z'rlz +G- I t Z'R - 1] = X'V - I . (13.4.7.19)

On using (13.4.7.18) and (13.4.7.19) in (13.4.7.16) we obtain


/J = kv -Ixt X'V-Iy . (13.4 .7.20)
Now it can be written as
it = D-1[Z'+C(A - BD-IC t (BD-I z'-x')]r l y

r
= (Z'R-IZ +G- 1t Z'R-1y
l
+(Z'R-IZ +G-ItZ'R-1X[X'R-IX -x'rlz(z'r1z +G-ItZ'R-IX

x [{X'R-IZ ~Z'R-IZ +G- I[I Z'R-Iy- X'R-Iy]


= GZ'V-Iy + GZ'V-1X[(X'R-1X)- X'R-1ZGZ'V-1X JI [(X'R -IZ pZ'v-Iy - X'R-Iy]
Using the results that (Z'R-1Z + G- 1t Z' R-1= GZ'V- I and now using y = X/J in the
above expression
it = GZ'V-Iy + GZ'V-'xk(R- ' -R-IZGZ'V- I )xJI[X'(R-1-R-1ZGZ'V- 1)x/J]
= GZ'V-Iy-GZ'V-1X/J = GZ,v-I&-x/J). (13.4.7 .21)

Thus the result follows theorem:

Theorem 13.4.7.1. The best linear unbiased predictor (BLUP) of l;'fJ + t;'J1 is
I;'/J+t;'u = I;'/J + t;'GZ,v -I&-X/J) (13.4.7.22)
Proof. Replace fJ by /J and J1 by it in 1;' fJ + t;' J1 we have the theorem .

Theorem 13.4.7.2. The variance of the best linear unbiased predictor (BLUP),
itp = 1;' /J + t;' it
IS

v(up)= dx'v-1xtl; -r;{Z'R-1Z +G- I tZ'R-IX(X,v-Ixtl;

-I;(x,v-Ixtx'rIZ(Z'R-IZ +G-Itt;

+t;'(Z'R-IZ + G- I t[I + Z'R-IX(X'V-Ixt X 'R-1Z(Z'R-IZ + G- I t]t;

(13.4 .7.23)
Proof. Following Henderson (1975) let
13. Miscellaneous topics 1093

r =:~ -\~1~-1
C12 IC 22
be a generalized inverse of the matrix

[~-i-~-]
then

CIl I C121
V{;Jp) = V(f,8 + C;'iJ) = [;K
{
-~ -1--- [~}T2
Cl 2 I C22
2
= a l;'CI I ; +C;'C;2; +;'C12C; +C;'C 22C;J
where =(x,v -Ixt, C 12 =-(x,v-IXtX'R-IZ(Z'R-IZ+G-lt ,and
CII

C22 = (Z' R - IZ + G- I t [I + Z' R- IX(X' V-I Xt X'R-IZ(Z' R- IZ + G- I t ] .

Remark 13.4.7.1. Note that BLUP can also be written as:


;',8 + C;'u = ;',8 + C;IGZ'V-I~ - X,8)
=;,(x,v -Ixt X'V-IY+C;'GZ'V-I[y-X(X'V-1xt X'V-Iy]

=[;'kv-Ixt X'V - l +C;'GZ'V-I(1-X(X'V-lxt X'V-1 )]y


=Wy,
where

t
W= [dx'v-I X X'V- I + C;'GZ'V- 1( 1- xkv- IX X,V- I)] . t
We first discuss three well known models useful for small area estimation as
follows:
( a ) Nested error regression model; (b) Random regression coefficient model;
and (c) Fay and Herriot model.

Battese, Harter and Fuller (1988) suggested a nested error regression model in the
context of estimating mean acreage under a crop for counties (small areas) in Iowa
using landsat satellite data in conjunction with survey data with the model

y = -I)
x'../3 + + e,
VI' i = 1,2,...,1 and j = 1,2,....n,
1094 Ad vanced sampling theo ry with applications

where Y = col col


ISiSt ISj Sni
(vij ) is a vector of the r sample d unit in the lh sampled area,

t
is an n xk matrix , P = (PI,P2, ...,f3k )· , n= 'Lni , and
i=1
So the nested error regression model can be written as

Yl l XI II> X112, ..., Xl lk VI ell


Y I2 X121, X122, ..., XI2k VI el2
PI
·Ynl x lnl l , xlnI2,..., xlnlk P2 VI elnl
................................ + + (13.4 .7.24)
YII XIII ' X1I 2, ...., Xllk Vt ell
Yt2 * t2 1> XI22' ·....' Xt2k f3k Vt e/2

Ytnt Xtnt l, Xtnl 2 ' ..., Xtntk Vt etnl

To find var iance unde r the nested error regression model let us choose

.
I0,, 0,1, ,,0]° [1]1
Z=dlagCol(I)=It @l ni=
[0, 0, ..., 1 1
@ ,
ISiStlSj Sni . .
txt ni xl

where denotes the Krnoker product,

with impl ies that

nxn

v = R + ZGZ'= diag(Vi) = diag(O'; f I ni + O'~ J ni]


lSiSt lSiSt t a;
and Vi-I = (0'; t [Ini - Yi Ini
ni
l~i] with

In- = col (1) . The Vi - I can be obtained by using a standard result


I l<;,j Sni

(I a +aJat l =I a _ _l +a_
aa
Ja . One can easily observe that with s'= (0,0 ,...,1,..., 0) we
have S, GZ•V - I - ' r,
= (Yi - ' 00"Yi
- ] , where r, = 2
0';_ I 2 '
. . .
which further implies that
ni ni ni O'v + ni O'e

Yil - PXil
X
S'Gz.V-1(Y-XP)=( Yi , Yi ,oo .,Yi] Yi2 - P i2 = Yi~i - XieP) .
ni ni ni
Yini - P Xini

Again taking ;= (o,o,oo .,Xie, .oo,o) the best linear unbiased predictor (BLUP) under
the nested error regression model becomes
13. Miscellaneous topics 1095

jl(nested)= ~'/J + C;' GZ'V- I~ - X/J) = X i.fJ + r, ~i - Xi./J). (13.4.7.25)

Then we have the following theorem:

Theorem 13.4.7.2.1. The mean squared error under the nested error regression
model is given by
MSE[jl(nested)] = (Xi. - YiXi. XX 'V - 1X J I(Xi• - Yixi. )+CT; (I - Yi) ' (13.4 .7.26)
Proof. The true mean under the nested error regression model in the lh area is given
by Pi = X i.fJ + Vi ' therefore, the mean squared error (MSE) is given by

MSE[.u(nested)] =E[.u(nested)- p;]2 = E[(Xi• - YiXi.)/J+ Yi Yij - Pi]2


ni j= 1
t

(-
= Xi. - YiXi.
- )vfIP;, X
- - Yixi.- )+(1- Yi f CTv2 +-CT
Xi. r1 e2 •
ni
2 2
Note that r, = v
2 CT _ 2' which implies Yi = (1- y;) CT~ and thus
CTv + ni 1CTe ni CTe

MSE(jl(nested)) = (Xi. - YiXi.)v~ XXi. - YiXi. )+(1- Yi f CT; +CT;r;(I- Yi) CTCT;2
e

= (Xi. - YiXi. XX IV-lX t(xi• - Yixi. )+CT; (I - y;).


Hence the theorem .

Dempster, Rubin, and Tsutakawa (1981) proposed a more general model with
random regression coefficient fJ , which in case of single concomitant variable x
and regression through origin takes the form
Yij = fJiXij +eij = fJXij +ViXij +eij' i = 1,2,...,1; j = 1,2,....n., (13.4.7.27)
where fJi = fJ + Vi and Vi and eij are independent. The mean of the j lh area is given
by
(13.4.7.28)
1096 Advanced sampling theory with applications

a linear combination of fixed effect fJ and realized value of random effect Vj.
Under random regression model with k =I, z, =x, ,rj =O"i{O"~ + O"i/j~1 xt}-I ,
which implies, (1- r;) = O"~ (7./ I xtJ'
O"v J=I
and we have

j
= -2
( O"~ J[x, ej - r»
O"e
j i ej ~ Xij2] = (O"~J["j
-, XjX,(" j
J-I
_l~Ij xijeij ]
- 2 ~ xijeij - r.
O"e J-I J-I

=(O"~
O"e J[(I- r;) ~ Xijeij] =(O"~ O"~ J(7.( ~ xtJ[ ~ Xij~ij - /JXij)lJ
O"e J( O"v
J-I J-I J-I

=r_j[AfJj - fJA]
j j
= rj
_ [" A]
~ XijYij / "L Xij2 - fJ . (13.4.7.29)
pI J=I

Further if ~ = (o,o,... ,Xj,....o.o] then the best linear unbiased predictor (BLUP) of
ffJ stu
+ is given by

.u(random)=Xj/J+Xjrj~j-/J) . (13.4.7.30)

Then we have the following theorem:

Theorem 13.4.7.3.1. The mean squared error of the BLUP under random
regression coefficient model with one auxiliary variable is given by

MSE[.u(random)] =Xl[(I- r; f( O"~/j~rj J+ O"~(I- r;)] . (13.4.7.31)

Proof. Note that with one auxiliary variable (k = I) we have

V~)=(xoV-IXt' =O"~/j~lj
A ( + VjXij + eij)y LlIj Xij2=fJ+Vj + L1Ij xijeij / LlIj Xij2
fJj =LlIj XijYij / LlIj Xij2=LlIj XijlpXij n

j=1 j=1 j=1 j=1 j=1 j=1


which implies
lIj
lIj 22 2 2 lIj 2 O"~
E~j - fJf =E[ v. + L Xijeij/ L Xij ] =O"v +0"1/ L Xij =-;::;-.
A

j=1 j=1 r, / j=1

Also
j
rr;. - fJfj\ ] =E[2
EllPj Vj + v. LlIj xijeij /"L Xij2] =O"v·2
j=l j=1

Thus we have
MSE[.u(random)] =E[.u(random)-,uJ =E[Xj/J + rjXj ~j - /J)- XjfJ _XjVj]2
13. Miscellaneous topics 1097

= Xl[(I- Yif( a;/i~ti )+a;(I- Yi)] .


Hence the theorem .

For estimating per capita income for small areas (population less than 1000), Fay
and Herriot (1979) assumed that the k vector of benchmark variables
Xi = (Xi\,XiZ, ,,,,Xik)' related to Jli, is available for each area i, and that the Jli are
independent N~;P, A), where P is a k vector of unknown parameters. The
sample mean vector )I = ()II ' )lz,..·,)1/) = col ()Ii)' given JI = col (Jli) is N(JI,D) ,
1:5;";/ I";i";/

where D = diag(Di) with known diagonal element. The model can be stated as a
I";i";/

linear model

)Ii = Jli + ei and Jli = x;P + Vi ' i = 1,2,...,1, (13.4 .7.32)

where e = (el, ez ,...,e/) and v =(VI> Vz ,...,v/) are distributed independently as N(O,D)
and N(O,AI), respectively. For the Fay and Herriot model, the best linear unbiased
predictor (BLUP) of Jli is obtained as

,iJ(FH) = X;;3+(_A_J~i -x;p), (13.4.7.33)


A+Di

where P=(Xl V-I X t X' V-I)I with V = diag(A + D A + Dz, ..., A + D/) andI,

X = col (x;). Under normality the estimator (13.4.7.33) is also a Bayes estimator, as
1:5;";/

shown by Fay and Herriot (1979). Note that ,iJ(FH)~)li if D;/(A i +Di)~ 0, and
,iJ(FH)~ x;P if A;/(Ai +Di)~ O.

Consider Yijk denotes the value of the /(iJ unit in the cell (i,i), the JI j are fixed
effects and the error term eijk are uncorrelated with zero means and variance a; .

Holt, Smith, and Tomberlin (1979) obtain a best linear unbiased prediction (BLUP)
estimator of Yi under the linear model for the finite population
1098 Advanced sampling theory with applications

Yijk = f.Jj + eijk , k = 1,...,Nij; j = 1,2,...,N k; i = 1,2,....m . (13.4.8.1)

Further Nij denotes the number of population elements in the large domain j that
belong to the small area i . Let nij elements in a sample of size n fall in the cell
(i,j) , and Yij and YOj denote the sample means for (i,j) and j, respectively., The
best linear unbiased estimator of f.J j under the model (13.4.8.1) is jJj = YO j which
in turn leads to the BLUP estimator of Yj given by
'8 Nk· C
Y; = L Y;j , (13.4.8.2)
j;l

where rf is a composite estimator of the total Yij with shrinkage coefficient


r, = nij / N ij attached to the direct estimator and its complement
(1-Yj) =INij - nij )/Nij to the synthetic estimator. Thus in this situation if nij/ Nij ~ 0
for all i, irrespective of the size of between area variation relative to within area
variation, the BLUP estimator of Yj changes to the synthetic estimator. More
general models in the presence of auxiliary information may be considered as
Yj = x] P+VjZj, i =1,2,...,m, (13.4.8.3)
where X , =(Xii , Xj 2 , .. . ,Xjp Y denotes the set of known auxiliary information, P is
the vector of regression parameters and the Vj are independent and identically
distributed random variable with E(vj) = 0, V(Vj) = (]"~ . It is also possible to
consider nested error regression model, for example
Yij = xl; P+ Vj + eij' j = 1,2,..., N j , i = 1,2,....m (13.4.8.4)
where eij = eij kij and the eij are independent and identically distributed random
variables, independent of the Vj with Eleij)= 0, V(eij)= (]"2 , the k ij being known
constants and N j the number of elements in the l h area. More details about the
small area estimation issues can be had from Singh, Stukel, and Pfefferman (1998)
and Rao (1999c), where they have discussed Bayesian versus frequent measures of
error in small area estimation in addition to some recent advances in model-based
small area estimation. Datta and Ghosh (1991), Datta and Lahiri (1995), Lahiri and
Rao (1995), Datta, Day, and Maiti (1998), Datta, Day, and Basawa (1999), Datta,
Lahiri, Maiti, and Lu (1999) have suggested robust hierarchical Bayes estimation of
small area characteristics in the presence of covariates and outliers etc.. Datta and
Lahiri (2000) suggested a unified measure of uncertanity of estimated best linear
unbiased predictors. Moura and Holt (1999) have considered the problem of small
area estimation using multilevel models. Agarwal and Roy (1999) have also
suggested some efficient estimators for small domains. Sisodia and Singh (200 I)
have shown the application of small area statistic in agriculture is becoming more
and more important in the context of Agro Climate Regional Planning process
initiated in India. You and Rao (2000a, 2000b), Stukel and Rao (1999), Prasad and
Rao (1999), Rao (1999c), Lahiri and Rao (1995), Rao and Yu (1994) are also
interesting contributions to small area estimation.
13. Miscellaneous topics 1099

The use of estimates of proportion and their accuracy is well known in different
disciplines such as economics, criminology, etc., and a practitioner always look for
an improved methodology . Fay and Herriot (1979) and their followers, as reviewed
by Ghosh and Rao (1994), consider the problem of estimation of mean in small
areas of a continuous study variable such as income, tax return and yield of a crop
based on the standard normal theory distribution. Unfortunately such a theory is not
appropriate for estimating proportion of an attribute in small areas. MacGibbon and
Tomberlin (1989) consider, the classical statistical logistic regression theory, that
the logit transform of the proportion, not the proportion itself, that has to be
modelled in a linear way while estimating proportion in small areas. Dempster and
Tomberlin (1980) are the first to consider the problem of inference from a relatively
thinly spread complex, multi-stage surveys to small area or domains not necessarily
included in the survey by using model based approach. They estimated proportion
for small areas and associated uncertainty by making use of random effects,
multiple logistic regression model and empirical Bayes techniques . This explicitly
model based methods differs substantially from the implicitly model based
approach of the synthetic estimation techniques of Gonzalez and Hoza (1976).

Farrell, MacGibbon, and Tomberlin (1994) consider the problem of estimating


small area rates and binomial parameters using empirical and hierarchical Bayes
approaches by following pioneer empirical Bayes model of Dempster and
Tomberlin (1980). Stroud (1994) studied hierarchical Bayes models for univariate
natural exponential families with quadratic variance functions. Malec, Sedransk,
Moriarity, and LeClere (1997) used a fully Bayes approach to estimate proportions
using data from National Health Interview Survey.

Farrell (1997, 2000), Farrell, MacGibbon, and Tomberlin (1994, 1997a, 1997b,
1997c) pointed out that the importance of small area estimation as a facet of survey
sampling cannot be over emphasised. Of late there has been an increasing demand
for small area statistics in both the public and private sectors . They considered the
problem of gain in power of a Bayesian approach by borrowing strength from an
ensemble, while simultaneously obtaining desirable frequents operating
characteristic and adopted empirical Bayes methodology for the said purpose .
Farrell, MacGibbon, and Tomberlin studied the various adjustments in the empirical
Bayes interval estimates of small area proportions by following different kinds of
bootstrap methodology developed by Laird and Louis (1987) and modifications
suggested by Carlin and Gelfand (1990, 1991).

The probabilities associated with individuals in the population as a function of


categorical variables, continuous covariates and sample characteristics can be, in
particular, modelled as
logit(Jrij) = e + X ijP+ ¢i' (13.4.9.1)
1100 Advanced sampling theory with applications

where Ifij denotes the probability ofa 'response' for the j" unit in the /h area, the
subscript i refers to a set of categorical variable covariates, and the subscript j
refers to a set of nested sampling characteristics, indicating primary stage units
(PSU), second stage units (SSU), SSU within PSU, and so on. The parameter 0
represents a sum of fixed classification effects, the parameter ¢i represents a sum of
random effects associated with sampling characteristics, the vector X ij represents a
vector of quantitative covariates, and the parameter P is a vector of fixed logistic
linear regression parameters. The random effects parameters are assumed to have
some parametric distribution. The probabilities Ifij are obtained by inverting
Ifij[1 + exp~ (0+ XijP +¢i )}]-I .
= (13.4.9.2)
The Bayes estimate for the /h small area is given by
"'[Jrij
• J (13.4.9.3)
Pi=T'
I

where
irij = [1+exp~(e+Xij,B+¢;}}jl (13.4.9.4)
such that
IYijXij = IirijXij ' (13.4.9.5)
i,J i,J
I Yij = Iirij , (13.4.9.6)
i,J i,J
and
L~ij -irij)-¢j(J"2 = O. (13.4.9.7)
J
The equations (13.4.9.5), (13.4.9.6) and (13.4.9.7) can be solved by the Newton--
Raphson algorithm. If Zij represents a vector of predictor variables, both
quantitative and qualitative, associated with the (i,j~h individual and I' represent a
vector of the parameters of the model
z[r = O+XijP+¢J'
(13.4.9.8)
then
1\

V(jJ;),., IZ[irij(1-irij)I ; IZijirij(1-irij) (13.4.9.9)


} Ni }
where
-I

Iirij(1- irij )xJ, Iirij(1- irij )xij' Iirij(1- irij)xij


ij ij J
Iirij(1- irij )xij' Iirij(1- irij} Iirij(1-irij) (13.4.9.10)
ij ij ij
Iirij(1- irij)xij' Iirij(1-irij} Iirij(1- irij )-1/(J"2
J ij ij

and (J"2 is known. If (J"2 is unknown, then it can be estimated by the empirical
Bayes EM algorithm of Dempster, Laird, and Rubin (1977).
13. Miscellaneous topics 1101

Exercise 13.1. What is the raking ratio estimation technique? Explain with the help
of a 3 x 3 contingency table.

Exercise 13.2. Write a short note on the estimation of continuous populations in


survey sampling? Give three example where continuous populations exists in actual
surveys .

Exercise 13.3. Show that RHC strategy remains design unbiased for estimating
population total of a continuous population .
Hint: Padmawar (1996) .

Exercise 13.4. Discuss different models for estimating the estimation of


measurement errors using ANOVA method. Discuss the estimators of bias and
measurements errors under different sampling schemes.
Hint: Bhatia, Mangat, and Morrison (1998).

Exercise 13.5. Discuss the Vital Rate Method and Housing Unit Method techniques
for small area estimation. Discuss two methods of choosing optimal weights, y.,
I
in
case of the composite estimator, ric, of small area estimation . Show that if Yi ~ 0 ,
the composite estimator reduces to synthetic estimator, and if Yi ~ I, then the
compos ite estimator reduces to the direct estimator ri" .
Hint: Ghosh and Rao (1994).

Exercise 13.6. Estimate the variance of the estimator of correlation coefficient


using ( a ) Jackknifing, ( b ) Bootstrapping, and (c) the Balanced Half sample
method. Discuss the effect of non-response on the estimator of correlation
coefficient between two variables.
Exercise 13.7. Find the bias and variance of the ratio estimator

- -(XJ
x
YR =y

and a weighted estimator


Ye = e y+(I-e)YR
when both variables Y and x are subject to error.
Hint: Manisha and Singh (2001).

Practical 13.1. Consider R = C = 2, we have drawn a simple random sample of 80


units with 74 respondents, and obtained the data given in the following table:
1102 Advanced sampling theory with applications

n;. = 21
Explain the raking ratio method of estimation with minimum four iterations.

Practical 13.2. The average area in hectares, number of countries in each continent,
and the number of countries to be selected from each one of the 10 continent are as
given in the following table.

1 6 3194.50 3
2 6 14660.00 3
3 8 18309.37 4
4 10 14923.50 5
5 12 5987.83 5
6 4 3450.00 3
7 30 11682.73 12
8 17 145162.30 8
9 10 33976.10 5
10 3 1333.33 2

Select a sample of 50 countries from different continents listed in the population 5


of the Appendix. Estimate the average production in each of the continent using the
synthetic ratio (SR) estimator and the composite estimator (CE).

Practical 13.3. In a college of a university, the total recruitment of new students


during 2000 was 500 and the number of students who left the college was 150. The
recruitment and departure rate according to 1999 records was 2.6% and 1.4%,
respectively, for both the college and the university. The recruitment and departure
rate of the students during 2000 for the university is 2.7% and 1.5% respectively.
Estimate the total number of students in the college during 2000 by using VRM
method.

Practical 13.4. In a university let the total number students be 20,000 during 1999.
During 2000 there is a recruitment of 2,500 students according to the schedule and
later on 150 students left the university and stopped their studies . There was
migration -10 students (say, 30 students migrated to other universities and 20
students came from other universities during the academic year 2000). Apply the
Census Component Method (CCM) to estimate the total number of students in the
university during 2000.
13. Miscellaneous topics 1103

Practical 13.5. The followin"j figure shows a plant growing area near the main road
of a city. The upward arrow L shows a good plant, whereas the two-headed arrow
I shows a defective plant. The crossing block arrow ~ shows the possibility
of footpaths across the plant growing area.

( a ) Select a sample of 20 plants across the whole plant growing area.

Now divide the plant growing area into four small areas using the four directions of
the block arrow. Estimate the proportion of defective plants in each small area.
( b ) Consider that the information about the distance of each plant row from the
road is known. Suggest if this can be used to improve the estimator of proportion
developed above. Note that the growth is less near the footpaths .
( c ) Can you think of any information which can improve the estimates?
ApPENDIX

Table 1. Pseudo-Random Numbers (PRN).

992954 622048 053688 165744 123095 171617 163786 973077 023072


588183 771280 343591 663332 873618 367912 762966 513710 787776
601448 917515 745302 624833 911823 506723 175641 935065 159406
549171 675048 304073 568609 016631 504253 004851 644863 147565
925567 534272 555153 624063 527207 981904 537885 222215 138108
014737 513080 467607 629610 769730 876103 241252 779302 185096
697856 339922 075453 855050 869127 664305 730911 734215 915553
872233 075720 139834 841453 427838 991093 436525 957477 714145
626895 627161 190747 313897 674490 313503 678962 323374 640289
236263 696957 899853 676831 158751 050072 314718 963637 342154
884270 767387 845366 506386 135152 567334 465516 862593 594974
942383 485006 074816 835762 500252 186142 938443 604162 732781
644818 498059 442486 661170 736683 504168 426933 163153 853715
465225 881559 869629 297738 442775 556942 471196 118548 401922
049819 370781 854896 088106 729287 527940 403024 444583 113867
320636 728960 250949 803613 139839 092083 637599 118688 765193
940767 638199 700635 335070 286080 644274 137531 794310 607639
477870 219452 582000 528396 161651 317053 486002 971898 595671
574258 333150 688670 751336 873796 713099 239312 691403 450906
566179 259758 478480 706027 892073 059970 128034 418305 385466
772153 769327 827373 198803 080205 929413 870198 464626 537843
576891 092665 670905 288755 011477 648964 429105 972735 527691
817574 433974 903417 118562 905366 151992 978704 763841 904615
973142 806267 731170 680235 502065 537687 092620 884410 876879
608932 942025 790592 721489 210155 914388 671936 282358 711834
333138 622638 491253 070561 076166 121776 234911 601944 794614
053651 688769 490430 927370 127476 703745 163014 878087 080662
721677 155734 669360 312921 489308 846283 776593 018163 692605
226505 427675 933935 279406 816254 486369 208397 265881 615237
882750 936294 647940 978020 793783 418755 537679 672050 300690
381070 295889 943669 589360 119593 024805 787483 159494 712930
651853 012286 247554 378621 009225 897628 236542 620496 219186
702670 495896 644947 487506 061240 437818 563445 681374 869224
786451 528864 640408 529911 262859 408046 834394 862338 051535
884886 630173 347278 228907 876472 040491 665615 329032 161707
Continued .
1106 Advanced sampling theory with applications

290260 323502 707834 244659 874655 548426 995549 040993 816920


409083 099078 571932 984411 330590 412815 196353 978115 984771
465730 211163 476164 995712 226350 188281 106244 185970 055161
539896 027534 900914 944872 268155 442362 646597 613553 104275
039365 768948 676363 467448 894496 793214 362738 995656 554078
616705 724810 642912 412883 803841 564324 554305 991424 397772
975581 896491 426965 326618 159848 602917 211953 876408 368676
604472 33631 1014143 482110 571612 234134 309728 794506 347737
367344 390037 705539 831722 094714 106243 601509 274297 420157
960001 354823 001221 767362 860098 626945 561950 113412 591427
737209 517446 531783 06483 1 678795 071352 704580 128848 117604
275811 727153 680330 628827 518951 229341 877061 556992 583115
565430 260577 313905 698626 010685 541472 776828 370699 429592
196602 529352 801239 821 149 582828 531195 149357 344186 155977
291222 833343 431066 054285 592207 794992 259361 736578 509032
584975 998455 937964 376363 227579 432919 091869 680558 314357
685465 039510 363667 535141 926647 075276 051822 154758 404150
Source: Generated from Uniform distribution.
Appendix 1107

Table 2. Critical values based on t distribution.


.'"
,.. ilmve1of . ~."'r
~l)~:~~I:!~';i ~ I ~\' ,~\ li";!~
.,
"-
',~ siificance ",'" ~.. .... ., ~4.

0.25 O:?~ 0.025


~ ,~ ~;f'
.~ ~.~:,;
1. II
~ 0 .1 5; ~0 .'1 0
~' ~:j .~;?~"~ O;.~. .10.005
I:t'~"
,...1'

~" "" ,,,I'. .~'" 1 1J'j> ·~ ·

•.";;;,~ .';"T", ~
{,0.30 ,,,, 0.10 -s,
df " Iff~ a . ~. 0.50 ll;.0.25. ~O:20 0.05 11" ,0 :02 1~0.()1 ,
01 1.000 1.963 2.414 3.078 6.314 12.71 31.82 63.65
02 0.816 1.386 1.604 1.886 2.920 4.303 6.965 9.925
03 0.765 1.250 1.423 1.638 2.353 3.182 4.541 5.841
04 0.741 1.190 1.344 1.533 2.132 2.776 3.747 4.604
05 0.727 1.156 1.301 1.476 2.015 2.571 3.365 4.032
06 0.718 1.134 1.273 1.440 1.943 2.447 3.143 3.707
07 0.711 1.119 1.254 1.415 1.895 2.365 2.998 3.499
08 0.706 1.108 1.240 1.397 1.860 2.306 2.896 3.355
09 0.703 1.100 1.230 1.383 1.833 2.262 2.821 3.250
10 0.700 1.093 1.221 1.372 1.812 2.228 2.764 3.169
11 0.697 1.088 1.214 1.363 1.796 2.201 2.718 3.106
12 0.695 1.083 1.209 1.356 1.782 2.179 2.681 3.055
13 0.694 1.079 1.204 1.350 1.771 2.160 2.650 3.012
14 0.692 1.076 1.200 1.345 1.761 2.145 2.624 2.977
15 0.691 1.074 1.197 1.341 1.753 2.131 2.602 2.947
16 0.690 1.071 1.194 1.337 1.746 2.120 2.583 2.921
17 0.689 1.069 1.191 1.333 1.740 2.110 2.567 2.898
18 0.688 1.067 1.189 1.330 1.734 2.101 2.552 2.878
19 0.688 1.066 1.187 1.328 1.729 2.093 2.539 2.861
20 0.687 1.064 1.185 1.325 1.725 2.086 2.528 2.845
21 0.686 1.063 1.183 1.323 1.721 2.080 2.518 2.831
22 0.686 1.061 1.182 1.321 1.717 2.074 2.508 2.819
23 0.685 1.060 1.180 1.319 1.714 2.069 2.500 2.807
24 0.685 1.059 1.179 1.318 1.711 2.064 2.492 2.797
25 0.684 1.058 1.178 1.316 1.708 2.060 2.485 2.787
26 0.684 1.058 1.177 1.315 1.706 2.056 2.479 2.779
27 0.684 1.057 1.176 1.314 1.703 2.052 2.473 2.771
28 0.683 1.056 1.175 1.313 1.701 2.048 2.467 2.763
29 0.683 1.055 1.174 1.311 1.699 2.045 2.462 2.756
30 0.683 1.055 1.173 1.310 1.697 2.042 2.457 2.750
31 0.682 1.054 1.172 1.309 1.696 2.040 2.453 2.744
32 0.682 1.054 1.172 1.309 1.694 2.037 2.449 2.738
Continued .
1108 Advanced sampling theory with applications

33 0.682 1.053 1.171 1.308 1.692 2.035 2.445 2.733


34 0.682 1.052 1.170 1.307 1.691 2.032 2.441 2.728
35 0.682 1.052 1.170 1.306 1.690 2.030 2.438 2.724
36 0.681 1.052 1.169 1.306 1.688 2.028 2.434 2.719
37 0.681 1.051 1.169 1.305 1.687 2.026 2.431 2.715
38 0.681 1.051 1.168 1.304 1.686 2.024 2.429 2.712
39 0.681 1.050 1.168 1.304 1.685 2.023 2.426 2.708
40 0.681 1.050 1.167 1.303 1.684 2.021 2.423 2.704
iliif'1niWi
I@@)@nq)@@
~ lfrl'l 'll;;' ilm)l)t<I'I' .~
Source: Generated with EXCEL using the TINV(a, df'] function.
Appendix 1109

Table 3. Area under the standard normal curve for (0 ~ z < 00).
l'i, ,~ ~O!OI~ !1;, 0.02 ~'. ~ 0:03'li 1'1:;,0:04,<; J[O':05'" :lo;O(j:~ j:·0;07~~ ~ 0.08~ ';1:.0.09 .:
If, Z ,~ I:l0.00.
0.0. 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.11' 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0:2'; 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
'1 0:3+ 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
n o~~'" 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0:5' 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8> 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0:9', 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
wi~Q~ 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
!f!l~ 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.38 10 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3~ 0.403 2 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4; 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
HS, 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
;fl :6~ 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
~:e7:~ 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
~1 .81 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
f~9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
"2.0: 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
, 2~ 1~ 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
n .2' 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
' ~ 2:3 1 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
1'2:'4&0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5' 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6, 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7. 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8y 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
~2:9i; 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
1'310] 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3:1@0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993
3.2! 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995
3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997
3::4' 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
1:13i!sl 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
~3 . 6i. 0.4998 0.4998 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
oO''i 0.5000
Source: Generated in Excel usmg the NORMSDIST(x)
Appendix 1111

Population 1. All operating banks : Amount (in $000) of


agricultural loans outstanding in different states in 1997.

1 AL 348.334 408.978
2 AK 3.433 2.605
3 AZ 431.439 54.633
4 AR 848.317 907.700
5 CA 3928.732 1343.461
6 CO 906.281 315.809
7 CT 4.373 7.130
8 DE 43.229 42.808
9 FL 464.516 825.748
10 GA 540.696 939.460
11 HI 38.067 40.775
12 ID 1006.036 53.753
13 IL 2610.572 2131.048
14 IN 1022.782 1213.024
15 IA 3909.738 2327 .025
16 KS 2580.304 1049.834
17 KY 557.656 1045.106
18 LA 405.799 282.565
19 ME 51.539 8.849
20 MD 57.684 139.628
21 MA 56.471 7.590
22 MI 440.518 323.028
23 MN 2466.892 1354.768
24 MS 549.551 627.013
25 MO 1519.994 1579.686
26 MT 722.034 292.965
27 NE 3585.406 1337.852
28 NV 16.710 5.860
29 NH 0.471 6.044
30 NJ 27.508 39.860
31 NM 274.035 140.582
32 NY 426.274 201.631
33 NC 494.730 639.571
34 ND 1241.369 449.099
35 OH 635.774 870.720
36 OK 1716.087 612.108
37 OR 571.487 114.899
38 PA 298.351 756.169
Contmued .
1112 Advanced sampling theory with applications

39 RI 0.233 1.611
40 SC 80.750 87.951
41 SD 1692.817 413.777
42 TN 388.869 553 .266
43 TX 3520.361 1248.761
44 UT 197.244 56.908
45 VT 19.363 57.747
46 VA 188.477 321.583
47 WA 1228.607 1100.745
48 WV 29.291 99.277
49 WI 1372.439 1229.752
50 WY 386.479 100.964
Source: Agricultural Statistics (1999) Washington, US.
Appendix 1113

Population 2. Hypothetical situation of a small village


having only30 old persons (age morethat 50 years):
Approximate duration of sleep (in minutes) and Age
(in years) of the persons.

1 60 492
2 72 384
3 55 408
4 56 465
5 82 312
6 78 315
7 67 420
8 74 381
9 84 276
10 56 465
II 68 420
12 70 360
13 59 435
14 64 405
15 53 510
16 66 420
17 78 345
18 63 405
19 77 330
20 73 285
21 55 438
22 71 360
23 63 390
24 87 270
25 61 375
26 58 375
27 60 390
28 70 360
29 66 390
30 72 345
1114 Advanced sampling theory with applications

Population 3. Apples, commercial crop: Season average price (in $) per


pound, by States, 1994--1996.

1 AZ 0.078 0.071 0.122


2 AR 0.164 0.143 0.180
3 CA 0.133 0.183 0.160
4 CO 0.157 0.145 0.223
5 CT 0.283 0.276 0.292
6 DE 0.168 0.125 0.173
7 GA 0.139 0.164 0.175
8 ID 0.101 0.174 0.157
9 IL 0.209 0.210 0.324
10 IN 0.200 0.197 0.246
II IA 0.244 0.303 0.312
12 KS 0.206 0.305 0.279
13 KY 0.216 0.255 0.316
14 ME 0.174 0.179 0.185
15 MD 0.173 0.131 0.158
16 MA 0.226 0.208 0.221
17 MI 0.086 0.099 0.125
18 MN 0.332 0.403 0.452
19 MO 0.198 0.160 0.228
20 NH 0.217 0.203 0.198
21 NJ 0.157 0.159 0.145
22 NM 0.219 0.298 0.306
23 NY 0.118 0.121 0.130
24 NC 0.088 0.084 0.117
25 OH 0.181 0.200 0.264
26 OR 0.107 0.116 0.126
27 PA 0.104 0.095 0.133
28 RI 0.310 0.301 0.314
29 SC 0.130 0.126 0.136
30 TN 0.195 0.215 0.235
31 UT 0.121 0.188 0.142
32 VT 0.165 0.181 0.194
33 VA 0.090 0.099 0.101
34 WA 0.138 0.215 0.204
35 WV 0.095 0.110 0.111
36 WI 0.230 0.241 0.133
Source: Agricultural Statistics (1998) Washington, US.
Appendix 1115
1116 Advanced sampling theory with applications

Population 4. Fish caught: Estimated number of fish caught by marine recreational

_ 1'' ..'' '


fishermen by species group and year, Atlantic and Gulf coasts, 1992--95 .

~
1 Sharks, other 1467 1385 2001 2016
2 Sharks, dogfish 1039 1031 993 833
3 Skates/Rays 2152 1981 2939 2353
4 Eels 138 222 186 152
5 Herrings 28933 34060 38007 30027
6 Freshwater catfish 1100 1091 1377 666
7 Saltwater catfish 13466 12690 14441 13859
8 Toadfish 1784 2676 1781 1632
9 Atlantic cod 850 2693 1861 1942
10 Pollock 168 397 862 832
11 Red hake 559 216 369 184
12 Codfishlhakes, other 73 124 130 266
13 Searobins 4768 7726 4707 4793
14 Sculpins 54 698 136 71
15 White perch 3669 5281 4648 3489
16 Striped bass 3840 4799 8521 10758
17 Temperate bass, other 5 35 32 23
18 Black sea bass 11759 12758 11892 17723
19 Groupers 4661 4236 4583 4923
20 Sea bass, other 2797 2690 2138 2068
21 Bluefish 11990 10301 12405 10940
22 Crevalle jack 3542 2569 2978 3951
23 Blue runner 2371 3800 5692 2319
24 Greater amber jack 692 1141 332 164
25 Florida pompano 498 641 425 644
26 Jacks, other 4463 3802 1878 1625
27 Dolphins 1484 1926 2449 2613
28 Gray snapper 5363 5154 4845 4552
29 Red snapper 2024 2546 2011 1608
30 Lane snapper 919 1079 1088 859
31 Vermilion snapper 950 1228 826 1200
32 Yellowtail snapper 1649 2061 1247 1334
33 Snappers, others 746 861 462 492
34 Pigfish 2955 2691 4918 4199
35 White grunt 5593 5356 5784 5678
36 Grunts, other 3039 3521 3186 3379
Continued .
Appendix 1117

37 Scup 10078 7077 5662 3688


38 Pinfish 13055 13043 16063 16855
39 Sheepshead 5933 5593 4383 5118
40 Red porgy 207 166 166 230
41 Porgies, other 545 445 434 484
42 Spotted seatrout 22304 21538 22181 24615
43 Weakfish 1668 2219 4929 5739
44 Sandseatrout 3780 4068 5665 4355
45 Silverperch Il98 1034 1729 2146
46 Spot 14974 14263 18491 11567
47 Kingfish 3778 3304 4805 4333
48 Atlantic croaker 16953 21016 26671 17753
49 Blackdrum 1405 1534 Il25 1595
50 Red drum 8682 7649 7609 9236
51 Drums, other 1365 1165 1622 1354
52 Mullet 5571 4186 4386 4657
53 Barracudas 800 788 906 908
54 Tautog 4195 4215 2653 3816
55 Cunner 1931 1876 1255 1375
56 Wrasses, other 275 240 227 185
57 Littletunny/Atl bonito 996 925 982 782
58 Atlantic mackerel 1045 2307 4860 4008
59 King mackerel 1289 1023 1148 1252
60 Spanish mackerel 5575 3653 3850 2568
61 Tunas/mackerels, other 1190 794 1018 1029
62 Summer flounder 11918 22919 17741 16238
63 Gulf flounder 216 189 776 163
64 Southern flounder 1148 1083 1369 1446
65 Winter flounder 1544 3582 2300 2324
66 Flounders, other 1252 2149 2173 1284
67 Triggerfish / file fish 1103 999 918 897
68 Puffers 2100 1323 1141 935
69 Other fish 12249 14953 20488 14426
Source: Agricultural Statistics (1999), Washington, D.C.
1118 Advanced sampling theory with applications
Appendix 1119

Population 5. Tobacco: Area (hectares), yield and production (metric tons) in


specified countries during 1998.

Central 1 Costa Rica 1072 2.03 2180,


America 2 El Salvador 580 1.79 1038
3 Guatemala 9024 2.21 19962
4 Honduras 5157 1.78 9177
5 Nicaragua 2240 2.03 4550
6 Panama 1094 2.00 2188
2 Caribbean 1 Cuba 59000 0.63 37000
2 Dominican Rep 27050 1.51 40950
3 Haiti 565 1.29 730
4 Jamaica 1175 1.99 2339
5 Trinidad and 100 1.70 170
6 St. Vincent 70 1.21 85
3 European 1 Austria 105 1.90 199
Union 2 Belzium-Lux 320 3.69 1180
3 France 9260 2.80 25930
4 Germany 3831 2.40 9200
5 Greece 67300 1.96 132000
6 Italy 47600 2.77 132000
7 Portugal 2909 2.14 6226
8 Spain 15150 2.79 42300
4 West 1 Switzerland 635 2.09 1325
and 2 Albania 24000 0.63 15000
Eastern 3 Bulgaria 49000 1.61 79000
Europe 4 Czech Republic 7500 1.64 12306
5 Croatia 7000 1.67 11700
6 Hungary 2000 1.75 3500
7 Macedonia 22000 1.36 30000
8 Poland 19100 2.34 44700
9 Romania 12000 1.23 14750
10 Sebia/Montenegr 6000 1.17 7000
5 FSU 12 1 Azerbaijan 4500 2.33 10500
2 Armenia 4304 0.26 1100
3 Bvelarus 1076 2.42 2606
4 Georgia 5400 1.63 8800
5 Kyrgyzstan 12000 2.50 30000
6 Kazakhstan 5500 1.82 10000
7 Moldova 18600 2.05 38150
8 Russia 650 0.92 600
Continued.. . . . .
1120 Advanced sampling theory with applications

9 Taiikistan 3228 2.48 8000


10 Turkmenistan 1100 2.36 2600
11 Ukranine 5000 0.84 4200
12 Uzbekistan 10496 2.37 24900
6 North 1 Algeria 2700 1.96 5300
Africa 2 Libya 900 1.61 1450,
3 Morocco 3500 1.13 3962
4 Tunisia 6700 1.18 7900
7 Other 1 Angola 3950 0.99 3900
Africa 2 Burundi 705 1.00 705
3 Chad 200 1.00 200
4 Congo 4000 0.45 1800
5 Zaire 3700 1.11 4110
6 Cameroon 3400 1.62 5500
7 Central African 750 0.87 650
8 Benin 200 2.00 400
9 Ethiopia 3000 1.17 3500
10 Ghana 3950 0.38 1500
11 Cote d'lvoire 10000 0.26 2600
12 Kenya 8805 2.51 22120
13 Liberia 10 1.00 10
14 Madagascar 5900 0.93 5500
15 Malawi 116700 1.22 142300
16 Mali 1000 0.55 550
17 Mauritius 655 1.63 1065
18 Mozambique 2700 1.07 2900
19 Niger 1000 0.93 930
20 Nigeria 10000 2.10 21000
21 Reunion 200 1.00 200
22 Zimbabwe 103110 2.06 212050
23 South Africa 15500 2.00 31000
24 Sierra Leone 540 1.11 600
25 Somalia 0 0.00 0
26 Togo 4000 0.50 2000
27 Tanzania 33900 0.74 25080
28 Uganda 7525 0.96 7198
29 Swaziland 200 1.00 200
30 Zambia 4882 1.29 6300
8 Other Asia 1 Bangladesh 50263 0.88 44000
2 Burma 36000 1.22 44000
3 Cambodia 9000 0.56 5000
4 Sri Lanka 12165 0.74 9000
5 China 144500 1.75 2524500
Continued .
Appendix 1121

6 Indonesia 206625 0.85 175631


7 India 445000 1.43 635000
8 Japan 26214 2.56 67100
9 Korea, North 20000 1.25 25000
10 Korea, South 25730 2.02 52040
11 Laos 4000 0.75 3000
12 Malaysia 11000 1.05 11505
13 Pakistan 47300 1.91 90450
14 Philippines 37568 1.80 67500
15 Thailand 51500 1.33 68600
16 Taiwan 4394 2.50 10990
17 Vietnam 36000 0.89 32000
9 Middle 1 Cyprus 161 1.50 241
East 2 Iran 18000 1.39 25000
3 Iraq 2000 1.09 2180
4 Jordan 2100 1.29 2700
5 Lebanon 3750 1.33 5000
6 Oman 1800 1.11 2000
7 Syria 15000 1.15 17200
8 United Arab. 350 5.71 2000
9 Turkey 293300 0.91 266500
10 Yemen 3300 1.73 5720
10 Oceania 1 Australia 3300 2.73 9000
2 Solomon Islands 100 0.95 95
3 New Zealand 600 2.58 1550
1122 Advanced sampling theory with applications
Appendix 1123

Population 6. Agespecific deathrates from 1990 to 2065


(Number per 100, 000 births).
= = =

0--4 967 818 692 496 355 255 183 131 80


5--9 19 16 14 10 7 5 4 3 2
10--14 20 17 15 11 8 6 4 3 2
15--19 67 62 57 48 40 34 28 24 18
20--24 86 78 71 58 48 40 33 27 20
25--29 84 75 68 54 44 35 28 23 16
30--34 97 87 78 62 50 40 32 25 18
35--39 138 124 III 90 72 58 47 38 27
40--44 221 201 182 150 124 102 84 69 52
45--49 370 341 315 267 227 193 164 139 109
50--54 613 572 533 464 403 351 305 265 215
55--59 965 907 853 754 666 589 520 460 382
60--64 1511 1432 1357 1218 1094 982 882 792 674
65--69 2233 2119 2010 1810 1629 1466 1320 1188 1015
70--74 3361 3187 3022 2718 2444 2198 1976 1777 1515
75--79 4979 4693 4423 3930 3491 3102 2756 2448 2050
80--84 7748 7323 6921 6182 5523 4933 4407 3936 3323
85--89 12267 11687 10609 10108 9177 8331 7564 6868 5942
90--94 19099 18341 16915 16246 14987 13827 12758 11774 10439
95--99 29744 28864 27188 26390 24869 23442 22102 20844 19095
100--104 46334 45554 44053 43329 41933 40600 39325 38104 36364
105--109 72195 72100 71956 71906 71845 71831 71861 71930 72097
Source: Statistical Abstracts of the United States (1993).
1124 Advanced sampling theory with applications

Population 7. State population projections, 1995 and 2000.


(Numbers in thousands).
J;' Region~ ~Division" State ";J ·"',,,~ " ;;;';;:'±!,7;' lo/iYear 1995'~' I' Yeai~2(j00 ~

Northeast New Maine 1295 1344


England New Hampshire 1276 1410
Vermont 597 619
Massachusetts 6032 6159
Rhode Island 1021 1048
Connecticut 3354 3422
Mid New York 17909 17966
Atlantic New Jersey 8100 8382
Pennsylvania 12080 12069
Midwest East North Ohio 10958 10930
Central Indiana 5688 5696
Illinois 11759 11722
Michigan 9364 9365
Wisconsin 4908 4844
West North Minnesota 4501 4566
Central Iowa 2703 2549
Missouri 5353 5473
North Dakota 631 596
South Dakota 719 715
Nebraska 1582 1539
Kansas 2546 2534
South South Delaware 739 802
Atlantic Maryland 5180 5608
District of 592 595
Virginia 6758 7275
West Virginia 1749 1651
North Carolina 7197 7717
South Carolina 3772 3962
Georgia 7288 8005
Florida 14583 16315
East South Kentucky 3740 3689
Central Tennessee 5239 5424
Alabama 4282 4358
Mississippi 2717 2772
West South Arkansas 2473 2509
Central Louisiana 4247 4141
Oklahoma 3072 2924
Texas 17572 17828
West Mountain Montana 774 744
Idaho 1018 1008
Wyoming 439 409
Continued....
Appendix 1125

Colorado 3407 3424


New Mexico 1639 1735
Arizona 4149 4633
Utah 1807 1845
Nevada 1283 1409
Pacific Washington 5052 5191
Oregon 2887 2903
California 31749 33963
Alaska 560 599
Hawaii 1253 1362
Source: Statistical Abstracts of the United States (1993).
1126 Advanced sampling theory with applications

Population 8. Projected vital statistics by country or area during 2000.

- 1
2
3
4
United states
Afghanistan
Algeria
Angola
14.2
41.6
26.5
42 .6
8.8
16.6
5.4
15.9
6.2
137.5
42.2
125.9
76.3
47.8
69.6
48 .9
2.07
5.87
3.16
6.05
5 Argentina 19.9 7.6 17.8 75.0 2.64
6 Australia 13.0 6.9 5.0 80.4 1.80
7 Austria 10.3 10.3 5.8 77.3 1.52
8 Azerbaiian 21.1 8.6 72.3 65.4 2.61
9 Bangladesh 27.4 10.1 93.0 57.5 3.08
10 Belarus 14.0 12.9 12.0 69.4 1.92
11 Belgium 11.2 10.4 6.1 77.7 1.69
12 Bolivia 30.0 9.3 60.2 62.0 3.81
13 Brazil 19.2 10.1 47.7 60.9 2.14
14 Bulgaria 12.5 13.6 14.8 71.6 1.74
15 Burkina Faso 44.9 21.4 112.8 39.8 6.48
16 Burma 28.1 10.7 71.8 58.1 3.57
17 Burundi 41.0 15.3 95.2 48.1 6.25
18 Cambodia 41.0 14.3 100.5 51.5 5.81
19 Cameroon 41.4 13.9 74.1 51.3 5.73
20 Canada 12.3 7.2 5.5 80.0 1.80
21 Chad 42 .8 16.3 113.6 48.9 5.64
22 Chile 15.9 5.7 11.9 75.5 2.00
23 China 15.0 6.8 32.6 71.1 1.81
24 Colombia 19.0 4.6 21.3 74.2 2.17
25 Congo (Kinshasa) 46.5 15.6 98.9 48 .1 6.39
26 Cote d'ivoire 41.3 17.6 94.9 43 .7 5.80
27 Cuba 12.7 7.5 8.6 75.6 1.60
28 Czech Republic 13.3 11.0 8.0 74.2 1.74
29 Dominican Republic 21.1 5.4 40 .8 70.4 2.42
30 Ecuador 23.0 5.1 29.3 72.5 2.59
31 Egypt 26.2 8.1 65.7 62.7 3.24
32 Ethiopia 44 .3 17.6 117.7 46.0 6.75
33 France 12.1 9.0 5.6 79 .1 1.73
34 Germany 10.5 10.8 5.7 76.7 1.56
35 Ghana 30.8 10.2 74.8 57.5 3.95
36 Greece 10.7 9.5 6.6 79.0 1.52
37 Guatemala 31.2 6.5 44 .6 66.9 4.00
38 Guinea 40 .0 16.9 123.7 47.0 5.46
39 Haiti 32.3 14.7 98.4 50.2 4.50
Continued .
Appendix 1127

40 Hong Kong 10.2 5.6 4.5 82.8 1.42


41 Hungary 13.1 14.8 11.7 69.6 1.78
42 India 23.5 8.9 63.5 61.5 2.90
43 Indonesia 22.4 8.1 55.4 63.4 2.53
44 Iran 29.2 5.8 45.1 69.1 3.90
45 Iraq 40.8 5.6 50.1 68.7 5.81
46 Italy 11.4 10.1 6.4 78.6 1.55
47 Japan 10.7 8.4 4.3 80.0 1.50
48 Kazakhstan 18.8 9.8 59.6 64.7 2.29
49 Kenya 29.5 12.6 54.9 51.1 3.69
50 Korea. North 19.9 5.4 22.2 71.5 2.21
51 Korea. South 15.8 5.7 7.4 74.7 1.80
52 Madagascar 41.2 13.3 87.6 53.6 5.64
53 Malawi 38.7 27.1 135.8 33.0 5.33
54 Malaysia 24.1 5.2 20.9 71.0 3.10
55 Mali 49.3 17.5 95.6 48.8 6.91
56 Mexico 24.3 4.4 20.7 75.0 2.79
57 Morocco 25.1 5.1 32.9 71.8 3.13
58 Mozamb ique 42.0 16.8 114.9 46.4 5.76
59 Nepal 35.6 11.2 70.1 55.9 4.68
60 Netherlands 11.5 8.7 4.7 78.2 1.53
61 Niger 51.6 22.2 111.2 42.4 7.17
62 Nigeria 41.5 11.7 63.7 55.6 5.95
63 Pakistan 32.6 10.2 90.3 59.7 4.56
64 Peru 21.8 5.8 44.4 70.8 2.66
65 Philippines 27.3 6.4 33.2 66.8 3.38
66 Poland 14.2 10.2 11.8 72.6 1.93
67 Portugal 11.9 10.3 7.1 76.1 1.53
68 Romania 14.3 12.7 21.7 70.0 1.83
69 Russia 14.5 14.8 21.6 65.4 1.95
70 Rwanda 38.5 23.5 118.8 36.5 5.73
71 Saudi Arabia 37.2 4.7 36.3 71.1 6.30
72 Senegal 43.4 10.4 58.4 58.3 6.04
73 Serbia 14.8 10.5 21.3 72.5 2.10
74 Somalia 41.7 12.0 111.6 57.0 6.53
75 South Africa 25.3 14.4 55.4 51.9 3.02
76 Spain 12.0 9.1 5.7 79.1 1.51
77 Sri Lanka 16.9 5.9 18.7 73.2 1.95
78 Sudan 38.8 10.3 69.2 56.8 5.47
79 Sweden 10.5 11.1 4.4 78.5 1.60
80 Switzerland 11.2 9.5 5.2 78.1 1.59
81 Syria 36.1 5.3 35.2 68.4 5.19
82 Taiwan 14.4 5.7 6.3 77.3 1.75
83 Tajikistan 34.5 8.3 104.1 65.1 4.47
Continued... ,
1128 Advanced sampling theory with applications

84 Tanzania 39.8 20.9 lOlA 40.0 5.31


85 Thailand 16.2 7.2 28.3 6904 1.80
86 Tunisia 22.5 5.1 30.1 73.6 2.68
87 Turkey 2004 5.2 33.3 73.8 2.35
88 Uganda 43.1 21.8 95.6 38.1 6.24
89 Ukraine 13.2 14.9 21.2 67.8 1.87 ,
90 United Kingdom 12.1 10.9 6.0 77.1 1.79
91 Uzbekistan 28.9 7.7 77.5 65.2 3.56
92 Venezuela 21.5 4.9 25.5 73.3 2.53
93 Vietnam 20.0 6.5 33.7 68.5 2.31
94 Yemen 43 .3 7.9 57.9 62.6 6.86
95 Zambia 43.8 25.8 97.6 33.7 6.28
96 Zimbabwe 29.3 21.6 72.2 38.2 3.50
Source: Statistical Abstracts of the United States (1997).

0.525913
0.882218 0.660721
.. 1l1'';1I ~ .. :il-0.842360 -0.828100 -0.910310 1
0.985520 0.549261 0.852536 -0.815260
Appendix 1129

Population 9. Number of immigrants admitted in the USA.


~SrO NOo;;';lf ~tates~' f.<.. 19~~il~~1 ~~5f 1996
~1,;/"f\""
'." '" ~~i - Z ~ -'\.X~ "~ !_, ·Y 7'.;
1 AL 1837 1900 1782
2 AK 1129 1049 1280
3 AZ 9141 7700 8900
4 AR 1031 934 1494
5 CA 208498 166482 201529
6 CO 6825 7713 8895
7 CT 9537 9240 10874
8 DE 984 1051 1377
9 DC 3204 3047 3784
10 FL 58093 62023 79461
11 GA 10032 12381 12608
12 HI 7746 7537 8436
13 ID 1559 1612 1825
14 IL 42400 33898 42517
15 IN 3725 3590 4692
16 IA 2163 2260 3037
17 KS 2902 2434 4303
18 KY 2036 1857 2019
19 LA 3366 3000 4092
20 ME 829 814 1028
21 MD 15937 15055 20732
22 MA 22882 20523 23085
23 MI 12728 14135 17253
24 MN 7098 8111 8977
25 MS 815 757 1073
26 MO 4362 3990 5690
27 MT 447 409 449
28 NE 1595 1831 2150
29 NY 4051 4306 5874
30 NH 1144 1186 1512
31 NJ 44083 39729 63303
32 NM 2936 2758 5780
33 NY 144354 128406 154095
34 NC 6204 5617 7011
35 ND 635 483 606
36 OH 9184 8585 10237
37 OK 2728 2792 3511
38 OR 6784 4923 7554
39 PA 15971 15065 16938
40 RI 2907 2609 3098
Continued .
1130 Advanced sampling theory withapplications

41 SC 2110 2165 2151


42 SD 570 495 519
43 TN 3608 3392 4343
44 TX 56158 49963 83385
45 UT 2951 2831 4250
46 VT 658 535 654
47 VA 15342 16319 21375
48 WA 18180 15862 18833
49 WV 663 540 583
50 WI 5328 4919 3607
51 WY 217 252 280
Source: Statistical Abstracts of the United States (1999).
BIBLIOGRAPHY

[ I ] Adhvary u, D. ( 1978). Successive sampling using multi-auxi liary info rmation. Sankhy a. C, 40,
167--17 3.

[ 2 ] Adhvary u, D. and Gupta, P.C. (1983) . On some alternative sampling strategies using auxiliary
information. Metrika, 30 , 2 17--226

[ 3 ] Agarwa l, C.L. and Tikkiwal, B.D. ( 1980). Two stage sampling on successive occasions. Sankhy a.
42, C, 31--44 .

[ 4 ] Agarw al, O.K. and Singh, P. (1982). On cluster sampling strategies using ancillary information.
Sankhy ii , B, 44, 184--192.

[ 5 ] Agarwal, M.C. and Jain , N. (1989). A new predi ctive product estimator. Biometrika. 76,822--823 .

[ 6] Agarwal, M.C. and Panda, K.B. (1993) . An efficient estimator in post stratification. Metron , 179--
188.

[ 7 ] Agarwal, M.C. and Roy, D.C. (1999). Efficient estimators for small domains. 1. Indian Soc. Agric.
Statist.., 52(3), 327--337.

[ 8] Aagrwa l, M.C. and Sthapit, A.B. ( 1996). Model assisted selection of a product strategy. J. Indian
Soc. Agric. Statist., 48(2), 207--215.

[ 9 ] Agarwal, S.K . (1980). Two auxiliary variates in ratio method of estimation. Biom. J., 22(7), 569--
573.

[ 10 1 Agarwa l, S.K. and Kumar, P. (1980) . Combination of ratio and PPS estima tor. 1. Indian Soc. Agric.
Statist.. 32, 8 1--86 .

[ I I ] Agarwal, S.K., Sharma, U.K. and Kashyap, S. (1997) . A new approach to use mult ivariate auxiliary
information in sample surveys. J. Statist. Planning Inf er.. 60, 261--267.

[ 12] Agarwal, S.K., Singh , M. and Goel, B.B.P .S. ( 1979). Use of p-auxiliary variates in PPS samplin g.
Biom. J., 21(8), 781--785.

[ 13 ] Ahmed, M.S. (1997). The general class of chain estimators for the ratio of two means using double
sampling. Commun . Statist.--Theory Meth., 26(9), 2247--2254.

[ 14] Ahmed, M.S. (199 8). A note on regression type estimators using multiple auxiliary informat ion.
Austral. & New Zealand J. Statist., 40(3 ), 373--376.

[ 15 ] Ahmad, T. (1997). A resamplin g techniqu e for complex survey data. J. Indian Soc. Agric. Statist.,
50(3),364--37 9.

[ 16] Ahsan, MJ. and Khan , S.U. (1982). Optimum allocation in multivariate stratified random sampling
with overhead cost. Metrika, 29, 71--78.

[ 17] Aires, N. (2000). Compariso ns between conditional Poisson sampling and Pareto tips sampl ing
design, J. Statist. Planning Infer.. 88, 133-- 147.

[ 18 1 Ajga onkar, S.G.P. ( 1975). The efficien t use of supplementary information in double sampl ing
a
procedure. Sankhy ,C,37, 18 1--189.
1132 Advanced sampling theory with applications

[ 19 ] Akar, I. and Sedransk, J. (1979). Post-stratified cluster sampling. Sankhy d ,C, 41, 76--83.

[ 20 ] Alalouf, I.S. (1996). The estimation of a proportion in cluster sampling. Commun. Statist »- Theory
Meth ., 25(2), 325--343.

[21] Allen, J., Saxena, S., Singh, H.P., Singh, S. and Smarandaehe, F. (2002). Randomness and optimal
estimation in data sampling . American Research Press. 26--43.

[22] Amahia, a .N., Chaubey, Y.P. and Rao, TJ. (1989). Efficiency ofa new estimator in PPS sampling
for multiple characteristics . J. Statist . Planning Infer., 21,75--84 .

[ 23 ] Amdekar, SJ. (1985). An unbiased estimator in overlapping clusters. Calcutta Statist. Assoc. Bull .,
15, 231--232.

[ 24 ] Anderson, H. (1976). Estimation of a proportion through randomized response . Int. Statist. Rev. ,
44,213--217.

[ 25 ] Anderson, H. (1977). Efficiency versus protection in the general randomized response model.
Scand. J. Stati st., 4,11--19.

[ 26 ] Andreatta, a . and Kaufman, a .M. (1986). Estimates of finite population when sampling is without
replacement and proportional to magnitude. J. Amer. Statist . Assoc., 81, 657--666.

[ 27 ] Anscombe, F.J. (1948). The validity of comparative experiments. 1.R. Statist . Soc.. A, 61, 181--
211.

[ 28 ] Amab, R. (1979a). On strategies of sampling finite populations on succesive occasions with


varying probabilities . Sankhy ii , C, 41,141 --155.

[ 29] Amab, R. (l979b). An addendum to Singh and Singh's paper on random nonresponse in unequal
probability sampling. Sankhy a.
C, 41, 138--140.

[30] Amab, R. (1988). Variance estimation in multi-satge sampling. Aust. 1. Statist .,30, 107--110.

[ 31 ] Amab, R. (1990). On comutativity of design and model expectations in randomized response


surveys. Commun . Statist .s-Theory Meth ., 19(10),3751--3757.

[ 32 ] Amab, R. (1992a). Estimation of a finite population mean under superpopulation models.


Commun . Statist. -- Theory Meth. 21(6),1717--1724.

[ 33 ] Amab, R. (l992b). Unbiased estimation of variance in Multi-Stage sampling under randomized


response surveys. Statistics. 23, 357--364.

[ 34 ] Amab, R. (1994). Nonnegative variance estimation in randomized response surveys. Commun .


Statist- Theory Meth .. 23(6), 1743--I752.

[ 35 ] Amab, R. (1995). On admissibility and optimality of sampling strategies in randomized response


surveys. Sankhy Ii , B, 57, 385--390.

[ 36 ] Arnab, R. (1996). Randomized response trials : A unified approach for qualitative data. Commun .
Statist.i-Theory Meth.. 25(6), 1173--1183.

[37] Amab, R. (1998). Sampling on two occasions: Estimation of population total. Survey Methodology,
24,185--192.
Bibliography 1133

[38] Arnab, R. (1999). On use of distinct respondents in randomized response surveys. Biom.J.,41(4),
507--513.

[ 39 ] Amab, R. (200 I) . Estimation of a finite population total in varying probability sampling for multi-
character surveys. Metrika, 54(2), 159--177.

[ 40 ] Amab, R. (2002). Optimum sampling strategies under randomized response surveys. Biom. J.,
44(4),490--495.

[41 ] Amab, R. and Singh, S. (2001). On the estimation of population total and variance in the presence
of non-response . Presented on JSM--2001 conference Atlanta. USA.

[ 42 ] Amab, R. and Singh, S. (2002a). Calibration for variance estimator of generalized regression
predictor. Submittedfor possible presentation at JSM--2003 , California , USA.

[ 43 ] Amab, R. and Singh, S. (2002b). Estimation of the size and mean value of a stigmatized
characteristic of a hidden gang in a finite population: a unified approach. Ann . Inst. Math . Stat., 54(3),
659--666.

[ 44 ] Amab, R and Singh, S. (2002c). On the estimation of size and mean value of a stigmatized
characterstic of a hiden gang in finite populations. Recent Advances in Statistical Methods--Proceedings
ofStatistics 2001 Concordia University Conference , 1--11.

[ 45 ] Amab, R. and Singh, S. (2002d). Estimation of variance form missing data. Presented at Statistical
Society ofCanada Conferen ce at Hamilton. Canada .

[ 46 ] Amab, R. and Singh, S. (2002e). Jackknifing the imputed data in addition to observed data while
estimating variance of the ratio estimator. Working paper.

[ 47 ] Arnholt, A.T. and Hebert, lL. (1995). Estimating the mean with known coefficient of variation.
American Statistician, 49(4), 367--369.

[ 48 ] Artes, E. and Garcia, A. (2000a). A note on successive sampling using auxiliary information.
Proceedings ofthe ts : International Workshop on Statistical Modelling , 376--379.

[ 49 ] Artes, E. and Garcia, A. (2000b). Sobre muestreo en ocasiones sucesivas. Aetas del IX congreso
sobr e ensenanza y aprendizaje de las Matematicas, 153--155.

[ 50 ] Artes, E. and Garcia, A. (200 I a). Metodo diferencia multivariate en muestreo en dos ocasiones .
Vlll Conferencia Espanola de Biometria, 199--200.

[ 51 ] Artes, E. and Garcia, A. (200 I b). Successive sampling for the ratio of population parameters.
Journal ofthe Portuguese Nacional Statisti cal Institute, En prensa.

[ 52 ] Artes, E. and Garcia, A. (200Ic). Estimating the current mean in successive sampling using a
product estimate.Conference on Agricultural and Environmental Statist . Application in Rome, XLIII-I-
-XLllI--2.

[ 53 ] Artes, E. and Garcia, A. (200Id). An almost unbiased ratio-cum-product estimator on two


occasions. X International Symposium on Appli ed Stochastic Models and Data Analysis, 130--135 .

[ 54 ] Artes, E. and Garcia, A. (200 Ie). Estimation of current population ratio in successive sampling. J.
Indian Soc. Agric. Statist. , 54(3), 342--354.

[55] Artes, E., Rueda, M. and Arcos, A. (1998). Successive sampling using a product estimate. Appli ed
sciences and the environm ent. computational mechanics publications, 85--90 .
1134 Advanced sampling theory with applications

[ 56 ) Asok, C. ( 1974). Contribution to the theory of unequal probability sampling witho ut replacemen t.
Unpublished Ph.D. Thesis, Iowa State University, Ames, Iowa.

[ 57) Asok, C. ( 1980). A note on the comparison between simple mean and mean based on distinct units
in sampling with replacement. American Statistician, 34, 158.

[ 58 ) Asok, C. and Sukhatme, B.V. ( 1975). Unequal probability sampling with random stratification.
Proc. Amer. Statist. Assoc., 283--288.

[ 59 ) Asok, C. and Sukhatme, B.V. ( 1976a). On the efficiency compariso n of two 7LpS sampling
strategies. Proc. Amer. Statist. Asso c., 161--166.

[ 60 ) Asok, C. and Sukhatme, B.V. ( 1976b). On Sampford' s procedure of unequal probab ility sampling
without replacement. J. Amer. Statist. Asso c., 71, 9 12--9 18.

[ 61 ) Asok, C. and Sukhatme, B.V. (1978). A note on Midzuno Scheme of sampling. Pap er pr esented at
the 32nd Annual Conference ofthe Indian Soc. Agr icul. Stat ist., New Delhi,/n dia.

[ 62 ) Avdhani, M.S. (1968). Contribution to the theory of sampli ng from finite population and its
applic ation. Ph.D. thesis, Delhi University.

[ 63 ) Bahadur , R.R. ( 1954). Sufficiency and statistical decision functions. Ann. Math. Statist., 25,423--
462.

[ 64 ) Bandyopadhyay, S. (1980). Improved ratio and product estimators. Sa nkhy a,C, 42, 45--49.
[ 65) Bandyopadhyay, S., Chattopadhyaya, A.K. and Kundu, S. ( 1977). On estimation of population
total. Sankhy ii ,C, 39, 28--42.

[66) Bankier, M. D. (1986). Estimators based on several stratified samples with application to multiple-
frame surveys. J. Amer. Statis t. Assoc. , 81, 1074--1079.

[ 67 ] Bankier, M.D.(1988). Power allocations: Determining sample sizes for sub-national areas.
American Statist.,42(3), I74-- 178.

[ 68 ) Bansal, M.L and Singh, R. (1985). An alternative estimator for multiple characteristics in PPS
sampling. J. Statist. Planning Infer., I I, 313--320.

[ 69 ) Bansal, M.L. and Singh, R. (1986). On the generalization of Rao, Hartley and Cochran' s scheme.
Metrika, 33,307--3 14.

[ 70 ] Bansal, M.L. and Singh, R. (1989). An alternative estimator for multiple characteristics
correspond ing to Horvitz and Thompson estimator in probability proport ional to size and without
replacement sampling . Statistica, anno. XLIX, 3, 447--452 .

[ 71 ) Bansal, M.L. and Singh, R. (1990). An alternative estimator for multiple characteristics in RHC
sampling scheme. Commun . Statist. -Theory Meth. 19(5), 1777-- I784.

[ 72 ) Bansal, M.L., Singh, S. and Singh, R.(1994) Multi-character survey using randomized response
technique. Com mun.Stat ist.-- Theory Meth. 23(6), 1705--1715.

[ 73 ) Barnard, J. and Rubin, D.B. (1999). Small sample degree of freedom with multiple imputation.
Biometrika, 86(4), 948--955.

[ 74 ) Bartholomew, DJ. ( 1961). A method of allowing for not at home. bias in sample surveys. Applie d
Statist., 10, 52--59.
Bibliography 1135

[ 75 ] Bartlett, R.F. (1986). Estimating the total of a continuous populations . J. Statist. Planning Infer.. 13,
51--66.

[76] Basawa, LV., Godambe, V.P. and Taylor, R.L. (1997). Selected proceedings ofthe symposium on
estimating functions. Lecture Notes - Monograph Series, Institute of Mathematical Statistics, Hayward,
California.

[77 ] Basu, D. (1958). On sampling with and without replacement. Sankhy ii , 20, 287--294.

[ 78 ] Basu, D. (1971). An essay on the logical foundations of survey sampling. Part one. In: V.P.
Godambe and D.A. Sportt (eds.) Foundations of statistical inferences. Holt, Rinehart and Winston,
Toronto, 203--242.

[ 79 ] Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An error components model for prediction of
county crop areas using surveys and satellite data. J. Amer . Statist Assoc ., 83, 28--36.

[ 80 ] Bayless, D.L. and Rao, J.N.K. (1970). An empirical study of stabilities of estimators and variance
estimators in unequal probability sampling (n=3 or 4). 1. Amer. Statist . Assoc., 65, 1645--1667.

[81] Beale, E.M.L. (1962). Some use of computers in operational research. Industrie//e Organ ., 31, 27--
28.

[ 82 ] Bedi, P.K. (1995). An alternative estimator in Midzununo scheme for multiple characteristics.
Commun . Statist.--Simula., 17--30.

[ 83 ] Bedi, P.K. and Agarwal, S.K. (1999). Modified Midzuno scheme of sampling. J. Statist. Planning
Infer., 76, 203--214.

[ 84 ] Bedi, P.K. and Rao, TJ. (1996). Probability proportional to revised sizes with replacement scheme.
Metron, 67--82 .

[ 85 ] Bedi, P.K. and Rao, TJ. (2001). PPS method of estimation under a transformation . J. Indian Soc.
Argi c. Stati st., 54(2), 184--195.

[ 86 ] Bellhouse, D.R. (1984a). Optimal randomization for experiments in which autocorrelation is


present. Biometrika, 71,155--160.

[87] Bellhouse, D.R. (I984b). A review of optimal designs in survey sampling. Canad ian 1. Statist., 12,
53--65.

[ 88 ] Bellhouse, D. R. (1995). Estimation of correlation in randomized response. Survey Methodology.


21,13--19.

[ 89 ] Bellhouse, D.R. and Rao, J.N.K. (1975). Systematic sampling in the presence of a trend.
Biometrika, 62, 694--697.

[90] Bellhouse, D.R. and Rao, J.N.K. (1986). On the efficiency of prediction estimators in two-stage
sampling. 1. Statist . Planning Infer., 13, 269--281.

[91] Bennett, B.M. (1983). Alternate estimates in stratified sampling. Metron, 77--82.

[ 92 ] Bennett, B.M. and Islam, M.A. (1983). On relative precision in stratified sampling for proportions.
Metron, 19--22.

[ 93 ] Bethlehem, J.G and Keller, WJ. (1987). Linear weighting of sample survey data. J.Official Statist.,
141--153.
1136 Advanced sampling theory with applications

[ 94 ] Bhargava, M. (1996). An investigat ion into the efficiencies of certain randomized response
strategies. Unpublished Ph.D. thesis submitted to Punjab Agricultural University, Ludhiana, India.

[ 95 ] Bhargava, M. and Singh, R. (2000). A modified randomization device for Warner's model.
Stat istica, 60, 315--321.

[ 96 ] Bhargava, M. and Singh, R. (2001). Efficiency comparison of certain randomized response


schemes with U--model. 1. Indian Soc. Agric. Statist., 54(1), 19--28.

[ 97 ] Bhargava, M. and Singh, R. (2002). On the efficiency comparison of certain randomized response
strategies. Metr ika, 55(3),191--197.

[ 98 ] Bhargava, N.K. (1978). On some applications of the technique of combined unordering. Sankhy d ,
C, 40,74--83.

[99] Bhatia, A., Mangat, N.S., and Morrison, T. (1998). Estimation of measurement errors. Proce edings
of the International Pipeline Conference 1998. Calgary. Canada. American Society of Mechanical
Engineers. I. 315--325.

[ 100] Bhatnagar, S. (1996). Improved product estimators. Sankhy a. B, 58, 84--89.


[ 101 ] Bhave, S.Y. (1987). Godambe's paradox and the ancillarity principle. Statist. Prob . Lett ., 5, 243--
246.

[ 102 ] Binder, D.A. and Theberge, A. (1988). Estimating the variance of raking ratio estimators.
Canadian J. Statist ., 16,47--55 .

[ 103 ] Binder, D.A. and Patak, Z. (1994). Use of estimating functions for estimation from complex
surveys. J. Amer. Statist. Assoc., 89,1035--1043.

[ 104 ] Biradar, R.S. and Singh, H.P. (I 992a). A class of estimators for finite population correlation
coefficient using auxiliary information. J. Indian Soc. Agril. Stat ist., 44, 271--285.

[ 105 ] Biradar, R.S. and Singh, H.P. (1992b). A note on an almost unbiased ratio cum product estimator.
Metron , 249- -255.

[ 106 ] Biradar, R.S. and Singh, H.P. (1997-98). A class of estimators for population parameter using
supplementary information. Aliga rh J. Statist., 17/18, 54--71.

[ 107] Biradar, R.S. and Singh, H.P. (1998). Predictive estimation of finite population variance. Calcutta
Stat ist. Assoc . Bull., 48, 229--235.

[ 108 ] Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation. Ann. Math .
Statist., 18, 105--110.

[ 109 ] Blackwell, D. (1951). Comparison of experiments. Proc. 2nd Berkeley symp. Math. Stat . Prob ., 93-
-102.

[ 110] Blight, BJ.N. (1973). Sampling from an autocorrelated finite population. Biometrika, 60, 375--
385.

[ III ] Bose, C. (1943). Note on the sampling error in the method of double sampling. Sankhy ii , 6, 330.

[ 112 ] Bogue, DJ. (1950). A technique for making extensive postcensus estimates. J. Amer. Statist.
Assoc.,45,149--163 .
Bibliography 1137

[ 113 ] Bourke, P.O. (1981). On the analysis of some multivariate randomized response designs for
categorical data. J. Statist. Plan ing Infer., 5,165--170.

[ 114] Bourke, P.O. and Dalenious, T. (1976). Some new ideas in the realm of randomized enquiries. Int.
Statist. Rev., 44, 219--221.

[ 115] Bouza, C. (1994). The use of auxiliary information for solving non-response problems. Test, 3,
113--122.

[ 116 I Brackstone, GJ. (1987). Small area data : policy issues and technical challenges. In Small Area
Statistics (R. Platek, J.N.K. Rao, C.E. Sarndal and M.P. Singh eds.) 3--20, Wiley New York.

[ 117 ] Brackstone, GJ. and Rao, J.N.K. (1979). An investigation of raking ratio estimators . Sankhy Ii ,
C, 41, 97--114.

[ 118 ] Bratley, P., Fox, B. L. and Schrage, L.E. (1983). A Guide to Simulation . NY: Springer--Verlag.

[ 119] Breau, P. and Ernst, L.R. (1983). Alternative estimators to the current composite estimator. Proc
ofthe section on Survey Research Meth ods, Amer. Statist. Assoc., 397--402.

[ 120 ] Breidt, FJ. and Opsomer, J.D . (2000). Local polynomial regression estimators in survey sampling.
Ann . Statist., 28(4),1026--1053 .

[ 121 ] Brewer, K.R.W. (1963a). Ratio estimation and finite populations: Some results deducible from the
assumption of an underlying stochastic process. Austral. J. Statist., 5, 93--105.

[ 122] Brewer, K.R.W. (l963b). A model of systematic sampling with unequal probabilities . Austral. J.
Stat ist ., 5, 5--13.

[ 123 ] Brewer, K.R.W. (1967). A note on Fellegi's method of sampling without replacement with
probabilities proportional to size. J. Amer. Statist. Assoc., 62, 79--85.

[ 124] Brewer, K.R.W. (1975). A simple procedure for sampling zps wor. Austra l. J. Statist., 17, 166--
172.

[ 125 ] Brewer, K.R.W. (1979). A class of robust sampling designs for large scale surveys. J. Amer.
Statist. Assoc.. 74, 911--915.

[ 126 ] Brewer, K.R.W. (1994). Survey sampling inference Some past perspectives and present
prospects. Pak. J. Statist., A, 10, 213--233.

[ 127] Brewer, K.R.W. (1995). Combining design based model based infer ence. Chapter 30 in Business
Survey Methods (Eds. B.G. Cox, D.A. Binder, B.N. Chinnapa, A. Christianson , MJ. Colledge and P.S.
Kolt). New York:Wiley, 589--606.

[ 128 ] Brewer, K.R.W. (I 999a). Cosmetic calibration with unequal probability sampling. Survey
Meth odology, 25(2), 205--212.

[ 129 ] Brewer, K.R.W. (I 999b). Design based or prediction based inference? Stratified random vs.
Stratified balanced sampling. Int. Statist. Rev., 67, 35--47.

[ 130 I Brewer, K.R.W. (2002). Combined survey sampling inference. Arnold.

[ 131 ] Brewer, K.R.W. and Hanif, M. (1970). Durbin's new multistage variance estimator. J. R. Statist.
Soc., B, 32, 302--311.
1138 Advanced sampling theory with applications

[ 132 ] Brewer, K.R.W. and Hanif, M. (1983). Sampling with unequal probabilities. New York
Springer --Verlag.

[ 133] Brewer, K.R.W. and Undy, G.C. (1962). Samples of two units drawn with unequal probabilities
without replacement. Austral. J. Statist., 4, 89--100.

[ 134 ] Brewer, K.R.W., Early, L.J. and Hanif, M. (1984). Poisson, modified poisson and collocated
sampling.J. Statist. Planning Infer., 10, 15--30.

[ 135] Brillinger, D.R., Jones, L.V. and Tukey, J.W. (1978). Report of the statistical task force for the
weather modification advisory board. The Management of Western Resources. Vol. II: The Role of
Statistics on Weather Resources Management. Government Printing Office, Washington, DC.

[ 136] Brown, B.M., Hall, P. and Young, G.A. (2001). The smoothed median and the bootstrap.
Biometrika, 88(2), 519--534.

[ 137 ] Brown, J.A. (1996). The relative efficiency ofadaptive cluster sampling for ecological surveys.
Mathematical and Information Sciences Reports, series B, 96/08, Massey University.

[ 138] Brown, J.A. (1999). A comparison of two adaptive sampling designs. Austral. & New Zealand J.
Statist., 41(4),395--403.

[ 139 ] Bryant, E.C., Hartley, H.O. and Jessen, R.J. (1960). Design and estimation in two-way
stratification. J. Amer. Statist. Assoc.. 55, 105--124.

[ 140] Buckland, W.R. (1951). A review of the literature of the systematic sampling. J. R. Statist. Soc.,
B, 13,208--215.

[ 141 ] Burdick, R.K. and Sielken, R.L. (1979). Variance estimation based on a superpopulation model
in two stage sampling. J. Amer. Statist. Assoc., 74, 438--440.

[ 142 ] Carlin, B.P. and Gelfand, A.E. (1990). Approaches for empirical Bayes confidence intervals. 1.
Amer. Statist. Assoc.. 85, 105--114.

[ 143 ] Carlin, B.P. and Gelfand, A.E. (1991). A sample re-use method for accurate parametric empirical
Bayes confidence intervals. J.R. Statist. Soc., B, 53, 189--200.

[ 144 ] Casady, R.J. and Lepkowski, J.M. (1993). Stratified telephone survey designs. Survey
Methodology, 19, 103--113.

[ 145 ] Cassel, C.M. and Sarndal, C.E. (1974). Evaluation of some sampling strategies using a
continuous variable framework. Commun. Statist.--Theory Meth.. 3,373--390.

[ 146 ] Cassel, C.M., Sarndal, C.E. and Wretman, J.H. (1976). Some results on generalized difference
estimation and generalized regression estimation for finite populations. Biometrika. 63, 615--620.

[ 147 ] Cassel, C.M., Samdal, C.E. and Wretman, J.H. (1977). Foundations of Inference in Survey
Sampling. John Wiley and Sons, New York.

[ 148 ] Cassel, C.M., Sarndal, C.E. and Wretman, J.H. (1979). Some uses of statistical models in
connection with the non-response problem. Symposium on Incomplete Data. Preliminary Proc..
Washington. D.C.

[ 149 ] Causey, B.D. (1972). Sensitivity of raked contingency table totals to changes in problem
conditions. Ann. Math. Statist., 43, 656--658.
Bibliography 1139

[ 150 ] Causeur, D. (1999). Exact distribution of the regression estimator in double sampling. Statistics ,
32,297--315 .

[ 151 ] Cebrian, A.A . and Garcia, M.R. (1997) . Variance estimation using aux iliary information : An
almost unbi ased multivariate ratio estimator. Metrika, 45, 171--178 .

[ 152 ] Ceccon, C., Diana, G. and Salvan , A. (1991). Approccio c1assico al campionamento da
popolazioni finite: Alcuni risultati recenti, CLEUP, Padova.

[ 1531 Chakrabarty, M.C. (1963). On the use of incidence matrices in sampling from finite populations.
J. Indian Statist . Assoc, I, 78--85 .

[ 154] Chakrabarty, R.P. (1968). Contribution to the theory ofratio type estimators. Ph.D . Thesis, Texas
A and M University.

[ 155] Chak rabarty, R.P . (1979). Some ratio type estimators . J. Indian Soc. Agril. Statist, 31,49--62.

[ 156] Chand , L. (1975). Some ratio type estimators based on two or more auxilia ry variables. Ph.D.
thesis submitted to Iowa State University, Ames, Iowa.

[ 157] Chang, H.J. and Liang, D.H. (1996). A randomized response procedure for two unrelated sensitive
questions. J. Information & Optimization Sci., 17(1), 185--198 .

[ 158 ] Chang, H.J . and Huang , K.C . (200Ia). On construction of almost unbiased estimators of finite
population mean using transformed auxiliary variable . Statistical Papers, 42(4), 505--515 .

[ 159] Chang , H.J. and Huang, K.C . (2001b). Estimation of proportion and sensitivity ofa qualitative
character. Metrika, 53(2), 269--280.

[ 160] Chang, K.C., Han, C.P. and Hawkins, D.L. (1999) . Truncated multiple inverse sampling in post-
stratification. 1. Statist . Planning Infer., 76, 215--234.

[ 161 ] Chang, K.C., Liu, J.F . and Han, c.e. (1998) . Multiple inverse sampling in post-stratification. J.
Statist . Planning Infer., 69,209--227.

[ 162 ] Chao, M.T. (1982). A general purpose unequa l probability sampling plan . Biometrika, 69, 653--
656.

[ 163 1 Chatterjee, S. and Simon , G. (1993) . Confidentiality guaranteed : A non-invasive procedure for
collecting sensitive information. Comm. Statist-Theory Meth., 22(6), 1629--1651.

[ 164] Chaubey, Y.P. and Crisalli, A.N. (1995) . Adjustment of the inclus ion probabilities in case of non-
response. Statistical Society ofCanada, Proceedings ofthe Survey Methods Section, 75--79.

[ 1651 Chaudhuri, A. (19 74). On some properties of sampling scheme due to Midzuno. Calcutt Statist.
Assoc. Bull; 23, 1--9.

[ 166 ] Chaudhuri, A. (I 975a) . A simple method of sampling without replacement with inclusion
probabilities exactly proportional to size. Metrika , 22,147--152.

[ 167] Chaudhuri, A. (1975b). Some results concerning Horvitz and Thompson 's T, -class of estimators.
Metrika, 217--223.

[ 168 ] Chaudhuri, A. (1976) . A non-negativ ity criterion for a certain variance estimator. Metrika, 23,
201--205 .
1140 Advanced sampling theory with applications

[ 169 ] Chaudhuri, A. (1977). On some properties of the Horvitz and Thompson estimator based on
Midzuno 's Jr p s sampling scheme. J. Indian Soc. Ag. Stat ist ., 47--52.

[ 170] Chaudhuri , A. (1981). On non-negative variance estimation . Metrika, 28, 1-12.

[ 171 ] Chaudhuri,A. (1992). Small domain statistic: a review. Techn ical Report ASC /92/2, Indian
Statistical Institute, Calcutta .

[ 172 ] Chaudhuri, A. ( 1993). Mean square error estimation in randomized response surveys. Pak. 1.
Statist., A, 9, 101--104.

[ 173 ] Chaudhuri , A. (1997 ). On a pragmatic modification of survey sampling in three stages . Commun.
Statist>-Theory Meth ., 26(7), 1805-- I 81O.

[ 174] Chaudhuri, A. (2001). Using randomized response from a complex survey to eliminate a sensitive
proport ion in a dichotomous finite population . J. Statist. Planning Infer., 94, 37--42.

[ 175 ] Chaudh uri, A. and Adhikary, AX (1983). On optimality of doub le sampling strategies with
varying probabili ties. J. Stati st. Planning Infer., 8, 257--265.

[ 176 ] Chaudh uri, A. and Adhikary, A.K. (1985). Some results on admissibility and uniform
admissibility in double sampling. J. Statist. Planning Infer., 12, 199--202 .

[ 177 ] Chaudh uri, A. and Adhikary, A.K. (1987). Circular systematic sampling with varying
probabil ities. Cal cutt a Sta tist. Ass oc. Bull.• 36, 193--I95.

[ 178 ] Chaudhuri , A. and Adhikary, A.K. (1990). Variance estimation with randomized response.
Commun. Statist» - Theory Meth ., 19(3), 1119--1125.

[ 179] Chaudhuri , A., Adhikary, A.K., Dihidar, S. (2000). Mean square error estimation in multi-stage
sampling. Metrika, 52, 2, I 15--13 1.

[ 180 ] Chaudhuri , A., Adhikary, A.K. and Seal, A.K. (1997). Small domain estimation by empirical
Bayes and Kalman filtering procedures-A case study. Commun. Statist -- Theory Me th., 26(7), 1613--
1621.

[ 181 ] Chaudhuri, A. and Amab, R. ( 1977). On the relative efficiencies of a few strategies of sampling
with varying probabilitie s on two occasions. Calcutta Stati st. Asso c. Bull .. 26, 25--38.

[ 182 ] Chaudhuri, A. and Amab , R. (1978). On the role of sample size in determin ing efficie ncy of
Horvitz and Thompson estimators . Sankh y d , C, 40, 104--109.

[ 183 ] Chaudhuri, A. and Amab, R. (1979). On the relative efficiencies of sampling strategies under a
superpopulation model. Sankhy a,
C, 41, 40--43.

[ 184] Chaudhuri , A. and Amab, R. ( 1982). On unbiased variance estimators with various multi-stage
sampling strategies . Sankhy Ii , B, 44, 92-- I0 I.

[ 185 ] Chaudhuri, A. and Maiti, T. (1994). Variance estimation in model assisted survey sampling.
Commun. Statist» - Theory Meth.. 23(4),1203--1214.

[ 186 ] Chaudhuri , A., Maiti, T. and Roy, D. (1996). A note on competing variance estimators in
randomized response surveys. Austral. J. Statist., 38(1), 35--42.

[ 187] Chaudhuri , A. and Mitra, J. ( 1992). A note on two variance estimators for Rao--Hartley--Cochran
estimator. Commun. Statist. -- Theory Meth., 21(12), 3535--3543 .
Bibliography 1141

[ 188 ] Chaudhuri , A. and Mukerjee, R. (1988). Randomized response: Theory and techniques. Marcel
Dekker, New York,.

[ 189 ] Chaudhuri, A. and Roy, D. (1997a). Optimal variance estimation for generalized regression
predictor. J. Statist. Planning Infer.. 60, 139--151.

[ 190] Chaudhuri, A. and Roy, D. (l997b). Model assisted survey sampling strategies with randomized
response. J. Statist . Planning Infer., 60, 61--68.

[ 191 ] Chaudhuri, A. and Vos, J.W.E. (1988). Unified theory and strategies of survey sampling. North
Holand.

[ 192 ] Chen, 1. and Shao, J. (2001). Jackknife variance estimation for nearest neighbour imputation. J.
Amer. Statist. Assoc., 96, 260--269.

[ 193 ] Chen, J. and Qin, 1. (1993). Empirical likelihood estimation for finite populations and the
effective usage of auxiliary information. Biometrika , 80,107--116.

[ 194 ] Chen, J., Rao, 1.N.K. and Sitter, R.R. (2000). Efficient random imputation for missing data in
complex surveys. Statistica Sinica , 10(4), 1153--1169.

[ 195 ] Chen, J., Sitter, R.R., and Wu, C. (2002). Using empirical likelihood methods to obtain range
restricted weights in regression estimators for surveys. Biometrika, 89, 230--237.

[ 196] Chen, S.X. (1998). Weighted polynomial models and weighted sampling schemes for finite
population . Annals ofStatistics, 26, 5, 1894--1515.

[ 197] Chernick, M.R. and Wright, T. (1983). Estimation of population mean with two-way stratification
using a systematic allocation scheme. J. Statist . Planning Infer.• 7, 219--231.

[ 198] Chotai, J. (1974). A note on Rao--Hartley--Cochran method for PPS sampling over two occasions.
Sankhy a, C, 36, 173--180.

[ 199 ] Choudhry, G.H. and Singh, M.P. (1979). Sampling with unequal probabilities and without
replacement-A rejective method. Survey Methodology, 5(2), 162--177.

[ 200 ] Christman, M. (1997). Efficiency of some sampling designs for spatially clustered populations.
Environmetrics, 8, 145--166.

[201] Christofides, T.C. (2003). A generalized randomized response technique . Metrika, 57,195--200.

[ 202 ] Chromy, J.R. (1974). Pairwise probabilities in probability non-replacement sampling. Presented
at ASA meeting at St. Louis, Missouri . USA.

[ 203 ] Clayton, D., Dunn, G., Pickles, A. and Spiegelhalter, D. (1998). Analysis of longitudinal binary
data from multi-phase sampling (with discussion). J. R. Statist . Soc., B, 60, 71--80.

[ 204 ] Cochran, W.G. (1940). Some properties of estimators based on sampling scheme with varying
probabilities. Austral. J. Statist ., 17,22--28.

[205] Cochran, W.G. (1963). Sampling Techniques. John Wiley and Sons : New York.

[206] Cochran, W.G. (1968). Errors of measurement in statistics. Technometrics, 10(4),637--666.

[207] Cochran, W.G. (1977). Sampling Techniques . 3'd Ed. John Wiley & Sons, New York.
1142 Advanced sampling theory with applications

[ 208 ] Conti, P.L. (1995). A note on the estimation of a proportion in sampling finite populations.
Metron, 35--41.

[209] Cox, D.R. (1958). The Planning ofExperiments. Wiley, New York.

[ 210 ] Cox, D.R. (1971). Discussion of Royall (1971). Foundation s of Statistical Inference (V.P.
Godambe and D.A. Sprott, eds). Holt, Rinehart & Winston, Toronto, 275.

[ 211 ] Cox, D.R.(1984). Present position and potential developments : some personal views, design of
experiments and regression . J. Roy. Statist . Soc., A, 147,306--315.

[212] Dalabehera, M. and Sahoo, L.N. (1995). Efficiency of six almost unbiased ratio estimators under a
particular model. Statist ical Hefte, 36, 61--67.

[ 213 ] Dalabehara, M . and Sahoo, L.N. (1997). A class of estimators in stratified sampling with two
auxiliary variables. J. Indian Soc. Agril. Statist., 50(2), 144--149.

[ 214 ] Dalabehera, M. and Sahoo, L.N (2000). An unbiased estimator in two-phase sampling using two
auxiliary variables. J. Indian Soc. Agric. Statist ., 53(2), 134--140.

[ 215 ] Dalenius, T. (1950). The problem of optimum stratification. Skand Akt., 33, 203--213.

[216] Dalenius, T. (1955). The problem of not-at-homes. Statistisk Tidskrift, 4, 208--211.

[ 217 ] Dalenius, T. (1957). Sampling in Sweden . Almquist and Wiksell, Stockholm.

[218] Dalenius, T. and Gurney, M. (1951). The problem of optimum stratification II. Skand. Akt., 34,
133--148.

[219] Dalenius, T. and Gurney, M. (1957). The choice of stratification points. Skand. and Akt.,40 , 198--
203.

[ 220 ] Dalenius, T. and Hodges, J.L. (1957). The choice of stratification points. Skandinavisk
Aktuarietidskrift.

[ 221 ] Dalenius, T. and Hodges, J.L. (1959). Minimum variance stratification . 1. Amer. Stat ist. Asso c.,
54.

[ 222 ] Das, A.C. (1950). Two dimensional systematic sampling. Sankhya, 10, 95--108 .

[223] Das, A.C. (1951). On two-phase sampling and sampling with varying probabilities. Bull. Int.
Statist . Inst., 33(2),105--112.

[224] Das, A.K. (1982). On the use of auxiliary information in estimating proportions . 1. Indian Statist .
Asso c., 20, 99--108 .

[ 225 ] Das, A.K. and Tripathi, T.P. (1978). Use of auxiliary information in estimating finite population
variance. Sankhy Ii ,C, 40, 139--148.

[ 226 ] Das, A.K. and Tripathi, T.P. (1979). A class estimators for population mean when mean of an
auxiliary character is known. Math. Tech. Report No. 22/79, lSI, Calcutta.

[227] Das, AX. and Tripathi, T.P. (1980). Sampling strategies for population mean when the coefficient
of variation of an auxiliary character is known. Sankhy a C,42,76--86.

[ 228 ] Das, G. and Bez, K. (1995). Preliminary test estimators in double sampling with two auxiliary
variables. Commun . Statist. - Theory Meth., 24(5),1211--1226.
Bibliography 1143

[229] Das, K. (1982). Estimation of population ratio on two occasions. J. Indian Soc. Agric. Statist ., 34
(2), 1--9.

[230] Datta, G.S., Day, B. and Basawa, LV. (1999). Empirical best linear unbiased and empirical Bayes
prediction in multivariate small area estimation. J. Statist. Planning Infer., 75, 269--279 .

[ 231 ] Datta, G.S., Day, B. and Maiti, T. (1998). A nested error regression model for multivariate
hierarchical Bayes estimation of small area means. Sankhy a,
A, 60, 344--362.

[ 232 ] Datta, G.S. and Ghosh, M. (1991). Bayesian prediction in linear models: applications to small area
estimation. Annals ofStatistics , 19, 1748--1770.

[ 233 ] Datta, G.S. and Lahiri, P. (1995). Robust hierarchical Bayes estimation of small area
characteristics in presence of covariates and outliers. J. Multtivar iate Analysis, 54, 310--328.

[ 234 ] Datta, G.S. and Lahiri, P. (2000). A unified measure of uncertainty of estimated best linear
unbiased predictors in small area estimation problems. Statistica Sinica, 10, 6 13--627.

[ 235 ] Datta, G.S., Lahiri, P., Maiti, T. and Lu, K.L. (1999). Hierarchical Bayes estimation of
unemployment rates for the states of the U.S. J. Amer. Statist. Assoc., 94,1074--1082.

[236] David, I.P. and Sukhatme, B.V. (1974). On the bias and mean square error of the ratio estimator.
J. Amer. Statist . Assoc., 69, 464--466.

[ 237 ] Dayal, S. (1979). Use of estimates of proportions of stratum sizes and standard deviations in
allocation of sample to different strata under stratified random sampling. Sankhy a.
C, 41, 159--175.

[ 238 ] Dayal, S. (1985). Allocation of sample using values of auxiliary characteristic . J. Statist. Plann ing
Infer.• II , 321--328.

[ 239 ] Deming, W.E. (1953). On a probability mechanism to attain an economic balance between the
resulting error of response and bias of non-response. J. Amer. Statist . Assoc., 48, 743--772.

[ 240 ] Deming, W.E. and Stephan, F.F. (1940). On a least square adjustment of a sampled frequency
table when the expected marginal totals are known. Ann. Math. Statist ., 11,427--444.

[241 ] Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data
via the EM algorithm (with discussion). 1. R. Statist. Soc., B, 39, 1--38.

[ 242] Dempster, A.P., Rubin, D.B., and Tsutakawa, R.K. (1981). Estimation in covariance component
models.J. Amer. Statist . Assoc ., 76, 341--353.

[ 243 ] Dempster, A.P. and Tomberlin, T.J. (1980). The analysis of census undercount from a post-
enumeration survey. Proceedings ofthe Conferenc e on Census Undercount, 88-94 .

[ 244 ] Deng, L.Y. and Wu, C.FJ. (1987). Estimation of variance of the regression estimator. J. Amer.
Statist . Assoc., 82, 568--575.

[ 245 ] Deshpande, M.N. (1978). A new sampling procedure with varying probabilities. 1. Indian Soc.
Agric. Statist ., 30,110--114.

[ 246] Deshpande, M.N. (1980). A note on the comparison between simple random sampling with and
without replacement. Metrika, 27,277--279.

[ 247] Deshpande, M.N. and Ajgoankar, S.G.P. (1977). On multitrial sampling methods. Biometrika, 64,
422--424.
1144 Advanced sampling theory with applications

[ 248 ] Deshpande, M.N. and Ajgoankar, S.G.P. (1987). A generalization of Midzuno sampling
procedure. Aust. J. Stat , 29,1 88--192.

[ 249] Deville, J.C. and Goga, C. (2002). The Horvitz--Thompson theory for two samples. Int ernational
Conferenc e on Imp rov ing Surveys, Copenhagen.

[250] Deville, J.C. and Sarndal, C.E. (1992). Calibration estimators in survey sampling. J. Amer. Stat ist.
Assoc., 87,376--382.

[ 251 ] Deville, J.C. and Tille, Y. (1998). Unequal probability sampling without replacement through a
splitting method. Biometrika, 85( I), 89--101.

[ 252 ] Deville, J.C. and Tille, Y. (2000). Selection of several unequal probability samples from the
same population . J. Stat ist. Plan ning Infer., 86,215--227.

[ 253 ] Dey, A. and Srivastava, A.K. (1987). A sampling procedure with inclusion probabilities
proportional to size. Survey Methodology, 13(I), 85--92.

[ 254 ] Diana, G. (1992). A study ofkth order approximation of some ratio type strategies . Metron, 19--
32.

[255] Dorfman, A.H. and Hall, P. (1993). Estimators of the finite population distribution function using
non-parametric regression. Annals ofStatistics , 21(3),1452--1475.

[ 256] Doss, D.C., Hartley, H.O. and Somayajulu, G.R. (1979). An exact small sample theory for post-
stratification . J. Statist. Planning Infer., 3,235--248.

[ 257 ] Dowling, T.A. and Shachtman, R.H. (1975). On the relative efficiency of randomized response
models. J. Amer. Statist. Assoc., 70, 84--87.

[ 258 ] Draper, N.R. and Guttman, 1. (1968a). Some Bayesian stratified two-phase sampling results.
Biometrika, 55, 131--139.

[ 259 ] Draper, N.R. and Guttman, 1. (1968b). Bayesian stratified two-phase sampling results: k
characteristics . Biometrika, 55,587--589.

[260] Drew, D., Singh, M.P. and Choudhry, G.H. (1982). Evaluation of small area estimation techniques
for the Canadian labour force surveys. Survey Methodology, 8, 17--47.

[ 261 ] Dubey, V. (1993). An almost unbiased product estimator. J. Indian Soc. Agril. Statist., 45, 226--
229.

[ 262 ] Dubey, V. and Singh, S.K. (2001). An improved regression estimator for estimating population
mean. J. Ind ian Soc. Agric. Statist., 54(2), 179--183.

[ 263 ] Duncan, GJ. and Kalton, G. (1987). Issue of design and analysis of survey across time. Int.
Statist. Rev., 55, 97--117.

[ 264 ] Dupont, F. (1995). Alternative adjustments where there are several levels of auxiliary information.
Survey Methodology, 21, 125--135.

[ 265 ] Durbin, J. (1959). A note on the application of Quenouille's method of bias reduction to the
estimation of ratios. Biometrika, 46,477--480.

[ 266 ] Durbin, J. (1967). Design of multi-stage survey for the estimation of sampling error. Applied
Statist., 16, 152--164.
Bibliography 1145

[ 267 ] Eckler, A.R. (1955). Rotation sampling. Ann. Math. Stat., 26, 664--685.

[ 268 ] Eichhorn, B.H. and Hayre, L.S. (1983). Scrambled randomized response methods for obtaining
sensitive quantitative data. J. Statist. Planning Infer.,7, 307--316 .

[269] Ekman, G. (1959). An approximation useful in univariate stratification . Ann. Math. Stat., 30, 219-
-229.

[ 270 ] Elliott, M.R., Little, RJ.A. and Lewitzky, S. (2000). Subsampling callbacks to improve survey
efficiency. J. Amer. Statist. Assoc., 95, 730--738.

[ 271 ] Eltinge, J.L. (1999). Accounting for non-Gaussian measurement error in complex survey
estimators of distribution functions and quantiles. Statist. Sinica ,9 , 425--450 .

[ 272 ] Ericson, W.A. (1969). Subjective Bayesian models in sampling finite populations . J. R. Statist.
Soc., 55, 587--589.

[273] Eriksson, S.A. (1973). A new model for randomized response. Int. Stat. Rev., 41, 101--103.

[ 274 ] Espejo, M.R. (1997). Uniqueness of the Zinger strategy with estimable variance : Rana-Singh
estimator. Sankhy a, B, 59, 76--83.

[ 275 ] Espejo, M.R. and Pineda, M.D.(1997). On variance estimation for poststratification: a review.
Metron, 209--220.

[ 276] Espejo, M.R., Pineda, M.D. and Nadarajah, S. (2003). Estimation of finite population parameters
with several realizations. Statistical Papers, 44 (2), 267--278.

[ 277 ] Estevao, Y.M. (1994). Calibration of g weights under calibration and bound constraints . Report,
Statistics Canada.

[ 278 ] Estevao, Y.M. and Sarndal, C.E. (2000). A functional form approach to calibration . J. Official
Statist., 16(4),379--399.

[ 279 ] Estevao, Y.M. and Sarndal, C.E. (2002). The ten cases of auxiliary information for calibration in
two-phase sampling. 1. Official Statist., 18(2),233--255.

[ 280 ] Farrell, PJ. (1997). Empirical Bayes estimation of small area proportions based on ordinal
outcomes variables. Survey Methodology , 23, 119--126.

[ 281 ] Farrell, P.J. (2000). Bayesian inference for small area proportions . Sankhy a, B, 62, 402--416 .

[ 282 ] Farrall, PJ., MacGibbon, B. and Tomberlin, TJ. (1994). Protection against outliers in empirical
Bayes estimation. Canad. 1. Statist ., 22,365-376.

[ 283 ] Farrell, PJ., MacGibbon, B. and Tomberlin, TJ. (1997a). Bootstrap adjustments for empirical
Bayes interval estimates of small-area proportions. Canad. J. Statist., 25(1),75--89.

[ 284 ] Farrell, PJ., MacGibbon, B. and Tomberlin, TJ. (1997b). Empirical Bayes estimators of small
area proportions in multistage designs. Statistica Sinica, 7, 1065--1083.

[ 285 ] Farrell, PJ., MacGibbon, B. and Tomberlin, TJ. (1997c). Empirical Bayes small area estimation
using logistic regression models and summary statistics. J. Business and Econo. Statist., 15, 101--108.

[ 286] Farrell, PJ. and Singh, S. (2002a). Recalibration of higher order calibration weights. Presented
at the Conference ofthe Statistical Society ofCanada, Hamilton.Canada .
1146 Advanced sampling theory with applications

[ 287 ] Farrell , PJ. and Singh, S. (2002b ). Penal ized chi square distance function in survey sampling.
Jo int Statistical Meeting s. NY-Section on survey research method , pp. 963- -968 .

[ 288 ] Fan, 1. (1993). Local linear regression smoothers and their minimax efficiencies . Ann. Statist .,
2 1,196--2 16.

[ 289 ] Fay, R. (1992). When are inferences from multiple imput ation valid? Proc. Survey Res. Meth.
Sect.. Amer Statist. Assoc., 227--232 .

[ 290 ] Fay, R. (1994). Discussion of paper by X.L. Meng . Statist. Sci., 9, 558--560.

[ 291 ] Fay, R. (1996). Alternative paradigms for the analysis of imputed survey data . J. Amer. Statist.
Assoc., 91, 490 --498.

[ 292 ] Fay, R.E. and Herriot, R.A. (1979). Estimates of income for small places: An application of
lames--Stein procedures to census data. J. Amer. Statist. Assoc., 74, 269- -277 .

[ 293 ] Fellegi, I.P. ( 1963). Sampling with vary ing proba bilities without rep lacement : rotating and non -
rotating samp les. J. Amer. Statist. Assoc., 58,183--20 1.

[ 294 ] Fellegi, I.P. and Holt, D. (1976). A system atic approach to automatic editi ng and imputat ion. J.
Amer. Statist. Assoc. , 71, 17--35.

[ 295 ] Fellegi, I.P. and Sunter, A.B. (1974). Balance between different sources of survey errors--some
Canadian experiences. Sankhy a, C, 36, 119--142 .

[296] Feller, W. (1957). An introduction to probability theory and its applications. Vol. I, John Wiley
and Sons, New York.

[ 297 ] Feng, S. and Zou, G. ( 1997). Samp le rotation method with auxi liary variable. Commun. Statist.--
Theory Meth ., 26(6) , 1497--1509.

[ 298 ] Fienberg, S.E. (1970). An iterative procedure for estimation in contingency tables . Ann. Math.
Statist., 41, 907--917.

[ 299 ] Fienberg, S.E. and Tanur, 1.M. ( 1987). Experimental and sampling structures : parallel diverging
and meeting. Ins. Statist. Rev., 55, 75--96 .

[ 300 ] Finney, 0 .1. (194 8). Random and systematic sampling in timbe r surveys. Forestry , 22 ,1 --36 .

[301] Finney, OJ. (1950). An example of period ic variation in forest sampling. Forestry, 23, 96-- 1I I.

[ 302 ] Fisher, R.A. (1920). A mathematical exam ination of the methods of determining the accuracy of
an observation by the mean error, and by the mean square error . Monthly Notices R. Astr. Soc., 80, 759--
770.

[303] Fisher, R.A . (1922). On the mathematical foundat ions of theo retical stati stics . Phil. Trans. R. Soc.
Lond., A, 222 , 309 --368.

[ 304 ] Fisher, R.A. (1925). Statistical Methods fo r Research Workers. 1st Edition. Oliver and Boyd,
Edinb urgh .

[ 305 ] Folsom, R.E., Greenberg, B.G., Horv itz, D.G. and Abernathy, 1.R. (1973). The two alternate
questions randomized response model for human surve ys. J. Amer. Statist. Assoc., 68, 525--530.

[ 306 ] Foreman, E.K. and Brewer , K.W.R. (1971). The efficient use of supp lementary information in
standard sampli ng procedures. 1. R. Statist. Soc.• B, 33, 391--400 .
Bibliography 1147

[307] Fountain , R.L. and Pathak, P.K. (1989). Systematic and non-random sampling in the presence of
linear trends. Commun . Statist>- Theory Meth., 18,2511--2526.

[ 308 ] Francis, R.LC. (1984) . An adaptive strategy for stratified random trawl surveys. New Zealand J.
Mar. Freshw . Res., 18,59--71.

[ 309 ] Francisco , C.A. and Fuller, W.A. (1991). Quantile estimation with a complex survey design. Ann.
Statist.. 19, 454--469 .

[ 310 ] Frankel, L.R. and Stock, J.S. (1942). On the sample survey of unemployment. J. Amer. Statist.
Assoc., 37, 77--80.

[ 311 ] Franklin , L.A. (1989a) . A comparison of estimators for randomized response sampling with
continuous distributions from a dichotomous population. Commun . Statist . -- Theory Meth ., 18,489--505.

[ 312 ] Franklin, L.A. (1989b) . Randomized response sampling from dichotomous populations with
continuous randomization. Survey Methodology. 15,225--235.

[ 313 ] Freedman , D., Pisani, R. and Purves, R. (1978). Statistics. Norton, New York.

[314] Freund, J.E. (2000) . Mathematical Statistics. Fifth Edition. Prentice HalI oflndia, New Delhi.

[ 315 ] Friedlander, D. (1961). A technique for estimating a contingency table, given the marginal totals
and some supplementary data. J. Roo Statist . A, 124,412--420.

[ 316 ] Fuller, W.A. (1966). Estimation employing post strata. 1. Amer. Statist . Assoc ., 61,1172--1183 .

[317] Fuller, W.A. (1970). Sampling with random stratum boundaries . J. R. Statist. Soc., 32, 209 -- 226.

[ 318 ] Fuller, W.A. (1971). A procedure for select ing non-replacem ent unequal probability samples.
Unpubl ished Manuscript, Department of Statistics, Iowa State, University , Ames, Iowa.

[ 319 ] Fuller, W.A. (1987) . Measurement error models . John Wiley and Sons, Inc., New York.

[320] FulIer, W.A. (1995) . Estimation in the presence of measurement error. Int. Statist . Rev ., 63,121--
147.

[321 ] FulIer, W.A. (1998). Replication variance estimation for two-phase samples. Statistica Sinica, 8,
1153--1164 .

[322] Fuller, W.A. and Breidt, FJ. (1999). Estimation for supplemented panels. Sankhy a ,51, 58--70.

[323] FulIer, W.A. and Burmeister, L.F. (1972). Estimators for samples selected from two overlapping
frames. Proceedings ofthe Social Statistics Section . American Statistical Association, 245--249 .

[ 324 ] Gabler, S. (1981) . A comparison of Sampford's sampling procedure versus unequal probability
sampling with replacement. Biometrika, 68, 725--727.

[ 325 ] Gabler , S. (1984) . On unequal probability sampling: sufficient conditions for the superiori ty of
sampling without replacement. Biometrika. 71, 171--175.

[326] Gabler, S. and Horst, S. (1995). Improving the RHC--strategy . Statistical Hefte, 36, 327--336 .

[ 327 ] Garcia, M.R. and Cebrian, A.A. (1996). Repeated substitution method : The ratio estimator for the
population variance. Metrika, 43, 101--105.
1148 Advanced sampling theory with applications

[ 328 ] Garcia, M.R. and Cebrian, A.A. (1998). Quantile interval estimation in finite population using a
multivariate ratio estimator. Metrika , 47, 203--213.

[329] Garcia, M.R. and Cebrian, A.A. (2001). On estimating the median from survey data using multi-
auxiliary information. Metrika , 54( I), 59--76.

[ 330] Gautschi, W. (1957). Some remarks on systematic sampling. Ann. Math. Statist.. 28,385--394.

[ 331 ] Gershunskaya, J., Eltinge, J.L. and Huff, L. (2002). Use of auxiliary information to evaluate a
synthetic estimator in the U.S. current employment statistic program. Joint Statistical Meetings. NY-
Section on survey research methods, 1149--1154.

[ 332 ] Ghangurde, P.O. and Rao, J.N.K. (1969). Some results on sampling over two occasions.
Sankhy E , A, 31, 463--472.

[ 333 ] Ghosh, M. and Meeden, G. (1997). Bayesian methods f or finite population sampling. Chapman
and Hall.

[ 334 ] Ghosh, M. and Pathak, P.K. (1992). Current Issues in Statistical Inf erence: Essays in Honor ofD.
Basu. Lecture Notes -- Monograph Series, Institute of Mathematical Statistics, Hayward, California.

[335] Ghosh, M. and Rao, J.N.K. (1994). Small area estimation : An appraisal. Statistical Science, 9(1),
55--93.

[336] Ghosh, S. (1998). The Horvitz-Thompson vs. Sen--Yates--Grundy variance Estimators: Issues in
finite population sampling. J. Indian Soc. Agric. Statist., 50, 2&3, 343--348.

[337] Ghosh, S.P. (1963). Post-cluster sampling. Ann. Math. Statist. 34,587--597.

[338] Giffard--Jones, W. (1993). The doctor game. The Windsor Star, April IS, 1993.

[339] Giommi, A. (1984). A simple method for estimating individual response probabilities in sampling
from finite populations. Metrika, 185--200.

[ 340 ] Godambe, V.P. (1995a). Estimation of parameters in survey sampling : Optimality. Canad. 1.
Statist., 23(3), 227--243.

[ 341 ] Godambe, V.P. (1955b). A unified theory of sampling from finite population s. J. R. Statist. Soc..
B, 17, 269--278 .

[ 342 ] Godambe, V.P. (1960). An optimum property of regular maximum likelihood estimation. Ann .
Math . Statist .• 3 1,1208--1211.

[ 343 ] Godambe, V.P. (1969). Admissibility and Bayes estimation in sampling finite populations- V,
Ann. Math. Statist ., 40,672--676.

[ 344] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating equations.
Biometrika , 63, 277--284.

[345] Godambe, V.P. (1980a). On the sufficiency and ancillarity in the presence of nuisance parameters.
Biometrika. 67,269--276.

[346] Godambe, V.P. (I 980b). Estimation in randomized response trials. Int. Statist. Rev., 48, 29--32.

[ 347] Godambe,V.P. (1984). On ancillarity and Fisher information in presence of a nuisance parameters.
Biometrika , 7 1, 626--629.
Bibliography 1149

[348] Godambe, V.P. (1987). Resolution of Godambe's paradox. Statist . Probab. Lett., 5, 239--239.

[ 349 ] Godambe, V.P. (1989). Estimation of cumulative distribution of survey population. Technical
Report STAT : 89--117, University of Waterloo.

[ 350 ] Godambe, V.P. (1991). Orthogonality of estimating functions and nuisance parameters.
Biometrika, 78, 143--151.

[ 351 ] Godambe, V.P. (1995). Estimation of parameters in survey sampling : Optimality. Canad. J.
Statist., 23(3), 227--243.

[352] Godambe, V.P. (1998). Estimation of parameters in survey sampling. J. Indian Soc. Agric. Statist.,
51 (2-3),315--330.

[353] Godambe, V.P. (1999). Linear Bayes and optimal estimation. Ann. Inst. Statist . Math.,51(2), 201--
215.

[ 354 ] Godambe, V.P. and Heyde, C.C. (1987). Quasi likelihood and optimal estimation. Int. Statist.
Rev., 55, 231--244.

[ 355 ] Godambe, V.P. and Joshi, V.M. (1965). Admissibility and Bayes estimation in sampling finite
populations -- I. Ann. Math. Statist.. 36, 1707--1722.

[ 356 ] Godambe, V.P. and Kale, B.K. (1991). Estimating functions: an overview. In estimating
Functions (V.P. Godambe ed.), Clarendon Press, Oxford, 3--20.

[ 357 ] Godambe, V.P. and Thompson, M.E. (1984). Robust estimation through estimating equations.
Biometrika, 71,115--125.

[ 358 ] Godambe, V.P. and Thompson, M.E. (1986). Parameters of superpopulation and survey
population, their relationship and estimation. Int.. Statist. Rev.. 54, 127--138.

[ 359 ] Godambe, V.P. and Thompson, M.E. (1989). An extension of quasi-likelihood estimation (with
discussion). J. Statist . Planning Infer., 22, 137--172.

[ 360 ] Godambe, V.P. and Thompson, M.E. (1996-97). Optimal estimation in a casual framework. J.
Indian Soc. Agril. Statist. , 49, 21--46.

[ 361 ] Godambe, V.P. and Thompson, M.E. (1999). A new look at confidence intervals in survey
sampling. Survey Methodology, 25, 161--173.

[ 362 ] Goel, B.B.P.S. and Singh, D. (1977). On the formation of clusters. J. Indian Soc. Agril. Statist.,
29,53--68.

[ 363 ] Gonzalez, M.E. (1973). Use and evaluation of synthetic estimators. Proceedings of the Amer.
Statist. Assoc ., Social Statistics Section.. 33--36.

[364] Gonzalez, M.E. and Hoza, C. (1976). Small area estimation of unemployment. Proceedings ofthe
Section on Social Statistics. American Statistical Association , 437--443.

[ 365 ] Gonzalez, M.E. and Hoza, C. (1978). Small area estimation with application to unemployment
and housing estimates. 1. Amer. Statist. Assoc ., 73,7--15 .

[ 366 ] Goodman, L.A. and Hartley, H.O. (1958). The precision of unbiased ratio type estimators. J.
Amer. Statist. Assoc ., 53, 491--508.
1150 Advanced sampling theory with applications

[ 367 ) Graf, M. (2002) . Assessing the accuracy of the median in a stratified double stage cluster
sampling by means of a nonparametric confidence interval: Application to the swiss earnings structure
survey. Proc. Jo int Statistical Meetings, NY--Section on Governm ent Statatistics, 1223--1228.

[ 368 ) Greenberg , B.G., Abul-Ela, A.L.A., Simmons, W.R. and Horvitz, D.G. (1969). The unrelated
question randomized response model -- theoretical framework. J. Amer. Statist. Assoc., 64, 520--539 .

[369) Greenberg, B.G., Kuebler, R.R., Abernathy, J.R. and Horvitz, D.G. (1971). Application of the
random ized response technique in obtaining quantitative data. J. Amer. Statist. Assoc., 66, 243--250 .

[ 370 ) Grewal, I.S., Bansal, M.L. and Singh, S. (1999). An alternative estimator for multiple
characteristics using randomized response technique in PPS sampling. Aligarh J. Statist., 51--65.

[371) Grewal, I.S., Bansal, M.L. and Singh, S. (2002). Estimation of populat ion mean ofa stigmatized
quantitative variable using double sampling. Statistica (Accepted) .

[ 372 ) Gross, S.T. (1980) . Median estimation in sample surveys. Proc. Surv. Res. Meth . Sect . Amer.
Statist. Assoc.. 181--184.

[ 373 ) Groves, R.M. (1996). Non-sampling error in surveys: the journey toward relevance in practice.
Proc. Statist . Can. Symp., 96, 7--14.

[ 374 ) Groves, R.M. and Lepkowski, J.M. (1986). An experimental implementation of a dual frame
telephone sample design . Proc. Sec. Survey Res. Meth .. American Statistical Association, 340--345 .

[ 375 ) Grubbs, F.E. (1948) . On estimating precision of measuring instruments and product variability . J.
Amer. Statist. Assoc., 43,243--264.

[ 376 ) Gujarati , D. (1978). Basic econometrics (Internat ional Student Edition). Mcgraw - Hill
International Book Company , Tokyo.

[377) Gupta, B.K. and Rao, T.J. (1997). Stratified PPS sampling and allocation of sample size. J. Indian
Soc. Agril. Statist., 50(2), 199--208.

[ 378 ) Gupta , J.P. (2002). Estimation of the correlation coefficient in probability proportion al to size
with replacement sampling . Statistical Papers, 43(4), 525--536.

[ 379) Gupta , J.P. and Singh, R. (1990). A note on usual correlation coefficient in systematic sampling.
Statist ica, 50,255--259.

[ 380 ) Gupta, J.P., Singh, R. and Kashani, H.B. (1993). An estimator of the correlation coefficient in
probability proportional to size with replacement sampling. Metron , 165--177.

[381) Gupta , J.P., Singh, R. and Lal, B. (1978). On the estimation of the finite population correlation
coefficienr-L Sankhy a, C, 41, 38--59.

[382) Gupta, J.P., Singh, R. and Lal, B. (1979). On the estimation of the finite populat ion correlation
coefficient- -11. Sankhy ii , C, 42, 1--39.

[ 383 ) Gupta, P.C. (1970). Some estimation problems in samp ling using auxiliary inf ormation.
Unpublished Ph.D. thesis submitted to lARS, New Delhi.

[ 384 ) Gupta, P.C. (1978). On some quadratic and higher degree ratio and product estimator. J. Indian
Soc. Agril. Statist ., 30, 7 I--80.

[385) Gupta, P.C. and Kothwala , N.H. (1990). A study of second order approximation for some product
type estimators. J. Indian Soc. Agril. Statist., 42,171--185.
Bibliography 1151

[ 386 ] Gupta, R.K., Singh, S. and Mangat, N.S. (1992-93). Some chain ratio type estimators for
estimating finite population variance. Aligarh J . Statist., 12&13,65--69.

[ 387 ] Gupta, V.K. and Nigam, A.K. (1987). Mixed orthogonal arrays for variance estimation with
unequal number of primary selections per stratum. Biometrika. 74, 735--742.

[388] Gupta, V.K., Nigan, A.K. and Kumar, P. (1982). On a family of sampling schemes with inclusion
probability proportional to size. Biometrika, 69, 191--196.

[ 389] Gurney, M. and Jewett, R.S. (1975). Constructing orthogonal replications for standard errors. J.
Amer. Statist. Assoc. , 70, 819--821.

[ 390] Hajek, J. (1958). Some contribution to the theory of probability sampling. Bull. Int. Statist. Inst.,
36,127--134.

[391 ] Hajek, J. (1959). Optimum strategies and other problems in probability sampling. Casopis Pest.
Mat., 84, 387--423.

[ 392 ] Hajek, J. (1964). Asymptotic theory of rejective sampling with varying probability from a finite
population . Ann. Math. Stat., 35,1491--1525.

[393] Halmos, P.R. and Perlman, M.D. (1974). On the existence of a minimal sufficient sub-field.
Ann. Statist.., 2,1049--1055.

[ 394 ] Hanif, M., Mukhopadhyay, P. and Bhattacharyya, S. (1993). On estimating the variance of
Horvitz and Thompson estimator. Pak. J . Statist., A, 9, 123--136.

[ 395 ] Hansen, M.H. and Hurwitz, W.N. (1942). Relative efficiencies of various sampling units in
population enquiries. J. Amer. Statist. Assoc., 37, 89--94.

[396] Hansen, M.H. and Hurwitz, W.N. (1943). On the theory of sampling from finite populations. Ann.
Math. Stat., 14, 333--362.

[397] Hansen, M.H. and Hurwitz, W.N. (1946). The problem of non-response in sample surveys. J .
Amer. Statist. Assoc.,41,517--529 .

[ 398 ] Hansen, M.H., Hurwitz, W.N. and Madow, W.G. (1953). Sample survey methods and theory.
New York, John Wiley and Sons, 456--464.

[ 399 ] Hanurav, T.V. (1965). Optimum sampling strategies and some related problems. Ph.D. Thesis,
Indian Statistical Institute.

[ 400 ] Hanurav, T.V. (1966). Some aspects of unified sampling theory. Sankhy a. A, 28, 175--204.
[401] Hanurav, T.V. (1967). Optimum utilization of auxiliary information: J( ps sampling of two units
from a stratum. J . R. Statist . Soc.,B, 29, 374--391.

[ 402 ] Hartigan, J.A. (1969). Linear Bayesian methods. J. R. Statist. Soc., B, 31, 440--454.

[403] Hartley, H.O. (1962). Multiple frame surveys. Proc. of the Social Statist ics Section. American
Statistical Association, 203--206 .

[ 404] Hartley, H.O. (1966). Systematic sampling with unequal probability and without replacement J.
Amer. Statist. Assoc., 61, 739--748.
1152 Advanced sampling theory with applications

[ 405] Hartley, H.G. (1974). Multiple frame methodology and selected applications . Sankhy d , C, 36,
99--118.

[ 406 ] Hartley, H.G. and Biemer, P.P. (1978). The estimation of non-sampling variances in current
surveys. Proc. Sec. Survey Res. Meth., American Statistical Association, 257--262 .

[ 407 ] Hartley, H.G. and Rao, J.N.K. (1962). Sampling with unequal probabilities and without
replacement. Ann. Math. Statist ., 33, 350--374.

[ 408 ] Hartley, H.G. and Rao, J.N.K. (1968). A new estimation theory for sample surveys. Biometrika,
55,547--557.

[409] Hartley, H.G., Rao, J.N.K. and Kiefer, G. (1969). Variance estimation with one unit per stratum.
J. Amer. Statist. Assoc ., 64, 841--851.

[410] Hartley, H.G. and Ross, A. (1954). Unbiased ratio estimators. Nature , 174,270--271.

[ 411 ] Hedayat, A.S., Rao, C.R. and Stufken, J. (1988). Sampling plans excluding contiguous units. J.
Statist. Planning Infer., 19, 159--170.

[ 412 ] Heilbron, D.C. (1978). Comparison of estimators of the variance of systematic sampling.
Biometrika . 65, 429--433 .

[ 413 ] Henderson, C.R. (1975). Best linear unbiased estimation and prediction under a selection model.
Biometrics, 31,423--447.

[ 414 ] Hendricks, W.A. (1944). The relative efficiencies of groups of farms as sampling units. J. Amer.
Statist. Assoc .. 39,366--376.

[ 415] Hendricks, W.A. (1949). Adjustment for bias caused by non-response in mailed surveys. Agric.
Econo. Res.. I, 52--56.

[416] Herzel, A. (1986). Sampling without replacement with unequal probabilities : sample designs with
pre-assigned joint inclusion probabilities ofany order. Metron, 49--68 .

[ 417 ] Herzel, A. (1993). Maximin Jr ps designs. Metron, 5--25.

[ 418 ] Hidiroglou, M.A. (1995). Sampling and estimation for stage one of the Canadian survey of
employment, payrolls and hours survey redesign. Statistical Society of Canada. Proc. of the Survey
Methods Section , 123--128.

[ 419 ] Hidiroglou, M.A. (200 I). Double Sampling. Survey Methodology, 27, 143--154.

[ 420 ] Hidiroglou, M. A. and Sarndal, C.E. (1995). Use of auxiliary information for two-phase sampling.
Proc. Sec. Survey Res. Meth., Amer. Statist. Assoc.. VoUI, 873--878.

[ 421 ] Hidiroglou, M. A. and Sarndal , C.E. (1998). Use of auxiliary information for two-phase sampling.
Survey Methodology, 24 (I), 11--20.

[422] Hodges, J.L. and Lehmann, E. (1970). Basic Concepts of Probability and Statistics . 2nd ed.
Holden--Day, San Francisco.

[ 423 ] Holt, D. and Smith, T.M.F. (1979). Post-stratification. J. R. Statist. Soc., A, 142 33--46.

[ 424 ] Holt, D., Smith, T.M.F. and Tomberlin, TJ. (1979). A model based approach to estimation for
small subgroups ofa population . J. Amer. Statist. Assoc., 74,405--410.
Bibliography 1153

[425] Horvitz, D.G. and Thompson, DJ. (1952). A generalisation of sampling without replacement from
a finite universe. J. Amer. Statist. Assoc., 47, 663--685.

[ 426 ] Huang, L.R. and Ernst, L.R. (\981). Comparison of an alternat ive estimator to the current
composite estimator in the Current Population Surveys. Proc. of the Amer. Statist. Assoc.. Section on
Survey Research Methods, 303--308 .

[ 427 ] Hutchison , M.C. (1971). A Monte Carlo comparison of some ratio estimators . Biometrika, 58,
313--321.

[428] Iachan, R. (\982). Systematic sampling : A critical review. Int. Stat. Rev., 50, 293--303.

[ 429 ] Ireland, C.T. and Kullback, S. (1968). Contingency table with given marginals. Biometrika, 55,
179--188.

[ 430 ] Isaki, C.T. (1983).Variance estimation using auxiliary information. J. Amer. Statist. Assoc. ,78,
117--123.

[ 431 ] Isaki, C.T. and Fuller, W.A. (1982). Survey design under a regression superpopulation model. J.
Amer. Statist . Asso c., 77,89--96.

[ 432 ] Jaech, J.L. (1981). Constraind expected likelihood estimates of precisions using Grubbs'
technique for two dimensional methods. Nuclear Materials Management Journal, X(2), 34--39.

[433] Jaech, J.L. (\985). Statistical analysis ofmeasurement errors. John Wiley and Sons, New York.

[434] Jagers, P. (\986). Post-stratification against bias in sampling. Int. Statist. Rev., 54,159--167.

[ 435 ] Jagers, P., Oden, A. and Trulsson, L. (\985). Post-stratification and ratio estimation : usages of
auxiliary information in survey sampling and opinion polls. Internat. Statist . Rev., 53, 221--238.

[ 436 ] Jain, R.K. (1987). Properties of estimators in simple random sampling using auxiliary variable.
Metron, 265--271.

[ 437 ] Jessen, RJ . (\942). Statistical investigation of a sample survey for obtaining farm facts. Iowa
Agricultural Experiment Station Research Bulletin, 104.

[ 438 ] Jhajj, H.S. and Srivastava, S.K. (\983). A class of PPS estimators of population mean using
auxiliary information. J. Indian Soc. Agril. Statist ., 35, 57--61.

[439 ] John, S. (1969). On multivariate ratio and product estimators. Biometrika, 56, 533--536 .

[440] Jolly, G.M. and Hampton, I. (1990). A stratified random transect design for acoustic surveys of
fish stocks. Canad . J. Fish. Aquat. Sci., 47,1282--1291.

[441] Jones, R.G. (1980). Best linear unbiased estimators for repeated surveys. J. R. Stat ist. Soc., B, 42,
221--226.

[ 442 ] Joshi,V.M. (1966). Admissibility and bayes estimation in sampling finite populations IV. Ann .
Math. Stati st., 37,1658--1678.

[443] Joshi, V.M. (1970). Note on the admissibility of the Sen-Yates--Grundy estimator and Murthy's
estimator and its variance estimator for samples of size two. Sankhy Ii , A,32, 431--438 .

[ 444 ] Kadilar, C. and Cingi, H. (2003). Ratio estimators in stratified random sampling. Biom. J., 45(2),
218--225.
1154 Advanced sampling theory with applications

[445) Kalton, 0 and Anderson, D.W. (1986). Sampling rare populations. J. R. Statist. Soc ., A, 65--82.

[446) Kalton, O. and Kasprzyk, J.R. (1986). The treatment of missing data. Survey Methodology, 105--
110.

[ 447 ) Kapadia, S.B. and Gupta, P.C. (1984). A quadratic and higher degree ratio, product estimators in
sampling with varying probabilities . J. Statist. Res., 18, 1--18.

[ 448 ) Karlheinz, F. (1990). Stratified sampling using double sampling. Statist . Hefte, 31, 55--63.

[ 449 ) Kasprzyk, D., Duncan, OJ., Kalton, O. and Singh, M.P. Panel Surveys . Wiley, New York.

[450) Kathuria, O.P. (1975). Some estimators in two-stage sampling on succesive occasions with partial
matching at both stages. Sankhy a , C, 37,147--162.

[451 ) Kathuria, O.P. and Singh, D. (l97la). Comparison of estimates in two-stage sampling on
successive occasions . J. Indian Soc. Agril. Statist ., 23, 31--51.

[ 452 ) Kathuria, O.P and Singh, D. (l97Ib). Relative efficiencies of some alternative replacement
procedures in two-stage sampling on successive occasions. J. Indian Soc. Agril. Statist., 23, 101--114.

[ 453 ) Kaur, P. and Singh, O. (1982). A note on estimating variance in a finite population . J. Statist.
Res., 16(1&2), 51--54.

[454) Kempthorne, O. (1952). The Design and Analysis ofExperiments. Wiley, New York.

[455) Kerkvliet, J. (1994). Estimating a logit model with randomized data: The case of cocaine use.
Austral. J. Statist.,36, 9--20.

[456) Khan, S.U. and Tripathi, T.P. (1967). The use of multi-auxiliary information in double sampling.
J. Indian Statist . Assoc ., 5, 42--48.

[ 457 ) Khan, Z. (1976). Optimum allocation in Bayesian stratified two-phase sampling. J. Indian Soc.
Agril. Statist ., 14, 65--74.

[ 458 ) Khare, B.B. (1987). Allocation in stratified sampling in presence of non-response . Metron , 213--
221.

[ 459 ) Khare, B.B. (1991). Determination of sample sizes for a class of two-phase sampling estimators
for ratio and product of two population means using auxiliary character. Metron, 185--197.

[ 460 ) Khare, B.B. and Srivastava, S. (1981). A generalized regression ratio estimator for the population
mean using two auxiliary variables. Aligarh J. Statist., 1(1),43--51.

[461) Khare, B.B. and Srivastava, S. (1997). Transformed ratio type estimators for the population mean
in the presence of nonresponse . Commun . Statist. -- Theory Meth., 26(7), 1779--1791.

[ 462 ) Khare, B.B and Srivastava, S.R. (1998). Combined generalised chain estimators for ratio and
product of two population means using auxiliary characters. Metron, 56, 109--116.

[463) Kim, J. (1978). Randomized respons e techniqu es/or surv eying human populations. Unpublished
Ph.D. dissertation , Temple University, Philadelphia, USA.

[464) Kim, J.K. (2001). Variance estimation after imputation. Survey Methodology, 27, 75--83.

[ 465 ) Kiregyera, B. (1980). A chain ratio type estimator in finite population : double sampling using two
auxiliary variables. Metrika , 27, 217--223.
Bibliography 1155

[ 466 ] Kiregyera,B.(1984). Regression type estimators using two auxiliary variables and the model of
double sampling from finite populations . Metrika, 31, 215--226.

[ 467 ] Kish, L. and Hess, I. (1959). A replacement procedure for reducing the bias of non-response.
American Statistician , 13, 17--19.

[ 468] Kokan, A.R. and Khan, S.U. (1967). Optimum allocation in multivariate surveys: An analytical
solution. J. R. Statist. Soc ., B, 2,115--125.

[ 469 ] Kolmogorov, A.N. (1942). Sur I'estimation statistique des parameters de la loi de Gauss. Izv.
Akod. Nauk SSSR Ser. Mat. 6, 3--32.

[ 470 ] Konijn, H.S. (1973). Statisti cal Theory of Sample Survey Design and Analysis. North Holand
Publishing Company.

[ 471 ] Konijn, H.S. (1979). Model free evaluation of the bias and the mean square error of the regression
estimator. Sankhy a. C, 41, 69--75.

[ 472 ] Konijn, H.S. (1981). Biases, variances and co-variances of raking ratio estimators for marginal
and cell totals and averages of observed characteristics. Metrika, 28, 109--121.

[ 473] Koop, J.C. (1967). Replicated (or interpenetrating) samples of unequal sizes. Ann . Math. Statist .,
38,1142--1147.

[ 474 ] Koop, J.e. (1971). On splitting a systematic sample for variance estimation. Ann. Math. Statist.,
42,3,1084--1087.

[ 475 ] Korn, E.L. and Graubard, B.I. (1998). Variance estimation for superpopulation parameters.
Statistica Sinica, 8, 1131--1151.

[476] Kossack, CF. and Shiledar--Bax, H.R. (1971). On designing ofa unit--stratified survey design for
discrete set of observations. Int. Statist. Rev., 39, 46--56.

[ 477 ] Kothwala, N.H. and Gupta, P.C. (1989). Estimation of population mean with knowledge of
coefficient of variation with p--auxiliary variables. Metron, 107--119.

[ 478 ] Kott, P.S. (1988). Model based finite population correction for the Horvitz and Thompson
estimator. Biometrika, 75, 797--799.

[ 479 ] Kott, P.S. and Stukel, D.M. (1997). Can the Jackknife be used with a two-phase sample? Survey
Methodology, 23, 81--89.

[ 480 ] Kowar, R.M. (1996). One pass selection of a sample with probability proportional to aggregate
size. Sankhy a.B, 58, 80--83.

[481] Krewski, D. and Chakrabarty, R.P. (1981). On the stability of the Jackkinfe variance estimator in
ratio estimation. J. Statist. Planning Infer., 5, 71--78.

[ 482 ] Krewski, D. and Rao, J.N.K. (1981). Inference from stratified samples: properties of the
linearization, Jackknife and balanced repeated replication methods. Ann . Statist .,9, 1010--1019 .

[483] Kuhn, H.W. and Tucker, A.W. (1952). Non-linear programming. Proceeding of the second
Berkeley Symposium on Mathematical Statistics and Probability.

[484] Kuk, A.Y.C. (1990). Asking sensitive questions indirectly. Biomerika, 77(2), 436--438.
1156 Advanced sampling theory with applications

[485] Kuk, A.Y.C. and Mak, T.K. (1989). Median estimation in the presence of auxiliary information. J.
R. Statist. So c., B, 51, 261--269.

[ 486] Kuk, A.Y.C. and Mak, TX. (1994). A functional approach to estimating finite population
distribution functions. Commun. Statist» - Theory Meth., 23(3), 883--896.

[ 487 ] Kulldorff, G. (1963). Some problems of optimum allocation for sampling on two occasions. Rev.
Inter. Statist. Inst., 31, 24--57.

[ 488 ] Kumar, E.V. and Srivenkataramana, T. (1994). A generalization of Midzunc --Sen sampling
scheme for finite populations. Commun. Statist.i-Theory Meth ., 23(9), 2541--2559.

[ 489] Kumar, E.V., Srivenkataramana, T. and Srinath, K.P. (1996). Use of ranks in probability
proportional to size sampling. Commun. Statist. -- Theory Meth ., 25( I2), 3 195--32 I5.

[ 490 ] Kumar, P. and Agarwal, S.K. ( 1997). Alternative estimators for the population totals in multiple
characteristic survey. Commun. Statist. -- Theory Meth., 26(10), 2527--2537.

[ 491 ] Kumar, P., Gupta, V.K. and Nigam, A.K. (1985). On inclusion probability proportiona l to size
sampling scheme. J. Statist. Planning Infer., 12, 127--131.

[ 492 ] Kumar, P. and Herzel, A. (1988). Estimating population totals in surveys involving multi-
characters. Metron, 33--47.

[ 493 ] Kumar, P., Srivastava, AX. and Agarwal, S.K.(1986). A genera l class of unequal probability
sampling schemes. Statistica, 46, 67--74.

[ 494 ] Kumar, S and Lee, H. (1983). Evaluation of composite estimation for the Canadian Labour Force
Survey. Su rvey Meth od ology , 9, 1--24.

[ 495 ] Laake, P. (1986). Optimal estimates and optimal predictors of finite population characteristics in
the presence of non-response. Metrika, 33,69--77.

[496] Lahiri, D.B. (1951). A method for sample selection providing unbiased ratio estimates. Bull. Ins.
Statist. Inst.,33(2), 133--140.

[ 497 ] Lahiri, P. and Rao, J.N.K. (1995). Robust estimation of mean squared error of small area
estimators . J. Am er. Statist. Assoc., 90, 758--766.

[ 498 ] Laird, N.M. and Louis, T.A. (1987). Empirical Bayes confidence intervals based on bootstrap
samples. J. Am er. Statist. Assoc., 82, 739--750.

[ 499] Lakshmi, D.V. and Raghavarao, D. (1992). A test for detecting untruthful answering in
randomized response procedure. J. Stat ist. Planning Infer., 31, 387--390.

[500] Lanke, J. (1975). On the choice of the unrelated question in Simons version of randomised
response. J. Am er. Stat ist. Asso c., 70, 80--83.

[501] Lanke, J. (1976). On the degree of protection in randomized interviews. Int . Statist. Rev ., 44,197-
-203.

[ 502] Lee, H. and Kim, J.K. (2002). Jackknife variance estimation for two-phase samples with high
sampling fractions. Joint Stat istical Meetings, N Y--Sec tion on survey research meth ods, 2024--2028

[ 503 ] Lee, H., Rancourt, E. and Sarnd all, C.E. (1994). Experiments with variance estimation from
survey data with imputed values. J. Official Statist., 10(3),231--243.
Bibliography 1157

[504) Lee, H., Rancourt, E. and Samdall, C.E. (1995a). Variance estimation in the presence of imputed
data for the generalized estimation system. Proc. oJthe American Statist. Assoc. (Social Surv ey Research
Methods Section) , 384--389 .

[505) Lee, H., Rancourt, E. and Sarndall, C.E. (1995b). Jackknife variance estimation for data with
imputed values. Statistical Society ojCanada. Proceedings ojthe Survey Methods Section , 111--115.

[ 506 ) Lent , J., Miller, S.M. and Cantwell, PJ. (1996). Effect of composite weight on some estimates
from the current population surveys. Proc. oj the section on Survey Research Methods. Amer. Statist .
Asso c., 130--139.

[ 507 ) Leysieffer, F.W . and Warner, S. L. (1976). Respondent jeopardy and optimal designs in
randomi zed response models. J. Amer. Statist. Assoc., 71, 649--656.

[ 508 ) Linacre, SJ. and Trewin, OJ. (1993). Total survey design application to a collection of the
construction industry. J. Official Statist ., 9, 611--621.

[ 509 ] Lindley, D.V. and Deely, J.J . (1993). Optimum allocation in stratified sampling with partial
information. Test, 2(1),147--160.

[510] Little, RJ.A. and Yao, L. (1996). Intent to treat analysis for longitudinal studies with drop outs .
Biometrics, 52 , 1324--1333.

[ 511 ] Lohr, S.L. and Rao , J.N.K. (1998). Jackknife variance estimation in dual frame surveys. Tech.
Rep.. LaboratoryJor research in Statistics and Probability, Carleton University.

[ 512 ] Lohr, S.L. and Rao, J.N.K. (2000). Inference from dual frame surveys. J. Amer. Statist .
Asso c., 95 , 271-- 280 .

[ 513 ) Lundstrom, S. (1997). Calibration as a standard method for treatment oj non-response. Ph.D .
Thesis.

[ 514) Lundstrom, S. and Sarndal, C.E. (1999). Calibration as a standard method for treatment of non-
response. J. Official Statist., 15(2),305--327 .

[ 515 ] MacOibbon, B. and Tomberlin, TJ. (1989). Small area estimates of proportion via empirical
Bayes techniques. Survey Methodology, 15(2),237--252.

[516) Madow, W.O. (1949). On the theory of systematic sampling --11. Ann. Math. Statist ., 20, 333--
354.

[517) Madow, W.O. (1953). On theory of systematic sampling -- III. Ann . Math . Statist ., 24, 101--106 .

[ 518 ] Madow, W.O. and Madow, L.H. (1944). On the theory of systematic sampling -- I. Ann . Math .
Statist .. IS, 1--24.

[ 519 ) Mahajan, P.K. and Singh, S. (1996). On estimation of total in two stage sampling. 1. Statist . Res.,
30,127--131 .

[ 520 ] Mahajan, P.K. and Singh, S. (1997). Almost unbiased ratio and product type estimators: A new
approach. Biom. 1., 39(3), 509--516.

[ 521 ] Mahalanobis, P.C. (1940). A sample survey of acreage under jute in Bengal. Sankhy d , 4, 511--
530 .

[ 522 ] Mahalanobis, P.C. (1942). General report on the sample census oj area under jute in Bangal.
Indian Central Jute Committee.
1158 Advanced sampling theory with applications

[ 523 ] Mahalanobis, P.C. (1944). On large scale sample surveys. Phil. Transac. Roy. Soc.. London, B,
231, 324--351.

[ 524 ] Mahalanobis, P.C. (1946). Recent developments in statistical sampling in the Indian Statistical
Institute. J. R. Statist. Soc ., 109,326--378.

[ 525 ] Mahmood, M., Singh , S. and Hom , S. (1998). On the confidentiality guaranteed under
randomized response sampling : A comparison with several new techniques. Biom . J . 40 (2), 237--242.

[526] Mak, T.K. and Kuk, A.Y.C. (1993). A new method for estimating finite population quantiles
using auxiliary information. Canad. J. Statist., 21(1), 29--38 .

[527] Malec, D., Sedransk, J., Moriarity, C.L. and LeClere, F.B. (1997). Small area inference for binary
variables in the National Health Interview Survey. J. Amer. Statist. Assoc.• 92, 815--826.

[528] Mandowara, V.L. and Gupta, P.C. (1999). Contribution to optimum points of stratification for
multi-stage designs. Metron, 57, 51--66.

[ 529 ] Mangat, N.S . (1991). An optional randomized response sampling technique using non-
stigmatized attribute. Statistica, LI , 595--602.

[ 530 ] Mangat, N.S. (1992). Two stage randomized response sampling procedure using unrelated
question . J. Indian Soc. Agril. Stati st., 44 (1),82-87.

[ 53 I ] Mangat, N.S. (1993). Estimation of population total using an alternative estimator for RHC
scheme. Statistica, 53, 251--259.

[ 532 ] Mangat, N.S. (1994). An improved randomized response strategy. J. R. Statist. Soc., B, 56, 93--
95.

[ 533] Mangat, N.S. and Singh, R. (1990) . An alternative randomized response procedure. Biometrika,
77, 439--442.

[ 534 ] Mangat, N.S. and Singh, R. (199Ia). An alternative randomized response procedure for sampling
without replacement. J. Indian Statist. Assoc., 29(2), 127--13 1.

[535] Mangat, N.S . and Singh, R. (l99Ib). An alternative approach to randomized response survey.
Statistica, 51(3), 327--332.

[ 536] Mangat, N.S. and Singh, R. (1992 -93). Sampling with varying probabilities without replacement:
A review . A/igarh J. Statist ., 12& 13, 75-- 106.

[ 537 ] Mangat, N.S. and Singh, R. (1995). A note on the inverse binomial randomized response
rocedure. J.lndian Soc. Agri.l. Statist., 47(1), 21--25 .

[ 538 ] Mangat, N.S., Singh, R. and Singh, S. (1991). Alternative estimators in randomised response
technique. A/igarh J. Statist., 11,75--80.

[539] Mangat, N.S., Singh, R. and Singh , S. (1992). An improved unrelated question randomized
response strategy. Cal cutta Statist. Assoc. Bull ., 42, 277--281.

[540] Mangat, N.S., Singh, R. and Singh , S. (1995). Unrelated question randomised response model
without randomization device. Estadistica, 47 ,59--68.

[541 ] Mangat, N.S., Singh, R. and Singh, S. (1997). Violation of respondent's privacy in Moor's model
-- its rectification through a random group strategy . Commun Statist. -- Theory Meth. , 26 (3) , 743--754.
Bibliography 1159

[ 542 ] Mangat, N.S., Singh, R., Singh, S. Bellhouse, D.R. and Kashani, H.B. (1995). On efficiency of
estimator using distinct respondents in randomised response survey. Survey Methodology, 21(I), 21--23.

[ 543 ] Mangat, N.S., Singh, R., Singh, S. and Singh, B. (1993). On Moors' randomised response model.
Biom. J., 35(6), 727--732.

[ 544 ] Mangat, N.S. and Singh, S. (1994). An optional randomized response sampling technique. J.
Indian Statist. Assoc ., 32, 71--75.

[ 545 ] Mangat, N.S., Singh, S. and Singh, R. (1993). On the use of a modified randomization device in
randomised response inquiries. Metron, 51 (I), 2 I 1--216.

[ 546 ] Mangat, N.S., Singh, S. and Singh, R. (1995). On use of a modified randomization device in
Warner's model. J. Indian Soc. Statist. Opers. Res., 16,65--69.

[ 547 ] Manisha and Singh, R.K. (2001). An estimation of population mean in the presence of
measurement errors. J. Indian Soc. Agric. Statist., 54(1), 13--18.

[ 548 ] Manwani, A.H. and Singh, K.B. (1978). Studies in systematic sampling for two-dimensional finite
population with special reference to survey for estimation of guavas. J. Indian Soc. Agric. Statist., 30, I,
82--93.

[549] Marker, D.A. (1983). Organization of small area estimators. Proc. Survey Research Method
Section. Amer. Statist. Assoc., Washington, D.C., 409-414.

[ 550 ] Mayor, I.A. (2002). Optimum cluster selection probabilities to estimate the finite population
distribution function under PPS cluster sampling. Test, 11(I), 73--88.

[ 551 ] McCarthy, M.D. (1939). On the application of the z-test to randomized blocks . Ann. Math.
Statist.. 10,337.

[552] McCarthy, PJ. (1969). Pseudo replication: Half samples. Rev. Int. Statist. Inst., 37, 239--264.

[ 553 ] McLeod, A.I. and Bellhouse, D.R. (1983). A convenient algorithm for drawing a simple random
sample. App. Statist.. 32, 182--184.

[554] Meeden, G. (1992). Basu's contribution to the foundations of sample survey. Current issues in
statistical inference: Essays in Honor of D. Basu by Ghosh and Pathak. Lecture Notes -- Monograph
Series. Institute ofMathematical Statistics. Hayward. California. 17, 178-- 186.

[ 555 ] Meeden, G. (2000). A decision theoretic approach to imputation in finite population sampling. J.
Amer. Statist. Assoc., 95, 586--595.

[556] Meeden, G. and Gosh, M. (1981). Admissibility in finite problems. Ann. Statist. 9, 846--852.

[ 557 ] Meeden, G. and Ghosh, M. (1983). Choosing between experiments : applications to finite
population sampling. Ann. Statist., 11,296--305.

[ 558 ] Meng, X.L. (1994). Multiple imputation inferences with uncongenial sources of input (with
discussion). Statist. Sci., 9, 538--573.

[ 559 ] Mickey, M.R. (1959). Some finite population unbiased ratio and regression estimators. J .Amer.
Statist. Assoc ., 54, 594--612.

[ 560 ] Midha, C.K. (1980). Contribution to survey sampling and design of experiments . Unpublished
Ph.D. thesis, Iowa State University Press, Ames, Iowa.
1160 Advanced sampling theory with applications

[ 561 ] Midzuno, H. (1952). On the sampling system with probability proportional to sum of sizes. Ann .
Inst. Statist. Math ., 3, 99--107.

[562] Miller, R.G. (1974). An unbalanced jackknife. Ann . Statist ., 2, 880--891.

[563] Milne, A. (1959). The centric systematic area sample treated as a random sample. Biometrics, 15,
270--297.

[ 564 ] Mishra, G. and Rout, K. (1997). A regression estimator in two-phase sampling in presence of two
auxiliary variables. Metron , 55, 177--186.

[ 565 ] Mishra, R.N. and Sinha, J.N. (1999). Randomized response procedure with multiple statement.
Aligarh J. Statist., 19, 1--9.

[566] Mohanty, S. (1977). Sampling with repeated units. Sankhy ii , C, 39, 43--46.

[ 567 ] Mohanty, S. and Sahoo, L.N. (1987). A class of estimators based on mean per unit ratio
estimators. Statistica, 47,473--477.

[ 568 ] Mohanty, S. and Sahoo, J. (1995). A note on improving the ratio method of estimation through
linear transformation using certain known population parameters. Sankhy Ii , B, 57, 93--102.

[569] Mohanty, S. and Pattanaik, L.M. (1984). Alternative multivariate ratio estimators using geometric
and harmonic means. J. Indian Soc. Argic. Statist., 36,100--118.

[ 570 ] Montanari, G.E. (1998). On regression estimation of finite population means. Survey
Methodology, 24(1), 69--77.

[ 571 ] Montanari, G.E. (1999). A study on the conditional properties of finite population mean
estimators. Metron, 57, 21--35.

[ 572 ] Moors, J.J.A. (1971). Optimization of the unrelated question randomized response model. J.
Amer. Statist. Asso c., 66, 627--629.

[573] Moors, J.J.A. (1997). A critical evaluation of Mangat's two-step p rocedure in randomized
response. Discussion paper at Center for Economic Research, Tilburg University, The Netherlands.

[574] Moors, J.J.A., Smeets, R. and Boekema, F.W.M. (1998). Sampling with probabilities proportional
to the variable of interest. Statistica Neerlandica, 52, 129--140.

[ 575 ] Morrison, T., Mangat, N.S., Desjardins, G. and Bhatia, A. (2000). Validation of an in-line
inspection metal loss tool. Proceedings ofthe Internat ional Pipeline Conference 2000, Calgary, Alberta,
Canada . The American Society ofMechanical Engineers, New York, Vol 2, 839--844.

[ 576 ] Morrison, T., Mangat, N.S, Carroll, L.B. and Riznic, J. (2002). Statistical estimation of flaw size
measurement errors for steam generator inspection tools. Proceedings of the 4th international Steam
Generator Conference. Canadian Nuclear Society, Toronto, Ontario, May 5--8.

[ 577 ] Morrison, T. Mangat, N.S., Carroll, L.B. and Riznic, J. (2003). Statistical estimation of flaw size
measurement errors for steam generator tube inspection tools. Submitted to Nucl ear Engineering and
Design.

[578] Moses, L.E. (1978). Energy information validation -- A status report. Proc . of the /978 DOE
Stat istical Symposium, 33--49.

[ 579 ] Moura, F.A.S. and Holt, D. (1999). Small area estimation using multilevel models. Survey
Methodology, 25(1), 73--80.
Bibliography 1161

[ 580 ] Mukerjee , R. and Chaudhuri , A. (1990). Asymptotic optimality of doub le sampl ing plans
employ ing generalized regression estimators. J. Statist. Plann ing Infer., 26, 173--183.

[ 581 ] Muke rjee, R. and Sengupt a, S. (1990). Optimal estimation of a finite population mean in the
presence of linear trend . Biometrika, 77., 625--630.

[ 582 ] Mukerjee, R., Rao, T.J. and Vijayan, K. (1987). Regression type estimator using multiple
auxiliary information. Austral. J. Statist., 29(3), 244--254 .

[ 583 ] Mukerjee, R., Rao, TJ. and Vijayan, K. (2000). Rejo inder to Ahmed , M.S. (1998) : A note on
regression type estimators using multiple auxiliary information. Austral. & New Zealand J. Statist .. 42(2),
245.

[ 584 ] Mukhop adhya y, P. (1977). Further studies in samp ling theory. Unpubl ished Ph.D. thesis
submitted to the University of Calcutta .

[585] Mukhopadhyay, P. (1982) . Optimum strategies for estimating the variance ofa finite population
under a superpopulation mode l. Metrika , 29, 143--158.

[ 586 ] Mukhopadhyay, P. (1994) . Prediction in finite popu lation under error in variables superpopulation
models . J. Statist . Plann ing Infer ., 41, 151--161.

[587] Mukhopadhyay, P. and Bhattacharyya, S. (1990-91) . Estimating a finite population variance under
some genera l linear models with exchangeable errors. Calcutta Statist . Assoc . Bull., 40, 219--228 .

[ 588 ] Murthy, M.N. (1957). Ordered and unordered estimators in sampling without replacement.
Sankhy d , 18,379--390.

[ 589 ] Murthy, M.N. (1961) . Introduction to sampling theory : Lecture Notes . Indian Statistical
Institutes .

[ 590 ] Murthy, M.N. (1962) . Almost unbiased estimators based on interpenetrating sub-samples.
Sankhy ii , 303--314.

[591 ] Murthy, M.N. (1963) . Genera lized unbiased estimation for finite populations. Sankhy d , B, 25,
245--261.

[592] Murthy, M.N. (1964) . Product method of estimat ion. Sankhy d , A, 26,69--74.

[ 593 ] Murthy , M.N. (196 7). Sampling theory and methods . Statistical Publish ing Society , Calcutta.

[ 594 ] Murthy , M.N. (1977) . Sampling theory and methods. Second edition, Statistic al Publication
Society, Calcutta .

[ 595 ] Murthy, M.N. and Nanjamma, N.S. (1959). Almost unbia sed ratio estimates based on
interpenetrating sub-sample estimates . Sankhy a,
21, 381--392.

[ 596 ] Murthy, M.N. and Singh, M.P. (1969). On the concepts of best and admissible estimators in
sampling theory . Sankhya, A, 31, 343--354 .

[ 597 ] Nanjamma, N.S., Murthy, M.N. and Sethi, V.K. (1959) . Some sampling systems providing
unbiased ratio estimators. Sankhy ii , 21, 299--314.

[598] Nara in, R.D. ( 1951). On sampling without replacement with varying probabilities. 1. Indian Soc.
Agril. Statist ., 3, 169--174.
1162 Advanced sampling theory with applications

[ 599 ] Nayak, T.K. (1994). On randomized response surveys for estimating a proportion . Commun .
Statist»-Theory Meth., 23( I), 3303--3321.

[ 600 ] Nelson, D. and Meeden, G. (1998). Using prior information about population quantiles in finite
population sampling. Sankhy d , A, 60, 426--445.

[ 60 I ] Neyman, J. (1923). On the application of probability theory to agricultural experiments: essay on


principles. Stat istical Science, 5, 465--480.

[ 602 ] Neyman, J. (1934). On two different aspects of the representative methods, the method of
stratified sampling and the method of purposive selection. J. R. Statist. Soc. 97,558--606.

[ 603 ] Neyman, J. (1938). Contribution to the theory of sampling human populations. J. Am er. Statist.
Assoc ., 33, 101--116.

[ 604 ] Neyman, 1. (1971). Discussion of Royall (/971) : Foundations of Statistical Inference (V.P.
Godambe and D.A. Sprott, eds). Holt, Rinehart & Winston, Toronto, 276--278.

[ 605 ] Nieto de Pascual, 1. (1961). Unbiased ratio estimates in stratified sampling. J. Amer. Stat ist.
Assoc. , 56, 70--87.

[ 606 ] Ogus, 1.K. and Clark, D.F. (1971). The annual survey of manufacturers: A report on
methodology . Technical Report No. 24, U.S. Bureau of Census, Washington, D.C.

[607] Ohlsson, E. (1989). Variance estimation in Rao--Hartley--Cochran procedure . Sankhy Ii B, 51,


348--361.

[608] Okafor, F.e. (1992). The theory and application of sampling over two occasions for the estimation
of current population ratio. Stat istica , I , 137--147.

[609] Okafor, F. C. (2001). Treatment of non-response in successive sampling. Statistica, I, 195--204.

[ 610 ] Okafor, F.C. and Amab, R. (1987). Some strategies of two-stage sampling for estimating
population ratios over two occasions. Austrial. J. Statist ., 29(2), 128--142.

[ 611 ] Okafor, F.C. and Lee, H. (2000). Double sampling for ratio and regression estimation with sub-
sampling the non-respondents . Survey Methodology, 26, 183--188.

[612] Olkin, I. (1958). Multivariate ratio estimation for finite population . Biometrika, 43,154--163.

[ 613 ] Padmawar, V.R. (1994). Strategies admitting non-negative unbiased variance estimators. 1.
Statist. Planning Infer ., 40, 81--95.

[ 614 ] Padmawar, V.R. (1996). Rao--Hartley--Cochran strategy in survey sampling of continuous


populations. Sankhy ii , B, 58, 90--104.

[615] Padmawar, V.R. (I998a). On estimating non-negative definite quadratic forms. Metrika, 49, 231--
244.

[ 616 ] Padmawar, V.R. (1998b). On 7[ PS designs and stratification. J. Indian Statist. Assoc., 36, 99--
104.

[ 617 ] Paik, M.C. (1997). The generalized estimating equation approach when data are not missing
completely at random. J. Amer. Statist . Assoc., 92, 1320--1329.
Bibliography 1163

[618] Panda, P. and Sahoo, L.N. (1999). Predictive estimation of finite population mean using a product
estimator for two-stage sampling. Biom. J., 41(1), 93--97.

[619] Pandey, B.N. and Dubey, V. (1989). On almost unbiased estimators. Metron, 333-- 338.

[ 620 ] Pandey, S.K. and Singh, R.K. (1984). On combination of ratio and PPS estimators . Biom . J., 26
(3),333--336.

[ 621 ] Patel, H.C. and Dharmadhikari, S.W. (1978). Admissibility of Murthy and Midzuno's estimators
within the class oflinear unbiased estimators of finite population totals. Sankhy d , C, 40, 21--28.

[ 622 ] Pathak, P.K. (1961). On the evaluation of moments of distinct units in a sample. Sankhy a, A, 23,
415--420.

[ 623 ] Pathak, P.K. (1962). On simple random sampling with replacement. Sankhy ii . A, 24, 287--302.

[ 624 ] Pathak, P.K. (1966). An estimator in PPS sampling for multiple characteristics . Sankhy a. A, 28,
35--40.

[ 625 ] Pathak, P.K. (I 967a). Asymptotic efficiency of Des Raj's strategy--l. Sankhy d , A, 29, 283--298.

[ 626 ] Pathak, P.K. (1967b). Asymptotic efficiency of Des Raj's strategy--Il. Sankhya, A, 29, 299--
304.

[627] Pathak, P.K. and Rao, TJ. (1967). Inadmissibility of customary estimators in sampling over two
occasions. Sankhy a , A, 29, 49--54.

[ 628 ] Patterson, RD. (1950). Sampling on successive occassions with partial replacement units. J. R.
Statist. Soc., 241--255.

[ 629 ] Pearson, K. (1896). Mathematical contribution to the theory of evolution-llI--Regression:


heredity and panmixia. Phil Trans (A), Royall Soc. London, 187, 253--318.

[ 630 ] Pedgaonkar, A.M. and Prabhu--Ajaonkar, S.G. (1978). Comparison of sampling strategies.
Metrika, 25, 149--153.

[ 631 ] Pfeffermann, D. (1984). A note on large sample properties of balanced samples. J. R. Statist. Soc..
B, 46, 38--4 1.

[ 632 ] Pitman, EJ.G. (1937). Significance tests which can be applied to samples from any population--
1lI: The analysis of variance test. Biometrika, 29, 322--335.

[ 633 ] Platek, R. and Grey, G.B. (1983). Imput ation methodology: total survey error. In incomplete data
in sample surveys, 2, Ed. W.G. Madow, I. Olkin, and D.B. Rubin, 249-333. New York: Academic Press.

[ 634 ] Pokropp, F. (2001). Imposed linear structures in conventional sampling theory. Allgemeines
Statist. Archiv, 86, 333--352.

[ 635 ] Politz, A. and Simmons, W. (1950). Note on an attempt to get not at home into the sample without
callbacks. J. Amer. Statist. Assoc., 45,136--137.

[ 636 ] Pollock, K.H. and Bek, Y. (1976). A comparison of three randomized response models for
quantitative data. J. Amer. Statist. Assoc ., 71, 884--886.

[ 637 ] Prabhu--Ajgaonkar, S.G. (1975). The efficient use of supplementary information in double
sampling procedures . Sankhy a.
C, 37, 181--189.
1164 Advanced sampling theory with applications

[ 638 ] Pradhan, B.K. (200 I). Modified chain regression estimators using multi auxiliary information.
Stati stica, no.I, 249--258.

[ 639 ] Prasad, B. (1989). Some improved ratio type estimators of population mean and ratio in finite
population sample surveys. Commun. Sta tist.--Theory Meth., 18(1),379--392.

[ 640 ] Prasad, B. and Singh, H.P. (1990). Some improved ratio type estimators of finite population
variance in sample surveys. Commun. Stati st.-- Theory Meth.• 19,1127--1139.

[ 641 1 Prasad, B. and Singh, H.P. (1992). Unbiased estimators of finite population variance using
auxiliary information in sample surveys. Commun. Statist»- Theory Meth ., 21(5),1367--1376.

[ 642 ] Prasad, B., Singh, R.S. and Singh, H.P. (1996). Some chain ratio type estimators for ratio of two
population means using two auxiliary characters in two phase sampling. Me tron, 95--113 .

[ 643 ] Prasad, N.G.N and Graham, J.B. (1994). PPS sampling over two occasions. Survey Methodology ,
20,59--64.

[ 644] Prasad, N.G.N. and Rao, J.N.K. (1999). On robust small area estimation using a simple random
effects model. Survey Methodology, 25, 67--72.

[ 645 ] Prasad, N.G.N. and Srivenkataramana, T.A. (1980). A modification to Horvitz--Thompson


estimator under Midzuno sampling scheme. Biometrika, 67, 709--711.

[646] Purcell, N.J. and Kish, L. (1979). Estimation for small domain. Biometrics, 35, 365--384.

[ 647 ] Purcell, N.J. and Kish, L. (1980). Postcensal estimates for local areas (or domains). Int. Stat ist.
Rev., 48, 3--18.

[ 648 ] Quenouille , M.H. (1949). Problems in plane sampling. Ann . Math . Statist., 20, 355--375.

[ 649 ] Quenouille, M.H. (1956). Notes on bias in estimation. Biometrika, 43, 353--360.

[ 650 ] Raghavarao, D. (1971). Constru ctions and combinatorial problems in design 01 exp eriments.
Wiley, New York.

[ 651 ] Raghunandanan , K. and Bryant, E.C. (1971). Variance in multi-way stratification Sankhy ii , A,
33, 221--226.

[ 652 ] Raiffa, H. and Schlaifer, R. (1961). Applied statistical decision theory. Boston.

[ 653 ] Raj, D. (1954a). On sampling probabilities proportional to size. Ganita, 52, 175--182.

[654 ] Raj, D. (I954b). Ratio estimator in sampling with equal and unequal probabili ties. J. Indian Soc.
Agril. Statist., 6, 127--138.

[ 655 ] Raj, D. (1956). Some estimators in sampling with varying probabilities without replacement. J.
Am er. Statist. Assoc ., 51, 269--284.

[ 656 ] Raj, D. (1958). On the relative accuracy of some sampling techniques. J. Am er. Statist.Assoc .,53,
98--101.

[ 657 ] Raj, D. (1964). On double sampling for pps estimation. Ann. Ma th. Statist ., 35, 900--902.

[ 658 ] Raj, D. (1965a) . On a method of using multi-auxiliary in sample surveys. J. Amer. Statist.Assoc..
60, 270--277.
Bibliography 1165

[ 659 ] Raj, D. (1965b). On sampling over two occasions with probability proportionate to size. Ann.
Math. Statist ., 36, 327--330.

[660] Raj, D. (1966) .Some remarks on a simple procedure of sampling without replacement. J. Amer.
Statist. Assoc. , 61,391--397.

[661] Raj, D. (1968). Sampling Theory . McGraw -Hili, New York.

[ 662 ] Raj, D. and Khamis , S.H. (1958) . Some remarks on sampling with replacement. Ann. Math.
Statist.• 29, 550--557 .

[ 663 ] Ramachandran, G. ( 1982). Horvitz and Thompson estimator and generalized 7tPS designs . J.
Statist. Planning InJer., 7, 151--153.

[664] Ramachandran, G. and Rao, T.J . (1974) . Allocation to strata and relativ e effic ienci es of strat ified
and unstratified nps sampl ing schemes . J. R. Statist. Soc., 97, 558-- 606 .

[ 665 ] Ramachandran, V. and Pillai, S.S. (1976) . Multivariate unbi ased ratio type estimation for finite
sampling. J. Indian Soc. Agril. Statist ., 28,71--80.

[ 666 ] Ramakrishnan, M.K. (1969) . Some remarks on the comparison of sampling with and without
replacement. Sankhy a , A, 31,333--342.

[ 667 ] Ramak rishnan , M.K. ( 1975a). A generalisation of the Yates--Grundy variance estimator.
Sankh y Ii , C, 37, 204--206.

[668] Ramakrishnan, M.K. (1975b). Choice of an optimum sampling strategy --I. Ann . Stat ist., 3, 669--
679 .

[ 669 ] Ramakrishnan, M.K. and Rao, V.V. B. (1975) . On the sample mean in simple random sampling
without replacement. Sankhy a.
C, 37, 207--210 .

[ 670 ] Rana, R.S. (1989). Concise estimator of bias and variance of the finite population correlation
coefficient. J. Indian Soc. Agril. Statist ., 41, 69--76 .

[ 671 ] Rana, R.S. and Singh , R. (1989) . Note on systematic sampling with supplementary observations.
Sankhy E , S, 51, 205--211.

[ 672] Rangarajan , R. (1957). A note on two stage samp ling . Sankhy a, 17,373--376 .
[ 673 ] Rao, C.R. (1945) . Information and accuracy obtainable in an estimation of a statistical parameter.
Bull. Calcutta Math . Soc., 37, 81.

[674] Rao, C.R. (1973). Linear Statistical Inference and its Applications. Wiley , New York.

[675] Rao, C.R. (1975 ). Some problems of sample surveys . Proceedings ofthe conference on directions
for math emati cal statistics. University oj Alb erta. Edmonton . Canada .

[ 676 ] Rao, C.R. (1987). Strategies of data analyst. Proceedings oj 46'· Session oj the Interna tional
Statisti cal Institut e. Tokyo.

[ 677 ] Rao , J.N.K. (1961). On sampling with varying probabilities in sub-sampling designs. J. Indian
Soc. Agric. Stati st., 13,211--217.

[678] Rao, J.N .K. (l963a). On two systems of unequal probability sampling without replacement. Ann.
Inst. Statist. Math.,15, 67--72 .
1166 Advanced sampling theory with applications

[679] Rao, J.N.K. ( 1963b). On three procedures of unequal probabili ty sampling without replacement. J.
Amer. Statis t. Assoc., 58, 202--215.

[680] Rao, J.N.K. (I965a). A note on estimation of ratios by Quenouille 's method. Biometrika, 52, 647-
-649.

[ 68 1 ] Rao, J.N.K. ( 1965b). On two simple schemes of unequal probability sampling without
replacement. J. Indian Soc. Agric. Statist., 3, 169--174.

[ 682 ] Rao, J.N.K. ( 1966a). Alternative estimators in PPSWR sampling for multiple characteristics.
Sankhy d , A, 28, 47--60.

[ 683] Rao, J.N.K. (I 966b). On the relative efficiency of some estimators in PPS sampling for multiple
characteristics. Sankhy ii , A, 28, 6 1--70.

[ 684 ] Rao, J.N.K. (1966 c). On the comparison of sampling with and without replacement. Rev. Int.
Statist. Inst., 34, 125--138.

[ 685] Rao, J.N.K. (196 7). The precision of Mickey's unbiased ratio estimator. Biometrika. 54,93--108.

[ 686 ] Rao, J.N.K. (I 968a). Some small sample results in ratio and regression estimation. J. Indian
Statist. Assoc., 6, 160--168.

[ 687 ] Rao, J.N.K. ( 1968b). Some non-response sampling theory when the frame contains an unknown
amount of duplication. J. Amer. Statist. Assoc., 63, 87--90.

[ 688 ] Rao, J.N.K. ( 1969). Ratio and regression estimators in new developm ents in survey sampling. eds.
N. L. Johnson and H. Smith, New York, John Wiley.

[ 689 ] Rao, J.N.K. (1975). Unbiased variance estimation for multi-stage designs. Sankhy ii , C, 37, 133--
139.

[ 690 ] Rao, J.N.K. (1979 ). On deriving the mean square errors and their non-negative unbiased
estimators in finite population sampling. J. Indian Soc. Agric. Statist., 17, 125-- 136.

[ 691] Rao, J.N.K. ( 1985). Cond itional inference in survey sampling. Survey Methodology. 11,15--31.

[ 692 ] Rao, J.N.K. (1989). A note on Narain's necessary condition in sampling. J. Indian Soc. Agric.
Statist. , 41, 3 16--3 17.

[ 693 ] Rao, J.N.K. (1994). Estimating totals and distribution functions using auxiliary information at the
estimation stage. 1. Official Statist., 10(2), 153-- 165.

[ 694] Rao, J.N.K. (I 996a). Some current topics in sample survey theory. J. Indian Soc. Agril.Statist.. 50,
244--263 .

[ 695 ] Rao, J.N .K. (1996b ). On variance estimation with imputed survey data. J. Amer. Statist. Assoc .,
9 1, 499--506.

[ 696 ] Rao, J.N.K. (1997). Developments in sample survey theory: an appraisal. Canad. J. Statist., 25,
1--21.

[ 697 ] Rao, J.N.K. (I 999a). Some current trends in sample survey theory and methods. Sankhy a. B, 6 1,
1--57.
Bibliography 1167

[ 698 ] Rao, J.N.K. (I999b). Reply to comments on 'some current trends in sample survey theory and
methods' . Sankhya , B, 61, 53--57.

[ 699 ] Rao, J.N.K. (1999c) . Some recent advances in model based small area estimation. Surv ey
Methodology. 25, 175--186.

[ 700 ] Rao, J.N.K. (2000). Conditional inference for large and small areas. J. Indi an Statist . Assoc., 38
(2), 383--398.

[ 701 ] Rao, J.N.K. (2002). Discussion of 'Exact linear unbiased estimation in survey sampling' . J.Statist.
Plann ing Infer., 102,39--40.

[ 702] Rao, J.NK (2003). Small area estimation. John Wiley and Sons, NY.

[ 703 ] Rao, J.N.K. and Bayless, D.L. (1969). An empirical study of the stabilities of estimators and
variance estimators in unequal probability sampling of two units per stratum. J. Ame r. Statist. Asso c., 64,
540--559.

[704] Rao, J.N.K. and Beegle, L.D. (1967). A Monte Carlo study of some ratio estimators. Sankhy a , B,
29,47--56.

[ 705 ] Rao, J.N.K. and Bellhouse, D.R. (1978). Optimal estimation of a finite population mean under
generalized random permutation models. J. Statist . Planning Infer., 2,125--141.

[ 706 ] Rao, J.N.K. and Graham, J.B. (1964). Rotation designs for sampling on repeated occasions. J.
Amer. Stat ist. Assoc., 59,492--509.

[707] Rao, J.NK, Hartley,H.O. and Cochran, W.G. (1962). A simple procedure of unequal probability
sampling without replacement. J. R. Statist. Soc., B, 24, 482--491.

[ 708 ] Rao, J.N.K. and Lanke, J. (1984). Simplified unbiased variance estimation for multistage designs.
Biometrika. 71,387--395.

[ 709 ] Rao, J.N.K and Rao, P.S.R.S. (1971). Small sample results for ratio estimators . Biometrika, 58,
625--630.

[ 710 ] Rao, J.N.K. and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck
imputation. Biometrika, 79, 811--822.

[ 711 ] Rao, J.N.K. and Shao, J. (1999). Modified balanced repeated replication for complex survey data.
Biometrika, 86,403--415.

[ 712] Rao, J.N.K. and Singh, M.P. (1973). On the choice of estimator in survey sampling. Austral. J.
Statist., 15(2),95--104.

[ 713 ] Rao, J.N.K. and Sitter, R.R. (1995). Variance estimation under two-phase sampling with
application to imputation for missing data. Biometrika, 82, 453--460.

[ 714 ] Rao, J.N.K. and Vijayan, K.(I977). On estimating the variance in sampling with probability
proportional to aggregate size. J. Amer. Statist. Assoc, 72, 579--584.

[7 15] Rao, J.N.K.and Webster,J.T.(1966). On two methods of bias reduction in the estimation of ratios.
Biom etrika, 53,571--577 .

[ 716 ] Rao, J.N.K. and Wu, C.FJ. (1985). Inference from stratified samples: second order analysis of
three methods for non-linear statistics. J. Amer. Statist . Assoc., 80, 620--630.
1168 Advanced sampling theory with applications

[ 717 ] Rao, J.N.K. and Yu, M. (1994). Small area estimation by combining time series and cross
sectional data. Canad. J. Statist., 22, 511--528.

[ 718 ] Rao, P.S.R.S. (1969). Comparison of four ratio type estimates under a model. J. Amer. Sta tist.
Asso c., 64, 574--580.

[71 9] Rao, P.S.R.S. (1972). On two phase regression estimator. Sa nkhy a , A, 34, 373--476.

[ 720 1 Rao, P.S.R.S. (1974). Jackknifing the ratio estimator. Sankhy a. 36, 84--97.
[721 ] Rao, P.S.R.S. (1975). Hartley--Ross type estimators with two-phase sampling. Sankhy a. 37, 140-
-146.

[ 722 ] Rao, P.S.R.S. (1979). On applying the jackknife procedure to the ratio estimator. Sankhy a. 41,
115--126.

[ 723 ] Rao, P.S.R.S. and Mudholkar, G.S. (1967). Generalized multivariate estimators for the mean of
finite population parameters. J. Indian Soc. Agric . Statist ., 62, 1008--1012.

[ 724 1 Rao, TJ . (1966). On certain unbiased ratio estimators. Ann . Inst. Stat ist. Math. ,18, 117-- 121.

[725] Rao, T.J. (1967). Contributions to the theory of sampling strat egi es. Ph.D. Thesis, I.S.I. Calcutta.

[726] Rao, TJ . (1968). On the allocation of sample size in stratified sampling. Ann. Inst. Statist. Math . •
20, 159--166.

[ 727 1 Rao, T.J. (1971). nps sampling designs and Horvitz and Thompson estimator. J. Amer. Stat ist.
Assoc.• 66,872--875.

[ 728 1 Rao, TJ. (1972). Horvitz and Thompson and Des Raj estimator revisited. Austral. J. Statist.. 14,
227--230.

[ 729 1 Rao, TJ. ( 1977a). Estimating the variance of the ratio estimator for the Midzuno--Sen sampling
scheme. Metrika , 24 ,203-- 208 .

[7301 Rao, TJ. (I 977b). Optimum allocation of sample size and prior distributions: a review. Int. Statist.
Rev.. 45, 173--179.

[ 731 ] Rao, T.J. (1981). On a class of almost unbiased ratio estimators. Ann . Inst. Stati st. Math ., A, 33,
225--231.

[ 732 1 Rao, T.J. (I 993a). On certain problems of sampling design and estimation for multiple
characteristics. San khy a.B, 55, 372--38 1.

[ 733 ] Rao, T.J. (1993b). On certain alternative estimators for multiple characteris tics in varying
probability sampling. J. Indian Soc. Ag ril. Statist . 45(3), 307--318.

[734] Rao, T.J. (l983c). Horvitz--Thompson strategy vs. stratified random sampling strategy. J. Statist.
Planning Infer.. 8, 43--50.

[ 735 1 Rao, T.J. (1984). Allocation of sample size to strata and related problems. Biom . J., 26, 517--526.

[ 736 ] Rao, T.J., Sengupta, S. and Sinha, B.K. (1991). Some order relations between selection and
inclusion probabilities for PPSWOR sampling scheme. Metrika, 38, 335--343.
Bibliography 1169

[737] Ray, S. and Das, M.N. (1995). On systematic sampling allowing estimation of variance of mean.
J. Indian Soc. Agril. Statist ., 47(2), 192--196.

[ 738 ] Ray, S. and Das, M.N. (1997). Circular systematic sampling with drawback. J. Indian Soc. Agric .
Statist ., 50( I), 70--74.

[ 739 ] Ray, S.K. and Sahai, A. (1979). A note on ratio and product type estimators. Ann . Inst . Statist .
Math, 31,141 --144.

[740] Ray, SK and Singh, K. (1981). Difference cum ratio type estimators. J.Lndian Statist . Assoc., 19,
147--151.

[ 741 ] Reddy, V.N. (1973). On ratio and product method of estimation. Sankhy a ,B, 35, 307--3 I6.

[ 742] Reddy, V.N. (1974). On a transformed ratio method of estimation. Sankhy a , C, 36, 59--70.

[ 743 ] Reddy, V.N. (l978a). A study on the use of prior knowledge on certain population parameters in
estimation. Sankhy ii , C, 40, 29--37.

[ 744 ] Reddy, V.N. (1978b). A comparison between stratified and unstratified random sampling.
Sankhy d , C, 40, 99--103.

[745] Reddy, V.N. (1980). Systematic sampling in monotone populations. Sankhya ,C, 42, 97--108.

[ 746 ] Reddy, V.N. and Rao, T.J. (1977). Modified PPS method of estimation. SankhyIi , C, 39, 185--
197.

[747] Reddy, V.N. and Rao, T.J. (1990). On estimation of the population total of bottom (top) P
percentiles of a finite population. Metron , 309--320 .

[ 748 ] Ren, R. (2000). Estimation de la fonction de repartition et des fractiles d'une population finite.
Vllemes Journees de Methodologies Statist ique. Paris.

[ 749] Renssen, R.H. and Nieuwenbroek, N.J. (1997). Aligning estimates for common variables in two
or more sample surveys. J. Amer. Statist. Assoc. , 92, 368--374.

[ 750 ] Richardson, S.c. (1989). One pass selection of a sample with probability proportional to size.
Appl. Statist ., 38, 517--520.

[ 751 ] Rizvi, S.E.H., Gupta, J.P. and Singh, R. (2000). Approximately optimum stratification for two
study variables using auxiliary information. J. Indian Soc. Agric. Statist ., 53(3), 287--298.

[752] Robins, J.M. and Wang, N. (2000). Inference for imputation estimators. Biometrika, 87, 113--124.

[753] Robson, D.S. (1957). Applications of multivariate polykays of the theory of unbiased ratio type
estimation. J. Amer. Statist . Assoc., 52, 5 I 1--522.

[ 754 ] Rosen, B. (l997a). On sampling with probability proportional to size. J. Statist. Planning Infer.,
62,159--191.

[ 755 ] Rosen, B. (I 997b). Asymptotic theory for order sampling. J. Statist. Planning Infer. , 62, 135--
158.

[ 756 ] Rosen, B. (1998). On inclusion probabilities for order sampling. Rand D Report, Research
Methods, Development , 2, 1--23.
1170 Advanced sampling theory with applications

[ 757 ] Roy, J. and Chakravorty, I.M. (1960). Estimating the mean of a finite population . Ann. Math.
Statist., 31, 392--398.

[ 758 ] Royall, R.M. (I 970a). On finite population sampling theory under certain linear regression
models. Biometrika,57, 377--387 .

[759] Royall, R.M. (l970b). Finite population sampling: on labels in estimation. Ann. Math. Statist.. 41,
1774--1779.

[ 760 ] Royall, R M. (I 970c). On finite population sampling theory under certain linear regression
models. Biometrika , 57, 377--387.

[761 ] Royall, R.M. (1971). Linear regression models in finite population sampling theory. Foundations
ofStatist. Infer. (V.P. Oodambe and D.A. Sprott, eds). Holt, Rinehart & Winston, Toronto, 259--274.

[ 762 ] Royall, R.M. (1976). The linear least squares prediction approach to two-stage sampling. 1. Amer.
Statist. Assoc.•71, 657--664 .

[ 763 ] Royall, R.M. (1986). The prediction approach to robust variance estimation in two-stage cluster
sampling. J. Amer. Statist. Assoc., 81, 119--123.

[ 764 ] Royall, R.M. (1992). Robustness and optimal design under prediction models in finite population
sampling. Survey Methodology, 2, 179--195.

[ 765 ] Royall, R.M. and Cumberland, W.O. (1978). Variance estimation in finite population sampling. J.
Amer. Statist . Assoc ., 73, 351--358.

[ 766 ] Royall, R.M. and Cumberland, W.O. (198Ia). An empirical study of the ratio estimator and
estimators of its variance. J. Amer. Statist. Assoc., 73, 351--358.

[ 767 ] Royall, R.M. and Cumberland, W.O. (l98Ib). The finite population linear regression estimator
and estimators of its variance -- An empirical study. 1. Amer. Statist . Assoc .. 76,924--930.

[ 768 ] Royall, R.M. and Cumberland, W.O. (1985). Conditional coverage properties of finite population
confidence intervals. 1. Amer. Statist. Assoc ., 80, 355--359.

[769] Royall, R.M. and Eberhardt, K.R. (1975). Variance estimates for the ratio estimator. Sankhy a. C,
37,43--52.

[ 770] Royall, R.M. and Herson, J. (l973a). Robust estimation in finite populations -- I. J. Amer. Statist.
Assoc.. 68, 880--889.

[ 771 ] Royall, R.M. and Herson, J. (l973b). Robust estimation in finite populations -- II: Stratification
on a size variable. J. Amer. Statist. Assoc., 68, 890--893.

[ 772 ] Royall, R.M. and Pfeffermann, D. (1982). Balanced samples and robust Bayesian inference in
finite population sampling. Biometrika , 69, 401-409.

[773] Rubin, D.S. (1976). Inference and missing data. Biometrika , 63, 581--592.

[ 774 ] Rubin, D.B. (1978). Multiple imputation in sample surveys -- a phenomenological Bayesian
approach to nonresponse . In Proc. Sect. Survey res. Meth ., pp. 20-34. Washington D.C.:American
Statistical Association.

[775] Rubin, R.B.(l987). Multiple imputationfor non-response in surveys. John Wiley, NewYork.
Bibliography 1171

[776] Rubin, R.B. (1996). Multiple imputation after 18+ years. J. Amer. Statist. Asso c., 91, 473--490.

[ 777 ] Rubin, D.B. and Schenker, N. (1986) . Multiple imputation for interval estimation from simple
random samples with ignorable non-response. J. Am er. Stat ist. Assoc., 8 1, 366--374.

[ 778 ] Rueda, M. and Arco s, A. (2002) . The use of quantiles of auxiliary variable to estimate medians .
Biom. J ., 44(5) , 6 I9--632.

[779] Rueda, M., Arcos, A. and Artes, E. (1998) . Quantile interval estimation in finite population using
a multivariate ratio estimator. Metrika , 47,203--213 .

[ 780 ] Ruiz, M. and Santos , J. (1990) . Sampling design providing unbiased new product estimator.
Statistiea, 50, 285--288.

[ 781 ] Ruiz, M. and Santos, J. (1992) . Variance estimation with systematic sampling. Rev. Acad. Cien e
Zaragoza. 47,121--124.

[782] Sadasivan, G. and Aggarwal, R. (1978) . Optimum points of stratification in bivariate populations.
Sankhy a.C, 40, 84--97 .

[ 783 ] Sadasivan, G. and Srinath , M. (1975) . Some contributions to post-cluster sampling. Sankhy a. C,
37,171--180.

[ 784] Sahai, A. and Ray, S.K . (1980) . An efficient estimator using auxiliary information. Metrika, 27,
271--275.

[ 785 ] Sahai , A. and Sahai, A. (1985). On efficient use of auxil iary information. J. Statist. Plann ing
Infer., 12,203--212.

[786] Sahoo, 1. and Sahoo, L.N. (1999a). A comparative study of some regression type estimators in
double sampling procedures. Aligarh J. Statist., 19,67--76.

[ 787 ] Sahoo, 1. and Sahoo , L.N. (I 999b) . An alternat ive class of estimators in double sampling
procedures. Calcutta Statist. Assoc. Bull., 49, 79--83 .

[788] Sahoo, J. Sahoo , L.N. and Mohanty , S. (1994) . Unequal probability sampling using a transformed
auxiliary variable. Metron, 71--83.

[ 789 ] Sahoo , J., Sahoo, L.N. and Wywial, J. (1997) . Some thoughts of reduction of estimation bias
using auxiliary information in sample . Statistics in Transition, 3(2), 383--401.

[790] Sahoo , L.N (1983). On a method of bias reduction in ratio estimation. J. Statist. Res. 17,1--6 .

[ 791 ] Sahoo, L.N. (1986) . On a ratio method of estimation using a transformed auxiliary variable .
Statistiea, 46, 409--413 .

[ 792 ] Sahoo, L.N. (1987) . A regression type estimator in two-stage sampling. Calcutta Statist. Assoc.
Bull ., 36, 97--100.

[793] Sahoo, L.N. (1991). An unbiased ratio cum product estimator in two-stage sampling. Metron, 213-
-217 .

[ 794 ] Sahoo , L.N. (1994). Some estimation problems in finite population sa mpling using auxiliary
information. Ph. D. Thesis, Utkal University, Bhubaneswar, India.

[ 795 ] Sahoo, L.N. and Panda, P. (1997). A class of estimators in two-st age sampling with varying
probabilities. South African Stati st. J., 31,151--160.
1172 Advanced sampling theory with applications

[ 796 ) Sahoo, L.N. and Panda, P. (I 999a). A class of estimators using auxiliary information in two-stage
sampling. Austral. & New Zealand J. Statist., 41(4), 405--410.

[797) Sahoo, L.N. and Panda, P. (I999b). A predictive regression type estimator in two-stage sampling.
J. Indian Soc. Agric. Statist., 52(3), 303--308.

[ 798 ) Sahoo, L.N., Sahoo, J. and Espejo, M.R. (1998). On some strategies using auxiliary information
for estimating finite population mean. Questiio , 22, 243--252.

[ 899 ) Sahoo, L.N., Sahoo, J. and Mohanty, S. (I995a). Empirical comparison of some regression and
regression type strategies. Statistical Hefte, 36, 337--347.

[ 800) Sahoo, L.N., Sahoo, J. and Mohanty, S. (1995b). A new predictive ratio estimator. J. Indian Soc .
Agric. Statist. 47(3), 240--242.

[801) Sahoo, L.N. and Sahoo, R.K. (2001). Predictive estimation of finite population mean in two-phase
sampling using two auxiliary variables. J. Indian Soc . Agric. Statist., 54(2), 258--264.

[ 802) Sahoo, L.N. and Swain, A.K.P.C. (1980). Unbiased ratio cum product estimator. Sankhy a, C, 42,
56--62.

[ 803 ) Sahoo, L.N. and Swain, A.K.P.C. (1986). Chain product estimators. Aligarh J. Statist., 6,53--58.

[804) Sahoo, L.N. and Swain, AX.P.C. (1987). Some modified ratio estimators . Metron, 285--293 .

[ 805 ) Sahoo, L.N. and Swain, A.K.P.C. (1989). On two modified ratio estimators in two-phase
sampling. Metron , 261--266.

[ 806) Samiuddin, M. and Kattan, A.K.A. (1991). A procedure of unequal probability sampling. Pak. J.
Statist., A, 7, 1--7.

[ 807 ) Samiuddin, M., Kattan, A.K.A., Hanif, M. and Asad, H. (199 2). Some remarks on models,
sampling schemes and estimators in unequal probability sampling. Pak . J. Statist., A, 8,1--18.

[ 808 ) Sampath, S. (1989). On the optimal choice of unknowns in ratio type estimators. J. Indian Soc .
Agric. Statist. 41,166--172.

[ 809 ) Sampath, S. and Chandra, S.K. (1990). General class of estimators for the population total under
unequal probability sampling schemes. Metron , 409--419.

[ 810 ) Sampath, S., Uthayakumaran, N. and Tracy, D.S. (1995). On an alternative estimator for
randomized response technique. J. Indian Soc . Agric. Statist. 47(3),243--248.

[ 811 ) Sampford, M.R. (1962). Methods of cluster sampling with and without replacement for clusters of
unequal sizes. Biometrika, 49, 27--40.

[ 812 ) Sampford, M.R. (1967). On sampling without replacement with unequal probabilities of selection.
Biometrika, 54,499--513.

[ 813 ) Sarndal, C.E. (1980a). On Jf --inverse weighting versus best linear unbiased weighting in
probability sampling. Biometrika. 67,639--650.

[ 814 ) Sarndall, C.E. (I980b). Two model based inference arguments in survey sampling. Austral. J.
Statist., 22, 341--348.

[ 815 ) Sarndal, C.E. (1982). Implications of survey designs for generalized regression estimators of
linear functions. J. Statist. Planning Infer., 7, 155--170.
Bibliography 1173

[ 816 ] Sarndal , C. E.( 1992): Methods for estimating the precision of survey estimates when imputation is
used. Survey Methodol ogy, 18, 241--252 .

[ 817 ] Sarndal, C.E. (1996) . Efficient estimators with simple variance in unequal probability sampling .
J. Amer. Statist .Assoc.,9 I, 1289--1300.

[ 818 ] Sarndal, C.E. and Hidiroglou, M.A. (1989). Small domain estimation : a conditional analysis . J.
Amer. Statist. Asso c., 84, 266--275 .

[819] Sarndal, C.E. and Wright, R.L. (1984). Cosmetic form of estimators in survey sampling . Scand. J.
Statist., II, 146--156.

[ 820 ] Sarndal, C.E. and Swensson , B. (1987). A general view of estimation for two phases of selection
with application to two-phase sampling and non-response. Int. Statist. Rev.. 55, 279--294.

[ 821 ] Sarndal, C.E., Swensson, B. and Wretman, J.H. (1989). The weighted residual technique for
estimating the variance of the general regression estimator of the finite population total. Biometrika, 76
(3),527--537.

[ 822 ] Sarndal, C.E., Swensson , B. and Wretman, J.H. (1992). Model assisted survey sampling .
NewYork: Springer--Verlag.

[ 823 ] Saxena, R.R., Singh, P. and Srivastava, AX. (1986). An unequal probability sampling scheme.
Biometrika , 73(3), 761--763 .

[ 824 ] Saxena, S.K., Nigam, A.K. and Shukla, N.D. (1995). Variance estimation for combined ratio
estimator . Sankhy a.
B, 57. 85--92.

[ 825 ] Schafer, J.L. and Schenker, N. (2000). Inference with imputed conditional means. 1. Amer. Statist.
Assoc ., 95,141--154.

[ 826 ] Schneeberger, H. (1979). Saddle points of the variance of the sample mean in stratified sampling .
Sankhy a. C, 41, 92--96

[ 827 ] Schreuder, H.T., Gregoire, T.G. and Wood, G.B. (1993) . Sampling methods for multi-resource
forest inventory. Wiley, New York.

[ 828 ] Schueany, W.R., Gray, H.L. and Owen, D.B. (1971). Bias reduction in estimation. J. Amer.
Statist. Assoc.. 66, 524--533 .

[829] Scott, A. and Smith, T.M.F. (1969). Estimation in multi-stage surveys . J. Amer. Statist. Assoc.. 64,
830--840 .

[830] Scott, AJ., Brewer, K.W. and Ho, E.W. (1978). Finite population sampling and robust estimation .
J. Amer. Statist. Assoc .. 73, 359--361.

[ 831 ] Searls, D.T. (1964). The utilization of a known coefficient of variation in the estimation
procedure . J. Amer. Statist. Assoc., 59,1225--1226.

[ 832 ] Searls, D.T. (1967). A note on the use of an approximately known co-efficient of variation.
American Statistician, 21(2), 20--21.

[833] Sedransk , 1. and Meyer, J. (1978). Confidence intervals for quantiles ofa finite population : simple
random and stratified simple random sampling . J. R. Statist. Soc.. B, 40, 239--252 .

[ 834 ] Sekkappan, R.M. (1973). Bayes estimation and uniform admissibility for sampling from finit e
population . Ph. D. Thesis, University of Waterloo.
1174 Advanced sampling theory with applications

[ 835 ) Sekkappan, R.M. (1981). Subjective Bayesian multivariate stratified sampling from finite
populations. Metrika, 28,123--132.

[ 836 ) Sekkappan, R.M. and Thompson, M.E. (1975). On a class of uniformly admissible estimators for
finite populations . Ann . Statist ., 3, 492--499.

[ 837 ) Sekkappan , R.M. and Thompson, M.E. (1994). Multi-phase and successive sampling for a
stratified population with unknown stratum sizes. Pak. J. Statist ., A, 10, 131--142.

[ 838 ) Sen, A.R. (1952). Present status of probability sampling and its use in estimation of farm
characteristics. Econom et. . 27, 130.

(839) Sen, A.R. (1953). On the estimate of the variance in sampling with varying probabilities. J.lndian
Soc. Agril . Statist ., 5, 119--127.

[ 840 ) Sen, A.R., Seller, S. and Smith, D.N. (1973). The use of ratio estimate in successive sampling.
Biometrics, 31, 673--683 .

[ 841 ) Sengupta, S. (1980). On the admissibility of the symmetrized Des Raj estimator for PPSWOR
sample of size two. Calcutta Statist. Assoc. Bull., 29, 35--44.

[ 842 ) Sengupta , S. (1981a). On interpenetrating samples of equal and unequal sizes. Calcutta Statist .
Accoc. Bull. 30,187--197.

[ 843 ) Sengupta , S. (198Ib). Jackknifing the ratio and the product estimators in double sampling.
Metrika, 28, 245--256.

[ 844 ) Sengupta, S. (1982a). On interpenetrating samples of unequal sizes. Metrika, 29, 175--188.

[ 845 ) Sengupta, S. (l982b). Admissibility of the symmetrized Des Raj estimator for fixed size sampling
designs of size two. Calcutta Statist . Assoc. Bull.. 31, 201--205.

[ 846 ) Sengupta, S. (1983). Admissibility of unbiased estimators in finite population sampling for
samples of size at most two. Calcutta Statist. Assoc. Bull.. 32,91 --102.

[ 847 ) Sengupta, S. (1986). A comparison between PPSWR and Brewer 's st ps WOR procedures.
Calcutta Statist. Assoc. Bull ., 35, 207--210.

[ 848 ) Sengupta, S. (1988). A note on PPS circulation systematic sampling. Calcutta Statist . Asso c. Bull..
37,111--112.

[ 849 ) Serfling, RJ . (1968). Approximate optimum stratification. J. Amer. Statist. Assoc., 63, 1298--
1309.

(850) Seth, G.R. and Rao, J.N.K. (1964). On the comparison between simple random sampling with and
without replacement. Sankhy ii , A, 26, 85--86.

[851 ) Sethi, V.K. (1965). On optimum pairing of units. Sankhy a , B, 27, 315--320.

(852) Shah, D.N. and Shah, S.M. (1979). Unbiased product type estimators. Gujarat Statist. Rev.. 6(2),
34--43.

[ 853 ) Shah, D.N. and Gupta, M.R. (1986). Comparison of double sampling estimators. Metron , 417--
419.
Bibliography 1175

[ 854 ) Shah, D.N. and Patel, P.A. (1996). Asymptotic properties of a generalized regression type
predictor of a finite population variance in probability sampling. Canad. J. Statist .. 24,373--384.

[ 855 ) Shannon, D.F. (1970). Parameter selection for modified Newton methods for function
minimisation. Siam . J. Numerical Analysis, 7, 102--109.

[856) Shao, J. and Chen, Y. (1999). Approximate balanced half sample and related replication methods
for imputed survey data. Sankhy a.
61, 187--201.

[ 857 ) Shao, J., Chen, Y and Chen, Y. (1998). Balanced repeated replication for stratified multistage
survey data under imputation. 1. Amer. Statist. Assoc., 819--831.

[ 858 ) Sharma, S.D. and Sil, A. (1996). A study of Politz--Simmon estimator under non-cooperation. J.
Indian Soc. Statist ., 48(2), 171--184.

[859) Sharma, S.S. (1970). On an estimation in T3 -- class of linear estimators in sampling with varying
probabilities from a finite population. Ann. Inst. Statist. Math.. 22,495--500.

[ 860) Sharma, Y.K., Singh, R., Rai, A. and Verma, S.S. (2000). Regression estimators from survey data
for small sample sizes. 1. Indian Soc. Agric. Statist ., 53(2), 115--124.

[ 861 ) Sharot, T. (1976). Sharpening the Jackknife. Biometrika. 63, 315--321.

[862) Sheers, N. (1992). A review of randomized response technique. Measurement and Evaluation in
Counselling and Development., 25, 27--41.

[ 863 ) Shiledar - Baxi, H.R. (1995). Approximately optimum stratified design for a finite population--II.
Sankhy d , 57, 391--404.

[ 864 ) Shiue, OJ. (1966). Systematic sampling with multiple random starts. Forestry Science, 6, 142--
150.

[ 865 ) Shukla, O.K. (1996). An alternative multivariate ratio estimate for finite population . Bull.
Calcutta Statist . Assoc., 15, 127--134.

[ 866 ) Shukla, D. and Dubey, J. (2001). Estimation in mail surveys under PSNR sampling scheme. J.
Indian Soc. Agric . Statist ., 54(3), 288--302.

[ 867 ) Shukla, D. and Trivedi, M. (2001). Mean estimation in deeply stratified population under post-
stratification . J. Indian Soc. Agric. Statist., 54(2), 221--235.

[ 868 ) Silva, P.L.D.N. and Skinner, CJ. (1995). Estimating distribution functions with auxiliary
information using poststratification . 1. Official Statist., 11,277--294.

[ 869 ) Silverman, RW. (1986). Dens ity estimation for statistics and data analysis. London:Chapman
and Hall.

[ 870) Singh, A.C. (1996). Combining information in survey sampling by modified regression. Proc. of
the Section on Survey Research Methods, American Statistical Association, 120--129.

[ 871 ) Singh, A.C. ad Mohl, c.A. (1996). Understanding calibration estimators in survey sampling.
Survey Methodology, 22, 107--115.

[ 872 ) Singh, A .C., Stukel, D.M. and Pfeffermann, D. (1998). Bayesian versus frequentist measures of
error in small area estimation. 1. R. Statist . Soc., B, 60, 377--396.
1176 Advanced sampling theory with applications

[ 873 ] Singh , A.K. and Singh, H.P. (1997) . A note on the efficiencies of three product type estimators
under a linear model. J. Indian Soc. Agric. Statist ., 50(2), 130--134,

[ 874 ] Singh, AK, Singh, H.P. and Upadhyaya, L.N. (2001) . A generalized chain estimator for finite
population mean in two-phase sampling . J. Indian Soc. Agric. Statist., 370--375 .

[ 875 ] Singh, D. (1956). On efficiency of cluster sampling. J. Indian Soc. Agric. Statist ., 8, 45--55.

[ 876] Singh, D. (1968) . Estimation in successive sampling using a multi-stage design. J. Amer. Statist.
Assoc ., 63, 99--112 .

[ 877 ] Singh, D. and Chaudhary, F.S. (1986). Theory and analysis of sample survey designs . Wiley
Eastern Limited .

[ 878 ] Singh, D., Jindal, K.K. and Garg, J.N. (1968). On modified systematic sampling. Biometrika, 55,
541--546.

[ 879 ] Singh, D. and Singh , B.D. (1965). Some contribution to two-phase sampling. Austral. J. Statist.,
2, 45--67 .

[880] Singh, D. and Singh, P. (1977) . New systematic sampling. J. Statist. Planning Infer., I , 163--177.

[881 ] Singh, G.N. and Upadhyaya, L.N. (1995) . A class of modified chain type estimators using two
auxiliary variables in two-phase sampling. Metron, 117--125.

[ 882 ] Singh, G.N. and Singh, V.K. (2001). On the use of auxiliary information in successive sampling .
1. Indian Soc. Agric. Statist ., 54(1), 1--12.

[ 883 ] Singh, H.P. (1988) . An improved class of estimators of population mean using auxiliary
information. 1. Indian Soc. Agric. Statist., 96--104 .

[884] Singh, H.P. (1989) . A class of unbiased estimators of product of population means . 1. Indian Soc.
Agric. Statist ., 40, 113--118.

[ 885 ] Singh, H.P. and Biradar , R.S. (1992) . Almost unbiased ratio cum product estimators for the finite
popu lation mean . Test, I, 19--29.

[ 886 ] Singh, H.P., Chandra, P. and Singh, S. (2003). Variance estimat ion using multi-auxiliary
information for random non-response in survey sampling. Statistica ( Accepted).

[887] Singh, H.P. and Gangele, R.K. (1995). Almost separation of bias precipitates in the estimator of
'Inverse of population mean' with known coefficient of variation . J. Indian Soc. Agric. Statist., 47, 212--
218.

[ 888 ] Singh, H.P. and Gangele , R.K. (1997). An approach for almost separation of bias precipitates. J.
Indian Soc. Agric. Statist, 50, 11--17.

[ 889 ] Singh, H.P. and Kakran, M.S. (1993). A modified ratio estimator using known coefficient of
kurtosis of an auxiliary character. (Unpublished manuscript) .

[ 890 ] Singh, H.P., Katyar, N.P. and Gangwar , O.K. (1996) . A class of almost unbiased regression type
estimators in two phase sampling applying Quenouille's method . J. Indian Soc. Agric. Statist ., 48(1), 98--
104.

[ 891 ] Singh, H.P. and Sahoo, L.N. (1989) . A class of almost unbiased estimators for population ratio
and product. Calcutta Statist. Assoc. Bull, 38, 241--243 .
Bibliography 1177

[ 892 ] Singh, H.P. and Singh, R. (2001). Improved ratio type estimator for variance using auxiliary
information. J. Indian Soc. Agri c. Stat ist., 54(3), 276--287.

[ 893 ] Singh, H.P. and Singh, S. (2002). Estimation of median of the study variable using known
interquartile range of the auxiliary variable. Working Paper.

[ 894 ] Singh, H.P., Singh, S. and Joarder, A.H. (2003). Estimation of median using know mode of the
auxiliary variable. J. Statist. Research (To appear) .

[ 895 ] Singh, H.P., Singh, S. and Puetas, S.M. (2003a). Ratio type estimators for the median of finite
populations. Allgemeines Statistiches Archiv. (In press)

[ 896 ] Singh, H.P., Singh, S. and Puetas, S.M. (2003b). Estimation of Interquartile range of the study
variable using known interquartile range of the auxiliary variable. Working Pap er.

[ 897 ] Singh, H.P., Singh, S. and Puetas, S.M. (2003c). Estimation of median using three known
quartiles of the auxiliary variable. Working Paper.

[ 898 ] Singh, H.P. and Singh, V.P. (1993). A general class of unbiased estimators of a parameter.
Calcutta Sta tist. Assoc. Bull., 43, 169--170.

[ 899 ] Singh, H.P. and Singh, V.P. (1995). A class of unbiased dual to ratio estimator in stratified
sampling. J. Indian Soc. Agric. Statist., 47(2), 168--175.

[900] Singh, H.P. and Tracy, D.S. (2001). Estimation of population mean in presence of random non-
response in sample surveys. Statistica, LXI, no.2, 231--248.

[ 901 ] Singh, H.P. and Upadhyaya, L.N. (1986) On a class of estimators of the population mean in
sampling using auxiliary information. J. Indian Soc. Agric. Statist., 38, 100--104.

[902] Singh, M. (1979). On the reduction of bias of ratio estimator to a desired degree. Biom . J., 21(7),
645--647.

[ 903 ] Singh, M., Kumar, P. and Chandak, R. (1983). Use of multi-auxiliary variables as a condensed
auxiliary variable in selecting a sample. Commun . Statist .--Theory Meth .• 12,1685--1697.

[904] Singh, M.P. (1967a). Ratio cum product method of estimation. Metr ika, 12,34--42.

[ 905 ] Singh, M.P. (I 967b). Multivariate product method of estimation for finite populations. J. Indian
Soc. Ag ric. Statist.. 19,1 --10.

[ 906 ] Singh, M.P. (1969). Comparison of some ratio-cum-product estimators. Sankhy Ii , B, 31, 375--
378.

[ 907 ] Singh, M.P., Gambino, J. and Mantel, HJ. (1994). Issues and strategies for small area data.
Survey Methodology, 20, 3--22.

[908 ] Singh, P. (1978). A sampling scheme with inclusion probability proportional to size. Sankhy d ,
C,40, 122--128.

[ 909 ] Singh, P. and Garg, J.N. (1979). On balanced random sampling. Sankhy a, C, 41, 60--68.
[ 910 ] Singh, P. and Srivastava, A.K. (1980). Sampling scheme providing unbiased regression
estimators. Biometrika. 67,205--209.

[ 911 ] Singh, P. and Yadav, RJ. (1992). Generalized estimation under successive sampling. J. Indian
Soc. Agric. Statist., 44,27--36.
1178 Advanced sampling theory with applications

[ 912 ] Singh, R. (1971). Approximately optimum stratification on the auxiliary variable. J. Amer. Statist.
Assoc ., 66, 829--833.

[913] Singh, R. (1972). A note on successive sampling over two occasions. Aust. J. Stat ist., 14,2, 120--
122.

[914] Singh, R. (1975a) . A note on the efficiency of ratio estimate with Midzuno 's scheme of sampling.
Sankhy ii , C, 37, 211--214.

[ 915] Singh, R. (1975b) . An alternative method of stratification on the auxiliary variable. Sankhy ii . C.
37, 100--108.

[916] Singh, R. (1984). Double sampling for two auxiliary characters. Calcutta Stat ist. Assoc. Bull.• 33,
193--197.

[ 917 ] Singh, R. and Bansal, M.L. (1975). On the efficiency of interpenetrating sub-samples in simple
random sampling. Sankhy ii , C, 37, 190--198.

[ 918 ] Singh, R. and Bansal, M.L. (1978). A note on the efficiency of interpenetrating sub-samples in
simple random sampling. Sankhy a,
C, 40, 174--176.

[ 919 ] Singh, R. and Kathuria, O.P. (1995). Sampling without replacement in qualitative randomized
response model. J. Indian Soc. Agric. Statist .• 47(2), 134--141.

[920] Singh, R. and Kishore, L. (1975). On Rao, Hartley and Cochran's method of sampling. Sankhy a,
37,88--94.

[ 921 ] Singh, R. and Lal, M. (1978). On the construction of random groups in the RHC scheme.
Sankhy ii , C, 40, 129--135.

[ 922 ] Singh, R. and Mangat, N.S. (1996). Elements of survey sampling . Kluwer Academic Publishers,
The Netherlands .

[ 923 ] Singh, R., Mangat, N.S. and Singh, S.(1993). A mail survey design for sensitive character
without using randomization device. Commun . Statist . -- Theory Meth ., 22(9), 2661--2668.

[924] Singh, R. and Narain, P. (1989). Method of estimation from samples with random sample sizes. J.
Stat ist. Planning Infer.• 23,217--225.

[ 925] Singh, R. and Singh, B. (1974). On replicated samples drawn with Rao, Hartley and Cochran 's
Scheme. Sankhy a,C, 36, 147--150.

[ 926] Singh, R. and Singh, H.P. (1993). A Hartley--Ross type estimator for finite population mean when
the variables are negatively correlated. Metron, 205--216 .

[ 927 ] Singh, R. and Singh, H.P. (1999). A class of unbiased estimators in cluster sampling. J. Indian
Soc. Agric. Statist., 52(3), 299--302.

[928] Singh, R., Singh, H.P. and Espejo, M.R. (1998). The efficiency of an alternative to ratio estimator
under a super population model. J. Stat ist. Plann ing Infer., 71, 287--301.

[ 929 ] Singh, R., Singh, S. and Mangat, N.S.(1995). Mail survey design for sensitive quantitative
variable. Metron , 53,43--54.
Bibliography 1179

[930 ] Singh, R., Singh, S., Mangat, N.S. and Tracy, D.S. (1995). An improved two stage randomized
response strategy. Statist ical Papers, 36, 265--271.

[931] Singh, R. and Sukhatme, B.Y. (1969). Optimum stratification . Ann. Inst. Statist. Math., 21, 515-
528.

[ 932] Singh, R.K. ( 1982a). On estimating ratio and product of population parameters. Calcutta Statist.
Assoc. Bull., 31, 69--76.

[ 933 ] Singh, RK. (\982b). Generalized double sampling estimators for the ratio and product of
population parameters. J. Indian Statist. Assoc ., 20, 39--49.

[ 934 ] Singh, R.K. and Ray, S.K. (198 1). Product cum difference method of estimation using two
auxiliary variables. Biom. 1., 23(6), 563--571.

[ 935 ] Singh, R.K. and Singh, G. (1984a). A class of estimators with estimated optimum values in
sample surveys. Statist. Prob. Lett.. 2, 319--321.

[ 936 ] Singh, RK. and Singh, G. (1984b). Improved generalized ratio cum product estimation. Biom. 1.,
26 ( I), 57--61.

[ 937 ] Singh, RK. and Zaidi, S.M.H. (2000). On estimating square of population mean and population
variance. J. Indian Soc. Agric. Statist., 53(3), 243--256.

[ 938 ] Singh, S. (1988). Estimation in overlapping clusters. Commun. Statist. - Theory Meth.. 17(2), 613-
-621.

[939] Singh, S. (\99Ia). Estimation of finite population variance using double sample. Aligarh LStatist.,
11,53--56.

[ 940 ] Singh, S. (199Ib). On improved strategies in survey sampling . Unpublished Ph.D. thesis
submitted to Punjab Agricultural University, Ludhiana, India.

[ 94 1 ] Singh, S. (1994). Unrelated question randomized response sampling using continuous


distributions . 1. Indian Soc. Agric. Statist., 46(3), 349--361. (Gold Medalist presentation) .

[ 942 ] Singh, S. (1999). An addendum to the confidentiality guaranteed under randomised response
sampling by Mahmood, Singh and Hom. Biom. J., 41(8), 955--966.

[ 943 ] Singh, S. (2000a). Estimation of parametric functions in two-dimensional space in survey


sampling. South. African Statist. J., 34, 51--7 1.

[944] Singh, S. (2000b). Estimation of variance of regression estimator in two phase sampling. Calcutta
Statist. Assoc. Bull., 50, 49--63.

[ 945 ] Singh, S. (2000c). A new method of imputation in survey sampling. Working paper.

[ 946 ] Singh, S. (200 I). Generalized calibration approach for estimating variance in survey sampling.
Ann. Inst. Statist. Math., 53(2), 404--4 17.

[ 947 ] Singh, S. (2002a) . Estimation of median of the study variable using 99 known percentiles of the
auxiliary variable. Working Paper at St. Cloud State University, MN, USA.

[ 948 ] Singh, S. (2002b). On Farrell and Singh's penalized chi square distance functions in survey.
Presented at the SSC--2003 conderence at Halifax, Canada.

[ 949 ] Singh, S. (2002c). A new stochastic randomized response technique. Metrika, 56, 131--142.
1180 Advanced sampling theory with applications

[ 950 ] Singh, S. (2003a). Short note on the linear regression estimator in survey sampling. Working
pap er at St. Cloud State University. St. Cloud. MN. USA..

[ 951 ] Singh, S. (2003b). On Jackknifing the two-phase calibration weights for estimating the variance
of estimator of distibution function using two auxiliary variables. Working pap er at St. Cloud State
University. St. Cloud. MN. USA.

[ 952] Singh, S. (2003c). Golden Jubilee Year-2003 of the linear regression estimator. Working pap er
at St. Cloud State University. St. Cloud. MN. USA.

[ 953 ] Singh, S. and Amab, R. (2003). Penalized chi square distance function for non-response
adjustments. Working paper at St. Cloud State Univerity, and University ofDurb in-Westvile( Submitted
for presentation at JSM-2003. California . USA) .

[ 954 ] Singh, S. and Deo, B. (2002). Imputing with power transformation. Statistical Papers (To appear)

[ 955 ] Singh, S., Grewal, I.S. and Joarder, A.H. (2002). General class of estimators in multi-character
surveys. Statistical Papers (To appear).

[ 956 ] Singh, S. and Hom, S. (1998). An alternative estimator for multi-character surveys. Metrika , 48,
99--107.

[ 957 ] Singh, S. and Hom, S. (1999). An improved estimator of the variance of the regression estimator.
Biom . J., 41(3), 359--369.

[958] Singh, S. and Hom, S. (2000). Compromised imputation in survey sampling. Metrika, 51, 267--
276.

[ 959 ] Singh, S., Hom, S. and Chowdhury, S. (1998). Estimation of stigmatized characteristics of a
hidden gang in finite population. Austral. & New Zealand J. Statist.,40(3) , 291--297.

[960] Singh, S., Hom, S. Chowdhury, S. and Yu, F. (1999). Calibration of the estimators of variance.
Austral. & New Zealand J. Statist. , 40(2), 199--212.

[ 961 ] Singh, S., Hom, S. and Tracy, D.s . (2001). Hybrid of calibration and imputation : estimation of
mean in survey sampling. Statistica , LXI (I), 27--41

[ 962 ] Singh, S., Hom, S. and Yu, F. (1998). Estimation of variance of general regression estimator:
Higher level Calibration Approach. Survey Methodology. 24 (1),41--50.

[ 963 ] Singh, S., Hom, S., Singh, R. and Mangat, N.S. (2003). On the use of modified randomization
device for estimating the prevalence of a sensitive attribute. Statistics in Transition (To appear)

[ 964 ] Singh, S. and Joarder, A.H. (l997a). Optional randomized response technique for quantitative
sensitive character. Metron, LV, 151--157.

[ 965 ] Singh, S. and Joarder, A.H. (l997b). Unknown repeated trials in randomized response sampling.
J. Indian Soc. Agric. Statist ., 50(1),103--105.

[ 966 ] Singh, S. and Joarder, A. (1998). Estimation of finite population variance using random
nonresponse in survey sampling. Metrika, 241--249 .

[ 967 ] Singh, S. and Joarder, A.H. (2002). Estimation of distribution function and median in two-phase
sampling. PakJ. Stati st., 18(2),301--319.

[968] Singh, S., Joarder, A. H. and King, M. L. (1996). Regression analysis using scrambled responses.
Austral. J. Statist. 38 (2), 201--211.
Bibliography 1181

[ 969 ] Singh, S., Joarder, A.H. and Tracy, D.S. (2000). Regression type estimators for random non-
response in survey sampling. Stat istica, LX, 39--44.

[ 970 ] Singh, S., Joarder, A.H. and Tracy, D.S. (2001). Median estimation using double sampling.
Austral. & New Zealand J. Statist., 43 (1),33--46.

[ 971 ] Singh, S. and Kataria, P. (1990). An estimator of finite population variance. J. Indian Soc. Agril.
Statist. 42, 186--188.

[ 972 ] Singh, S. and King, M.L. (1999). Estimation of coefficient of determination using scrambled
responses. J. Indian Soc . Agric. Statist. 52(3), 338--343.

[ 973 ] Singh, S., Mahmood, M. and Tracy, D.S. (200 I). On the estimation of mean and variance of a
sensitive character using distinct units. Statistical Papers, 42, 403--411

[ 974 ] Singh, S., Mangat, N.S. and Gupta, J.P. (1996). Improved estimator of finite population
correlation coefficient. J. Indian Soc. Agric. Statist., 48,141--149.

[ 975 ] Singh, S., Mangat, N.S. and Mahajan, P.K. (1995). General class of estimators. J. Indian Soc.
Agril. Statist. ,47(2), 129--133.

[ 976 ] Singh, S., Mangat, N.S. and Singh, R. (1994). On estimation of mean/total of stigmatised
quantitative variable. Stati stica , 54(3), 383--386.

[ 977 ] Singh, S., Mangat, N.S. and Singh, R. (1997). Estimation of size and mean of a sensitive
quantitative variable for a sub-group of a population. Commun. Statist. -- Theory Meth ., 26(7), 1793--
1804.

[ 978 ] Singh, S., Pannu, C.J.S., Singh, S., Singh, J.P. and Kaur, S. (1996). Energy is Punjab Agriculture.
Department of Farm Power and Machinery, Punjab Agricultural University, Ludhiana, India.

[ 979 ] Singh, S. and Puetas, S.M. (2002). On the estimation of total, mean and distribution function
using two-phase sampling: calibration approach. J. Indian Soc. Agric. Statist. (Revised submitted).

[ 980 ] Singh, S., Singh, H.P., Tailor, R. and Allen, J. (2002). General class of estimators for estimating
ratio of two population means in the presence of random non-response. Working paper.

[ 981 ] Singh, S. and Singh, R. (1979). On random non-response in unequal probability sampling.
Sankhy ii , C, 41,127--137.

[ 982 ] Singh, S. and Singh, R. (1991). Almost bias precipitate filtration: A new technique. Aligarh J.
Statist., 11,5--8.

[ 983 ] Singh, S and Singh, R. (1992a). Improved Franklin's model for randomized response sampling.
J. Indian Statist. Assoc. 30,109--122.

[984] Singh, S. and Singh, R. (1992b). An alternative estimator for randomised response technique. J.
Ind ian Soc . Agic. Statist., 44(2),149--154.

[ 985 ] Singh, S. and Singh, R. (1993a). Almost filtration of bias precipitates:a new approach. J. Indian
Soc. Agril. Statist. 45,214--218.

[986] Singh, S. and Singh, R. (1993b). A new method: almost separation of bias precipitates in sample
surveys. J. Indian Statist. Assoc., 31, 99--105.
1182 Advanced sampling theory with applications

[ 987 ] Singh, S. and Singh, R. (l993c). A class of almost unbiased ratio and product type estimators. J.
Ind ian Soc . Statist. Opers. Res., 14,35--39.

[988] Singh, S. and Singh R (1993d) . Generalised Franklin's model for randomised response sampling.
Commun. Statist. --Theory Meth 22 (3), 741--755.

[ 989 ] Singh, S., Singh, R. and Mangat, N.S. (1996). Estimation of mean of a stigmatized quantitative
variable for a sub-group of the population . Metron, 54 (3-4), 83--91.

[990] Singh, S., Singh, R. and Mangat, N.S. (1998). Estimation of coefficient of variation ofa sensitive
character. Metron, 55. 59--67.

[ 991 ] Singh, S., Singh, Rand Mangat, N.S. (2000). Some alternative strategies to Moors' model in
randomized response sampling. J. Statist. Planning Infer., 83, 243-255.

[ 992 ] Singh, S., Singh, R., Mangat, N.S. and Tracy, D.S. (1994). An alternative device for randomised
responses . Statistica, 54(2),233--243.

[ 993 ] Singh, S. and Singh, S. (1988). Improved estimators of K and B in finite populations . J. Indian
Soc. Agric. Statist.• 40(2), 121--126.

[ 994 ] Singh, S., Singh, S., Mittal, J.P., Pannu, C.J.S. and Bhangoo , B.S. (1994) . Energy inputs and crop
yield relationships for rice in Punjab. Energy-dnternational Journal,19( 10), 1061--1065.

[995] Singh, S., Singh, S., Pannu, C.J.S., Bhangoo, B.S. and Singh, M.P. (1994) . Energy inputs and crop
yield relationships for wheat in Punjab. Energy ConversoMgmt, 35(6), 493--499 .

[996] Singh, S. and Tracy, D.S. (1999). Ridge regressionusing scrambled responses. Metron, 57, 147-157.

[ 997 ] Singh, S. and Valdes, S.R. (2003). Optimum method of imputatio n in survey sampling. Working
paper at St. Cloud State University , St. Cloud,MN. USA.

[ 998] Singh, V.K. and Shukla, D. (1993). An efficient one parameter family offactor type estimators in
sample surveys . Metron, 139-- 159.

[999] Singh, V.K. and Singh, G.N. (1991). Chain type regression estimators with two auxiliary variables
under double sampling scheme. Metron , 279- -289.

[ 1000 ] Singh, V.K., Singh, H.P. and Singh, H.P. (\ 994). Estimation of ratio and product of two finite
population means in two-phase sampling. J. Stat ist. Planning Infer . , 41,163--171 .

[ 1001 ] Singh, V.P. and Singh, H.P. (1997--98). Chain estimators for popu lation ratio in double
sampling . Aligarh J. Statist., 17/18,85--100.

[ 1002] Sinha, B.K. (1973). On sampling schemes to realize pre-assigned sets of inclusion probabili ties
of first two orders . Calcuua Statist. Assoc. Bull., 22, 89--100.

[ 1003 ] Sisodia, B.V.S. and Dwivedi, V.K. (1981). A modified ratio estimator using coefficient of
variation of auxiliary variable. J. Indian Soc. Agric. Statist., 33, 13--18.

[ 1004 ] Sisodia, B.V.S. and Singh, A. (2001). On small area estimation-An empirical study. J. Indian
Soc . Agric. Statist., 54(3), 303--306 .

[ 1005 ] Sitter, R.R. (1993) . Balanced repeated replications based on orthogonal multi-arrays. Biom etrika,
80,211 --221.

[ 1006] Sitter, R.R. ( 1997). Variance estimation for the regression estimator in two-phase sampling . J.
Amer. Statist. Assoc., 92, 780--787.
Bibliography 1183

[ 1007] Sitter , R.R. and Rao, J.N.K. (1997). Imputation for missing values and corresponding variance
estimat ion. Canad. J. Statist., 25(1) , 61--73.

[ 1008 ] Sitter, R.R. and Wu, C. (2002). Efficient estimation of quadratic finite population functions. J.
Amer. Stat ist. Asso c., 97, 535--543

[ 1009] Skinner, C.J. (1991 ). On the efficienc y of raking ratio estimation for multiple frame surveys. J.
Amer. Statist. Assoc., 86, 779--784 .

[ 1010] Skinner, C.J., Holt, D. and Smith, T.M.F. (1989) . Analysis ofcomplex surveys . Wiley New York.

[ 101I ] Skinner, C.J. and Rao, J.N.K. (1996) . Estimation in dual frame surveys with complex designs . J.
Amer. Statist. Assoc., 91, 349--356 .

[ 1012] Smith, H.F. (1938) . An experimental law describing heterogeneity in the yields of agricultural
crops . J. Amer. Statist. Assoc., 28, 1--23.

[ 1013 ] Smith, J.H. (1947) . Estimation of linear functions of cell proportions. Ann . Math . Statist., 18,
231--254.

[ 1014 ] Smith , P. and Sedransk, J. (1983). Lower bounds for confidence coefficients for confidence
intervals for finite population quantiles . Commun . Statist .i-Theory Meth .. 12,1329--1344.

[ 1015] Smith , S.K. and Lewis , B.B. (1980) . Some new techniques for applying the housing unit method
oflocal population estimation. Demography, 17,323--340.

[ 1016] Smith, T.M.F. (1969). A note on ratio estimates in multi-stage sampling. J. R. Statist. Soc.. A,
132, 426--430.

[ 1017 ] Smith, T.M.F. (1978) . Principles and problems in the analysis of repeated surveys . In N.
Krishnan Namboodiri, ed. Survey Sampling and Measurement, Aead. Press, NY.

[ 1018] Smith, T.M.F. (1984). Present position and potential developments: some personal views, sample
surveys . J. R. Statist. Soc., A, 147,208--221.

[ 1019 ] Smith , T.M.F. and Sugden, R.A. (1985) . Inference and the ignorability of selection for
experiments and surveys . Bull. Int. Statist. Inst., 44'h Session, Book II , 10.2- I to 10.2-- I2.

[ 1020] Smith , T.M .F. (1995). Problems of resource allocation . Proc. Statist. Can. Symp ., 95, Statistics
Canada, 107--114.

[ 1021 ] Srikantan, K.S. (1963). A note on interpenetrating sub-samples of unequal sizes . Sankhy d , B,
25, 345--350.

[ 1022] Srinath, K.P. (1971). Multiphase sampling in non-response problems . J. Amer. Statist. Assoc.,
66, 583--586.

[ 1023 ] Srinath, K.P. and Hidiroglou , M.A. (1980). Estimation of variance in multi-stage sampling.
Metrika, 27,121--125.

[ 1024 ] Srivastava, J. and Ouyang, Z. (1992) . Studies on a general estimator in sampling, utilizing
extraneous information through a sample weight function . J. Statist. Plann ing Infer.. 31, 199--2 I8.

[ 1025 ] Srivastava, J.N. and Saleh, F. (1985). Need of t design in sampling theory. Utilitas
Math emati ca, 25, 5-- I 7.
1184 Advanced sampling theory with applications

[ 1026 ] Srivastava, S.K. (1965). An estimator of the mean of a finite population using several auxiliary
variables. J. Indian Statist. Assoc., 3, 189--194.

[ 1027] Srivastava, S.K. (1967). An estimator using auxiliary information in sample surveys. Calcutta
Statist. Assoc. Bull., 16,121--132.

[ 1028 ] Srivastava, S.K. (1971). A generalized estimator for the mean of a finite population using multi-
auxiliary information. J. Amer. Statist. Assoc. , 66,404--407.

[ 1029 ] Srivastava, S.K. (1980). A class of estimators using auxiliary information in sample surveys.
Canad. J. Statist.. 8(2), 253--254.

[ 1030 ] Srivastava, S.K. (l98Ia). A generalized two-phase sampling estimator. J Indian Soc. Agric.
Statist., 33, 38--46.

[ 1031 ] Srivastava S.K. (1981b). A note on generalized RPO estimator in double sampling. J Indian
Soc. Agric. Statist., 33, 89--93.

[ 1032] Srivastava, S.K. (1983). Predictive estimation of finite population mean using product estimator.
Metrika, 30,93--99.

[ 1033 ] Srivastava, S.K. (1992). A note on improving classes of estimators in sample surveys. J. Indian
Soc . Agric. Statist., 44,267--270.

[ 1034 ] Srivastava, S.K. and Jhajj, S.K. (1980). A class of estimators using auxiliary information for
estimating finite population variance. Sankhy d , C, 42,87--96.

[ 1035 ] Srivastava, S.K. and Jhajj, H.S. (1981). A class of estimators of the population mean in survey
sampling using auxiliary information. Biometrika, 68, 341--343.

[ 1036 ] Srivastava, S.K. and Jhajj, H.S. (l983a). A class of estimators of the population mean using
multi-auxiliary information. Calcutta Statist. Assoc. Bull., 32,47--56.

[ 1037 ] Srivastava, S.K. and Jhajj, H.S. (1983b). Class of estimators of mean and variance using
auxiliary information when correlation coefficient is also known. Siam. J, 25(4),401--409.

[ 1038 ] Srivastava, S.K. and Jhajj, H.S. (1986). On the estimation of finite population correlation
coefficient. J Indian Soc. Agric. Statist., 38,82--91.

[ 1039 ] Srivastava, S.K. and Jhajj, H.S. (1987). Improved estimation in two-phase and successive
sampling. J. Indian Statist. Assoc., 25, 71--75.

[ 1040 ] Srivastava, S.K. and Jhajj, H.S. (1995). Classes of estimators of finite population mean and
variance using auxiliary information. J Indian Soc. Agril. Statist., 47 ,119--128.

[ 1041 ] Srivastava, S.K., Jhajj, H.S. and Sharma, M.K. (1986). Comparison of some estimators of K and
B in finite populations . J. Indian Soc. Agric. Statist., 38(2), 230--236.

[ 1042] Srivastava, S.R., Khare, B.B. and Srivastava, S.R. (1990). A generalized chain ratio estimator for
mean of finite population . J. Indian Soc. Agric. Statist., 42,108--117.

[ 1043 ] Srivastava, V.K. and Bhatnagar, S. (1981). Ratio and product methods of estimation when X is
not known. J Statist. Res., 15,29--39.

[ 1044 ] Srivastava, V.K., Dwivedi, T.O., Chaubey, Y.P. and Bhatnagar, S. (1983). Finite sample
properties of Beale's ratio estimator. Commun. Statist. -- Theory Meth. , 12(15), 1795--1805.
Bibliography 1185

[ 1045 ] Srivenkataramana , T. (1978). Change of origin and scale in ratio and difference methods of
estimation in sampling . Canad. J. Statist., 6, 79--86.

[ 1046 ] Srivenkataramana, T. ( 1980). A dual to ratio estimator in sample surveys. Biom etrika, 67,199--
204.

[ 1047] Srivenkataramana, T. and Tracy, D.S. (1979). On ratio and product methods of estimation in
sampling . Statistica Neerlandica, 33, 37--49.

[ 1048] Srivenkataramana, T. and Tracy, D.S. (1980). An alternative to ratio method in sample surveys .
Ann . Inst . Statist. Math ., A, 32, 111--120.

[ 1049 ] Srivenkataramana, T. and Tracy, D.S. (1981). Extending product method of estimation to ositive
correlation case in surveys . Austral. J. Statist.. 23, 95--100.

[ 1050 ] Srivenkataramana, T. and Tracy, D.S. ( 1983). Interchangebility of the ratio and product methods
in sample surveys. Commun. Stati st. -- Theory Meth., 12(18), 2143--2150 .

[ 1051 ] Srivenkataramana, T. and Tracy, D.S. ( 1984). Positive and negative valued auxiliary variates in
surveys . Metron, 207- -319 .

[ 1052] Srivenkataramana, T. and Tracy, D.S. (1986). Transformations after sampling. Statistics, 17,
597--608 .

[ 1053] Srivenkataramana, T. and Tracy, D.S. (1989). Two-phase sampling for selection with probability
proportional to size in sample surveys. Biom etrika, 76(4), 818--82 1.

[ 1054 ] Strachan, R., King, M.L. and Singh, S. (1998). Likelihood-based estimation of the regression
model with scrambled responses. Austral. & New Zealand J. Statist.,40(3), 279--290.

[ 1055 ] Strauss, 1. (1982) . On the admissibility of estimators for the finite population variance. Metrika,
29, 195--202.

[ 1056 ] Stephan, F.F. (1945). The expected value and variance of the reciprocal and other negative
powers of a positive Bernoulli variate. Ann . Math . Statist., 16, 50--61.

[ 1057 ] Stroud , T.W.F. (1994) . Bayesian analysis of binary survey data. Canad. 1. Statist.• 22,33--45.

[ 1058] Stukel, D.M., Hidiroglo u, M.A. and Sarndal, C.E. (1996) . Variance estimation for calibration
estimators: A comparison of Jackknifing versus Taylor linearization. Survey Methodology, 22,107--1 15.

[ 1059 ] Stukel, D.M. and Rao, J.N.K. (1997). Estimation of regression model with nested error structure
and unequal error variance under two and three stage cluster sampling. Statist. Prob. Lell., 35, 401--407.

[ 1060 ] Stukel, D.M. and Rao, J.N.K. (1999). On small-area estimation under two-fold nested error
regression models . J. Statist. Planning Infer., 78, 131--147.

[ 1061 ] Subramani, J. (2000) . Diagonal systematic sampling scheme for finite populat ions. 1. Indian Soc.
Agric. Statis t., 53(2),187--195.

[ 1062 ] Subramani , 1. and Tracy, D.S. (1993). Determinant sampling scheme for finite populations .
Working paper at University oj Windsor. Canada.

[ 1063] Sud, U.C. and Srivastava , A.K. (2000). Estimation of population mean in repeated surveys in the
presence of measurement errors. J. Indian Soc. Agri c. Statist., 53(2), 125--133 .
1186 Advanced sampling theory with applications

[ 1064 ) Sud, V.C., Srivastava, A.K. and Sharma, D.P. (200Ia). On a biased estimator in repeated
surveys. J. Indian Soc. Agric. Statist., 54(1), 29--42.

[ 1065) Sud, U.c., Srivastava, A.K. and Sharma, D.P. (200Ib). On the estimation of population variance
in repeated surveys. J. Indian Soc. Agric. Statist., 54(2), 355--369.

[ 1066) Sudakar, K. (1978). A note on 'Circular Systematic Sampling Design'. Sankhy a , C, 40, 72--73.

[ 1067) Sukhatme, B.Y. (1962). Some ratio type estimators in two-phase sampling. J. Amer. Statist.
Assoc., 57, 628--632.

[ 1068) Sukhatme, P.Y. (1944). Moments and product moments of moment statistics for samples of the
finite and infinite populations . Sankhy d , 6, 363--382.

[ 1069) Sukhatme, P.Y. (1953). Sampling theory ofsurveys with applications. Iowa State College Press,
Ames,lowa.

[ 1070 ) Sukhatme, P.Y. (1954). Sampling theory of surveys with applications. Indian Society of
Agricultural Statistics, New Delhi.

[ 1071 ) Sukhatme, P.Y., Panse, V.G. and Sastri, K.Y.R. (1958). Sampling techniques for estimating the
catch of sea fish in India. Biometrics, 14, 78--96.

[ 1072 ) Sukhatme, P.Y. and Sukhatme, B.Y. (1970). Sampling theory of surveys with applications.
Second Edition, Asia Publishing House, Bombay, India.

[ 1073) Sukhatme, P.Y., Sukhatme, B.Y., Sukhatme, S. and Asok, C. (1984). Sampling theory ofsurveys
with applicat ions. Iowa State University Press and Indian Society of Agricultural Statistics, New Delhi.

[ 1074) Sunter, A.B. (1977). List sequential sampling with equal or unequal probabilities without
replacement. Applied Statist ., 26, 261--268.

[ 1075 ) Swain, A.K.P.C. and Mishra, a. (1992). Unbiased estimators of finite population variance using
auxiliary information. Metron, 201--215.

[ 1076 ) Swain, A.K.P.C. and Mishra, a. (1994). Limiting distribution of ratio estimator of finite
population variance. Sankhya, B, 56,11 --17.

[ 1077) Swain, A.K.P.C. and Sahoo, L.N. (1982). Comparison of three almost unbiased ratio estimators
in a survey for quantitative characteristics. Statistica , 42(3),397--401.

[ 1078) Tallis, a .M. (1978). Note on robust estimation in finite populations. Sankhy a , C, 40, 136--138.

[ 1079) Tallis, a .M. (1991). A note on balanced cluster sampling. Statist . Prob. Lett., I I, 169--172.

[ 1080) Tam, S.M. (1984). Optimal estimation in survey sampling under a regression superpopulation
model. Biometrika .Ti, 645--647 .

[ 1081 ) Tam, S.M. (1986). Characterization of best model based predictors in survey sampling.
Biometrika. 73,232--235.

[ 1082) Tam, S.M. (1995). Optimal and robust strategies for cluster sampling. J. Amer. Statist. Assoc.,
90,379--382.
Bibliography 1187

[ 1083 ) Taylor, J.M., Muzoz, A., Bass, S.M., Sah, A.1., Chmiel, J., Kingsley, L. et al. (1990) . Estimating
the distribution of times from HIV seroconversion to AIDS using multiple imputation. Statist . Med., 9,
505--514 .

[ 1084) Tepping, B.1., Hurwitz, W.N. and Deming, W.E. (1943) . On the efficiency of deep stratification
in block sampling . J. Amer. Statist . Assoc., 38, 93--100.

[ 1085) Thompson, M.E. (1997). Theory ofsample surveys. Chapman & Hall, London, U.K.

[ 1086) Thompson, S.K. (1990). Adaptive cluster sampling. J. Amer. Statist. Assoc., 85,1050--1059.

[ 1087 ) Thompson, S.K. and Seber, GAF. (1996). Adaptive sampling. New York: Wiley and Sons.

[ 1088) Tikkiwal, B.D. (1953). Optimum allocation in succesive sampling. J. Indian Soc. Agric. Statist .•
5,100--102.

[ 1089 ) Tikkiw al, B.D. (1955). Mult iphase sampling on successive occasions. Unpublished Ph.D. thesis
submitted to North Carolina State University.

[ 1090) Tikkiwal , B.D. (1958). Theory of successive two-stage sampling . (Abstract). Ann . Math. Statist .•
29, 1291.

[ 1091 ) Tikkiwal, B.D. (1960). On the theory of classical regression and double sampling estimation . J.
R. Statist. Soc.,B, 22,131--138.

[ 1092) Tikkiwal, B.D. (1965). The theory of two-stage sampling on succesive occasions . J. Indian Soc.
Agril. Statist.• 125--136 .

[ 1093) Tikkiwal, B.D. (1979). Succesive sampling - a review. Bull. Int. Statist. Inst ., 48, 367--383 .

[ 1094) Tille, y. (1998) . Estimation is surveys using conditional inclusion probabilities : Simple random
sampling. Int. Statist. Rev., 66, 303--322.

[ 1095) Tin, M. (1965). Comparison of some ratio estimators. J. Amer. Statist. Assoc.,60, 294--307.

[ 1096) Toutenburg, H. and Srivastava, V.K. (1998). Estimation of ratio of population means in survey
sampling when some observations are missing. Metrika, 48, 177--187.

[ 1097) Tracy, D.S. (1984). Moments of sample moments. Commun . Statist.s-Theory Meth ., 3(5), 553--
562.

[ 1098 ) Tracy, D.S. and Mangat, N.S. (1995). Respondent's privacy hazards in Moors ' randomized
response model-- A remedial strategy. Int. J. Math. & Statist . Sci., 4( I), 121--130.

[ 1099) Tracy, D.S. and Mangat, N.S. (1996a). Some developments in randomized response sampling
during the last decade - A follow up of review by Chaudhuri and Mukerjee , J. Applied Statist. Sci.,
4(2/3) ,147--158 .

[ 1100 ) Trcay, D.S. and Mangat, N.S. (I 996b). On respondent's jeopardy in two alternate question
randomized response model. J. Statist . Planning Infer., 55(1), 107--114.

[ 1101 ) Tracy, D.S. and Mangat, N.S. (1998). Comparisons of distinct units based estimators in unrelated
question randomized response model. Internal. J. Math . & Statist . Sci., 7, 229--240 .

[ 1102) Tracy, D.S. and Osahan, S.S. (I 994a). Determinant sampling versus some conventional sampling
schemes. Pak. J. Statist ., 10(1), 99--121.
1188 Advanced sampling theory with applications

[ 1103 ) Tracy, D.S. and Osahan, S.S. (I 994b). Estimation in overlapping clusters with unknown
population size. Survey Methodology, 20(1), 53--57.

[ 1104) Tracy, D.S. and Osahan, S.S. (l994c). Random nonresponse on study variable versus on study
as well as auxiliary variables. Stat istica, 54, 163--168.

[ 1105) Tracy, D.S. and Osahan, S.S. (1999). A partial randomized response strategy . Test, 4(2), 315--
322.

[ 1106) Tracy, D.S. and Singh, H.P. (1998). A modified ratio cum product estimator. Int.. J. Math. &
Statist. Sci., 7, 201--212.

[ 1107 ) Tracy, D.S. and Singh, H.P. (1999). A general class of chain regression estimators in two-phase
sampling. J. Appl. Stati st. Sci., 8, 205--216.

[ 1108) Tracy, D.S., Singh, H.P. and Singh, R. (1996). An alternative to ratio cum product estimator in
sample surveys. 1. Statist. Planning Infer., 53,375--387.

[ 1109) Tracy, D.S., Singh, H.P. and Singh, R. (1998). A class of unbiased estimators alternative to ratio
cum product estimator in sample surveys. Prisankhyan Samikkha, 5, 43--50.

[ 1110) Tracy, D.S., Singh, H.P. and Singh, S. (2001). An investigation on the bias reduction in linear
variety of ratio cum product estimator. Allgemeines Statist . Archive, 85,323--332.

[ 1III ] Tracy, D.S. and Singh, S. (2000). Calibration estimators in randomized response surveys.
Metron, 57,47--68.

[ 1112 ) Tracy, D.S., Singh, S. and Arnab, R. (2003). Note on calibration in stratified and double
sampling. Survey Methodology, June issue ( To appear)

[ 1113 ] Tripathi, T.P. (1970). Contributions to the sampling theo ry using multivariate information .
Ph.D. thesis submitted to Punjabi University, Patiala, India.

[ I I 14 ] Tripathi, T.P. (1976). On double sampling for multivariate ratio and difference methods of
estimation. J.lndian Soc. Agric. Statist ., 33, 33--54.

[ 1115 ] Tripathi, T.P. (1980). A general class of estimators for population ratio. Sankhy a, C, 42, 63--
75.

[ 1116] Tripathi, T.P. (1987). A class of estimators for population mean using multi-variate auxiliary
information under general sampling designs. Aligarh J. Statist ., 7,49--62.

[ 1117 ] Tripathi, T.P. and Ahmed, M.S. (1995). A class of estimators for a finite population mean based
on multivariate information and general two-phase sampling. Calcutta. Statist. Assoc. Bull.A5, 203--218.

[ 1118) Tripathi, T.P. and Chaubey, Y.P. (1992). Improved estimation ofa finite population mean based
on paired observations. Commun . Statist .i-Theory Meth ., 21, 3327--3333.

[ 1119 ] Tripathi, T.P. and Singh, H.P. (1992). A class of unbiased product type estimators for the mean
suitable for positive and negative correlation situations. Commun . Statist--Theory Meth. , 21(2), 507--
518.

[ 1120) Tripathi, T.P. and Srivastava, O.P. (1979). Estimation on successive occasions using PPSWR
sampling. Sankhy ii , C, 41, 84--91.

[ 1121 ) Tu, X.M., Meng, X.L. and Pagano, M. (1993). The AIDS epidemic: Estimating survival after
AIDS diagnosis from surveillance data. J. Amer. Statist . Asso c., 88, 26--36.
Bibliography 1189

[ 1122) Tukey, l .W . (1956). Keeping moments like sampling computations simple . Ann. Math. Statist.,
27,37--54.

[ 1123 ) Tukey, l .W . (1958) . Bias and confidence in not quite large samples. Ann. Math. Statist.
(Abstract), 29, 614 .

[ 1124 ) Tuteja, R.K. and Bahl, S. (1991) . Multivariate product estim ators . Calcutta Statist. Assoc . Bull..
42,109--115 .

[ 1125 ) Unam, I. (1995). Estimating the population mean using supplementary dephased information
Commun. Statist . -- Simula.. 24,733--743.

[ 1126 ) Unnithan, V.K.G. (1978) . The minimum variance boundary points of stratification Sankhy a, C,
40,60--72.

[ 1127) Upadhyaya, L.N., Kushwaha, K.S. and Singh, H.P. (1990). A modified chain ratio type estimato r
in two-phase sampling using multi-auxiliary information. Metron, 381--393.

[ 1128) Upadhy aya, L.N. and Singh, H.P. (1999) . Use of transformed auxiliary variable in estimating the
finite population mean . Biom. J., 41(5) , 627--636.

[ 1129) Upadhyaya, L.N ., Singh, H.P. and Singh, S. (2003). A family of almost unbiased estimators for
negat ively correlated variables using jackknife techn ique. Statistica (Accepted)

[ 1130 ) Uthayakumaran, N. (1998) . Additional cicular systematic sampling methods. Biom. J., 40(4) ,
467--474.

[ 1131 ) Valliant, R. (1996). Limitations of balanced half-sampling. J. Official Statist ., 12(3),225--240 .

[ 1132 ) Valliant, R. (2002) . Variance estimation for the general regression estimator. Survey
Methodology , 28 (1),103--114.

[ 1133 ) Verdeman, S. and Meeden, G.(1983) . Admissible estimators in finite population sampling
employing various types of prior informat ion. J. Statist. Planning Infer., 7,329--341.

[ 1134 ) Vijayan, K. (1968). An exact zrps schem e---generalization of a method of Hanurav. J. R.


Statist . Soc., B, 30, 556 --566.

[ 1135 ) Vijayan, K. (1975). On estimating variance in unequal probability sampling. 1. Amer. Statist.
Assoc .. 70, 713--716.

[ 1136) Vos, l.W.E. (1980). Mixing of direct , ratio and product method estimators. Statistica
Nearlandi ca, 34,209--213 .

[ 1137) Wakimoto, K. (1971). Strat ified random sampl ing (III): Estimation of the correl ation coefficient.
Ann. Inst. Statist . Math , 23, 339--355.

[ 1138) Walsh, l .E. (1970). Generalization of ratio estimator for population total. Sankhy E , A, 32, 99--
103.

[ 1139) Warner, S.L. (1965). Randomized response : A survey techn ique for eliminating evasive answer
bias . J. Amer. Statist. Assoc., 60, 63--69 .

[ 1140) Welch , B.L. (1937). On the z test in randomized blocks and Latin squares. Biometrika , 29, 21--
52.
1190 Advanced sampling theory with applications

[ 1141 ) Williams, W.H. (\958). Unbiased regression estimator. Unpublished Ph.D. Dissertation, Iowa
State University, Ames, Iowa.

[ 1142 ) Williams, W.H. (1961). Generating unbiased ratio and regression estimators . Biometrics, 17,
267--274.

[ 1143) Williams, W.H. (\ 963). The precision of some unbiased regression estimators. Biometrics, 19,
352--361.

[ 1144 ) Willson, D., Kirnos, P., Gallagher, J. and Wagner, A. (2002). Variance estimation from
calibrated samples. Joint Statistiacl Meetings. NY-Section on survey research methods, 3727--3731 .

[ 1145) Wolter, K.M. (\979). Composite estimation in finite populations . J. Amer. Statist. Assoc. 74,
604--613.

[ 1146) Wolter, K.M. (1984). An investigation of some estimators of variance for systematic sampling.
J. Amer. Statist . Assoc.,79,781--790.

[ 1147) Wolter, K.M. (\985). Introduction to variance estimation. New York, Springer-Verlag.

[ 1148) Worthingham, R., Morrison, T., Mangat, N.S., and Desjardins, G. (2002). Bayesian estimates of
measurement error for in--line inspection and field tools. Paper IPC2002-27263 Internat ional Pipeline
Conference 2002. Calgary. Alberta . Canada. The American Society ofMechanical Engineers. New York.
Proceedings sre on a CD-ROM.

[ 1149 ) Wretrnan, J.H. (1995). Split questionnaires. Presented at the Conferen ce on Methodological
issues in Official Statistics, Stockholm.

[ 1150 ) Wright, R.L. (1983). Finite population sampling with multivariate auxiliary information. J.
Amer. Statist. Assoc., 78, 879--883.

[ 1151 ) Wright, T. (1990). Probability proportional to size (Jl' ps) sampling using ranks. Commun .
Statist. -- Theory Meth .. 19(1),347--362.

[ 1152 ) Wu, C. (2001). Empirical Likelihood method for finite populations . Proceedings of Statistics
2001 Canada . The 4 th Conference in Applied Statistics , 339--350 .

[ 1153 ) Wu, C. and Sitter, R.R. (2001). A model calibration approach to using complete auxiliary
information from survey data. J. Amer. Statist. Assoc. , 96, 185--193.

[ 1154 ) Wu, C. and Sitter, R.R. (2001). Variance estimation for the finite population distribution
function with complete auxiliary information. Canad. J. Statist., 29(2), 289--307.

[ 1155 ) Wu, C.FJ. (1981). Balanced repeated replications based on mixed orthogonal arrays.
Biometrika, 78,181--188,

[ 1156) Wu, C.FJ. (1982). Estimation of variance of the ratio estimator. Biometrika, 69,183--189.

[ 1157 ) Wu, C.FJ. (1984). Estimation in systematic sampling with supplementary observations .
Sankhy a , B, 46, 306--315.

[ 1158) Wu, C. F. J. (\985). Variance estimation for combined ratio and combined regression estimators.
1. R. Statist. Soc ., B, 47,147--154.

[ 1159 ) Wynn, H.P. (I 977a). Minimax purposive survey sampling design. J. Amer. Statist. Assoc., 72,
655--657.
Bibliography 1191

[ 1160 ] Wynn, H.P. (1977b) . Optimum designs for finite populations sampling. Statistica Decision
Theory and Related Topic (S.S. Gupta and D.S. Moore, eds) Academic Press, New York, 471--492.

[ 1161 ] Wywial, J. (1999). Generalization of Singh and Srivastava's sampling scheme providing
unbiased regression estimators. Statistics in Transition, 4(2), 259--281.

[ 1162] Wywial, J. (2000). On precision of Horvitz-Thompson strategies. Statistics in Transition , 4(5),


779--798.

[ 1163] Yamada, S. and Morimoto, H. (1992). Sufficiency. Current issues in statistical inference: Essays
in Honor of D. Basu by Ghosh and Pathak . Lecture Notes--Monograph Series. Institute ofMathematical
Statistics. Hayward. Californ ia. 17, 86--98.

[ 1164 ] Yansaneh, I.S. and Fuller, W.A. (1998). Optimal recursive estimation for repeated surveys.
Survey Methodology, 24,31--40.

[ 1165] Yates, F. (1948). Systematic Sampling. Phil. Trans. R. Soc., A, 241,345--377.

[ 1166] Yates, F. (1949). Sampling methods for censuses and surveys . London: Charles Griffin and Co.

[ 1167] Yates, F. (1960). Sampling methods for censuses and surveys. Charles Griffin & Co., London.

[ 1168 ] Yates, F. and Grundy, P.M. (1953). Selection without replacement from within strata with
probability proportional to size. J. R. Statist . Soc, 15(B), 253--261.

[ 1169] You, Y. and Rao, J.N.K. (2000a). Small area estimation with unmatched sampling and linking
models. Proceedings ofSurvey Methods Section, 191--196.

[ 1170 ] You, Y. and Rao, J.N.K. (2000b). Hierarchical Bayes estimation of small area means using
multi-level models. Survey Methodology, 26 (2), 173--181.

[ 1171 ] Yung, W. and Rao, J.N.K. (2000). Jackknife variance estimation under imputation for estimators
using poststratification information. J. Amer. Statist . Assoc.,95, 903--915 .

[ 1172] Zarcovic, S.S. (1960). On the efficiency of sampling with various probabilities and the selection
of units with replacement. Metrika, 3, 53--60.

[ 1173] Zasepa, R. (1962). Badania Statystyezna Metoda Reprezentacying. Warszawa.

[ 1174] Zinger, A. (1980). Variance estimation in partially systematic sampling. J. Amer. Statist . Assoc.•
75,89--97.

[ 1175 ] Zou, G. (1997). Admissible estimation for finite population under the Linex loss function. J.
Statist . Planning Infer. 61,373--384.

[ 1176] Zou, G. (1999). Variance estimation for unequal probability sampling. Metrika, 50, 71--82.

[ 1177 ] Zou, G. and Liang, H. (1997). Admissibility of the usual estimators under error in variables
superpopulation model. Statist . Prob. Lett., 32, 301--309.

[ 1178 ] Zou, G. and Wan, A.T.K. (2000). Simultaneous estimation of several stratum means under error-
in-variables superpopulation models. Ann . Inst. Statist. Meth., 52(2), 380--396.

[ 1179 ] Zyskind, G. (1967). On canonical forms, non-negative covariance matrices and best and simple
least squares linear estimators in linear models. Ann. Math. Statist., 38, 1092--1109.
AUTHOR INDEX

Akar, I. 820, 1132

Alalouf, I.S. 822 , 1132


Abernathy, l .R. 942 , 966, 1146, 1150
Allen , 1. 605 , 1058, 1132, 1181
Abul -Ela , A.L.A. 897, 898, 899, 902,
930, 1150 Amahia, G.N. 335 , 336 , 342, 1132

Adhvaryu, D. 257,267,552, 1025, Amdekar, SJ. 812, 821 ,1132


1131
Anderson, D.W. 578, 1154
Adhikary, A.K. 563, 624, 838, 873,
924,956,1140 Anderson, H. 931 , 944 , 1132

Agarwal, e.L. 847, 877,1131 Andreatta, G. 449 , 1132

Agarwal, D.K. 812, 1031 Anscombe, FJ. 126, 1132

Agarwal, M.e. 259, 731 , 741,1098, Arcos , A. 277 , 278, 430, 516, 864,
1131 1133,1171

Agarwal, S.K. 265 , 270, 317, 320, Arnab, R. 342 , 422 , 432 , 436 , 440 ,
342 ,343,391,392,1131 ,1135, 442 ,511,5 12,700,748,848,857,
1156 859,864,878,880,882,896,916,
919,956,957,958,961,963,968,
Aggarwal, R. 7 18, 1171 1000, 1035, 1041, 1046, 1054,
1132, 1133, 1140, 1162, 1180,
Ahmed, M.S. 563, 596, 60 I, 1131, 1188
1188
Arnholt, A.T. 105, 1133
Ahmed, T. 867,1131
Artes, E. 278, 864 , 885, 1133, 1171
Ahsan , MJ. 723,1131
Asad , H. 498 , 1172
Aires, N. 386 , 1131
Asok ,e.269,378,389,391 ,1134,
Ajgaonkar, S.G.P . 391, 516, 605, 1186
1131,1143,1144
Avdhani, M.S. 864, 1134
1194 Advanced sampling theory with applications

Bethlehem, J.G. 400, 1135

Bennett, B.M. 738, 1135


Bahadur, R.R . 492, 1134
Bez, K. 601, 1042
Bahal , S. 264, 1189
Bhangoo, B.S . 713,1182
Bandyopadhyay, S. 270,280,422,
511 , 1134 Bhargava,M.933,945,947,949,
955,958,969,1136
Bankier, M.D. 578, 708 , 1134
Bhargava, N.K. 448, 1136
Bansal, M.L. 177,326,335,458,
509,510,957,960, 1134, 1150, Bhatnagar, S. 258, 260, 1136, 1184
1178
Bhattacharyya, S. 498,512,1151,
Bartlett, R.F. 1079, 1135 1161

Bartholomew, D. J. 980, 1134 Bhatia, A. 1065, 1068, 1073, 1101,


1136, 1160
Barnard, 1. 1025, 1134
Bhave, S.V. 472 , 1136
Basawa, LV. 478, 1089, 1135, 1143
Biemer, P.P. 495, 1152
Bass , S.M. 1022, 1187
Binder, D.A. 465, 1076, 1136
Basu, D. 106, 117,280,370,371,
372,373,448,490,518,1135 Biradar, R.S. 209, 248, 265, 271,
274,1136, 1176
Battese, G.E . 1093, 1135
Blackwell, D. 106,492, 1136
Bayless, D.L. 379, 392, 395, 484 ,
1135, 1167 Blight, B.J.N. 645, 1136

Beale, E.M.L. 186, 187, 223, 317, Bogue, D.J. 1081, 1136
1135
Boekema, F.W.M. 494, 1160
Bedi, P.K. 340,391,509,518,1135
Bose, C. 594, 606, 1136
Beegle, L.D. 141, 1167
Bourke, P.D. 966, 968, 1137
Bek, Y. 943 , 1163
Bouza, C. 1041, 1137
Bellhouse, D.R. 126,395,497,508,
563,630,645,842,843,915,961, Brackstone, GJ. 1074, 1081, 1137
970, 1135, 1159, 1167
Author Index 1195

Bratley, P. 5, 1137 Causey, RD. 1075, 1138

Breau, P. 865, 1137 Cebrian, A.A. 261, 275, 420, 586,


1007, 1008, 1139, 1147, 1148
Bredit, F.J. 501,503,504,516,865,
866, 1137, 1147 Ceccon, C. 267, 1139

Brewer, K.R.W. 214, 312, 370, 373, Chakrabarty, M.C. 126, 127, 1139
377,378,379,380,381 ,384,387,
388,389,390,394,443,494,497, Chakrabarty, R.P. 227, 257, 267,
498,500,624,744,808,880, 274,569,1025,1139,1155
1137,1138,1146,1173
Chakravorty, I.M. 490, 1170
Brillinger, D.R. 126, 1138
Chand,L. 552, 554,606, 1139
Brown, RM. 257, 1138
Chandak, R. 340, 1177
Brown, J.A. 819, 1138
Chang, H.J. 272, 945, 961, 1139
Bryant, E.C. 714, 715, 717, 1138,
1164 Chang, K.C. 731, 742,1139

Buckland, W.R. 641,1138 Chandra, P. 1042, 1176

Burdick, R.K. 845, 1138 Chandra, S.K. 374, 507, 508, 1172

Burmeister, L.F. 578, 641, 1147 Chao, M.T. 395, 1139

ChatteQee, S. 956, 1139

Chattopadhyaya , A.K. 422, 511,


1134
Cantwell, P.J. 865, 1157
Chaubey, Y.P. 258, 335, 336, 342,
Carlin, RP. 1099, 1138 602, 1045, 1046, 1132, 1139,
1184,1188
Carroll, L.B. 1067, 1073, 1160
Chaudhuri , A. 385, 394, 395, 414,
Casady, R.J. 496, 1138 422,432,433,435,436,458,497,
511,512,563,565,599,624,838,
Cassel, C.M. 374, 373, 378, 394, 857,873,882,896,924,925,926,
395,498,515,713,1031,1032, 956,957,972, 1081, 1139, 1140,
1138 1141, 1161

Causeur, D. 540, 1139 Chaudhary, F.S. 639, 1176


1196 Advanced sampling theory with applications

Chen, J. 519, 586,1011 ,1017,1028, Dalenius, T. 704, 708, 718, 817, 968,
1141 980,1137,1142

Chen, S.X. 394,464, 1141 Das,A.C. 449, 639, 641, 1142

Chen, Y. 873,1017,1175 Das, A.K. 199,271 ,373 ,420,607,


1142
Chernick, M.R. 713, 717, 1141
Das, G. 601,1142
Chmiel, J. 1022, 1187
Das, K. 864, 1143
Chotai, J. 857, 887, 1141
Das, M.N. 627, 1169
Choudhry, G.H. 394, 1087, 1141,
1144 Datta, G.S. 1098, 1143

Chowdhury, S. 409, 907, 911, 1180 David, I.P. 267, 1143

Christman, M. 819, 1141 Day, B. 1098, 1143

Christofides, T.C. 972, 1141 Dayal, S. 712, 739, 1143

Chromy, J.R. 389, 1141 Deely, J.1. 741, 1043, 1157

Cingi, H. 748, 1153 Deming, W.E. 714, 980, 1073, 1075,


1076, 1143, 1187
Clark, D.F. 500, 1162
Dempster, A.P. 1095, 1099, 1100,
Clayton, D. 1022, 1141 1143

Cochran, W.G. 138,329,349,373, Deng,L.Y.217,220,317,414,421 ,


376,401,414,453,509,512,673, 425, 1143
838,875,877,919, 1141, 1167
Deo, B. 1056, 1180
Conti, P.L. 129, 1142
Deshpande,M.N.117,391,515,
Cox,D.R. 126,378, 1142 516,1143, 1144

Crisalli, A.N. 1045, 1046, 1139 Desjardins, G. 1065, 1073, 1160,


1190
Cumberland, W.G. 220, 844, 1170
Deville, r.c. 220, 221, 394,400,401 ,
413,421 ,425,504,505,509,516,
518,546,578,591,1003,1038,
Dalabehara, M. 268, 563, 600, 740, 1039,1144
1142
Author Index 1197

Dey, A. 392, 1144 Ekman, G. 721, 1145

Dharmadhikari, S.W. 448, 486, 1163 Elliott, M.R. 980, 1051, 1145

Diana, G. 267, 1139, 1144 Eltinge, J.L. 495, 1084, 1145, 1148

Dihidar, S. 838, 1140 Ericson, W.A. 726, 1145

Dorfman, A.H. 429, 1144 Eriksson, S.A. 966, 1145

Doss, D.C. 731, 742,1144 Ernst, L.R. 864, 1137, 1153

Dowling, T.A. 930, 1144 Espejo, M.R. 171,272,343,637,


638, 730, 1145, 1172, 1178
Draper, N.R. 726, 1144
Estevao, V.M. 409, 421, 509, 545,
Drew, D. 1087, 1144 593,609,1145

Dubey,J. 731,1175

Dubey, V. 260, 261, 263,1144, 1163

Duncan, G.J. 865, 1144, 1154 Farrell, P.J. 424, 427, 504, 517,
1099,1145, 1146
Dunn, G. 1022, 1141
Fan, J. 503, 1146
Dupont, F. 545, 563, 1003, 1144
Fay, R. 1022, 1146
Durbin,J.191,389,393,1144
Fay,R.E. 1097, 1099, 1146
Dwivedi, T.D. 258, 1184
Fellegi, I.P. 385,387,495,496, 1146
Dwivedi, V.K. 257, 258, 280, 1182
Feller, W. 107, 108, 1146

Feng,S.884,1146

Fienberg, S.E. 126, 1075, 1146


Early, L.J. 500, 1138
Finney, D.J. 638, 639, 1146
Eberhardt, K.R. 220, 1170
Fisher, R.A. 126,492, 1146
Eckler, A.R. 847,865, 1144
Folsom, R.E. 968, 1146
Eichhorn, B.H. 903,920,943, 1144
Foreman, E.K. 312, 808, 1146
1198 Advanced sampling theory with applications

Fountain, R.L. 644,645, 1147 Gelfand, A.E. 1099, 1138

Fox, B. L. 5, 1137 Gershunskaya , J. 1084, 1148

Francis, R.LC. 820, 1147 Ghangurde, P.D. 857, 887, 1148

Francisco, C.A. 250,251, 1147 Ghosh, M. 373,492,563, 1081,


1084,1098,1099, 1101, 1143,
Frankel, L.R. 713, 1147 1148,1159

Franklin, L.A. 892, 893,962, 1147 Ghosh, S. 368, 370, 525, 1148

Freedman, D. 126, 1147 Ghosh, S.P. 817,821, 1148

Freund, lE. 66, 1147 Giffard--Jones, W. 812, 1148

Friedlander, D. 126, 1075, 1147 Giommi, A. 1031, 1148

Fuller, W.A. 250, 251, 377, 387, Godambe, V.P. 384, 395,427,428,
410,411,495,569,578,731 ,865, 465,466,467,468,469,470,472,
866, 1065, 1093, 1135, 1147, 476,477,478,480,482,484,513,
1153, 1191 563,743,879,957,1135,1148,
1149

Goel, B.B.P.S. 343, 812,1131,1149

Goga, C. 578, 1144


Gabler, S. 395, 458, 1147
Gonzalez, M.E. 1083, 1099, 1149
Gallagher, J. 867, 1190
Goodman, L.A. 270, 1149
Gambino, J. 495, 1177
Graf, M. 257, 1150
Gangele, R.K. 262, 1176
Graham, J.B. 847,859,865, 1164,
Gangwar, D.K. 596, 1176 1167

Garcia, A. 864, 885, 1133 Graubard, B.L 882, 1155

Garcia, M.R. 261,275,420,586, Gray,H.L.I77, 191, 1173


1007,1008,1139,1147,1148
Greenberg, B.G. 897,898,899,902,
Garg, IN. 630, 634, 644, 643, 647, 93~942,968, 1146, 1150

1176, 1177
Gregoire, T.G. 865, 1173
Gautschi, W. 638, 648, 1148
Author Index 1199

Grewal, I.S. 335, 343, 957, 960, Han, C.P. 731, 742,1139
1150,1180
Hanif, M. 389, 394,498,500,512,
Grey,G.B.I018,1163 880,1137,1138,1151,1172

Gross, S.T. 250, 1150 Hansen, M.H. 150,306,309,324, .


325,327,328,336,340,346,389,
Groves, R.M. 495, 496, 1150 504,505,590,592,765,791,805,
806,815,847,876,975,976,980,
Grubbs,F.E. 1065, 1066, 1073, 1150 1151

Grundy, P.M. 354, 356, 357, 385, Hanurav, T.V. 385, 387, 395,489,
387,390,391,409,411,413,419, 713, 1151
427,428,473,1191
Harter, R.M. 1093, 1135
Gujarati, D. 246, 1150
Hartigan, J.A. 476, 1151
Gupta, B.K. 740, 1150
Hartley, H.G. 130, 180, 182,266,
Gupta, J.P. 209, 262, 340, 341, 643, 270,291 ,349,389,453,458,495,
721,1150,1169, 1181 509,512,578,586,624,714,715,
717,731 ,742,838,919,1138,
Gupta, M.R. 594, 1174 1144,1149,1151 ,1152,1167

Gupta, P.C. 189,257,264,267,373, Hawkins, D.L. 731, 742, 1139


508,709,864,1025,1131 ,1150,
1154, 1155, 1158 Hayre, L.S. 903, 920, 943, 1145

Gupta, R.K. 600, 1151 Hebert, J.L. 105, 1133

Gupta, V.K. 391, 872,1151,1156 Hedayat, A.S. 128, 1152

Gurney, M. 718, 872, 1142, 1151 Heilborn, D.C. 638, 1152

Guttman, I. 726, 1144 Henderson, C.R. 1090, 1091, 1092,


1152

Hendricks, W.A. 791, 980, 1152

Hajek, J. 395,499,515,516, 1151 Herson, J. 378, 379, 380, 381, 498,


1170
Halmos, P.R. 492, 1151
Herriot, R.A. 1097, 1099, 1146
Hall, P. 257,429,1138, 1144
Herzel, A. 335, 395, 1152, 1156
Hampton, I. 820, 1153
1200 Advanced sampling theory with applications

Hess, I. 980, 1155 504,505,590,592,714,765,791,


805,806,815,847,876,975,976,
Heyde, C.C. 465, 1149 980, 1151, 1187

Hidiroglou, M.A. 411, 545, 561, 563, Hutchison, M.C. 191, 1153
590,592,603,847,1003,1087,
1152, 1173, 1183, 1185

Ho, E.W. 378, 379, 380, 381, 498,


1173 Iachan, R. 641, 648, 1153
Hodges, J.L. 126, 704, 708, 1142, Ireland, C.T. 1075, 1153
1152
Isaki, C.T. 192, 197,261,279,377,
Holt, D. 466, 496, 729, 1097, 1098, 410,412,420,994,1007,1008,
1146,1152, 1160, 1183 1153
Horn, S. 336, 337, 344, 346, 401, Islam, M.A. 738, 1135
409,414,419,420,421,425,426,
474,503,519,696,698,700,899,
900,901,907,911,928,971,
1025, 1027, 1028, 1047, 1158,
1180
Jaech, J.L. 1065, 1073, 1153
Horst, S. 458, 1147
Jain, N. 259, 1131
Horvitz, D.G. 349, 351, 352, 353,
355,356,357,368,371,372,383, Jain, R.K. 263, 1153
384,387,389,400,412,428,431,
436,439,440,464,473,484,485, Jagers, P. 731,1153
490,491,497,499,501,509,511,
512,545,577,635,713,819,897, Jessen, RJ. 379, 714, 715, 717, 791,
898,899,902,919,926,930,942, 847,864,865, 1138, 1153
966,1000, 1006, 1035, 1146,
1150,1153 Jewett, R.S. 872, 1151

Hoza, C. 1099, 1149 Jhajj, H.S. 138, 167, 169, 171, 199,
203,209,259,261,317,318,319,
Huang, K.C. 272, 945, 1139 320,373,420,421,541,578,698,
1153,1183
Huang, L.R. 865, 1153
Jindal, K.K. 630, 643, 644, 647,
Huff, L. 1084, 1148 1176

Hurwitz, W.N. 150,306,309,324, Joarder, A.H. 266, 269, 292, 335,


325,327,328,336,340,346,389, 343,578,588,602,903,905,906,
Author Index 120 I

920,926,960,986,987,993,1007, Katyar, N.P. 596, 1176


1008,1042,1177,1180,1181
Kaur,P.199,1154
John,S. 279 , 1153
Kaufman, G.M. 449, 1132
Jolly, G.M. 820, 1153
Kaur , S. 715 , 1181
Jones, L.V. 126, 1138
Keller, W.J. 400 , 1135
Jones,R.G. 865,1153
Kempthome, O. 126, 1154
Joshi, V.M. 427, 448, 482, 484 , 486,
497,563, 1149, 1153 Kerkv liet, 1. 967, 1154

Khamis, S.H. 107, 117, 132, 1165

Khan, S.U. 552, 723 , 726, 1131,


1154,1155
Kadilar, C. 748 , 1153
Khan, Z. 726, 1054
Kakran, M.S. 280, 1176
Khare , B.B. 265 , 552, 555, 596,
Kale , B.K . 465, 1149 1042,1043,1054,1184

Kalton, G. 578, 865,976, 1144, 1154 Kiefer, G. 458 , 1152

Kapadia, S.B . 508, 1154 Kim, J. 967 , 1054

Karlheinz, F. 731 , 1154 Kim, 1.K. 603, 1017, 1054, 1156

Kashani, H.B. 340, 341 , 961 , 1150, Kireg yera , B. 552 , 595 , 1054, 1055
1159
King, M.L. 903, 905, 906, 926, 1180,
Kashyap, S. 270, 1131 1181, 1185

Kasprzyk, D. 865, 1154 Kingsley, L. 1022, 1187

Kasprzyk, J.R. 976 , 1154 Kimos, P. 867, 1190

Kataria, P. 248 , 1181 Kish, L. 980, 1081, 1086, 1155,


1164
Kathuria, O.P. 847, 877 , 958 , 1154,
1178 Kishore, L. 458, 511 , 1178

Kattan ,A.K.A.498,1172 Kokan , A.R. 726 , 1155


1202 Advanced sampling theory with applications

Kolmogorov, AN. 492, 1155 Lahiri, D.B. 303, 1156

Konijn, H.S. 512, 621, 1076, 1155 Lahiri, P. 1098, 1143, 1156

Koop, J.C. 177,647, 1155 Laird, N.M. 1099, 1100, 1143, 1156

Koro, E.L. 882, 1155 Lakshmi, D.V. 950, 951,1156

Kossack, C.F. 705, 1155 Lal, B. 209, 1150

Kothwala, N.H. 189,264, 1150, Lal, M. 511, 1178


1155
Lanke,J.392,930,931,932,1156,
Kott, P.S. 431,514,873, 1155 1167

Kowar, R.M. 497, 1155 LeClere, F.B. 1099, 1158

Krewski, D. 227,274,569,872, Lee, H. 865, 980, 1010, 10 17, 1043,


1017,1155 1156, 1157, 1162

Kuebler, R.R. 942, 1150 Lehmann, E. 126, 1152

Kuhn, H.W. 726, 1155 Lent, J. 865, 1157

Kuk, AY.C. 250,251,257,277, Lepkowski, J.M. 496, 1138, 1150


474,475,579,581,582,584,893,
956,967, 1155, 1156, 1158 Lewis, B.B. 1083, 1183

Kullback, S. 1075, 1153 Lewi~ky,S.980, 1051, 1145

Kulldoff, G. 883, 1156 Leysieffer, F.W. 930, 931,1157

Kumar, E.V. 391,498,1156 Liang, D.H. 961, 1139

Kumar, P. 317, 320, 335, 340, 342, Liang, H. 514, 1191


391, 392, 1131, 1151, 1156, 1177
Linacare, S.J. 495, 1157
Kumar, S. 865, 1156
Lindley, D.V. 741, 1043, 1157
Kundu,S.422,511,1134
Little, R.J.A 980, 1022, 1051, 1145,
Kushwaha, K.S. 595, 1189 1157

Liu, J.F. 742, 1139

Laake,P. 1044, 1156 Lohr, S.L. 578, 873, 1157


Author Index 1203

Louis, T.A. 1099, 1156 1179, 1180, 1181, 1182, 1187,


1190
Lu, K.L. 1098, 1143
Mantel, H.J. 495, 1177
Lundstrom, S. 732, 1041, 1157
Manwani, A.H. 639, 1159

Marker, D.A. 1083, 1159

Manisha 1101, 1159


MacGibbon, B. 1099, 1145, 1157
Mayor, J.A. 822, 1159
Madow, L.H. 638, 645, 1157
McCarthy, M.D. 126, 1159
Madow, W.G. 150,504,505,590,
592,638,639,645,791,808,822, McCarthy, P.J. 872, 1159
847,876,1151,1157
McLeod, A.I. 497, 1159
Mahajan, P.K. 245, 345, 420, 698,
874, 1157, 1181 Meeden, G. 257, 373, 563, 1025,
1050,1148,1159,1162,1189
Mahalanobis, P.C. 495, 790, 791,
1157,1158 Meng, X.L. 1022, 1159, 1188

Mahmood, M. 899,900,901 ,961 , Meyer, J. 250, 1173


1158, 1181
Mickey, M.R. 191, 276, 598, 1159
Maiti, T. 512,925,957, 1098, 1140,
1143 Midha, C.K. 389, 1159

Mak, T.K. 250, 251, 257, 277, 474, Midzuno, H. 390, 391, 393, 511,
475,579,581 ,582,584,1156, 512, 1160
1158
Miller, R.G. 227, 1160
Malec, D. 1099, 1158
Miller, S.M. 865, 1157
Mandowara, V.L. 708, 1158
Milne, A. 639, 641, 1160
Mangat, N.S. 129,209,245,262,
335,420,458,509,600,709,896, Mishra, G. 193,420,464, 510, 595,
899,901,920,933,935,939,954, 1160,1186
955,958,959,960,961 ,962,963 ,
964,967,969,970,971 ,1065, Mishra, R.N. 966, 1160
1067, 1068, 1073, 1101, 1136,
1151,1158,1159,1160,1178, Mitra, J. 414, 512,1140
1204 Advanced sampling theory with applications

Mittal, J.P. 713, 1182 Narain, P. 1044, 1178

Mohanty, S. 131,260,267,268,279, Narain, R.D. 385, 390, 1161


324,325,336,342,346,1160,
1171,1172 Nayak, T.K.939,955, 1162

Mohl, e.A. 409, 1175 Nelson, D. 257, 1162

Montanari, G.E. 273, 513, 1160 Neyman,J. 123,378,529,662, 1162

Moors, J.J.A. 494, 899, 901, 930, Nieto de Pascual, J. 191, 1162
955, 1160
Nieuwenbroek, N.J. 577,600, 1169
Moriarity, e.L. 1099, 1158
Nigam, A.K. 391, 697, 872,1151,
Morimoto, H. 492, 1191 1156, 1173

Morrison, T. 1065, 1067, 1068,


1073, 1101, 1136, 1160, 1190

Moses, L.E. 713, 1160


Oden, A. 731,1153
Moura, F.A.S. 1098, 1160
Ogus, J.K. 500, 1162
Mudholkar, G.S. 864, 1168
Ohlsson, E. 512, 1162
Mukherjee, R. 552,565,599,601,
646,89~924, 1141, 1161 Okafor, F.e. 864, 980, 1162
Mukhopadhyay, P. 448, 449, 498, Olkin, 1. 230, 270, 1162
512, 1151, 1161
Opsomer, J.D. 501, 503, 504, 516,
Murthy, M.N. 145, 174, 191,261, 1137
449,450,464,479,482,484,486,
489,621,643,647,874,1161 Osahan,S.S.128,812,816,824,
955,986,987, 1187, 1188
Muzoz, A. 1022, 1187
Ouyang,Z.498,1183

Owen, D.B. 177, 191, 1173

Nadar~ah,S.343, 1145
Padmawar, V.R. 422, 458, 745,
Nanjamrna, N.S. 191,464, 1161 1080, 1101, 1162
Author Index 1205

Pagano ,~ . 1022, 1188 Pisani, R. 126, 1147

Paik, ~ .C . 1022, 1162 Pitman, EJ.G. 126, 1163

Panda, K.R 731, 741,1131 Platek, R. 1018, 1163

Panda,P. 840, 841, 842, 874,878, Pokropp,F.223,1163


882, 1163, 1171, 1172
Politz, A. 980, 982, 1163
Pandey, RN. 261,1163
Pollock, K.H. 943, 1163
Pandey, S.K. 342, 1163
Prabhu--Ajgaonkar, S.G. 497, 511,
Pannu, C.J.S. 713, 715,1181,1182 599,1163

Panse, V.G. 638,1186 Pradhan, RK. 601, 1164

Patak, Z. 465, 1136 Prasad, R 258,261 ,271 ,317,519,


596,1164
Patel, H.C. 448, 486, 1163
Prasad, N.G.N. 391, 859,1098, 1164
Patel, P.A. 420, 425, 1175
Puetas, S .~. 268, 269, 279, 610,
Pathak ,P.K.I08, 109, 111 , 114, 117, 61 1, 1177, 1181
130,131,266,335,448,492,644,
645,883, 1147, 1148, 1163 Purcell, N.J. 1081, 1086, 1164

Pattanaik , L.~. 279, 1160 Purves, R. 126, 1147

Patterson , H.D. 594, 847, 865, 877,


1163

Pearson , K. 209, 1163 Qin, J. 586, 1141

Pedgaonkar, A.~. 497, 1163 Quenouille, ~ .H . 173, 174,223,


639,641, 1164
Perlman, ~.D. 492, 1151

Pfeffermann , D. 498, 1084, 1086,


1098,1163,1170,1175

Pickles, A. 1022, 1141


Raghavarao, D. 391, 950, 951, 1156,
PilIai, S.S. 270, 1165 1164

Pineda, ~.D. 343, 730, 1145 Raghunandanan,K. 717,1164


1206 Advanced sampling theory with applications

Rai, A. 517,1175 340,342,343,422,449,497,499,


518,552,601,712,713,740,883,
Raiffa, H. 724, 1164 1132, 1135, 1150, 1161, 1163,
1165, 1168, 1169
Raj,D.107, 117, 132,231,239,270,
275,344,379,447,572,598,604, Rao, V.V.B. 130, 1165
727,808,838,865,875,1164,
1165 Ray, S. 627, 1169

Ramachandran, G. 497,713, 1165 Ray,S.K. 265,267, 373, 1169, 1171,


1179
Ramachandran, V. 270, 1165
Reddy, V.N. 105, 129,261,267,
Ramakrishnan, M.K. 117, 130, 411, 339,340,343,373,499,638,643,
563, 1165 673,1027,1169

Rana,R.S. 209, 638, 1165 Ren, R. 257, 1169

Rancourt,E. 1010, 1017, 1043, Renssen, R.H. 577, 600, 1169


1156,1157
Richardson, S.C. 497, 1169
Rangarajan, R. 876, 882, 1165
Riznic, J. 1067, 1073, 1160
Rao,C.R. 106, 128,946,950, 1152,
1165 Rizvi, S.E.H. 721, 1169

Rao,J.N.K. 117, 130, 141, 174, 191, Robins, J.M. 1022, 1169
220,227,327,330,335,336,344,
349,370,376,378,389,390,391, Robson, D.S. 181 , 187,260,317,
392,395,413,422,428,453,458, 1169
484,494,495,497,509,512,520,
563,578,586,600,630,645,838, Rosen,B.386,448,1169
842,843,847,857,864,865,872,
873,875,879,887,919,980, Ross, A. 180, 182, 291, 266, 1152
1016, 1017, 1018, 1020, 1021,
1026, 1049, 1053, 1074, 1081, Rout, K. 595, 1160
1084,1098,1099, 1101, 1135,
1137,1141,1148, 1152, 1155, Roy,D. 422,432, 433,435,436,
1156, 1157, 1164, 1165, 1166, 924,925,926,957, 1140, 1141
1167, 1168, 1174, 1183, 1185,
1191 Roy, D.C. 1098, 1131

Rao, P.S.R.S. 191,227,273,276, Roy, J. 490, 1170


54~594,596, 116~ 1168

Rao, TJ. 191,268,335,336,339,


Author Index 1207

Royall, R.M. 214,220,376,377, Sampford, M.R. 389, 395, 808, 820,


378,379,380,381 ,467,497,498, 1172
508,808,810,844,845,846,1170
Santos, J.511,638, 1171
Rubin, D.B. 975, 976, 1000, 1021,
1022,1025,1033, 1095, 1100, Sarndal, C.E. 220,221,374,377, .
1134, 1143, 1170, 1171 378,384,394,395,400,401,411,
413,415,418,421,425,431,443,
Rueda,M.277,278,430,516,864, 497,498,500,503,504,505,508,
1133,1171 509,515,516,5 18,545,546,561,
563,590,591,592,593,598,609,
Ruiz,M.511,638,1171 713,732,1003,1010,1017,1031 ,
1032, 1033, 1034, 1035, 1037,
1038,1039,1040, 1041, 1043,
1087,1138,1144,1145,1152,
1156,1157,1172,1173,1185

Sadasivan, G. 718, 817, 821, 1171 Sastri, K.V.R. 638, 1186

Sah, AJ. 1022, 1187 Saxena, R.R. 393, 1173

Sahai, A. 257, 267,1169,1171 Saxena, S. 605, 1132

Sahoo, J. 171, 191,260,324,325, Saxena, S.K. 697, 1173


336,342,346,601,1160,1171,
1172 Schafer, J.L. 499, 1022, 1173

Sahoo, L.N. 171,185,186,187,191, Scharge, L.E. 6, 1137


258,260,263,265,267,268,317,
324,325,336,342,346,563,595, Schenker, N. 499, 1021, 1022, 1171,
601,606,608,740,840,841,842, 1173
874,875,878,882, 1142, 1160,
1163,1171,1172,1176,1186 Schiaifer, R. 724, 1164

Sahoo,R.K.268,608,1172 Schneeberger, H. 708, 1173

Salvan, A. 267, 1139 Schreuder, H.T. 865, 1173

Saleh, F. 515,1183 Schucany, VV.R. 177, 191, 1173

Samiuddin, M. 498, 1172 Scott,A.214,842,845,1173

Sampath, S. 209, 374, 507, 508, 959, Scott, A.J. 387, 379, 380, 381,498,
1027,1172 1173
1208 Advanced sampling theory with applications

Seal, A.K. 873, 1140 Shanna,S.S.422,1175

Searls, D.T. 103, 105,261,262,519, Sharma, U.K. 270, 1131


1045,1173
Sharma, Y.K. 517,1175
Sebe~G.A.F.820, 1187
Sharot, T. 191, 1175
Sedransk,J.250, 820, 1099, 1132,
1158,1173,1183 Sheers, N. 896, 1175

Sekkappan,FL~.563, 726,859, Shiledar--Baxi, H.FL 705, 1155, 1175


1173,1174
Shiue, G.J. 648, 1175
Seller, S. 864, 1174
Shukla, D. 264, 731, 732, 1175, 1182
Sen,A.R.354,356,357,387,390,
391,409,411,413,419,428,473, Shukla, G.K. 279, 1175
511,512,864,1174
Shukla, N.D. 697, 1173
Sengupta, S. 177,395,448,449,
569,624,646,1161,1168,1174 Sielken, FLL. 845, 1138

Serfling, R.J. 708, 1174 Sil, A. 982, 1175

Seth, G.FL 117, 1174 Silva, P.L.D.N. 428, 586, 1175

Sethi, V.K. 464, 647,1161,1174 Silverman, B.W. 580, 1175

Shachtman, FLH. 930, 1144 Simons, G. 956, 1139

Shah, D.N. 261,420,425,594, Simmons, W.980,982, 1163


1174,1175
Simmons, W. R. 897, 898, 899, 902,
Shah, S.~ . 261, 1174 930, 1150

Shannon,D.F. 705,1175 Singh, A. 1098, 1182.

Shao, J. 873,1011,1016,1017, Singh, A.C. 409, 865, 1084, 1086,


1018, 1020, 1021, 1028, 1049, 1098,1175
1141, 1167, 1175
Singh, A.K. 268, 607, 1176
Shanna, D.P. 864,884, 1186
Singh, B. 177,955, 1159, 1178
Sharma, ~.K. 203, 261,1184
Singh, B.D. 572, 1176
Sharma, S.D. 982, 1175
AuthorIndex 1209

Singh, D. 572, 630, 635, 639, 643, Singh, R.K. 203, 250, 265, 342, 596,
644,647,791,812,847,876,883, 698,1101,1159,1163,1179
888,1149,1154,1176
Singh, R.S. 596, 1164
Singh, G. 199,203,265,698, 1154,
1179 Singh, S. 129, 187, 188,207,209,.
245,248,259,261,262,266,268,
Singh, G.N. 595, 886, 1176, 1182 269,276,277,279,280,281,292,
335,336,337,343,344,345,346,
Singh,H.P.174, 185, 186,209,248, 373,401,409,414,419,420,421,
260,261,262,265,268,269,270, 424,425,426,427,428,432,436,
271,272,274,277,279,289,317, 440,442,473,474,475,503,504,
555,563,595,596,605,607,611, 505,517,519,520,544,561,578,
740,808, 1042, 1058, 1132, 1136, 588,590,592,600,602,605,610,
1164, 1176, 1177, 1178, 1181, 611,696,698,700,713,715,748,
1182, 1188, 1189 812,874,896,899,900,901,903,
905,906,907,911,915,916,919,
Singh, J.P. 715,1181 920,925,926,927,928,951,955,
957,958,959,960,961,962,963,
Singh, K. 373, 1169 964,965,967,969,970,971,984,
986,987,993,1000,1007,1008,
Singh, K.B. 639, 1159 1025,1027,1028,1035, 1041,
1042, 1046, 1047, 1052, 1054,
Singh,M.174,340,343, 1131, 1177 1056, 1058, 1132, 1133, 1134,
1145, 1146, 1150, 1151, 1157,
Singh, M.P. 264, 265, 272, 277, 317, 1158,1159,1176,1177,1178,
370,394,479,482,495,552,713, 1179,1180,1181,1182,1185,
865,1087,1141, 1144, 1154, 1161, 1188,1189
1167, 1177, 1182
Singh, S.K. 263, 265, 1144
Singh, P. 278, 393, 394,420,464,
510,634,635,864,1131,1173, Singh, V.K. 264, 595, 596, 886,
1176, 1177 1176, 1182

Singh,R. 129, 177, 187, 188,209, Singh, V.P. 174,563, 740, 1177,
259,260,262,263,272,276,279, 1182
326,335,340,341,373,458,509,
510,511,512,517,549,607,638, Sinha, B.K. 395,449, 1168, 1182
643,708,709,710,721,808,896,
899,901,933,935,954,955,957, Sinha, J.N. 966, 1160
958,959,960,961,962,963,964,
965,967,969,970,971,984,1041, Sisodia, B.V.S. 257, 258, 280, 1098,
1044, 1134, 1136, 1150, 1158, 1182
1159,1165,1169,1175,1177,
1178, 1179, 1180, 1181, 1182, Sitter, R.R. 411, 425, 430, 504, 519,
1188 568,569,600,872,1017,1026,
1210 Advanced sampling theory with applications

1053, 1141, 1167, 1182, 1183, Srivastava, S.K. 138, 160, 164, 166,
1190 167,169,171,199,203,209,239,
248,259,261,266,268,276,280,
Skinner, C.J. 428, 466, 578, 586, 317,318,319,320,373,414,420,
1175,1183 421,541,578,580,698,840,882,
1153,1184
Smarandache, F. 605, 1132
Srivastava, S.R. 552, 555, 1054,
Smeets, R. 494, 1160 1184

Smith, D.N. 864, 1174 Srivastava, V.K. 258, 260, 1048,


1184,1187
Smith, H.F. 379, 765, 775, 788, 1183
Srivenkataramana, T. 161,258,267,
Smith, J.H. 1075, 1183 271,273,280,324,373,391,498,
598, 738, 1156, 1164, 1185
Smith, P. 250, 1183
Stephan, F.F. 121, 1073, 1075, 1076,
Smith, S.K. 1083, 1183 1143,1185

Smith, T.M.F. 126,214,466,495, Sthapit, A.B. 259, 1131


729,842,845,865,874,1097,
1152, 1173, 1183 Strachan, R. 905, 1185

Somayajulu, G.R. 731, 742,1144 Strauss.L 191, 1185

Spiegelhalter, D. 1022, 1141 Stroud, T.W.F. 1099, 1185

Srikantan, K.S. 177, 1183 Stock, J.S. 713, 1147

Srinath, K.P. 498, 847, 980, 1156, Stufken, J. 128, 1152


1183
Stukel, D.M. 411,873, 1084, 1086,
Srinath, M. 817, 821, 1171 1098,1155,1175,1185

Srivastava, A.K. 278, 392, 393, 420, Subramani, J. 127,646, 1185


464,510,864,884, 1144, 1156,
1173,1177,1185,1186 Sud, U.C.864,884, 1185, 1186

Srivastava, J. 498, 1183 Sudakar, K. 622, 1186

Srivastava, J.N. 515, 1183 Sugden, R.A. 126, 1183

Srivastava, o.r. 857, 878, 1188 Sukhatme, B.V. 138,267,269,387,


389,391,594,606,621,708,710,
Srivastava, S. 265, 1042, 1054 1143,1179,1186
Author Index 1211

Sukhatme, P.Y. 138,269,621 ,638, Tikkiwal, B.D. 597, 847, 864, 877,
808, 822, 1051, 1055, 1134, 1186 1131,1187

Sukhatme, S. 269, 1186 Tille, Y. 394,493,514, 1144, 1187

Sunter, A.B. 386,495 ,1146,1186 Tin, M. 186, 187,268,317, 1187

Swain, A.K.P.C. 193,258,263,265, Tomberlin, TJ. 1097, 1099, 1143,


420,464,510,595,874,1172, 1145,1152, 1157
1186
Toutenburg, H. 1048, 1187
Swensson, B. 411,413,415,418,
497,503,561,598, 1173 Tracy, D.S. 127, 128, 138, 161,258,
265,271 ,272,273,277,373,555,
578,598,602,700,738,748,812,
816,824,896,901,905,925,927,
928,955,957,959,961,963,965,
Tallis, a.M. 498, 822, 1186 970,986,987,1008,1047,1172,
1177,1179,1180,1181,1182,
Tam, S.M. 498, 808, 809, 810, 1186 1185,1187,1188

Tanur, 1.M.126, 1146 Trewin, D.l . 495, 1157

Tailor, R. 1058, 1181 Tripathi, T.P. 199,267,270,373,


420,552,563,602,607,857,878,
Taylor,l.M. 1022, 1187 1142, 1154, 1188

Taylor, R.L. 478, 1135 Trivedi, M. 732, 1175

Tepping, B.1. 714, 1187 Trulsson, L. 731, 1153

Theberge, A. 1076, 1136 Tsutakawa, R.K. 1095, 1143

Thompson, DJ. 349, 351, 352, 353, Tu, X.M. 1022, 1188
355,356,357,368,371 ,372,383,
384,387,389,400,412,428,431, Tucker, A.W. 726, 1155
436,439,440,464,473,484,485,
490,491,497,499,501 ,509,511, Tukey,1.W. 126, 191,594, 1138,
512,545,577,635,713,819,919, 1189
926,1000,1006,1035,1153
Tuteja, R.K. 264, 1189
Thompson, M.E. 370, 373,465,466,
468,469,470,472,476,477,478,
563,859,1149,1174,1187

Thompson, S.K. 819, 820, 1187 Unam, I. 320, 1189


1212 Advanced sampling theory with applications

Undy, G.c. 390, 1138 Welch, B.L. 126, 1189

Unnithan, V.K.G. 705, 1189 Williams, W.H. 191, 1190

Upadhyaya, L.N. 248,270, 280, 595, Willson, D. 867, 1190


607,1176,1177,1189
Wolter, K.M. 638, 865, 872,1190
Uthayakumaran, N. 624,959, 1172,
1189 Wood, G.B. 865, 1173

Worthingham, R. 1073, 1190

Valdes, S.R. 1056, 1182 Wretman, J.H. 374, 377, 394, 395,
411,413,415,418,496,49~498,

Valliant, R. 512, 872, 1189 503,515,561,713, 1031, 1032,


1138, 1173, 1190
Verdeman, S. 563, 1189
Wright, R.L. 377, 384, 500, 1173,
Verma,S.s.517,1175 1190

Vijayan, K. 387,422,552,601, Wright, T. 325,498, 713, 717, 1141,


1161, 1167, 1189 1190

Vos, J.W.E. 257, 373, 395, 1025, Wu, C. 411, 425, 430, 504, 519,
1141, 1189 1141, 1183, 1190

Wu,C.F.J.217,220,262,317,413,
414,421,425,426,561,638,678,
682,685,689,697,698,872,
1143, 1167, 1190
Wakimoto, K. 209, 1189
Wynn, H.P. 378, 1190, 1191
Walsh, J.E. 257, 267, 320, 580, 1189
Wywial, J. 191,464, 1171, 1191
Wan, A.T.K. 745,1191

Wang, N. 1022, 1169

Wanger, A. 867, 1190

Warner, S.L. 889, 892, 893, 911, 912 Yadav, R.J. 864,1177
914,920,930,931,933,935,937,
939,952,953,954,955,958,963, Yamada, S. 492, 1190
968,970, 1157, 1189
Yansaneh, I.S. 865, 1191
Webster, J.T. 191, 1167
Yao, L. 1022, 1157
Author Index 1213

Yates, F. 354,356,357,358,387,
390,391,409,410,413,419,427,
428,473,629,630,641,647,847,
1191

You, Y. 1098, 1191

Young, G.A. 257, 1138

Yu, F. 401,409,414,419,420,421,
425,426,474,503,594,696,698,
700, 928, 1098, 1180, 1191

Yu, M. 1098, 1168

Yung, W. 1021, 1049, 1191

Zaidi, S.M.H. 203, 1179

Zarcovic, S.S. 808, 1191

Zasepa, R. 804, 1191

Zinger, A. 637,1191

Zou,G.514, 745,884,1146,1191

Zyskind, G. 498, 1191


HANDY SUBJECT INDEX

Cluster sampling 765

Adaptive cluster sampling 819 Combined ratio estimator 684

Amy and Michael 360 Combined regression estimator 688


700 '
Anticipated variance 444, 500
Correlation coefficient 50, 209
Auxiliary information 50, 138
Conditional inclusion probabilities
493

Confidence interval 35
Balanced half sample 871
Controlled sampling 125
Balanced sample 380
Cosmetic calibration 500
Basic concepts 1
Cumulative total method 300
Beale's estimator 223
Current topics 494
Best estimator 479

Bootstrap 873
Determinant sampling 127

Distinct units 106

Distribution function 13,428, 588


Calibration 399
Domain estimation 118
Calibrated estimators of variance 409
Dual frames 576
Circular systematic sampling 621
Dual to ratio estimator 161
Chain ratio type estimators 554

Class of estimators 164, 166, 167,


169
Equal allocation 659
1216 Advanced sampling theory with applications

Filtration of bias 187 Jackknife 30, 223, 567, 1016

Franklin's model 892

Lahiri's method 303

Godambe strategy 465 Linear trend 627

GREG 399

Grubbs' estimator 1066 Measurement errors 1065

Median 57, 250, 578

Hansen and Hurwitz's estimator 306 Midzuno--Sen method 390

Model assisted calibration 436


Hartley and Ross' unbiased estimator
180 Model based estimators 375, 494

Henderson's model 1090 Multi-auxiliary information 231

Hidden gangs 907 Multi-character surveys 326

Horvitz and Thompson's estimator Multi-dimensional systematic


351 sampling 639

Multi-phase sampling 529

Multi-stage sampling 829


Imputation 1009, 1010
Multi-way stratification 713
Inclusion probabilities 349
Multiple imputation 1121
IPPS unbiased estimation 462

Interpenetrating sampling 175


Negatively correlated variables 324
Inverse sampling 123
Non-parametric models 501
Handy subject Index 1217

Non-response 975 Quantitative variables 11

Quenouille's method 173

Optimum allocation 659

Optimal designs 426 Random number table method 5

Ordered estimators 445 Randomized response 889

Overlapping cluster sampling 812 Rao--Hartley--Cochran strategy 452

Ratio estimator 138

Periodic trends 638 Recalibration 424

Poisson sampling 499 Regression analysis 903

Politz and Simmons model 980 Regression coefficient 203

Post-cluster sampling 817 Regression estimator 149,506

Post-stratified sampling 649, 729 Regression predictor 431

Power transformation estimator 160 Re-sampling methods 866

PPS circular systematic sampling Respondent's protection 942


623
Response probabilities 1031
PPSWR295
Ridge regression 905
PPSWOR 349
Royall's technique 846
Prediction variance 444

Product estimator 145


Sampling distribution 32
Proportion 39, 94
Searls estimator 103
Proportional allocation 659
Sen--Yates--Grundy variance 354

Separate ratio estimator 677


Qualitative variables 11
Separate regression estimator 681
1218 Advanced sampling theory with applications

Small area estimation 1081

SRSWR4 Variance estimation 30, 36,409

SRSWOR5

Strata boundaries 70 I
Warner's model 889
Stratified sampling 649

Stochastic randomization 951


Zinger strategy 657
Successive sampling 847

Superpopulation model 214, 377

Supplemented panels 865

Synthetic estimator 1087

Systematic sampling 615

Two-phase sampling 529

Two-stage sampling 829

Unbiased estimators 24

Unbiased estimators of ratio and


product 173

Unified theory 479

Unordered estimators 449

Unrelated question model 897


ADDITIONAL INFORMATiON

I Barnett, V. (2002). Sampl e Survey Principl es and Method s. 3'd Ed., Arnold , London .
2 Biemer, P., Groves, R., Lyberg, L., Mathiowetz, N. and Sudman, S. (1992) . Measur ement Errors in
Surveys. Wiley.
3 Brewer, K. (2002) . Combined survey sampling inference . Arnold.
4 Cassel, C; Samdal, C.E. and Wretrnan, J.H. ( 1992). Foundations ofinference in survey samp ling.
Krieger Publishing Company .
5 Chaudhuri , A. and Vos, J.W.E..(1988). Unified theory and strategies of surv ey sampling. N. Holand.
6 Chaudhuri, A. and Stenger, H. (1992). Survey samp ling : Theory and Methods. Marcel Dekker, NY.
7 Cingi, H. (1994) . Sampling Theory. Hacettepe University Press.
8 Cochran, W.G. (1977). Sampling Techniques. 3rd ed., Wiley.
9 Coleman, P.B. (1993) . Practical sampling techniquesfor infrar ed analysis. Marcel Dekker, NY.
10 Cox, B.G., College, M., Binder, D., Kott, P.S. and Christianson, A. (1995). Business Survey
Methods. Wiley.
11 Deming, W.E. (1950) . Some Theory ofSampling. Wiley.
12 Groves, R.M. (1988) . Telephone Survey Methodology. Wiley.
13 Groves , R.M. (1989). Survey Errors and Survey Costs. Wiley.
14 Foreman, E.K. (199 1). Survey Sampling Principles. Marcel Dekker.
15 Hajek, J. (1981). Samplingfrom afinite population . Marcel Dekker.
16 Hansen, M.H., Hurwitz, W.N. and Madow, G. (1953). Sampl e Survey Methods and Theory. Wiley.
17 Hansen, M.H., Hurwitz, W.N. and Madow, G. (1993). Sampl e Survey Meth ods and Theory. Wiley.
18 Hedayat, A.S. and Sinha, B.K. (1991). Design and inference in fi nite population sampling. Wiley.
19 Jessen, RJ. (1978). Statistical Survey Techniques . Wiley.
20 Kish, L. (1995). Survey Sampling. Wiley.
21 Lohr, S.L. (1999) . Sampling: Design and Analysis. Duxbury Press
22 Mandenhall, W. (1979) . Elementary survey sampling . Duxbury.
23 Moser, C.A. and KaIton, G. (1971). Survey Methods in Social Investigation . London:Heineman.
24 Mukhopadhyay, P. (1998). Small area estimation in surv ey sampling. Narosa Publishing House .
25 Mukhopadhyay, P. (1998). Theory and methods ofsurv ey sampling . Prentice--Hall of India.
26 Mukhopadhyay, P. (2002). Topics in survey sampling. Springer-Verlag, NY.
27 Murthy, M.N. (1967) . Sampling theory and methods . Statistical Publishing Society, Calcutta .
28 Platek, R., Rao, J.N.K. , Sarndal, C.E. and Singh, M.P. (1987). Small area statistics. Wiley.
29 Raj, D. (1968). Survey Sampling. McGraw Hill.
30 Rao, J.N.K. (2003) . Small area estimation : Methods and applications. Wiley.
31 Rossi, P.H., Wright, J.D. and Anderson, A.S. (1983). Handb ook ofSurvey Research. AP.
32 Sarndal, C.E., Swenson ,B. and Wretrnan, J.H. (1992). Model Assisted Survey Sampling. Springer.
33 Schaefer, R.L., Mendenhall , W. and Ott, R.L. (1996). Elementary Survey Sampling. Duxbury.
34 Seber, G.A.F. (1981) . Estimation ofAnimal Abundance and Related Parameters. Griffin, London.
35 Singh , R. and Mangat , N.S. (1996). Elements ofsurvey sampling. Kluwer Academic Publisher .
36 Som, R.K. (1995). Practical sampling techniques . Marcel Dekker, NY.
37 Stuart, A. (1984) . The Ideas ofSampling. Griffin, London.
38 Sudman, S. (1976). Applied Sampling. AP.
39 Sukhatme, P.V., Sukhatme, B.V., Sukhatme, S. and Asok, C. (1984) . Sampling Theory of Surv eys
With Applications. 3'd Ed., Iowa State University Press, Ames, Iowa.
40 Thompson , M.E. (1997). Theory ofSampl e Surveys . Chapman & Hall.
41 Thompson, S.K. (1992) . Sampling. Wiley.
42 Tryfos , P. (1996) . Sampling Methodsfor Appli ed Resear ch. Wiley.
43 Valliant, R., Dorfman, A.H. and Royall, R. (2000). Finite Population Sampling and Inference: a
Prediction Approach. Wiley.
44 Williams, B. (1978) . A Sampl er on Sampling. Wiley.
45 Wolter, K.M. (1985). Introduction to Variance Estimation . Springer .
46 Yates, F. (1960). Sampling Methodsfor Censuses and Surveys. Griffen Publishing Co.
1220

Multi-purpose Text book for teachers/students.


Reference manual for researchers.
Practical uide for statisticians.

Highlights Basic concepts to advanced technology, SRSWR,


SRSWOR, General class of estimators, Ratio and
regression type estimators, Bias filtration, Median
estimation, PPSWR sampling, Multi-character
survey, PPSWOR sampling, Rao--Hartley--Cochran
strategy, Calibration of estimators of total, variance,
and distribution function etc., Multi-phase sampling,
Systematic sampling, Stratified and Post-stratified
sampling, Cluster sampling, Multi-stage sampling,
Randomized response sampling, Hidden gangs,
Imputation, Measurements errors, Small area
estimation, man more to ics, and fun.

Unique way to 1 - - - - - . . . . L . - . L - - - - - - - - - - - 1

Attractions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy