0% found this document useful (0 votes)
37 views47 pages

Survey Sampling e Content

This document provides an overview of sample surveys and key survey concepts. It discusses the difference between a sample and a census, with a sample collecting information from a subset of the population. The key concepts covered include sampling units, populations, sampling frames, random and non-random sampling. It notes that random sampling allows for representative samples where all population features are present. The advantages of sampling over a complete census are also summarized as reduced cost and scope, easier organization of work, greater accuracy possible, ability to obtain urgent information quickly, and increased feasibility when units may be destroyed.

Uploaded by

avirarao8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views47 pages

Survey Sampling e Content

This document provides an overview of sample surveys and key survey concepts. It discusses the difference between a sample and a census, with a sample collecting information from a subset of the population. The key concepts covered include sampling units, populations, sampling frames, random and non-random sampling. It notes that random sampling allows for representative samples where all population features are present. The advantages of sampling over a complete census are also summarized as reduced cost and scope, easier organization of work, greater accuracy possible, ability to obtain urgent information quickly, and increased feasibility when units may be destroyed.

Uploaded by

avirarao8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

BSc II Year

Paper Code: 295

Survey Sampling

Unit 1 and Unit 2


UNIFIED SYLLABUS OF STATISTICS
B.Sc. Part- II

Paper II : Survey Sampling


UNIT – I
Sampling Method : Concept of population, sample, parameter and statistic, sampling
versus census, advantages of sampling methods, role of sampling theory, sampling and
non-sampling errors, bias and its effects, probability sampling.

UNIT-II
Simple Random sampling with and without replacement, use of random number tables in
selection of simple random sample, estimation of population mean and proportion.
Derivation of expression for variance of these estimates. Estimates of variance. Sample
size determination.

UNIT-III
Stratified random sampling. Problem of allocation, proportional allocation, optimum
allocation. Derivation of the expression for the standard errors of the usual estimators
when these allocation are used. Gain in precision due to stratification.

UNIT-IV
Systematic sampling : estimation of population mean and population total, standard errors
of these estimators. Cluster sampling with equal clusters. Estimation of population mean
and their mean square error.
Notes On
Sample
Survey
Chapter 1
Introduction

Statistics is the science of data.

Data are the numerical values containing some information.

Statistical tools can be used on a data set to draw statistical inferences. These statistical inferences are
in turn used for various purposes. For example, government uses such data for policy formulation for
the welfare of the people, marketing companies use the data from consumer surveys to improve the
company and to provide better services to the customer, etc. Such data is obtained through sample
surveys. Sample surveys are conducted throughout the world by governmental as well as non-
governmental agencies. For example, “National Sample Survey Organization (NSSO)” conducts
surveys in India, “Statistics Canada” conducts surveys in Canada, agencies of United Nations like
“World Health Organization (WHO), “Food and Agricultural Organization (FAO)” etc. conduct
surveys in different countries.

Sampling theory provides the tools and techniques for data collection keeping in mind the objectives to
be fulfilled and nature of population.

There are two ways of obtaining the information


1. Sample surveys
2. Complete enumeration or census

Sample surveys collect information on a fraction of total population whereas census collect information
on whole population. Some surveys e.g., economic surveys, agricultural surveys etc. are conducted
regularly. Some surveys are need based and are conducted when some need arises, e.g., consumer
satisfaction surveys at a newly opened shopping mall to see the satisfaction level with the amenities
provided in the mall .

1
Sampling unit:
An element or a group of elements on which the observations can be taken is called a sampling unit.
The objective of the survey helps in determining the definition of sampling unit.

For example, if the objective is to determine the total income of all the persons in the household, then
the sampling unit is household. If the objective is to determine the income of any particular person in
the household, then the sampling unit is the income of the particular person in the household. So the
definition of sampling unit depends and varies as per the objective of the survey. Similarly, in another
example, if the objective is to study the blood sugar level, then the sampling unit is the value of blood
sugar level of a person. On the other hand, if the objective is to study the health conditions, then the
sampling unit is the person on whom the readings on the blood sugar level, blood pressure and other
factors will be obtained. These values will together classify the person as healthy or unhealthy.

Population:
Collection of all the sampling units in a given region at a particular point of time or a particular period
is called the population. For example, if the medical facilities in a hospital are to be surveyed through
the patients, then the total number of patients registered in the hospital during the time period of survey
will the population. Similarly, if the production of wheat in a district is to be studied, then all the fields
cultivating wheat in that district will be constitute the population. The total number of sampling units in
the population is the population size, denoted generally by N. The population size can be finite or
infinite (N is large).

Census:
The complete count of population is called census. The observations on all the sampling units in the
population are collected in the census. For example, in India, the census is conducted at every tenth
year in which observations on all the persons staying in India is collected.

Sample:
One or more sampling units are selected from the population according to some specified procedure.
A sample consists only of a portion of the population units. Such a collection of units is called the
sample.

2
In the context of sample surveys, a collection of units like households, people, cities, countries etc. is
called a finite population.
A census is a 100% sample and it is a complete count of the population.

Representative sample:
When all the salient features of the population are present in the sample, then it is called a
representative sample,
It goes without saying that every sample is considered as a representative sample.

For example, if a population has 30% males and 70% females, then we also expect the sample to have
nearly 30% males and 70% females.

In another example, if we take out a handful of wheat from a 100 Kg. bag of wheat, we expect the
same quality of wheat in hand as inside the bag. Similarly, it is expected that a drop of blood will give
the same information as all the blood in the body.

Sampling frame:
The list of all the units of the population to be surveyed constitutes the sampling frame. All the
sampling units in the sampling frame have identification particulars. For example, all the students in a
particular university listed along with their roll numbers constitute the sampling frame. Similarly, the
list of households with the name of head of family or house address constitutes the sampling frame. In
another example, the residents of a city area may be listed in more than one frame - as per automobile
registration as well as the listing in the telephone directory.

Ways to ensure representativeness:


There are two possible ways to ensure that the selected sample is representative.

1. Random sample or probability sample:


The selection of units in the sample from a population is governed by the laws of chance or probability.
The probability of selection of a unit can be equal as well as unequal.

3
2. Non-random sample or purposive sample:
The selection of units in the sample from population is not governed by the probability laws.

For example, the units are selected on the basis of personal judgment of the surveyor. The persons
volunteering to take some medical test or to drink a new type of coffee also constitute the sample on
non-random laws.

Another type of sampling is Quota Sampling. The survey in this case is continued until a
predetermined number of units with the characteristic under study are picked up.

For example, in order to conduct an experiment for rare type of disease, the survey is continued till
the required number of patients with the disease are collected.

Advantages of sampling over complete enumeration:


1. Reduced cost and enlarged scope.
Sampling involves the collection of data on smaller number of units in comparison to the
complete enumeration, so the cost involved in the collection of information is reduced. Further,
additional information can be obtained at little cost in comparison to conducting another
separate survey. For example, when an interviewer is collecting information on health
conditions, then he/she can also ask some questions on health practices. This will provide
additional information on health practices and the cost involved will be much less than
conducting an entirely new survey on health practices.

2. Organizaton of work:
It is easier to manage the organization of collection of smaller number of units than all the units
in a census. For example, in order to draw a representative sample from a state, it is easier to
manage to draw small samples from every city than drawing the sample from the whole state at
a time. This ultimately results in more accuracy in the statistical inferences because better
organization provides better data and in turn, improved statistical inferences are obtained.

4
3. Greater accuracy:
The persons involved in the collection of data are trained personals. They can collect the data
more accurately if they have to collect smaller number of units than large number of units.

4. Urgent information required:


The data from a sample can be quickly summarized.
For example, the forecasting of the crop production can be done quickly on the basis of a
sample of data than collecting first all the observation.

5. Feasibility:
Conducting the experiment on smaller number of units, particularly when the units are
destroyed, is more feasible. For example, in determining the life of bulbs, it is more feasible to
fuse minimum number of bulbs. Similarly, in any medical experiment, it is more feasible to use
less number of animals.

Type of surveys:
There are various types of surveys which are conducted on the basis of the objectives to be fulfilled.

1. Demographic surveys:
These surveys are conducted to collect the demographic data, e.g., household surveys, family size,
number of males in families, etc. Such surveys are useful in the policy formulation for any city, state or
country for the welfare of the people.

2. Educational surveys:
These surveys are conducted to collect the educational data, e.g., how many children go to school, how
many persons are graduate, etc. Such surveys are conducted to examine the educational programs in
schools and colleges. Generally, schools are selected first and then the students from each school
constitue the sample.

5
3. Economic surveys:
These surveys are conducted to collect the economic data, e.g., data related to export and import of
goods, industrial production, consumer expenditure etc. Such data is helpful in constructing the indices
indicating the growth in a particular sector of economy or even the overall economic growth of the
country.

4. Employment surveys:
These surveys are conducted to collect the employment related data, e.g., employment rate, labour
conditions, wages, etc. in a city, state or country. Such data helps in constructing various indices to
know the employment conditions among the people.

5. Health and nutrition surveys:


These surveys are conducted to collect the data related to health and nutrition issues, e.g., number of
visits to doctors, food given to children, nutritional value etc. Such surveys are conducted in cities,
states as well as countries by the national and international organizations like UNICEF, WHO etc.

6. Agricultural surveys:
These surveys are conducted to collect the agriculture related data to estimate, e.g., the acreage and
production of crops, livestock numbers, use of fertilizers, use of pesticides and other related topics. The
government bases its planning related to the food issues for the people based on such surveys.

7. Marketing surveys:
These surveys are conducted to collect the data related to marketing. They are conducted by major
companies, manufacturers or those who provide services to consumer etc. Such data is used for
knowing the satisfaction and opinion of consumers as well as in developing the sales, purchase and
promotional activities etc.

8. Election surveys:
These surveys are conducted to study the outcome of an election or a poll. For example, such polls are
conducted in democratic countries to have the opinions of people about any candidate who is contesting
the election.

6
9. Public polls and surveys:
These surveys are conducted to collect the public opinion on any particular issue. Such surveys are
generally conducted by the news media and the agencies which conduct polls and surveys on the
current topics of interest to public.

10. Campus surveys:


These surveys are conducted on the students of any educational institution to study about the
educational programs, living facilities, dining facilities, sports activities, etc.

Principal steps in a sample survey:


The broad steps to conduct any sample surveys are as follows:

1. Objective of the survey:


The objective of the survey has to be clearly defined and well understood by the person planning to
conduct it. It is expected from the statistician to be well versed with the issues to be addressed in
consultation with the person who wants to get the survey conducted. In complex surveys, sometimes the
objective is forgotten and data is collected on those issues which are far away from the objectives.

2. Population to be sampled:
Based on the objectives of the survey, decide the population from which the information can be
obtained. For example, population of farmers is to be sampled for an agricultural survey whereas the
population of patients has to be sampled for determining the medical facilities in a hospital.

3. Data to be collected:
It is important to decide that which data is relevant for fulfilling the objectives of the survey and to
note that no essential data is omitted. Sometimes, too many questions are asked and some of their
outcomes are never utilized. This lowers the quality of the responses and in turn results in lower
efficiency in the statistical inferences.

7
4. Degree of precision required:
The results of any sample survey are always subjected to some uncertainty. Such uncertainty can be
reduced by taking larger samples or using superior instruments. This involves more cost and more time.
So it is very important to decide about the required degree of precision in the data. This needs to be
conveyed to the surveyor also.

5. Method of measurement:
The choice of measuring instrument and the method to measure the data from the population needs to
be specified clearly. For example, the data has to be collected through interview, questionnaire,
personal visit, combination of any of these approaches, etc. The forms in which the data is to be
recorded so that the data can be transferred to mechanical equipment for easily creating the data
summary etc. is also needed to be prepared accordingly.

6. The frame:
The sampling frame has to be clearly specified. The population is divided into sampling units such that
the units cover the whole population and every sampling unit is tagged with identification. The list of
all sampling units is called the frame. The frame must cover the whole population and the units must
not overlap each other in the sense that every element in the population must belong to one and only
one unit. For example, the sampling unit can be an individual member in the family or the whole
family.

7. Selection of sample:
The size of the sample needs to be specified for the given sampling plan. This helps in determining and
comparing the relative cost and time of different sampling plans. The method and plan adopted for
drawing a representative sample should also be detailed.

8. The Pre-test:
It is advised to try the questionnaire and field methods on a small scale. This may reveal some troubles
and problems beforehand which the surveyor may face in the field in large scale surveys.

8
9. Organization of the field work:
How to conduct the survey, how to handle business administrative issues, providing proper training to
surveyors, procedures, plans for handling the non-response and missing observations etc. are some of
the issues which need to be addressed for organizing the survey work in the fields. The procedure for
early checking of the quality of return should be prescribed. It should be clarified how to handle the
situation when the respondent is not available.

10. Summary and analysis of data:


It is to be noted that based on the objectives of the data, the suitable statistical tool is decided which
can answer the relevant questions. In order to use the statistical tool, a valid data set is required and this
dictates the choice of responses to be obtained for the questions in the questionnaire, e.g., the data has
to be qualitative, quantitative, nominal, ordinal etc. After getting the completed questionnaire back, it
needs to be edited to amend the recording errors and delete the erroneous data. The tabulating
procedures, methods of estimation and tolerable amount of error in the estimation needs to be decided
before the start of survey. Different methods of estimation may be available to get the answer of the
same query from the same data set. So the data needs to be collected which is compatible with the
chosen estimation procedure.

11. Information gained for future surveys:


The completed surveys work as guide for improved sample surveys in future. Beside this they also
supply various types of prior information required to use various statistical tools, e.g., mean, variance,
nature of variability, cost involved etc. Any completed sample survey acts as a potential guide for the
surveys to be conducted in the future. It is generally seen that the things always do not go in the same
way in any complex survey as planned earlier. Such precautions and alerts help in avoiding the
mistakes in the execution of future surveys.

9
Variability control in sample surveys:
The variability control is an important issue in any statistical analysis. A general objective is to draw
statistical inferences with minimum variability. There are various types of sampling schemes which are
adopted in different conditions. These schemes help in controlling the variability at different stages.
Such sampling schemes can be classified in the following way.

1. Before selection of sampling units


• Stratified sampling
• Cluster sampling
• Two stage sampling
• Double sampling etc.

2. At the time of selection of sampling units


• Systematic sampling
• Varying probability sampling

3. After the selection of sampling units


• Ratio method of estimation
• Regression method of estimation
Note that the ratio and regtresion methods are the methods of estimation and not the methods of
drawing samples.

Methods of data collection


There are various way of data collection. Some of them are as follows:

1. Physical observations and measurements:


The surveyor contacts the respondent personally through the meeting. He observes the sampling unit
and records the data. The surveyor can always use his prior experience to collect the data in a better
way. For example, a young man telling his age as 60 years can easily be observed and corrected by the
surveyor.

10
2. Personal interview:
The surveyor is supplied with a well prepared questionnaire. The surveyor goes to the respondents and
asks the same questions mentioned in the questionnaire. The data in the questionnaire is then filled up
accordingly based on the responses from the respondents.

3. Mail enquiry:
The well prepared questionnaire is sent to the respondents through postal mail, e-mail, etc. The
respondents are requested to fill up the questionnaires and send it back. In case of postal mail, many
times the questionnaires are accompanied by a self addressed envelope with postage stamps to avoid
any non-response due to the cost of postage.

4. Web based enquiry:


The survey is conducted online through internet based web pages. There are various websites which
provide such facility. The questionnaires are to be in their formats and the link is sent to the
respondents through email. By clicking on the link, the respondent is brought to the concerned website
and the answers are to be given online. These answers are recorded and responses as well as their
statistics is sent to the surveyor. The respondents should have internet connection to support the data
collection with this procedure.

5. Registration:
The respondent is required to register the data at some designated place. For example, the number of
births and deaths along with the details provided by the family members are recorded at city municipal
office which are provided by the family members.

6. Transcription from records:


The sample of data is collected from the already recorded information. For example, the details of the
number of persons in different families or number of births/deaths in a city can be obtained from the
city municipal office directly.

The methods in (1) to (5) provide primary data which means collecting the data directly from the
source. The method in (6) provides the secondary data which means getting the data from the primary
sources.

11
Chapter -2
Simple Random Sampling

Simple random sampling (SRS) is a method of selection of a sample comprising of n number of


sampling units out of the population having N number of sampling units such that every sampling
unit has an equal chance of being chosen.

The samples can be drawn in two possible ways.


• The sampling units are chosen without replacement in the sense that the units once chosen
are not placed back in the population .
• The sampling units are chosen with replacement in the sense that the chosen units are
placed back in the population.

1. Simple random sampling without replacement (SRSWOR):


SRSWOR is a method of selection of n units out of the N units one by one such that at any stage of
selection, anyone of the remaining units have same chance of being selected, i.e. 1/ N .

2. Simple random sampling with replacement (SRSWOR):


SRSWR is a method of selection of n units out of the N units one by one such that at each stage of
selection each unit has equal chance of being selected, i.e., 1/ N . .

Procedure of selection of a random sample:


The procedure of selection of a random sample follows the following steps:
1. Identify the N units in the population with the numbers 1 to N .
2. Choose any random number arbitrarily in the random number table and start reading
numbers.
3. Choose the sampling unit whose serial number corresponds to the random number drawn
from the table of random numbers.
4. In case of SRSWR, all the random numbers are accepted ever if repeated more than once.
In case of SRSWOR, if any random number is repeated, then it is ignored and more
numbers are drawn.

1
Such process can be implemented through programming and using the discrete uniform distribution.
Any number between 1 and N can be generated from this distribution and corresponding unit can be
seleced into the sample by associating an index with each sampling unit. Many statistical softwares
like R, SAS, etc. have inbuilt functions for drawing a sample using SRSWOR or SRSWR.

Notations:
The following notations will be used in further notes:

N: Number of sampling units in the population (Population size).


n: Number of sampling units in the sample (sample size)
Y: The characteristic under consideration
Yi : Value of the characteristic for the i th unit of the population

1 n
y= ∑ yi : sample mean
n i =1
N
1
Y =
N
∑y
i =1
i : population mean

1 N 1 N

=
=∑
S2 (Yi =
N −1 i 1=
− Y )2 (∑ Yi 2 − NY 2 )
N −1 i 1
1 N 1 N
=
σ2 =
= ∑ (Yi − Y )2 =(∑ Yi 2 − NY 2 )
N i 1= N i1
n n
1 1
=
=
s 2
∑ ( yi =
n −1 i 1=
− y) 2
(∑ yi2 − ny 2 )
n −1 i 1

Probability of drawing a sample :


1.SRSWOR:
N
If n units are selected by SRSWOR, the total number of possible samples are   .
n
1
So the probability of selecting any one of these samples is .
N
 
n
Note that a unit can be selected at any one of the n draws. Let ui be the ith unit selected in the
sample. This unit can be selected in the sample either at first draw, second draw, …, or nth draw.
2
Let Pj (i ) denotes the probability of selection of ui at the jth draw, j = 1,2,...,n. Then

Pj (i=
) P1 (i ) + P2 (i ) + ... + Pn (i )
1 1 1
= + + ... + (n times )
N N N
n
=
N

Now if u1 , u2 ,..., un are the n units selected in the sample, then the probability of their selection is

P(u1 , u2 ,..., un ) = P(u1 ).P(u2 ),..., P(un )


Note that when the second unit is to be selected, then there are (n – 1) units left to be selected in the
sample from the population of (N – 1) units. Similarly, when the third unit is to be selected, then
there are (n – 2) units left to be selected in the sample from the population of (N – 2) units and so on.
n
If P(u1 ) = , then
N
n −1 1
=P(u2 ) = ,..., P(un ) .
N −1 N − n +1
Thus
n n −1 n − 2 1 1
=
P(u1 , u2 ,.., un ) =. . ... .
N N −1 N − 2 N − n +1  N 
 
n

Alternative approach:
The probability of drawing a sample in SRSWOR can alternatively be found as follows:

Let ui ( k ) denotes the ith unit drawn at the kth draw. Note that the ith unit can be any unit out of the N

units. Then so = (ui (1) , ui (2) ,..., ui ( n ) ) is an ordered sample in which the order of the units in which they

are drawn, i.e., ui (1) drawn at the first draw, ui (2) drawn at the second draw and so on, is also

considered. The probability of selection of such an ordered sample is


P ( so ) = P (ui (1) ) P(ui (2) | ui (1) ) P(ui (3) | ui (1)ui (2) )...P(ui ( n ) | ui (1)ui (2) ...ui ( n −1) ).

Here P(ui ( k ) | ui (1)ui (2) ...ui ( k −1) ) is the probability of drawing ui ( k ) at the kth draw given that

ui (1) , ui (2) ,..., ui ( k −1) have already been drawn in the first (k – 1) draws.

3
Such probability is obtained as
1
P (ui ( k ) | ui (1)ui (2) ...ui ( k −1) ) = .
N − k +1
So
n
1 ( N − n)!
=P( so ) ∏
=
N − k +1
k =1 N!
.

The number of ways in which a sample of size n can be drawn = n !


( N − n)!
Probability of drawing a sample in a given order =
N!
So the probability of drawing a sample in which the order of units in which they are drawn is

( N − n)! 1
=
irrelevant n=
! .
N! N
 
n

2. SRSWR
When n units are selected with SRSWR, the total number of possible samples are N n . The
1
Probability of drawing a sample is .
Nn
Alternatively, let ui be the ith unit selected in the sample. This unit can be selected in the sample
either at first draw, second draw, …, or nth draw. At any stage, there are always N units in the
population in case of SRSWR, so the probability of selection of ui at any stage is 1/N for all i =

1,2,…,n. Then the probability of selection of n units u1 , u2 ,..., un in the sample is

P(u1 , u2 ,.., un ) = P(u1 ).P(u2 )...P(un )


1 1 1
= . ...
N N N
1
= n
N

4
Probability of drawing an unit
1. SRSWOR
Let Ae denotes an event that a particular unit u j is not selected at the th draw. The

probability of selecting, say, j th unit at k th draw is

P (selection of u j at k th draw) = P( A1  A2  ....  Ak −1  Ak )

= P( A1 ) P( A2 A1 ) P( A3 A1 A2 ).....P( Ak −1 A1 , A2 ...... Ak − 2 ) P( Ak A1 , A2 ...... Ak −1 )


 1  1  1   1  1
= 1 −  1 −  1 −  ... 1 − 
 N   N −1   N − 2   N − k + 2  N − k +1
N −1 N − 2 N − k +1 1
= . ... .
N N − 1 N − +2 N − k + 1
1
=
N

2. SRSWR
1
P[ selection of u j at kth draw] = .
N

Estimation of population mean and population variance


One of the main objectives after the selection of a sample is to know about the tendency of the data
to cluster around the central value and the scatterdness of the data around the central value. Among
various indicators of central tendency and dispersion, the popular choices are arithmetic mean and
variance. So the population mean and population variability are generally measured by the arithmetic
mean (or weighted arithmetic mean) and variance, respectively. There are various popular estimators
for estimating the population mean and population variance. Among them, sample arithmetic mean
and sample variance are more popular than other estimators. One of the reason to use these
estimators is that they possess nice statistical properties. Moreover, they are also obtained through
well established statistical estimation procedures like maximum likelihood estimation, least squares
estimation, method of moments etc. under several standard statistical distributions. One may also
consider other indicators like median, mode, geometric mean, harmonic mean for measuring the
central tendency and mean deviation, absolute deviation, Pitman nearness etc. for measuring the
dispersion. The properties of such estimators can be studied by numerical procedures like
bootstraping.
5
1. Estimation of population mean
1 n
Let us consider the sample arithmetic mean y = ∑ yi as an estimator of population mean
n i =1
N
1
Y =
N
∑Y
i =1
i and verify y is an unbiased estimator of Y under the two cases.

SRSWOR
n
Let ti = ∑ yi . Then
i =1

n
1
E( y ) = E (∑ yi )
n i =1
1
= E ( ti )
n
 N 
   
1 1 n 
= ∑ ti
n   N  i =1 
  
 n  
N
 
1 1  n  n
=
n  N=
∑  ∑
 i 1= i 1
yi .

 
n
When n units are sampled from N units by without replacement , then each unit of the population
can occur with other units selected out of the remaining ( N − 1) units is the population and each unit

 N − 1 N
occurs in   of the   possible samples. So
 n −1  n
N
 
n
 n
  N − 1 N
So ∑  ∑ y  =  n − 1  ∑ y .
i i
=i 1 =i 1  
=i 1

Now
( N − 1)! n !( N − n)! N
E( y ) =
(n − 1)!( N − n)! nN!
∑i =1
yi
N
1
=
N
∑y
i =1
i

=Y.

6
Thus y is an unbiased estimator of Y . Alternatively, the following approach can also be adopted to
show the unbiasedness property.
n
1
E( y ) =
n
∑j =1
E( y j )

1 n N 
= ∑  ∑
i 1
n=j 1 =
Yi Pj (i ) 

1 n N 1
= ∑  ∑ Yi . 
i 1 N
n=j 1 =
n
1
=
n
∑Y
j =1

=Y

where Pj (i ) denotes the probability of selection of i th unit at j th stage.

SRSWR
n
1
E( y ) = E (∑ yi )
n i =1
1 n
= ∑ E ( yi )
n i =1
1 n
= ∑ (Y1P1 + .. + YN P)
n i =1
1 n
=
n
∑Y
=Y.
1
where Pi = for all i = 1, 2,..., N is the probability of selection of a unit. Thus y is an unbiased
N
estimator of population mean under SRSWR also.

7
Variance of the estimate
Assume that each observation has some variance σ 2 . Then
V (=
y ) E ( y − Y )2
2
1 n 
= E  ∑ ( yi − Y ) 
 n i =1 
1 n 1 n n 
= E  2 ∑ ( yi − Y ) 2 + 2 ∑∑ ( yi − Y )( y j − Y ) 
= n i 1 n i ≠j 
n n n
1 1
= 2 ∑ E ( yi − Y ) 2 + 2 ∑∑ E ( yi − Y )( y j − Y )
n n i ≠j
1 n 2 K
=
n2
∑ σ + n2
N −1 2 K
= S + 2
Nn n
n n
where =
K ∑∑ E ( y − Y )( y − Y )
i ≠j
i i assuming that each observation has variance σ 2 . Now we find

K under the setups of SRSWR and SRSWOR.

SRSWOR
n n
=
K ∑∑ E ( y − Y )( y − Y ) .
i ≠j
i i

Consider
N N
1
E ( y=
i − Y )( y j − Y ) ∑∑ ( yk − Y )( ye − Y )
N ( N − 1) k ≠ 
Since
2
N  N N N

∑ k −  ∑ k
= − + ∑∑ ( yk − Y )( y − Y ))
2
( y Y ) ( y Y )
= k 1=  i 1 k ≠
N N
0 =( N − 1) S 2 + ∑∑ ( yk − Y )( y − Y )
k ≠
N N
1
∑∑ ( y
k ≠
k − Y )( y=
−Y )
N ( N − 1)
[−( N − 1) S 2 ]

S2
= − .
N

8
S2
Thus K =
−n(n − 1) and so substituting the value of K , the variance of y under SRSWOR is
N
N −1 2 1 S2
V ( yWOR )= S − 2 n(n − 1)
Nn n N
N −n 2
= S .
Nn

SRSWR
N N
=
K ∑∑ E ( y − Y )( y − Y )
i ≠j
i i

N N
= ∑∑ E ( y − Y ) E ( y
i ≠j
i je −Y )

=0
because the ith and jth draws (i ≠ j ) are independent.
Thus the variance of y under SRSWR is
N −1 2
V ( yWR ) = S .
Nn
It is to be noted that if N is infinite (large enough), then
S2
V ( y) =
n
N −n
is both the cases of SRSWOR and SRSWR. So the factor is responsible for changing the
N
variance of y when the sample is drawn from a finite population in comparison to an infinite
N −n
population. This is why is called a finite population correction (fpc) . It may be noted that
N
N −n n N −n n
= 1 − , so is close to 1 if the ratio of sample size to population , is very small or
N N N N
n
negligible. The term is called sampling fraction. In practice, fpc can be ignored whenever
N
n
< 5% and for many purposes even if it is as high as 10%. Ignoring fpc will result in the
N
overestimation of variance of y .

9
Efficiency of y under SRSWOR over SRSWR
N −n 2
V ( yWOR ) = S
Nn
N −1 2
V ( yWR ) = S
Nn
N − n 2 n −1 2
= S + S
Nn Nn
= V ( yWOR ) + a positive quantity
Thus

V ( yWR ) > V ( yWOR )


and so, SRSWOR is more efficient than SRSWR.

Estimation of variance from a sample


Since the expressions of variances of sample mean involve S 2 which is based on population values,
so these expressions can not be used in real life applications. In order to estimate the variance of y

on the basis of a sample, an estimator of S 2 (or equivalently σ 2 ) is needed. Consider S 2 as an


estimator of s 2 (or σ 2 ) and we investigate its biasedness for S 2 in the cases of SRSWOR and
SRSWR,

Consider
1 n
=s2 ∑
n − 1 i =1
( yi − y ) 2
2
1 n
= ∑ ( yi − Y ) − ( y − Y ) 
n − 1 i =1 
1  n 
=  ∑
n − 1  i =1
( yi − y ) 2 − n( y − Y ) 2 

1  n 
=
E (s 2 )  ∑
n − 1  i =1
E ( yi − Y ) 2 − nE ( y − Y ) 2 

1  n  1
=  ∑
n − 1  i =1
Var ( yi ) − nVar ( y )  =
 n −1
 nσ 2 − nVar ( y ) 

10
In case of SRSWOR
N −n 2
V ( yWOR ) = S
Nn
and so
n  2 N −n 2
=
E (s 2 ) σ − S 
n − 1  Nn 
n  N −1 2 N − n 2 
= S − S 
n − 1  N Nn 
= S2
In case of SRSWR
N −1 2
V ( yWR ) = S
Nn
and so

n  2 N −n 2
=
E (s 2 ) σ − S 
n − 1  Nn 
n  N −1 2 N − n 2 
= S − S 
n − 1  N Nn 
N −1 2
= S
N
=σ2
Hence
 S 2 is SRSWOR
E (s2 ) =  2
σ is SRSWR

An unbiased estimate of Var ( y ) is


N −n 2
Vˆ ( yWOR ) = s in case of SRSWOR and
Nn
N −1 N 2
Vˆ ( yWR ) = . s
Nn N − 1
s2
= in case of SRSWR.
n

11
Standard errors
The standard error of y is defined as Var ( y ) .
In order to estimate the standard error, one simple option is to consider the square root of estimate of
variance of sample mean.

N −n
• under SRSWOR, a possible estimator is σˆ ( y ) = s.
Nn

N −1
• under SRSWR, a possible estimator is σˆ ( y ) = s.
Nn
( y) .
It is to be noted that this estimator does not possess the same properties as of Var

Reason being if θˆ is an estimator of θ , then θ is not necessarily an estimator of θ .
In fact, the σˆ ( y ) is a negatively biased estimator under SRSWOR.

The approximate expressions for large N case are as follows:


(Reference: Sampling Theory of Surveys with Applications, P.V. Sukhatme, B.V. Sukhatme, S.
Sukhatme, C. Asok, Iowa State University Press and Indian Society of Agricultural Statistics,
1984, India)

Consider s as an estimator of S .
Let
S 2 + ε with E (ε ) =
s2 = 0, E (ε 2 ) =
S 2.
Write
s ( S 2 + ε )1/2
=

ε 
1/2

= S 1 + 2 
 S 
 ε ε2 
= S 1 + 2 − 4 + ... 
 2S 8S 
assuming ε will be small as compared to S 2 and as n becomes large, the probability of such an
event approaches one. Neglecting the powers of ε higher than two and taking expectation, we have

12
 Var ( s 2 ) 
E ( s=
) 1 −
8S 4 
S

where
2S 4   n − 1  
Var ( s ) =
2
1+   ( β 2 − 3) )  for large N .
(n − 1)   2n  
N j

∑ (Y − Y )
1
=µj i
N i =1

µ4
β2 = : coefficient of kurtosis.
S4
Thus
 1 β − 3
E (s) =
S 1 − − 2
 4(n − 1) 8n 
2
 1 Var ( s 2 ) 
Var ( s ) = S − S 1 −
2 2
4 
 8 S 
2
Var ( s )
=
4S 2
S 2   n −1  
= 1+   ( β 2 − 3)  .
2 ( n − 1)   2n  
Note that for a normal distribution, β 2 = 3 and we obtain

S2
Var ( s ) = .
2 ( n − 1)

Both Var ( s ) and Var ( s 2 ) are inflated due to nonnormality to the same extent, by the inflation factor

  n −1  
1 +  2n  ( β 2 − 3) 
   
and this does not depends on coefficient of skewness.

This is an important result to be kept in mind while determining the sample size in which it is
assumed that S 2 is known. If inflation factor is ignored and population is non-normal, then the
reliability on s 2 may be misleading.

13
Alternative approach:
The results for the unbiasedness property and the variance of sample mean can also be proved in an
alternative way as follows:

(i) SRSWOR
With the ith unit of the population, we associate a random variable ai defined as follows:

1, if the i th unit occurs in the sample


ai = 
0, if the i unit does not occurs in the sample (i =1, 2,..., N )
th

Then,
E (ai ) = 1× Probability that the i th unit is included in the sample
n
= = , i 1, 2,..., N .
N
E (ai2 ) = 1× Probability that the i th unit is included in the sample
n
= =, i 1, 2,..., N
N
E (ai a j ) = 1× Probability that the i th and j th units are included in the sample
n(n − 1)
= = , i ≠ j 1, 2,..., N .
N ( N − 1)
From these results, we can obtain
n( N − n)
Var (ai ) = E (ai2 ) − ( E (ai ) ) = 2 , i =
2
1, 2,..., N
N
n( N − n)
Cov(ai= , a j ) E (ai a j ) − E (ai ) E=
(a j ) ,= i ≠ j 1, 2,..., N .
N 2 ( N − 1)
We can rewrite the sample mean as
1 N
y= ∑ ai yi
n i =1
Then
1 N
=E( y ) = ∑ E (ai ) yi Y
n i =1
and
1  N  1 N N 
Var ( y ) = = 2
Var  ∑ a
 i =1=
i i
y 2 ∑
 n i 1
Var ( ai ) yi
2
+ ∑ Cov(ai , a j ) yi y j  .
n i≠ j 

14
Substituting the values of Var (ai ) and Cov(ai , a j ) in the expression of Var ( y ) and simplifying, we

get
N −n 2
Var ( y ) = S .
Nn
To show that E ( s 2 ) = S 2 , consider

1  n 2 2 1 N 
=
=s2  ∑ y
(n − 1)  i 1 =
i −
= ny   ∑
 (n − 1)  i 1
ai yi2 − ny 2  .

Hence, taking, expectation, we get
1 N 
=
E (s 2 )  ∑ E (ai ) yi2 − n {Var ( y ) + Y 2 }
(n − 1)  i =1 
Substituting the values of E (ai ) and Var ( y ) in this expression and simplifying, we get E ( s 2 ) = S 2 .

(ii) SRSWR
Let a random variable ai associated with the ith unit of the population denotes the number of times

the ith unit occurs in the sample i = 1, 2,..., N . So ai assumes values 0, 1, 2,…,n. The joint

distribution of a1 , a2 ,..., aN is the multinomial distribution given by

n! 1
P(a1 , a2 ,..., aN ) = N
.
Nn
∏a !
i =1
i

N
where ∑a i =1
i = n. For this multinomial distribution, we have

n
E (ai ) = ,
N
n( N − 1)
=
Var (ai ) = , i 1, 2,..., N .
N2
n
Cov(ai , a j ) =− 2 , i ≠ j = 1, 2,..., N .
N
We rewrite the sample mean as
1 N
y= ∑ ai yi .
n i =1
Hence, taking expectation of y and substituting the value of E (ai ) = n / N we obtain that

E( y ) = Y .

15
Further,
1 N N

2 ∑ ∑
=Var ( y ) Var ( ai ) yi
2
+ Cov(ai , a j ) yi y j 
=n  i 1 =i 1 
Substituting, the values of Var (ai ) =
n( N − 1) / N 2 and Cov(ai , a j ) =
−n / N 2 and simplifying, we get

N −1 2
Var ( y ) = S .
Nn
N −1 2
To prove that=
E (s 2 ) = S σ 2 in SRSWR, consider
N
n N
(n − 1) s 2 =
=i 1 =i 1
∑ yi2 − ny 2 = ∑a y i
2
i − ny 2 ,

− n {Var ( y ) + Y 2 }
N
(n − 1) E ( s 2=
) ∑ E (a ) y
i =1
i
2
i

n N ( N − 1) 2
=∑ yi2 − n. S − nY 2
N i =1 nN
(n − 1)( N − 1) 2
= S
N
N −1 2
=
E (s 2 ) = S σ2
N

Estimator of population total:


Sometimes, it is also of interest to estimate the population total, e.g. total household income, total
expenditures etc. Let denotes the population total
N
=
YT ∑=
Y
i =1
i NY

which can be estimated by

YˆT = NYˆ
= Ny .

16
Obviously

( )
E YˆT = NE ( y )
= NY
( )
Var YˆT = N 2 ( y )
 2  N − n  2 N ( N − n) 2
 N  Nn  S = S for SRSWOR
   n
=
 N 2  N − 1  S 2 = N ( N − 1) S 2 for SRSWOR
  Nn  n

and the estimates of variance of YˆT are

 N ( N − n) 2
 s for SRSWOR

Var (YT ) = 
ˆ n
 N s2 for SRSWOR
 n

Confidence limits for the population mean


Now we construct the 100 (1 − α ) % confidence interval for the population mean. Assume that the

y −Y
population is normally distributed N ( µ , σ 2 ) with mean µ and variance σ 2 . then
Var ( y )

follows N (0,1) when σ 2 is known. If σ 2 is unknown and is estimated from the sample then

y −Y
follows a t -distribution with (n − 1) degrees of freedom. When σ 2 is known, then the
Var ( y )
100( 1 − α ) % confidence interval is given by

 y −Y 
P −Zα ≤ ≤ Zα 1 α
 =−
 2 Var ( y ) 2 
 
or P  y − Z α Var ( y ) ≤ y ≤ y + Zα Var ( y )  =1 − α
 2 2 
and the confidence limits are
 
 y − Zα Var ( y ), y + Z α Var ( y 
 2 2 

17
α
when Z α denotes the upper % points on N (0,1) distribution. Similarly, when σ 2 is unknown,
2 2
then the 100(1- 1 − α ) % confidence interval is

 y −Y 
P  −tα ≤ ≤ tα  =1 − α
 2 Varˆ( y ) 2 

 
or P  y − tα ≤ Varˆ( y ) ≤ y ≤ y + tα Varˆ( y )  =1 − α
 2 2 
and the confidence limits are
 
 y − tα ≤ Varˆ( y ) ≤ y + tα Varˆ( y ) 
 2 2 
α
where tα denotes the upper % points on t -distribution with (n − 1) degrees of freedom.
2 2

Determination of sample size


The size of the sample is needed before the survey starts and goes into operation. One point to be
kept is mind is that when the sample size increases, the variance of estimators decreases but the cost
of survey increases and vice versa. So there has to be a balance between the two aspects. The
sample size can be determined on the basis of prescribed values of standard error of sample mean,
error of estimation, width of the confidence interval, coefficient of variation of sample mean,
relative error of sample mean or total cost among several others.

An important constraint or need to determine the sample size is that the information regarding the
population standard derivation S should be known for these criterion. The reason and need for this
will be clear when we derive the sample size in the next section. A question arises about how to
have information about S before hand? The possible solutions to this issue are to conduct a pilot
survey and collect a preliminary sample of small size, estimate S and use it as known value of S
it. Alternatively, such information can also be collected from past data, past experience, long
association of experimenter with the experiment, prior information etc.

Now we find the sample size under different criteria assuming that the samples have been drawn
using SRSWOR. The case for SRSWR can be derived similarly.

18
1. Prespecified variance
The sample size is to be determined such that the variance of y should not exceed a given value, say
V. In this case, find n such that
Var ( y ) ≤ V
N −n
or ( y) ≤ V
Nn
N −n 2
or S ≤V
Nn
1 1 V
or − ≤ 2
n N S
1 1 1
or − ≤
n N ne
ne
n≥
n
1+ e
N
S2
where ne = .
v
It may be noted here that ne can be known only when S 2 is known. This reason compels to assume

that S should be known. The same reason will also be seen in other cases.
The smallest sample size needed in this case is
ne
nsmallest = .
ne
1+
N
It N is large, then the required n is
n ≥ ne and nsmallest = ne .

2. Pre-specified estimation error


It may be possible to have some prior knowledge of population mean Y and it may be required that
the sample mean y should not differ from it by more than a specified amount of absolute
estimation error, i.e., which is a small quantity. Such requirement can be satisfied by associating a
probability (1 − α ) with it and can be expressed as

P  y − Y ≤ e  = (1 − α ).

19
N −n 2
Since y follows N (Y , S ) assuming the normal distribution for the population, we can write
Nn
 y −Y e 
P ≤ =1−α
 Var ( y ) Var ( y ) 

which implies that


e
= Zα
Var ( y ) 2

or Z α2 Var ( y ) = e 2
2

N −n 2
or Z α2 S = e2
2 Nn

  Z S 2 
  α2  
   
  e  
or n =    
  Zα S  
2

 1 2  
 1+  
 N  e  
   
which is the required sample size. If N is large then
2
 Zα S 
 
n =  2e  .
 
 

3. Prespecified width of confidence interval


If the requirement is that the width of the confidence interval of y with confidence coefficient
(1 − α ) should not exceed a prespecified amount W , then the sample size n is determined such that

2 Z α Var ( y ) ≤ W
2

assuming σ 2 is known and population is normally distributed. This can be expressed as

N −n
2Z α S ≤W
2 Nn

1 1 
or 4Z α2  −  S 2 ≤ W 2
2 n N
20
1 1 W2
or ≤ +
n N 4 Z α2 S 2
2

4 Z α2 S 2
2

or n ≥ W2 .
4 Z α2 S 2
1+ 2
NW 2
The minimum sample size required is
4 Z α2 S 2
2

nsmallest = W2
4 Z α2 S 2
1+ 2
NW 2
If N is large then
4Z α2 S 2
n≥ 2

W2
and the minimum sample size needed is
4Z α2 S 2
nsmallest = 2
.
W2

4. Prespecified coefficient of variation


The coefficient of variation (CV) is defined as the ratio of standard error (or standard deviation)
and mean. The know ledge of coefficient of variation has played an important role in the sampling
theory as this information has helped in deriving efficient estimators.

If it is desired that the the coefficient of variation of y should not exceed a given or prespecified
value of coefficient of variation, say C0 , then the required sample size n is to be determined such
that
CV ( y ) ≤ C0

Var ( y )
or ≤ C0
Y

21
N −n 2
S
or Nn 2 ≤ C02
Y
1 1 C02
or − ≤
n N C2
C2
Co2
or n ≥
C2
1+
NC02
S
is the required sample size where C = is the population coefficient of variation.
Y
The smallest sample size needed in this case is
C2
C02
nsmallest = .
C2
1+
NC02
If N is large, then
C2
n≥
C02
C2
and nsmalest = 2
C0

5. Prespecified relative error


When y is used for estimating the population mean Y , then the relative estimation error is defined

y −Y
as . If it is required that such relative estimation error should not exceed a prespecified value
Y
R with probability (1 − α ) , then such requirement can be satisfied by expressing it like such
requirement can be satisfied by expressing it like
 y −Y RY 
P ≤ =1−α.
 Var ( y ) Var ( y ) 

 N −n 2
Assuming the population to be normally distributed, y follows N  Y , S .
 Nn 

22
So it can be written that
RY
= Zα .
Var ( y ) 2

 N −n 2
or Z α2  S = R Y
2 2

2  Nn 

1 1  R2
or  −  =
 n N  C Zα
2 2

2
 Zα C 
 2 
 R 
 
or n =  
2
 Zα C 
1  
1+  2 
N R 
 
S
where C = is the population coefficient of variation and should be known.
Y
If N is large, then
2
 zα C 
n= 2  .
 R 
 
 

6. Prespecified cost
Let an amount of money C is being designated for sample survey to called n observations, C0 be

the overhead cost and C1 be the cost of collection of one unit in the sample. Then the total cost C
can be expressed as
= C0 + nC1
C

C − C0
Or n =
C1
is the required sample size.

23
Chapter 3
Sampling For Proportions and Percentages
In many situations, the characteristic under study on which the observations are collected are
qualitative in nature. For example, the responses of customers in many marketing surveys are
based on replies like ‘yes’ or ‘no’ , ‘agree’ or ‘disagree’ etc. Sometimes the respondents are
asked to arrange several options in the order like first choice, second choice etc. Sometimes
the objective of the survey is to estimate the proportion or the percentage of brown eyed
persons, unemployed persons, graduate persons or persons favoring a proposal, etc. In such
situations, the first question arises how to do the sampling and secondly how to estimate the
population parameters like population mean, population variance, etc.

Sampling procedure:
The same sampling procedures that are used for drawing a sample in case of quantitative
characteristics can also be used for drawing a sample for qualitative characteristic. So, the
sampling procedures remain same irrespective of the nature of characteristic under study -
either qualitative or quantitative. For example, the SRSWOR and SRSWR procedures for
drawing the samples remain the same for qualitative and quantitative characteristics. Similarly,
other sampling schemes like stratified sampling, two stage sampling etc. also remain same.

Estimation of population proportion:


The population proportion in case of qualitative characteristic can be estimated in a similar
way as the estimation of population mean in case of quantitative characteristic.

Consider a qualitative characteristic based on which the population can be divided into two
mutually exclusive classes, say C and C*. For example, if C is the part of population of
persons saying ‘yes’ or ‘agreeing’ with the proposal then C* is the part of population of persons
saying ‘no’ or ‘disagreeing’ with the proposal. Let A be the number of units in C and (N - A)
units in C* be in a population of size N. Then the proportion of units in C is
A
P=
N
and the proportion of units in C* is
N−A
Q= = 1 − P.
N

1
An indicator variable Y can be associated with the characteristic under study and then for i =
1,2,..,N
1 i th unit belongs to C
Yi = 
0 i th unit belongs to C *.

Now the population total is


N
=
YTOTAL ∑=
Y
i =1
i A

and population mean is


N

∑Y i
A
Y= i =1
= = P.
N N

Suppose a sample of size n is drawn from a population of size N by simple random sampling .

Let a be the number of units in the sample which fall into class C and (n − a ) units fall in class
C*, then the sample proportion of units in C is
a
p= .
n
which can be written as
n

a ∑y i
p= = i =1
= y.
n n
N
Since ∑Y =
i =1
i
2
A= NP, so we can write S 2 and s 2 in terms of P and Q as follows:

1 N
=S 2

N − 1 i =1
(Yi − Y ) 2
N
1
= (∑ Yi 2 − NY 2 )
N − 1 i =1
1
= ( NP − NP 2 )
N −1
N
= PQ.
N −1
n
Similarly, ∑y=
i =1
2
i a= np and

2
1 n
=s2 ∑ ( yi − y )2
n − 1 i =1
n
1
= (∑ yi2 − ny 2 )
n − 1 i =1
1
= (np − np 2 )
n −1
n
= pq.
n −1
Note that the quantities y , Y , s 2 and S 2 have been expressed as functions of sample and
population proportions. Since the sample has been drawn by simple random sampling and
sample proportion is same as the sample mean, so the properties of sample proportion in
SRSWOR and SRSWR can be derived using the properties of sample mean directly.

1. SRSWOR
Since sample mean y an unbiased estimator of population mean Y , i.e. E ( y ) = Y in case of
SRSWOR, so
E ( p=
) E( y=
) Y= P
and p is an unbiased estimator of P.

Using the expression of Var ( y ), the variance of p can be derived as

N −n 2
=
Var =
( p ) Var (y) S
Nn
N −n N
= . PQ
Nn N − 1
N − n PQ
= . .
N −1 n
Similarly, using the estimate of Var ( y ), the estimate of variance can be derived as

  N −n 2
=
Var =
( p ) Var ( y) s
Nn
N −n n
= pq
Nn n − 1
N −n
= pq.
N (n − 1)
(ii) SRSWR
Since the sample mean y is an unbiased estimator of population mean Y in case of SRSWR,
so the sample proportion,

3
E ( p=
) E( y=
) Y= P,
i.e., p is an unbiased estimator of P.
Using the expression of variance of y and its estimate in case of SRSWR, the variance of p and
its estimate can be derived as follows:
N −1 2
=
Var =
( p ) Var ( y) S
Nn
N −1 N
= PQ
Nn N − 1
PQ
=
n
 ( p ) = n . pq
Var
n −1 n
pq
= .
n −1

Estimation of population total or total number of count


It is easy to see that an estimate of population total A (or total number of count ) is
Na
= =
Aˆ Np ,
n
its variance is
Var ( Aˆ ) = N 2 Var ( p )
and the estimate of variance is
 ( Aˆ ) = N 2 Var
Var  ( p ).

Confidence interval estimation of P


p−P
If N and n are large then approximately follows N(0,1). With this approximation, we
Var ( p )
can write
 p−P 
P −Z α ≤ 1 α
≤ Z α  =−
 2 Var ( p ) 2 

and the 100(1 − α )% confidence interval of P is

 
 p − Z α Var ( p ), p + Z α Var ( p )  .
 2 2 

4
It may be noted that in this case, a discrete random variable is being approximated by a
continuous random variable, so a continuity correction n/2 can be introduced in the confidence
limits and the limits become
 n n
 p − Z α Var ( p ) + , p + Z α Var ( p ) − 
 2
2 2
2

Use of Hypergeometric distribution :


When SRS is applied for the sampling of a qualitative characteristic, the methodology is to
draw the units one-by-one and so the probability of selection of every unit remains the same at
every step. If n sampling units are selected together from N units, then the probability of
selection of units does not remains the same as in the case of SRS.

Consider a situation in which the sampling units in a population are divided into two mutually
exclusive classes. Let P and Q be the proportions of sampling units in the population belonging
to classes ‘1’ and ‘2’ respectively. Then NP and NQ are the total number of sampling units in
the population belonging to class ‘1’ and ‘2’, respectively and so NP + NQ = N. The
probability that in a sample of n selected units out of N units by SRS such that n1 selected

units belongs to class ‘1’ and n2 selected units belongs to class ‘2’ is governed by the
hypergeometric distribution and
 NP  NQ 
  
P(n1 ) =  n1  n2 
.
N
 
n
As N grows large, the hypergeometric distribution tends to Binomial distribution and P(n1 ) is
approximated by
n
=
P (n1 )   p n1 (1 − p ) n2
 n1 

Inverse sampling
In general, it is understood in the SRS methodology for qualitative characteristic that the
attribute under study is not a rare attribute. If the attribute is rare, then the procedure of
estimating the population proportion P by sample proportion n / N is not suitable. Some such
situations are, e.g., estimation of frequency of rare type of genes, proportion of some rare type

5
of cancer cells in a biopsy, proportion of rare type of blood cells affecting the red blood cells
etc. In such cases, the methodology of inverse sampling can be used.

In the methodology of inverse sampling, the sampling is continued until a predetermined


number of units possessing the attribute under study occur in the sampling which is useful for
estimating the population proportion. The sampling units are drawn one-by-one with equal
probability and without replacement. The sampling is discontinued as soon as the number of
units in the sample possessing the characteristic or attribute equals a predetermined number.

Let m denotes the predetermined number indicating the number of units possessing the
characteristic. The sampling is continued till m number of units are obtained. Therefore, the
sample size n required to attain m becomes a random variable.

Probability distribution function of n


In order to find the probability distribution function of n, consider the stage of drawing of
samples t such that at t = n, the sample size n completes the m units with attribute. Thus the first
(t - 1) draws would contain (m - 1) units in the sample possessing the characteristic out of NP
units. Equivalently, there are (t - m) units which do not possess the characteristic out of NQ
such units in the population. Note that the last draw must ensure that the units selected possess
the characteristic.

So the probability distribution function of n can be expressed as

 In a sample of (n -1) units   The unit drawn at 


   
P ( n) P  drawn from N , (m -1) units  × P  the nth draw will 
 will possess the attribute   possess the attribute 
   
  NP  NQ  
  
  m − 1 n − m    NP − m + 1 
= , n =m, m + 1,..., m + NQ.
  N    N − n + 1 
   
  n − 1 
Note that the first term (in square brackets) is derived using hypergeometric distribution as the
probability for deriving a sample of size (n – 1) in which (m – 1) units are from NP units and
NP − m + 1
(n – m) units are from NQ units. The second term is the probability associated
N − n +1
with the last draw where it is assumed that we get the unit possessing the characteristic.
m + NQ
Note that ∑
n=m
P(n) = 1.

6
Estimate of population proportion
m −1
Consider the expectation of .
n −1
m + NQ
 m −1   m −1 
E =
 n −1 
∑  n − 1  P(n)
n=m

 NP   NQ 
  
 m − 1   m − 1  n − m  Np − m + 1
m + NQ
= ∑
n=m

 n −1 

 N 
.
N − n +1
 
 n − 1
 NP − 1  NQ 
  
 NP − m + 1   m − 2   n − m 
m + NQ −1
= ∑
n=m

 N − n +1 

 N − 1
 
 n−2
which is obtained by replacing NP by NP – 1, m by (m – 1) and n by (n - 1) in the earlier
step. Thus
 m −1 
E  = P.
 n −1 
m −1
So Pˆ = is an unbiased estimator of P.
n −1

Estimate of variance of P̂
Now we derive an estimate of variance of P̂ . By definition
2
=
Var ( Pˆ ) E ( Pˆ 2 ) −  E ( Pˆ ) 

=E ( Pˆ 2 ) − P 2 .

Thus
 ( Pˆ=
Var ) Pˆ 2 − Estimate of P 2 .
(m − 1)(m − 2)
In order to obtain an estimate of P 2 , consider the expectation of , i.e.,
(n − 1)(n − 2)

 (m − 1)(m − 2)   (m − 1)(m − 2) 
E  = ∑ P ( n )
 (n − 1)(n − 2)  n≥ m  (n − 1)(n − 2) 
  NP − 2   NQ  
  
P( NP − 1)  NP − m + 1    m − 3   n − m  
= ∑ 
N − 1 n≥m  N − n + 1    N − 2 
   
  n−3  

7
where the last term inside the square bracket is obtained by replacing NP by ( NP − 2), N by
(n − 2) and m by (m - 2) in the probability distribution function of hypergeometric distribution.
This solves further to
 (m − 1)(m − 2)  NP 2 P
E =  − .
 (n − 1)(n − 2)  N − 1 N − 1
Thus an unbiased estimate of P 2 is

 N − 1  (m − 1)(m − 2) Pˆ
=
Estimate of P 2   +
 N  (n − 1)(n − 2) N
 N − 1  (m − 1)(m − 2) 1 m −1
=   + . .
 N  (n − 1)(n − 2) N n −1

Finally, an estimate of variance of P̂ is


 ( Pˆ=
Var ) Pˆ 2 − Estimate of P 2
 m − 1   N − 1 (m − 1)(m − 2) 1  m − 1  
2

=  − . +  
 n − 1   N (n − 1)(n − 2) N  n − 1  
 m − 1   m − 1  1  ( N − 1)(m − 2)  
=    + 1 −  .
 n − 1   n − 1  N  n−2 

For large N , the hypergeometric distribution tends to negative Binomial distribution with

 n − 1  m n−m
probability density function   P Q . So
 m − 1
m −1
Pˆ =
n −1
and

m − 1)(n − m) Pˆ (1 − Pˆ )
 ( Pˆ ) (=
=
Var .
(n − 1) 2 (n − 2) n−2

8
Estimation of proportion for more than two classes
We have assumed up to now that there are only two classes in which the population can be
divided based on a qualitative characteristic. There can be situations when the population is to
be divided into more than two classes. For example, the taste of a coffee can be divided into
four categories very strong, strong, mild and very mild. Similarly in another example the
damage to crop due to storm can be classified into categories like heavily damaged, damaged,
minor damage and no damage etc.

These type of situations can be represented by dividing the population of size N into, say k,
mutually exclusive classes C1 , C2 ,..., Ck . Corresponding to these classes, let

C1 C2 Ck
=P1 = , P2 =
,..., Pk , be the proportions of units in the classes C1 , C2 ,..., Ck
N N n
respectively.

Let a sample of size n is observed such that c1 , c2 ,..., ck number of units have been drawn from

C1 , C2 ,..., Ck . respectively. Then the probability of observing c1 , c2 ,..., ck is

 C1  C2   Ck 
   ...  
P (c1 , c2 ,..., ck ) =  1  2   k  .
c c c
N
 
n
ci
The population proportions Pi can be estimated by =
pi = , i 1, 2,..., k .
n
It can be easily shown that
E=
( pi ) P=
i, i 1, 2,..., k ,
N − n PQ
Var ( pi ) = i i

N −1 n
and
 ( p ) = N − n pi qi
Var
N n −1
i

For estimating the number of units in the ith class,


Cˆ i = Npi
Var (Cˆ ) = N 2Var ( p )
i i

and
 (Cˆ ) = N 2 Var
Var  ( p ).
i i

9
The confidence intervals can be obtained based on single pi as in the case of two classes.

If N is large, then the probability of observing c1 , c2 ,..., ck can be approximated by multinomial


distribution given by
n!
P(c1 , c2 ,..., ck ) = P1c1 P2c2 ...Pkck .
c1 !c2 !...ck !
For this distribution
E=
( pi ) P=
i, i 1, 2,.., k ,
Pi (1 − Pi )
Var ( pi ) =
n
and
 ( pˆ ) = pi (1 − pi ) .
Var i
n

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy