Sampling Unit 6
Sampling Unit 6
Regression analysis is one of statistical techniques that can be used for investigating and
modeling the relationship between variables. Regression is applied in almost every field of
human endeavor.
Assume a regression model describing a chance relationship between the auxiliary variable and
the variable of interest for the study. Then a simple regression model is given by:
Y = o + 1X +
Where o is the intercept and 1 is the slope of the model and they are unknown constants (called
regression coefficients). X is the independent variable, Y is the dependent variable, and is a
random error component. This random error is assumed to have mean zero and unknown
variance 2, and the errors are uncorrelated.
The objective of sampling is the ‘prediction’ of some characteristic of the y-values of the
population such as population mean and total or the y-value of a single unit not yet in the sample.
When the value of b is zero or b = y / x , the regression estimate, y lr will be reduced to y (mean
per unit) or Yˆ (ratio estimate) respectively. That is, y = y + b ( X x ) = y if b = 0 and
R lr
y lr = y + b ( X x ) = y + y / x ( X x ) = y + ( y / x ) X y = ( y / x ) X = YˆR , if b = y / x .
Theorem 6.1: In simple random sampling, in which the regression coefficient is constant and
known (say bo), the linear regression estimate y lr = y + bo( X x ) is unbiased, with variance,
N
y Y ) bo ( x i X )
2
i
1 f 1 f 2
V( y lr ) = i 1
=
n
S y 2bo S yx bo2 S x2 (prove this theorem)
n N 1
1
Corollary: An unbiased sample estimate of V( y lr ) is
n
y y ) bo ( xi x )
2
i
1 f i 1 1 f 2
v( y lr ) =
n n 1
=
n
s y 2bo s yx bo2 s x2 . The interest could be to
find the best value of bo that minimizes the V( y lr ) and it is given the following theorem.
N
S yx (y
i 1
i Y )( xi X )
Theorem 6.2: The value of bo that minimizes the V( y lr ) is bo = B = =
S x2 N
(x i X )2
i 1
(prove this theorem)
1 f 2
The resulting minimum variance is V( y lr )min = S y (1 2 ) which could be obtained by
n
S yx
substituting bo = B = = ( S y S x ) in the V( ylr ) and is the population correlation
S x2
coefficient between y and x. Note that B does not depend on the properties of any sample.
In most applications, B is unknown and should be estimated from the sample. It is known that
n
s yx (y
i 1
i y )( xi x )
the BLUE of B is b= =
s x2 n
(x i x)2
i 1
Example: A medical student was given an assignment to estimate the average systolic blood
pressure (BP) for teachers between 30 and 60 years of age in a certain university. The objective
was to compare it with the systolic BP of a part of the population engaged in manual work. A
without replacement of simple random sample of 24 teachers was drawn from the frame
consisting of 961 teachers. Age of teachers was taken as the auxiliary variable. The following
data are available to estimate the average systolic BP:
Average age for the population of teachers = 42.7 years, Regression coefficient from earlier
study is 0.8952,
n n n
Measurements from the sample are: yi = 3247,
i 1
yi2 = 442652,
i 1
x
i 1
i = 1059,
n
2
x
i 1
i = 48778, where yi and xi represent BP and age of teachers respectively. From these data
2
i) Average systolic BP: y lr = y + b ( X x ) = 135.292 + 0.8952(42.7 – 44.125) = 134.016
24
1
1 f 2
v( y lr ) =
n
s y bo2 s x2 2bo s yx = 961 146.085 (0.8952) 2 (89.114) 2(0.8952)(82.875)
24
v( y lr ) = 2.808 s.e ( y lr ) = 1.6757
To make comparisons, the sample size n must be large to apply approximate formulas for the
variances of the ratio and regression estimates. For the estimated population mean, Y , we have
the following three variances of estimate of the mean.
1 f 2 1 f 2
V( y lr )min = S y (1 2 ) , for regression, V( YˆR ) = S y R 2 S x2 2 R S y S x , for ratio,
n n
1 f 2
and V( y ) = S y , for the mean per unit .
n
The regression estimate is more precise than the ratio estimate when V( ylr ) V( YˆR ), i.e.,
1 f 2 1 f 2
n
S y (1 2 )
n
S y R 2 S x2 2 R S y S x
Sy
After rearranging this inequality, we get ( S y R S x )2 0, ( R)2 0
Sx
2
(B R) 0. If B = R they are equally efficient and this equality holds when there is linear
relationship between y and x with a straight line through the origin. If B R, then regression
estimate is more precise.
Like ratio estimate, two types of regression estimate can be made in stratified random
samplingseparate and combined regression estimates.
Separate regression estimate: For each stratum, a regression estimate can be given as
3
y l rh = y h + bh ( X h x h ), where bh is pre-assigned value for the hth stratum regression
1 fh 2
coefficient. Its variance is V( y lrh ) =
nh
S yh bh2 S xh2 2bh S yxh . Then a separate regression
L
estimate is y l rs = W
h 1
h y lrh , which is unbiased estimate of Y . Assuming independent sampling
L
(1 f h ) 2
in each stratum, V( y l rs ) = W h
2
n
S yh bh2 S xh2 2bh S yxh
V ( y lrh ) = W h
2
h 1 h
The V( y l rs ) is minimized when bh = Bh the true regression coefficient in stratum h and the
2 (1 f h ) 2
minimum variance is V( y l rs )min = W h S yh (1 h2 )
nh
Sample estimate: For separate regression estimate, the sample estimate of bh is given as
nh
i 1
( y hi y h )( x hi x h )
s yh
b̂h = = h , and it is a least squares estimate of Bh.
nh s xh
( xhi xh ) 2
i 1
2 (1 f h ) 2 2 (1 f h ) 2
The sample estimate of V( y l rs ) = W h S yh (1 h2 ) is v( y l rs ) = W h sh ,
nh nh
nh nh
ˆ
( y hi y h ) bh ( xhi xh ) 2
22
2
where, s h2 = i 1 i 1
and s h2 is an unbiased estimate of S yh (1 h2 ) if
nh 2
the sample size is large in all strata and regression is linear.
Assume Bh is the same in all strata. Then the combined regression estimate is y l rc = y st + b
L L
( X x st ), where y st = Wh yh , xst =
h 1
W
h 1
h x h , and b is pre-assigned value. y l rc is an unbiased
nh
S yh b 2 S xh2 2bS yxh . The value of b which minimizes this variance is bc =
(1 f h )
Wh2 S yxh
Cov ( y st , x st )
=
nh
=
ah Bh , which is a weighted mean of the stratum
V ( x st ) (1 f ) ah
Wh2 n h S xh2
h
4
(1 f h ) 2
regression coefficient Bh, and ah = Wh2 S xh . Then the minimum variance is V( y l rc )min =
nh
(1 f h ) 2
W h
2
nh
S yh bc2 S xh2 2bc S yxh .
(1 f h ) 2 ˆ 2 2
An estimate of V( y l rc ) is given as: v( y l rc ) = W h
2
nh
s yh bc s xh 2bˆc s yxh , where
nh
(1 f h )
( y hi y h ) ( x hi x h )
(1 f )
2 2
W n h s yxh
h W h
nh
i 1
nh 1
bˆc h
=
2 (1 fh ) 2 nh
Wh n s xh ( x hi x h ) 2
h (1 f )
Wh2 n h i 1 n 1
h h