Sampling and Estimation

Sampling and Estimation 185
LESSON SIX
Sampling and Estimation
- Sampling techniques
- Central limit theorem
- Sampling distribution of statistical parameters
- Test of hypothesis
6.1 Methods of Sampling

a . Random or probability sampling methods
they include
i. Simple random sampling
ii. Stratified sampling
iii. Systematic sampling
iv. Multi stage sampling
b. Non random probability sampling methods
these consist of
i. Judgment sampling
ii. Quota sampling
iii. Cluster sampling
Simple Random Sampling

This refers to the sampling technique in which each and every item of the population is given an
equal chance of being included in the sample. Since selection of items in the sample depends
entirely on chance, this method is also called chance selection or representative sampling.
It is assumed that if the sample is chosen at random and if the size of the sample is sufficiently
large, it will represent all groups in the population
Random sampling is of 2 types; sampling with replacement and sampling without replacement
Sampling is said to be with replacement when from a finite population a sampling unit is drawn
observed and then returned to the population before another unit is drawn. The population in
this case remains the same and a sampling unit might be selected more than once
If on the other hand a sampling unit is chosen and not retuned to the population after it has
been observed the sampling is said to be without replacement.
Random samples may be selected by the help of lottery method or table of random numbers
(such as tippet’s table of random numbers, fischer and Yates numbers or Kendall and Babington
Smith numbers.)
Stratified sampling
In this case the population is divided into groups in such a way that units within each group are
as similar as possible in a process called stratification. The groups are called strata. Simple
random samples from each of the strata are collected and combined into a simple. This
technique of collecting a sample from a population is called stratified sampling. Stratification
may be by age, occupation income group e.t.c.
186 Lesson Six
Systematic Sampling
This sampling is a part of simple random sampling in ascending or descending orders. In
systematic sampling a sample is drawn according to some predetermined object. Suppose a
population consists of 1000 units, then every tenth, 20th or 50th item is selected. This method is
very easy and economical. It also saves a lot of time
Multistage sampling
This is similar to stratified sampling except division is done on geographical/location basis, e.g. a
country can be divided into provinces and then survey is done in 4 towns in each province. This
helps to cut traveling costs for a surveyor.
Cluster Sampling
This is where a few geographical regions e.g. a location, town or village are selected at random
and say every single household or shop in that area is interviewed. This again cuts on costs.
Judgment Sampling
Here the interviewer selects whom to interview believing that their view is more fundamental
since they might be directly affected e.g. to find out effects of public transport one may chose to
interview only people who don’t own cars and travel frequently to work.
6.2 THE CENTRAL LIMIT THEOREM

The theory was introduced by De Moivre and according to it; if we select a large number of
simple random samples, say from any population and determine the mean of each
sample, the distribution of these sample means will tend to be described by the normal
probability distribution with a mean µ and variance σ2/n. This is true even if the population
itself is not normal distribution. Or the sampling distribution of sample means approaches to a
normal distribution irrespective of the distribution of population from where the sample is taken
and approximation to the normal distribution becomes increasingly close with increase in sample
sizes
Types of distribution
Population distribution
It refers to the distribution of the individual values of population. Its mean is denoted by ‘µ’
Sample distribution
It is the distribution of the individual values of a single sample. Its mean is generally written as
“ x ”. it is not usually the same as µ
Distribution of Sample Means or sampling distribution

A sample of size n is taken from the parent population and mean of the sample is calculated.
This is repeated for a number of samples so that we have a distribution of sample means, which
approaches a normal distribution.
Standard errors of the mean

The series of sample means X 1 , X 2 , X 3 …….. is normally distributed or nearly so (according
to the central limit theorem). It can be described by its mean and its standard deviation. This
standard deviation is known as the standard error.
s
Standard error of the mean = S x 
n
Note: this formula is satisfactory for larger samples and a large population i.e. n > 30 and n >
5% of N.
- The word ‘error’ is in place of ‘deviation’ to emphasize that variation among sample means
is due to sampling errors.
- The smaller the standard error the greator the precision of the sample value.
6.3 Statistical inference

It is the process of drawing conclusions about attributes of a population based upon information
contained in a sample (taken from the population).
It is divided into estimation of parameters and testing of hypothesis. Symbols for statistic of
population parameters are as follows.
Sample Statistic Population Parameter

Arithmetic mean x µ
Standard deviation s σ
Number of items n N
Statistical estimation
It is the procedure of using statistic to estimate a population parameter
It is divided into point estimation (where an estimate of a population parameter is given by a
single number) and interval estimation (where an estimate of a population is given by a range in
which the parameter may be considered to lie) e.g. a bus meant to take a class of 100 students
(population N) for trip has a limit to the maximum weight of 600kg of which it can carry, the
teacher realizes he has to find out the weight of the class but without enough time to weigh
everyone he picks 25 students selected at random (sample n = 25). These students are weighed
and their average weight recorded as 64kg ( X - mean of a sample) with a standard deviation (s),
now using this the teacher intends to estimate the average weight of the whole class (µ –
population mean) by using the statistical parameters standard deviation (s), and mean of the
sample ( x ).
Characteristic of a good estimator

(i) Unbiased: where the expected value of the statistic is equal to the population
parameter e.g. if the expected mean of a sample is equal to the population mean
(ii) Consistency: where an estimator yields values more closely approaching the
population parameter as the sample increases
(iii) Efficiency: where the estimator has smaller variance on repeated sampling.
(iv) Sufficiency: where an estimator uses all the information available in the data
concerning a parameter
Confidence Interval
The interval estimate or a ‘confidence interval’ consists of a range (an upper confidence limit and
lower confidence limit) within which we are confident that a population parameter lies and we
assign a probability that this interval contains the true population value
The confidence limits are the outer limits to a confidence interval. Confidence interval is the
interval between the confidence limits. The higher the confidence level the greater the
confidence interval. For example
A normal distribution has the following characteristic
i. Sample mean ± 1.960 σ includes 95% of the population
188 Lesson Six
ii. Sample mean ± 2.575 σ includes 99% of the population
1. LARGE SAMPLES
These are samples that contain a sample size greater than 30(i.e. n>30)
(a) Estimation of population mean

Here we assume that if we take a large sample from a population then the mean of the
population is very close to the mean of the sample
Steps to follow to estimate the population mean includes
i. Take a random sample of n items where (n>30)
ii. Compute sample mean ( X ) and standard deviation (S)
iii. Compute the standard error of the mean by using the following formular
s
Sx =
n
where S x = Standard error of mean
S = standard deviation of the sample
n = sample size
iv. Choose a confidence level e.g. 95% or 99%
v. Estimate the population mean as under
Population mean µ = χ ± (appropriate number) ×S x
‘Appropriate number’ means confidence level e.g. at 95% confidence level is 1.96
this number is usually denoted by Z and is obtained from the normal tables.
Example
The quality department of a wire manufacturing company periodically selects a sample of wire
specimens in order to test for breaking strength. Past experience has shown that the breaking
strengths of a certain type of wire are normally distributed with standard deviation of 200 kg. A
random sample of 64 specimens gave a mean of 6200 kgs. Find out the population mean at 95%
level of confidence
Solution
Population mean = χ ± 1.96 S x
Note that sample size is alredy n > 30 whereas s and x are given thus step i), ii) and iv) are
provided.
Here: X = 6200 kgs
s 200
Sx = = = 25
n 64
Population mean = 6200 ± 1.96(25)

= 6200 ± 49
= 6151 to 6249
At 95% level of confidence, population mean will be in between 6151 and 6249
FINITE POPULATION CORRECTION FACTOR (FPCF)

If a given population is relatively of small size and sample size is more than 5% of the
population then the standard error should be adjusted by multiplying it by the finite population
correction factor
N n
FPCF is given by =
n 1
where N = population size
n = sample size
Example
A manager wants an estimate of sales of salesmen in his company. A random sample 100 out of
500 salesmen is selected and average sales are found to be Shs. 75,000. if a sample standard
deviation is Shs. 15000 then find out the population mean at 99% level of confidence
Solution
Here N = 500, n = 100, X = 75000 and S = 15000
Now
Standard error of mean
s N n
= Sx = x
n n 1
=
15000
x
500  100 
100 500  1
15000 400
= x
10 499
15000
= (0.895)
10
Sx = 1342.50 at 99% level of confidence
Population mean = X ± 2.58 S x

=shs 75000 ± 2.58(1342.50)
=shs 75000 ± 3464
= Shs 71536 to 78464
b) Estimation of difference between two means

We know that the standard error of a sample is given by the value of the standard deviation
(σ)divided by the square root of the number of items in the sample ( n ).
But, when given two samples, the standard errors is given by
S A2 S B2
S X = 
AX B n A nB
Also note that we do estimate the interval not from the mean but from the difference between

the two sample means i.e. X A  X B . 
The appropriate number of confidence level does not change
Thus the confidence interval is given by;
 
X A  X B ± Confidence level S X  X 
A B
 
= X A  X B ± Z S X  X 
A B
190 Lesson Six
Example
Given two samples A and B of 100 and 400 items respectively, they have the means X 1 = 7 ad
X 2 = 10 and standard deviations of 2 and 3 respectively. Construct confidence interval at 70%
confidence level?
Solution
Sample A B
X1 = 7 X 2 = 10
n1 = 100 n2 = 400
S1 = 2 S2 = 3
The standard error of the samples A and B is given by
4 9
S X = 
AX B  100 400
25 5
= =
400 20
=¼ = 0.25
At 70% confidence level, then appropriate number is equal to 1.04 (as read from the normal
tables)
X 1  X 2 = 7 – 10 = - 3 = 3
We take the absolute value of the difference between the means e.g. the value of X = absolute
value of X i.e. a positive value of X.
Confidence interval is therefore given by
= 3± 1.04 (0.25 ) From the normal tables a z value of 1.04 gives a value of 0.7.
= 3± 0.26
= 3.26 and 2.974
Thus 2.974 ≤ X ≤ 3.26
Example 2
A comparison of the wearing out quality of two types of tyres was obtained by road testing.
Samples of 100 tyres were collected. The miles traveled until wear out were recorded and the
results given were as follows
Tyres T1 T2
Mean X 1 = 26400 miles X 2 = 25000 miles
Variance S21= 1440000 miles S22= 1960000 miles
Find a confidence interval at the confidence level of 70%
Solution
X 1 = 26400
X 2 = 25000
Difference between the two means
X 1 
 X 2 = (26400 – 25000)
= 1,400
Again we take the absolute value of the difference between the two means
We calculate the standard error as follows
S12 S 22
S X = 
AX B  n1 n2
1, 440, 000 1,960, 000

= 
100 100
= 184.4
Confidence level at 70% is read from the normal tables as 1.04 (Z = 1.04).
Thus the confidence interval is calculated as follows
= 1400 ± (1.04) (184.4)
= 1400 ± 191.77
or (1400 – 191.77) to (1400 + 191.77)
1,208.23 ≤ X ≤ 1591.77
c) Estimation of population proportions

This type of estimation applies at the times when information cannot be given as a mean or as a
measure but only as a fraction or percentage
The sampling theory stipulates that if repeated large random samples are taken from a
population, the sample proportion “p’ will be normally distributed with mean equal to the
population proportion and standard error equal to
Pq
Sp = = Standard error for sampling of population proportions
n
Where n is the sample size and q = 1 – p.
The procedure for estimating a proportion is similar to that for estimating a mean, we only have
a different formula for calculating standard.
Example 1
In a sample of 800 candidates, 560 were male. Estimate the population proportion at 95%
confidence level.
Solution
Here
560
Sample proportion (P) = = 0.70
800
q = 1 – p = 1 – 0.70 = 0.30
n = 800
pq
=
 0.70  0.30 
n 800
Sp = 0.016
population proportion
192 Lesson Six
= P ± 1.96 Sp where 1.96 = Z.

= 0.70 ± 1.96 (0.016)
= 0.70 ± 0.03
= 0.67 to 0.73
= between 67% to 73%
Example 2
A sample of 600 accounts was taken to test the accuracy of posting and balancing of accounts
where in 45 mistakes were found. Find out the population proportion. Use 99% level of
confidence
Solution
Here
45
n = 600; p = = 0.075
600
q = 1 – 0.075 = 0.925
Sp =
pq
=
 0.075  0.925 
n 600
= 0.011
Population proportion
= P ± 2.58 (Sp)
= 0.075 ± 2.58 (0.011)
= 0.075 ± 0.028
= 0.047 to 0.10
= between 4.7% to 10%
d) Estimation of difference between population proportions

Let the two proportions be given by P1 and P2, respectively
Then the difference (absolute) between the two proportions is given by (P1 – P2)
The standard error is given by
pq pq p n  p2 n2
S P =  where p = 1 1 and q = 1 - p
1  P2  n1 n2 n1  n2
Then given the confidence level, the confidence interval between the two population
proportions is given by
(P1 – P2) ± Confidence level S P  P 
1 2
pq pq
= (P1 – P2) ± Z 
n1 n2
p1n1  p2 n2
Where P = always remember to convert P1 & P2 to P.
n1  n2
2. SMALL SAMPLES
(a) Estimation of population mean
If the sample size is small (n<30) the arithmetic mean of small samples are not normally
distributed. In such circumstances, students t distribution must be used to estimate the
population mean.
In this case
Population mean µ = X ± ts x
X = Sample mean
s
Sx =
n
 x  x
2
S = standard deviation of samples = for small samples.

n 1
n = sample size
v = n – 1 degrees of freedom.
The value of t is obtained from students t distribution tables for the required confidence level
Example
A random sample of 12 items is taken and is found to have a mean weight of 50 grams and a
standard deviation of 9 grams
What is the mean weight of population
a) with 95% confidence
b) with 99% confidence
Solution
s 9
X  50; S = 9; v = n – 1 = 12 – 1 = 11; Sx  
n 12
µ = x’ ± ts x
At 95% confidence level

 9 
µ = 50 ± 2.262  
 12 
= 50 ± 5.72 grams
Therefore we can state with 95% confidence that the population mean is between 44.28 and
55.72 grams
At 99% confidence level
 9 
µ = 50 ± 3.25  
 12 
= 50 ± 8.07 grams
194 Lesson Six
Therefore we can state with 99% confidence that the population mean is between 41.93 and
58.07 grams
Note: To use the t distribution tables it is important to find the degrees of freedom (v = n – 1).
In the example above v = 12 – 1 = 11
From the tables we find that at 95% confidence level against 11 and under 0.05, the value of t =
2.201
6.4 Hypothesis Testing
Definition
- A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be tested
statistically in order to establish whether it is correct or not correct
- Whenever testing an hypothesis, one must fully understand the 2 basic hypothesis to be tested
namely
i. The null hypothesis (H0)
ii. The alternative hypothesis(H1)
The null hypothesis

This is the hypothesis being tested, the belief of a certain characteristic e.g. Kenya Bureau of
Standards (KBS) may walk to a sugar making company with an intention of confirming that the
2kgs bags of sugar produced are actually 2kgs and not less, they conduct hypothesis testing with
the null hypothesis being: H0 = each bag weighs 2kgs. The testing will set out to confirm this or
to refute it.
The alternative hypothesis

While formulating a null hypothesis we also consider the fact that the belief might be found to
be untrue hence we will reject it. We therefore formulate an alternative hypothesis which is a
contradiction to the null hypothesis, thus when we reject the null hypothesis we accept the
alternative hypothesis.
In our example the alternative hypothesis would be
H1 = each bag does not weigh 2kg
Acceptance and rejection regions

All possible values which a test statistic may either assume consistency with the null hypothesis
(acceptance region) or lead to the rejection of the null hypothesis (rejection region or critical
region)
The values which separate the rejection region from the acceptance region are called critical
values
Type I and type II errors

While testing hypothesis (H0) and deciding to either accept or reject a null hypothesis, there are
four possible occurrences.
a) Acceptance of a true hypothesis (correct decision) – accepting the null hypothesis and it
happens to be the correct decision. Note that statistics does not give absolute information,
thus its conclusion could be wrong only that the probability of it being right are high.
b) Rejection of a false hypothesis (correct decision).
c) Rejection of a true hypothesis – (incorrect decision) – this is called type I error, with
probability = α.
d) Acceptance of a false hypothesis – (incorrect decision) – this is called type II error, with
probability = β.
Levels of significance
A level of significance is a probability value which is used when conducting tests of hypothesis.
A level of significance is basically the probability of one making an incorrect decision after the
statistical testing has been done. Usually such probability used are very small e.g. 1% or 5%
0.5000 0.4900
1% provision for errors
0
Critical value
0.45
5% = 0.05
Critical region
0
Crititical value = -1.65
NB: If the standardized value of the mean is less than –1.65 we reject the null hypothesis (H0)
and accept the alternative Hypothesis (H1) but if the standardized value of the mean is more
than –1.65 we accept the null hypothesis and reject the alternative hypothesis
The above sketch graph and level of significance are applicable when the sample mean is < (i.e.
less than the population mean)
196 Lesson Six
The following is used when sample mean > population mean
Acceptance region
Critical region (rejection region)
5% = 0.05
0 Z = 1.65 (critical value)
NB: If the sample mean standardized value < 1.65, we accept the null hypothesis but reject the
alternative. If the sample mean value > 1.65 we reject the null hypothesis and accept the
alternative hypothesis
The above sketch is normally used when the sample mean given is greater than the population
mean
Accept null hyp( reject Alternative hyp)
Reject null hyp (accept alt hyp) Reject null hyp (accept alt hyp)
0.05% = 0.05 0.495 0.495 0.5% = 0.05

-2.58 +2.58
NB: if the standardized value of the sample mean is between –2.58 and +2.58 accept the null
hypothesis but otherwise reject it and therefore accept the alternative hypothesis
TWO TAILED TESTS

A two tailed test is normally used in statistical work(tests of significance) e.g. if a complaint
lodged by the client is about a product not meeting certain specifications i.e. the item will
generate a complaint if its measurements are below the lower tolerance limit or above the
upper tolerance limit
Region of acceptance for

H0
Critical region Critical region
15cm 17 ½ cm
NB: Alternative hypothesis is usually rejected if the standardized value of the sample mean lies
beyond the tolerance limits (15cm and 17 ½ cm).
ONE TAILED TEST

This is a test where the alternative hypothesis (H1:) is only concerned with one of the tails of the
distribution e.g. to test a business complaint if the complaint is above the measurements of item
being shorter than is required.
E.g. a manufacturer of a given brand of bread may state that the average weight of the bread is
500 gms but if a consumer takes a sample and weighs each of the pieces of bread and happens
to have a mean of 450 gms he will definitely complain about the bread which is underweight.
The statistical analysis to be done will concentrate on the left tail of the normal distribution in
which one will have to establish whether 450 gms being less than 500g is statistically significant.
Such a test therefore is referred to as one tailed test.
198 Lesson Six
left
On the other hand the test may compuliate on the right hand tail of the normal distribution
when this happens the major complaint is likely to do with oversize items bought. Therefore the
test is known as one tailed as the focus is on one end of the normal distribution.
Number of standard errors

Two tailed test One tailed test
5% level of significance 1.96 1.65
1% level of significance 2.58 2.33
HYPOTHESIS TESTING PROCEDURE

Whenever a business complaint comes up there is a recommended procedure for conducting a
statistical test. The purpose of such a test is to establish whether the null hypothesis or
alternative hypothesis is to be accepted.
The following are steps normally adopted
1. Statement of the null and alternative hypothesis
2. Statement of the level of significance to be used.
3. Statement about the test statistic i.e. what is to be tested e.g. the sample mean, sample
proportion, difference between sample means or sample proportions
4. Type of test whether two tailed or one tailed.
5. Statement on critical values using the appropriate level of significance
6. Standardizing the test statistic
7. Conclusion showing whether to accept or reject the null hypothesis
STANDARD HYPOTHESIS TESTS

In principal, we can test the significance of any statistic related to any probability distribution.
However we will be interested in a few standard cases. The sample statistics mean, proportion
and variance, are related to the normal, t, F, and chi squared distributions
Thus
1. Normal test
Test a sample mean ( X ) against a population mean (µ) (where samples size n > 30 and
population variance σ2 is known) and sample proportion, P(where sample size np >5 and nq
>5 since in this case the normal distribution can be used to approximate the binomial
distribution
2. t test
Tests a sample mean ( X ) against a population mean and especially where the population
variance is unknown and n < 30.
3. Variance ratio test or f test

It is used to compare population variances and it is used with samples of any size drawn
from normal populations.
4. Chi squared test

It can be used to test the association between attributes or the goodness of fit of an
observed frequency distribution to a standard distribution
Example 1
A certain NGO carried out a survey in a certain community in order to establish the average at
which the girls are married. The results of the survey indicated that the marriage age for the girls
is 19 years
In order to establish the validity of the mean marital age, a sample of 50 women was interviewed
and the average age indicated that they got married at the age of 16 years. However the different
ages at which they were married differed with the standard deviation of 2.1years
The sample data indicates that the marital age is less 19 years. Is this conclusion true or not ?
Required
Conduct a statistical test to either support the above conclusion drawn from the sample statistics
i.e. the marriage age is less than 19 years, use a level of significance of 5%
Solution
1. Null hypothesis
H0: μ (mean marital age) = 19 years
Alternative hypothesis H1: μ (mean marital age) < 19 years
2. The level of significance is 5%
3. The test statistics is the sample mean age, X = 16 years
4. The critical value of the one tailed test (one tailed because the alternative hypothesis is
an inequality) at 5% level of significance is –1.65
200 Lesson Six
Acceptance region
Rejection region
- 1.65 0
5. The standardizes value of the sample mean is

X -μ S
Z = where S x =
Sx n
Where, X = Sample mean

µ = Population mean
S = sample standard deviation
n = sample size
z = standard value (as per computation)
The standard value Z must fall within the acceptance region for us to accept the null
hypothesis. Thus it must be > - 1.65 otherwise we accept the alternative hypothesis.
16  19
Z = 2.1
= - 10.1
50
6. Since –10.1 < -1.65, we reject the null hypothesis but accept the alternative hypothesis
at 5% level of significance i.e. the marriage age in this community is significantly lower
than 19 years
Example 2
A foreign company which manufactures electric bulbs has assured its customers that the lifespan
of the bulbs is 28 month with a standard deviation of 4months
Recently the company embarked on a quality improvement research for their product. After the
research using new technology, a sample of 70 bulbs was tested and they gave a mean lifespan of
30.2 months
Does this justify the research undertaken? Use 1% level of significance to conduct a statistical
test in order to establish the truth about the above question.
Testing procedure
1. Null hypothesis H0: µ = 28
Alternative hypothesis H1: µ > 28
2. The level of significance is 1% (one tailed test)
3. The test statistics is the sample mean age, x’ = 30.2

4. The critical value of the one tailed test at 5% level of significance is + 2.33
0.4900
1% = 0.01
2.33
5. The standardized value of the sample mean is
X  30.2  28
Z = = = 4.6
Sx 4
70
6. Since 4.6 > 2.33, we reject the null hypothesis but accept the alternative hypothesis at
1% level of significance i.e. the new sample mean life span is statistically significant
higher than the population mean
Therefore the research undertaken was worth while or justified
Example 3
A construction firm has placed an order that they require a consignment of wires which have a
mean length of 10.5 meters with a standard deviation of 1.7 m
The company which produces the wires delivered 90 wires, which had a mean length of 9.2 m.,
The construction company rejected the consignment on the grounds that they were different
from the order placed.
Required
Conduct a statistical test to indicate whether you support or not support the action taken by the
construction company at 5% level of significance.
Solution
Null hypothesis µ = 10.5 m
Alternative hypothesis µ ≠ 10.5 m
Level of significance be 5%
The test statistics is the sample mean X = 9.2m
The critical value of the two tailed test at 5% level of significance is ± 1.96 (two tailed test).
202 Lesson Six
- 1.96 +1.96
The standardized value of the test Z =
X -μ 9.2  10.5
Z = = = - 7.25
SX 1.7
90
Since 7.25 < 1.96, reject the null hypothesis but accept the alternative hypothesis at 5% level
of significance i.e. the sample mean is statistically different from the consignment ordered by
the construction company. Therefore support the action taken by the construction company
TESTING THE DIFFERENCE BETWEEN TWO SAMPLE MEANS (LARGE

SAMPLES)
A large sample is defined as one which contains 30 or more items (n≥30) Where n is the sample
size
In a business those involved are constantly observant about the standards or specifications of
the item which they sell e.g. a trader may receive a batch of items at one time and another batch
at a later time at the end he may have concluded that the two samples are different in certain
specifications e.g. mean weight mean lifespan, mean length e.t.c. further it may become
necessary to establish whether the observed differences are statistically significant or not. If the
differences are statistically significant then it means that such differences must be explained i.e.
there are known causes but if they are not statistically significant then it means that the
difference observed have no known causes and are mainly due to chance
If the differences are established to be statistically significant then it implies that the complaints,
which necessitated that kind of test, are justified
Let X1 and X2 be any two samples whose sizes are n1 and n2 and mean X 1 and X 2. Standard
deviation S1 and S2 respectively. In order to test the difference between the two sample means,
we apply the following formulas
X1  X 2 S12 S 22
Z =  
where S X 1  X 2 = 

S X1  X 2  n1 n2
Example 1
An agronomist was interested in the particular fertilizer yield output. He planted maize on 50
equal pieces of land and the mean harvest obtained later was 60 bags per plot with a standard
deviation of 1.5 bags. The crops grew under natural circumstances and conditions without the
soil being treated with any fertilizer. The same agronomist carried out an alternative experiment
where he picked 60 plots in the same area and planted the same plant of maize but a fertilizer
was applied on these plots. After the harvest it was established that the mean harvest was 63
bags per plot with a standard deviation of 1.3 bags
Required
Conduct a statistical test in order to establish whether there was a significant difference between
the mean harvest under the two types of field conditions. Use 5% level of significance.
Solution
H0 : µ1 = µ2
H1 : µ1 ≠ µ2
Critical values of the two tailed test at 5% level of significance are 1.96
The standardized value of the difference between sample means is given by Z where
X1  X 2 1.52 1.32
Z =  
where S X 1  X 2 = 

S X1  X 2  50 60
Z =
 60  63 
0.045  0.028
= 11.11
- 1.96 0 +1.96
Since 11.11 < -1.96, we reject the null hypothesis but accept the alternative hypothesis at 5%
level of significance i.e. the difference between the sample mean harvest is statistically significant.
This implies that the fertilizer had a positive effect on the harvest of maize
Note: You don’t have to illustrate your solution with a diagram.
204 Lesson Six
Example 2
An observation was made about reading abilities of males and females. The observation lead to a
conclusion that females are faster readers than males. The observation was based on the times
taken by both females and males when reading out a list of names during graduation ceremonies.
In order to investigate into the observation and the consequent conclusion a sample of 200 men
were given lists to read. On average each man took 63 seconds with a standard deviation of 4
seconds
A sample of 250 women were also taken and asked to read the same list of names. It was found
that they on average took 62 seconds with a standard deviation of 1 second.
Required
By conducting a statistical hypothesis testing at 1% level of significance establish whether the
sample data obtained does support earlier observation or not
Solution
H0: µ1 = µ2
H1: µ1 ≠ µ2
Critical values of the two tailed test is at 1% level of significance is 2.58.
X1  X 2
Z =

S X1  X 2 
63  62
Z = = 3.45
42 2
200  250
1
Acceptance region
Rejection region
- 2.58 0 +2.58 +3.45

Since 3.45 > 2.33 reject the null hypothesis but accept the alternative hypothesis at 1% level
of significance i.e. there is a significant difference between the reading speed of Males and
females, thus females are actually faster readers.
TEST OF HYPOTHESIS ON PROPORTIONS

This follows a similar method to the one for means exept that the standard error used in this
case:
Pq
Sp =
n
P
Z score is calculated as, Z = Where P = Proportion found in the sample.
Sp
Π – the hypothetical proportion.
Example
A member of parliament (MP) claims that in his constituency only 50% of the total youth
population lacks university education. A local media company wanted to acertain that claim thus
they conducted a survey taking a sample of 400 youths, of these 54% lacked university education.
Required:
At 5% level of significance confirm if the MP’s claim is wrong.
Solution.
Note: This is a two tailed tests since we wish to test the hypothesis that the hypothesis is
different (≠) and not against a specific alternative hypothesis e.g. < less than or > more
than.
H0 : π = 50% of all youth in the constituency lack university education.

H1 : π ≠ 50% of all youth in the constituency lack university education.
pq 0.5 x0.5
Sp = = = 0.025
n 400
0.54  0.50
Z= = 1.6
0.025
at 5% level of significance for a two-tailored test the critical value is 1.96 since calculated Z value
< tabulated value (1.96).
i.e. 1.6 < 1.96 we accept the null hypothesis.
Thus the MP’s claim is accurate.
HYPOTHESIS TESTING OF THE DIFFERENCE BETWEEN PROPORTIONS
Example
Ken industrial manufacturers have produced a perfume known as “fianchetto.” In order to test
its popularity in the market, the manufacturer carried a random survey in Back rank city where
10,000 consumers were interviewed after which 7,200 showed preference. The manufacturer
also moved to area Rook town where he interviewed 12,000 consumers out of which 1,0000
showed preference for the product.
Required
Design a statistical test and hence use it to advise the manufacturer regarding the differences in
the proportion, at 5% level of significance.
Solution
H0 : π1 = π2
H1 : π1 ≠ π2
The critical value for this two tailed test at 5% level of significance = 1.96.
206 Lesson Six
Now Z =
 P1  P2    1   2 
S  P1  P2 
But since the null hypothesis is π1 = π2, the second part of the numerator disappear i.e.
π1 - π2 = 0 which will always be the case at this level.
Then Z =
 P1  P2 
S  P1  P2 
Where;
Sample 1 Sample 2
Sample size n1 = 10,000 n2 = 12,000
Sample proportion of success P1 =0.72 P2 = 0.83
Population proportion of success. Π1 Π2
pq pq
Now S  p1  p2  = 
n1 n2
p1n1  p2 n2
Where P =
n1  n2
And q = 1 – p
 in our case
10, 000(0.72)  12, 000(0.83)
P=
10, 000  12, 000
84, 000
=
22, 000
= 0.78
 q = 0.22
0.78  0.22  0.78 0.22 
S  P1  P2   
10, 000 12, 000
= 0.00894
0.72  0.83
Z= = 12.3
0.00894
Since 12.3 > 1.96, we reject the null hypothesis but accept the alternative. the differences
between the proportions are statistically significant. This implies that the perfume is much
more popular in Rook town than in Back rank city.
HYPOTHESIS TESTING ABOUT THE DIFFERENCE BETWEEN TWO

PROPORTIONS
Is used to test the difference between the proportions of a given attribute found in two random
samples.
The null hypothesis is that there is no difference between the population proportions. It means
two samples are from the same population.
Hence
H0 : π1 = π2
The best estimate of the standard error of the difference of P1 and P2 is given by pooling the
samples and finding the pooled sample proportions (P) thus
p1n1  p2 n2
P=
n1  n2
Standard error of difference between proportions

pq pq
S  p1  p2   
n1 n2
P1  P2
And Z =
S  p1  p2 
Example
In a random sample of 100 persons taken from village A, 60 are found to be consuming tea. In
another sample of 200 persons taken from a village B, 100 persons are found to be consuming
tea. Do the data reveal significant difference between the two villages so far as the habit of
taking tea is concerned?
Solution
Let us take the hypothesis that there is no significant difference between the two villages as far
as the habit of taking tea is concerned i.e. π1 = π2
We are given
P1 = 0.6; n1 = 100
P2 = 0.5; n2 = 200
Appropriate statistic to be used here is given by

p1n1  p2 n2
P =
n1  n2
=
 0.6 100   0.5 200   60  100
100  200 300
= 0.53
q = 1 – 0.53
= 0.47
pq pq
S  P1  P2  = 
n1 n2
=
 0.53 0.47   0.53 0.47 
100 200
= 0.0608
0.6  0.5
Z=
0.0608
= 1.64
208 Lesson Six
Since the computed value of Z is less than the critical value of Z = 1.96 at 5% level of
significance therefore we accept the hypothesis and conclude that there is no significant
difference in the habit of taking tea in the two villages A and B
t distribution (student’s t distribution) tests of hypothesis (test for small samples n < 30)
For small samples n < 30, the method used in hypothesis testing is exactly similar to the one for
large samples exept that t values are used from t distribution at a given degree of freedom v,
instead of z score, the standard error Se statistic used is also different.
Note that v = n – 1 for a single sample and n1 + n2 – 2 where two sample are involved.
a) Test of hypothesis about the population mean

When the population standard deviation (S) is known then the t statistic is defined as
X  S
t = where S X 
SX n
Follows the students t distribution with (n-1) d.f. where
X = Sample mean
μ = Hypothesis population mean
n = sample size
and S is the standard deviation of the sample calculated by the formula
 X  X 
2
S= for n < 30
n 1
If the calculated value of t exceeds the table value of t at a specified level of significance, the null
hypothesis is rejected.
Example
Ten oil tins are taken at random from an automatic filling machine. The mean weight of the tins
is 15.8 kg and the standard deviation is 0.5kg. Does the sample mean differ significantly from the
intended weight of 16kgs. Use 5% level of significance.
Solution
Given that n = 10; x = 15.8; S = 0.50; μ = 16; v = 9
H0 : μ = 16
H1 : μ ≠ 16
0.5
= SX 
10
15.8  16
t = 0.5
10
0.2
=
0.16
= -1.25
The table value for t for 9 d.f. at 5% level of significance is 2.26. the computed value of t is
smaller than the table value of t. therefore, difference is insignificant and the null hypothesis is
accepted.
b) Test of hypothesis about the difference between two means

The t test can be used under two assumptions when testing hypothesis concerning the difference
between the two means; that the two are normally distributed (or near normally distributed)
populations and that the standard deviation of the two is the same or at any rate not significantly
different.
Appropriate test statistic to be used is

X1  X 2
t = at (n1 + n2 – 2) d.f.
S X X 2
 1 
The standard deviation is obtained by pooling the two sample standard deviation as shown
below.
Sp =
 n1  1 S12   n2  1 S22
n1  n2  2
Where S1 and S2 are standard deviation for sample 1 & 2 respectively.
Sp Sp
Now S X 1 = and S X 2 =
n1 n2
S X1X 2 = S X2  S X2 2
  1
n1  n2
Alternatively S = Sp
 X1X 2  n1n2
Example
Two different types of drugs A and B were tried on certain patients for increasing weights, 5
persons were given drug A and 7 persons were given drug B. the increase in weight (in pounds)
is given below
Drug A 8 12 16 9 3
Drug B 10 8 12 15 6 8 11
Do the two drugs differ significantly with regard to their effect in increasing weight? (Given that
v= 10; t0.05 = 2.23)
Solution
H0 : μ1 = μ2
H1 : μ1 ≠ μ2
X1  X 2
t=
S X1X 2
 
Calculate for X 1 , X 2 and S

210 Lesson Six
X1 X1 – X 1 (X1 – X 1 )2 X2 (X2 – X 2 ) (X2 – X )2

8 -1 1 10 0 0
12 +3 9 8 -2 4
13 +4 16 12 +2 4
9 0 0 15 +5 25
3 -6 36 6 -4 16
8 -2 4
11 +1 1
ΣX1 = 45 Σ(X1– X 1 ) = 0 Σ (X1 – X 1 )2= 62 ΣX2= 70 Σ (X2 – X 2 ) = 0 Σ (X2– X 2 )2= 54
X1 =
X 1
=
45
=9 X2 =
X 2

70
 10
n1 5 n2 7
62 54
S1 = = 3.94 S2 = 3
4 6
Sp =
 4 15.4   6  9
10
= 3.406
11.6 11.6 75

S X1X 2   or 3.406
  5 7 57
= 1.99
X1  X 2 9  10
t = =
S X1X 2 1.99
 
= 0.50
Now t0.05 (at v = 10) = 2.23 > 0.5
Thus we accept the null hypothesis.
Hence there is no significant difference in the efficacy of the two drugs in the matter of
increasing weight
Example
Two salesmen A and B are working in a certain district. From a survey conducted by the head
office, the following results were obtained. State whether there is any significant difference in the
average sales between the two salesmen at 5% level of significance.
A B
No. of sales 20 18
Average sales in shs 170 205
Standard deviation in shs 20 25
Solution
H0 : μ1 = μ2
H1 : μ1 ≠ μ2
Where
Sp =
 n1  1 S12   n2  1 S22
n1  n2  2
n1  n2
S X 1  X 2 = Sp
  n1n2
Where: X 1 =170, X 2 = 205, n1 = 20, n2 = 18, S1 = 20, S2 = 25, V = 36
19   20 2   17   252 

Sp =
20  18  2
= 22.5
38
S X 1  X 2  22.5
  360
= 7.31
170  205
t=
7.31
= 4.79
t0.05(36) = 1.9 (Since d.f > 30 we use the normal tables)
The table value of t at 5% level of significance for 36 d.f. when d.f. >30, that t distribution is the
same as normal distribution is 1.9. since the value computed value of t is more than the table
value, we reject the null hypothesis. Thus, we conclude that there is significant difference in the
average sales between the two salesmen
Testing the hypothesis equality of two variances

The test for equality of two population variances is based on the variances in two independently
selected random samples drawn from two normal populations
Under the null hypothesis σ 12  σ 22
s12
σ 12
F= Now under the H0 : σ 12  σ 22 it follows that
s 22
σ 22
212 Lesson Six
S12
F= which is the test statistic.
S 22
Which follows F – distribution with V1 and V2 degrees of freedom. The larger sample variance is
placed in the numerator and the smaller one in the denominator
If the computed value of F exceeds the table value of F, we reject the null hypothesis i.e. the
alternate hypothesis is accepted
Example
In one sample of observations the sum of the squares of the deviations of the sample values
from sample mean was 120 and in the other sample of 12 observations it was 314. test whether
the difference is significant at 5% level of significance
Solution
Given that n1 = 10, n2 = 12, Σ(x1 – X 1 )2 = 120
Σ(x2 – X 2 )2 = 314
Let us take the null hypothesis that the two samples are drawn from the same normal population
of equal variance
H0 : σ 12  σ 22
H1: σ 12  σ 22
Applying F test i.e.

S12
F=
S 22
  X1  X 1 
2
n1 1
=
 
2
X2 X 2
 n2 1
120
9
= 314
11
13.33
=
28.55
since the numerator should be greater than denominator
28.55
F=  2.1
13.33
The table value of F at 5% level of significance for V1 = 9 and V2 = 11. Since the calculated
value of F is less than the table value, we accept the hypothesis. The samples may have been
drawn from the two population having the same variances.
Chi square hypothesis tests (Non-parametric test)(X2)

They include amongst others
i. Test for goodness of fit
ii. Test for independence of attributes
iii. Test of homogeneity
iv. Test for population variance
The Chi square test (χ2) is used when comparing an actual (observed) distribution with a
hypothesized, or explained distribution.
O  E 
2
It is given by; χ2 =  E
Where O = Observed frequency
E = Expected frequency
The computed value of χ2 is compared with that of tabulated χ2 for a given significance level and
degrees of freedom.
i. Test for goodness of fit

This tests are used when we want to determine whether an actual sample distribution matches a
known theoretical distribution
The null hypothesis usually states that the sample is drawn from the theoretical population
distribution and the alternate hypothesis usually states that it is not.
Example
Mr. Nguku carried out a survey of 320 families in Ateka district, each family had 5 children and
they revealed the following distribution
No. of boys 5 4 3 2 1 0
No. of girls 0 1 2 3 4 5
No. of families 14 56 110 88 40 12
Is the result consistent with the hypothesis that male and female births are equally probable at
5% level of significance?
Solution
If the distribution of gender is equally probable then the distribution conforms to a binomial
distribution with probability P(X) = ½.
Therefore
H0 = the observed number of boys conforms to a binomial distribution with P = ½
H1 = The observations do not conform to a binomial distribution.
On the assumption that male and female births are equally probable the probability of a male
birth is P = ½ . The expected number of families can be calculated by the use of binomial
distribution. The probability of male births in a family of 5 is given by
P(x) = 5cX Px q5-x (for x = 0, 1, 2, 3, 4, 5,)
= 5cX ( ½ )5 (Since P = q = ½ )
To get the expected frequencies, multiply P(x) by the total number N = 320. The calculations are
shown below in the tables
x P(x) Expected frequency = NP(x)

0 5c (½ )5 = 1 320 × 1 = 10
0
32 32
1 5c ( ½ )5 = 5 320 × 5 = 50
1
32 32
2 5c ( ½ )5 = 10 320 × 10 = 100
2
32 32
3 5c ( ½ )5 = 10 320 × 10 = 100
3
32 32
4 5c ( ½ )5 =5 320 × 5 = 50
4
32 32
214 Lesson Six
5 5c ( ½ )5 =1 320 × 1 = 10
5
32 32
Arranging observed and expected frequencies in the following table and calculating x2
O E (O – E) 2 (O – E) 2 /E
14 10 16 1.60
56 50 16 0.72
110 100 100 1.00
88 100 144 1.44
40 50 100 2.00
12 10 4 0.40
Σ(0 – E) 2 /E = 7.16
O  E 
2
χ2 =  E
= 7.16
The table of χ2 for V = 6 – 1 = 5 at 5% level of significance is 11.07. The computed value of χ2

= 7.16 is less than the table value. Therefore the hypothesis is accepted. Thus it can be
concluded that male and female births are equally probable.
ii) Test of independence of attributes

This test disclosed whether there is any association or relationship between two or more
attributes or not. The following steps are required to perform the test of hypothesis.
1. The null and alternative hypothesis are set as follows
H0: No association exists between the attributes
H1: an association exists between the attributes
2. Under H0 an expected frequency E corresponding to each cell in the
contingency table is found by using the formula
RC
E=
n
Where R = a row total, C = a column total and n = sample size
3. Based upon the observed values and corresponding expected frequencies the χ2
statistic is obtained using the formular
O  E 
2
χ2 =  E
4. The characteristic of this distribution are defined by the number of degrees of
freedom (d.f.) which is given by
d.f. = (r-1) (c-1),
Where r is the number of rows and c is number of columns corresponding to a
chosen level of significance, the critical value is found from the chi squared
table
5. The calculated value of χ2 is compared with the tabulated value χ2 for (r-1) (c-1)
degrees of freedom at a certain level of significance. If the computed value of χ2
is greater than the tabulated value, the null hypothesis of independence is
rejected. Otherwise we accept it.
Example
In a sample of 200 people where a particular devise was selected, 100 were given a drug and the
others were not given any drug. The results are as follows
Drug No drug Total
Cured 65 55 120
Not cured 35 45 80
Total 100 100 200
Test whether the drug will be effective or not, at 5% level of significance.
Solution
Let us take the null hypothesis that the drug is not effective in curing the disease.
Applying the χ2 test
The expected cell frequencies are computed as follows
R1C1 120 100
E11 = = = 60
n 200
R1C2 120 100

E12 = = = 60
n 200
R2C1 80 100
E21 = = = 40
n 200
R2C2 80 100
E22 = = = 40
n 200
The table of expected frequencies is as follows

60 60 120
40 40 80
100 100 200
O E (O – E) 2 (O – E) 2 /E
65 60 25 0.417
55 60 25 0.625
35 40 25 0.417
45 40 25 0.625
Σ(O – E) 2 /E = 2.084
Arranging the observed frequencies with their corresponding frequencies in the following table
we get
O  E 
2
χ2 =  E
= 2.084
2
V= (r –1) (c-1) = (2 – 1) (2 –1) = 1;  tabulated ( 0.05 ) = 3.841
216 Lesson Six
The calculated value of χ2 is less than the table value. The hypothesis is accepted. Hence the
drug is not effective in curing the disease.
Test of homogeneity
It is concerned with the proposition that several populations are homogenous with respect to
some characteristic of interest e.g. one may be interested in knowing if raw material available
from several retailers are homogenous. A random sample is drawn from each of the population
and the number in each of sample falling into each category is determined. The sample data is
displayed in a contingency table
The analytical procedure is the same as that discussed for the test of independence
Example
A random sample of 400 persons was selected from each of three age groups and each person
was asked to specify which types of TV programs be preferred. The results are shown in the
following table
Type of program
Age group A B C Total
Under 30 120 30 50 200
30 – 44 10 75 15 100
45 and above 10 30 60 100
Total 140 135 125 400
Test the hypothesis that the populations are homogenous with respect to the types of television
program they prefer, at 5% level of significance.
Solution
Let us take hypothesis that the populations are homogenous with respect to different types of
television programs they prefer
Applying χ2 test
O E (O – E) 2 (O – E) 2 /E
120 70.00 2500.00 35.7143
10 35.00 625.00 17.8571
10 35.00 625.00 17.8571
30 67.50 1406.25 20.8333
75 33.75 1701.56 50.4166
30 33.75 14.06 0.4166
50 62.50 156.25 2.500
15 31.25 264.06 8.4499
60 31.25 826.56 26.449
Σ(O – E) 2 /E = 180.4948
O  E 
2
χ2 =  E
The table value of χ2 for 4d.f. at 5% level of significance is 9.488

The calculated value of χ2 is greater than the table value. We reject the hypothesis and conclude
that the populations are not homogenous with respect to the type of TV programs preferred,
thus the different age groups vary in choice of TV programs.
SUMMARY OF FORMULAE IN HYPOTHESIS
Testing
(a) Hypothesis testing of mean
For n>30
X  S
Z= Where S X  at  level of significance.
SX n
For n < 30
X  S
t= where S X 
SX n
at n – 1 d.f
 level of significance
(b) Difference between means (Independent samples)

For n > 30
X1  X 2
Z=
S X1X 2
 
S12 S 22
Where S  
 X1X 2  n1 n2
At  = level of significance
For n < 30
X1  X 2
t= at n1 + n2 – 2 d.f
S X1X 2
 
n1  n2
where S  Sp
 X1X 2  n1n2
and S p 
 n1  1 S12   n2  1 S22
n1  n2  2
(c) Hypothesis testing of proportions

p 
Z=
Sp
pq
Where: Sp =
n
p = Proportion found in sample
q=1–p
 = hypothetical proportion
(d) Difference between proportions
P1  P2
Z=
S P1  P2 
218 Lesson Six
Where:
pq pq
S P1  P2   
n1 n2
p1n1  p2 n2
p=
n1  n2
q=1–P
(e) Chi-square test
O  E 
2
X2 =  E
Where O = observed frequency
Column total × Row total
E= = expected frequency
Sample Size
(f) F – test (variance test)
S12
F=
S 22
here the bigger value between the standard deviations make the numerator.

Sampling and Estimation

Uploaded by

Copyright:

Available Formats

Sampling and Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sampling and Estimation

Uploaded by

Copyright:

Available Formats

Sampling and Estimation 185

6.1 Methods of Sampling

Simple Random Sampling

6.2 THE CENTRAL LIMIT THEOREM

Distribution of Sample Means or sampling distribution

Standard errors of the mean

6.3 Statistical inference

Sample Statistic Population Parameter

Characteristic of a good estimator

ii. Sample mean ± 2.575 σ includes 99% of the population

(a) Estimation of population mean

Population mean = 6200 ± 1.96(25)

FINITE POPULATION CORRECTION FACTOR (FPCF)

Sx = 1342.50 at 99% level of confidence

Population mean = X ± 2.58 S x

b) Estimation of difference between two means

= 3.26 and 2.974

Thus 2.974 ≤ X ≤ 3.26

1, 440, 000 1,960, 000

or (1400 – 191.77) to (1400 + 191.77)

c) Estimation of population proportions

= P ± 1.96 Sp where 1.96 = Z.

= between 67% to 73%

= between 4.7% to 10%

d) Estimation of difference between population proportions

S = standard deviation of samples = for small samples.

At 95% confidence level

6.4 Hypothesis Testing

The null hypothesis

The alternative hypothesis

Acceptance and rejection regions

Type I and type II errors

1% provision for errors

The following is used when sample mean > population mean

Critical region (rejection region)

0 Z = 1.65 (critical value)

Accept null hyp( reject Alternative hyp)

0.05% = 0.05 0.495 0.495 0.5% = 0.05

TWO TAILED TESTS

Region of acceptance for

Critical region Critical region

ONE TAILED TEST

Number of standard errors

HYPOTHESIS TESTING PROCEDURE

STANDARD HYPOTHESIS TESTS

3. Variance ratio test or f test

4. Chi squared test

5. The standardizes value of the sample mean is

Where, X = Sample mean

3. The test statistics is the sample mean age, x’ = 30.2

TESTING THE DIFFERENCE BETWEEN TWO SAMPLE MEANS (LARGE

- 2.58 0 +2.58 +3.45

TEST OF HYPOTHESIS ON PROPORTIONS

H0 : π = 50% of all youth in the constituency lack university education.

HYPOTHESIS TESTING OF THE DIFFERENCE BETWEEN PROPORTIONS

HYPOTHESIS TESTING ABOUT THE DIFFERENCE BETWEEN TWO

Standard error of difference between proportions

Appropriate statistic to be used here is given by

a) Test of hypothesis about the population mean