Sampling Theory: Determining The Distribution of Sample Statistics
Sampling Theory: Determining The Distribution of Sample Statistics
Sampling Theory: Determining The Distribution of Sample Statistics
0.06
0.05
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60
the population is unobserved (unless all observations
in the population have been observed)
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60
A histogram computed from the observations
x1, x2, x3, … , xn
Gives an estimate of the population.
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60
A statistic computed from the observations
x1, x2, x3, … , xn
is also a random variable prior to observation of the
sample.
A statistic is also a numerical quantity whose value is
determined by the outcome of a random experiment
(the choosing of a random sample from the
population).
The probability distribution of statistic computed
from the observations
x1, x2, x3, … , xn
is sometimes called its sampling distribution.
This distribution describes the random behaviour of
the statistic
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60
It is important to determine the sampling distribution
of a statistic.
It will describe its sampling behaviour.
The sampling distribution will be used the assess the
accuracy of the statistic when used for the purpose of
estimation.
Sampling theory is the area of Mathematical Statistics
that is interested in determining the sampling
distribution of various statistics
Many statistics have a normal distribution.
This quite often is true if the population is Normal
It is also sometimes true if the sample size is
reasonably large. (reason – the Central limit theorem,
to be mentioned later)
Combining Random Variables
Combining Random Variables
Important question
What is the distribution of the new random variable?
Example 1: Suppose that one performs two
independent tasks (A and B):
X = time to perform task A (normal with mean 25
minutes and standard deviation of 3 minutes.)
Y = time to perform task B (normal with mean 15
minutes and std dev 2 minutes.)
Let T = X + Y = total time to perform the two tasks
0.1
0.08
Z (Social Studies)
0.06
m = 70 , s = 7.
0.04
Y (English Literature)
m = 60, s = 10.
0.02
0
0 20 40 60 80 100
Suppose that after the tests have been written an overall
score, S, will be computed as follows:
If L = aX + bY + … + c
0.1
0.08
Z (Social Studies)
0.06
m = 70 , s = 7.
0.04
Y (English Literature)
m = 60, s = 10.
0.02
0
0 20 40 60 80 100
Determine the distribution of
S = 0.50 X + 0.30 Y + 0.20 Z + 10
S has a normal distribution with
Mean mS = 0.50 mX + 0.30 mY + 0.20 mZ + 10
= 0.50(90) + 0.30(60) + 0.20(70) + 10
= 45 + 18 + 14 +10 = 87
0.1
distribution of
0.08
S = 0.50 X + 0.30 Y + 0.20 Z + 10
0.06
0.04
0.02
0
0 20 40 60 80 100
Sampling Theory
If L = aX + bY + … + c
L a X b Y c
and standard deviation
L a b
2 2
X
2 2
X
In particular:
X + Y is normal with mean X Y
standard deviation X2 Y2
X – Y is normal with mean X Y
standard deviation X2 Y2
The distribution of the sample
mean
The distribution of averages (the mean)
x i
1 1 1
x i 1
x1 x2 xn
n n n n
What is the distribution of x ?
The distribution of averages (the mean)
Because the mean is a “linear combination”
1 1 1
x x1 x2 xn
n n n
1 1 1 1
n
n n n n
and
2 2 2
1 2 1 2 1 2
x1 x2 xn
2
x
n n n
2 2 2
1 2 1 2 1 2 2 2
n 2
n n n n n
Thus if x1, x2, … , xn denote n independent random
variables each coming from the same Normal
distribution with mean m and standard deviation s.
Then n
x i
1 1 1
x i 1
x1 x2 xn
n n n n
has Normal distribution with
mean x and
2
variance x2
n
standard deviation x
n
Graphs
0.08
The probability
distribution of
0.06 the mean
0.04
n
The probability
distribution of
0.02 individual
observations
0
150 170 190 210 230 250 270 290 310
Summary
• The distribution of the sample mean x is Normal.
• The distribution of the sample mean has exactly the
same mean as the population (m).
• The distribution of the sample mean has a smaller
standard deviation then the population.
compared to
n
• Averaging tends to decrease variability
• An Excel file illustrating the distribution of the
sample mean x
Example
• Suppose we are measuring the cholesterol level of
men age 60-65
• This measurement has a Normal distribution with
mean m = 220 and standard deviation s = 17.
• A sample of n = 10 males age 60-65 are selected and
the cholesterol level is measured for those 10 males.
• x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, are those 10
measurements
Find the probability distribution of x ?
Compute the probability that x is between 215 and 225
Solution
10 20 30 x 10 20 30 x
Distribution of x: Distribution of x:
n = 10 n = 30
10 x 10 20 x
Implications of the Central Limit Theorem
1) P ( 45 £ x £ 60)
2) P ( x £ 47.5)
1) m x = m = 50
2) s x = s n = 15 9 = 15 3 = 5
Example
45 50 60 x
- 1.00 0 2.00 z
0.3085
47.5 50 x
-0.50 0 z
0.0793
105 109 x
141
. 0 z
• Consider how far out in the tail of the distribution of the sample
mean is $120
x- æ -
z= ; P ( x ³ 120) = Pçz ³ 120 109ö ÷
s n è 2.83 ø
= P ( z ³ 389
. )
= 1.0000 - 0.9999 = 0.0001
30
p 1 p
pˆ
25 n
20
15 p̂ p
c
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Example Sample Proportion Favoring a
Candidate
Suppose 20% all voters favor Candidate A.
Pollsters take a sample of n = 600 voters. Then
the sample proportion who favor A will have
approximately a normal distribution with
mean pˆ p 0.20
p (1 p ) 0.20(0.80)
pˆ 0.01633
n 600
Sampling distribution of p̂
30
25
20
15
c
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Using the Sampling distribution:
Suppose 20% all voters favor Candidate A. Pollsters
take a sample of n = 600 voters.
p (1 p ) 0.20(0.80)
pˆ 0.01633
n 600
mean X Y
standard deviation X2 Y2
Comparing Means
Situation
• We have two normal populations (1 and 2)
• Let m1 and s1 denote the mean and standard deviation of
population 1.
• Let m2 and s2 denote the mean and standard deviation of
population 2.
• Let x1, x2, x3 , … , xn denote a sample from a normal
population 1.
• Let y1, y2, y3 , … , ym denote a sample from a normal
population 2.
• Objective is to compare the two population means
We know that:
1
x is Normal with mean x 1 and x
n
and
2
y is Normal with mean y 2 and y
m
2
2
x y = x2 y2 1
2
n m
Example
Consider measuring Heart rate two minutes after a twenty
minute exercise program.
0.05
0.04
0.03
0.02
0.01
0
80 90 100 110 120 130
-0.01
Situation
• Suppose we observe the heart rate of n = 15 subjects on
program A.
• Let x1, x2, x3 , … , x15 denote these observations.
• We also observe the heart rate of m = 20 subjects on
program B.
• Let y1, y2, y3 , … , y20 denote these observations.
• What is the probability that the sample mean heart rate for
Program A is at least 8 units higher than the sample mean
heart rate for Program B?
We know that:
7.3
x is Normal with mean x 110 and x
15
and
4.5
y is Normal with mean y 95 and y
20
2
2
7.32
4.5 2
x y = x2 y2 1
2 2.1366
n m 15 20
distn of
0.45
sample mean
0.4
for program B distn of
0.35 sample mean
0.3 program A
0.25
0.2
0.15
0.1
0.05
0
80 90 100 110 120 130
-0.05
0.2
0.12
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25 30
• What is the probability that the sample mean heart rate for
Program A is at least 8 units higher than the sample mean
heart rate for Program B?
Solution
want P x y 8 P x y 8 P D 8
D 15 8 15
P P z 3.28
2.1366 2.1366
1 0.0005 0.9995
Sampling distribution of a difference in two
Sample proportions
Comparing Proportions
Situation
• Suppose we have two Success-Failure experiments
• Let p1 = the probability of success for experiment 1.
• Let p2 = the probability of success for experiment 2.
• Suppose that experiment 1 is repeated n1 times and
experiment 2 is repeated n2
• Let x1 = the no. of successes in the n1 repititions of
experiment 1, x2 = the no. of successes in the n2 repititions
of experiment 2.
x1 x2
pˆ1 = and pˆ 2 =
n1 n2
x1 x2
What is the distribution of D pˆ1 pˆ 2 = ?
n1 n2
We know that:
x1
pˆ1 = is Normal with mean pˆ1 p1
n1
p1 1- p1
and pˆ1
n1
x2
Also pˆ 2 = is Normal with mean pˆ 2 p2
n2
p2 1- p2
and pˆ 2
n2
Thus D pˆ1 pˆ 2 is Normal with mean
pˆ1 pˆ 2 pˆ1 pˆ 2 p1 - p2
p1 1 p1 p2 1 p2
pˆ1 pˆ 2 =
2
pˆ1
2
pˆ 2
n1 n2
Example
The Globe and Mail carried out a survey to investigate
the “State of the Baby Boomers”. (June 2006)
^ – p^ ≥ 0.15]?
What is P[p1 2
Solution:
x1
pˆ1 = is Normal with mean pˆ1 p1 0.40
n1
p1 1- p1
and pˆ1
n1
0.40 1- 0.40
0.019012
664
x2
Also pˆ 2 = is Normal with mean pˆ 2 p2 0.20
n2
p2 1- p2
and pˆ 2
n2
0.20 1- 0.20
0.02163
342
distn of sample
25 proportion for Gen X
20
distn of sample proportion
for Baby Boomers
15
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Now D pˆ1 pˆ 2 is Normal with mean
p1 1 p1 p2 1 p2
D pˆ1 pˆ 2 =
2
pˆ1
2
pˆ 2
n1 n2
14
12
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Now
Summary
Distribution for Sample Mean
2
2
x y = x2 y2 1 2
n m
Distribution of a difference in two sample proportions
pˆ1 pˆ 2 pˆ1 pˆ 2 p1 - p2
p1 1 p1 p2 1 p2
pˆ1 pˆ 2 =
2
pˆ1
2
pˆ 2
n1 n2
The Chi-square (c 2) distribution
The Chi-squared distribution
with
n degrees of freedom
0.12
0.06
0
0 10 20
n - degrees of freedom
0.5
2 d.f.
0.4
0.3
3 d.f.
0.2
4 d.f.
0.1
2 4 6 8 10 12 14
Statistics that have the Chi-squared
distribution:
x Eij
2
c r c r
rij2
2 ij
1.
j 1 i 1 Eij j 1 i 1
x x
2
i
2. U i 1
2
(n 1) s 2
2
x x
2
i
s i 1
n 1
is the sample standard deviation.
Find P 10 s 20 .
Note r
x x
2
i
(n 1) s 2
(9) s 2
U i 1
2
2 (15) 2
9 100 9 s 2 9 400
P 2
2 2
15 15 15
P 4 U 16
chi-square distribution with d.f. = n – 1 = 9
P 4 U 16
P 4 U 16
P x U
x
P 4 U 16 CHIDIST(4,9)-CHIDIST(16,9)
= 0.91141 - 0.06688 = 0.84453
Statistical Inference