0% found this document useful (0 votes)
21 views

Lecture 06 Estimation of Differences of Means

The document discusses methods for comparing population means and proportions between two groups. It covers estimating the difference between two population means when variances are equal or unequal, as well as estimating the difference between two population proportions. Examples are provided to illustrate calculating confidence intervals for comparing population means.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture 06 Estimation of Differences of Means

The document discusses methods for comparing population means and proportions between two groups. It covers estimating the difference between two population means when variances are equal or unequal, as well as estimating the difference between two population proportions. Examples are provided to illustrate calculating confidence intervals for comparing population means.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CCN2311

FOUNDATIONS OF DATA SCIENCE


LECTURE 6
Estimation of Differences of Means
Estimation of Differences of Proportions
Estimation of Ratios of Variances
Topics
Estimation of Differences
• Differences of Means
• Ratios of Variances
• Differences of Proportions

CCN2311 Foundations of Data Science Page 2


Motivations
In many situations, there are needs to compare the
population means of two different populations.
Examples
• Comparison of average ages of male and female
populations
• Comparison of proportions of smokers in Hong Kong
and China

CCN2311 Foundations of Data Science Page 3


Types of Comparisons
• To compare parameters of two populations there are
two approaches
– Differences of the parameters in two different populations
– Ratios of the parameters in two different populations
• The choice of differences vs ratios depends on the
sampling distribution of the statistics used

CCN2311 Foundations of Data Science Page 4


Comparing Population Means
Similar to one sample problem in Lecture 4, there are a number of
cases to consider. The following are two of the important situations.

Case 1: Normal populations and unknown variances, but the


variances are equal
Case 2: Normal populations and unknown variances, but the
variances are unequal
We will also see how to deal with the situation when we don't have
normal populations at the end of the section.

CCN2311 Foundations of Data Science Page 5


Comparing Population Means
Case 1 – Assumptions
Assumptions:
• 𝑋𝑋1 , 𝑋𝑋2 , … , 𝑋𝑋𝑛𝑛 are iid 𝑁𝑁(𝜇𝜇𝑥𝑥 , 𝜎𝜎𝑥𝑥2 ) [Sample from the 1st population]
• 𝑌𝑌1 , 𝑌𝑌2 , … , 𝑌𝑌𝑚𝑚 are iid 𝑁𝑁(𝜇𝜇𝑦𝑦 , 𝜎𝜎𝑦𝑦2 ) [Sample from the 2nd population]
• No requirements on 𝑚𝑚 and 𝑛𝑛. i.e. large samples are not needed
• 𝑋𝑋𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑛𝑛 and 𝑌𝑌𝑗𝑗 , 𝑗𝑗 = 1, … , 𝑚𝑚 are all independent of each
other
• 𝜇𝜇𝑥𝑥 and 𝜇𝜇𝑦𝑦 are unknown. 𝜎𝜎𝑥𝑥2 and 𝜎𝜎𝑦𝑦2 are unknown but equal. That
is 𝜎𝜎𝑥𝑥2 = 𝜎𝜎𝑦𝑦2 = 𝜎𝜎 2

CCN2311 Foundations of Data Science Page 6


Comparing Population Means
Case 1 – Sampling Distribution
• If 𝑋𝑋� and 𝑌𝑌� are the sample mean of the two populations
𝜎𝜎 2 𝜎𝜎 2
𝑋𝑋� − 𝑌𝑌~𝑁𝑁
� 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦 , +
𝑛𝑛 𝑚𝑚
• 𝑆𝑆𝑥𝑥2 and 𝑆𝑆𝑦𝑦2 are the sample variances of the two populations
∑ 𝑛𝑛 �
𝑖𝑖=1 𝑋𝑋𝑖𝑖 − 𝑋𝑋
2 ∑ 𝑚𝑚 �
𝑖𝑖=1 𝑌𝑌𝑖𝑖 − 𝑌𝑌
2
2 2
𝑆𝑆𝑥𝑥 = , 𝑆𝑆𝑦𝑦 =
𝑛𝑛 − 1 𝑚𝑚 − 1
𝑛𝑛−1 𝑆𝑆𝑥𝑥2 + 𝑚𝑚−1 𝑆𝑆𝑦𝑦2
• 𝑆𝑆𝑝𝑝2 = is the pooled sample variance for
(𝑛𝑛+𝑚𝑚−2)
estimating the common variance 𝜎𝜎 2
– 𝐸𝐸 𝑆𝑆𝑝𝑝2 = 𝜎𝜎 2
𝑛𝑛+𝑚𝑚−2 𝑆𝑆𝑝𝑝2
– ~𝜒𝜒 2 (𝑛𝑛 + 𝑚𝑚 − 2)
𝜎𝜎 2

CCN2311 Foundations of Data Science Page 7


Comparing Population Means
Case 1 – Sampling Distribution
� 𝑌𝑌� −(𝜇𝜇𝑥𝑥 −𝜇𝜇𝑦𝑦 )
𝑋𝑋−
• 𝑇𝑇 = has a t-distribution with (n+m-2)
1 1
𝑆𝑆𝑝𝑝 +
𝑛𝑛 𝑚𝑚
degrees of freedom
• The confidence interval of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦 is derived from the
distribution of 𝑇𝑇

CCN2311 Foundations of Data Science Page 8


Comparing Population Means
Case 1 – Estimation
• Point estimate of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦 = 𝑥𝑥̅ − 𝑦𝑦,
� where 𝑥𝑥̅ and 𝑦𝑦� are the
observed sample mean of the two populations
1 1
• Standard error = 𝑠𝑠𝑝𝑝 + , where 𝑠𝑠𝑝𝑝 is the pooled sample
𝑛𝑛 𝑚𝑚
variance

1 − 𝛼𝛼 100% Confidence interval of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦


1 1
• 𝑥𝑥̅ − 𝑦𝑦� ± 𝑡𝑡𝛼𝛼⁄2;(𝑛𝑛+𝑚𝑚−2) 𝑠𝑠𝑝𝑝 +
𝑛𝑛 𝑚𝑚

CCN2311 Foundations of Data Science Page 9


Interpretation of CI of the Difference of
Population Means
Interpretation of the CI
• If “0” is within the confidence interval, it is likely that the
difference of the two population means is zero (i.e. they are
equal)
– We will say “We do not have enough evidence to conclude that the two
population means are not equal”
• If “0” is outside the confidence interval, it is likely that the
difference of the two population means is NOT zero (i.e. they are
unequal)
– We will say “We enough evidence to conclude that the two population
means are not equal”
• Note: We will revisit this when we talk about two-sample
hypothesis tests in upcoming lectures.
CCN2311 Foundations of Data Science Page 10
Example 1 – Comparison of Population
Means (Case 1)
Suppose the weights (in kg) of a sample of 5 boys and a sample of 6
girls are listed below.
Boys: 50, 52, 55, 60, 48
Girls: 46, 48, 50, 52, 54, 50
It is known that weights of boys and weights of girls are both
normally distributed with equal variances.

1. Give a point estimate for the difference of the population mean


weights of boys & girls and calculate its standard error.
2. Find a 95% confidence interval for the difference of the
population mean weights of boys & girls.

CCN2311 Foundations of Data Science Page 11


Example 1 – Comparison of Population
Means (Case 1)
Solution:
For boys, 𝑛𝑛 = 5, 𝑥𝑥̅ = 53, 𝑠𝑠𝑥𝑥2 = 22
For girls, 𝑚𝑚 = 6, 𝑦𝑦� = 50, 𝑠𝑠𝑦𝑦2 = 8
𝑛𝑛−1 𝑠𝑠𝑥𝑥2 + 𝑚𝑚−1 𝑠𝑠𝑦𝑦2 (5−1) 22 +(6−1)(8)
Pooled variance = 𝑠𝑠𝑝𝑝2 = = = 14.2222
(𝑛𝑛+𝑚𝑚−2) 5+6−2

Point estimate of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦 = 53 − 50 = 3


1 1 1 1
Standard error = 𝑠𝑠𝑝𝑝 + = 1.4222 × + = 2.2836
𝑛𝑛 𝑚𝑚 5 6

CCN2311 Foundations of Data Science Page 12


Example 1 – Comparison of Population
Means (Case 1)
Solution:
For 95% confidence interval, 𝑡𝑡1−𝛼𝛼/2;(𝑛𝑛+𝑚𝑚−2) = 𝑡𝑡0.025;(9) = 2.262
The 95% confidence interval is given by
1 1
𝑥𝑥̅ − 𝑦𝑦� ± 𝑡𝑡𝛼𝛼⁄2;(𝑛𝑛+𝑚𝑚−2) 𝑠𝑠𝑝𝑝 +
𝑛𝑛 𝑚𝑚

1 1
53 − 50 ± 2.262 14.2222 +
5 6
𝑖𝑖. 𝑒𝑒. (−2.1655, 8.1655)
Therefore, the difference of the mean weights of boys and girls is between
− 2.1655 kg and 8.1655 kg with a confidence of 95%. Since 0 is within the
confidence interval, we do not have enough evidence to conclude that the two
population means are unequal.

CCN2311 Foundations of Data Science Page 13


Comparing Population Means
Case 2 – Assumptions
Assumptions:
• 𝑋𝑋1 , 𝑋𝑋2 , … , 𝑋𝑋𝑛𝑛 are iid 𝑁𝑁(𝜇𝜇𝑥𝑥 , 𝜎𝜎𝑥𝑥2 ) [Sample from the 1st population]
• 𝑌𝑌1 , 𝑌𝑌2 , … , 𝑌𝑌𝑚𝑚 are iid 𝑁𝑁(𝜇𝜇𝑦𝑦 , 𝜎𝜎𝑦𝑦2 ) [Sample from the 2nd population]
• No requirements on 𝑚𝑚 and 𝑛𝑛. i.e. large samples are not needed
• 𝑋𝑋𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑛𝑛 and 𝑌𝑌𝑗𝑗 , 𝑗𝑗 = 1, … , 𝑚𝑚 are all independent of each
other
• 𝜇𝜇𝑥𝑥 and 𝜇𝜇𝑦𝑦 are unknown. 𝜎𝜎𝑥𝑥2 and 𝜎𝜎𝑦𝑦2 are unknown but NOT equal.
That is 𝜎𝜎𝑥𝑥2 ≠ 𝜎𝜎𝑦𝑦2
• If we are uncertain if 𝜎𝜎𝑥𝑥2 and 𝜎𝜎𝑦𝑦2 are equal, it is preferable to use
case 2

CCN2311 Foundations of Data Science Page 14


Comparing Population Means
Case 2 – Sampling Distribution
• If 𝑋𝑋� and 𝑌𝑌� are the sample mean of the two populations
𝜎𝜎𝑥𝑥2 𝜎𝜎𝑦𝑦2
𝑋𝑋� − 𝑌𝑌~𝑁𝑁
� 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦 , +
𝑛𝑛 𝑚𝑚
• 𝑆𝑆𝑥𝑥2 and 𝑆𝑆𝑦𝑦2 are the sample variances of the two populations
∑ 𝑛𝑛 �
𝑖𝑖=1 𝑋𝑋𝑖𝑖 − 𝑋𝑋
2 ∑ 𝑚𝑚 �
𝑖𝑖=1 𝑌𝑌𝑖𝑖 − 𝑌𝑌
2
2 2
𝑆𝑆𝑥𝑥 = , 𝑆𝑆𝑦𝑦 =
𝑛𝑛 − 1 𝑚𝑚 − 1
• Since 𝜎𝜎𝑥𝑥2 and 𝜎𝜎𝑦𝑦2 are unequal, they are estimated separately by
𝑆𝑆𝑥𝑥2 and 𝑆𝑆𝑦𝑦2
– 𝐸𝐸 𝑆𝑆𝑥𝑥2 = 𝜎𝜎𝑥𝑥2 and 𝐸𝐸 𝑆𝑆𝑦𝑦2 = 𝜎𝜎𝑦𝑦2
𝑛𝑛−1 𝑆𝑆𝑥𝑥2 𝑚𝑚−1 𝑆𝑆𝑦𝑦2
– 2 ~𝜒𝜒 2 (𝑛𝑛 − 1) and ~𝜒𝜒 2 (𝑚𝑚 − 1)
𝜎𝜎𝑥𝑥 𝜎𝜎𝑦𝑦2

• There are no pooled variance in this case as 𝜎𝜎𝑥𝑥2 ≠ 𝜎𝜎𝑦𝑦2

CCN2311 Foundations of Data Science Page 15


Comparing Population Means
Case 2 – Sampling Distribution
� 𝑌𝑌� −(𝜇𝜇𝑥𝑥 −𝜇𝜇𝑦𝑦 )
𝑋𝑋−
• 𝑇𝑇 = has an approximate t-distribution with (𝑟𝑟)
2
𝑆𝑆2
𝑥𝑥 +𝑆𝑆𝑦𝑦
𝑛𝑛 𝑚𝑚

degrees of freedom, where


2
𝑠𝑠𝑥𝑥2𝑠𝑠𝑦𝑦2
+
𝑟𝑟 = 𝑛𝑛 𝑚𝑚
2
2 2 𝑠𝑠𝑦𝑦2
1 𝑠𝑠𝑥𝑥 1
+
𝑛𝑛 − 1 𝑛𝑛 𝑚𝑚 − 1 𝑚𝑚
• The confidence interval of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦 is derived from the
distribution of 𝑇𝑇

CCN2311 Foundations of Data Science Page 16


Comparing Population Means
Case 2 – Estimation
Point Estimation of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦
• Point estimate = 𝑥𝑥̅ − 𝑦𝑦,
� where 𝑥𝑥̅ and 𝑦𝑦� are the observed sample
mean of the two populations
𝑠𝑠𝑥𝑥2 𝑠𝑠𝑦𝑦2
• Standard error = +
𝑛𝑛 𝑚𝑚

1 − 𝛼𝛼 100% Confidence interval of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦

𝑠𝑠𝑥𝑥2 𝑠𝑠𝑦𝑦2
• 𝑥𝑥̅ − 𝑦𝑦� ± 𝑡𝑡𝛼𝛼⁄2;(𝑟𝑟) +
𝑛𝑛 𝑚𝑚

CCN2311 Foundations of Data Science Page 17


Example 2 – Comparison of Population
Means (Case 2)
Two samples of primary school students are selected from City A
and City B.
For City A, there are 10 students selected, the mean and standard
deviation of the 10 students are 140cm and 5 cm respectively.
For City B, there are 12 students selected, the mean and standard
deviation of the 12 students are 135cm and 4cm respectively.

Suppose the average heights of primary school students in City A


and City B are normally distributed as 𝑁𝑁(𝜇𝜇𝐴𝐴 , 𝜎𝜎𝐴𝐴2 ) and 𝑁𝑁(𝜇𝜇𝐵𝐵 , 𝜎𝜎𝐵𝐵2 )
respectively. Estimate the difference 𝜇𝜇𝐴𝐴 − 𝜇𝜇𝐵𝐵 and find a 90%
confidence interval for the difference.

CCN2311 Foundations of Data Science Page 18


Example 2 – Comparison of Population
Means (Case 2)
Solution:
𝑛𝑛 = 10, 𝑥𝑥𝐴𝐴̅ = 140, 𝑠𝑠𝐴𝐴2 = 52 and 𝑚𝑚 = 12, 𝑥𝑥̅𝐵𝐵 = 135, 𝑠𝑠𝐵𝐵2 = 42
• Populations are normal
• 𝜎𝜎𝐴𝐴2 and 𝜎𝜎𝐵𝐵2 are unknown and NOT equal
• Sample sizes are 10 and 12. (i.e. not large)

Point estimate of 𝜇𝜇𝐴𝐴 − 𝜇𝜇𝐵𝐵 = 𝑥𝑥𝐴𝐴̅ − 𝑥𝑥̅𝐵𝐵 = 140 − 135 = 5


2 2
𝑠𝑠𝐴𝐴 𝑠𝑠𝐵𝐵 52 42
Standard error = + = + = 1.9579
𝑛𝑛 𝑚𝑚 10 12

CCN2311 Foundations of Data Science Page 19


Example 2 – Comparison of Population
Means (Case 2)
Solution: Confidence interval is given by
Degrees of freedom for CI is
2 𝑠𝑠𝐴𝐴2 𝑠𝑠𝐵𝐵2
𝑠𝑠𝐴𝐴2 𝑠𝑠𝐵𝐵2 𝑥𝑥̅ − 𝑦𝑦� ± 𝑡𝑡𝛼𝛼⁄2;(𝑟𝑟) +
+ 𝑛𝑛 𝑚𝑚
𝑛𝑛 𝑚𝑚
𝑟𝑟 = 2 2 52 42
1 𝑠𝑠𝐴𝐴2 1 𝑠𝑠𝐵𝐵2 140 − 135 ± 1.740 +
+ 10 12
𝑛𝑛 − 1 𝑛𝑛 𝑚𝑚 − 1 𝑚𝑚
2 (1.5933, 8.4067)
52 42
+ We are 90% confident that the difference in
10 12
= 2 2 = 17.1652 ≈ 17 average heights of primary school students in
1 52 1 42 City A and B is between 1.5933 cm and 8.4067
+
9 10 11 12 cm.
Table value is 𝑡𝑡𝛼𝛼/2;(𝑟𝑟) = 𝑡𝑡0.05;(17) = 1.740 Since 0 is not inside the confidence interval, we
have enough evidence to conclude that the two
population means are unequal.

CCN2311 Foundations of Data Science Page 20


Example 3 – Comparison of Population
Means (Case 2)
A doctor was trying to study the difference in heart rates (in beats per minute) of
smokers and non-smokers.
For a random sample of 8 smokers, the mean and standard deviation of their
heart rates were 85 and 5 respectively.
For a random sample of 16 non-smokers, the mean and standard deviation of
their heart rates were 81 and 7 respectively.
It was known that heart rates of smokers and non-smokers are both normally
distributed but their variance may be different.

Estimate the difference in average heart rates of smokers and non-smokers and
find a 99% confidence interval for the difference.

CCN2311 Foundations of Data Science Page 21


Example 3 – Comparison of Population
Means (Case 2)
Solution:

CCN2311 Foundations of Data Science Page 22


Example 3 – Comparison of Population
Means (Case 2)
Solution: Degrees of freedom for CI is
2
2 52 72
𝑛𝑛 = 8, 𝑥𝑥̅ = 85, 𝑠𝑠𝑥𝑥2 = 52 and 𝑠𝑠𝑥𝑥2 𝑠𝑠𝑦𝑦2 +
𝑟𝑟 = 𝑛𝑛 + 𝑚𝑚 8 16
𝑚𝑚 = 16, 𝑦𝑦� = 81, 𝑠𝑠𝑦𝑦2 = 72 2 2 = 2 2
1 𝑠𝑠𝑥𝑥2 1 𝑠𝑠𝑦𝑦2 1 52 1 72
+ +
7 8 15 16
• Populations are normal 𝑛𝑛 − 1 𝑛𝑛 𝑚𝑚 − 1 𝑚𝑚
= 18.9498 ≈ 19
• 𝜎𝜎𝑥𝑥2 and 𝜎𝜎𝑦𝑦2 are unknown and NOT equal Table value is 𝑡𝑡𝛼𝛼/2;(𝑟𝑟) = 𝑡𝑡0.005;(19) = 2.861
• Sample sizes are 8 and 16. (i.e. not large) Confidence interval is given by
𝑠𝑠𝑥𝑥2 𝑠𝑠𝑦𝑦2
𝑥𝑥̅ − 𝑦𝑦� ± 𝑡𝑡𝛼𝛼⁄2;(𝑟𝑟) +
𝑛𝑛 𝑚𝑚
Point estimate of 𝜇𝜇𝑥𝑥 − 𝜇𝜇𝑦𝑦 = 𝑥𝑥̅ − 𝑦𝑦�
= 85 − 81 = 4 85 − 81 ± 2.861
52 72
+
8 16
𝑠𝑠𝑥𝑥2 𝑠𝑠𝑦𝑦2 52 72 (−3.1166,11.1167)
Standard error = + = + = 2.4875
𝑛𝑛 𝑚𝑚 8 16 Therefore, we are 99% confident that the difference in heart rates
of smokers and non-smokers is between -3.1166 and 11.1167
beats per minute.
Since 0 is within the confidence interval, we do not have enough
evidence to conclude that the two population mean are unequal.

CCN2311 Foundations of Data Science Page 23


Comparison of Population Means -
Other Situations
• Case 1 and 2 only cover the situations when populations are
normal. When populations are not normal, the confidence
interval formula can still be used if sample sizes of both samples
are large (say 𝑛𝑛 > 30 and 𝑚𝑚 > 30)

• Recall that when degrees of freedom of t-distribution goes to


infinity, it will become a standard normal distribution. Therefore,
for large 𝑛𝑛 and 𝑚𝑚, 𝑡𝑡𝛼𝛼/2 in the confidence interval formula can be
replaced by 𝑧𝑧𝛼𝛼/2 .

CCN2311 Foundations of Data Science Page 24


Questions?

CCN2311 Foundations of Data Science Page 25


Comparison of Population Variances
• Why do we need to compare population variances?
• This is usually done before the comparison of
population means because the equality of variances will
affect the choice of estimation methods for comparing
means

CCN2311 Foundations of Data Science Page 26


Comparison of Population Variances -
Assumptions
Assumptions:
• 𝑋𝑋1 , 𝑋𝑋2 , … , 𝑋𝑋𝑛𝑛 are iid 𝑁𝑁(𝜇𝜇𝑥𝑥 , 𝜎𝜎𝑥𝑥2 ) [Sample from the 1st population]
• 𝑌𝑌1 , 𝑌𝑌2 , … , 𝑌𝑌𝑚𝑚 are iid 𝑁𝑁(𝜇𝜇𝑦𝑦 , 𝜎𝜎𝑦𝑦2 ) [Sample from the 2nd population]
• No requirements on 𝑚𝑚 and 𝑛𝑛. i.e. large samples are not needed
• 𝑋𝑋𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑛𝑛 and 𝑌𝑌𝑗𝑗 , 𝑗𝑗 = 1, … , 𝑚𝑚 are all independent of each
other
• 𝜇𝜇𝑥𝑥 and 𝜇𝜇𝑦𝑦 are unknown
• 𝜎𝜎𝑥𝑥2 and 𝜎𝜎𝑦𝑦2 are unknown
• 𝜇𝜇𝑥𝑥 ≠ 𝜇𝜇𝑦𝑦 and 𝜎𝜎𝑥𝑥2 ≠ 𝜎𝜎𝑦𝑦2

CCN2311 Foundations of Data Science Page 27


Comparison of Population Variances –
Sampling Distributions
• 𝑆𝑆𝑥𝑥2 and 𝑆𝑆𝑦𝑦2 are the sample variances of the two populations
∑ 𝑛𝑛
𝑖𝑖=1 𝑋𝑋𝑖𝑖 − �
𝑋𝑋 2 ∑ 𝑚𝑚 � 2
𝑖𝑖=1 𝑌𝑌𝑖𝑖 − 𝑌𝑌
2 2
𝑆𝑆𝑥𝑥 = , 𝑆𝑆𝑦𝑦 =
𝑛𝑛 − 1 𝑚𝑚 − 1
• The sample variances are both chi-square distributed
𝑛𝑛−1 𝑆𝑆𝑥𝑥2 𝑚𝑚−1 𝑆𝑆𝑦𝑦2
2 ~𝜒𝜒 2 𝑛𝑛 − 1 , ~𝜒𝜒 2 𝑚𝑚 − 1
𝜎𝜎𝑥𝑥 𝜎𝜎𝑦𝑦2

• The comparison of variances will rely on the following ratio


𝑛𝑛 − 1 𝑆𝑆𝑥𝑥2
/(𝑛𝑛 − 1)
𝜎𝜎𝑥𝑥2 𝜎𝜎𝑦𝑦2 𝑆𝑆𝑥𝑥2
= 2 ⋅ 2 ~𝐹𝐹(𝑛𝑛 − 1, 𝑚𝑚 − 1)
𝑚𝑚 − 1 𝑆𝑆𝑦𝑦2 𝜎𝜎𝑥𝑥 𝑆𝑆𝑦𝑦
/(𝑚𝑚 − 1)
𝜎𝜎𝑦𝑦2

CCN2311 Foundations of Data Science Page 28


Comparison of Population Variances –
Estimation
𝜎𝜎𝑥𝑥2 𝑠𝑠𝑥𝑥2
• Point estimate of =
𝜎𝜎𝑦𝑦2 𝑠𝑠𝑦𝑦2

𝜎𝜎𝑥𝑥2
• 1 − 𝛼𝛼 100% confidence interval of is given by
𝜎𝜎𝑦𝑦2
1 𝑠𝑠𝑥𝑥2 𝜎𝜎𝑥𝑥2 𝑠𝑠𝑥𝑥2
⋅ 2 < 2 < 𝐹𝐹𝛼𝛼/2; 𝑚𝑚−1,𝑛𝑛−1
𝐹𝐹𝛼𝛼/2; 𝑛𝑛−1,𝑚𝑚−1 𝑠𝑠𝑦𝑦 𝜎𝜎𝑦𝑦 𝑠𝑠𝑦𝑦2

CCN2311 Foundations of Data Science Page 29


Comparison of Population Variances –
Interpretation of the CI
𝜎𝜎𝑥𝑥2
• If “1” is within the CI, it is very likely that the ratio is
𝜎𝜎𝑦𝑦2
equal to 1. i.e. the two variances are equal
– We will say “We do not have enough evidence to conclude
that the two population variances are not equal”
𝜎𝜎𝑥𝑥2
• If “1” is not inside the CI, it very likely that the ratio is
𝜎𝜎𝑦𝑦2
NOT equal to 1. i.e. the two variances are unequal
– We will say “We have enough evidence to conclude that the
two population variances are not equal”

CCN2311 Foundations of Data Science Page 30


Example 4 – Comparison of Population
Variances
Suppose a sample of 10 boys and a sample of 8 girls were
selected. The sample variances of their IQs are 30 and 20
respectively.
If IQs of boys and girls are both normally distributed with
variances 𝜎𝜎𝐵𝐵2 and 𝜎𝜎𝐺𝐺2 , find a 95% confidence interval for
the variance ratio 𝜎𝜎𝐵𝐵2 /𝜎𝜎𝐺𝐺2 .

CCN2311 Foundations of Data Science Page 31


Example 4 – Comparison of Population
Variances
Solution:
𝑛𝑛 = 10, 𝑠𝑠𝐵𝐵2 = 30, 𝑚𝑚 = 8 and 𝑠𝑠𝐺𝐺2 = 20
• The two populations are normal (important)
• Means and variances of the two populations may be unequal
• Sample sizes are not large
Table values for 95% confidence interval
• 𝐹𝐹𝛼𝛼/2; 𝑛𝑛−1,𝑚𝑚−1 = 𝐹𝐹0.025; 9,7 = 4.82
• 𝐹𝐹𝛼𝛼/2; 𝑚𝑚−1,𝑛𝑛−1 = 𝐹𝐹0.025; 7,9 = 4.20

95% confidence of 𝜎𝜎𝐵𝐵2 /𝜎𝜎𝐺𝐺2 is given by


1 𝑠𝑠𝐵𝐵2 𝜎𝜎𝐵𝐵2 𝑠𝑠𝐵𝐵2
⋅ < < 𝐹𝐹𝛼𝛼/2; 𝑚𝑚−1,𝑛𝑛−1
𝐹𝐹𝛼𝛼/2; 𝑛𝑛−1,𝑚𝑚−1 𝑠𝑠𝐺𝐺2 𝜎𝜎𝐺𝐺2 𝑠𝑠𝐺𝐺2

CCN2311 Foundations of Data Science Page 32


Example 4 – Comparison of Population
Variances
Solution:
95% confidence of 𝜎𝜎𝐵𝐵2 /𝜎𝜎𝐺𝐺2 is given by
1 𝑠𝑠𝐵𝐵2 𝜎𝜎𝐵𝐵2 𝑠𝑠𝐵𝐵2
⋅ < < 𝐹𝐹𝛼𝛼; 𝑚𝑚−1,𝑛𝑛−1 2
𝐹𝐹𝛼𝛼/2; 𝑛𝑛−1,𝑚𝑚−1 𝑠𝑠𝐺𝐺2 𝜎𝜎𝐺𝐺2 2 𝑠𝑠𝐺𝐺
1 30 𝜎𝜎𝐵𝐵2 30
⋅ < < 4.20 ⋅
4.82 20 𝜎𝜎𝐺𝐺2 20
𝜎𝜎𝐵𝐵2
0.3112 < 2 < 6.3
𝜎𝜎𝐺𝐺
2
𝜎𝜎𝐵𝐵
Therefore, we are 95% confident that 2 is between 0.3112 and 6.3. Since the
𝜎𝜎𝐺𝐺
value 1 is within the confidence interval, we don’t have enough evidence to
conclude that the two variances are different.

CCN2311 Foundations of Data Science Page 33


Example 5 – Comparison of Population
Variances
A study was carried out to compare the average salaries of male and female in IT
industry.
13 male IT workers were selected and their average salary was $26000 with a
standard deviation $1500.
10 female IT workers were also selected and their average salary was $28000
with a standard deviation of $1000.
2
It is known that male IT workers’ salaries are 𝑁𝑁(𝜇𝜇𝑚𝑚 , 𝜎𝜎𝑚𝑚 ) and female IT workers’
salaries are 𝑁𝑁(𝜇𝜇𝑓𝑓 , 𝜎𝜎𝑓𝑓2 ).
2
(a) Construct a 95% confidence interval for the 𝜎𝜎𝑚𝑚 /𝜎𝜎𝑓𝑓2
(b) Based the result of (a), choice a suitable method to construct a 95%
confidence for the difference in average salaries of male and female IT
workers.

CCN2311 Foundations of Data Science Page 34


Example 5 – Comparison of Population
Variances
Solution:
(a)

CCN2311 Foundations of Data Science Page 35


Example 5 – Comparison of Population
Variances
Solution:
(b)

CCN2311 Foundations of Data Science Page 36


Example 5 – Comparison of Population
Variances
Solution:
2
𝑛𝑛𝑚𝑚 = 13, 𝑥𝑥̅𝑚𝑚 = 26000, 𝑠𝑠𝑚𝑚 = 15002
𝑛𝑛𝑓𝑓 = 10, 𝑥𝑥𝑓𝑓̅ = 28000, 𝑠𝑠𝑓𝑓2 = 10002
2
𝜎𝜎𝑚𝑚
(a) 95% CI for
𝜎𝜎𝑓𝑓2
1 2
15002 𝜎𝜎𝑚𝑚 15002
⋅ < < 𝐹𝐹0.025; 9,12
𝐹𝐹0.025; 12,9 10002 𝜎𝜎𝑓𝑓2 10002
1 15002 𝜎𝜎𝑚𝑚 2
15002
⋅ < < 3.44 ⋅
3.87 10002 𝜎𝜎𝑓𝑓2 10002
2
𝜎𝜎𝑚𝑚
0.5814 < 2 < 7.74
𝜎𝜎𝑓𝑓
2
𝜎𝜎𝑚𝑚
We are 95% confident that the variance ratio is between 0.5814 and 7.74.
𝜎𝜎𝑓𝑓2
Since “1” is within the confidence interval, there is not enough evidence to
conclude that the two population variances are unequal.

CCN2311 Foundations of Data Science Page 37


Example 5 – Comparison of Population
Variances
Solution:
(b) Based on the conclusion in part (a), we will assume that the variances of the two populations are
equal when constructing the 95% confidence interval for the difference in population means.
Since we assume equality of variances, we need to find the pooled sample variance.
2 2
2
𝑛𝑛 − 1 𝑠𝑠𝑚𝑚 + 𝑚𝑚 − 1 𝑆𝑆𝑓𝑓 13 − 1 15002 + 10 − 1 10002
𝑠𝑠𝑝𝑝 = = = 1714285.7142
(𝑛𝑛 + 𝑚𝑚 − 2) (13 + 10 − 2)
95% confidence interval for 𝜇𝜇𝑚𝑚 − 𝜇𝜇𝑓𝑓 is given by
1 1
𝑥𝑥̅𝑚𝑚 − 𝑦𝑦�𝑓𝑓 ± 𝑡𝑡𝛼𝛼⁄2;(𝑛𝑛+𝑚𝑚−2) 𝑠𝑠𝑝𝑝 +
𝑛𝑛 𝑚𝑚

1 1
28000 − 26000 ± 𝑡𝑡0.025; 21 1714285.7142 +
13 10
(854.4945, 3145.506)
We are 95% confident that the difference between in the average salaries of male and female IT
workers is between $854.4945 and $3145.506. Since “0” is not inside the CI, we have enough
evidence to conclude that the average salaries of male and female IT workers are unequal.

CCN2311 Foundations of Data Science Page 38


Questions?

CCN2311 Foundations of Data Science Page 39


Comparison of Population Proportions
• If we are interested in comparing the opinions (yes/no
responses) of people in two populations, we can do
that by comparing proportions of the opinion (say yes)
of the two populations
• For example,
– Proportion of unemployment in male vs female populations
– Proportion of childless families in Hong Kong vs China
– Proportion of students with iPad in secondary school vs
college

CCN2311 Foundations of Data Science Page 40


Comparison of Population Proportions -
Assumptions
Assumptions:
• 𝑋𝑋1 , 𝑋𝑋2 , … , 𝑋𝑋𝑛𝑛 are iid Bernoulli random variables with parameter 𝑝𝑝𝑥𝑥 [Sample
from the 1st population]
• 𝑌𝑌1 , 𝑌𝑌2 , … , 𝑌𝑌𝑚𝑚 are iid Bernoulli random variables with parameter 𝑝𝑝𝑦𝑦 [Sample
from the 2nd population]
• Sample sizes are required, i.e. 𝑚𝑚 > 30, 𝑛𝑛 > 30, 𝑛𝑛𝑝𝑝̂𝑥𝑥 > 5, 𝑛𝑛 1 − 𝑝𝑝̂𝑥𝑥 >
5, 𝑛𝑛𝑝𝑝̂𝑦𝑦 > 5, 𝑛𝑛 1 − 𝑝𝑝̂𝑦𝑦 > 5
• 𝑋𝑋𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑛𝑛 and 𝑌𝑌𝑗𝑗 , 𝑗𝑗 = 1, … , 𝑚𝑚 are all independent of each other

CCN2311 Foundations of Data Science Page 41


Comparison of Population Proportions -
Sampling Distributions
• Sample proportions
𝑝𝑝𝑥𝑥 1−𝑝𝑝𝑥𝑥 𝑝𝑝𝑦𝑦 1−𝑝𝑝𝑦𝑦
𝑝𝑝̂𝑥𝑥 is approximately 𝑁𝑁(𝑝𝑝𝑥𝑥 , ) and 𝑝𝑝̂𝑦𝑦 is approximately 𝑁𝑁(𝑝𝑝𝑦𝑦 , )
𝑛𝑛 𝑚𝑚
• Difference of sample proportions
𝑝𝑝𝑥𝑥 1−𝑝𝑝𝑥𝑥 𝑝𝑝𝑦𝑦 1−𝑝𝑝𝑦𝑦
𝑝𝑝̂𝑥𝑥 − 𝑝𝑝̂𝑦𝑦 is approximately 𝑁𝑁 𝑝𝑝𝑥𝑥 − 𝑝𝑝𝑦𝑦 , +
𝑛𝑛 𝑚𝑚
• Confidence interval of 𝑝𝑝𝑥𝑥 − 𝑝𝑝𝑦𝑦 is based on
𝑝𝑝̂𝑥𝑥 − 𝑝𝑝̂𝑦𝑦 − (𝑝𝑝𝑥𝑥 − 𝑝𝑝𝑦𝑦 )
𝑍𝑍 =
𝑝𝑝̂𝑥𝑥 1 − 𝑝𝑝̂𝑥𝑥 𝑝𝑝̂𝑦𝑦 1 − 𝑝𝑝̂𝑦𝑦
+
𝑛𝑛 𝑚𝑚
𝑍𝑍 is approximately 𝑁𝑁 0,1 when sample sizes are large.

CCN2311 Foundations of Data Science Page 42


Comparison of Population Proportions -
Estimation
• Point estimate of 𝑝𝑝𝑥𝑥 − 𝑝𝑝𝑦𝑦 = 𝑝𝑝̂𝑥𝑥 − 𝑝𝑝̂𝑦𝑦

𝑝𝑝�𝑥𝑥 1−𝑝𝑝�𝑥𝑥 𝑝𝑝�𝑦𝑦 1−𝑝𝑝�𝑦𝑦


• Standard error = +
𝑛𝑛 𝑚𝑚
• 1 − 𝛼𝛼 100% confidence interval
𝑝𝑝̂𝑥𝑥 1 − 𝑝𝑝̂𝑥𝑥 𝑝𝑝̂𝑦𝑦 1 − 𝑝𝑝̂𝑦𝑦
𝑝𝑝̂𝑥𝑥 − 𝑝𝑝̂𝑦𝑦 ± 𝑧𝑧𝛼𝛼/2 +
𝑛𝑛 𝑚𝑚

CCN2311 Foundations of Data Science Page 43


Example 6 – Comparison of Population
Proportions
An economist would like to find out if proportion of unemployed
males (𝑝𝑝𝑚𝑚 ) and proportion of unemployed female (𝑝𝑝𝑓𝑓 ) are different
in a town.
Among a random sample of 50 males, 10 of them are unemployed.
Among a random sample of 80 females, 20 of them are
unemployed.
Estimate the difference in the unemployment proportions 𝑝𝑝𝑚𝑚 − 𝑝𝑝𝑓𝑓
and find a 99% confidence interval for the difference.

CCN2311 Foundations of Data Science Page 44


Example 6 – Comparison of Population
Proportions
Solution:
10 20
𝑛𝑛 = 50, 𝑝𝑝̂𝑚𝑚 = = 0.2 and 𝑚𝑚 = 80, 𝑝𝑝̂𝑓𝑓 = = 0.25
50 80
Point estimate of 𝑝𝑝𝑚𝑚 − 𝑝𝑝𝑓𝑓 = 𝑝𝑝̂𝑚𝑚 − 𝑝𝑝̂𝑓𝑓 = 0.2 − 0.25 = −0.05

𝑝𝑝�𝑚𝑚 1−𝑝𝑝�𝑚𝑚 𝑝𝑝�𝑓𝑓 1−𝑝𝑝�𝑓𝑓 0.2 1−0.2 0.25 1−0/25


Standard error = + = + = 0.0745
𝑛𝑛 𝑚𝑚 50 80
99% confidence interval is
𝑝𝑝̂𝑥𝑥 1 − 𝑝𝑝̂𝑥𝑥 𝑝𝑝̂𝑦𝑦 1 − 𝑝𝑝̂𝑦𝑦
𝑝𝑝̂𝑥𝑥 − 𝑝𝑝̂𝑦𝑦 ± 𝑧𝑧𝛼𝛼/2 +
𝑛𝑛 𝑚𝑚
0.2 − 0.25 ± 2.575 0.0745
(−0.2417,0.1417)
We are 99% confident that the difference in unemployment rate of males and females is
between -0.2417 and 0.1417. Since “0” is within the confidence interval, we do not have
enough evidence to conclude that the unemployment rate of males and females are
different.

CCN2311 Foundations of Data Science Page 45


Example 7 – Comparison of Population
Proportions
In order to study the effect of smoking on lung problems, a doctor interviewed a
random sample of 100 smokers and a random sample of 80 non-smokers. The
doctor asked if they had any lung problems in the last 3 months. The collected
information is summarized in the table below.
Smoking Status Had lung No lung Total
problems in the problems in the
last 3 months last 3 months
Smoking 45 55 100
Non-smoking 20 60 80

Let 𝑝𝑝𝑠𝑠 and 𝑝𝑝𝑛𝑛 be the proportion of smokers and non-smokers who developed
lung problems in the last 3 months. Construct a 95% confidence interval for 𝑝𝑝𝑠𝑠 −
𝑝𝑝𝑛𝑛 .

CCN2311 Foundations of Data Science Page 46


Example 7 – Comparison of Population
Proportions
Solution:

CCN2311 Foundations of Data Science Page 47


Example 7 – Comparison of Population
Proportions
Solution:
45 20
𝑛𝑛 = 100, 𝑝𝑝̂𝑠𝑠 = = 0.45, 𝑚𝑚 = 80, 𝑝𝑝̂𝑛𝑛 = = 0.25
100 80
95% confidence interval is
𝑝𝑝̂𝑠𝑠 1 − 𝑝𝑝̂𝑠𝑠 𝑝𝑝̂𝑛𝑛 1 − 𝑝𝑝̂𝑛𝑛
𝑝𝑝̂𝑠𝑠 − 𝑝𝑝̂𝑛𝑛 ± 𝑧𝑧𝛼𝛼/2 +
𝑛𝑛 𝑚𝑚

0.45(1 − 0.45) 0.25(1 − 0.25)


0.45 − 0.25 ± 1.96 +
100 80
(0.0639,0.3361)

We are 95% confidence that the difference in proportions of people who had
lung problems in the last 3 months between smokers and non-smokers is
between 0.0639 and 0.3361. Since “0” is outside the confidence interval, we have
enough evidence to conclude that the two proportions are unequal.

CCN2311 Foundations of Data Science Page 48


Final Words – Comparison of
DEPENDENT Samples
• In some situations, we need to compare population means of
dependent samples.
• For example, we want to know that change in average weight of
a group of obese children before and after a diet. It is typically
done in the following way.
– A random sample of obese children are selected (only 1 sample)
– The weights of the i-th child before and after the diet are 𝑋𝑋𝑖𝑖 and 𝑌𝑌𝑖𝑖
– However 𝑋𝑋𝑖𝑖 and 𝑌𝑌𝑖𝑖 do not satisfy the independent assumption required,
because they are measured on the same child!
– To compare the difference we need to look at the random variable 𝐷𝐷𝑖𝑖 =
𝑋𝑋𝑖𝑖 − 𝑌𝑌𝑖𝑖
– Methods in Lecture 5 (one sample problem) can be used to estimate the
mean difference by looking at 𝐷𝐷𝑖𝑖

CCN2311 Foundations of Data Science Page 49


Next Lecture
• In next lecture, we will look at how can we use data to
make decisions by means of hypothesis testing.

CCN2311 Foundations of Data Science Page 50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy