Lesson 6 - Statistics For Data Science - II
Lesson 6 - Statistics For Data Science - II
A hypothesis test is a formal procedure in statistics used to test whether a hypothesis can be
accepted or not.
It is used to infer the results of a hypothesis performed on sample data to a large population.
The testing methodology depends on the data used and the reason for the analysis.
Types of Hypothesis Test
Alternative Statistical
Hypothesis Test Hypothesis Test
Non-Parametric
Parametric Test
Test
You have already learned about simple, complex, null, alternative, and statistical
hypotheses in the previous lesson. This lesson will focus on discussing parametric and
non-parametric tests.
Statistics for Data Science – II
Topic 2—Parametric Test
What Is a Parametric Test?
In these tests, inferences are based on the assumptions made about the
nature of the population distribution. The tests are used for normal data.
Types of Parametric Tests
Analysis of
Z-Test and T-Test Variance
(ANOVA) Test
There are many tests that are parametric. We will limit our attention to the tests
mentioned above.
Types of Parametric Tests
Z-TEST
T-Test
𝑋ത − 𝜇
𝑧=
ANOVA 𝜎
𝑛
Where,
n: Sample number
𝑋ത: Sample mean from a sample X1, X2, …, Xn
μ: Population mean
σ: Standard Deviation
Types of Parametric Tests
EXAMPLE IN R
The test scores of an entrance exam fit a normal distribution with the
Z-Test mean test score of 72 and a standard deviation of 15.2. Compute the
Problem percentage of students scoring 84 or more.
statement
T-Test
ANOVA
Calculation
on R
Solution
Types of Parametric Tests
EXAMPLE IN R
Let’s use the pnorm (probability normal distribution) function to find the
Z-Test required percentage of students and the upper tail of the normal
Problem distribution (since the given score criteria is 84 or more).
statement
T-Test pnorm(84, mean = 72, sd = 15.2, lower.tail = FALSE)
Calculation
on R
ANOVA
Calculation
on R
Solution
Types of Parametric Tests
T-Test is performed in cases where the test statistic is t, σ is unknown, sample standard
deviation is known, and the population is normal.
Z-Test
The formula to calculate t is:
T-Test
𝑋ത − 𝜇
𝑡=
ANOVA 𝑠
𝑛
Where,
n: Sample number
𝑋ത : Sample mean from a sample X1, X2, …, Xn
μ: Population mean
σ: Standard Deviation
Types of Parametric Tests
EXAMPLE IN R
Find out the 2.5th and 97.5th percentiles of the Student’s t-distribution,
Z-Test assuming 5 degrees of freedom.
Problem
statement
T-Test
ANOVA
Calculation
on R
Solution
Types of Parametric Tests
EXAMPLE IN R
The required 2.5th and 97.5th percentiles are -2.5706 and 2.5706,
Z-Test respectively.
Problem
statement
T-Test
ANOVA
Calculation
on R
Solution
Types of Parametric Tests
The ANOVA test is used for hypothesis tests that compare the averages of two or more groups.
Z-Test
For example, consider the following statements:
T-Test
• An environmentalist wants to know if the average amount of pollution varies in several
bodies of water.
ANOVA
• A sociologist wants to find out if a person’s income varies according to his/her upbringing.
Types of Parametric Tests
TYPES
Z-Test ANOVA
T-Test
One-way Anova:
Z-Test
• Uses variances to determine if a statistically significant difference exists among
several group means or not
T-Test
• Tests H0: μ1 = μ2 = μ3 = ... = μk (where, µ = group mean and k = number of groups)
ANOVA
One-way
ANOVA
Two-way
ANOVA
For one-way ANOVA, the ratio of the between-group variability to the within-
group variability follows an F-distribution when the null hypothesis is true.
Types of Parametric Tests
ASSUMPTIONS
One-way
ANOVA 1 3 5
Two-way
ANOVA All samples are The factor is a The result is a
random and categorical numerical
independent variable variable
Types of Parametric Tests
EXAMPLE 1
Find out if there is a difference in the mean grades among the sororities,
Z-Test assuming μ1, μ2, μ3, and μ4 are the population means of the sororities.
Problem
statement
T-Test
ANOVA
Calculation
One-way on R
ANOVA
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 1
Test:
• H0: μ1 = μ2 = μ3 = μ4
Z-Test • H1: Not all of the means μ1, μ2, μ3, and μ4 are equal
Problem • Distribution for the test: F3,16
statement
o df(num)= k – 1 = 4 – 1 = 3
T-Test
o df(denom) = n – k = 20 – 4 = 16
• Calculate the test statistic: F = 2.23
ANOVA • Define probability statement: p-value = P(F > 2.23) = 0.1241
• Compare α and the p-value: α = 0.01
Calculation
o p-value = 0.1241
One-way on R
o α < p-value
ANOVA
• Decide: Since α < p-value, you cannot reject H0.
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 1
ANOVA
Calculation
One-way on R
ANOVA
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 2
A fast food chain wants to test and market three of its new menu items. To
analyze if they are equally popular, consider:
Z-Test
Problem • 18 random restaurants for the study
statement • 6 of the restaurants to test market the first menu item, another 6 for the
T-Test second one, and the remaining 6 for the last one
The table below shows the sales figures of the menu items in the 18
ANOVA restaurants. At .05 level of significance, test whether the mean sales volumes
Calculation for these menu items are equal.
One-way on R
ANOVA Item 1 Item 2 Item 3
22 52 16
Two-way
ANOVA 42 33 24
44 8 19
Solution
52 47 18
45 43 34
37 32 39
Types of Parametric Tests
EXAMPLE 2
1. Copy and paste the sales figures in a table file "fastfood-1.txt" using a text
editor.
Z-Test
Problem 2. Load the file into a data frame df1 using the read.table function.
statement df1 = read.table("fastfood-1.txt", header = TRUE); df1
T-Test
Item1 Item2 Item3
1 22 52 16
ANOVA 2 42 33 24
3 44 8 19
Calculation 4 52 47 18
One-way on R 5 45 43 34
ANOVA 6 37 32 39
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 2
p-value of 0.11 > .05 significance level. Do not reject H0. This means that
the mean sales volumes of the new menu items are all equal.
Z-Test
Problem
statement
T-Test
ANOVA
Calculation
One-way on R
ANOVA
Two-way
ANOVA
Solution
F- Distribution
F-Ratio refers to the value derived from two estimates of the variance, as described below:
o Variance between samples (SSbetween): It is an estimate of σ2: variance of the sample means * n,
when the sample sizes are the same. When sizes are different, the variance is weighted to account
for different sample sizes.
o Variance within samples (SSwithin): It is an estimate of σ2: average of sample variances. When
sizes are different, the variance within samples is weighted.
Types of Parametric Tests
For example:
T-Test
A company bases its sales classification by identifying the sales by a salesman
ANOVA and sales by region.
One-way
ANOVA
Two-way
ANOVA
Types of Parametric Tests
ASSUMPTIONS
Independence of
Measurement of observations
Z-Test dependent variable
at continuous level
4
T-Test
2
ANOVA
One-way
ANOVA 1 3 5
Two-way
ANOVA
Normal Categorical independent Homogeneity of the
distribution of the groups that have the variance of the
population sample same size population
https://keydifferences.com/difference-between-one-way-and-two-way-anova.html
Statistics for Data Science – II
Topic 3—Non-Parametric Test
What Is a Non-Parametric Test?
It refers to a null category, since virtually all statistical tests assume one
thing or another about the properties of the source population(s).
http://www.statisticshowto.com/parametric-and-non-parametric-data/
Types of Non-Parametric Tests
• Chi-square test
Chi-square test is the most commonly used non-parametric test. We will limit our
scope to learning chi-square test in this course.
What Is Chi-square Test?
1 3 5
https://www.chegg.com/homework-help/definitions/chi-square-test-14
Types of Chi-square Test
USE CASES
Goodness of fit test is used to identify the relation between two attributes, as in the cases
below:
Chi-square test
for goodness of • Credit worthiness of borrowers based on their age groups and personal loans
fit
Chi-square test • Relation between the performance of salesmen and training received
for independence
of two variables • Return on a single stock and on stocks of a sector like pharmaceutical or banking
It is used to check whether the variables are independent of each other or not. The
Chi-square test statistic (𝜒 2 ) is
Chi-square test 2
𝑂𝑖 − 𝐸𝑖
for goodness of 𝜒2 =
fit 𝐸𝑖
Chi-square test
for independence
of two variables With (r-1) (c-1) degrees of freedom
Where Oi is the observed count, r is number of rows, c is the number of columns, and Ei is
the expected counts
Two random variables are called independent if the probability distribution of one
variable is not affected by the other.
https://www.chegg.com/homework-help/definitions/chi-square-test-14
Types of Chi-square Test
USE CASES
Chi-square test Problem • She takes a random sample of 100 customers asking if the service was
for goodness of statement excellent, good, or poor.
fit • She then categorizes the salaries of the people waiting as low, medium, and
Chi-square test high.
for independence
of two variables Her findings are shown in the table below:
Calculation
on R Salary
Excellent 9 10 7 26
Good 11 9 31 51
Solution
Poor 12 8 3 23
Total 32 27 41 100
Types of Chi-square Test
EXAMPLE
Solution
Types of Chi-square Test
EXAMPLE IN R
Solution
Types of Chi-square Test
EXAMPLE IN R
As p > significance level, H0 is not rejected. This means that the smoking
habits of students are independent of their exercise levels.
Chi-square test Problem
for goodness of statement
fit
Chi-square test
for independence
of two variables
Calculation
on R
Solution
Hypothesis Test around Mean, Variance, and Proportion
Hypothesis tests about population means involve testing the hypothesis that
compares the population mean of interest with a specified value.
Hypothesis Tests about Population Means
ASSUMPTION
X1, X2,……., Xn is a sample of size n from a normal population with mean μ and variance ơ2. The
mean X is distributed normally with the mean μ and variance ơ2/n (X ~ N (μ, ơ2/n)).
If n is large, X will be calculated similarly, even if the sample is from a non-normal population.
Therefore, for large samples, the standard normal variable corresponding to X bar is Z (as
calculated in the Z-test).
Hypothesis Tests about Population Means
WHEN POPULATION VARIANCE IS KNOWN
Test the hypothesis that the sample mean X has been drawn from a population with the mean μ and
a specified value μ0, that is:
• H0 : μ = μ0
• H1 : μ ≠ μ0
• H1 : μ > μ0
• H1 : μ < μ0
Under null hypothesis, Z = (X̅ – μ0)/S.E.(X) follows Standard Normal Distribution approximately.
• H0 : μ = μ0
• H1 : μ ≠ μ0
If μ0 falls in the confidence interval, the test result is “failing to reject the null hypothesis”; if
not, the result is “reject the null hypothesis.”
Consider the case where data consists of a simple random sample drawn from a normally
distributed population. The test statistic for testing hypotheses about a single population
variance is calculated as:
Consider a random sample of the size n and the proportion of members with a certain attribute p.
You need to test the hypothesis that the proportion P in the population has a specified value P0,
that is:
• H0 : P = P0
• H1 : P ≠ P0
• H1 : P > P0
• H1 : P < P0
Where,
p = X/n = Number of successes in sample/Sample size
P0 = Hypothesized proportion of successes in the population
Key Takeaways
The Z-test is performed in cases where the test statistic is t and σ is known.
The T-test is performed in cases where the test statistic is t and σ is unknown.
The degree of freedom is the number of independent variates that make up the
statistic.
The ANOVA test is used for such hypothesis tests that compare
the averages of two or more groups.