0% found this document useful (0 votes)
82 views60 pages

Lesson 6 - Statistics For Data Science - II

The document discusses hypothesis testing and different types of parametric and non-parametric tests. It explains concepts like z-test, t-test, ANOVA test and provides examples of their usage in R.

Uploaded by

rimbrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views60 pages

Lesson 6 - Statistics For Data Science - II

The document discusses hypothesis testing and different types of parametric and non-parametric tests. It explains concepts like z-test, t-test, ANOVA test and provides examples of their usage in R.

Uploaded by

rimbrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Science with R

Lesson 6— Statistics for Data Science – II

© Simplilearn. All rights reserved.


Learning Objectives

Discuss Hypothesis Test

Explain Parametric test and its types

Explain Non-Parametric test and its types

Perform Hypothesis Tests on Population Means

Perform Hypothesis Tests on Population Variance

Perform Hypothesis Tests on Population Proportions


Statistics for Data Science – II
Topic 1—Hypothesis Test
What Is Hypothesis Test?

A hypothesis test is a formal procedure in statistics used to test whether a hypothesis can be
accepted or not.

It is used to infer the results of a hypothesis performed on sample data to a large population.

The testing methodology depends on the data used and the reason for the analysis.
Types of Hypothesis Test

Simple Complex Null Hypothesis


Hypothesis Test Hypothesis Test Test

Alternative Statistical
Hypothesis Test Hypothesis Test

Non-Parametric
Parametric Test
Test

You have already learned about simple, complex, null, alternative, and statistical
hypotheses in the previous lesson. This lesson will focus on discussing parametric and
non-parametric tests.
Statistics for Data Science – II
Topic 2—Parametric Test
What Is a Parametric Test?

A parametric statistical test is one that makes assumptions about the


parameters (defining properties) of the population distribution(s) from which
one's data is drawn.

In these tests, inferences are based on the assumptions made about the
nature of the population distribution. The tests are used for normal data.
Types of Parametric Tests

Analysis of
Z-Test and T-Test Variance
(ANOVA) Test

Two population means or Equality of several


proportions are compared population means is
and tested. tested.

There are many tests that are parametric. We will limit our attention to the tests
mentioned above.
Types of Parametric Tests
Z-TEST

Z-Test is performed in cases where the test statistic is t, σ is known, the


population is normal, and the sample size is at least 30.
Z-Test
The formula to calculate z (standard statistic) is:

T-Test
𝑋ത − 𝜇
𝑧=
ANOVA 𝜎
𝑛

Where,
n: Sample number
𝑋ത: Sample mean from a sample X1, X2, …, Xn
μ: Population mean
σ: Standard Deviation
Types of Parametric Tests
EXAMPLE IN R

The test scores of an entrance exam fit a normal distribution with the
Z-Test mean test score of 72 and a standard deviation of 15.2. Compute the
Problem percentage of students scoring 84 or more.
statement
T-Test

ANOVA
Calculation
on R

Solution
Types of Parametric Tests
EXAMPLE IN R

Let’s use the pnorm (probability normal distribution) function to find the
Z-Test required percentage of students and the upper tail of the normal
Problem distribution (since the given score criteria is 84 or more).
statement
T-Test pnorm(84, mean = 72, sd = 15.2, lower.tail = FALSE)

ANOVA [1] 0.21492

Calculation
on R

lower.tail = TRUE is used to find the probability of values no


larger than z, whereas lower.tail = FALSE is used to find the
probability of values z or larger.
Solution
Types of Parametric Tests
EXAMPLE IN R

Z-Test The required percentage is 21.5%.


Problem
statement
T-Test

ANOVA
Calculation
on R

Solution
Types of Parametric Tests

T-Test is performed in cases where the test statistic is t, σ is unknown, sample standard
deviation is known, and the population is normal.
Z-Test
The formula to calculate t is:
T-Test
𝑋ത − 𝜇
𝑡=
ANOVA 𝑠
𝑛

Where,
n: Sample number
𝑋ത : Sample mean from a sample X1, X2, …, Xn
μ: Population mean
σ: Standard Deviation
Types of Parametric Tests
EXAMPLE IN R

Find out the 2.5th and 97.5th percentiles of the Student’s t-distribution,
Z-Test assuming 5 degrees of freedom.
Problem
statement
T-Test

ANOVA
Calculation
on R

Solution
Types of Parametric Tests
EXAMPLE IN R

Let’s use the quantile function (applied to compute percentiles) “qt”


Z-Test against the decimal values 0.025 and 0.975.
Problem
statement qt(c(.025, .975), df = 5) # 5 degrees of freedom
T-Test
[1] -2.5706 2.5706
ANOVA
Calculation
on R

Degree of freedom refers to the number of values in the final


calculation of a test statistic that varies freely. It is calculated using
the formula df = N-1 (where N is the number of values in a dataset).
Solution
Types of Parametric Tests
EXAMPLE IN R

The required 2.5th and 97.5th percentiles are -2.5706 and 2.5706,
Z-Test respectively.
Problem
statement
T-Test

ANOVA
Calculation
on R

Solution
Types of Parametric Tests

The ANOVA test is used for hypothesis tests that compare the averages of two or more groups.
Z-Test
For example, consider the following statements:
T-Test
• An environmentalist wants to know if the average amount of pollution varies in several
bodies of water.
ANOVA
• A sociologist wants to find out if a person’s income varies according to his/her upbringing.
Types of Parametric Tests
TYPES

Z-Test ANOVA

T-Test

ANOVA One-way Two-way


ANOVA ANOVA
Types of Parametric Tests

One-way Anova:
Z-Test
• Uses variances to determine if a statistically significant difference exists among
several group means or not
T-Test
• Tests H0: μ1 = μ2 = μ3 = ... = μk (where, µ = group mean and k = number of groups)
ANOVA

One-way
ANOVA
Two-way
ANOVA

For one-way ANOVA, the ratio of the between-group variability to the within-
group variability follows an F-distribution when the null hypothesis is true.
Types of Parametric Tests
ASSUMPTIONS

The populations have


Each equal standard deviations
Z-Test population is
normal 4
T-Test
2
ANOVA

One-way
ANOVA 1 3 5
Two-way
ANOVA All samples are The factor is a The result is a
random and categorical numerical
independent variable variable
Types of Parametric Tests
EXAMPLE 1

Find out if there is a difference in the mean grades among the sororities,
Z-Test assuming μ1, μ2, μ3, and μ4 are the population means of the sororities.
Problem
statement
T-Test

ANOVA
Calculation
One-way on R
ANOVA
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 1

Test:
• H0: μ1 = μ2 = μ3 = μ4
Z-Test • H1: Not all of the means μ1, μ2, μ3, and μ4 are equal
Problem • Distribution for the test: F3,16
statement
o df(num)= k – 1 = 4 – 1 = 3
T-Test
o df(denom) = n – k = 20 – 4 = 16
• Calculate the test statistic: F = 2.23
ANOVA • Define probability statement: p-value = P(F > 2.23) = 0.1241
• Compare α and the p-value: α = 0.01
Calculation
o p-value = 0.1241
One-way on R
o α < p-value
ANOVA
• Decide: Since α < p-value, you cannot reject H0.
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 1

Without sufficient evidence, you cannot conclude that there is a


difference among the mean grades for the sororities.
Z-Test
Problem
statement
T-Test

ANOVA
Calculation
One-way on R
ANOVA
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 2

A fast food chain wants to test and market three of its new menu items. To
analyze if they are equally popular, consider:
Z-Test
Problem • 18 random restaurants for the study
statement • 6 of the restaurants to test market the first menu item, another 6 for the
T-Test second one, and the remaining 6 for the last one

The table below shows the sales figures of the menu items in the 18
ANOVA restaurants. At .05 level of significance, test whether the mean sales volumes
Calculation for these menu items are equal.
One-way on R
ANOVA Item 1 Item 2 Item 3
22 52 16
Two-way
ANOVA 42 33 24
44 8 19
Solution
52 47 18
45 43 34
37 32 39
Types of Parametric Tests
EXAMPLE 2

1. Copy and paste the sales figures in a table file "fastfood-1.txt" using a text
editor.
Z-Test
Problem 2. Load the file into a data frame df1 using the read.table function.
statement df1 = read.table("fastfood-1.txt", header = TRUE); df1
T-Test
Item1 Item2 Item3
1 22 52 16
ANOVA 2 42 33 24
3 44 8 19
Calculation 4 52 47 18
One-way on R 5 45 43 34
ANOVA 6 37 32 39
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 2

3. Concatenate the data rows of df1 into a single vector r.


r = c(t(as.matrix(df1))) # response data
Z-Test r
Problem [1] 22 52 16 42 33 ...
statement
T-Test
4. Assign new variables for the treatment levels and number of observations.
f = c("Item1", "Item2", "Item3") # treatment levels
ANOVA k=3 # number of treatment levels
n=6 # observations per treatment
Calculation
One-way on R
ANOVA
Two-way
ANOVA
Solution
Types of Parametric Tests
EXAMPLE 2

5. Create a vector of treatment factors, corresponding to each element of R in step 3,


using the gl function.
Z-Test tm = gl(k, 1, n*k, factor(f)) # matching treatments
Problem Tm
statement [1] Item1 Item2 Item3 Item1 Item2 ...
T-Test tm = gl(k, 1, n*k, factor(f)) # matching treatments
tm
[1] Item1 Item2 Item3 Item1 Item2 ...
ANOVA Apply the function aov to a formula that describes the response r by the treatment
factor tm.
Calculation
av = aov(r ~ tm)
One-way on R Print out the ANOVA table with the summary function.
ANOVA summary(av)
Two-way
ANOVA Df Sum Sq Mean Sq F value Pr(>F)
tm 2 745 373 2.54 0.11
Solution Residuals 15 2200 147
Types of Parametric Tests
EXAMPLE 2

p-value of 0.11 > .05 significance level. Do not reject H0. This means that
the mean sales volumes of the new menu items are all equal.
Z-Test
Problem
statement
T-Test

ANOVA
Calculation
One-way on R
ANOVA
Two-way
ANOVA
Solution
F- Distribution

F distribution or the Fisher–Snedecor distribution is a continuous probability distribution that arises


frequently as the null distribution of a test statistic, most notably in the analysis of variance (ANOVA).

F-Ratio refers to the value derived from two estimates of the variance, as described below:

o Variance between samples (SSbetween): It is an estimate of σ2: variance of the sample means * n,
when the sample sizes are the same. When sizes are different, the variance is weighted to account
for different sample sizes.

o Variance within samples (SSwithin): It is an estimate of σ2: average of sample variances. When
sizes are different, the variance within samples is weighted.
Types of Parametric Tests

Two-way ANOVA refers to a hypothesis test where the classification of data is


Z-Test based on two independent variables

For example:
T-Test
A company bases its sales classification by identifying the sales by a salesman
ANOVA and sales by region.

One-way
ANOVA
Two-way
ANOVA
Types of Parametric Tests
ASSUMPTIONS

Independence of
Measurement of observations
Z-Test dependent variable
at continuous level
4
T-Test
2
ANOVA

One-way
ANOVA 1 3 5
Two-way
ANOVA
Normal Categorical independent Homogeneity of the
distribution of the groups that have the variance of the
population sample same size population

https://keydifferences.com/difference-between-one-way-and-two-way-anova.html
Statistics for Data Science – II
Topic 3—Non-Parametric Test
What Is a Non-Parametric Test?

A non-parametric test (sometimes called a distribution free test) does not


assume anything about the underlying distribution. It is used when the data is
not distributed normally.

It refers to a null category, since virtually all statistical tests assume one
thing or another about the properties of the source population(s).

http://www.statisticshowto.com/parametric-and-non-parametric-data/
Types of Non-Parametric Tests

• Kruskal Willis test (alternative to the One way ANOVA)

• Mann Whitney test (alternative to the two sample t test)

• Chi-square test

Chi-square test is the most commonly used non-parametric test. We will limit our
scope to learning chi-square test in this course.
What Is Chi-square Test?

Chi-square test is a nonparametric test used to compare two or more variables


for randomly selected data.
Chi-Square Test
FEATURES

Uses contingency tables (in


Evaluates if frequencies observed in market researches, these
different categories vary significantly tables are called cross-tabs)
from the frequencies expected under a
specified set of assumptions 4

1 3 5

Considers the Determines how well an Supports nominal-level


square of a assumed distribution fits measurements
standard normal the data
variate
Types of Chi-square Test

1. Chi-square test for goodness of fit


2. Chi-square test for independence of two variables
Types of Chi-square Test

It is used to observe the closeness of a sample that matches a population. The


Chi-square test statistic (𝜒 2 ) is
Chi-square test
for goodness of 𝑂𝑖 − 𝐸𝑖
2
2
fit 𝜒 =෎
𝐸𝑖
Chi-square test
for independence
of two variables with k-1 degrees of freedom.

Where Oi is the observed count, k is categories, and Ei is the expected counts

Goodness of fit of a statistical model refers to the understanding of how well


sample data fits a set of observations.

https://www.chegg.com/homework-help/definitions/chi-square-test-14
Types of Chi-square Test
USE CASES

Goodness of fit test is used to identify the relation between two attributes, as in the cases
below:
Chi-square test
for goodness of • Credit worthiness of borrowers based on their age groups and personal loans
fit
Chi-square test • Relation between the performance of salesmen and training received
for independence
of two variables • Return on a single stock and on stocks of a sector like pharmaceutical or banking

• Category of viewers and impact of a TV campaign


Types of Chi-square Test

It is used to check whether the variables are independent of each other or not. The
Chi-square test statistic (𝜒 2 ) is

Chi-square test 2
𝑂𝑖 − 𝐸𝑖
for goodness of 𝜒2 =෎
fit 𝐸𝑖
Chi-square test
for independence
of two variables With (r-1) (c-1) degrees of freedom

Where Oi is the observed count, r is number of rows, c is the number of columns, and Ei is
the expected counts

Two random variables are called independent if the probability distribution of one
variable is not affected by the other.

https://www.chegg.com/homework-help/definitions/chi-square-test-14
Types of Chi-square Test
USE CASES

Test of independence is suitable for the following situations:

Chi-square test • There is one categorical variable.


for goodness of
fit • There are two categorical variables, and you will need to determine the relation between
Chi-square test them.
for independence
of two variables • There are cross-tabulations, and relation between two categorical variables needs to be
found.

• There are non-quantifiable variables (For example, answers to questions like, do


employees in different age groups choose different types of health plans?)
Types of Chi-square Test
EXAMPLE

The manager of a restaurant wants to find the relation between customer


satisfaction and the salaries of the people waiting tables.

Chi-square test Problem • She takes a random sample of 100 customers asking if the service was
for goodness of statement excellent, good, or poor.
fit • She then categorizes the salaries of the people waiting as low, medium, and
Chi-square test high.
for independence
of two variables Her findings are shown in the table below:
Calculation
on R Salary

Service Low Medium High Total

Excellent 9 10 7 26

Good 11 9 31 51
Solution
Poor 12 8 3 23
Total 32 27 41 100
Types of Chi-square Test
EXAMPLE

Assume the level of significance is 0.05. Here, H0 and H1 denote the


independence and dependence of the service quality on the salaries of
people waiting tables.
Chi-square test Problem
for goodness of Test: DF = (3-1) (3-1) = 4
statement • Under H0, expected frequencies are:
fit
o E11 = (26X32)/100 = 8.32, E12 = 7.02, E13 = 10.66
Chi-square test o E21 = 16.32, E22 = 13.77, E23 = 20.91
for independence o E31 = 7.36, E32 = 6.21, E33 = 9.41
of two variables
Calculation Therefore, ‫א‬2(calculated) = (9-8.32)2/8.32+(10-7.02)2/7.02+(7-10.66)2/10.66 +(11-
on R 16.32)2/16.32+(9-13.77)2/13.77+(31-20.91)2/20.91+(12-7.36)2/7.36+(8-
6.21)2/6.21+(3-9.43)2/9.43 = 18.658

• ‫א‬2 0.05,4 = 9.48773


‫א‬2 (Calculated) > ‫א‬2(Tabulated)
• Reject H0, accept H1.
Solution
Types of Chi-square Test
EXAMPLE

Service quality is dependent on the salaries of the people waiting.

Chi-square test Problem


for goodness of statement
fit
Chi-square test
for independence
of two variables
Calculation
on R

Solution
Types of Chi-square Test
EXAMPLE IN R

To perform this test in R, let’s consider a table that is a result of a survey


conducted among students about their smoking habits.
Chi-square test Problem
for goodness of statement This tables has:
fit
Chi-square test “Smoke” variables, which record the smoking habits of students (Allowed
for independence values: "Heavy," "Regul," "Occas," and "Never")
of two variables “Exer” variables, which record the exercise levels of smoking (Allowed
Calculation values: "Freq," "Some, " and "None")
on R
Assuming .05 as the significance level, test the hypothesis whether the
smoking habits of students are independent of their exercise levels or not.

Solution
Types of Chi-square Test
EXAMPLE IN R

Let’s build the contingency table in R:


library(MASS) # load the MASS package
Chi-square test Problem
head(survey)
for goodness of statement
tbl = table(survey$Smoke, survey$Exer)
tbl
fit Freq None Some
Chi-square test Heavy 7 1 3
Never 87 18 84
for independence Occas 12 3 4
of two variables Regul 9 1 7
Calculation
on R Let’s use the chisq.test function for the contingency table and find the
value of p (calculated probability).
chisq.test(tbl)

Output: data: table(survey$Smoke, survey$Exer)


Solution X-squared = 5.4885, df = 6, p-value = 0.4828
Types of Chi-square Test
EXAMPLE IN R

As p > significance level, H0 is not rejected. This means that the smoking
habits of students are independent of their exercise levels.
Chi-square test Problem
for goodness of statement
fit
Chi-square test
for independence
of two variables
Calculation
on R

Solution
Hypothesis Test around Mean, Variance, and Proportion

Both parametric and non-parametric hypothesis tests are used to check


whether the mean, variance, and proportion of the population have
pre-determined values or if the values need to be defined.

Let’s discuss them in detail.


Statistics for Data Science – II
Topic 4—Hypothesis Tests about Population Means
Hypothesis Tests about Population Means

Hypothesis tests about population means involve testing the hypothesis that
compares the population mean of interest with a specified value.
Hypothesis Tests about Population Means
ASSUMPTION

X1, X2,……., Xn is a sample of size n from a normal population with mean μ and variance ơ2. The
mean X is distributed normally with the mean μ and variance ơ2/n (X ~ N (μ, ơ2/n)).
If n is large, X will be calculated similarly, even if the sample is from a non-normal population.
Therefore, for large samples, the standard normal variable corresponding to X bar is Z (as
calculated in the Z-test).
Hypothesis Tests about Population Means
WHEN POPULATION VARIANCE IS KNOWN

Consider a random large sample of size n, with a sample mean 𝑋ത

Test the hypothesis that the sample mean X has been drawn from a population with the mean μ and
a specified value μ0, that is:

• H0 : μ = μ0
• H1 : μ ≠ μ0
• H1 : μ > μ0
• H1 : μ < μ0

Under null hypothesis, Z = (X̅ – μ0)/S.E.(X) follows Standard Normal Distribution approximately.

When population variance is unknown, Z test is used.


Hypothesis Tests about Population Means
WHEN POPULATION VARIANCE IS UNKNOWN

Consider the following hypothesis formation:

• H0 : μ = μ0
• H1 : μ ≠ μ0

If μ0 falls in the confidence interval, the test result is “failing to reject the null hypothesis”; if
not, the result is “reject the null hypothesis.”

When population variance is unknown, T test is used.


Statistics for Data Science – II
Topic 5—Hypothesis Tests about Population Variance
Hypothesis Tests about Population Variance

Hypothesis test about population variance involves finding the squared


deviation of a random variable from its mean. It measures how far a set of
(random) numbers are spread out from their average value.
Hypothesis Tests about Population Variance
FORMULA

Consider the case where data consists of a simple random sample drawn from a normally
distributed population. The test statistic for testing hypotheses about a single population
variance is calculated as:

Chi-square test is used in hypothesis tests of population variance.


Statistics for Data Science – II
Topic 6—Hypothesis Tests about Population Proportions
Hypothesis Tests about Population Proportions

Hypothesis Tests about population proportions are defined as the ratio of


the values in a subset S to the values in a set R.
Hypothesis Tests about Population Proportions
FORMULA

Consider a random sample of the size n and the proportion of members with a certain attribute p.

You need to test the hypothesis that the proportion P in the population has a specified value P0,
that is:

• H0 : P = P0
• H1 : P ≠ P0
• H1 : P > P0
• H1 : P < P0

For a large sample, Z = (p - P0)/S.E.(p) ~ N (0,1) (under H0)

Where,
p = X/n = Number of successes in sample/Sample size
P0 = Hypothesized proportion of successes in the population
Key Takeaways

Hypothesis test is a formal procedure in statistics used to test whether a


hypothesis can be accepted or not.

The Z-test is performed in cases where the test statistic is t and σ is known.

The T-test is performed in cases where the test statistic is t and σ is unknown.

The degree of freedom is the number of independent variates that make up the
statistic.

The Chi-Square Test considers the square of a standard normal variate.

The ANOVA test is used for such hypothesis tests that compare
the averages of two or more groups.

Both parametric and non-parametric tests of the population have a pre-


determined value, or the values need to be defined.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy