0% found this document useful (0 votes)
4 views

Statistics ESCP

This document outlines a course on Confidence Interval Estimation as part of a Masters in Big Data and Business Analytics at ESCP Business School. It covers statistical testing, ANOVA, regression analyses, and the use of R for data analysis, along with assessment methods including written assignments and exams. Key concepts include population vs sample, types of variables, measures of central tendency and variation, and the construction of confidence intervals for population parameters.

Uploaded by

funtooshda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Statistics ESCP

This document outlines a course on Confidence Interval Estimation as part of a Masters in Big Data and Business Analytics at ESCP Business School. It covers statistical testing, ANOVA, regression analyses, and the use of R for data analysis, along with assessment methods including written assignments and exams. Key concepts include population vs sample, types of variables, measures of central tendency and variation, and the construction of confidence intervals for population parameters.

Uploaded by

funtooshda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 383

R & Business Analytics

Masters in Big Data and Business Analytics

Chapter 1: Confidence Interval Estimation

Dr. Hao (Howard) ZHONG


Assistant Professor
Information & Operations Management
ESCP Business School

This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
What do you expect from this course?
By the end of this course, you may know better:
• How to perform statistical testings in various business scenarios
• When and how to perform ANOVA analysis
• What non-parametric testings are and when we should use them
• What regression analyses are and how they are done exactly
• What kind of model diagnostics we usually need to perform
• What logistic regression is and when we need it
• What the difference between frequentist and Bayesian views on
probabilities are
• What Bayes Theorem is and how it can relate to regression analyses
• How we can use R to perform all the analyses above

2
How will you be assessed?
We have ten lectures in total:
• seven lectures in theoretical and mathematical materials
• three lectures in R

You will have the following assessments:


• two written assignments (30%)
• three in-class exercises in R (30%)
• one final written exam (40%)

3
Fundamentals

4
Basic vocabulary of statistics
POPULATION
A population consists of all the items or individuals
about which you want to draw a conclusion.

SAMPLE
A sample is the portion of a population selected for
analysis. The sample is the “small group”.

PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.

STATISTIC
A statistic is a numerical measure that describes a
characteristic of a sample. 5
Types of variables

Variables

Categorical Numerical
(Qualitative) (Quantitative)

Examples:
■ Marital Status
Discrete Continuous
■ Political Party
■ Eye Color
(Defined categories) Examples: Examples:
■ Number of Children ■ Weight
■ Defects per hour ■ Voltage
(Counted items) (Measured characteristics)
6
Levels of measurement

Measured? Ordered? Distance? Meaningful Zero? Example

Gender
Nominal Nationality

Review ratings (Very good, good, …)


Ordinal Letter grades (A+, A, A-, B+…)

SAT score (200-800)


Interval Credit score (300-850)

Ratio Distance
Weight

7
Recovery period (days) from a disease

What measures can we use to characterize such a data distribution?

Source: Paul, S., Lorin, E. Estimation of COVID-19 recovery and decease periods in Canada using delay model. Sci Rep 11, 23763 (2021). 8
Measures of central tendency

Central Tendency

Arithmetic Median Mode Geometric Mean


Mean

XG = ( X1 ´ X 2 ´ ! ´ Xn )1/ n

Middle value in Most frequently Rate of change of a


the ordered array observed value variable over time
(e.g. Compound
Annual Growth Rate
(CAGR))
9
Measures of variation

Variation

Range Variance Standard Coefficient


Deviation of Variation

Measures of variation give


information on the spread or
variability or dispersion of
the data values.

Same center, different variation

10
Shape of a distribution: skewness
Describes the amount of asymmetry in distribution (symmetric or skewed):

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean

Skewness <0 0 >0

11
Shape of a distribution: kurtosis
Describes relative concentration of values in the center as compared to the tails.

Flatter Than Bell-Shaped Sharper Peak


Bell-Shaped Than Bell-Shaped

Kurtosis <3 3 >3

12
Quartile measures
The data series should be sorted from low to high.
Quartiles split the ranked data into four segments with an equal number of values per segment.

• The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% larger.
• Q2 is the same as the median (50% of the observations are smaller and 50% are larger).
• Only 25% of the observations are greater than Q3.

13
Five number summary
The five numbers can help describe the center, spread and shape of data.

Xsmallest Q1 Median Q3 Xlargest

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

14
Probability Distributions

15
What are the common probability distributions?

Discrete Distributions

Continuous Distributions

Song, W. T. (2005). Relationships among some univariate distributions. IIE Transactions, 37(7), 651–656.
16
Probability distributions

Standard Normal
Standard Normal = 0, =1
= 0, =1
Student’s t
=5
Normal
= 1, =2

Student’s t
=2

17
𝜇
𝜇
𝜈
𝜈
𝜇
𝜎
𝜎
𝜎
Locating Extreme Outliers: Z-Score

Z-Score:

X−X
Z=
S
where X represents the data value
X̄ is the sample mean
S is the sample standard deviation

18
Locating Extreme Outliers: Z-Score
The Z-score is the number of standard deviations that a data value is from the mean.

A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater
than +3.0.

The larger the absolute value of the Z-score, the farther the data value is from the mean.

19
The Empirical Rule
The empirical rule approximates the variation of data in a bell-shaped distribution.
Approximately 68% of the data in a bell shaped distribution is within one standard deviation of
the mean or µ ± σ.

68%

µ
µ ± 1σ

20
The Empirical Rule
Approximately 95% of the data in a bell-shaped distribution lies within two standard deviations
of the mean, or µ ± 2σ.
Approximately 99.7% of the data in a bell-shaped distribution lies within three standard
deviations of the mean, or µ ± 3σ.

95% 99.7%

µ ± 2σ µ ± 3σ

21
Probability distributions

Student’s t
=5

F
1 = 1, 2 =5

Chi-square
=5

22
𝜈
𝜈
𝜈
𝜈
Probability distributions

Binomial
= 30, = 0.6

23
𝑛
𝑝
Confidence Interval Estimation

24
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size

25
Point Estimates

We can estimate a with a Sample


Population Parameter … Statistic
(a Point Estimate)

Mean μ X
Proportion π p

26
Point and Interval Estimates
A point estimate is a single number, while an interval estimate provides additional information
about the variability of the estimate.

Lower Upper
Confidence Confidence
Point Estimate
Limit Limit
Width of
confidence interval

27
Confidence Intervals
How much uncertainty is associated with a point estimate of a population parameter?

An interval estimate provides more information about a population characteristic than does a
point estimate.

Such interval estimates are called confidence intervals.

28
Confidence Interval Estimate
An interval gives a range of values:
• Takes into consideration variation in sample statistics from sample to sample.
• Based on observations from one sample.
• Gives information about closeness to unknown population parameters.
• Stated in terms of level of confidence:
• e.g. 95% confident, 99% confident
• Can never be 100% confident

29
Estimation Process

Random Sample
I am 95% confident
Population Mean that μ is between 40
(mean, μ, is & 60.
X = 50
unknown)

Sample

30
Confidence Interval General Formula
The general formula for all confidence intervals is:

Point Estimate ± (Critical Value)(Standard Error)

where:
•Point Estimate is the sample statistic estimating the population
parameter of interest.
•Critical Value is a table value based on the sampling distribution of the
point estimate and the desired confidence level.
•Standard Error is the standard deviation of the point estimate, σ/ n .

31
Confidence Level (1-α)
Suppose confidence level = 95%.
Also written (1 - α) = 0.95, so α = 0.05.

A specific interval either will contain or will not contain the true parameter.

A relative frequency interpretation:


• 95% of all the confidence intervals that can be constructed will contain the unknown true
parameter.

32
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size

33
Confidence Intervals
Confidence
Intervals

Population Population
Mean Proportion

σ Known σ Unknown

34
Confidence Interval for mean (σ Known)
Assumptions
• Population standard deviation σ is known.
• Population is normally distributed.

Confidence interval estimate:


σ
X ± Z α/2
n
where X̄ is the point estimate.
Zα/2 is the normal distribution critical value for a probability of α/2 in each tail.
σ/ n is the standard error.

35
Finding the Critical Value, Zα/2
Consider a 95% confidence interval:
Z α/2 = ±1.96
1 − α = 0.95 so α = 0.05

α α
= 0.025 = 0.025
2 2

Z units: Zα/2 = -1.96 0 Zα/2 = 1.96


Lower Upper
X units: Confidence Point Estimate Confidence
Limit Limit
36
Common Levels of Confidence
Commonly used confidence levels are 90%, 95%, and 99%

Confidence
Confidence Level Coefficient, Zα/2 value
1− α
80% 0.80 1.28
90% 0.90 1.65
95% 0.95 1.96
98% 0.98 2.33
99% 0.99 2.58
99.8% 0.998 3.08
99.9% 0.999 3.27

37
Intervals and Level of Confidence
Sampling Distribution of the Mean

α/2 1− α α/2
x
Intervals extend µx = µ
from x1
σ
X − Zα / 2 x2 (1-α)x100%
n of intervals constructed
to
contain μ;
σ
X + Zα / 2 (α)x100% do not.
n
For this red colour, mean does not lie
in the confidence interval
so the chance is that it Confidence Intervals
lies in the 5% interval 38
Example: Z-dist. Confidence Interval
A sample of 11 circuits from a large normal population has a mean resistance of 2.20 ohms. We
know from past testing that the population standard deviation is 0.35 ohms.
Determine a 95% confidence interval for the true mean resistance of the population.

σ
X ± Z α/2
n
= 2.20 ± 1.96 (0.35/ 11)
= 2.20 ± 0.2068
1.9932 ≤ µ ≤ 2.4068
If Z=99%, then interval estiamte will widen

39
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size

40
Confidence Intervals

Confidence
Intervals

Population Population
Mean Proportion

σ Known σ Unknown

41
Do You Ever Truly Know σ?
Probably not!
In virtually all real world business situations, σ is not known.

If the population standard deviation σ is unknown, we can substitute the sample standard
deviation, S.

This introduces extra uncertainty, since S is variable from sample to sample.

So we use the t distribution instead of the normal distribution.

42
Confidence Interval for Mean (σ Unknown)
Assumptions:
• Population standard deviation is unknown.
• Use Student’s t Distribution

Here S is used instead of Sigma and t is used instead of Z


Confidence Interval Estimate:
S
X ± tα / 2
n
where tα/2 is the critical value of the t distribution with n -1 degrees of freedom and an area of α/2
in each tail.

43
Student’s t Distribution
The t is a family of distributions.
The tα/2 value depends on degrees of freedom (d.f.).

Degrees of freedom (d.f.)


Number of observations that are free to vary after sample mean has been calculated:

d.f. = n - 1

44
Degrees of Freedom (d.f.)
Idea: Number of observations that are free to vary after sample mean has been calculated.

Example: Suppose the mean of three numbers is 8.0.


How many numbers are free to vary?
Let X1 = 7 If the mean of these three
values is 8.0,
Let X2 = 8 then X3 must be 9
What is X3? (i.e., X3 is not free to vary)

Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2.


(2 values can be any numbers, but the third is not free to vary for a given mean)

45
t-dist. vs z-dist.
Note: t Z as n increases
Standard
Normal
(t with df = ∞)
t (df = 13)
t-distributions are bell-shaped and
symmetric, but have ‘fatter’ tails t (df = 5)
than the normal.

0 t
46
Selected t-dist. Values
With comparison to the Z value:

Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) (∞ d.f.)
0.80 1.372 1.325 1.310 1.28
0.90 1.812 1.725 1.697 1.645
0.95 2.228 2.086 2.042 1.96
0.99 3.169 2.845 2.750 2.58

Note: t Z as n increases

47
Example: t-dist. Confidence Interval
A random sample of n = 25 has X̄ = 50 and S = 8. Form a 95% confidence interval for μ.

We have d.f. = n – 1 = 24, we get t α/2 = t 0.025 = 2.0639

The confidence interval is


S 8
X ± t α/2 = 50 ± (2.0639)
n 25

46.698 ≤ µ ≤ 53.302

48
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size

49
Confidence Intervals

Confidence
Intervals

Population Population
Mean Proportion

σ Known σ Unknown

50
Confidence Interval Estimate
An interval estimate for the population proportion ( π ) can be calculated by adding an allowance for
uncertainty to the sample proportion ( p ).
Confidence Interval Estimate:

p(1 − p)
p ± Z α/2
n
where
Zα/2 is the standard normal value for the level of confidence desired
p is the sample proportion
n is the sample size
Note: must have np > 5 and n(1-p) > 5

51
Example: Confidence Intervals for Population Proportion
A random sample of 100 people shows that 25 are left-handed. Form a 95% confidence interval
for the true proportion of left-handers.
p ± Z α/2 p(1 − p)/n
= 25/100 ± 1.96 0.25(0.75)/100
= 0.25 ± 1.96 (0.0433)
0.1651 ≤ π ≤ 0.3349
• We are 95% confident that the true percentage of left-handers in the population is between 16.51% and
33.49%.

• Although the interval from 0.1651 to 0.3349 may or may not contain the true proportion, 95% of intervals
formed from samples of size 100 in this manner will contain the true proportion.

52
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size

53
Determining Sample Size

Determining
Sample Size

For the For the


Mean Proportion

54
Sampling Error
The required sample size can be found to reach a desired margin of error (e) with a specified level
of confidence (1 - α).

The margin of error is also called sampling error.


• The amount of imprecision in the estimate of the population parameter.

55
Determining Sample Size for Mean

Determining
Sample Size

For the
Mean
Sampling error (margin of error)

σ σ
X ± Zα / 2 e = Zα / 2
n n

56
Determining Sample Size for Mean

Determining
Sample Size

For the
Mean

σ Now solve for n Zα / 22 σ 2


e = Zα / 2 to get n=
n e2
If interval is very narrow, then
sample size will be very bog

57
Determining Sample Size for Mean
To determine the required sample size for the mean, you must know:

• The desired level of confidence (1 - α), which determines the critical value, Zα/2

• The acceptable sampling error, e

• The standard deviation, σ

58
Example:
If σ = 45, what sample size is needed to estimate the mean within ± 5 with 90% confidence?

Z 2 σ 2 (1.645) 2 (45)2
n= 2
= 2
= 219.19
e 5

So the required sample size is n = 220


(Always round up)

If σ is unknown…
• Select a pilot sample and estimate σ with the sample standard deviation, S . Pick a small random sample,
calulate S and put it instead of Sigma

59
Determining Sample Size for Proportion

Determining
Sample Size

For the
Proportion

π (1− π ) Now solve for n Z 2 π (1− π )


e=Z to get n=
n e2

60
Determining Sample Size for Proportion
To determine the required sample size for the proportion, you must know:

• The desired level of confidence (1 - α), which determines the critical value, Zα/2

• The acceptable sampling error, e

• The true proportion of events of interest, π


• π can be estimated with a pilot sample if necessary (or conservatively use 0.5 as an estimate
of π)

61
Example:
How large a sample would be necessary to estimate the true proportion defective in a large
population within ±3%, with 95% confidence?
(Assume a pilot sample yields p = 0.12)

For 95% confidence, use Zα/2 = 1.96, e = 0.03, p = 0.12 (used to estimate π):

Z α/2 2 π (1 − π ) (1.96) 2 (0.12)(1 − 0.12)


n= = = 450.74
e2 (0.03) 2

So use n = 451

62
Ethical Issues
A confidence interval estimate (reflecting sampling error) should always be included when
reporting a point estimate.

The level of confidence should always be reported.

The sample size should be reported.

An interpretation of the confidence interval estimate should also be provided.

63
Thank you!

64
R & Business Analytics
Masters in Big Data and Business Analytics

Chapter 3: Hypothesis Testing

Dr. Hao (Howard) ZHONG


Assistant Professor
Information & Operations Management
ESCP Business School

This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
One-sample Tests

2
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions

3
What is a Hypothesis?
A hypothesis is a claim (assertion) about a population parameter.

• Population mean

Example: The mean monthly cell phone bill in this city is μ = $42

• Population proportion

Example: The proportion of adults in this city with cell phones is π = 0.68

4
The Null Hypothesis, H0
States the claim or assertion to be tested.
Example: The average diameter of a manufactured bolt is 30mm. (H0 : μ = 30)

Always about a population parameter, NOT about a sample statistic

H 0 : µ = 30 H 0 : X = 30

5
The Null and Alternate Hypotheses

The null hypothesis, H0 The Alternative Hypothesis, H1


Begin with the assumption that the null Is the opposite of the null hypothesis.
hypothesis is true.
e.g., The average diameter of a manufactured
Similar to the notion of innocent until proven bolt is not equal to 30mm ( H1: μ ≠ 30 )
guilty.

Challenges the status quo.


Always contains “=“, or “≤”, or “≥” sign.
NEVER contains the “=“, or “≤”, or “≥” sign.
May or may not be rejected.

6
The Hypothesis Testing Process
Claim: The population mean age is 50.
H0: μ = 50, H1: μ ≠ 50

Sample the population and find sample mean.

Population

Sample

7
The Hypothesis Testing Process
Suppose the sample mean age was X̄ = 20.

This is significantly lower than the claimed mean population age of 50.

If the null hypothesis were true, the probability of getting such a different sample mean would be
very small, so you reject the null hypothesis.

In other words, getting a sample mean of 20 is so unlikely if the population mean was 50, you
conclude that the population mean must NOT be 50.

8
The Hypothesis Testing Process
Sampling
Distribution of X

X
20 μ = 50

It is very unlikely that


you would get a sample If H0 is true
mean of this value ...
Suppose that it is indeed the
... then you reject the null population mean…
hypothesis that μ = 50.

9
The Hypothesis Testing Process
If the sample mean is close to the stated population mean, the null hypothesis is NOT rejected.
If the sample mean is far from the stated population mean, the null hypothesis is rejected.

How far is “far enough” to reject H0?

The critical value of a test statistic creates a “line in the sand” for decision making -- it answers
the question of how far is far enough.

10
The Test Statistic and Critical Values

Sampling Distribution of the test statistic

Region of Region of
Rejection Rejection
Region of
Non-Rejection

Critical Values

“Too far away” from the mean of sampling distribution


11
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions

12
Possible Errors in Hypothesis Test

Type I Error
• Reject a true null hypothesis

• Considered a serious type of error

• The probability of a Type I Error is α


(level of significance)

• Set by researcher in advance

Type II Error
• Failure to reject false null hypothesis

• The probability of a Type II Error is β

Type I and Type II errors cannot occur at the same time.

13
Level of Significance and the Rejection Region

Level of significance = α

α/2 α/2

Critical values

Rejection Region

14
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions

15
Hypothesis Tests for the Mean

Hypothesis
Tests for µ

σ Known σ Unknown
(Z test) (t test)

16
Z Test of Hypothesis for the Mean (σ Known)

Hypothesis
Tests for µ

σσKnown
Known σσUnknown
Unknown
(Z test) (t test)
The test statistic is:
X −µ
ZSTAT =
σ
n

17
Two-Tail Tests

H0: μ = μ0
H1: μ ≠ μ0 Level of significance = α

α/2 α/2

μ0 X
Reject H0 Do not reject H0 Reject H0

-Zα/2 0 +Zα/2 Z

Lower critical value Upper critical value


18
Critical Value Approach
For a two-tail test for the mean, σ known:
1. Convert sample statistic (X̄) to test statistic (Zstat)
2. Determine the critical Z values for a specified level of significance α from a table or computer

Decision Rule:
If the test statistic falls in the rejection region, reject H0 ; otherwise do not reject H0

19
Critical Value Approach: 6 Steps
1. State the null hypothesis, H0 and the alternative hypothesis, H1
2. Choose the level of significance, α, and the sample size, n
3. Determine the appropriate test statistic and sampling distribution
4. Determine the critical values that divide the rejection and non-rejection regions
5. Collect data and compute the value of the test statistic
6. Make the statistical decision and state the managerial conclusion.
• If the test statistic falls into the non-rejection region, do not reject the null hypothesis H0.
• If the test statistic falls into the rejection region, reject the null hypothesis.
• Express the managerial conclusion in the context of the problem

20
Example: Manufactured Bolt
Test the claim that the true mean diameter of a manufactured bolt is 30mm, given the sample
statistic is 29.84mm. The sample size is 100 and σ = 0.8. (Assume α = 0.05)

1. State the appropriate null and alternative hypotheses


H0: μ = 30 H1: μ ≠ 30 (This is a two-tail test)

2. Specify the desired level of significance and the sample size


Suppose that α = 0.05 and n = 100 are chosen for this test

21
Example: Manufactured Bolt
3. Determine the appropriate technique
σ is assumed known so this is a Z test.

4. Determine the critical values


For α = 0.05 the critical Z values are ±1.96

X−µ 29.84 − 30 − .16


5. Collect the data and compute the test statistic ZSTAT = = = = −2.0
σ 0.8 0.08
n = 100, X = 29.84 (σ = 0.8 is assumed known) n 100

22
Example: Manufactured Bolt
6. Is the test statistic in the rejection region?

α/2 = 0.025 α/2 = 0.025

Reject H0 if ZSTAT < Reject H0 Do not reject H0 Reject H0


-1.96 or ZSTAT >
-Zα/2 = -1.96 0 +Zα/2 = +1.96
1.96; otherwise do
not reject H0

Here, ZSTAT = -2.0 < -1.96

23
Example: Manufactured Bolt
6 (continued). Reach a decision and interpret the result

α = 0.05/2 α = 0.05/2

Reject H0 Do not reject H0 Reject H0

-Zα/2 = -1.96 0 +Zα/2= +1.96


-2.0

Since ZSTAT = -2.0 < -1.96, reject the null hypothesis and conclude there is no sufficient
evidence that the mean diameter of a manufactured bolt is equal to 30

24
P-value Approach
P-value: Probability of obtaining a value equal to or more extreme than the observed sample
statistic given H0 is true.
• The p-value is also called the observed level of significance.
• It is the smallest value of α for which H0 can be rejected.

Decision rule: compare the p-value with α


• If p-value < α , reject H0
• If p-value ≥ α , do not reject H0

Remember: If the p-value is low, H0 must go.

25
P-value Approach: 5 Steps
1. State the null hypothesis, H0 and the alternative hypothesis, H1
2. Choose the level of significance, α, and the sample size, n
3. Determine the appropriate test statistic and sampling distribution
4. Collect data and compute the value of the test statistic and the p-value
5. Make the statistical decision and state the managerial conclusion.
• If the p-value < α then reject H0, otherwise do not reject H0.
• State the managerial conclusion in the context of the problem.

26
Example: Manufactured Bolt
Test the claim that the true mean diameter of a manufactured bolt is 30mm, given the sample
statistic is 29.84mm. The sample size is 100 and σ = 0.8. (Assume α = 0.05)

1. State the appropriate null and alternative hypotheses


H0: μ = 30 H1: μ ≠ 30 (This is a two-tail test)

2. Specify the desired level of significance and the sample size


Suppose that α = 0.05 and n = 100 are chosen for this test

27
Example: Manufactured Bolt
3. Determine the appropriate technique
σ is assumed known so this is a Z test.

4. Collect the data, compute the test statistic and the p-value
n = 100, X = 29.84 (σ = 0.8 is assumed known)
So the test statistic is:

X−µ 29.84 − 30 − .16


ZSTAT = = = = −2.0
σ 0.8 0.08
n 100

28
Example: Manufactured Bolt
5. Is the p-value < α?
Since p-value = 0.0456 < α = 0.05, reject H0
α = 0.05/2 α = 0.05/2

p-value = 0.0456
Reject H0 Do not reject H0 Reject H0

-Zα/2 = -1.96 0 +Zα/2= +1.96


-2.0
5. (continued) State the managerial conclusion in the context of the situation.
There is sufficient evidence to conclude the average diameter of a manufactured bolt
is not equal to 30mm.

29
Tests of population mean: σ known

30
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions

31
Hypothesis Testing: σ Unknown
If the population standard deviation is unknown, you use the sample standard deviation S.

Because of this change, you use the t distribution instead of the Z distribution to test the null
hypothesis about the mean.

All other steps, concepts, and conclusions are the same.

32
T-test of Hypothesis for the Mean (σ Unknown)

Hypothesis
Tests for µ

σσKnown
σ Known
Known σσUnknown
σ Unknown
Unknown
(Z test) (t test)
The test statistic is:
X −µ
t STAT =
S
n

33
Example: Hotel Cost
The average cost of a hotel room in New York is said to be $168 per night.
To determine if this is true, a random sample of 25 hotels is taken and resulted in an X̄ of $172.50
and S = $15.40. Test the appropriate hypotheses at α = 0.05.

H0: μ = 168
H1: μ ≠ 168

34
Example: Hotel Cost
α = 0.05
n = 25, df = 25-1=24

σ is unknown, so α/2=.025 α/2=.025


use a t statistic
X−µ 172.50 − 168
t STAT = = = 1.46 Reject H0 Do not reject H0 Reject H0
S 15.40
-t 0.025, 24 0 t 0.025, 24
n 25
-2.0639 1.46 2.0639
Critical Value: ±t0.025, 24 = ± 2.0639

Do not reject H0: insufficient evidence that true mean cost is different than $168
35
One-Tail Tests
In many cases, the alternative hypothesis focuses on a particular direction.

H0: μ ≥ 3 This is a lower-tail test since the alternative


hypothesis is focused on the lower tail
H1: μ < 3 below the mean of 3

H0: μ ≤ 3 This is an upper-tail test since the alternative


hypothesis is focused on the upper tail above
H1: μ > 3 the mean of 3

36
Lower-Tail Tests
There is only one critical value, since the rejection area is in only one tail.

H0: μ ≥ 3
H1: μ < 3

Reject H0 Do not reject H0


-Zα or -tα 0 Z or t
μ X

Critical value
37
Upper-Tail Tests
H0: μ ≤ 3
H1: μ > 3

Do not reject H0 Reject H0

Z or t 0 Zα or tα
_
X μ

Critical value
38
Example: Cell Phone Bills
A phone industry manager thinks that customer monthly cell phone bills have increased, and
now average over $52 per month.
The company gathered a sample with n=25, X̄ =53.1 and S=10. (Assume α = 0.10)

H0: μ ≤ 52 the average is not over $52 per month

H1: μ > 52 the average is greater than $52 per month

39
Example: Cell Phone Bills
Obtain sample and compute the test statistic.

Suppose a sample is taken with the following results:


n = 25, X = 53.1, and S = 10
Then the test statistic is:
X−µ 53.1 − 52
t STAT = = = 0.55
S 10
n 25
Given α = 0.10 and d.f.=n-1=24, we get tvalue = 1.318

40
Example: Cell Phone Bills
Reach a decision and interpret the result:
Reject H0

α = 0.10

Do not reject H0 Reject H0


0
1.318
tSTAT = 0.55

Do not reject H0 since tSTAT = 0.55 ≤ 1.318


There is no sufficient evidence that the mean bill is over $52

41
Example: Cell Phone Bills
Or using p-value approach,
calculate the p-value and compare to α:
p-value = .2937

Reject H0
α = .10

0
Do not reject H0 Reject H0
1.318
tSTAT = .55

Do not reject H0 since p-value = .2937 > α = .10

42
Tests of population mean: σ unknown

43
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions

44
Hypothesis Tests for Proportions
Involves categorical variables, typically with two possible outcomes
• Possesses characteristic of interest.
• Does not possess characteristic of interest.

Fraction or proportion of the population in the category of interest is denoted by π.

45
Proportions
Sample proportion in the category of interest is denoted by p:

X number in category of interest in sample


p= =
n sample size

When both nπ and n(1-π) are at least 5, p can be approximated by a normal distribution with
mean and standard deviation.

µp = π π (1− π )
σp =
n

46
Hypothesis Tests for Proportions

The sampling distribution


of p is approximately
normal, so the test statistic Hypothesis
is a ZSTAT value: Tests for p

nπ ≥ 5 nπ < 5
and or
p−π n(1-π) ≥ 5 n(1-π) < 5
ZSTAT =
π (1 − π )
Not discussed in
n this chapter

47
Proportion Test wrt. Number in Category of Interest

An equivalent form to the


last slide, but in terms of the Hypothesis
number in the category of
Tests for X
interest, X:

X≥5 X<5
and or
n-X ≥ 5 n-X < 5
X − nπ
ZSTAT = Not discussed in
nπ (1 − π ) this chapter

48
Example: Mailing
A marketing company claims that it receives 8% responses from its mailing.
To test this claim, a random sample of 500 were surveyed with 25 responses.
Test at the α = 0.05 significance level.

Check:
n π = (500)(.08) = 40
n(1-π) = (500)(.92) = 460

p=25/500=0.05

49
Example: Critical Value Solution
Test Statistic:
H0: π = 0.08
H1: π ≠ 0.08 p −π .05 − .08
ZSTAT = = = −2.47
π (1 − π ) .08(1 − .08)
α = 0.05
n 500
n = 500, p = 0.05
Decision:
Critical Values: ± 1.96
Reject H0 at α = 0.05
Reject Reject
Conclusion:
.025 .025 There is sufficient evidence to reject
the company’s claim of 8% response
-1.96 0 1.96 z rate.
-2.47 50
Example: P-value Solution
Calculate the p-value and compare to α
(For a two-tail test the p-value is always two-tail)

Do not reject H0
Reject H0 Reject H0
α/2 = .025 α/2 = .025

P(Z ≤ −2.47) + P(Z ≥ 2.47)


0.0068 0.0068
= 2(0.0068) = 0.0136
-1.96 0 1.96

Z = -2.47 Z = 2.47

Reject H0 since p-value = 0.0136 < α = 0.05


51
Tests of Population Proportion

52
Potential Pitfalls and Ethical Considerations
Use randomly collected data to reduce selection biases.

Choose the level of significance, α, and the type of test (one-tail or two-tail) before data collection.

Do NOT practice “data cleansing” to hide observations that do not support a stated hypothesis.

Report all pertinent findings including both statistical significance and practical importance.

53
Two-sample Tests

54
Two-Sample Tests

Two-Sample Tests

Population Population
Means, Means, Related Population Population
Independent Samples Proportions Variances
Samples

Examples:
Group 1 vs. Group 2 Same group before vs. after Proportion 1 vs. Proportion 2 Variance 1 vs. Variance 2
treatment

55
Outline
Hypothesis Tests for Two Means (Independent Populations)
Hypothesis Tests for Two Means (Related Populations)
Hypothesis Tests for Two Proportions (not covered)
Hypothesis Tests for Two Variances (not covered)

56
Two Means: Independent Populations

Goal: Test hypothesis or form a


Population means,
independent samples * confidence interval for the difference
between two population means, μ1 – μ2

σ1 and σ2 unknown,
Sigma1=Sigma2=Sigma The point estimate for the
assumed equal difference is

σ1 and σ2 unknown, X̄1 – X̄2


not assumed equal

57
Two Means: Independent Populations

Different data sources


Population means,
independent samples *  Unrelated
 Independent
 Sample selected from one population has no
effect on the sample selected from the other
population.
Sp will be calculated

σ1 and σ2 unknown, Use Sp to estimate unknown σ. Use a


Pooled-Variance t test.
assumed equal

σ1 and σ2 unknown, Use S1 and S2 to estimate unknown σ1


and σ2. Use a Separate-variance t test
not assumed equal
58
Tests for Two Independent Population Means

Two Population Means, Independent Samples

Lower-tail test: Upper-tail test: Two-tail test:

H0: μ1 ≥ μ2 H0: μ1 ≤ μ2 H0: μ1 = μ2


H1: μ1 < μ2 H1: μ1 > μ2 H1: μ1 ≠ μ2
i.e., i.e., i.e.,
H0: μ1 – μ2 ≥ 0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
H1: μ1 – μ2 < 0 H1: μ1 – μ2 > 0 H1: μ1 – μ2 ≠ 0

59
Tests for Two Independent Population Means
Two Population Means, Independent Samples
Lower-tail test: Upper-tail test: Two-tail test:
H0: μ1 – μ2 ≥ 0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
H1: μ1 – μ2 < 0 H1: μ1 – μ2 > 0 H1: μ1 – μ2 ≠ 0

α α α/2 α/2

-tα tα -tα/2 tα/2


Reject H0 if tSTAT < -tα Reject H0 if tSTAT > tα Reject H0 if tSTAT < -tα/2
or tSTAT > tα/2
60
When σ1and σ2 Unknown and Assumed Equal…
Assumptions:
Population means, • Samples are randomly and independently drawn
independent samples • Populations are normally distributed or both sample sizes are
at least 30
• Population variances are unknown but assumed equal

σ1 and σ2 unknown, * • The pooled variance is: (n


S2 = 1
p
− 1)S1
2
+ (n 2 − 1)S 2
(n1 − 1) + (n2 − 1)
2

assumed equal

t STAT =
(X1 − X 2 )− (µ1 − µ 2 )
σ1 and σ2 unknown, • The test statistic is: 2 ⎛⎜ 1 1 ⎞
Sp ⎜ + ⎟
not assumed equal ⎟
⎝ n1 n2 ⎠

• where tSTAT has d.f. = (n1 + n2 – 2) This difference can also be any value.
In this case we are taing this vlaue to be
61 0
Example: Dividend Yield
You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between
stocks listed on the NYSE & NASDAQ? You collect the following data:

NYSE NASDAQ
Count. 21 25
Sample mean 3.27 2.53
Sample std dev 1.30 1.16

Assuming both populations are approximately normal with equal variances, is there a
difference in mean yield (α = 0.05)?

62
Example: Dividend Yield

H0: μ1 - μ2 = 0 i.e. (μ1 = μ2)


H1: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2)

The test statistic is:

t=
(
X1 − X 2 )− (µ1 − µ 2 )
=
(3.27 − 2.53)− 0 = 2.040
2 ⎛⎜ 1 1 ⎞ ⎛ 1 1 ⎞
Sp ⎜ + ⎟ 1.5021 ⎜ + ⎟
⎝ n1 n 2

⎠ ⎝ 21 25 ⎠

2 2
2 (n1 − 1)S1 + (n2 − 1)S 2 (21 − 1)1.30 2 + (25 − 1)1.16 2
S p = = = 1.5021
(n1 − 1) + (n2 − 1) (21 - 1) + (25 − 1)

63
Example: Dividend Yield
H0: μ1 - μ2 = 0 i.e. (μ1 = μ2) Reject H0 Reject H0

H1: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2) To get Critical value,


we need alpha/2 and df
.025 .025
α = 0.05
df = 21 + 25 - 2 = 44 -2.0154 0 2.0154 t
Critical Values: t = ± 2.0154 2.040
Test Statistic:
Decision:
Reject H0 at α = 0.05
3.27 − 2.53
t= = 2.040
⎛ 1 1 ⎞ Conclusion:
1.5021 ⎜ + ⎟
⎝ 21 25 ⎠ There is evidence of a difference
in means.
64
When σ1and σ2 Unknown and NOT Assumed Equal…
Assumptions:
Population means, • Samples are randomly and independently drawn
• Populations are normally distributed or both sample sizes are
independent samples
at least 30
• Population variances are unknown but NOT assumed equal

σ1 and σ2 unknown,
assumed equal

σ1 and σ2 unknown, not


assumed equal
*
65
Outline
Hypothesis Tests for Two Means (Independent Populations)
Hypothesis Tests for Two Means (Related Populations)
Hypothesis Tests for Two Proportions (not covered)
Hypothesis Tests for Two Variances (not covered)

66
Tests for Two Related Population Means
Tests Means of two related Populations
Related • Paired or matched samples
samples
• Repeated measures (before/after)
• Use difference between paired values:

Di = X1i - X2i

Assumptions:
• The differences are normally distributed

67
Tests for Two Related Population Means
The i-th paired difference is Di , where
Related
samples Di = X1i - X2i
The the paired difference sample mean μD is:
n

∑D i
D= i =1
n

The sample standard deviation is SD:


n
2
(D
∑ i − D ) n is the number of pairs
in the paired sample
SD = i=1
n −1 68
Tests for Two Related Population Means
The test statistic for μD is:
Related
samples D − µD
t STAT =
SD
n

where tstat has d.f. = n - 1

69
The Paired Difference Test: Possible Hypotheses
Paired Samples

Lower-tail test: Upper-tail test: Two-tail test:

H0: μD ≥ 0 H0: μD ≤ 0 H0: μD = 0


H1: μD < 0 H1: μD > 0 H1: μD ≠ 0
α α α/2 α/2

-tα tα -tα/2 tα/2


Reject H0 if tSTAT < -tα Reject H0 if tSTAT > tα Reject H0 if tSTAT < -tα/2

where tSTAT has n - 1 d.f. or tSTAT > tα/2


70
Example:
Assume you send your salespeople to a “customer service” training workshop. Has the
training made a difference in the number of complaints? You collect the following data:

Number of Complaints: (2) - (1) Σ Di


Salesperson Before (1) After (2) Difference, Di D =
n
C.B. 6 4 - 2 = -4.2
T.F. 20 6 -14
M.H. 3 2 - 1 2

R.K. 0 0 0 SD =
∑ (D − D)
i

n −1
M.O. 4 0 - 4
-21 = 5.67

71
Example:
Has the training made a difference in the number of complaints (at the 0.01 level)?
Reject Reject
H0: μD = 0
H1: μD ≠ 0 α/2 α/2
- 4.604 4.604
α = .01 D = - 4.2 - 1.66

t0.005 = ± 4.604 We are using t-dist instead of Z-dist because


we have low 'n'.

d.f. = n - 1 = 4
Test Statistic: Decision: Do not reject H0
(tstat is not in the reject region)
D − µ D − 4.2 − 0
t STAT = = = −1.66 Conclusion: There is NOT a significant
SD / n 5.67/ 5 change in the number of complaints.
72
Thank you!

73
R & Business Analytics
Masters in Big Data and Business Analytics

Chapter 4: ANOVA, Goodness of Fit, and Test of Independence

Dr. Hao (Howard) ZHONG


Assistant Professor
Information & Operations Management
ESCP Business School

This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
ANOVA

2
ANOVA Definition
Analysis of Variance (ANOVA):
A technique used to test simultaneously whether the means of several populations are equal. It
uses the F distribution as the distribution of the test statistic.

Assumptions:
For each population, the response variable is normally distributed.
The observations must be independent.

3
ANOVA Definition
Supposed there are three populations with means 1, 2 and 3, we have:

Our objective is to determine whether the observed differences in the three sample means are
large enough to reject H0.

In other words:
If the variability among the sample means is “small”, it does not reject H0;
If the variability among the sample means is “large”, it rejects H0.

4
𝜇
𝜇
𝜇
Graphical Illustration
Supposed 0 is true, we should have only one sampling distribution:

5
𝐻
Graphical Illustration
Supposed 0 is false, the sample means come from different sampling distributions:

6
𝐻
Conceptual Overview
The logic behind ANOVA is based on the development of two independent estimates of the
common population variance 2.
2
1. One estimate of is based on the variability among the sample means themselves;
2
2. The other estimate of is based on the variability of data within each sample.

By comparing these two estimates of σ 2 , we will be able to determine whether the population
means are equal.

So, what are these two estimates?

7
𝜎
𝜎
𝜎
Between-treatments Estimate
Sum of squares due to treatments (SSTR)

Mean square due to treatments (MSTR)

K is the number of populations and is the number of elements in -th sample.

8
𝑗
𝑛
𝑗
Within-treatments Estimate
Sum of squares due to error (SSE) Xij=ith observation of sample j

Mean square due to error (MSE)

K is the number of populations and 𝑛𝑗 is the number of elements in 𝑗-th sample. 𝑛𝑇 is the total
number of instances.

9
Total Sum of Squares
Total sum of squares (SST):

Or

10
ANOVA Test

F is always positive.
Now since F is always positive so it will always be positive-tailed

11
Example: Assembly Line
One company developed a new filtration system for municipal water supplies.
The industrial engineering group is responsible for determining the best assembly method for the
new filtration system.
The group narrows the alternatives to three: method A, method B, and method C.
Three groups of workers are randomly selected to assemble the system using three different
methods respectively.

The manager would like to know whether the mean number of units produced per week is the
same for all three populations (methods) at α = 0.05.

12
Example: Assembly Line
Number of units produced by 15 workers from three groups:

Xj
Sj^2

13
Example: Assembly Line
The between-treatments estimate of variance:

(62x5 + 66x5 + 52x5)/15 = 60=x(double bar)

The within-treatments estimate of variance:

SST=SSTR+SSE
14
Comparing the Estimates: F test
The test statistic for F test:

15
Example: Assembly Line
The value of the test statistic is

By checking F distribution table, we see

16
Example: Assembly Line
Because F = 9.18 is greater than 6.93, the area in the upper tail at F = 9.18 is less than .01.
With p-value ≤ α = .05, H0 is rejected.

The test provides sufficient evidence to conclude that the means of the three populations are
not equal.

17
Probability distributions

Student’s t
=5

F t 2 = F1,ν
1 = 1, 2 =5

Chi-square
=5

18
𝜈
𝜈
𝜈
𝜈
ANOVA Table
The results of the preceding calculations can be displayed conveniently in a table referred to as
the analysis of variance or ANOVA table.

19
Example: Assembly Line
ANOVA table for Assembly Line example:

20
Goodness of Fit Test

21
Example: Market Share Study
Based on a market share study, last year the market shares stabilized at: 30% for company A, 50%
for company B, and 20% for company C.
Recently company C developed a “brand-new and improved” product.

We need to conduct a sample survey and compute the proportion of customers preferring each
company’s product.

A hypothesis test will then be conducted to see whether the new product caused a change in
market shares at α = 0.05.

22
Example: Market Share Study
Assume our observed frequency is

Need to test the difference between the observed frequencies and the expected frequencies.

Which hypothesis test should we use?


Goodness of fit test.

23
Example: Market Share Study
The null and alternative hypothesis:

If we assume 200 customers in the study, our expected frequency is

24
Test Statistic
We define the test statistic for goodness of fit:

The test statistic has a chi-square distribution with k-1 degrees of freedom provided that the
expected frequencies are 5 or more for all categories.

25
Example: Market Share Study
Following the calculation of test statistic:

26
Example: Market Share Study
By checking chi-square table, we see

The test statistic 7.34 is between 5.991 and 7.378. Thus, the corresponding upper tail area or p-
value must be between .05 and .025.

With p-value less than .05, we reject H0 and conclude that the introduction of the new product
by company C will alter the current market share structure.

27
Goodness of fit test
1. State the null and alternative hypotheses
H0: the observed frequencies are equivalent to the expected frequencies.
H1: the observed frequencies are different from the expected frequencies.

2. Assume the null hypothesis is true and determine the expected frequency in each category
by multiplying the category probability by the sample size

3. Compute the value of the test statistic:

4. Decision rule:

28
𝑖
𝑒
Probability distributions

F is the ratio of two


Student’s t chi-squared statistics:
=5

χν21 /ν1
Fν1,ν2 ≡
χν22 /ν2
F
1 = 1, 2 =5

Chi-square
=5

29
𝜈
𝜈
𝜈
𝜈
Test of Independence

30
Example: Beer Drinkers
A manufacturer distributes three types of beer: light, regular and dark.
The firm’s market research group raise the question of whether preferences for the three beers
differ among male and female beer drinkers.
A test of independence can address the question of whether the beer preference (light, regular,
or dark) is independent of the gender of the beer drinker (male, female) at α = 0.05.

A simple random sample of 150 beer drinkers is selected. The data is summarized using the
following contingency table: Observed Frequency Table

31
Example: Beer Drinkers
The hypotheses are:
H0: Beer preference is independent of the gender of the beer drinker
H1: Beer preference is not independent of the gender of the beer drinker

Thought process:
1. Determine the expected frequencies under the assumption of independence between beer
preference and gender of the beer drinker
2. Use the goodness of fit test to determine whether there is a significant difference between
observed and expected frequencies.

Will this work?


Yes!

32
Example: Beer Drinkers
In the entire sample of 150 beer drinkers, we have:
• 50/150 = 1/3 prefer light beer
• 70/150 = 7/15 prefer regular beer
• 30/150 = 1/5 prefer dark beer
If the independence assumption is valid, we argue that these fractions must be applicable to both
male and female beer drinkers.
Then the expected frequencies are:
Expected Frequency Table

80 x 1/3

33
Example: Beer Drinkers
The test procedure for comparing the observed frequencies with the expected frequencies is
similar to the goodness of fit calculations.
We define test statistic for independence:

With n rows and m columns in the contingency table, the test statistic has a chi-square
distribution with (n-1)(m-1) degrees of freedom.
34
Example: Beer Drinkers
Following the calculation of test statistic:

35
Beer drinkers: calculation
By checking chi-square table, we see

The test statistic 6.12 is between 5.991 and 7.378. Thus, the corresponding upper tail area or p-
value must be between .05 and .025.

With p-value less than .05, we reject H0 and conclude that beer preference is not independent of
the gender of the beer drinker.

36
Test of Independence
1. State the null and alternative hypotheses
H0: the column frequencies is independent of the row variable.
H1: the column variable is not independent of the row variable.

2. Assume the null hypothesis is true and compute the expected frequency for each cell in the
contingency table.

3. Compute the value of the test statistic:

4. Decision rule:

37
Thank you!

Reference:
Anderson, D., Sweeney, D., Williams, T. (2010). Statistics for Business and Economics (11th ed). Cengage Learning.

38
R & Business Analytics
Masters in Big Data and Business Analytics

Chapter 5: Non-parametric Tests

Dr. Hao (Howard) ZHONG


Assistant Professor
Information & Operations Management
ESCP Business School

This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Parametric vs. Non-parametric

Parametric methods Non-parametric methods


The statistical methods for inference Non-parametric methods can be used to make
presented previously in the text are generally inferences about a population without requiring an
assumption about the specific form of the
known as parametric methods.
population’s probability distribution.

These methods begin with an assumption Most of the statistical methods referred to as
about the probability distribution of the parametric methods require quantitative data, while
population which is often that the population non-parametric methods are appropriate when data
has a normal distribution. are measured on an ordinal scale of measurement.

2
Sign Test

3
Sign Test
The sign test is a versatile non-parametric method for hypothesis testing that uses the binomial
distribution with = 0.5 as the sampling distribution.
However, it does not require an assumption about the distribution of the population.
Its main application: test about a population median.

The median is the measure of central tendency that divides the population so that 50% of the
values are greater than the median and 50% of the values are less than the median.
When a population distribution is skewed, the median is often preferred over the mean as the
best measure of central location for the population.

4
𝑝
Example: Grocery Store
The manager estimates that the median sales of the new potato chip should be $450 per week on
a per store basis.

After carrying the product for three-months, Lawler’s management requested the following
hypothesis test about the population median weekly sales.

5
Example: Grocery Store
Data showing one-week sales at 10 randomly selected stores are provided:

6
Transforming Data
If the observation is greater than the hypothesized value, we record a plus sign “+”.

If the observation is less than the hypothesized value, we record a minus sign “−”.

If an observation is exactly equal to the hypothesized value, the observation is eliminated from
the sample and the analysis proceeds with the smaller sample size.

7
Example: Grocery Store
According to the hypothesized median 450, we have:

Note that there are 7 plus signs and 3 minus signs.

8
Binomial probability function
In a binomial experiment, our interest is in the number of successes occurring in the n trials.
x denotes the number of successes occurring in the n trials:

9
Example: Grocery Store
The ideal binomial probabilities for the number of plus signs under the assumption 0 is true:

Since the observed number of plus


signs 7 is in the upper tail of the
binomial distribution, we begin by
computing the probability of
obtaining 7 or more plus signs (the
probability of 7,8,9 or 10).

10
𝐻
Example: Grocery Store
Adding these probabilities we have Using R to calculate probability:
dbinom(7, size=10, prob=0.5) = 0.1172
. 1172 + . 0439 + . 0098 + . 0010 = . 1719. dbinom(8, size=10, prob=0.5) = 0.0439

Since we are using a two-tailed hypothesis test, this upper tail probability is doubled to obtain the
p-value= 2( . 1719) = . 3438.

With p-value > α, we cannot reject 0.

We cannot reject the hypothesis that the population median is $450.

11
𝐻
Approximation: Normal Distribution
With larger sample sizes, we rely on the normal distribution approximation of the binomial
distribution to compute the p-value.

12
Example: Home Sales
One year ago the median price of a new home was $236,000.
However, a current downturn in the economy emerges.
Real estate firms use sample data on recent home sales to determine if the population median
price of a new home today is lower than it was a year ago.

The hypothesis test about the population median price:

13
Example: Home Sales
A random sample of 61 recent new home sales found 22 homes sold for more than $236,000, 38
homes sold for less than $236,000, and one home sold for $236,000.
The one home that sold for the hypothesized median price of $236,000 should be deleted from
the sample.

The sample result showing 22 plus signs is in the lower tail of the binomial distribution.
The p-value is the probability of 22 or fewer plus signs.

14
Mean and Standard Deviation
The sampling distribution of the number of plus signs can be approximated by a normal
distribution with:

Note that the binomial probability distribution is discrete and the normal probability distribution
is continuous.

How do we calculate the probability of 22 or fewer?

15
Continuity Correction Factor
The continuity correction factor is applied when a continuous probability distribution is used
for approximating a discrete probability distribution.

The general continuity correction factor table:


• If P(X=m) use P(m – 0.5 < X < m + 0.5)
• If P(X<m) use P(X < m – 0.5)
• If P(X≤m) use P(X < m + 0.5)
• If P(X>m) use P(X > m + 0.5)
• If P(X≥m) use P(X > m – 0.5)

16
Example: Home Sales
To compute the p-value for 22 or fewer plus signs we use the normal distribution with μ=30 and
σ=3.873 to compute the probability that the normal random variable has a value ≤ 22.5.

17
Example: Home Sales
Using this normal distribution, we compute the p-value as follows:

With p-value = .0262 < .05, we reject the null hypothesis and conclude that the median price of a
new home is less than the $236,000 median price a year ago.

18
Wilcoxon Signed-rank Test

19
Wilcoxon Signed-rank Test
The Wilcoxon signed-rank test is a nonparametric procedure for analyzing data from a matched-
sample experiment.

The test uses quantitative data but does not require the assumption that the differences between
the paired observations are normally distributed.

It only requires the assumption that the differences between the paired observations have a
symmetric distribution.

If the data has significant outliers, the Wilcoxon signed rank test would be a more robust option.

20
Example: Manufacturing
A manufacturing firm that is attempting to
determine whether two production methods differ
in terms of task completion time.
Using a matched-samples experimental design, 11
randomly selected workers completed the
production task twice, once using method A and
once using method B.
16.2 6.9

Do these data indicate that the two production


methods differ significantly in terms of completion
times? 6.2 4.4

(We do not have normal distribution assumption


for the time differences.)

21
Example: Manufacturing
Do these data indicate that the two production methods differ significantly in terms of
completion times?

The hypotheses are as follows:


H0: the median of the time differences is 0;
H1: the median of the time differences is not 0.

If we assume that the differences have a symmetric distribution but not necessarily a normal
distribution, we should apply the Wilcoxon signed-rank test.

22
Example: Manufacturing
The steps for Wilcoxon signed-rank test are:

1. Discard the difference of zero for worker 8 and


then compute the absolute value of the differences
for the remaining 10 workers.

2. Rank these absolute differences from lowest to


highest as shown in column 4. The tied absolute
differences of .4 for workers 3 and 5 are assigned the
average rank of 3.5. Similarly for workers 4 and 10.

3. Each rank is given the sign of the original


difference for the worker.

+
To conduct the Wilcoxon signed-rank test, we will use as the test statistic.

23
𝑇
Approximation: Normal Distribution
+
If the number of matched pairs is 10 or more, the sampling distribution of can be
approximated by a normal distribution as follows.

24
𝑇
Example: Manufacturing
Based on the calculation, we have:

25
Example: Manufacturing
The probability that T + ≥ 49.5 is approximated by:

The two-tailed p-value = 2(1-.9857) = .0286.

With the p-value≤.05, we reject 0 and conclude that the median completion times for the two
production methods are not equal.

26
𝐻
Mann-Whitney-Wilcoxon (MWW) test

27
Mann-Whitney-Wilcoxon (MWW) test
Mann-Whitney-Wilcoxon (MWW) test is a nonparametric test for the difference between two
populations based on two independent samples.

Advantages of this nonparametric procedure are that it can be used with either ordinal data or
quantitative data and it does not require the assumption that the populations have a normal
distribution.

28
Example: Employee Performance
During an employee performance review at a theater, the theater manager rated all 35 part-time
employees from best (ranked 1) to worst (ranked 35) in the theater’s annual report.
The part-time employees were primarily college and high school students. The manager asked if
there was evidence of a significant difference in performance for college students compared to
high school students.

The hypotheses:
H0: Their performance are identical.
H1: Their performance are not identical.

29
Example: Employee Performance
We begin by selecting a random sample of four college students and a random sample of five
high school students.

30
Example: Employee Performance
Rank the combined samples from low to high.
Sum the ranks for each sample.

The sum of ranks for the first sample will be the test statistic W = 14.

31
Example: Employee Performance
Letting C denote a college student and H denote a high school student.
Suppose the ranks of the nine students had the following order:

The sum of ranks for the college students is 10.


Now consider a ranking where the four college students have the four highest ranks.

The sum of ranks for the college students is 30.


Thus, we see that the sum of the ranks for the college students must be between 10 and 30.
If the two populations are identical, the sum of ranks W is closer to (10+30)/2 = 20.

32
Example: Employee Performance
We used a computer program to compute all possible orderings (rankings) for the nine students:

33
Example: Employee Performance
The two-tailed p-value = 2(.0952) = 0.1904.

With α =.05 as the level of significance and p-value>.05.

The MWW test conclusion is that we cannot reject the null hypothesis that the populations of
college and high school students are identical.

34
Normal Distribution Approximation
When both samples sizes are 7 or more, a normal approximation of the sampling distribution of
W can be used.

35
Kruskal-Wallis (KW) test

36
Kruskal-Wallis Test
We extend the nonparametric procedures to hypothesis tests involving three or more
populations.

The nonparametric Kruskal-Wallis test is based on the analysis of independent random samples
from each of k populations.

This procedure can be used with either ordinal data or quantitative data and does not require the
assumption that the populations have normal distributions.

37
Example: Annual Performance Report
One company hires employees for its management staff from three different colleges.

The personnel director began reviewing the annual performance reports for the management
staff in an attempt to determine whether there are differences in the performance ratings among
the managers who graduated from the three colleges.

The independent samples include 7 managers from college A, 6 from college B, and 7 from
college C.

38
Example: Annual Performance Report
Rank the combined samples from low to high.
Sum the ranks for each sample.

39
Kruskal-Wallis Test statistic
The Kruskal-Wallis test statistic uses the sum of the ranks for the three samples and is computed
as follows.

Kruskal-Wallis test is always expressed as an upper tail test


40
Example: Annual Performance Report
The sample sizes are

The value of the Kruskal-Wallis test statistic is as follows:

41
Example: Annual Performance Report
Under the null hypothesis assumption of identical populations, the sampling distribution of H
can be approximated by a chi-square distribution with (k-1) degrees of freedom.

With H = 8.92, we can conclude that the area in the upper tail of the chi-square distribution is
between .025 and .01. (You may use chi-square distribution table or R functions.)

Because p-value ≤ α = .05, we reject 0 and conclude that the three populations are not all the
same.

42
𝐻
Rank Correlation

43
Rank Correlation
The Pearson correlation coefficient is a measure of the linear association between two variables
using quantitative data.

Spearman rank-correlation coefficient is a correlation measure of association between two


variables when ordinal or rank-ordered data are available.

We have discussed in previous session the test for Pearson correlation coefficient. Now we will
discuss the test for Spearman rank-correlation coefficient.

44
Spearman Rank-correlation Coefficient
The Spearman rank-correlation coefficient ranges from -1.0 to +1.0 and its interpretation is
similar to the Pearson correlation coefficient for quantitative data.

45
Example: Sales Potential
A personnel director reviewed the performance of 10 current members of the sales force.

After the review, the director ranked the 10 individuals in terms of their potential for success and
assigned the individual who had the most potential the rank of 1.

Data were then collected on the actual sales for each individual during their first two years of
employment.

A second ranking of the 10 individuals based on sales performance was obtained.

46
Example: Sales Potential
The ranks based on potential as well as the ranks based on the actual performance are shown
below.

47
Example: Sales Potential
The computations of Spearman rank-correlation coefficient are summarized.

48
Rank Correlation Test Hypothese
We can use the sample rank correlation to make an inference about the population rank
correlation coefficient .

We test the following hypotheses:

49
𝑠
𝑠
𝑟
𝜌
Example: Sales Potential
The following sampling distribution of can be used to conduct the test.

The sample rank-correlation coefficient for sales potential and sales performance is = . 733.
Then we have = 0, and = 1/(10 − 1) = . 333.

50
𝑠
𝑠
𝑟
𝑠
𝑠
𝑟
𝜎
𝑟
𝑟
𝜇
Example: Sales Potential
The test statistic is

Using the standard normal probability table and z = 2.20, we find the two-tailed p-value =
2(1–.9861) = .0278.

With a .05 level of significance, p-value ≤ α. Thus, we reject the null hypothesis that the
population rank-correlation coefficient is zero.

51
Thank you!

Reference:
Anderson, D., Sweeney, D., Williams, T. (2010). Statistics for Business and Economics (11th ed). Cengage Learning.

52
R & Business Analytics
Masters in Big Data and Business Analytics

Chapter 7: Linear Regression Analysis

Dr. Hao (Howard) ZHONG


Assistant Professor
Information & Operations Management
ESCP Business School

This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Simple Linear Regression Model

2
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one independent variable
• Explain the impact of changes in an independent variable on the dependent variable

Dependent variable: the variable we wish to predict or explain


Independent variable: the variable used to predict or explain the dependent variable

3
Simple Linear Regression Model
Only one independent variable, X.
Relationship between X and Y is described by a linear function.
Changes in Y are assumed to be related to changes in X.

Independent
Intercept Slope Variable
Dependent
Variable

= 0 + 1 +
Linear Component Random Error
𝑖
𝑖
𝑖
4
𝑌
𝛽
𝛽
𝑋
𝜖
Simple Linear Regression Model

Y = 0 + 1 +
Observed Value
of Y for Xi
ΔYi
εi ΔXi
ΔYi
Predicted Value Slope = β1 =
Random Error for ΔXi
of Y for Xi this Xi value

Intercept = β0

Xi
X 5
𝑖
𝑖
𝑖
𝑌
𝛽
𝛽
𝑋
𝜖
Simple Linear Regression Equation
The simple linear regression equation provides an estimate of the population regression line.

Estimated Estimate of Estimate of the


(or predicted) the regression regression slope
Y value for intercept
observation i
Value of X for

Y ̂ = b0 + b1Xi observation i

6
The Least Squares Method
b0 and b1 are obtained by finding the values of that minimize the sum of the squared
differences between Y and Y:̂

(Yi − Yî )2 = min (Yi − (b0 + b1Xi))2


∑ ∑
min

The coefficients b0 and b1 , and other regression results in this chapter, will be found using R.

• b0 is the estimated average value of Y when the value of X is zero.


• b1 is the estimated change in the average value of Y as a result of a one-unit increase in X.

7
Example: House Price
A real estate agent wishes to examine the House Price in
Square Feet
relationship between the selling price of a home $1000s
(X)
and its size (measured in square feet) (Y)
245 1400
312 1600
A random sample of 10 houses is selected 279 1700
• Dependent variable (Y) = house price in $1000s 308 1875
• Independent variable (X) = square feet 199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
8
Example: House Price

500

375
House Price ($1000s)

250

125

0
0 650 1300 1950 2600
Square Feet

9
Example: House Price
lm: Fitting Linear Model

The regression equation is: house price = 98.24833 + 0.10977 (square feet)

10
Example: House Price
Scatter Plot and Prediction Line

house price = 98.24833 + 0.10977 (square feet)

Slope
= 0.10977

Intercept
= 98.248

11
Example: House Price
b0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of
observed X values).

Because a house cannot have a square footage of 0, b0 has no practical application.

house price = 98.24833 + 0.10977 (square feet)

12
Example: House Price
b1 estimates the change in the average value of Y as a result of a one-unit increase in X.
Here, b1 = 0.10977 tells us that the mean value of a house increases by .10977($1000) = $109.77 on
average, for each additional one square foot of size.

house price = 98.24833 + 0.10977 (square feet)

13
Example: House Price
Predict the price for a house with 2000 square feet:

house price = 98.25 + 0.1098 (sq.ft.)

= 98.25 + 0.1098(200 0)

= 317.85
The predicted price for a house with 2000 square feet is
317.85($1,000s) = $317,850

14
Measures of Variation

15
Measures of Variation
Total variation is made up of two parts:

SST = SSR + SSE


Total Sum of Regression Sum of Error Sum of
Squares Squares Squares

2 2
( −¯)
2 ^ ¯ ^

=
∑ ∑
= ( − ) = ( − )

where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given Xi value
16
𝑖
𝑖
𝑖
𝑆
𝑆
𝑆
𝑆
𝑅
𝐸
𝑌
𝑌
𝑌
𝑌
𝑖
𝑆
𝑆
𝑇
𝑌
𝑌
Measures of Variation
SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean Ȳ

SSR = regression sum of squares (Explained Variation)


• Variation attributable to the relationship between X and Y

SSE = error sum of squared (Unexplained Variation)


• Variation in Y attributable to factors other than X

17
Measures of Variation

Y
Yi ∧
Y

SSE = ∑(Yi - Yi )2
_
SST = ∑(Yi - Y)2

Yi ∧ _
_ SSR = ∑(Yi - Y)2
_
Y Y

Xi X 18
Coefficient of Determination, r 2
The coefficient of determination is the portion of the total variation in the dependent variable
that is explained by variation in the independent variable.
The coefficient of determination is also called r-squared and is denoted as r2

SSR regression sum of squares


r2 = =
SST total sum of squares

note: 0≤ 2
≤1

19
𝑟
R-squared

Y
2 ∑
r =


Ȳi
2
0≤ ≤1

X
20
𝑟
Examples of Approximate r 2 Values
Y

r2 = 1
X
r2 = 1 Perfect linear relationship between X and Y:
Y 100% of the variation in Y is explained by
variation in X

X
r2 =1
21
Examples of Approximate r 2 Values
Y

0 < r2 < 1

X Weaker linear relationships between X


and Y:
Y
Some but not all of the variation in Y is
explained by variation in X

X
22
Examples of Approximate r 2 Values

Y
r2 = 0

No linear relationship between X and Y:

The value of Y does not depend on X.


(None of the variation in Y is explained
by variation in X)
X
r2 = 0

23
Example: House Price
Summary report of a fitted linear model:

58.08% of the variation in house prices is


explained by variation in square feet

24
Adjusted r 2
The use of adjusted r2 is an attempt to account for the phenomenon of the r2 spuriously
increasing when extra explanatory variables are added to the model.

There are different ways of adjusting, and the most common one (also used in R):

n−1
R2adj = 1 − (1 − R2)
n−p−1

where n is the total number of observations and p is the number of explanatory variables.

25
r 2 vs Sample Correlation Coefficient
If sample correlation coefficient is , we have

2 2
=

More specifically,
2
=( 1)

Only true for simple linear regression!

26
𝑥
𝑦
𝑟
𝑠
𝑖
𝑔
𝑛
𝑜
𝑓
𝑏
𝑟
𝑥
𝑦
𝑥
𝑦
𝑟
𝑟
𝑟
Example: House Price
Standard error of residuals in R:

S YX = 41.33032
27
Standard Error of residuals
The standard error of residuals is the standard deviation of the variation of observations around
the regression line.

2
^
∑ =1 ( − )
= =
−2 −2 ? Two parameters are estimated.

where
SSE = error sum of squares
n = sample size
𝑛
𝑛
𝑌
𝑋
𝑆
𝑖
𝑖
𝑖
𝑆
𝑆
𝐸
28
𝑌
𝑌
𝑛
Comparing Standard Errors
SYX is a measure of the variation of observed Y values from the regression line.

Y Y

X X
small SYX large SYX

The magnitude of SYX should always be judged relative to the size of the
Y values in the sample data

29
Regression Slope

30
Inferences about the Slope
The standard error of the regression slope coefficient (b1) is estimated by

1
= =
2
∑( − ¯)

where:
Sb1 = Estimate of the standard error of the slope

= = Standard error of the estimate


−2
𝑖
𝑋
𝑋
𝑆
𝑆
𝑋
𝑛
31
𝑏
𝑆
𝑌
𝑋
𝑆
𝑌
𝑋
𝑌
𝑋
𝑆
𝑆
𝐸
𝑆
𝑆
Inferences about the Slope: T-test
T test for a population slope
• Is there a linear relationship between X and Y?

State the hypotheses


• H0: β1 = 0 (no linear relationship)
• H1: β1 ≠ 0 (linear relationship does exist)

Test statistic
1 − 1 where:
=
1 b1 = regression slope coefficient
. .= −2 β1 = hypothesized slope
Sb1 = standard error of the slope
𝑏
32
𝑆
𝑆
𝑇
𝐴
𝑇
𝑡
𝑑
𝑓
𝑛
𝑏
𝛽
Example: House Price

House Price in
Square Feet Estimated Regression Equation:
$1000s
(x)
(y)
house price = 98.25 + 0.1098 (sq.ft.)
245 1400
312 1600
279 1700 The slope of this model is 0.1098.
308 1875
199 1100
219 1550
Is there a linear relationship between the square
footage of the house and its sales price?
405 2350
324 2450
319 1425
255 1700

33
Example: House Price

H 0: β 1 = 0 From R output:

H 1: β 1 ≠ 0

b1 Sb1

1 − 1 0.10977 − 0
= = = 3.32938
1
0.03297

34
𝑏
𝑆
𝑆
𝑇
𝐴
𝑇
𝑡
𝑏
𝛽
Example: House Price

H 0: β 1 = 0 Test Statistic: tSTAT = 3.329


H 1: β 1 ≠ 0
d.f. = 10- 2 = 8

α/2=.025 α/2=.025

Decision: Reject H0
Reject H0 Do not reject H0 Reject H0
-tα/2 0 tα/2 There is sufficient evidence that square
-2.3060 2.3060 3.329 footage affects house price.

35
House Price: T-test Example

From R output:

p-value

Decision: Reject H0, since p-value < α

There is sufficient evidence that square footage affects house price.

36
Confidence Interval Estimate for the Slope

b1 ± t α / 2 S b d.f. = n - 2
1

R output:

At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)

37
Confidence Interval Estimate for the Slope

Since the units of the house price variable is $1000s, we are 95%
confident that the average impact on sales price is between $33.74 and
$185.80 per square foot of house size

This 95% confidence interval does not include 0.


Conclusion: There is a significant relationship between house price
and square feet at the .05 level of significance

38
Multiple Linear Regression

39
Multiple Regression Model
Multiple regression analysis is the study of how a dependent variable y is related to two or
more independent variables.
The equation that describes how the dependent variable y is related to the independent variables
1, 2, …, and an error term is called the multiple regression model.

40
𝑝
𝑥
𝑥
𝑥
Multiple Regression Equation
One of the assumptions is that the mean or expected value of is zero.

The equation that describes how the mean value of y is related to 1, 2, …, is called the
multiple regression equation.

41
𝑝
𝜖𝑥
𝑥
𝑥
Estimated Multiple Regression Equations
A simple random sample is used to compute sample statistics 0, 1, 2, …, that are used as the
point estimators of the parameters 0, 1, 2, …, .

42
𝑝
𝑝
𝑏
𝛽
𝑏
𝛽
𝑏
𝛽
𝑏
𝛽
Least Squares Method
The same least squares method is used to develop the estimated multiple regression equation.

The least squares method uses sample data to provide the values of 0, 1, 2, …, that make the
sum of squared residuals (SSE) a minimum.

43
𝑝
𝑏
𝑏
𝑏
𝑏
Example: Trucking Company
To develop better work schedules, the managers at a trucking company want to estimate the total
daily travel time for their drivers.
Initially the managers believed that the total daily travel time would be closely related to the
number of miles traveled in making the daily deliveries.

44
Example: Trucking Company
The scatter diagram is shown below.

45
Example: Trucking Company

The estimated regression equation:

The relationship between the total travel


time and the number of miles traveled is
significant.

This finding is fairly good, but the


managers might want to consider adding a
second independent variable.

46
Example: Trucking Company
The managers felt that the number of deliveries could also contribute to the total travel time.

47
Example: Trucking Company

The estimated regression equation is

But is this regression model better than


the simple linear regression model?

48
Multiple Coefficient of Determination
Similar to the conception in simple linear regression, the term multiple coefficient of
determination indicates that we are measuring the goodness of fit for the estimated multiple
regression equation.

49
Multiple Coefficient of Determination
Relationship among SST, SSR, and SSE:

50
Example: Trucking Company
The r 2 value of simple linear regression is 0.664. In comparison, the 2
value for the estimated
regression equation with two independent variables is 0.904.

The goodness of fit is definitely increased!

In general, r 2 always increases as independent variables are added to the model. Many analysts
prefer adjusting which is also increased (from 0.622 to 0.876).

51
𝑹
𝑅
𝟐
Testing for Significance
In simple linear regression, the significance test we used was t test.

In multiple regression, we use both t test and the F test and they have different purposes.
1. T test is used to determine whether each of the individual independent variables is significant.
2. The F test is used to determine whether a significant relationship exists between the dependent
variable and the set of all the independent variables.

52
F test

53
Example: Trucking Company
F statistic and p-value are exported in model summary output:

54
Categorical Independent Variables

55
Categorical Independent Variables
So far, the examples we have considered involved quantitative independent variables.

In many situations, however, we must work with categorical independent variables such as
gender (male, female), payment method (cash, credit card, check), and so on.

56
Example: System Maintenance
A water-filtration system maintenance service provider is trying to estimate the repair time
necessary for each maintenance request.

Repair time is believed to be related to two factors, the number of months since the last
maintenance service and the type of repair problem (mechanical or electrical).

57
Example: System Maintenance
Data for a sample of 10 service calls are reported.

58
Example: System Maintenance
To incorporate the type of repair into the regression model, we define the following variable.

In regression analysis 2 is called a dummy or indicator variable.

The multiple regression model is:

1 denotes the number of months since the last maintenance service.

59
𝑥
𝑥
Example: System Maintenance
The estimated result is:

60
Interpreting dummy variable
Comparing the model with the dummy variable equals to 0 or 1.

When 2 =0:

When 2 =1:

The interpretation of 2 is that it indicates the difference between the mean repair time for an electrical
repair and the mean repair time for a mechanical repair.

61
𝑥
𝑥
𝛽
Multi-level Categorical Variables
The categorical variable for the previous example had two levels (mechanical and electrical).

Oftentimes, we will encounter a categorical variable with more than two levels.

If a categorical variable has k levels, how many dummy variables we will need?

62
Example: Copy Machine
Suppose a manufacturer of copy machines organized the sales territories for a particular state into
three regions: A, B, and C.
The managers want to use regression analysis to help predict the number of copiers sold per
week.
The managers believe sales region is an important factor in predicting the number of copiers sold.
Because sales region is a categorical variable with three levels, A, B and C, we will need 3 − 1 = 2
dummy variables to represent the sales region.

63
Example: Copy Machine
The three regions are encoded as

The regression equation is:

64
Example: Copy Machine
Consider the following three variations of the regression equation:

would be interpreted as the mean difference between regions A and B;


is the mean difference between regions C and A.

When a categorical variable has k levels, k-1 dummy variables are required.

65
𝟏
𝟐
𝜷
𝜷
Thank you!

Reference:
Anderson, D., Sweeney, D., Williams, T. (2010). Statistics for Business and Economics (11th ed). Cengage Learning.

66
R & Business Analytics
Masters in Big Data and Business Analytics

Chapter 8: Model Diagnostics and Regularization

Dr. Hao (Howard) ZHONG


Assistant Professor
Information & Operations Management
ESCP Business School

This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Generalized Linear Model

2
Generalized Linear Model
The major issues in model building are finding the proper functional form of the relationship and
selecting the independent variables to be included in the model.

As a general framework for developing more complex relationships among the independent
variables, we introduce the concept of a generalized linear model.

Each of the independent variables (where = 1,2, …, ) is a function of 1, 2, …, .

3
𝑗
𝑘
𝑧
𝑗
𝑥
𝑥
𝑥
𝑝
Example: Industrial Scales
A manufacturer of industrial scales and laboratory equipment want to investigate the relationship
between length of employment of their salespeople and the number of electronic laboratory scales
sold.
The number of scales sold by 15 randomly selected salespeople for the most recent sales period
and the number of months each salesperson has been employed by the firm are scattered.

4
Example: Industrial Scales
The estimated regression is

5
Example: Industrial Scales
The standardized residual plot suggests that a curvilinear relationship is needed.

6
Example: Industrial Scales
To account for the curvilinear relationship, second-order model with one predictor variable is
developed.

7
Example: Industrial Scales
The new standardized residual plot shows that the previous curvilinear pattern has been
removed.

8
Transformation of Dependent Variable
In showing how the general linear model can be used to model a variety of possible relationships
between the independent variables and the dependent variable, we have focused attention on
transformations involving one or more of the independent variables.

It is worth to consider transformations involving the dependent variable y.

9
Example: Automobile
A dataset shows the miles-per-gallon ratings and weights for 12 automobiles.

10
Example: Automobile
The estimated regression is

11
Example: Automobile
The standardized residual plot is indicative of a nonconstant variance.

12
Example: Automobile
Often the problem of nonconstant variance can be corrected by transforming the dependent
variable to a different scale.

13
Example: Automobile
The wedge-shaped pattern has now disappeared.

14
Model Diagnostics

15
Residual plot
A residual plot is a scatterplot of the residuals
against the predicted values.

With a residual plot, we can check two things:


• Linearity: whether the residuals distribution exhibits
a curvature-like shape
• Constant variance: whether the distribution indicates
a funnel shape

16
Box-Cox method
If we see a curvature-like shape, Box-Cox method
can be applied.
Box-Cox method can be used to find the best
power transformation to the response variable.
In an actual application, it would be better to
interpret this number and choose the power that
makes sense to you.

= − 1.52 17
𝜆
Box-Cox method
Then what do we do with λ?

Simple rule of thumb: More formally, the original form of Box-Cox transformation:

But it cannot deal with negative value of y. Instead:

where λ2 is a constant chosen such that y + λ2 > 0.

18
Heteroskedasticity
Linear regression contains an assumption that residuals are identically distributed across every X
variable.
If that holds, the error terms are homoskedastic, meaning the errors have the same scatter
regardless of the value of X.
When the errors vary depending on the value on one or more Xs, the error terms are
heteroskedastic.

19
Heteroskedasticity
Use Scale-Location plot to identify heteroskedasticity.

Square root of the


standardized residuals

It displays the fitted values of a regression model along the x-axis and the the square root of the
standardized residuals along the y-axis.
1. Verify that the red line is roughly horizontal across the plot.
2. Verify that there is no clear pattern among the residuals.

20
Heteroskedasticity
The White Test is a method to identify whether or not the error variances are all equal.

0.032

How do we address the heteroskedasticity issue?


• Transform the dependent variable (normally log-transformation)
• Re-specify the model (e.g., including omitted variables)
• Try other methods (robust standard errors or weighted least squares regression, etc.)

21
Check for normality
Using Normal Q-Q plot of standardized residuals and the theoretical quantiles of Normal
distribution.

Graphs Normality Tests

Shapiro-Wilk test

22
Frequency distribution

By Matthew E. Clapham from UC Santa Cruz


23
Normal Q-Q plot

Note: the slope of the reference line is the standard deviation of the observations.
By Matthew E. Clapham from UC Santa Cruz
24
Shapiro-Wilk test

H0: the data is normally distributed.


H1: the data is not normally distributed.

25
Multicollinearity
Multicollinearity refers to a situation in which two or more
variables in a regression model are highly correlated with
each other.
Variance Inflation Factors (VIF) measure how much the
variance of the estimated coefficients are increased over the
case of no correlation among the X variables.
• If all VIFs are 1, there is no two X variables being correlated.
• If VIF for one of the variables is around or greater than 5, there is
collinearity associated with that variable.
• If two or more variables have high VIFs, one or more of these variables
must be dropped. They can be dropped one by one and select the
regression equation with higher R-squared. (Note: when one variable is
dropped, VIFs for all the remaining variables need to be re-calculated.)

26
Influential observations
Cook’s Distance (or Cook’s D) is calculated by
comparing the regression model with and
without a particular observation.
It measures how much the estimated coefficients
change when that observation is removed from
the dataset.

It is considered as an influential observation if


Cook’s distance larger than 1 or 4/n (whichever
is smaller), where n is the number of data Index

points.

27
Influential observations
Leverage is a measure of how far away the independent
variable values of an observation are from those of the
other observations.
High-leverage points are potential outliers with respect to
the independent variables.

We can also incorporate cook’s distance and leverage in


one plot.
If any point in this plot falls outside of Cook’s distance=1
(the red dashed lines), it is considered an influential
observation.

28
Influential observations
How do we deal with influential observations?
1. Verify that the observation is not an error.
• Before you take any action, you should first verify that the influential observation(s) are not a result of a
data entry error or some other odd occurrence.
2. Remove the influential observations.
• You may decide to simply remove the influential observations if the model you specified seems to fit the
data well except for the one or two influential observations.
3. Attempt to fit another regression model.
• Influential observations could indicate that the model you specified does not provide a good fit to the
data.

29
Case Study: Sales Prediction

By Darek Kane
30
Grape juice
Your supermarket SuperMart is selling a new type of
grape juice in some of its stores for pilot testing.
The marketing team wants to develop a model to with
the following variables to predict the sales of this new
type of grape juice.

31
Data summary
From the summary table, we can roughly know the basic statistics of each numeric variable.

For example, the mean value of sales is 216.7 units, We can further explore the distribution of the
the min value is 131, and the max value is 335. data of sales by visualizing the data in graphical
form as follows. The sales data distribution is
roughly normal.

32
Question #1
To predict the sales of grape juice in a store, what
statistical analysis technique should we use?
• A multiple linear regression model is suitable.
• Here, "sales" is the dependent variable and the others
are independent variables.
Let's investigate the correlation between the sales and other
variables by displaying the correlation coefficients in pairs.
• The correlation coefficients between sales and price, ad type,
price apple, and price cookies are 0.85, 0.58, 0.37, and 0.37
respectively, that means they all might have some influences to
the sales.

33
Multiple Linear Regression analysis
We can first try to add all of the independent variables into the
regression model:
• The p-value for Price, Ad Type, and Price Cookies is much less than
0.05. They are significant in explaining the sales. We are confident
to include these variables into the model.
• The p-value of Price Apple is a bit larger than 0.05, seems there is
no strong evidence for apple juice price to explain the sales.
• However, according to our real-life experience, we know when
apple juice price is lower, consumers likely to buy more apple juice,
and then the sales of other fruit juice may decrease. So we still keep
it in the model to explain the grape juice sales.
• The Adjusted R-squared is 0.881, which indicates a reasonable
goodness of fit and 88% of the variation in sales can be explained by
the four variables.
34
Question #2
What model diagnostics we need to perform?
• The residual plot shows that the residuals scatter around the
fitted line with no obvious pattern.
• From the Scale-Location plot, we see acceptable constant
variance on standardized residuals.
• The Normal Q-Q graph shows that basically the residuals are
normally distributed.
• The VIF test value for each variable is close to 1, which means
the multicollinearity is very low among these variables.
The final model:
• Sales = 774.81 - 51.24 * Price + 29.74 * Ad Type + 22.1 * Price
Apple - 25.28 * Price Cookies

35
Pitfalls of R-squared

36
R-squared

r2 =
SSR regression sum of squares
= Y
SST total sum of squares


r2 =
Ȳi

2
0≤ ≤1
X

37
𝑟
Checking R-squared
Is it really bad to have low R-squared?
Not really.

Sometimes, if your R-squared value is low but you have statistically


significant predictors, you could still draw important conclusions about
how changes in the predictor values are associated with changes in the
response value.

38
Pitfalls of R-squared
1. R-squared can be arbitrarily low when the model is completely correct.
• We simulate data points by adding normally distributed noise (error) on the dependent variable.
• By making larger, we drive R-squared towards 0, even when every assumption of the simple linear
regression model is correct.

Adding random noise

Increasing sigma

39
𝜎
Pitfalls of R-squared
2. R-squared can be arbitrarily close to 1 when the model is totally wrong.
• Especially when we have non-linear data.

Simulating non-
linear distribution

R-squared is high!

40
Pitfalls of R-squared
3. R-squared says nothing about prediction error, even with exactly the same, and no change
in the coefficients.
• We’re better off using Mean Square Error (MSE) or Root Mean Square Error (RMSE) as a measure of
prediction error.
X values are shrinked

R-squared dropped

But prediction errors remain the same

41
𝝈
𝟐
Regularization

42
Bias & Variance
Let’s first discuss where model errors come from.

Error due to Bias


• The error due to bias is taken as the difference between
the expected (or average) prediction of our model and
the correct value which we are trying to predict.

Error due to Variance


• The error due to variance is taken as the variability of a
model prediction for a given data point.
High bias High variance
(underspecified / (overspecified /
underfitting) overfitting)
43
Bias & Variance
There is a tradeoff between a model’s ability to minimize bias and variance.
Understanding these two types of error can help us diagnose model results and avoid the mistake
of over- or under-fitting.

Model Complexity
44
Regularization
Normally, we can increase explanatory variables or bring in non-linearity to reduce bias.
But how do we reduce the variance?

One common technique is regularization.

45
Ridge Regression
Linear Regression uses Ordinary Least Squares
(OLS) to learn model coefficients:

OLS Loss
Ridge Regression incorporates one type of
regularization techniques (L2 norm).

OLS Loss L2 Norm

It is an alternative to dropping variables for solving the multi-collinearity problem. It reduces variances
and introduce bias in a way that does not completely drop variables.

46
LASSO Regression
Ridge regression cannot perform variables selection to reduce
the model variance. LASSO regression has such an ability.
LASSO is an acronym for "Least Absolute Selection and
Shrinkage Operator".

OLS Loss L1 Norm

The lasso is very competitive with the ridge regression in


regards to prediction error.

47
Shrinkage parameter
Why is LASSO regression able to perform variable
selection but not Ridge regression?

Let’s look at the trace plots of these two models:


• For Ridge regression, with increasing, the coefficient
approaches zero (but not equal);
• For LASSO regression, with increasing, some reaches
exactly zero.

How do we determine the shrinkage parameter λ ?

λ increasing
48
𝑗
𝑗
𝜆𝛽
𝜆𝛽
Elastic Net
Is it possible to combine these two regression models and have both the characteristics and
advantages?

Elastic Net reduces the impact of different features while not eliminating all of the features.

OLS Loss L2 Norm L1 Norm

49
Thank you!

50
R & Business Analytics
Masters in Big Data and Business Analytics

Chapter 9: Logistic Regression and Model Building

Dr. Hao (Howard) ZHONG


Assistant Professor
Information & Operations Management
ESCP Business School

This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Logistic Regression Model

2
Recall: Multiple Regression Equation
The equation that describes how the mean value of y is related to 1, 2, …, is called the
multiple regression equation.

3
𝑝
𝑥
𝑥
𝑥
Logistic Regression
What if the dependent variable is categorical?

For instance, a bank might like to develop an estimated regression equation for predicting
whether a person will be approved for a credit card.
The dependent variable can be coded as y=1 if the bank approves the request for a credit card and
y=0 if the bank rejects the request for a credit card.

We can use logistic regression to estimate the probability that the bank will approve one credit
card application.

4
Logistic regression
One popular model is Logistic Regression, which is
extended from Linear Regression.

= 0 + 1 1 + 2 2 +…+
1
= or =
1+ 1+ −

The output of logistic regression models is a


probability between 0 and 1.
𝑒
𝑒
5
𝑖
𝑖
𝑧
𝑧
𝑖
𝑖
𝑌
𝑌
𝑖
𝑖
𝑖
𝑖
𝑧
𝛽
𝛽
𝑋
𝛽
𝑋
𝜖
𝑒
𝑖
𝑧
Maximum Likelihood Estimation
Recall that linear regression models are inferred using the Ordinary Least Squares (OLS) method,
for logistic regression, we use the method called Maximum Likelihood Estimation (MLE).
What is likelihood? It is somewhat similar to probability, but not the same!

Typically, it requires many points


to calculate the likelihood…

6
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is, given data points, to find a specific model which
gives the highest likelihood.

7
Model inference
For logistic regression, MLE is to find the best coefficients which provide the estimates best fitting
the logit curve.

X
Z

8
Direct Mail Promotion
An expensive four-color sales catalog have been designed and printed for direct mail promotion.
The managers would like to send them to only those customers who have the highest probability
of using the coupon.
They think that annual spending at the stores and whether a customer has a store credit card are
two variables that might be helpful in prediction.

The company sent the catalog to the selected 100 customers and noted whether the customer used
the coupon.

9
Direct Mail Promotion
The sample data for the first 10 catalog recipients are shown:

In the Coupon column, a 1 is recorded if the sampled customer used the coupon and 0 if not.

10
Direct Mail Promotion: Model Estimation
The nonlinear form of the logistic regression equation makes the method of computing estimates
more complex. We will see later how to use R to provide the estimates.
We define the dependent and independent variables:

The logistic regression equation is:

11
Direct Mail Promotion: Model Estimation
With the help of computer software, we generate the equation estimation:

By plugging in the values of 1 and 2, the estimated is the predicted probability.

12
𝑦
𝑥
𝑥
Odds Ratio
The odds in favor of an event occurring is defined as the probability the event will occur divided
by the probability the event will not occur.

The odds ratio measures the impact on the odds of a one-unit increase in only one of the
independent variables.
(x1, x2, . . . , xk + 1,...,xp)

(x1, x2, . . . , xk, . . . , xp)

13
Direct Mail Promotion: Odds Ratio
Suppose we want to compare the odds of using the coupon for customers who spend $2000 annually and
have a Simmons credit card ( 1 = 2 and 2 = 1) to the odds of using the coupon for customers who spend
$2000 annually and do not have a Simmons credit card ( 1 = 2 and 2 = 0).

The estimated odds of using the coupon for customers who have a Simmons credit card are 3 times of the
estimated odds of using the coupon for customers who do not have a Simmons credit card.

14
𝑥
𝑥
𝑥
𝑥
Direct Mail Promotion: Interprete Odds Ratio
The odds ratio for each independent variable is computed while holding all the other independent
variables constant.

A unique relationship exists between the odds ratio for a variable and its corresponding regression
coefficient.

15
Model Evaluation

16
Model Evaluation
Logistic regression is also known as one example of binary classification models, in which the
target variable has two possible categorical outcomes.

How could we evaluate how well such a model performs?

A straightforward option is classification accuracy. Accuracy is a common evaluation metric


because it reduces classifier performance to a single number and it is very easy to measure.

17
Problems with Unbalanced Classes
Unfortunately, accuracy is simple but has some widely recognized problems.

Loan Applicants

fraudulent: 1%

How do we develop a simple model to achieve a very high level of accuracy?


A simple model that classify any applicant as “fraudulent”!

18
Confusion Matrix
To evaluate a logistic regression model (or any classification model), it is important to understand
the notion of confusion matrix.

Actual Predicted
+ 0.94 TP True positive rate
+ 0.91
+ (pred) TP FN
- 0.89 FP
+ 0.87
Cutoff
- 0.83 FP TN
… …
- 0.12 TN - (pred) TP = True Positive
+ 0.10 FN FP = False Positive
- 0.09 FN = False Negative
+ 0.04 TN = True Negative

19
A Better Solution: ROC Analysis
A better solution is to use a method that can accommodate uncertainty by showing the entire
space of performance possibilities.

An Receiver Operating Characteristics (ROC) graph is a two-dimensional plot of a classifier


with false positive rate on the x axis against true positive rate on the y axis.

True positive rate (TPR) =


+
False positive rate (FPR) =
+
𝑇
𝐹
𝑃
𝑃
𝐹
𝑇
𝑁
𝑁
20
𝐹
𝑇
𝑃
𝑃
Interpreting ROC curve
Conceptually, we may imagine sorting the instances by prediction score and varying a threshold
from –∞ to +∞ while tracing a curve through ROC space.
Actual Predicted 0 100
0 100
+ 0.94 1 99
0 100
+ 0.91 2 98
0 100
- 0.89 2 98

+ 0.87 1 99
3 97

- 0.83 3 97
1 99 TPR
… … 2 98

- 0.12
+ 0.10
- 0.09
+ 0.04
FPR

21
Points on ROC Graph

Point (0, 0) represents the strategy of never issuing a


positive classification.

Point (1,1) represents the strategy of unconditionally


issuing positive classifications.

Point (0, 1) represents perfect classification.

Informally, one point in ROC space is better than


another if it is closer to the upper-left corner.

22
Random Classifier

Note that the classifier in the lower right


triangle of a ROC graph has worse
performance compared to a random classifier.

23
Area Under the ROC Curve (AUC)
An advantage of ROC graphs is that they decouple classifier performance from the conditions
under which the classifiers will be used.

Specifically, they are independent of the class proportions as well as the costs and benefits.

Do we have one numeric metric to summarize the curve?


Yes! The Area Under the ROC Cuver (AUC).

24
Area Under the ROC Curve (AUC)
The area under the ROC curve (AUC) is simply the area under a classifier’s curve expressed as a
fraction of the unit square.
Its value ranges from zero to one.

Though a ROC curve provides more information than its area, the AUC is useful when a single
number is needed to summarize performance, or when nothing is known about the operating
conditions.

25
Optimal Cutoff Point
Is there an optimal cutoff point? How to find it?

Youden’s index gives the optimal cut-off probability


when both sensitivity and specificity are treated
equally important.

Youden’s index = max( sensitivity + specificity – 1)

26
Model Building

27
When to Add or Delete Variables
We need to test whether it is advantageous to add one or more independent variables to a
multiple regression model.

This test is based on a determination of the amount of reduction in the error sum of squares
resulting from adding one or more independent variables to the model.

28
Recall: Trucking Company
Recall the trucking company example.

29
Trucking Company
The simple linear regression model is

When 2, the number of deliveries, was added as a second independent variable, we obtained the
following estimated regression equation:

Does adding the variable lead to a significant reduction in SSE?

30
𝟐
𝒙
𝑥
Trucking Company: Model Comparison
The ANOVA table of the first and second model:

Adding resulted in a reduction of SSE from 8.029 to 2.299.

31
𝟐
𝒙
Trucking Company: Reduction of SSE
The reduction in SSE resulting from adding 2 to the model involving just 1 is

Is this reduction of SSE significant?


We use F test.

32
𝑥
𝑥
F test
The hypotheses

F statistic

33
Trucking Company: F test
F statistic is defined to test whether addition of 2 is statistically significant.

We find that for a level of significance of = . 05, .05 = 5.59.


Because = 17.45 > .05 = 5.59, we reject the null hypothesis that 2 is not statistically
significant.

34
𝐹
𝐹
𝛼
𝑥
𝑥
𝐹
Variable selection procedure
Four types of variable selection procedure:
1. Forward selection
2. Backward elimination
3. Stepwise regression
4. Best-subsets regression

35
Forward selection
The forward selection procedure starts with no
independent variables.

It adds variables one at a time using the same


procedure as stepwise regression for determining
whether an independent variable should be entered
into the model.

The forward selection procedure does not allow a


variable to be removed from the model.

36
Backward elimination
The backward elimination procedure begins with a
model that includes all the independent variables.

It then deletes one independent variable at a time


using the same procedure as stepwise regression.

The backward elimination procedure does not


permit an independent variable to be reentered
once it has been removed.

37
Stepwise regression
The stepwise regression procedure begins each step by
determining whether any of the variables already in the
model should be removed.

If no independent variable can be removed from the


model, the procedure attempts to enter another
independent variable into the model.

Both are determined using F test.

38
Best-subsets regression
None of stepwise regression, forward selection, and backward
elimination guarantees that the best model for a given number of
variables will be found.

The best-subsets regression enables the user to find, given a specified


number of independent variables, the best regression model.

The criterion used in determining which estimated regression equations


are best for any number of predictors could be R-squared or other
relevant metric.

39
Thank you!

40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy