Statistics ESCP
Statistics ESCP
This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
What do you expect from this course?
By the end of this course, you may know better:
• How to perform statistical testings in various business scenarios
• When and how to perform ANOVA analysis
• What non-parametric testings are and when we should use them
• What regression analyses are and how they are done exactly
• What kind of model diagnostics we usually need to perform
• What logistic regression is and when we need it
• What the difference between frequentist and Bayesian views on
probabilities are
• What Bayes Theorem is and how it can relate to regression analyses
• How we can use R to perform all the analyses above
2
How will you be assessed?
We have ten lectures in total:
• seven lectures in theoretical and mathematical materials
• three lectures in R
3
Fundamentals
4
Basic vocabulary of statistics
POPULATION
A population consists of all the items or individuals
about which you want to draw a conclusion.
SAMPLE
A sample is the portion of a population selected for
analysis. The sample is the “small group”.
PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.
STATISTIC
A statistic is a numerical measure that describes a
characteristic of a sample. 5
Types of variables
Variables
Categorical Numerical
(Qualitative) (Quantitative)
Examples:
■ Marital Status
Discrete Continuous
■ Political Party
■ Eye Color
(Defined categories) Examples: Examples:
■ Number of Children ■ Weight
■ Defects per hour ■ Voltage
(Counted items) (Measured characteristics)
6
Levels of measurement
Gender
Nominal Nationality
Ratio Distance
Weight
7
Recovery period (days) from a disease
Source: Paul, S., Lorin, E. Estimation of COVID-19 recovery and decease periods in Canada using delay model. Sci Rep 11, 23763 (2021). 8
Measures of central tendency
Central Tendency
XG = ( X1 ´ X 2 ´ ! ´ Xn )1/ n
Variation
10
Shape of a distribution: skewness
Describes the amount of asymmetry in distribution (symmetric or skewed):
11
Shape of a distribution: kurtosis
Describes relative concentration of values in the center as compared to the tails.
12
Quartile measures
The data series should be sorted from low to high.
Quartiles split the ranked data into four segments with an equal number of values per segment.
• The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% larger.
• Q2 is the same as the median (50% of the observations are smaller and 50% are larger).
• Only 25% of the observations are greater than Q3.
13
Five number summary
The five numbers can help describe the center, spread and shape of data.
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
14
Probability Distributions
15
What are the common probability distributions?
Discrete Distributions
Continuous Distributions
Song, W. T. (2005). Relationships among some univariate distributions. IIE Transactions, 37(7), 651–656.
16
Probability distributions
Standard Normal
Standard Normal = 0, =1
= 0, =1
Student’s t
=5
Normal
= 1, =2
Student’s t
=2
17
𝜇
𝜇
𝜈
𝜈
𝜇
𝜎
𝜎
𝜎
Locating Extreme Outliers: Z-Score
Z-Score:
X−X
Z=
S
where X represents the data value
X̄ is the sample mean
S is the sample standard deviation
18
Locating Extreme Outliers: Z-Score
The Z-score is the number of standard deviations that a data value is from the mean.
A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater
than +3.0.
The larger the absolute value of the Z-score, the farther the data value is from the mean.
19
The Empirical Rule
The empirical rule approximates the variation of data in a bell-shaped distribution.
Approximately 68% of the data in a bell shaped distribution is within one standard deviation of
the mean or µ ± σ.
68%
µ
µ ± 1σ
20
The Empirical Rule
Approximately 95% of the data in a bell-shaped distribution lies within two standard deviations
of the mean, or µ ± 2σ.
Approximately 99.7% of the data in a bell-shaped distribution lies within three standard
deviations of the mean, or µ ± 3σ.
95% 99.7%
µ ± 2σ µ ± 3σ
21
Probability distributions
Student’s t
=5
F
1 = 1, 2 =5
Chi-square
=5
22
𝜈
𝜈
𝜈
𝜈
Probability distributions
Binomial
= 30, = 0.6
23
𝑛
𝑝
Confidence Interval Estimation
24
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size
25
Point Estimates
Mean μ X
Proportion π p
26
Point and Interval Estimates
A point estimate is a single number, while an interval estimate provides additional information
about the variability of the estimate.
Lower Upper
Confidence Confidence
Point Estimate
Limit Limit
Width of
confidence interval
27
Confidence Intervals
How much uncertainty is associated with a point estimate of a population parameter?
An interval estimate provides more information about a population characteristic than does a
point estimate.
28
Confidence Interval Estimate
An interval gives a range of values:
• Takes into consideration variation in sample statistics from sample to sample.
• Based on observations from one sample.
• Gives information about closeness to unknown population parameters.
• Stated in terms of level of confidence:
• e.g. 95% confident, 99% confident
• Can never be 100% confident
29
Estimation Process
Random Sample
I am 95% confident
Population Mean that μ is between 40
(mean, μ, is & 60.
X = 50
unknown)
Sample
30
Confidence Interval General Formula
The general formula for all confidence intervals is:
where:
•Point Estimate is the sample statistic estimating the population
parameter of interest.
•Critical Value is a table value based on the sampling distribution of the
point estimate and the desired confidence level.
•Standard Error is the standard deviation of the point estimate, σ/ n .
31
Confidence Level (1-α)
Suppose confidence level = 95%.
Also written (1 - α) = 0.95, so α = 0.05.
A specific interval either will contain or will not contain the true parameter.
32
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size
33
Confidence Intervals
Confidence
Intervals
Population Population
Mean Proportion
σ Known σ Unknown
34
Confidence Interval for mean (σ Known)
Assumptions
• Population standard deviation σ is known.
• Population is normally distributed.
35
Finding the Critical Value, Zα/2
Consider a 95% confidence interval:
Z α/2 = ±1.96
1 − α = 0.95 so α = 0.05
α α
= 0.025 = 0.025
2 2
Confidence
Confidence Level Coefficient, Zα/2 value
1− α
80% 0.80 1.28
90% 0.90 1.65
95% 0.95 1.96
98% 0.98 2.33
99% 0.99 2.58
99.8% 0.998 3.08
99.9% 0.999 3.27
37
Intervals and Level of Confidence
Sampling Distribution of the Mean
α/2 1− α α/2
x
Intervals extend µx = µ
from x1
σ
X − Zα / 2 x2 (1-α)x100%
n of intervals constructed
to
contain μ;
σ
X + Zα / 2 (α)x100% do not.
n
For this red colour, mean does not lie
in the confidence interval
so the chance is that it Confidence Intervals
lies in the 5% interval 38
Example: Z-dist. Confidence Interval
A sample of 11 circuits from a large normal population has a mean resistance of 2.20 ohms. We
know from past testing that the population standard deviation is 0.35 ohms.
Determine a 95% confidence interval for the true mean resistance of the population.
σ
X ± Z α/2
n
= 2.20 ± 1.96 (0.35/ 11)
= 2.20 ± 0.2068
1.9932 ≤ µ ≤ 2.4068
If Z=99%, then interval estiamte will widen
39
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size
40
Confidence Intervals
Confidence
Intervals
Population Population
Mean Proportion
σ Known σ Unknown
41
Do You Ever Truly Know σ?
Probably not!
In virtually all real world business situations, σ is not known.
If the population standard deviation σ is unknown, we can substitute the sample standard
deviation, S.
42
Confidence Interval for Mean (σ Unknown)
Assumptions:
• Population standard deviation is unknown.
• Use Student’s t Distribution
43
Student’s t Distribution
The t is a family of distributions.
The tα/2 value depends on degrees of freedom (d.f.).
d.f. = n - 1
44
Degrees of Freedom (d.f.)
Idea: Number of observations that are free to vary after sample mean has been calculated.
45
t-dist. vs z-dist.
Note: t Z as n increases
Standard
Normal
(t with df = ∞)
t (df = 13)
t-distributions are bell-shaped and
symmetric, but have ‘fatter’ tails t (df = 5)
than the normal.
0 t
46
Selected t-dist. Values
With comparison to the Z value:
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) (∞ d.f.)
0.80 1.372 1.325 1.310 1.28
0.90 1.812 1.725 1.697 1.645
0.95 2.228 2.086 2.042 1.96
0.99 3.169 2.845 2.750 2.58
Note: t Z as n increases
47
Example: t-dist. Confidence Interval
A random sample of n = 25 has X̄ = 50 and S = 8. Form a 95% confidence interval for μ.
46.698 ≤ µ ≤ 53.302
48
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size
49
Confidence Intervals
Confidence
Intervals
Population Population
Mean Proportion
σ Known σ Unknown
50
Confidence Interval Estimate
An interval estimate for the population proportion ( π ) can be calculated by adding an allowance for
uncertainty to the sample proportion ( p ).
Confidence Interval Estimate:
p(1 − p)
p ± Z α/2
n
where
Zα/2 is the standard normal value for the level of confidence desired
p is the sample proportion
n is the sample size
Note: must have np > 5 and n(1-p) > 5
51
Example: Confidence Intervals for Population Proportion
A random sample of 100 people shows that 25 are left-handed. Form a 95% confidence interval
for the true proportion of left-handers.
p ± Z α/2 p(1 − p)/n
= 25/100 ± 1.96 0.25(0.75)/100
= 0.25 ± 1.96 (0.0433)
0.1651 ≤ π ≤ 0.3349
• We are 95% confident that the true percentage of left-handers in the population is between 16.51% and
33.49%.
• Although the interval from 0.1651 to 0.3349 may or may not contain the true proportion, 95% of intervals
formed from samples of size 100 in this manner will contain the true proportion.
52
Outline
Confidence Intervals and Confidence Levels
Confidence Intervals for Population Mean (σ known)
Confidence Intervals for Population Mean (σ unknown)
Confidence Intervals for Population Proportion
Determining Sample Size
53
Determining Sample Size
Determining
Sample Size
54
Sampling Error
The required sample size can be found to reach a desired margin of error (e) with a specified level
of confidence (1 - α).
55
Determining Sample Size for Mean
Determining
Sample Size
For the
Mean
Sampling error (margin of error)
σ σ
X ± Zα / 2 e = Zα / 2
n n
56
Determining Sample Size for Mean
Determining
Sample Size
For the
Mean
57
Determining Sample Size for Mean
To determine the required sample size for the mean, you must know:
• The desired level of confidence (1 - α), which determines the critical value, Zα/2
58
Example:
If σ = 45, what sample size is needed to estimate the mean within ± 5 with 90% confidence?
Z 2 σ 2 (1.645) 2 (45)2
n= 2
= 2
= 219.19
e 5
If σ is unknown…
• Select a pilot sample and estimate σ with the sample standard deviation, S . Pick a small random sample,
calulate S and put it instead of Sigma
59
Determining Sample Size for Proportion
Determining
Sample Size
For the
Proportion
60
Determining Sample Size for Proportion
To determine the required sample size for the proportion, you must know:
• The desired level of confidence (1 - α), which determines the critical value, Zα/2
61
Example:
How large a sample would be necessary to estimate the true proportion defective in a large
population within ±3%, with 95% confidence?
(Assume a pilot sample yields p = 0.12)
For 95% confidence, use Zα/2 = 1.96, e = 0.03, p = 0.12 (used to estimate π):
So use n = 451
62
Ethical Issues
A confidence interval estimate (reflecting sampling error) should always be included when
reporting a point estimate.
63
Thank you!
64
R & Business Analytics
Masters in Big Data and Business Analytics
This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
One-sample Tests
2
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions
3
What is a Hypothesis?
A hypothesis is a claim (assertion) about a population parameter.
• Population mean
Example: The mean monthly cell phone bill in this city is μ = $42
• Population proportion
Example: The proportion of adults in this city with cell phones is π = 0.68
4
The Null Hypothesis, H0
States the claim or assertion to be tested.
Example: The average diameter of a manufactured bolt is 30mm. (H0 : μ = 30)
H 0 : µ = 30 H 0 : X = 30
5
The Null and Alternate Hypotheses
6
The Hypothesis Testing Process
Claim: The population mean age is 50.
H0: μ = 50, H1: μ ≠ 50
Population
Sample
7
The Hypothesis Testing Process
Suppose the sample mean age was X̄ = 20.
This is significantly lower than the claimed mean population age of 50.
If the null hypothesis were true, the probability of getting such a different sample mean would be
very small, so you reject the null hypothesis.
In other words, getting a sample mean of 20 is so unlikely if the population mean was 50, you
conclude that the population mean must NOT be 50.
8
The Hypothesis Testing Process
Sampling
Distribution of X
X
20 μ = 50
9
The Hypothesis Testing Process
If the sample mean is close to the stated population mean, the null hypothesis is NOT rejected.
If the sample mean is far from the stated population mean, the null hypothesis is rejected.
The critical value of a test statistic creates a “line in the sand” for decision making -- it answers
the question of how far is far enough.
10
The Test Statistic and Critical Values
Region of Region of
Rejection Rejection
Region of
Non-Rejection
Critical Values
12
Possible Errors in Hypothesis Test
Type I Error
• Reject a true null hypothesis
Type II Error
• Failure to reject false null hypothesis
13
Level of Significance and the Rejection Region
Level of significance = α
α/2 α/2
Critical values
Rejection Region
14
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions
15
Hypothesis Tests for the Mean
Hypothesis
Tests for µ
σ Known σ Unknown
(Z test) (t test)
16
Z Test of Hypothesis for the Mean (σ Known)
Hypothesis
Tests for µ
σσKnown
Known σσUnknown
Unknown
(Z test) (t test)
The test statistic is:
X −µ
ZSTAT =
σ
n
17
Two-Tail Tests
H0: μ = μ0
H1: μ ≠ μ0 Level of significance = α
α/2 α/2
μ0 X
Reject H0 Do not reject H0 Reject H0
-Zα/2 0 +Zα/2 Z
Decision Rule:
If the test statistic falls in the rejection region, reject H0 ; otherwise do not reject H0
19
Critical Value Approach: 6 Steps
1. State the null hypothesis, H0 and the alternative hypothesis, H1
2. Choose the level of significance, α, and the sample size, n
3. Determine the appropriate test statistic and sampling distribution
4. Determine the critical values that divide the rejection and non-rejection regions
5. Collect data and compute the value of the test statistic
6. Make the statistical decision and state the managerial conclusion.
• If the test statistic falls into the non-rejection region, do not reject the null hypothesis H0.
• If the test statistic falls into the rejection region, reject the null hypothesis.
• Express the managerial conclusion in the context of the problem
20
Example: Manufactured Bolt
Test the claim that the true mean diameter of a manufactured bolt is 30mm, given the sample
statistic is 29.84mm. The sample size is 100 and σ = 0.8. (Assume α = 0.05)
21
Example: Manufactured Bolt
3. Determine the appropriate technique
σ is assumed known so this is a Z test.
22
Example: Manufactured Bolt
6. Is the test statistic in the rejection region?
23
Example: Manufactured Bolt
6 (continued). Reach a decision and interpret the result
α = 0.05/2 α = 0.05/2
Since ZSTAT = -2.0 < -1.96, reject the null hypothesis and conclude there is no sufficient
evidence that the mean diameter of a manufactured bolt is equal to 30
24
P-value Approach
P-value: Probability of obtaining a value equal to or more extreme than the observed sample
statistic given H0 is true.
• The p-value is also called the observed level of significance.
• It is the smallest value of α for which H0 can be rejected.
25
P-value Approach: 5 Steps
1. State the null hypothesis, H0 and the alternative hypothesis, H1
2. Choose the level of significance, α, and the sample size, n
3. Determine the appropriate test statistic and sampling distribution
4. Collect data and compute the value of the test statistic and the p-value
5. Make the statistical decision and state the managerial conclusion.
• If the p-value < α then reject H0, otherwise do not reject H0.
• State the managerial conclusion in the context of the problem.
26
Example: Manufactured Bolt
Test the claim that the true mean diameter of a manufactured bolt is 30mm, given the sample
statistic is 29.84mm. The sample size is 100 and σ = 0.8. (Assume α = 0.05)
27
Example: Manufactured Bolt
3. Determine the appropriate technique
σ is assumed known so this is a Z test.
4. Collect the data, compute the test statistic and the p-value
n = 100, X = 29.84 (σ = 0.8 is assumed known)
So the test statistic is:
28
Example: Manufactured Bolt
5. Is the p-value < α?
Since p-value = 0.0456 < α = 0.05, reject H0
α = 0.05/2 α = 0.05/2
p-value = 0.0456
Reject H0 Do not reject H0 Reject H0
29
Tests of population mean: σ known
30
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions
31
Hypothesis Testing: σ Unknown
If the population standard deviation is unknown, you use the sample standard deviation S.
Because of this change, you use the t distribution instead of the Z distribution to test the null
hypothesis about the mean.
32
T-test of Hypothesis for the Mean (σ Unknown)
Hypothesis
Tests for µ
σσKnown
σ Known
Known σσUnknown
σ Unknown
Unknown
(Z test) (t test)
The test statistic is:
X −µ
t STAT =
S
n
33
Example: Hotel Cost
The average cost of a hotel room in New York is said to be $168 per night.
To determine if this is true, a random sample of 25 hotels is taken and resulted in an X̄ of $172.50
and S = $15.40. Test the appropriate hypotheses at α = 0.05.
H0: μ = 168
H1: μ ≠ 168
34
Example: Hotel Cost
α = 0.05
n = 25, df = 25-1=24
Do not reject H0: insufficient evidence that true mean cost is different than $168
35
One-Tail Tests
In many cases, the alternative hypothesis focuses on a particular direction.
36
Lower-Tail Tests
There is only one critical value, since the rejection area is in only one tail.
H0: μ ≥ 3
H1: μ < 3
Critical value
37
Upper-Tail Tests
H0: μ ≤ 3
H1: μ > 3
Z or t 0 Zα or tα
_
X μ
Critical value
38
Example: Cell Phone Bills
A phone industry manager thinks that customer monthly cell phone bills have increased, and
now average over $52 per month.
The company gathered a sample with n=25, X̄ =53.1 and S=10. (Assume α = 0.10)
39
Example: Cell Phone Bills
Obtain sample and compute the test statistic.
40
Example: Cell Phone Bills
Reach a decision and interpret the result:
Reject H0
α = 0.10
41
Example: Cell Phone Bills
Or using p-value approach,
calculate the p-value and compare to α:
p-value = .2937
Reject H0
α = .10
0
Do not reject H0 Reject H0
1.318
tSTAT = .55
42
Tests of population mean: σ unknown
43
Outline
Hypothesis and Hypothesis Testing Process
Errors in Hypothesis Test
Hypothesis Tests for the Mean (σ known)
Hypothesis Tests for the Mean (σ unknown)
Hypothesis Tests for Proportions
44
Hypothesis Tests for Proportions
Involves categorical variables, typically with two possible outcomes
• Possesses characteristic of interest.
• Does not possess characteristic of interest.
45
Proportions
Sample proportion in the category of interest is denoted by p:
When both nπ and n(1-π) are at least 5, p can be approximated by a normal distribution with
mean and standard deviation.
µp = π π (1− π )
σp =
n
46
Hypothesis Tests for Proportions
nπ ≥ 5 nπ < 5
and or
p−π n(1-π) ≥ 5 n(1-π) < 5
ZSTAT =
π (1 − π )
Not discussed in
n this chapter
47
Proportion Test wrt. Number in Category of Interest
X≥5 X<5
and or
n-X ≥ 5 n-X < 5
X − nπ
ZSTAT = Not discussed in
nπ (1 − π ) this chapter
48
Example: Mailing
A marketing company claims that it receives 8% responses from its mailing.
To test this claim, a random sample of 500 were surveyed with 25 responses.
Test at the α = 0.05 significance level.
Check:
n π = (500)(.08) = 40
n(1-π) = (500)(.92) = 460
✓
p=25/500=0.05
49
Example: Critical Value Solution
Test Statistic:
H0: π = 0.08
H1: π ≠ 0.08 p −π .05 − .08
ZSTAT = = = −2.47
π (1 − π ) .08(1 − .08)
α = 0.05
n 500
n = 500, p = 0.05
Decision:
Critical Values: ± 1.96
Reject H0 at α = 0.05
Reject Reject
Conclusion:
.025 .025 There is sufficient evidence to reject
the company’s claim of 8% response
-1.96 0 1.96 z rate.
-2.47 50
Example: P-value Solution
Calculate the p-value and compare to α
(For a two-tail test the p-value is always two-tail)
Do not reject H0
Reject H0 Reject H0
α/2 = .025 α/2 = .025
Z = -2.47 Z = 2.47
52
Potential Pitfalls and Ethical Considerations
Use randomly collected data to reduce selection biases.
Choose the level of significance, α, and the type of test (one-tail or two-tail) before data collection.
Do NOT practice “data cleansing” to hide observations that do not support a stated hypothesis.
Report all pertinent findings including both statistical significance and practical importance.
53
Two-sample Tests
54
Two-Sample Tests
Two-Sample Tests
Population Population
Means, Means, Related Population Population
Independent Samples Proportions Variances
Samples
Examples:
Group 1 vs. Group 2 Same group before vs. after Proportion 1 vs. Proportion 2 Variance 1 vs. Variance 2
treatment
55
Outline
Hypothesis Tests for Two Means (Independent Populations)
Hypothesis Tests for Two Means (Related Populations)
Hypothesis Tests for Two Proportions (not covered)
Hypothesis Tests for Two Variances (not covered)
56
Two Means: Independent Populations
σ1 and σ2 unknown,
Sigma1=Sigma2=Sigma The point estimate for the
assumed equal difference is
57
Two Means: Independent Populations
59
Tests for Two Independent Population Means
Two Population Means, Independent Samples
Lower-tail test: Upper-tail test: Two-tail test:
H0: μ1 – μ2 ≥ 0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
H1: μ1 – μ2 < 0 H1: μ1 – μ2 > 0 H1: μ1 – μ2 ≠ 0
α α α/2 α/2
assumed equal
t STAT =
(X1 − X 2 )− (µ1 − µ 2 )
σ1 and σ2 unknown, • The test statistic is: 2 ⎛⎜ 1 1 ⎞
Sp ⎜ + ⎟
not assumed equal ⎟
⎝ n1 n2 ⎠
• where tSTAT has d.f. = (n1 + n2 – 2) This difference can also be any value.
In this case we are taing this vlaue to be
61 0
Example: Dividend Yield
You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between
stocks listed on the NYSE & NASDAQ? You collect the following data:
NYSE NASDAQ
Count. 21 25
Sample mean 3.27 2.53
Sample std dev 1.30 1.16
Assuming both populations are approximately normal with equal variances, is there a
difference in mean yield (α = 0.05)?
62
Example: Dividend Yield
t=
(
X1 − X 2 )− (µ1 − µ 2 )
=
(3.27 − 2.53)− 0 = 2.040
2 ⎛⎜ 1 1 ⎞ ⎛ 1 1 ⎞
Sp ⎜ + ⎟ 1.5021 ⎜ + ⎟
⎝ n1 n 2
⎟
⎠ ⎝ 21 25 ⎠
2 2
2 (n1 − 1)S1 + (n2 − 1)S 2 (21 − 1)1.30 2 + (25 − 1)1.16 2
S p = = = 1.5021
(n1 − 1) + (n2 − 1) (21 - 1) + (25 − 1)
63
Example: Dividend Yield
H0: μ1 - μ2 = 0 i.e. (μ1 = μ2) Reject H0 Reject H0
σ1 and σ2 unknown,
assumed equal
66
Tests for Two Related Population Means
Tests Means of two related Populations
Related • Paired or matched samples
samples
• Repeated measures (before/after)
• Use difference between paired values:
Di = X1i - X2i
Assumptions:
• The differences are normally distributed
67
Tests for Two Related Population Means
The i-th paired difference is Di , where
Related
samples Di = X1i - X2i
The the paired difference sample mean μD is:
n
∑D i
D= i =1
n
69
The Paired Difference Test: Possible Hypotheses
Paired Samples
R.K. 0 0 0 SD =
∑ (D − D)
i
n −1
M.O. 4 0 - 4
-21 = 5.67
71
Example:
Has the training made a difference in the number of complaints (at the 0.01 level)?
Reject Reject
H0: μD = 0
H1: μD ≠ 0 α/2 α/2
- 4.604 4.604
α = .01 D = - 4.2 - 1.66
d.f. = n - 1 = 4
Test Statistic: Decision: Do not reject H0
(tstat is not in the reject region)
D − µ D − 4.2 − 0
t STAT = = = −1.66 Conclusion: There is NOT a significant
SD / n 5.67/ 5 change in the number of complaints.
72
Thank you!
73
R & Business Analytics
Masters in Big Data and Business Analytics
This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
ANOVA
2
ANOVA Definition
Analysis of Variance (ANOVA):
A technique used to test simultaneously whether the means of several populations are equal. It
uses the F distribution as the distribution of the test statistic.
Assumptions:
For each population, the response variable is normally distributed.
The observations must be independent.
3
ANOVA Definition
Supposed there are three populations with means 1, 2 and 3, we have:
Our objective is to determine whether the observed differences in the three sample means are
large enough to reject H0.
In other words:
If the variability among the sample means is “small”, it does not reject H0;
If the variability among the sample means is “large”, it rejects H0.
4
𝜇
𝜇
𝜇
Graphical Illustration
Supposed 0 is true, we should have only one sampling distribution:
5
𝐻
Graphical Illustration
Supposed 0 is false, the sample means come from different sampling distributions:
6
𝐻
Conceptual Overview
The logic behind ANOVA is based on the development of two independent estimates of the
common population variance 2.
2
1. One estimate of is based on the variability among the sample means themselves;
2
2. The other estimate of is based on the variability of data within each sample.
By comparing these two estimates of σ 2 , we will be able to determine whether the population
means are equal.
7
𝜎
𝜎
𝜎
Between-treatments Estimate
Sum of squares due to treatments (SSTR)
8
𝑗
𝑛
𝑗
Within-treatments Estimate
Sum of squares due to error (SSE) Xij=ith observation of sample j
K is the number of populations and 𝑛𝑗 is the number of elements in 𝑗-th sample. 𝑛𝑇 is the total
number of instances.
9
Total Sum of Squares
Total sum of squares (SST):
Or
10
ANOVA Test
F is always positive.
Now since F is always positive so it will always be positive-tailed
11
Example: Assembly Line
One company developed a new filtration system for municipal water supplies.
The industrial engineering group is responsible for determining the best assembly method for the
new filtration system.
The group narrows the alternatives to three: method A, method B, and method C.
Three groups of workers are randomly selected to assemble the system using three different
methods respectively.
The manager would like to know whether the mean number of units produced per week is the
same for all three populations (methods) at α = 0.05.
12
Example: Assembly Line
Number of units produced by 15 workers from three groups:
Xj
Sj^2
13
Example: Assembly Line
The between-treatments estimate of variance:
SST=SSTR+SSE
14
Comparing the Estimates: F test
The test statistic for F test:
15
Example: Assembly Line
The value of the test statistic is
16
Example: Assembly Line
Because F = 9.18 is greater than 6.93, the area in the upper tail at F = 9.18 is less than .01.
With p-value ≤ α = .05, H0 is rejected.
The test provides sufficient evidence to conclude that the means of the three populations are
not equal.
17
Probability distributions
Student’s t
=5
F t 2 = F1,ν
1 = 1, 2 =5
Chi-square
=5
18
𝜈
𝜈
𝜈
𝜈
ANOVA Table
The results of the preceding calculations can be displayed conveniently in a table referred to as
the analysis of variance or ANOVA table.
19
Example: Assembly Line
ANOVA table for Assembly Line example:
20
Goodness of Fit Test
21
Example: Market Share Study
Based on a market share study, last year the market shares stabilized at: 30% for company A, 50%
for company B, and 20% for company C.
Recently company C developed a “brand-new and improved” product.
We need to conduct a sample survey and compute the proportion of customers preferring each
company’s product.
A hypothesis test will then be conducted to see whether the new product caused a change in
market shares at α = 0.05.
22
Example: Market Share Study
Assume our observed frequency is
Need to test the difference between the observed frequencies and the expected frequencies.
23
Example: Market Share Study
The null and alternative hypothesis:
24
Test Statistic
We define the test statistic for goodness of fit:
The test statistic has a chi-square distribution with k-1 degrees of freedom provided that the
expected frequencies are 5 or more for all categories.
25
Example: Market Share Study
Following the calculation of test statistic:
26
Example: Market Share Study
By checking chi-square table, we see
The test statistic 7.34 is between 5.991 and 7.378. Thus, the corresponding upper tail area or p-
value must be between .05 and .025.
With p-value less than .05, we reject H0 and conclude that the introduction of the new product
by company C will alter the current market share structure.
27
Goodness of fit test
1. State the null and alternative hypotheses
H0: the observed frequencies are equivalent to the expected frequencies.
H1: the observed frequencies are different from the expected frequencies.
2. Assume the null hypothesis is true and determine the expected frequency in each category
by multiplying the category probability by the sample size
4. Decision rule:
28
𝑖
𝑒
Probability distributions
χν21 /ν1
Fν1,ν2 ≡
χν22 /ν2
F
1 = 1, 2 =5
Chi-square
=5
29
𝜈
𝜈
𝜈
𝜈
Test of Independence
30
Example: Beer Drinkers
A manufacturer distributes three types of beer: light, regular and dark.
The firm’s market research group raise the question of whether preferences for the three beers
differ among male and female beer drinkers.
A test of independence can address the question of whether the beer preference (light, regular,
or dark) is independent of the gender of the beer drinker (male, female) at α = 0.05.
A simple random sample of 150 beer drinkers is selected. The data is summarized using the
following contingency table: Observed Frequency Table
31
Example: Beer Drinkers
The hypotheses are:
H0: Beer preference is independent of the gender of the beer drinker
H1: Beer preference is not independent of the gender of the beer drinker
Thought process:
1. Determine the expected frequencies under the assumption of independence between beer
preference and gender of the beer drinker
2. Use the goodness of fit test to determine whether there is a significant difference between
observed and expected frequencies.
32
Example: Beer Drinkers
In the entire sample of 150 beer drinkers, we have:
• 50/150 = 1/3 prefer light beer
• 70/150 = 7/15 prefer regular beer
• 30/150 = 1/5 prefer dark beer
If the independence assumption is valid, we argue that these fractions must be applicable to both
male and female beer drinkers.
Then the expected frequencies are:
Expected Frequency Table
80 x 1/3
33
Example: Beer Drinkers
The test procedure for comparing the observed frequencies with the expected frequencies is
similar to the goodness of fit calculations.
We define test statistic for independence:
With n rows and m columns in the contingency table, the test statistic has a chi-square
distribution with (n-1)(m-1) degrees of freedom.
34
Example: Beer Drinkers
Following the calculation of test statistic:
35
Beer drinkers: calculation
By checking chi-square table, we see
The test statistic 6.12 is between 5.991 and 7.378. Thus, the corresponding upper tail area or p-
value must be between .05 and .025.
With p-value less than .05, we reject H0 and conclude that beer preference is not independent of
the gender of the beer drinker.
36
Test of Independence
1. State the null and alternative hypotheses
H0: the column frequencies is independent of the row variable.
H1: the column variable is not independent of the row variable.
2. Assume the null hypothesis is true and compute the expected frequency for each cell in the
contingency table.
4. Decision rule:
37
Thank you!
Reference:
Anderson, D., Sweeney, D., Williams, T. (2010). Statistics for Business and Economics (11th ed). Cengage Learning.
38
R & Business Analytics
Masters in Big Data and Business Analytics
This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Parametric vs. Non-parametric
These methods begin with an assumption Most of the statistical methods referred to as
about the probability distribution of the parametric methods require quantitative data, while
population which is often that the population non-parametric methods are appropriate when data
has a normal distribution. are measured on an ordinal scale of measurement.
2
Sign Test
3
Sign Test
The sign test is a versatile non-parametric method for hypothesis testing that uses the binomial
distribution with = 0.5 as the sampling distribution.
However, it does not require an assumption about the distribution of the population.
Its main application: test about a population median.
The median is the measure of central tendency that divides the population so that 50% of the
values are greater than the median and 50% of the values are less than the median.
When a population distribution is skewed, the median is often preferred over the mean as the
best measure of central location for the population.
4
𝑝
Example: Grocery Store
The manager estimates that the median sales of the new potato chip should be $450 per week on
a per store basis.
After carrying the product for three-months, Lawler’s management requested the following
hypothesis test about the population median weekly sales.
5
Example: Grocery Store
Data showing one-week sales at 10 randomly selected stores are provided:
6
Transforming Data
If the observation is greater than the hypothesized value, we record a plus sign “+”.
If the observation is less than the hypothesized value, we record a minus sign “−”.
If an observation is exactly equal to the hypothesized value, the observation is eliminated from
the sample and the analysis proceeds with the smaller sample size.
7
Example: Grocery Store
According to the hypothesized median 450, we have:
8
Binomial probability function
In a binomial experiment, our interest is in the number of successes occurring in the n trials.
x denotes the number of successes occurring in the n trials:
9
Example: Grocery Store
The ideal binomial probabilities for the number of plus signs under the assumption 0 is true:
10
𝐻
Example: Grocery Store
Adding these probabilities we have Using R to calculate probability:
dbinom(7, size=10, prob=0.5) = 0.1172
. 1172 + . 0439 + . 0098 + . 0010 = . 1719. dbinom(8, size=10, prob=0.5) = 0.0439
…
Since we are using a two-tailed hypothesis test, this upper tail probability is doubled to obtain the
p-value= 2( . 1719) = . 3438.
11
𝐻
Approximation: Normal Distribution
With larger sample sizes, we rely on the normal distribution approximation of the binomial
distribution to compute the p-value.
12
Example: Home Sales
One year ago the median price of a new home was $236,000.
However, a current downturn in the economy emerges.
Real estate firms use sample data on recent home sales to determine if the population median
price of a new home today is lower than it was a year ago.
13
Example: Home Sales
A random sample of 61 recent new home sales found 22 homes sold for more than $236,000, 38
homes sold for less than $236,000, and one home sold for $236,000.
The one home that sold for the hypothesized median price of $236,000 should be deleted from
the sample.
The sample result showing 22 plus signs is in the lower tail of the binomial distribution.
The p-value is the probability of 22 or fewer plus signs.
14
Mean and Standard Deviation
The sampling distribution of the number of plus signs can be approximated by a normal
distribution with:
Note that the binomial probability distribution is discrete and the normal probability distribution
is continuous.
15
Continuity Correction Factor
The continuity correction factor is applied when a continuous probability distribution is used
for approximating a discrete probability distribution.
16
Example: Home Sales
To compute the p-value for 22 or fewer plus signs we use the normal distribution with μ=30 and
σ=3.873 to compute the probability that the normal random variable has a value ≤ 22.5.
17
Example: Home Sales
Using this normal distribution, we compute the p-value as follows:
With p-value = .0262 < .05, we reject the null hypothesis and conclude that the median price of a
new home is less than the $236,000 median price a year ago.
18
Wilcoxon Signed-rank Test
19
Wilcoxon Signed-rank Test
The Wilcoxon signed-rank test is a nonparametric procedure for analyzing data from a matched-
sample experiment.
The test uses quantitative data but does not require the assumption that the differences between
the paired observations are normally distributed.
It only requires the assumption that the differences between the paired observations have a
symmetric distribution.
If the data has significant outliers, the Wilcoxon signed rank test would be a more robust option.
20
Example: Manufacturing
A manufacturing firm that is attempting to
determine whether two production methods differ
in terms of task completion time.
Using a matched-samples experimental design, 11
randomly selected workers completed the
production task twice, once using method A and
once using method B.
16.2 6.9
21
Example: Manufacturing
Do these data indicate that the two production methods differ significantly in terms of
completion times?
If we assume that the differences have a symmetric distribution but not necessarily a normal
distribution, we should apply the Wilcoxon signed-rank test.
22
Example: Manufacturing
The steps for Wilcoxon signed-rank test are:
+
To conduct the Wilcoxon signed-rank test, we will use as the test statistic.
23
𝑇
Approximation: Normal Distribution
+
If the number of matched pairs is 10 or more, the sampling distribution of can be
approximated by a normal distribution as follows.
24
𝑇
Example: Manufacturing
Based on the calculation, we have:
25
Example: Manufacturing
The probability that T + ≥ 49.5 is approximated by:
With the p-value≤.05, we reject 0 and conclude that the median completion times for the two
production methods are not equal.
26
𝐻
Mann-Whitney-Wilcoxon (MWW) test
27
Mann-Whitney-Wilcoxon (MWW) test
Mann-Whitney-Wilcoxon (MWW) test is a nonparametric test for the difference between two
populations based on two independent samples.
Advantages of this nonparametric procedure are that it can be used with either ordinal data or
quantitative data and it does not require the assumption that the populations have a normal
distribution.
28
Example: Employee Performance
During an employee performance review at a theater, the theater manager rated all 35 part-time
employees from best (ranked 1) to worst (ranked 35) in the theater’s annual report.
The part-time employees were primarily college and high school students. The manager asked if
there was evidence of a significant difference in performance for college students compared to
high school students.
The hypotheses:
H0: Their performance are identical.
H1: Their performance are not identical.
29
Example: Employee Performance
We begin by selecting a random sample of four college students and a random sample of five
high school students.
30
Example: Employee Performance
Rank the combined samples from low to high.
Sum the ranks for each sample.
The sum of ranks for the first sample will be the test statistic W = 14.
31
Example: Employee Performance
Letting C denote a college student and H denote a high school student.
Suppose the ranks of the nine students had the following order:
32
Example: Employee Performance
We used a computer program to compute all possible orderings (rankings) for the nine students:
33
Example: Employee Performance
The two-tailed p-value = 2(.0952) = 0.1904.
The MWW test conclusion is that we cannot reject the null hypothesis that the populations of
college and high school students are identical.
34
Normal Distribution Approximation
When both samples sizes are 7 or more, a normal approximation of the sampling distribution of
W can be used.
35
Kruskal-Wallis (KW) test
36
Kruskal-Wallis Test
We extend the nonparametric procedures to hypothesis tests involving three or more
populations.
The nonparametric Kruskal-Wallis test is based on the analysis of independent random samples
from each of k populations.
This procedure can be used with either ordinal data or quantitative data and does not require the
assumption that the populations have normal distributions.
37
Example: Annual Performance Report
One company hires employees for its management staff from three different colleges.
The personnel director began reviewing the annual performance reports for the management
staff in an attempt to determine whether there are differences in the performance ratings among
the managers who graduated from the three colleges.
The independent samples include 7 managers from college A, 6 from college B, and 7 from
college C.
38
Example: Annual Performance Report
Rank the combined samples from low to high.
Sum the ranks for each sample.
39
Kruskal-Wallis Test statistic
The Kruskal-Wallis test statistic uses the sum of the ranks for the three samples and is computed
as follows.
41
Example: Annual Performance Report
Under the null hypothesis assumption of identical populations, the sampling distribution of H
can be approximated by a chi-square distribution with (k-1) degrees of freedom.
With H = 8.92, we can conclude that the area in the upper tail of the chi-square distribution is
between .025 and .01. (You may use chi-square distribution table or R functions.)
Because p-value ≤ α = .05, we reject 0 and conclude that the three populations are not all the
same.
42
𝐻
Rank Correlation
43
Rank Correlation
The Pearson correlation coefficient is a measure of the linear association between two variables
using quantitative data.
We have discussed in previous session the test for Pearson correlation coefficient. Now we will
discuss the test for Spearman rank-correlation coefficient.
44
Spearman Rank-correlation Coefficient
The Spearman rank-correlation coefficient ranges from -1.0 to +1.0 and its interpretation is
similar to the Pearson correlation coefficient for quantitative data.
45
Example: Sales Potential
A personnel director reviewed the performance of 10 current members of the sales force.
After the review, the director ranked the 10 individuals in terms of their potential for success and
assigned the individual who had the most potential the rank of 1.
Data were then collected on the actual sales for each individual during their first two years of
employment.
46
Example: Sales Potential
The ranks based on potential as well as the ranks based on the actual performance are shown
below.
47
Example: Sales Potential
The computations of Spearman rank-correlation coefficient are summarized.
48
Rank Correlation Test Hypothese
We can use the sample rank correlation to make an inference about the population rank
correlation coefficient .
49
𝑠
𝑠
𝑟
𝜌
Example: Sales Potential
The following sampling distribution of can be used to conduct the test.
The sample rank-correlation coefficient for sales potential and sales performance is = . 733.
Then we have = 0, and = 1/(10 − 1) = . 333.
50
𝑠
𝑠
𝑟
𝑠
𝑠
𝑟
𝜎
𝑟
𝑟
𝜇
Example: Sales Potential
The test statistic is
Using the standard normal probability table and z = 2.20, we find the two-tailed p-value =
2(1–.9861) = .0278.
With a .05 level of significance, p-value ≤ α. Thus, we reject the null hypothesis that the
population rank-correlation coefficient is zero.
51
Thank you!
Reference:
Anderson, D., Sweeney, D., Williams, T. (2010). Statistics for Business and Economics (11th ed). Cengage Learning.
52
R & Business Analytics
Masters in Big Data and Business Analytics
This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Simple Linear Regression Model
2
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one independent variable
• Explain the impact of changes in an independent variable on the dependent variable
3
Simple Linear Regression Model
Only one independent variable, X.
Relationship between X and Y is described by a linear function.
Changes in Y are assumed to be related to changes in X.
Independent
Intercept Slope Variable
Dependent
Variable
= 0 + 1 +
Linear Component Random Error
𝑖
𝑖
𝑖
4
𝑌
𝛽
𝛽
𝑋
𝜖
Simple Linear Regression Model
Y = 0 + 1 +
Observed Value
of Y for Xi
ΔYi
εi ΔXi
ΔYi
Predicted Value Slope = β1 =
Random Error for ΔXi
of Y for Xi this Xi value
Intercept = β0
Xi
X 5
𝑖
𝑖
𝑖
𝑌
𝛽
𝛽
𝑋
𝜖
Simple Linear Regression Equation
The simple linear regression equation provides an estimate of the population regression line.
Y ̂ = b0 + b1Xi observation i
6
The Least Squares Method
b0 and b1 are obtained by finding the values of that minimize the sum of the squared
differences between Y and Y:̂
The coefficients b0 and b1 , and other regression results in this chapter, will be found using R.
7
Example: House Price
A real estate agent wishes to examine the House Price in
Square Feet
relationship between the selling price of a home $1000s
(X)
and its size (measured in square feet) (Y)
245 1400
312 1600
A random sample of 10 houses is selected 279 1700
• Dependent variable (Y) = house price in $1000s 308 1875
• Independent variable (X) = square feet 199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
8
Example: House Price
500
375
House Price ($1000s)
250
125
0
0 650 1300 1950 2600
Square Feet
9
Example: House Price
lm: Fitting Linear Model
The regression equation is: house price = 98.24833 + 0.10977 (square feet)
10
Example: House Price
Scatter Plot and Prediction Line
Slope
= 0.10977
Intercept
= 98.248
11
Example: House Price
b0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of
observed X values).
12
Example: House Price
b1 estimates the change in the average value of Y as a result of a one-unit increase in X.
Here, b1 = 0.10977 tells us that the mean value of a house increases by .10977($1000) = $109.77 on
average, for each additional one square foot of size.
13
Example: House Price
Predict the price for a house with 2000 square feet:
= 98.25 + 0.1098(200 0)
= 317.85
The predicted price for a house with 2000 square feet is
317.85($1,000s) = $317,850
14
Measures of Variation
15
Measures of Variation
Total variation is made up of two parts:
2 2
( −¯)
2 ^ ¯ ^
∑
=
∑ ∑
= ( − ) = ( − )
where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given Xi value
16
𝑖
𝑖
𝑖
𝑆
𝑆
𝑆
𝑆
𝑅
𝐸
𝑌
𝑌
𝑌
𝑌
𝑖
𝑆
𝑆
𝑇
𝑌
𝑌
Measures of Variation
SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean Ȳ
17
Measures of Variation
Y
Yi ∧
Y
∧
SSE = ∑(Yi - Yi )2
_
SST = ∑(Yi - Y)2
∧
Yi ∧ _
_ SSR = ∑(Yi - Y)2
_
Y Y
Xi X 18
Coefficient of Determination, r 2
The coefficient of determination is the portion of the total variation in the dependent variable
that is explained by variation in the independent variable.
The coefficient of determination is also called r-squared and is denoted as r2
note: 0≤ 2
≤1
19
𝑟
R-squared
Y
2 ∑
r =
∑
Ȳi
2
0≤ ≤1
X
20
𝑟
Examples of Approximate r 2 Values
Y
r2 = 1
X
r2 = 1 Perfect linear relationship between X and Y:
Y 100% of the variation in Y is explained by
variation in X
X
r2 =1
21
Examples of Approximate r 2 Values
Y
0 < r2 < 1
X
22
Examples of Approximate r 2 Values
Y
r2 = 0
23
Example: House Price
Summary report of a fitted linear model:
24
Adjusted r 2
The use of adjusted r2 is an attempt to account for the phenomenon of the r2 spuriously
increasing when extra explanatory variables are added to the model.
There are different ways of adjusting, and the most common one (also used in R):
n−1
R2adj = 1 − (1 − R2)
n−p−1
where n is the total number of observations and p is the number of explanatory variables.
25
r 2 vs Sample Correlation Coefficient
If sample correlation coefficient is , we have
2 2
=
More specifically,
2
=( 1)
26
𝑥
𝑦
𝑟
𝑠
𝑖
𝑔
𝑛
𝑜
𝑓
𝑏
𝑟
𝑥
𝑦
𝑥
𝑦
𝑟
𝑟
𝑟
Example: House Price
Standard error of residuals in R:
S YX = 41.33032
27
Standard Error of residuals
The standard error of residuals is the standard deviation of the variation of observations around
the regression line.
2
^
∑ =1 ( − )
= =
−2 −2 ? Two parameters are estimated.
where
SSE = error sum of squares
n = sample size
𝑛
𝑛
𝑌
𝑋
𝑆
𝑖
𝑖
𝑖
𝑆
𝑆
𝐸
28
𝑌
𝑌
𝑛
Comparing Standard Errors
SYX is a measure of the variation of observed Y values from the regression line.
Y Y
X X
small SYX large SYX
The magnitude of SYX should always be judged relative to the size of the
Y values in the sample data
29
Regression Slope
30
Inferences about the Slope
The standard error of the regression slope coefficient (b1) is estimated by
1
= =
2
∑( − ¯)
where:
Sb1 = Estimate of the standard error of the slope
Test statistic
1 − 1 where:
=
1 b1 = regression slope coefficient
. .= −2 β1 = hypothesized slope
Sb1 = standard error of the slope
𝑏
32
𝑆
𝑆
𝑇
𝐴
𝑇
𝑡
𝑑
𝑓
𝑛
𝑏
𝛽
Example: House Price
House Price in
Square Feet Estimated Regression Equation:
$1000s
(x)
(y)
house price = 98.25 + 0.1098 (sq.ft.)
245 1400
312 1600
279 1700 The slope of this model is 0.1098.
308 1875
199 1100
219 1550
Is there a linear relationship between the square
footage of the house and its sales price?
405 2350
324 2450
319 1425
255 1700
33
Example: House Price
H 0: β 1 = 0 From R output:
H 1: β 1 ≠ 0
b1 Sb1
1 − 1 0.10977 − 0
= = = 3.32938
1
0.03297
34
𝑏
𝑆
𝑆
𝑇
𝐴
𝑇
𝑡
𝑏
𝛽
Example: House Price
α/2=.025 α/2=.025
Decision: Reject H0
Reject H0 Do not reject H0 Reject H0
-tα/2 0 tα/2 There is sufficient evidence that square
-2.3060 2.3060 3.329 footage affects house price.
35
House Price: T-test Example
From R output:
p-value
36
Confidence Interval Estimate for the Slope
b1 ± t α / 2 S b d.f. = n - 2
1
R output:
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)
37
Confidence Interval Estimate for the Slope
Since the units of the house price variable is $1000s, we are 95%
confident that the average impact on sales price is between $33.74 and
$185.80 per square foot of house size
38
Multiple Linear Regression
39
Multiple Regression Model
Multiple regression analysis is the study of how a dependent variable y is related to two or
more independent variables.
The equation that describes how the dependent variable y is related to the independent variables
1, 2, …, and an error term is called the multiple regression model.
40
𝑝
𝑥
𝑥
𝑥
Multiple Regression Equation
One of the assumptions is that the mean or expected value of is zero.
The equation that describes how the mean value of y is related to 1, 2, …, is called the
multiple regression equation.
41
𝑝
𝜖𝑥
𝑥
𝑥
Estimated Multiple Regression Equations
A simple random sample is used to compute sample statistics 0, 1, 2, …, that are used as the
point estimators of the parameters 0, 1, 2, …, .
42
𝑝
𝑝
𝑏
𝛽
𝑏
𝛽
𝑏
𝛽
𝑏
𝛽
Least Squares Method
The same least squares method is used to develop the estimated multiple regression equation.
The least squares method uses sample data to provide the values of 0, 1, 2, …, that make the
sum of squared residuals (SSE) a minimum.
43
𝑝
𝑏
𝑏
𝑏
𝑏
Example: Trucking Company
To develop better work schedules, the managers at a trucking company want to estimate the total
daily travel time for their drivers.
Initially the managers believed that the total daily travel time would be closely related to the
number of miles traveled in making the daily deliveries.
44
Example: Trucking Company
The scatter diagram is shown below.
45
Example: Trucking Company
46
Example: Trucking Company
The managers felt that the number of deliveries could also contribute to the total travel time.
47
Example: Trucking Company
48
Multiple Coefficient of Determination
Similar to the conception in simple linear regression, the term multiple coefficient of
determination indicates that we are measuring the goodness of fit for the estimated multiple
regression equation.
49
Multiple Coefficient of Determination
Relationship among SST, SSR, and SSE:
50
Example: Trucking Company
The r 2 value of simple linear regression is 0.664. In comparison, the 2
value for the estimated
regression equation with two independent variables is 0.904.
In general, r 2 always increases as independent variables are added to the model. Many analysts
prefer adjusting which is also increased (from 0.622 to 0.876).
51
𝑹
𝑅
𝟐
Testing for Significance
In simple linear regression, the significance test we used was t test.
In multiple regression, we use both t test and the F test and they have different purposes.
1. T test is used to determine whether each of the individual independent variables is significant.
2. The F test is used to determine whether a significant relationship exists between the dependent
variable and the set of all the independent variables.
52
F test
53
Example: Trucking Company
F statistic and p-value are exported in model summary output:
54
Categorical Independent Variables
55
Categorical Independent Variables
So far, the examples we have considered involved quantitative independent variables.
In many situations, however, we must work with categorical independent variables such as
gender (male, female), payment method (cash, credit card, check), and so on.
56
Example: System Maintenance
A water-filtration system maintenance service provider is trying to estimate the repair time
necessary for each maintenance request.
Repair time is believed to be related to two factors, the number of months since the last
maintenance service and the type of repair problem (mechanical or electrical).
57
Example: System Maintenance
Data for a sample of 10 service calls are reported.
58
Example: System Maintenance
To incorporate the type of repair into the regression model, we define the following variable.
59
𝑥
𝑥
Example: System Maintenance
The estimated result is:
60
Interpreting dummy variable
Comparing the model with the dummy variable equals to 0 or 1.
When 2 =0:
When 2 =1:
The interpretation of 2 is that it indicates the difference between the mean repair time for an electrical
repair and the mean repair time for a mechanical repair.
61
𝑥
𝑥
𝛽
Multi-level Categorical Variables
The categorical variable for the previous example had two levels (mechanical and electrical).
Oftentimes, we will encounter a categorical variable with more than two levels.
If a categorical variable has k levels, how many dummy variables we will need?
62
Example: Copy Machine
Suppose a manufacturer of copy machines organized the sales territories for a particular state into
three regions: A, B, and C.
The managers want to use regression analysis to help predict the number of copiers sold per
week.
The managers believe sales region is an important factor in predicting the number of copiers sold.
Because sales region is a categorical variable with three levels, A, B and C, we will need 3 − 1 = 2
dummy variables to represent the sales region.
63
Example: Copy Machine
The three regions are encoded as
64
Example: Copy Machine
Consider the following three variations of the regression equation:
When a categorical variable has k levels, k-1 dummy variables are required.
65
𝟏
𝟐
𝜷
𝜷
Thank you!
Reference:
Anderson, D., Sweeney, D., Williams, T. (2010). Statistics for Business and Economics (11th ed). Cengage Learning.
66
R & Business Analytics
Masters in Big Data and Business Analytics
This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Generalized Linear Model
2
Generalized Linear Model
The major issues in model building are finding the proper functional form of the relationship and
selecting the independent variables to be included in the model.
As a general framework for developing more complex relationships among the independent
variables, we introduce the concept of a generalized linear model.
3
𝑗
𝑘
𝑧
𝑗
𝑥
𝑥
𝑥
𝑝
Example: Industrial Scales
A manufacturer of industrial scales and laboratory equipment want to investigate the relationship
between length of employment of their salespeople and the number of electronic laboratory scales
sold.
The number of scales sold by 15 randomly selected salespeople for the most recent sales period
and the number of months each salesperson has been employed by the firm are scattered.
4
Example: Industrial Scales
The estimated regression is
5
Example: Industrial Scales
The standardized residual plot suggests that a curvilinear relationship is needed.
6
Example: Industrial Scales
To account for the curvilinear relationship, second-order model with one predictor variable is
developed.
7
Example: Industrial Scales
The new standardized residual plot shows that the previous curvilinear pattern has been
removed.
8
Transformation of Dependent Variable
In showing how the general linear model can be used to model a variety of possible relationships
between the independent variables and the dependent variable, we have focused attention on
transformations involving one or more of the independent variables.
9
Example: Automobile
A dataset shows the miles-per-gallon ratings and weights for 12 automobiles.
10
Example: Automobile
The estimated regression is
11
Example: Automobile
The standardized residual plot is indicative of a nonconstant variance.
12
Example: Automobile
Often the problem of nonconstant variance can be corrected by transforming the dependent
variable to a different scale.
13
Example: Automobile
The wedge-shaped pattern has now disappeared.
14
Model Diagnostics
15
Residual plot
A residual plot is a scatterplot of the residuals
against the predicted values.
16
Box-Cox method
If we see a curvature-like shape, Box-Cox method
can be applied.
Box-Cox method can be used to find the best
power transformation to the response variable.
In an actual application, it would be better to
interpret this number and choose the power that
makes sense to you.
= − 1.52 17
𝜆
Box-Cox method
Then what do we do with λ?
Simple rule of thumb: More formally, the original form of Box-Cox transformation:
18
Heteroskedasticity
Linear regression contains an assumption that residuals are identically distributed across every X
variable.
If that holds, the error terms are homoskedastic, meaning the errors have the same scatter
regardless of the value of X.
When the errors vary depending on the value on one or more Xs, the error terms are
heteroskedastic.
19
Heteroskedasticity
Use Scale-Location plot to identify heteroskedasticity.
It displays the fitted values of a regression model along the x-axis and the the square root of the
standardized residuals along the y-axis.
1. Verify that the red line is roughly horizontal across the plot.
2. Verify that there is no clear pattern among the residuals.
20
Heteroskedasticity
The White Test is a method to identify whether or not the error variances are all equal.
0.032
21
Check for normality
Using Normal Q-Q plot of standardized residuals and the theoretical quantiles of Normal
distribution.
Shapiro-Wilk test
22
Frequency distribution
Note: the slope of the reference line is the standard deviation of the observations.
By Matthew E. Clapham from UC Santa Cruz
24
Shapiro-Wilk test
25
Multicollinearity
Multicollinearity refers to a situation in which two or more
variables in a regression model are highly correlated with
each other.
Variance Inflation Factors (VIF) measure how much the
variance of the estimated coefficients are increased over the
case of no correlation among the X variables.
• If all VIFs are 1, there is no two X variables being correlated.
• If VIF for one of the variables is around or greater than 5, there is
collinearity associated with that variable.
• If two or more variables have high VIFs, one or more of these variables
must be dropped. They can be dropped one by one and select the
regression equation with higher R-squared. (Note: when one variable is
dropped, VIFs for all the remaining variables need to be re-calculated.)
26
Influential observations
Cook’s Distance (or Cook’s D) is calculated by
comparing the regression model with and
without a particular observation.
It measures how much the estimated coefficients
change when that observation is removed from
the dataset.
points.
27
Influential observations
Leverage is a measure of how far away the independent
variable values of an observation are from those of the
other observations.
High-leverage points are potential outliers with respect to
the independent variables.
28
Influential observations
How do we deal with influential observations?
1. Verify that the observation is not an error.
• Before you take any action, you should first verify that the influential observation(s) are not a result of a
data entry error or some other odd occurrence.
2. Remove the influential observations.
• You may decide to simply remove the influential observations if the model you specified seems to fit the
data well except for the one or two influential observations.
3. Attempt to fit another regression model.
• Influential observations could indicate that the model you specified does not provide a good fit to the
data.
29
Case Study: Sales Prediction
By Darek Kane
30
Grape juice
Your supermarket SuperMart is selling a new type of
grape juice in some of its stores for pilot testing.
The marketing team wants to develop a model to with
the following variables to predict the sales of this new
type of grape juice.
31
Data summary
From the summary table, we can roughly know the basic statistics of each numeric variable.
For example, the mean value of sales is 216.7 units, We can further explore the distribution of the
the min value is 131, and the max value is 335. data of sales by visualizing the data in graphical
form as follows. The sales data distribution is
roughly normal.
32
Question #1
To predict the sales of grape juice in a store, what
statistical analysis technique should we use?
• A multiple linear regression model is suitable.
• Here, "sales" is the dependent variable and the others
are independent variables.
Let's investigate the correlation between the sales and other
variables by displaying the correlation coefficients in pairs.
• The correlation coefficients between sales and price, ad type,
price apple, and price cookies are 0.85, 0.58, 0.37, and 0.37
respectively, that means they all might have some influences to
the sales.
33
Multiple Linear Regression analysis
We can first try to add all of the independent variables into the
regression model:
• The p-value for Price, Ad Type, and Price Cookies is much less than
0.05. They are significant in explaining the sales. We are confident
to include these variables into the model.
• The p-value of Price Apple is a bit larger than 0.05, seems there is
no strong evidence for apple juice price to explain the sales.
• However, according to our real-life experience, we know when
apple juice price is lower, consumers likely to buy more apple juice,
and then the sales of other fruit juice may decrease. So we still keep
it in the model to explain the grape juice sales.
• The Adjusted R-squared is 0.881, which indicates a reasonable
goodness of fit and 88% of the variation in sales can be explained by
the four variables.
34
Question #2
What model diagnostics we need to perform?
• The residual plot shows that the residuals scatter around the
fitted line with no obvious pattern.
• From the Scale-Location plot, we see acceptable constant
variance on standardized residuals.
• The Normal Q-Q graph shows that basically the residuals are
normally distributed.
• The VIF test value for each variable is close to 1, which means
the multicollinearity is very low among these variables.
The final model:
• Sales = 774.81 - 51.24 * Price + 29.74 * Ad Type + 22.1 * Price
Apple - 25.28 * Price Cookies
35
Pitfalls of R-squared
36
R-squared
r2 =
SSR regression sum of squares
= Y
SST total sum of squares
∑
r2 =
Ȳi
∑
2
0≤ ≤1
X
37
𝑟
Checking R-squared
Is it really bad to have low R-squared?
Not really.
38
Pitfalls of R-squared
1. R-squared can be arbitrarily low when the model is completely correct.
• We simulate data points by adding normally distributed noise (error) on the dependent variable.
• By making larger, we drive R-squared towards 0, even when every assumption of the simple linear
regression model is correct.
Increasing sigma
39
𝜎
Pitfalls of R-squared
2. R-squared can be arbitrarily close to 1 when the model is totally wrong.
• Especially when we have non-linear data.
Simulating non-
linear distribution
R-squared is high!
40
Pitfalls of R-squared
3. R-squared says nothing about prediction error, even with exactly the same, and no change
in the coefficients.
• We’re better off using Mean Square Error (MSE) or Root Mean Square Error (RMSE) as a measure of
prediction error.
X values are shrinked
R-squared dropped
41
𝝈
𝟐
Regularization
42
Bias & Variance
Let’s first discuss where model errors come from.
Model Complexity
44
Regularization
Normally, we can increase explanatory variables or bring in non-linearity to reduce bias.
But how do we reduce the variance?
45
Ridge Regression
Linear Regression uses Ordinary Least Squares
(OLS) to learn model coefficients:
OLS Loss
Ridge Regression incorporates one type of
regularization techniques (L2 norm).
It is an alternative to dropping variables for solving the multi-collinearity problem. It reduces variances
and introduce bias in a way that does not completely drop variables.
46
LASSO Regression
Ridge regression cannot perform variables selection to reduce
the model variance. LASSO regression has such an ability.
LASSO is an acronym for "Least Absolute Selection and
Shrinkage Operator".
47
Shrinkage parameter
Why is LASSO regression able to perform variable
selection but not Ridge regression?
λ increasing
48
𝑗
𝑗
𝜆𝛽
𝜆𝛽
Elastic Net
Is it possible to combine these two regression models and have both the characteristics and
advantages?
Elastic Net reduces the impact of different features while not eliminating all of the features.
49
Thank you!
50
R & Business Analytics
Masters in Big Data and Business Analytics
This document belongs to ESCP Business School. It cannot be modified nor distributed without the author’s consent.
Logistic Regression Model
2
Recall: Multiple Regression Equation
The equation that describes how the mean value of y is related to 1, 2, …, is called the
multiple regression equation.
3
𝑝
𝑥
𝑥
𝑥
Logistic Regression
What if the dependent variable is categorical?
For instance, a bank might like to develop an estimated regression equation for predicting
whether a person will be approved for a credit card.
The dependent variable can be coded as y=1 if the bank approves the request for a credit card and
y=0 if the bank rejects the request for a credit card.
We can use logistic regression to estimate the probability that the bank will approve one credit
card application.
4
Logistic regression
One popular model is Logistic Regression, which is
extended from Linear Regression.
= 0 + 1 1 + 2 2 +…+
1
= or =
1+ 1+ −
6
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is, given data points, to find a specific model which
gives the highest likelihood.
7
Model inference
For logistic regression, MLE is to find the best coefficients which provide the estimates best fitting
the logit curve.
X
Z
8
Direct Mail Promotion
An expensive four-color sales catalog have been designed and printed for direct mail promotion.
The managers would like to send them to only those customers who have the highest probability
of using the coupon.
They think that annual spending at the stores and whether a customer has a store credit card are
two variables that might be helpful in prediction.
The company sent the catalog to the selected 100 customers and noted whether the customer used
the coupon.
9
Direct Mail Promotion
The sample data for the first 10 catalog recipients are shown:
In the Coupon column, a 1 is recorded if the sampled customer used the coupon and 0 if not.
10
Direct Mail Promotion: Model Estimation
The nonlinear form of the logistic regression equation makes the method of computing estimates
more complex. We will see later how to use R to provide the estimates.
We define the dependent and independent variables:
11
Direct Mail Promotion: Model Estimation
With the help of computer software, we generate the equation estimation:
12
𝑦
𝑥
𝑥
Odds Ratio
The odds in favor of an event occurring is defined as the probability the event will occur divided
by the probability the event will not occur.
The odds ratio measures the impact on the odds of a one-unit increase in only one of the
independent variables.
(x1, x2, . . . , xk + 1,...,xp)
13
Direct Mail Promotion: Odds Ratio
Suppose we want to compare the odds of using the coupon for customers who spend $2000 annually and
have a Simmons credit card ( 1 = 2 and 2 = 1) to the odds of using the coupon for customers who spend
$2000 annually and do not have a Simmons credit card ( 1 = 2 and 2 = 0).
The estimated odds of using the coupon for customers who have a Simmons credit card are 3 times of the
estimated odds of using the coupon for customers who do not have a Simmons credit card.
14
𝑥
𝑥
𝑥
𝑥
Direct Mail Promotion: Interprete Odds Ratio
The odds ratio for each independent variable is computed while holding all the other independent
variables constant.
A unique relationship exists between the odds ratio for a variable and its corresponding regression
coefficient.
15
Model Evaluation
16
Model Evaluation
Logistic regression is also known as one example of binary classification models, in which the
target variable has two possible categorical outcomes.
17
Problems with Unbalanced Classes
Unfortunately, accuracy is simple but has some widely recognized problems.
Loan Applicants
fraudulent: 1%
18
Confusion Matrix
To evaluate a logistic regression model (or any classification model), it is important to understand
the notion of confusion matrix.
Actual Predicted
+ 0.94 TP True positive rate
+ 0.91
+ (pred) TP FN
- 0.89 FP
+ 0.87
Cutoff
- 0.83 FP TN
… …
- 0.12 TN - (pred) TP = True Positive
+ 0.10 FN FP = False Positive
- 0.09 FN = False Negative
+ 0.04 TN = True Negative
19
A Better Solution: ROC Analysis
A better solution is to use a method that can accommodate uncertainty by showing the entire
space of performance possibilities.
+ 0.87 1 99
3 97
- 0.83 3 97
1 99 TPR
… … 2 98
- 0.12
+ 0.10
- 0.09
+ 0.04
FPR
21
Points on ROC Graph
22
Random Classifier
23
Area Under the ROC Curve (AUC)
An advantage of ROC graphs is that they decouple classifier performance from the conditions
under which the classifiers will be used.
Specifically, they are independent of the class proportions as well as the costs and benefits.
24
Area Under the ROC Curve (AUC)
The area under the ROC curve (AUC) is simply the area under a classifier’s curve expressed as a
fraction of the unit square.
Its value ranges from zero to one.
Though a ROC curve provides more information than its area, the AUC is useful when a single
number is needed to summarize performance, or when nothing is known about the operating
conditions.
25
Optimal Cutoff Point
Is there an optimal cutoff point? How to find it?
26
Model Building
27
When to Add or Delete Variables
We need to test whether it is advantageous to add one or more independent variables to a
multiple regression model.
This test is based on a determination of the amount of reduction in the error sum of squares
resulting from adding one or more independent variables to the model.
28
Recall: Trucking Company
Recall the trucking company example.
29
Trucking Company
The simple linear regression model is
When 2, the number of deliveries, was added as a second independent variable, we obtained the
following estimated regression equation:
30
𝟐
𝒙
𝑥
Trucking Company: Model Comparison
The ANOVA table of the first and second model:
31
𝟐
𝒙
Trucking Company: Reduction of SSE
The reduction in SSE resulting from adding 2 to the model involving just 1 is
32
𝑥
𝑥
F test
The hypotheses
F statistic
33
Trucking Company: F test
F statistic is defined to test whether addition of 2 is statistically significant.
34
𝐹
𝐹
𝛼
𝑥
𝑥
𝐹
Variable selection procedure
Four types of variable selection procedure:
1. Forward selection
2. Backward elimination
3. Stepwise regression
4. Best-subsets regression
35
Forward selection
The forward selection procedure starts with no
independent variables.
36
Backward elimination
The backward elimination procedure begins with a
model that includes all the independent variables.
37
Stepwise regression
The stepwise regression procedure begins each step by
determining whether any of the variables already in the
model should be removed.
38
Best-subsets regression
None of stepwise regression, forward selection, and backward
elimination guarantees that the best model for a given number of
variables will be found.
39
Thank you!
40