MATH 121 (Chapter 10) - Correlation & Regression
MATH 121 (Chapter 10) - Correlation & Regression
MATH 121 (Chapter 10) - Correlation & Regression
1
November 17 – 2, 2020
Chapter
10 Correlation and Regression Analysis
Learning Objectives
After completing this chapter, the students will able to:
o Distinguish and explain the difference between independent and dependent
variables.
o Draw a scatter plot for a set of ordered pairs.
o Calculate and interpret the coefficient of correlation using Pearson Product
Moment Correlation Coefficient (PMCC).
o Evaluate the significance of the correlation coefficient.
o Conduct a test of hypothesis on the correlation coefficient.
o Conduct a test for Spearman Rank Correlation and explain its applications.
o Calculate and interpret the linear regression equation.
o Calculate and interpret the coefficient of determination and the standard error of
estimate.
o Determine the confidence interval and the prediction interval in a linear
regression equation.
Chapter Outline
10.1 Introduction
10.2 Scatter Diagram
10.3 Pearson Product-Moment Correlation
Correlation Coefficient
Test of Significance
10.4 Spearman Rank Correlation
10.5 Simple Linear Regression Analysis
Estimating the Coefficient
Sum of Squares for Error
The Standard Error of Estimate
Coefficient of Determination
Confidence Interval and Prediction Interval
…….
253
10.1 Introduction
This chapter covers correlation coefficient and regression analysis. The early part of
the chapter discussed scatter diagram and its importance. Then immediately followed by
correlation, which is a statistical method used to determine whether a relationship
between variables exist. A variable here is characteristic of the population being observed
or measured. For instance, the variable of interest might be advertising expense and sales.
The sample then consists of random observations of the variable describing a given
population.
The second part of the chapter is the regression analysis, which is a statistical method
used to describe the nature of the relationship between variables, that is, either positive or
negative, linear or nonlinear. There are two types of relationships: simple and multiple. In
simple relationship, there are two variable-an independent variable (or explanatory
variable or predictor variable) and dependent variable (or response variable). On the other
hand, multiple relationship (or multiple regression), two or more independent variables
are used to predict the dependent variable.
Simple linear relationship can be positive or negative. A positive relationship exist
when either variables increase at the same time or both decrease at the same time. On the
contrary, in a negative relationship, as one variable increases, the other variable
decreases or vice versa. The text is limited with the discussion of simple linear regression
analysis.
Scatter diagram is a useful tool for checking the assumptions in a regression analysis.
It can be viewed during an initial screening run of the analysis or after the analysis. The
benefit of looking at scatter diagram residuals in the beginning stages of an analysis is
that it may save a researcher time. If the assumptions are not met, further screening must
be applied before the analysis can be completed and data may require cleansing and
transformation. In this case, the research is not running analysis haphazardly. For
instance, the assumptions are met, the regression is ready to be run and the researcher has
increased confidence that the chances of making a Type I or Type II error are reduced,
ultimately improving the accuracy of any research results.
N ∑ XY −(∑ X )(∑ Y )
r= (Formula 10-1)
√ [ N ( ∑ X ) −( ∑ X ) ] ¿ ¿ ¿
2 2
A. Correlation Coefficient
Pearson’s product-moment correlation coefficient of simply correlation
coefficient (or Pearson’s r) is a measure of the linear strength of the association
between two variables. It is founded by Karl Pearson. The value of the correlation
coefficient varies between +1 and -1. When the value of the correlation coefficient lies
around ±1, then it is said to be a perfect degree of association between the two
variables. As the value of the correlation coefficient goes closer to zero, the
relationship between the two variables will be weaker. This information is summarized
in the charts below. Y Variables
Y Variables
X Variables X Variables
Y variables
X variables X variables
Y variables
X variable
X variables
r √ N −2
t= (Formula 10-2)
√1−r 2
Where: t = t test for correlation coefficient
r = correlation coefficient
N = number of paired sample
Assumptions in Pearson Product-Moment Correlation test:
1. Subjects are randomly selected.
2. Both populations are not normally distributed.
Procedure in Pearson Product-Moment Correlation test:
1. Set up the hypotheses:
H0: p = 0 (The correlation in the population is zero.)
H1: p ≠ 0, p>0, p<0 (The correlation in the population is different from zero.)
where: p = correlation in the population.
2. Set the level of significance
3. Calculate the degrees of freedom (dF = N-2) and determine the critical value of t.
4. Calculate the value of Pearson’s r, by using Formula 10-1.
5. Calculate the value of t value (by using Formula 10-2) and determine the statistical
decision for hypothesis testing:
If tcomputed<tcritical, do not reject H0.
When the null hypothesis has been rejected for a specific significance level, there are
possible relationships between X and Y variables
1. There is a direct cause-and-effect relationship between the two variables.
2. There is a reverse cause-and-effect relationship between the two variables.
3. The relationship between the two variables may be caused by the third variable.
4. There may be a complexity of interrelationship among many variables.
5. The relationship between the two variables may be coincidental.
Example: The owner of a chain of fruit shake stores would like to study the correlation
between atmospheric temperature and sales during the summer season. A random sample
of 12 days is selected with the results given as follows:
Day 1 2 3 4 5 6 7 8 9 10 11 12
Temperatur 79 76 78 84 90 83 93 94 97 85 88 82
e (ºF)
Total Sales 147 143 147 168 206 155 192 211 209 187 200 150
(Units)
Plot the data on a scatter diagram. Does it appears there is a relationship between
atmospheric temperature and sales? Compute the coefficient of correlation. Determine at
the 0.05 significance level whether the correlation in the population is greater than zero.
Solution:
Step 1:Graph the scatter plot.
220
210
200
190
180
Sales (Y)
170
160
150
140
130
120
70 75 80 85 90 95 100
Temperature (X)
N ∑ XY −(∑ X )(∑ Y )
r=
√ [ N ( ∑ X ) −( ∑ X ) ] ¿ ¿ ¿
2 2
●●●●●●●
258
12 ( 183,222 )−(1,029)(2,115)
¿
√¿ ¿ ¿
2,198,222−2,176,335
¿
√[1,064,796−1,058,841][4,570,644−4,473,225]
22,329
¿
√ [ 5,955 ] [97,429]
= 0.9270572554
= 0.93
r √ N −2
t=
√1−R2
0.93 √ 12−2
¿
√ 1−¿ ¿ ¿
Since the computed t-value of 8.00 is greater than the tabular value of
2.228 at level of significance of 0.05, we would need to reject the null hypothesis.
Step 7:Conclusion
Since the null hypothesis has been rejected, we can conclude that there is evidence
that shows significant association between the atmospheric temperature and the total
sales of fruit shake.
Spearman rank correlation (or Spearman’s rho) is a nonparametric test that is used
to measure the degree of association between the two variables. Spearman rank
correlation test does not assume any assumptions about the distribution. Spearman rank
correlation test is used when the Pearson test gives misleading results. Spearman rank
correlation was named after Charles Edward Spearman who developed it, often denoted
by ρ (rho) or as rs.
Spearman rank correlation is the counterpart of Pearson Product Moment
Correlation in parametric statistics. It is calculated by converting each variable to ranks
and calculating the Pearson Product Moment Correlation between the two sets of ranks.
For small sample prize, theobserved correlation coefficient is compared to what would
result if the ranks of the X-values and Y-values were random permutation of the integers
1 to n (sample size).
Example: The owner of a chain of fruit shake stores would like to study the correlation
between atmospheric temperature and sales during the summer season. A random sample
of 12 days is selected with the results given as follows.
Day 1 2 3 4 5 6 7 8 9 10 11 12
Temperature 79 76 78 84 90 83 93 94 97 85 88 82
(ºF)
Total Sales 147 143 147 168 206 155 192 211 209 187 200 150
(Units)
12
10
8
Rank of Y
6
4
2
0
0 2 4 6 8 10 12 14
Rank of X
Day X Y Rx Ry D D
2
Step 7:Conclusion.
Since the null hypothesis has been rejected, we can conclude that there is
evidence that shows significant correlation between the atmospheric temperature
and the total sales of fruit shake.
The least square method determines the coefficients that minimizing the sum of
the squared deviation between the points and the line define by the coefficient, it is
called sum of square for error (or total variations).
The variations due to chance denoted by ∑ ¿Yi -Y^ )2, is called unexplained
variation. The variation cannot be attributed to the relationship. If the value of r is
close to +1 or -1, then the unexplained variation is small. Alternatively, the variation
obtained from the relationshipis∑ ¿i -Y )2 and is called the explained variations. Most
of the variations can be explained by the relationship. If the value of r is close to +1 or
-1, the better the points fit the line and the closer the ∑ ¿i -Y )2 is to ∑ ¿i -Y )2. If all
points fall in the regression line, ∑ ¿i -Y )2 will be equal to ∑ ¿i -Y )2.
Y^ =b1 X +b0
Total sum of squares
Explained sum of square
X
0 Xi
Figure 10.2 shows the different measure of variation in a simple linear regression
equation. The formula for the sum of the square is:
∑ ¿i -Y )2 = ∑ ¿i -Y )2 + ∑ ¿i -Y )2 (Formula 10-8)
Or SST = SSR + SSE
where: Y^ = predicted or fitted value of Y
Y = the value of any particular observations of the dependent variable.
Y = mean of the dependent variable.
∑ ¿i -Y )2 = total variations in Y (or total sum of squares).
∑ ¿i -Y^ )2 = unexplained variation in Y (or error sum of squares).
∑ ¿i -Y )2 = explained variation in Y (or regression sum of squares).
A small standard error of estimate indicates that the independent variable is a
good predictor of the dependent variable.
C. Standard Error of Estimate
The standard error of estimate is the standard deviation of the observed Y values
about the predicted Y^ values. The formula for the standard error of estimate is
S E=
√ SSE
n−2
= √∑ ¿ ¿ ¿ (Formula 10-9)
D. Coefficient of Determination
The coefficient of determination is the measure of variance of the dependent
variable that is explained by the regression line and the independent variable. It is the
ratio of the explained variation to the total variation and is denoted by r 2. Coefficient of
determination can also be computed by squaring the value of Pearson’s r. it may range
between 0 and 1, or 0% and 100%. That is,
2 total variation−unexplained variation expalined variation
r= =
total variation total variation
2 SSR SSE
r= =1− =1−∑¿ ¿ (Formula 10-10)
SST SST
where: r2 = coefficient of determination.
SSE = error sum of squares.
SSR = regression sum of squares.
SST = total sum of square for.
On the other hand, coefficient of non-determination is the proportion in the
dependent variable that is left unexplained by the independent variable, determined by1 –
r2.
E. Confidence Interval and Prediction Interval
A confidence interval represents a closed interval where a certain percentage of
the population is likely to lie. For example, a 95% confidence interval with a lower limit
of X – z and an upper limit of X + z implies that 95% of the population lies between the
values of X + z and X −z . Out of the remaining 5% of the population, 2.5% is less than X
+ z and 2.5% is greater than X – z. This section discusses confidence intervals used in
simple linear regression analysis. To determine the coincidence interval for the mean
value of Y for a given X , the formula is:
√
2
1 (X −X )
Confidence Interval = Y^ ± t α /2 (s E ) + ¿ (Formula 10-11)
n ∑ X 2−¿ ¿ ¿ ¿ ¿
√
2
1 ( X −X )
Prediction Interval = Y^ ± t α /2 (s E ) 1+ + ¿ (Formula 10-12)
n ∑ X 2−¿ ¿ ¿ ¿ ¿
Solution:
∑ X = 1,029
∑ Y = 2,115
∑ X 2 = 88,733
∑ Y 2 = 380,887
∑ X Y = 183, 222
Step 2: Compute for slope of the simple linear regression.
N ( ∑ XY ) −(∑ X)(∑ Y )
b 1=
N ( ∑ X 2) −¿ ¿
(Formula 10-6)
X=
∑ X = 1,029 = 85.75
N 12
Y=
∑ Y = 2,115 = 176.25
N 12
Step 5: Substitute the slope and intercept in the general simple linear regression equation.
Y^ =b1 X +b0 General Equation for Simple Linear Regression (Formula 10-5)
100
𝑌 ̂=3.7496𝒳 −145.2782
95
Temperature (F)
90
85
80
75
140 150 160 170 180 190 200 210 220
Sales
Now let us compute for the sum of squares for error, standard error, and
coefficient of determination.
Day Y Y −Y 2
(Y −Y ) Y^ Y −Y^ (Y −Y^ )
2
S E=
√ ∑ (Y −Y )2 =
n−2 √ 1,141.1406
12−2
=10.6824
Because, SE = 10.6824 andY = 176.25, we would have to admit that the standard
error of estimate is small. However, there is no predefined upper limit on SE, it is
difficult to assess the regression model. In summary, the standard error of estimate
cannot be a fixed measure of the regression model utility.
Computation of coefficient of Determination
We found that r 2 is equal to 0.8594. This statsic denotes that 85.94% of the variation
in the fruit shake sales is explained by the variation in the temperature reading in degrees
Fahrenheit.
Computation of Confidence Interval
Step 1: Determine the estimated number of fruit shake sold when the temperature is 95
degrees Fahrenheit. It is 210.9338, found by
Y^ =3.7496 X−145.2782=3.7496 ( 95 )−145.2782=210.9338
Step 2: The t value associated with the 95% level of confidence
Step 3: Determine the degrees of freedom and he critical values of t. (Refer to table B)
df= n-2 = 12-2 = 10 and t =± 2.228
Step 4: Compute for the confidence interval.
Recall the following:
Y^ =210.932, SE = 10.6824, n=12,
∑ X = 1,029 =85.75
∑ X=1,029 , ∑ X 2=88,733 , X= N 12
√
2
1 (X −X )
Confidence Interval=Y^ ± t ( S E ) + ¿
n ∑ X 2−¿ ¿¿ ¿
√
2
1 (95−85.75)
± ( 2.228 ) ( 10.6824 ) +
= 210.9338 12 (1,029)
2
88,733−
12
=210.9338 ± 12.0363
= 198.8975 up to 222.9701
So the number 95% confidence interval for the number of fruit shakes sold when the
temperature is 95 degree Fahrenheit is 210.9338 ±12.0363, or 198.8975 fruit shakes up to
222.9701 fruit shakes.
Computation of Prediction Interval
√
2
1 (X− X )
Prediction Interval =Y ± t a /2 (S E ) 1+ + ¿
n ∑ X 2−¿ ¿ ¿ ¿
√
2
1 (95−85.75)
=
210.9338 ±(2.228)(10.6824) 1+ + 2
n (1,029)
88,733−
12
= 210.9338± 26.6707 or
= 184.2631 up to 237.6045
We conclude that the probability is 0.95 that the number of fruit shakes sold is
between 184.2631 fruit shakes and 237.6045 fruit shakes. This interval is quite large.
The more important distinction between a confidence interval and a prediction
interval is that a confidence interval refers to all cases with the given value of X
(independent variable), while a prediction interval refers to a particular cases for a given
value of X (independent variable).
Find the coefficient of correlation. Determine at the 0.01 significance level whether the
correlation in the population is greater than zero.
Step 1: State the hypotheses.
Ho: ______________________________________________________________
H1: ______________________________________________________________
Step 2: The level of significance and critical region. ∝=¿ _______ and tcritical = ______________
Step 3: the table and compute for the value of r and t.
Costumer X Y X2 Y2 XY
1
2
3
4
5
6
7
8
∑ X=¿ ¿
∑ X 2=¿ ¿
r = ____________ tcomputed =________________
Step 4: Decision rule: ______________________________________________________
__________________________________________________________________
Step 5: Conclusion: _______________________________________________________
__________________________________________________________________________________________________
Month Jan Feb Mar Apr May June July Aug Sept
Advertising Expense 10 14 12 9 13 15 8 13 16
Sales Revenue 180 170 190 220 235 208 215 175 250
Find the coefficient of correlation. Determine at the 0.10 significance level whether the
correlation in the population is greater than zero.
Step 2: The level of significance and critical region. α= _______ and t critical= _______.
Step 3: Complete the table and compute for the value of r and t.
2 2
Month X Y X Y XY
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sept
No. Of Employee 2 4 6 8 5 12 10 9
Production (units) 20 15 25 28 22 30 35 40
Determine the coefficient of correlation using spearman rank correlation and test
the significance at 0.05.
Solution:
Step 1: State the hypotheses
H 0 :_____________________________________________________
_________________________________________________________
H 1: _____________________________________________________
_________________________________________________________
Step 4: Complete the table and compute for the value of p and t-test.
Pair X Y RX RY D D
2
1
2
3
4
5
6
7
8
Car `1 2 3 4 5 6 7 8 9
Age (years) 3 4 8 10 6 7 5.5 4 2.5
Selling Price (₱000) 350 320 170 120 280 190 250 300 400
Determine the coefficient of correlation using spearman rank correlation and test the
significance at 0.01.
Solution:
Step 1: State the hypotheses
H 0 :_____________________________________________________
_________________________________________________________
H 1: _____________________________________________________
_________________________________________________________
Step 4: Complete the table and compute for the value of p and t-test.
Pair X Y RX RY D D
2
1
2
3
4
5
6
7
8
Determine the regression equation. Solve for the sum of squares for error, standard error,
coefficient of determination. Find the 90% confidence interval for all year for the 12 th
year. Determine a 95% prediction interval for the 12th year.
1 13
2 15
3 17
4 18
5 17
6 19
7 22
8 20
9 23
10 26
∑ X =__________ ∑ Y =¿ ¿ __________ X =¿ ¿ Y =¿ ¿
Determine the regression equation. Solve for the sum of squares for error,
standard error, coefficient of determination,. Find the 99% confidence interval for
all months for the 10th month. Determine a 99% prediction interval for the 10th.
Step 1: Complete the table.
X Y X
2
Y
2
XY ( Y −Y )2 Y^ ( Y −Ý )
2
13 20
6 14
8 16
10 19
20 21
7 12
9 13
15 21