MATH 121 (Chapter 10) - Correlation & Regression

MODULE 3.
1
November 17 – 2, 2020
Chapter
10 Correlation and Regression Analysis
Learning Objectives
After completing this chapter, the students will able to:
o Distinguish and explain the difference between independent and dependent
variables.
o Draw a scatter plot for a set of ordered pairs.
o Calculate and interpret the coefficient of correlation using Pearson Product
Moment Correlation Coefficient (PMCC).
o Evaluate the significance of the correlation coefficient.
o Conduct a test of hypothesis on the correlation coefficient.
o Conduct a test for Spearman Rank Correlation and explain its applications.
o Calculate and interpret the linear regression equation.
o Calculate and interpret the coefficient of determination and the standard error of
estimate.
o Determine the confidence interval and the prediction interval in a linear
regression equation.
Chapter Outline
10.1 Introduction
10.2 Scatter Diagram
10.3 Pearson Product-Moment Correlation
Correlation Coefficient
Test of Significance
10.4 Spearman Rank Correlation
10.5 Simple Linear Regression Analysis
Estimating the Coefficient
Sum of Squares for Error
The Standard Error of Estimate
Coefficient of Determination
Confidence Interval and Prediction Interval
The primary duty of science is to establish

relationships between variables.
-Anonymous
…….
253
10.1 Introduction
This chapter covers correlation coefficient and regression analysis. The early part of
the chapter discussed scatter diagram and its importance. Then immediately followed by
correlation, which is a statistical method used to determine whether a relationship
between variables exist. A variable here is characteristic of the population being observed
or measured. For instance, the variable of interest might be advertising expense and sales.
The sample then consists of random observations of the variable describing a given
population.
The second part of the chapter is the regression analysis, which is a statistical method
used to describe the nature of the relationship between variables, that is, either positive or
negative, linear or nonlinear. There are two types of relationships: simple and multiple. In
simple relationship, there are two variable-an independent variable (or explanatory
variable or predictor variable) and dependent variable (or response variable). On the other
hand, multiple relationship (or multiple regression), two or more independent variables
are used to predict the dependent variable.
Simple linear relationship can be positive or negative. A positive relationship exist
when either variables increase at the same time or both decrease at the same time. On the
contrary, in a negative relationship, as one variable increases, the other variable
decreases or vice versa. The text is limited with the discussion of simple linear regression
analysis.
10.2 Scatter Diagram
Scatter diagram is a useful tool for checking the assumptions in a regression analysis.
It can be viewed during an initial screening run of the analysis or after the analysis. The
benefit of looking at scatter diagram residuals in the beginning stages of an analysis is
that it may save a researcher time. If the assumptions are not met, further screening must
be applied before the analysis can be completed and data may require cleansing and
transformation. In this case, the research is not running analysis haphazardly. For
instance, the assumptions are met, the regression is ready to be run and the researcher has
increased confidence that the chances of making a Type I or Type II error are reduced,
ultimately improving the accuracy of any research results.
10.3 Pearson Product0Moment Correlation
Pearson product-moment correlation is the most widely used in statistical to

measure the degree of the relationship between the linear related variables. The Pearson r
correlation would require both variables to be normally distributed. Correlation refers to
the departure of two random variables from independence. For example, in the stock
market, if we want to measure how two products are related to each other, Pearson r
correlation is used to measure the degree of relationship between the two products.
Thecorrelation coefficient is defined as the covariance divided by the standard
deviation of the variables. The following formula is used to calculate the Pearson r
correlation:
∑ ( X−X )( Y −Y )
r=
√¿¿¿
●●●●●●●
254
(∑ X )(∑ Y )
∑ XY −
N
r=
√¿ ¿ ¿
N ∑ XY −(∑ X )(∑ Y )
r= (Formula 10-1)
√ [ N ( ∑ X ) −( ∑ X ) ] ¿ ¿ ¿
2 2
A. Correlation Coefficient
Pearson’s product-moment correlation coefficient of simply correlation
coefficient (or Pearson’s r) is a measure of the linear strength of the association
between two variables. It is founded by Karl Pearson. The value of the correlation
coefficient varies between +1 and -1. When the value of the correlation coefficient lies
around ±1, then it is said to be a perfect degree of association between the two
variables. As the value of the correlation coefficient goes closer to zero, the
relationship between the two variables will be weaker. This information is summarized
in the charts below. Y Variables
Y Variables
X Variables X Variables
Perfect Positive Correlation (r = Perfect Negative Correlation (r = -

1.00) 1.00
Y variables
Y variables
X variables X variables
Positive Correlation (r=0.80) Negtive Correlation (r = -0.80)
Zero Correlation ( r = 0.00) Non-linear Correlation

Y variable
The
Y variables
X variable
X variables
following summarizes the correlation coefficient and strength of relationships.

0.00 – no correlation, no relationship
±0.01 to ±0.20 – very low correlation, almost negligible relationship
±0.21 to ±0.40 – slight correlation, define but small relationship
±0.41 to ±0.70 – moderate correlation, substantial relationship
±0.71 to ±0.90 – high correlation, marked relationship
±0.91 to ±0.99 – very high correlation, very dependable relationship
±1.00 – perfect correlation, perfect relationship
B. Test of Significance
A test of significance for the coefficient of correlation may be used to find out if
the computed Pearson’s r could have occurred in a population in which the two variables
are related or not. The test statistic follows the t distribution with n – 2 degrees of
freedom. The significance is computed using the formula of t as shown in Formula 10-2.
r √ N −2
t= (Formula 10-2)
√1−r 2
Where: t = t test for correlation coefficient
r = correlation coefficient
N = number of paired sample
Assumptions in Pearson Product-Moment Correlation test:
1. Subjects are randomly selected.
2. Both populations are not normally distributed.
Procedure in Pearson Product-Moment Correlation test:
1. Set up the hypotheses:
H0: p = 0 (The correlation in the population is zero.)
H1: p ≠ 0, p>0, p<0 (The correlation in the population is different from zero.)
where: p = correlation in the population.
2. Set the level of significance
3. Calculate the degrees of freedom (dF = N-2) and determine the critical value of t.
4. Calculate the value of Pearson’s r, by using Formula 10-1.
5. Calculate the value of t value (by using Formula 10-2) and determine the statistical
decision for hypothesis testing:
If tcomputed<tcritical, do not reject H0.
If tcomputed>tcritical, reject H0.
6. State the conclusion.

●●●●●●●
256
The test for correlation coefficient is two-tailed; the rejection region is divided
into equal parts (i.e. we divided 0.05 into two equal parts of 0.025 each). Figure 10.1
illustrates the rejection and nonrejection region of the test of hypothesis of correlation
coefficient.
Figure 10.1: Testing the Hypothesis of Correlation Coefficient at 0.05 Significance

Level
Rejection Region Rejection

(There in no cor- Region (There
relation) is correlation)
-ta/2 0+ta/2 scale of t
When the null hypothesis has been rejected for a specific significance level, there are
possible relationships between X and Y variables
1. There is a direct cause-and-effect relationship between the two variables.
2. There is a reverse cause-and-effect relationship between the two variables.
3. The relationship between the two variables may be caused by the third variable.
4. There may be a complexity of interrelationship among many variables.
5. The relationship between the two variables may be coincidental.
Example: The owner of a chain of fruit shake stores would like to study the correlation
between atmospheric temperature and sales during the summer season. A random sample
of 12 days is selected with the results given as follows:
Day 1 2 3 4 5 6 7 8 9 10 11 12
Temperatur 79 76 78 84 90 83 93 94 97 85 88 82
e (ºF)
Total Sales 147 143 147 168 206 155 192 211 209 187 200 150
(Units)
Plot the data on a scatter diagram. Does it appears there is a relationship between
atmospheric temperature and sales? Compute the coefficient of correlation. Determine at
the 0.05 significance level whether the correlation in the population is greater than zero.
Solution:
Step 1:Graph the scatter plot.
220
210
200
190
180
Sales (Y)
170
160
150
140
130
120
70 75 80 85 90 95 100
Temperature (X)
Step 2:State the hypotheses.

H0: r = 0
(There is no correlation between atmospheric temperature and total sales of fruit
shake.)
H1 r ≠ 0
(There is correlation between atmospheric temperature and total sales of fruit
shake.)
Step 3:The level of significance is a = 0.05.
Step 4:Determine the degrees of freedom and the critical value of t. (Refer to Table B)
dF = N-2 = 12-2 = 10 and t ≠ ±2.228
Step 5:Compute for the value of r (Pearson Product-Moment Correlation Coefficient).
Day X Y X
2
Y
2
XY
1 79 147 6,241 21,609 11,613
2 76 143 5,776 20,449 10,868
3 78 147 6,084 21,609 11,466
4 84 168 7,056 28.224 14,112
5 90 206 8,100 42,436 18.540
6 83 155 6,889 24,025 12,865
7 93 192 8,649 36,864 17,856
8 94 211 8,836 44,521 19,834
9 97 209 9,409 43,681 20,273
10 85 187 7,225 34,969 15,895
11 88 200 7,744 40,000 17,600
12 82 150 6,724 22,500 12,300
Total 1,029 2,115 88,733 380,887 183,222
∑ X =1,029 ∑ Y =2,115 ∑ X =88,733 ∑ Y =380,887 ∑ XY =183,222

2 2
N ∑ XY −(∑ X )(∑ Y )
r=
√ [ N ( ∑ X ) −( ∑ X ) ] ¿ ¿ ¿
2 2
●●●●●●●
258
12 ( 183,222 )−(1,029)(2,115)
¿
√¿ ¿ ¿
2,198,222−2,176,335
¿
√[1,064,796−1,058,841][4,570,644−4,473,225]
22,329
¿
√ [ 5,955 ] [97,429]
= 0.9270572554
= 0.93
The coefficient of correlation, r = 0.93, between the atmospheric

temperature and total sales indicates a very high positive correlation (very
dependable relationship)-that is an increased in atmospheric temperature is highly
associated with the increased in total sales of fruit shake.
Step 6:Decision rule:
In order to make a decision on the significant relationship we need to
determine the value of t.
r √ N −2
t=
√1−R2
0.93 √ 12−2
¿
√ 1−¿ ¿ ¿
Since the computed t-value of 8.00 is greater than the tabular value of
2.228 at level of significance of 0.05, we would need to reject the null hypothesis.
Step 7:Conclusion
Since the null hypothesis has been rejected, we can conclude that there is evidence
that shows significant association between the atmospheric temperature and the total
sales of fruit shake.
10.4 Spearman Rank Correlation
Spearman rank correlation (or Spearman’s rho) is a nonparametric test that is used
to measure the degree of association between the two variables. Spearman rank
correlation test does not assume any assumptions about the distribution. Spearman rank
correlation test is used when the Pearson test gives misleading results. Spearman rank
correlation was named after Charles Edward Spearman who developed it, often denoted
by ρ (rho) or as rs.
Spearman rank correlation is the counterpart of Pearson Product Moment
Correlation in parametric statistics. It is calculated by converting each variable to ranks
and calculating the Pearson Product Moment Correlation between the two sets of ranks.
For small sample prize, theobserved correlation coefficient is compared to what would
result if the ranks of the X-values and Y-values were random permutation of the integers
1 to n (sample size).
The following formula is used to calculate the Spearman rank correlation.

2
6∑D
ρ=1 2
N (N −1)
Where: ρ = Spearman rank correlation.

D = the difference between the ranks of corresponding value of X and Y.
N = number of value in each data set.
After obtaining the ρ value we need to interpret it the same way we interpret the Pearson
Product Moment Correlation (PMCC). After interpreting the value of ρ we need to test its
significance using Formula 10-4 as shown below and with the degrees of freedom of N -
2.
ρ √ N −2
t=
√1−ρ2
where: t = student’s t-test
ρ = Spearman’s rho (ρ)
N = number of paired samples
Assumptions in Spearman Rank Correlation test:
1. Subjects are randomly selected.

2. Observations must be at least ordinal level.
Procedure for Spearman Rank Correlation test:
1. Set up the hypotheses:
H0 ρ = 0 (The correlation in the population is zero).
H1 ρ ≠ 0, ρ>0, ρ<0 (The correlation in the population is different from zero).
where: ρ = correlation in the population.
2. Set the level of significance.
3. Calculate the degrees of freedom (df = N – 2) and determine the critical value of t.
4. Calculate the value of Spearman’s rho (ρ), by using Formula 10-3.
5. Calculate the value of t value (by using Formula 10-4) and determine the statistical
decision for hypothesis testing.
If tcomputed<tcritical, do not reject H0.
If tcomputed ≥ tcritical, reject H0.
6. State the conclusion.
Example: The owner of a chain of fruit shake stores would like to study the correlation
between atmospheric temperature and sales during the summer season. A random sample
of 12 days is selected with the results given as follows.
Day 1 2 3 4 5 6 7 8 9 10 11 12
Temperature 79 76 78 84 90 83 93 94 97 85 88 82
(ºF)
Total Sales 147 143 147 168 206 155 192 211 209 187 200 150
(Units)
At 0.05 significance level, can we conclude that there is a significance correlation

between the atmospheric temperature and total sales of fruit shakes?
Solution:
Step 1:Graph the scatter diagram.
12
10
8
Rank of Y
6
4
2
0
0 2 4 6 8 10 12 14
Rank of X
Step 2:State the hypotheses.

H0: ρ = 0
(There is no correlation between atmospheric temperature and total sales of fruit shake.)
H1: ρ ≠ 0
(There is a correlation between atmospheric temperature and total sales of fruit shake.)
Step 3: The level of significance is a = 0.05.

Step 4: Determine the critical region (Refer to Table B)
df = N – 2 = 12 – 2 = 10 and t = ±2.228
Step 5: Compute for the value of ρ (Spearman rank order correlation).
Before we can compute for the spearman’s rho (ρ), we have to rank variables X
and variables Y wherein rank 1 is being the highest and rank 12 is being the
lowest. In case of a tie like what happen in variable Y which contains two 147, we
would need to add the rank 10 and rank 11 where they belong and divide them by
the number of tied numbers
10+11
(e.g. =10.5 ). Therefore, 147 will both carry the rank 10.5.
2
Obtain the value of D (difference of Rx and Ry) and square it, then get the sum.
Day X Y Rx Ry D D
2
1 79 147 10 10.5 -0.5 0.25

2 76 143 12 12 0 0
3 78 147 11 10.5 0.5 0.25
4 84 168 7 7 0 0
5 90 206 4 3 1 1
6 83 155 8 8 0 0
7 93 192 3 5 -2 4
8 94 211 3 1 1 1
9 97 209 1 2 -1 1
10 85 187 6 6 0 0
11 88 200 5 4 1 1
12 82 150 9 9 0 0
2
Total 0 ∑ D =8.5
Apply Formula 10-4 to obtain the Spearman rank correlation coefficient.

2
5∑ D
ρ=1− 2
N −(N −1)
6 ( 8.5 ) 51
¿ 1− =1− =1−0.03=0.97
12 ( 12 −1 )
2
12 ( 43 )
The coefficient of correlation, ρ = 0.97, between the atmospheric temperature and
total sales indicates a very high positive correlation – that is an increased in atmospheric
temperature is highly associated with the increase in total sales.
Step 6:Decision rule:

Apply the t-ratio formula to test the hypothesis.
ρ √ N −2
t=
√1−ρ2
0.97 √ 12−2
¿
√ 1−¿ ¿ ¿
Since the computed t-value of 12.62 is greater than the critical value of 2.228 at
level of significance 0f 0.05, we would need to reject the null hypothesis.
Step 7:Conclusion.
Since the null hypothesis has been rejected, we can conclude that there is
evidence that shows significant correlation between the atmospheric temperature
and the total sales of fruit shake.
10.5 Simple Linear Regression
Regression analysisis a simple statistical tool used to model the dependence of

variable on one (or more) explanatory variables. This functional relationship may then be
formally stated as an equation, with associated statistical values that describe how well
this equation fits the data.
A simple linear regression is the least estimator of a linear regression model with
a single predictor (or one independent variable). The least square model determine a
regression equation by minimizing the sum of squares of the vertical distances between
the actual Y values and the predicted value of Y. Meaning, simple linear regression fits a
straight line through the set of n points in such a way that makes the sum of squared
residuals of the model as small as possible. This method gives what generally known as
the “best-fitting” line.
Assumptions of Linear Regression Equation

1. Linearity – The mean of each error component is zero.
2. Independence of Error Terms – The errors are independence of each other.
3. Normally Distributed Error Terms – Each error component (random variable) follows
an approximate normal distribution.
4. Homoscedasticity – The variance of the error components is the same for each value of
the independent variable.
A. Estimating the Coefficient

Y^ =b1 X +b0 (Formula 10-5)
N ( ∑ XY ) −(∑ X)(∑ Y )
b 1= (Formula 10-
N ( ∑ X 2) −¿ ¿
6)
b 0=Y −b1 X (Formula10-7)
where: Y^ = predicted or fitted of Y.

X = the value of any particular observations of the independent variable.
Y = the value of any particular observations of the dependent variable.
b1 = slope of the regression line.
b0 = intercept of the regression line.
X = mean of the independent variable.
Y = mean of the dependent variable.
B. Sum of Square for Error
The least square method determines the coefficients that minimizing the sum of
the squared deviation between the points and the line define by the coefficient, it is
called sum of square for error (or total variations).
The variations due to chance denoted by ∑ ¿Yi -Y^ )2, is called unexplained
variation. The variation cannot be attributed to the relationship. If the value of r is
close to +1 or -1, then the unexplained variation is small. Alternatively, the variation
obtained from the relationshipis∑ ¿i -Y )2 and is called the explained variations. Most
of the variations can be explained by the relationship. If the value of r is close to +1 or
-1, the better the points fit the line and the closer the ∑ ¿i -Y )2 is to ∑ ¿i -Y )2. If all
points fall in the regression line, ∑ ¿i -Y )2 will be equal to ∑ ¿i -Y )2.
Figure 10.2:Measures of Variance in Simple Linear Regression Analysis
Y Unexplained sum of square
Y^ =b1 X +b0
Total sum of squares
Explained sum of square
X
0 Xi
Figure 10.2 shows the different measure of variation in a simple linear regression
equation. The formula for the sum of the square is:
∑ ¿i -Y )2 = ∑ ¿i -Y )2 + ∑ ¿i -Y )2 (Formula 10-8)
Or SST = SSR + SSE
where: Y^ = predicted or fitted value of Y
Y = mean of the dependent variable.
∑ ¿i -Y )2 = total variations in Y (or total sum of squares).
∑ ¿i -Y^ )2 = unexplained variation in Y (or error sum of squares).
∑ ¿i -Y )2 = explained variation in Y (or regression sum of squares).
A small standard error of estimate indicates that the independent variable is a
good predictor of the dependent variable.
C. Standard Error of Estimate
The standard error of estimate is the standard deviation of the observed Y values
about the predicted Y^ values. The formula for the standard error of estimate is
S E=
√ SSE
n−2
= √∑ ¿ ¿ ¿ (Formula 10-9)
where: SE = standard error of estimate

b0 = intercept of the regression line.
b1 = slope of the regression line.
Y^ = predicted or fitted value of Y
n = the sample size
The standard error of estimate is comparable to the standard deviation; however
the mean is not used. As can be seen in Formula 10-9, the standard error of estimate is the
square root of the unexplained variation-that is, the variation due to the different of the
observed value and the expected values-divided by n – 2. Meaning the closer the
observed values to the predicted values, the smaller the standard error of the estimate
value will be.
The smallest value that standard error of estimate (SE) can assume is 0, which
occurs when SSE = 0, that is, when all points fall on the regression line. Thus, when
standard error of estimate is small, the fit is excellent, and the linear model is likely to be
an effective analytical and forecasting tool. If the standard error of estimate is large, the
linear model is a unreliable, and the researcher should improve it or reject it.
D. Coefficient of Determination
The coefficient of determination is the measure of variance of the dependent
variable that is explained by the regression line and the independent variable. It is the
ratio of the explained variation to the total variation and is denoted by r 2. Coefficient of
determination can also be computed by squaring the value of Pearson’s r. it may range
between 0 and 1, or 0% and 100%. That is,
2 total variation−unexplained variation expalined variation
r= =
total variation total variation
2 SSR SSE
r= =1− =1−∑¿ ¿ (Formula 10-10)
SST SST
where: r2 = coefficient of determination.
SSE = error sum of squares.
SSR = regression sum of squares.
SST = total sum of square for.
On the other hand, coefficient of non-determination is the proportion in the
dependent variable that is left unexplained by the independent variable, determined by1 –
r2.
E. Confidence Interval and Prediction Interval
A confidence interval represents a closed interval where a certain percentage of
the population is likely to lie. For example, a 95% confidence interval with a lower limit
of X – z and an upper limit of X + z implies that 95% of the population lies between the
values of X + z and X −z . Out of the remaining 5% of the population, 2.5% is less than X
+ z and 2.5% is greater than X – z. This section discusses confidence intervals used in
simple linear regression analysis. To determine the coincidence interval for the mean
value of Y for a given X , the formula is:
√
2
1 (X −X )
Confidence Interval = Y^ ± t α /2 (s E ) + ¿ (Formula 10-11)
n ∑ X 2−¿ ¿ ¿ ¿ ¿
A prediction interval is an estimate of an interval in which has already been

observed. Prediction intervals are often used in regression analysis. The prediction
interval is determined as follows:
√
2
1 ( X −X )
Prediction Interval = Y^ ± t α /2 (s E ) 1+ + ¿ (Formula 10-12)
n ∑ X 2−¿ ¿ ¿ ¿ ¿
where: Y^ = predicted of fitted value of Y.

sE = standard error of estimate.
t = t test for correlation coefficient.
n = the sample mean.
X = mean of the independent variable.
Example: Referring to the example involving atmospheric temperature on sales in page

257. Determine the regression equation, plot the regression line and interpret it. Solve for
the sum of squares for error, standard error, coefficient of determination. Find the 95%
confidence interval for all temperature with 95 degrees Fahrenheit. Determine a 95%
prediction interval with 95 degrees Fahrenheit.
Solution:
Computation of the Simple Linear Regression Equation
Step 1: Obtain the sum of X , Y, X 2 , Y2, and X Y. ( see page 258)
∑ X = 1,029
∑ Y = 2,115
∑ X 2 = 88,733
∑ Y 2 = 380,887
∑ X Y = 183, 222
Step 2: Compute for slope of the simple linear regression.
N ( ∑ XY ) −(∑ X)(∑ Y )
b 1=
N ( ∑ X 2) −¿ ¿
(Formula 10-6)
12 ( 183,222 )−(1,029)(2,115) 2,198,664−2,176,335 22,329

b 1= 2 = = = 3.7496221661 =
12 ( 88,733 )−(1,029) 1,064,769−1,058,841 5,955
3.7496
Step 3: Compute for the mean value of X and Y.
X=
∑ X = 1,029 = 85.75
N 12
Y=
∑ Y = 2,115 = 176.25
N 12
Step 4: Compute for intercept of the simple liner regression.
b 0=Y −b1 X (Formula10-7)
b 0=176.25−3.7496(85.75) = 176.25 – 321.5282 = - 145.2782
Step 5: Substitute the slope and intercept in the general simple linear regression equation.
Y^ =b1 X +b0 General Equation for Simple Linear Regression (Formula 10-5)
The Simple Linear Regression is

Y^ =3.7496 X−145.2782
Step 6: Graph the least square regression line.
100
𝑌 ̂=3.7496𝒳 −145.2782
95
Temperature (F)
90
85
80
75
140 150 160 170 180 190 200 210 220
Sales
Thus, the regression equation is Y^ =3.7496 X−145.2782. The b1 of 3.7496

indicates that for each additional temperature in Fahrenheit, sales are expected to increase
by 3.7496 units. The b0 value of – 145.2782 indicates that the intercept with the Y-axis is
below the origin. A concrete interpretation is that if the temperature in Fahrenheit is also
zero, a negative 145.2782 units would be sold.
Now let us compute for the sum of squares for error, standard error, and
coefficient of determination.
Day Y Y −Y 2
(Y −Y ) Y^ Y −Y^ (Y −Y^ )
2
1 147 -29.25 855.5625 150.97 -3.94 15.5227

2 143 -33.25 1,105.5625 139.72 3.31 10.9493
3 147 -29.25 855.5625 147.22 -0.19 0.0362
4 68 -8.25 68.0625 169.72 -1.69 2.8493
5 206 29.75 855.0625 192.22 13.81 190.8349
6 155 -21.25 451.5625 165.97 -10.94 119.6477
7 192 15.75 248.0625 203.47 -11.43 130.7492
8 211 34.75 1,207.5625 207.22 3.82 14.5605
9 209 32.75 1,072.5625 218.47 -9.43 88.9822
10 187 10.75 115.5625 173.47 13.56 183.9387
11 200 23.75 564.0625 184.72 15.31 234.5045
12 150 -26.25 689.0625 162.22 -12.19 148.5654
Total 2,115 0.00 8,11.2500 2,115.39 0.00 1,141.1406
Total sum of squares

∑ (Y −Y )2=8,118.25
Sum of squares for Error
∑ (Y −Y )2=1,141.1406
Computation of Standard Error of Estimate
S E=
√ ∑ (Y −Y )2 =
n−2 √ 1,141.1406
12−2
=10.6824
Because, SE = 10.6824 andY = 176.25, we would have to admit that the standard
error of estimate is small. However, there is no predefined upper limit on SE, it is
difficult to assess the regression model. In summary, the standard error of estimate
cannot be a fixed measure of the regression model utility.
Computation of coefficient of Determination
∑ ( Y −Y^ ) =1− 1,141.1406 =0.8594

2
2
r =1−
∑ ( Y −Y )2 8,118.25
We found that r 2 is equal to 0.8594. This statsic denotes that 85.94% of the variation
in the fruit shake sales is explained by the variation in the temperature reading in degrees
Fahrenheit.
Computation of Confidence Interval
Step 1: Determine the estimated number of fruit shake sold when the temperature is 95
degrees Fahrenheit. It is 210.9338, found by
Y^ =3.7496 X−145.2782=3.7496 ( 95 )−145.2782=210.9338
Step 2: The t value associated with the 95% level of confidence
Step 3: Determine the degrees of freedom and he critical values of t. (Refer to table B)
df= n-2 = 12-2 = 10 and t =± 2.228
Step 4: Compute for the confidence interval.
Recall the following:
Y^ =210.932, SE = 10.6824, n=12,
∑ X = 1,029 =85.75
∑ X=1,029 , ∑ X 2=88,733 , X= N 12
√
2
1 (X −X )
Confidence Interval=Y^ ± t ( S E ) + ¿
n ∑ X 2−¿ ¿¿ ¿
√
2
1 (95−85.75)
± ( 2.228 ) ( 10.6824 ) +
= 210.9338 12 (1,029)
2
88,733−
12
=210.9338 ± 12.0363
= 198.8975 up to 222.9701
So the number 95% confidence interval for the number of fruit shakes sold when the
temperature is 95 degree Fahrenheit is 210.9338 ±12.0363, or 198.8975 fruit shakes up to
222.9701 fruit shakes.
Computation of Prediction Interval
√
2
1 (X− X )
Prediction Interval =Y ± t a /2 (S E ) 1+ + ¿
n ∑ X 2−¿ ¿ ¿ ¿
√
2
1 (95−85.75)
=
210.9338 ±(2.228)(10.6824) 1+ + 2
n (1,029)
88,733−
12
= 210.9338± 26.6707 or
= 184.2631 up to 237.6045
We conclude that the probability is 0.95 that the number of fruit shakes sold is
between 184.2631 fruit shakes and 237.6045 fruit shakes. This interval is quite large.
The more important distinction between a confidence interval and a prediction
interval is that a confidence interval refers to all cases with the given value of X
(independent variable), while a prediction interval refers to a particular cases for a given
value of X (independent variable).
Name:_____________________________ Date:________________ Score:___________

Section Exercise 10.1
State each case whether you would expect a positive correlation, a negative, or no
correlation.
1.Cholesterol level and incidence of coronary heart disease. __________________
2.Outside temperature and layers of clothing needed. __________________
3.Amount of time spent studying and examination scores. __________________
4.IQ level and Grade Point Average. __________________
5.Singing ability and writing ability. __________________
6.Mileage a car has been driven and selling price. __________________
7.Age and incidence of Alzheimer’s disease. __________________
8.Weight and height of people. __________________
9.Shirt size and sense of humor. __________________
10.Income and educational attainment. __________________
11.Shoe size and IQ level. __________________
12.The number of hours a bowler practice and their scores. __________________
13.Ages of husbands and wives. __________________
14.Height and shoe size. __________________
15.Ice cream sales and outside temperature. __________________
16.Population and number of fast food chain in cities. __________________
17.Number of cigarettes smoked and lung cancer incidence. __________________
18.Board examination performance and weight of a person. __________________
19.Income and expenditure. __________________
20.Weight and educational attainment. __________________
21.IQ and age of extremely deprived children. __________________
22.Reading achievement and body built of a person. __________________
23.The number of sunny days in July in Manila and the
attendance at the Manila Zoo __________________
24.The number of television commercials broadcast and
sales of this product __________________
Name:_____________________________ Date:________________ Score:___________
25.The number of campaign materials used by a senatorial

candidate and the number of votes earned. __________________
26.Number of persons getting flu shots and the number of
persons catching flu. __________________
Name:_____________________________ Date:________________ Score:___________

1.)AUS Rural Bank is studying the relationship between the mean account balance for
individual accounts and the number of transaction per month. A sample of eight accounts
revealed.
Customer 1 2 3 4 5 6 7 8
Mean balance (₱000s) 50 32 53 24 15 10 16 27
No. of Transaction 5 2 7 9 10 2 4 11
Find the coefficient of correlation. Determine at the 0.01 significance level whether the
correlation in the population is greater than zero.
Step 1: State the hypotheses.
Ho: ______________________________________________________________
H1: ______________________________________________________________
Step 2: The level of significance and critical region. ∝=¿ _______ and tcritical = ______________
Step 3: the table and compute for the value of r and t.
Costumer X Y X2 Y2 XY
1
2
3
4
5
6
7
8
∑ X=¿ ¿
∑ X 2=¿ ¿
r = ____________ tcomputed =________________
Step 4: Decision rule: ______________________________________________________
__________________________________________________________________
Step 5: Conclusion: _______________________________________________________
__________________________________________________________________________________________________
Name:_____________________________ Date:________________ Score:___________

2. SJS Company has been selling to retail customers in the Metro Manila area. They
advertise extensively on radio, print ads, and in the internet. The owner would like to
review the relationship between the amount spent on advertising expense (in ₱000s) and
sales (in ₱000s). Below is information on advertising expense and sales for the last 9
months.
Month Jan Feb Mar Apr May June July Aug Sept
Advertising Expense 10 14 12 9 13 15 8 13 16
Sales Revenue 180 170 190 220 235 208 215 175 250
Find the coefficient of correlation. Determine at the 0.10 significance level whether the
correlation in the population is greater than zero.
Step 1: State the hypothesis

H 0 :_____________________________________________________
H 1: _____________________________________________________
Step 2: The level of significance and critical region. α= _______ and t critical= _______.
Step 3: Complete the table and compute for the value of r and t.
2 2
Month X Y X Y XY
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sept
Name:_____________________________ Date:________________ Score:___________

∑ X =__________ ∑ Y =¿ ¿ __________
∑ X 2=¿ ¿_________ ∑ Y 2=¿ ¿ _________ ∑ XY =¿
_________
r = __________ t critical= _________
Step 4: Decision Rule. _____________________________________________________

_________________________________________________________________
Step 5: Conclusion. _______________________________________________________
_________________________________________________________________
Name:_____________________________ Date:________________ Score:___________

1. The production department of WSS Electronics wants to explore the relationship
between the number of employees who assemble a certain product and the
number of units produced per hour. The complete set of paired observations
follows:
No. Of Employee 2 4 6 8 5 12 10 9
Production (units) 20 15 25 28 22 30 35 40
Determine the coefficient of correlation using spearman rank correlation and test
the significance at 0.05.
Solution:
Step 1: State the hypotheses
H 0 :_____________________________________________________
_________________________________________________________
H 1: _____________________________________________________
_________________________________________________________
Step 2: The level of significance is _________.
Step 3: Determine the critical region. df= _________ t critical= _________
Step 4: Complete the table and compute for the value of p and t-test.
Pair X Y RX RY D D
2
1
2
3
4
5
6
7
8
ρ = ____________ t computed= __________
Name:_____________________________ Date:________________ Score:___________

_______________________________________________________________________
Step 6: Conclusion. _______________________________________________________
_______________________________________________________________________
Name:_____________________________ Date:________________ Score:___________

2. A used car depot wants to study the relationship between the age of a car and its
selling price. Listed is a random sample of 9 cars sold at the depot during the last
3 months.
Car `1 2 3 4 5 6 7 8 9
Age (years) 3 4 8 10 6 7 5.5 4 2.5
Selling Price (₱000) 350 320 170 120 280 190 250 300 400
Determine the coefficient of correlation using spearman rank correlation and test the
significance at 0.01.
Solution:
Step 1: State the hypotheses
H 0 :_____________________________________________________
_________________________________________________________
H 1: _____________________________________________________
_________________________________________________________
Step 2: The level of significance is _________.
Step 3: Determine the critical region. df= _________ t critical= _________
Step 4: Complete the table and compute for the value of p and t-test.
Pair X Y RX RY D D
2
1
2
3
4
5
6
7
8
Name:_____________________________ Date:________________ Score:___________

ρ = ____________ t computed= __________

_______________________________________________________________________
Step 6: Conclusion. _______________________________________________________
_______________________________________________________________________
Name:_____________________________ Date:________________ Score:___________
1. Sales, in million pesos, of a certain company are shown here.
Year (X) 1 2 3 4 5 6 7 8 9 10
Sales (Y) ₱12 ₱15 ₱17 ₱18 ₱16 ₱19 ₱21 ₱20 ₱22 ₱24
Determine the regression equation. Solve for the sum of squares for error, standard error,
coefficient of determination. Find the 90% confidence interval for all year for the 12 th
year. Determine a 95% prediction interval for the 12th year.
Step 1: Complete the table.

X Y X
2
Y
2
XY (Y −Y )
2
Y^ (Y −Y^ )
2
1 13
2 15
3 17
4 18
5 17
6 19
7 22
8 20
9 23
10 26
Step 2: Solve for thee following:
∑ X =__________ ∑ Y =¿ ¿ __________ X =¿ ¿ Y =¿ ¿
∑ X 2=¿ ¿_________ ∑ Y 2=¿ ¿ _________ ∑ XY =¿ _________

∑ (Y −Y )2 = _____________ ∑ (Y −Y^ )2= ____________
Slope (b 1): ___________ Intercept (b 0 ¿ : _____________
Simple Linear Regression Equation( Y^ =b1 X +b 0): _______________
Standard Error of Estimate(s E ): _______ Coefficient of Determination( r 2 ): ________
Confidence Interval: ___________ Prediction Interval: ___________

Name:_____________________________ Date:_________________ Score:__________
2. A random sample of 8 office staffs hired within the last year was selected from a
large corporation. For each selected office staff, his or her experience (in months)
at the time of hire and starting salary were recorded. The data is given in the table
below.
Experience (in months) 13 6 8 10 20 7 9 15

Starting salary (in ₱ 000s) 20 14 16 19 21 12 13 21
Determine the regression equation. Solve for the sum of squares for error,
standard error, coefficient of determination,. Find the 99% confidence interval for
all months for the 10th month. Determine a 99% prediction interval for the 10th.
Step 1: Complete the table.
X Y X
2
Y
2
XY ( Y −Y )2 Y^ ( Y −Ý )
2
13 20
6 14
8 16
10 19
20 21
7 12
9 13
15 21
Step 2: Solve for the following.

∑ X = ___________ ∑ Y = ___________ X =¿ ¿¿
∑ X 2= ___________∑ Y 2=¿ ∑ XY =¿ ¿ ¿ ¿
∑ ( Y i−Y )2= ________ ∑ ( Y i−Y^i ) 2= ___________
Slope ( b 1 ) : __________ Interception ( b 0 ) :________
Simple Linear Regression Equation ( Y^ =b1 X +b 0 ) :¿
Standard Error of Estimate ( s E ) :¿ Coefficient of Determination ( r 2 ) : ¿
Confidence Interval: ________________ Prediction Interval: _______________

MATH 121 (Chapter 10) - Correlation & Regression

Uploaded by

Copyright:

Available Formats

MATH 121 (Chapter 10) - Correlation & Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MATH 121 (Chapter 10) - Correlation & Regression

Uploaded by

Copyright:

Available Formats

MODULE 3.

The primary duty of science is to establish

10.2 Scatter Diagram

10.3 Pearson Product0Moment Correlation

Pearson product-moment correlation is the most widely used in statistical to

Perfect Positive Correlation (r = Perfect Negative Correlation (r = -

Positive Correlation (r=0.80) Negtive Correlation (r = -0.80)

Zero Correlation ( r = 0.00) Non-linear Correlation

following summarizes the correlation coefficient and strength of relationships.

If tcomputed>tcritical, reject H0.

6. State the conclusion.

Figure 10.1: Testing the Hypothesis of Correlation Coefficient at 0.05 Significance

Rejection Region Rejection

-ta/2 0+ta/2 scale of t

Step 2:State the hypotheses.

∑ X =1,029 ∑ Y =2,115 ∑ X =88,733 ∑ Y =380,887 ∑ XY =183,222

The coefficient of correlation, r = 0.93, between the atmospheric

10.4 Spearman Rank Correlation

The following formula is used to calculate the Spearman rank correlation.

Where: ρ = Spearman rank correlation.

Assumptions in Spearman Rank Correlation test:

1. Subjects are randomly selected.

At 0.05 significance level, can we conclude that there is a significance correlation

Step 1:Graph the scatter diagram.

Step 2:State the hypotheses.

Step 3: The level of significance is a = 0.05.

1 79 147 10 10.5 -0.5 0.25

Apply Formula 10-4 to obtain the Spearman rank correlation coefficient.

Step 6:Decision rule:

10.5 Simple Linear Regression

Regression analysisis a simple statistical tool used to model the dependence of

Assumptions of Linear Regression Equation

A. Estimating the Coefficient

where: Y^ = predicted or fitted of Y.

B. Sum of Square for Error

Figure 10.2:Measures of Variance in Simple Linear Regression Analysis

Y Unexplained sum of square

where: SE = standard error of estimate

A prediction interval is an estimate of an interval in which has already been

where: Y^ = predicted of fitted value of Y.

Example: Referring to the example involving atmospheric temperature on sales in page

Computation of the Simple Linear Regression Equation

Step 1: Obtain the sum of X , Y, X 2 , Y2, and X Y. ( see page 258)

12 ( 183,222 )−(1,029)(2,115) 2,198,664−2,176,335 22,329

Step 4: Compute for intercept of the simple liner regression.

b 0=Y −b1 X (Formula10-7)

b 0=176.25−3.7496(85.75) = 176.25 – 321.5282 = - 145.2782

The Simple Linear Regression is

Step 6: Graph the least square regression line.

Thus, the regression equation is Y^ =3.7496 X−145.2782. The b1 of 3.7496

1 147 -29.25 855.5625 150.97 -3.94 15.5227

Total sum of squares

∑ ( Y −Y^ ) =1− 1,141.1406 =0.8594

Name:_____________________________ Date:________________ Score:___________

25.The number of campaign materials used by a senatorial

Name:_____________________________ Date:________________ Score:___________

Name:_____________________________ Date:________________ Score:___________

Step 1: State the hypothesis

Name:_____________________________ Date:________________ Score:___________

Name:___________________ Date: Score:_

Name:___________________ Date: Score:_

Name:___________________ Date: Score:_

Name:___________________ Date: Score:_

r = __ t critical= _

Name:___________________ Date: Score:_

Step 3: Determine the critical region. df= _ t critical= _

ρ = __ t computed=

Name:___________________ Date: Score:_

Name:___________________ Date: Score:_

Step 3: Determine the critical region. df= _ t critical= _

Name:___________________ Date: Score:_

∑ X 2=¿ ¿_____ ∑ Y 2=¿ ¿ _ ∑ XY =¿ _____

Standard Error of Estimate(s E ): _ Coefficient of Determination( r 2 ): __

Confidence Interval: _ Prediction Interval: _

Confidence Interval: __ Prediction Interval: _