Correlation and Regression Analysis
Correlation and Regression Analysis
1. Regression analysis is a statistical method which makes use of the relationship between two
or more quantitative variables so that one variable, called the dependent variable or
response variable, can be predicted with the knowledge of the values of the other variable,
called the independent variable or explanatory variable..
2. A mathematical equation that allows us to predict values of one dependent variable from
known values of one or more independent variable is called a regression equation.
5. For this chapter, it focuses on the problem of estimating or predicting a value of a dependent
variable Y on the basis of a known measurement of an independent variable X .
7. A linear relationship between two variables is one in which the relationship can be most
accurately presented by a straight line.
1
8. In this section, the problem of estimating or predicting the value of a dependent variable on
the basis of a known measurement of an independent variable will be given consideration.
9. Although a graphic solution is sometimes used for prediction, it is much more common to
predict Y from the equation of the straight line. The general form of the equation is given
by
Y = a + bX , linear regression line equation or simple linear regression
10. For each X , the equation Y = a + bX will predict a value of Y. The estimated regression
line is defined by the equation
Y = a + bX Where: Y = is the predicted dependent variable
a = Y intercept (value of Y when X = 0)
b = slope of the line
a and b are the estimates of the parameters of
regression which are calculated from the available sample
points.
Remark: Through the estimated regression line equation we can now predict any Y value
just by knowing the corresponding X value.
_ y i _ x i
y= i =1
and x= i =1
are the means of the sample values
n n
2
12. a is the estimate of the population Y intercept o and b is the estimate of the population
slope coefficient 1.
14. Example: Given the data in Table 10.1.1. Find the following
a. Find the equation of the regression line.
b. Scatter diagram.
c. Find the point estimate of Y when x = 113.
Solution:
12 12
n = 12, xi = 110 + 112 + ... + 138 = 1,503.00,
i =1
x
i =1
2
i = 1102 + 1122 + ... + 1382 = 189,187.00
12 12 _ _
yi = 50 + 56 + ... + 68 = 706.00,
i =1
yi2 = 502 + 562 + ... + 682 = 41,682.00, x = 125.25, y = 58.833
i =1
12
x y
i =1
i i = 110(50) + 112(56) + ... + 138(68) = 88,857
12(88,857) − (1,503)(706)
b= = 0.4598, a = 58.833 − 0.4598(125.25) = 1.2414
12(189,187) − (1,503) 2
3
a. Y = 1.2414 + 0.4598 X
b. Scatter Plot
70
60
50
SCORE
40
100 110 120 130 140
IQ
c. Y = 1.2414 + 0.4598(113) = 53.20
16. Is their a linear dependency of Y on X of the given example above? Test at 0.05 level of
significance.
Solution: Step 1. H o : 1 = 0
H1 : 0
Step 2. = 0.05
Step 3. Appropriate test statistic: t test
Step 4. Reject H o if tc t0.05, df , that is, tc 1.812
Using the PHStat2 output for succeeding steps, we have
Standard
Coefficients Error t Stat P-value
Intercept 1.241744548 14.66505476 0.084673707 0.934191994
4
IQ Score 0.459813084 0.116796188 3.936884361 0.002788923
17. Correlation analysis attempts to measure the strength of the relationship between two random
variables by means of a single number called correlation coefficient. This concerned only
with the strength of the relationship and no causal effect is implied.
18. The Pearson Correlation Coefficient ( ) measure the strength of the linear relationship
between two variables X and Y . The estimated sample correlation coefficient, denoted by
(r ), is given by:
n n n
n xi yi − xi yi
r= i =1 i =1 i =1
where n is the sample size
n
n
2 n
n
2
i i i i
− −
2 2
n x x n y y
i=1 i=1 i=1 i=1
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
r = .6 r=1
21. The Sample Coefficient of Determination, r 2 , is a number that determine the total variation
in the values of variable Y that can be accounted for or explained by the linear relationship
with the values of the variable X .
5
22. Example: Of the given example above, find the sample correlation coefficient and sample
coefficient of determination and interpret the results.
24. Using the above example, is there evidence of a linear relationship between the students
Math 15 midterm scores and IQ scores at 0.05 level of significance?
tc t0.025, 10 or tc −t0.05, 10
tc 2.228 or tc −2.228
0.7796 − 0
Step 5. Computation: tc = = 19.88
1 − .77962
12 − 2
Step 6. Reject H o since tc 2.228.
Step 7. There is sufficient sample evidence that there is a significant
linear relationship between Students IQ scores and their Math 15
midterm scores.
10.2. Problems/Exercises
6
1. Once you have computed the linear regression line equation, the intercept is completely
determined.
2. The slope of the regression line can be a negative value.
3. If the Y − intercept is –2.5, the X − score must be 1.00.
4. All things being equal, the higher the correlation, the more accurate in the prediction.
^
5. In the regression equation Y = a + bX , Y is used to predict the value of X .
6. A direct proportion line indicates a positive correlation.
7. The correlation value ranges from –1 to +1.
8. A correlation of +0.45 will have the same standard deviation as –0.45.
9. When the value of r = 1, it denotes a perfect positive correlation.
10. The sum of all the errors in the regression line will always add to zero.
11. When the correlation coefficient r is squared, it is called as the coefficient of
determination.
12. If the correlation of two variables is close to zero, it indicates that no relationship exists
between the two variables.
13. In testing the significance for r using the t -distribution, is dependent only to the variables
for X & Y .
II. Solve what is indicated in the problem. Show your solutions legibly.
1. The Kryplium Junior School Board is trying to anticipate building needs on the bases of
past student enrolment. From previous years, they have collected and recorded the data
for enrollment and the community population. The data are presented below:
Questions:
7
3. Give your best estimate of the enrolment when the community population
was 18, 000.
4. Give your best estimate for the year in which the population was 18,000.
2. Listed below are the IQ scores ( X ) and the Final grade (Y ) for 10 students
X 90 90 100 100 100 110 110 115 120 120
Y 2.0 3.0 2.5 3.0 3.5 3.0 4.0 3.5 3.5 4.0
Answer the following:
a. Plot a scatter plot.
b. Solve for r and r 2 .
c. Test H o : 1 = 0 using 0.05 level of significance.
d. Test Ho: = 0 using 0.05 level of significance.
e. Find the linear regression line equation.
3. Listed below are the undergraduate Grade Point Average GPA ( X ) and First Semesters
Graduate GPA (Y ) of 10 Senior students.
X 90 90 100 100 100 110 110 115 120 120
Y 2.0 3.0 2.5 3.0 3.5 3.0 4.0 3.5 3.5 4.0
Answer the following:
a. Plot a scatter plot.
b. Solve for r and r 2 .
c. H o : 1 = 0 using 0.05 level of significance.
d. Test Ho: = 0 using 0.05 level of significance.
e. What is the estimated graduate GPA to be for a student if the undergrad GPA
is 3.5?