SCM Session 6 Correlation and Regression Analysis
SCM Session 6 Correlation and Regression Analysis
SCM Session 6 Correlation and Regression Analysis
Correlation and
Regression
Analysis
The product moment correlation, r, summarizes the strength of
association between two metric (interval or ratio scaled)
variables, say X and Y.
D iv is io n o f th e n u m er ato r an d d en o m in ato r b y ( n - 1 ) g iv es
n
S ( X i - X )( Y i - Y )
n -1
r= i=1
S
n
( X i - X ) 2 Sn ( Y i - Y ) 2
n -1 n -1
i=1 i=1
C OV x y
=
SxSy
r varies between -1.0 and +1.0.
2 9 12 11
3 8 12 4
4 3 4 1
5 10 12 11
6 4 6 1
7 5 8 7
8 2 2 4
9 11 18 8
10 9 9 10
11 10 17 8
12 2 2 5
The correlation coefficient may be calculated as
follows:
= (10 + 12 + 12 + 4 + 12 + 6 + 8 + 2 + 18 + 9 + 17 + 2)/12
X = 9.333
= (6 + 9 + 8 + 3 + 10 + 4 + 5 + 2 + 11 + 9 + 10 + 2)/12
Y = 6.583
n = (10 -9.33)(6-6.58) + (12-9.33)(9-6.58)
S (X i - X )(Y i - Y ) + (12-9.33)(8-6.58) + (4-9.33)(3-6.58)
i =1 + (12-9.33)(10-6.58) + (6-9.33)(4-6.58)
+ (8-9.33)(5-6.58) + (2-9.33) (2-6.58)
+ (18-9.33)(11-6.58) + (9-9.33)(9-6.58)
+ (17-9.33)(10-6.58) + (2-9.33)(2-6.58)
= -0.3886 + 6.4614 + 3.7914 + 19.0814
+ 9.1314 + 8.5914 + 2.1014 + 33.5714
+ 38.3214 - 0.7986 + 26.2314 + 33.5714
= 179.6668
n
S (X i - X )2 = (10-9.33)2 + (12-9.33)2 + (12-9.33)2 + (4-9.33)2
i =1 + (12-9.33)2 + (6-9.33)2 + (8-9.33)2 + (2-9.33)2
+ (18-9.33)2 + (9-9.33)2 + (17-9.33)2 + (2-9.33)2
= 0.4489 + 7.1289 + 7.1289 + 28.4089
+ 7.1289+ 11.0889 + 1.7689 + 53.7289
+ 75.1689 + 0.1089 + 58.8289 + 53.7289
= 304.6668
n
S (Y i - Y )2 = (6-6.58)2 + (9-6.58)2 + (8-6.58)2 + (3-6.58)2
i =1 + (10-6.58)2+ (4-6.58)2 + (5-6.58)2 + (2-6.58)2
+ (11-6.58)2 + (9-6.58)2 + (10-6.58)2 + (2-6.58)2
= 0.3364 + 5.8564 + 2.0164 + 12.8164
+ 11.6964 + 6.6564 + 2.4964 + 20.9764
+ 19.5364 + 5.8564 + 11.6964 + 20.9764
= 120.9168
Thus, r= 179.6668
(304.6668) (120.9168) = 0.9361
E x p la in e d v a r ia tio n
r2 =
T o ta l v a r ia tio n
S S
= x
S S y
= T o ta l v a r ia tio n - E r r o r v a r ia tio n
T o ta l v a r ia tio n
S S y - S S e rro r
=
S S y
When it is computed for a population rather than a
sample, the product moment correlation is denoted
by r , the Greek letter rho. The coefficient r is an
estimator of r .
H0 : r = 0
H1 : r 0
The test statistic is:
1/2
t = r n-2
1 - r2
which has a t distribution with n - 2 degrees of freedom.
For the correlation coefficient calculated based on the
data given in Table 1.1,
12-2 1 /2
t = 0. 9361
1 - (0. 9361) 2
= 8.414
and the degrees of freedom = 12-2 = 10. From the t distribution
table the critical value of t for a two-tailed test and a = 0.05 is
2.228. Hence, the null hypothesis of no relationship between X and
Y is rejected.
A partial correlation coefficient measures the
association between two variables after
controlling for, or adjusting for, the effects
of one or more additional variables.
rx y - (rx z ) (ry z )
rx y . z =
1 - rx2z 1 - ry2z
Partial correlations have an order associated
with them. The order indicates how many
variables are being adjusted or controlled.
The simple correlation coefficient, r, has a
zero-order, as it does not control for any
additional variables while measuring the
association between two variables.
The coefficient rxy.z is a first-order partial
correlation coefficient, as it controls for the
effect of one additional variable, Z.
where
Y = dependent or criterion variable
X = independent or predictor variable
b 0 = intercept of the line
b 1 = slope of the line
Yi = b0 + b1 Xi + ei
Duration of Residence
β0 + β1X
Y
YJ
eJ
eJ
YJ
X
X1 X2 X3 X4 X5
In most cases,b 0 and b 1 are unknown and are estimated
from the sample observations using the equation
Y i = a + b xi
where Y i is the estimated or predicted value of Yi, and
a and b are estimators ofb 0 and b 1 , respectively.
COV xy
b=
S x2
n
S (X i - X )(Y i - Y )
= i=1
n 2
S (X i - X )
i=1
n
S X iY i - nX Y
= i=1
n
S X i2 - nX 2
i=1
The intercept, a, may then be calculated using:
a = Y- b X
For the data in Table 1.1, the estimation of parameters
may be illustrated as follows:
12
S XiYi
i =1
= (10) (6) + (12) (9) + (12) (8) + (4) (3) + (12) (10) + (6) (4)
+ (8) (5) + (2) (2) + (18) (11) + (9) (9) + (17) (10) + (2) (2)
= 917
12
S Xi2 = 102 + 122 + 122 + 42 + 122 + 62
i =1 + 82 + 22 + 182 + 92 + 172 + 22
= 1350
It may be recalled from earlier calculations of the simple
correlation that:
X = 9.333
Y = 6.583
Given n = 12, b can be calculated as:
917 - (12) (9.333) ( 6.583)
b=
1350 - (12) (9.333)2
= 0.5897
a = Y - bX
= 6.583 - (0.5897) (9.333)
= 1.0793
Standardization is the process by which the raw data are
transformed into new variables that have a mean of 0
and a variance of 1 .
When the data are standardized, the intercept assumes a
value of 0.
The term beta coefficient or beta weight is used to
denote the standardized regression coefficient.
where n
S S y = iS= 1 ( Y i - Y ) 2
n
S S reg = S (Y i - Y )2
i =1
n
S S res = iS ( Y i - Y i) 2
=1
Y
Residual Variation
SSres
Explained Variation
SSreg
Y
X
X1 X2 X3 X4 X5
The strength of association may then be calculated as
follows:
SS reg
r2 =
SS y
SS y - SS res
=
SS y
i =1
= (6.9763-6.5833)2 + (8.1557-6.5833)2
+ (8.1557-6.5833)2 + (3.4381-6.5833)2
+ (8.1557-6.5833)2 + (4.6175-6.5833)2
+ (5.7969-6.5833)2 + (2.2587-6.5833)2
+ (11.6939 -6.5833)2 + (6.3866-6.5833)2
+ (11.1042 -6.5833)2 + (2.2587-6.5833)2
=0.1544 + 2.4724 + 2.4724 + 9.8922 + 2.4724
+ 3.8643 + 0.6184 + 18.7021 + 26.1182
+ 0.0387 + 20.4385 + 18.7021
= 105.9524
n
SS res = S (Y i - Y i ) = (6-6.9763)2 + (9-8.1557)2 + (8-8.1557)2
2
= 14.9644
r 2 = SSreg /SSy
= 105.9524/120.9168
= 0.8762
Another, equivalent test for examining the significance of the
linear relationship between X and Y (significance of b) is the test
for the significance of the coefficient of determination. The
hypotheses in this case are:
H0: R2pop = 0
SS reg
F=
SS res /(n-2)
which has an F distribution with 1 and n - 2 degrees of freedom.
The F test is a generalized form of the t test . If a random
variable is t distributed with n degrees of freedom, then t2 is F
distributed with 1 and n degrees of freedom. Hence, the F test
for testing the significance of the coefficient of determination is
equivalent to testing the following hypotheses:
H0: b 1 = 0
or H1 : b 1 0
H0 : r = 0
H1 : r 0
From Table 1.2, it can be seen that:
r2 = 105.9522/(105.9522 + 14.9644)
= 0.8762
Which is the same as the value calculated earlier. The value of the
F statistic is:
F = 105.9522/(14.9644/10)
= 70.8027
ANALYSIS OF VARIANCE
df Sum of Squares Mean Square
(Y i - Yˆ i )
n 2
SEE = i =1
n-2
or
SEE = SS res
n-2
SEE = SS res
n - k -1
For the data given in Table 17.2, the SEE is estimated as follows:
SEE = 14.9644/(12-2)
= 1.22329
The error term is normally distributed. For each fixed
value of X, the distribution of Y is normal.
Y = b 0 + b 1 X1 + b 2 X2 + b 3 X3+ . . . + b k X k + e
which is estimated by the following equation:
F test. The F test is used to test the null hypothesis that the
coefficient of multiple determination in the population, R2pop, is
zero. This is equivalent to testing the null hypothesis. The test
statistic has an F distribution with k and (n - k - 1) degrees of
freedom.
Partial F test. The significance of a partial regression
coefficient,b i , of Xi may be tested using an incremental F
statistic. The incremental F statistic is based on the
increment in the explained sum of squares resulting from
the addition of the independent variable Xi to the
regression equation after all the other independent
variables have been included.
Y = a + b1X1 + b2X2
Suppose one was to remove the effect of X2 from X1. This could
be done by running a regression of X1 on X2. In other words,
one would estimate the equation X 1 = a + b X2 and calculate the
residual Xr = (X1 -X 1). The partial regression coefficient, b1, is
then equal to the bivariate regression coefficient, br , obtained
from the equation Y = a + br Xr .
Extension to the case of k variables is straightforward. The partial
regression coefficient, b1, represents the expected change in Y when X1
is changed by one unit and X2 through Xk are held constant. It can also
be interpreted as the bivariate regression coefficient, b, for the
regression of Y on the residuals of X1, when the effect of X2 through Xk
has been removed from X1.
The relationship of the standardized to the non-standardized
coefficients remains the same as before:
B1 = b1 (Sx1/Sy)
Bk = bk (Sxk /Sy)
or
ANALYSIS OF VARIANCE
df Sum of Squares Mean Square
where
n
SSy = S (Y i - Y )2
i =1
n
S
2
S S reg = (Y i - Y )
i =1
n
S
2
S S res = (Y i - Y i )
i =1
The strength of association is measured by the square of the multiple
correlation coefficient, R2, which is also called the coefficient of
multiple determination.
2 SS reg
R =
SS y
H0: b 1 = b2 = b 3 = . . . = b k = 0
SS reg /k
F=
SS res /(n - k - 1)
= R 2 /k
(1 - R 2 )/(n- k - 1)
t= b
SE
b
Predicted Y Values
Residuals
Predicted Y Values
The purpose of stepwise regression is to select, from a large
number of predictor variables, a small subset of variables that
account for most of the variation in the dependent or criterion
variable. In this procedure, the predictor variables enter or are
removed from the regression equation one at a time. There are
several approaches to stepwise regression.
Forward inclusion. Initially, there are no predictor variables in the
regression equation. Predictor variables are entered one at a
time, only if they meet certain criteria specified in terms of F ratio.
The order in which the variables are included is based on the
contribution to the explained variance.
Backward elimination. Initially, all the predictor variables are
included in the regression equation. Predictors are then removed
one at a time based on the F ratio for removal.
Stepwise solution. Forward inclusion is combined with the
removal of predictors that no longer meet the specified criterion
at each step.
Multicollinearity arises when intercorrelations among the
predictors are very high.