Correlation and Regression
Correlation and Regression
Correlation and Regression
Prepared by :
B.S. Parajuli
@ B. S. Parajuli
Correlation
• Correlation is a statistical device/tool used to measure
the degree of association between two or more
variables.
• Correlation is only concerned with strength of the
relationship
• No causal effect is implied with correlation
The following are the types of correlation
i. Positive and Negative Correlation
ii. Linear and Non-linear Correlation
iii. Simple, Partial and Multiple Correlation
@ B. S. Parajuli
Positive Correlation:
The relationship between two variables is such that as one variable’s
values tend to increase, the other variable’s values also tend to increase,
or if one variable’s values tend to decrease, the other variable’s values
also tend to decrease.
That is positive correlation indicates that the variable’s values are
deviated in same direction.
Example: height and weight of children up to certain age, demand and
supply , income and expenditure of a person etc.
Negative Correlation:
The relationship between two variables is such that as one variable’s
values tend to increase, the other variable’s values also tend to decrease,
or if one variable’s values tend to decrease, the other variable’s values
also tend to increase.
That is positive correlation indicates that the variable’s values are
deviated in opposite direction.
Example: Altitude and Temperature, price and demand etc.
@ B. S. Parajuli
Linear Correlation and Non-linear Correlation
Distinction between linear and non linear correlation is based
up on the constancy of the ratio of change between the
variables.
Linear Correlation:
If the amount of change in one variable tends to bear constant ratio
to the amount of change in the other variable then the correlation is
said to be linear. X 10 20 30 40 50
Y 70 140 210 280 350
@ B. S. Parajuli
Methods of Studying Correlation
Graphical
Scatter Diagram
Method
Karl Pearson’s
Correlation
Correlation coefficient
Rank Method
Algebraic Spearman’s Rank
Method Correlation
Coefficient
Correlation in
Bivariate
Frequency Table
@ B. S. Parajuli
Scatter Diagram:
O X O X O X
O X
O X
O X O X
O X
@ B. S. Parajuli
Karl Pearson’s Correlation Coefficient
Prof. Karl Pearson developed this method. Karl Pearson’s coefficient between two
variables X and Y denoted by ‘r’, “is the ratio of Co-variance between variables X andY to
the product of the standard deviations of X andY”.
Thus ‘r’ is a numerical measures of linear relationship between them.
It is a number, indicates to what extent two variables are related.
Cov.(X, Y) Cov.(X, Y)
r= =
𝑉𝑎𝑟(𝑋) 𝑉𝑎𝑟(𝑌) σx .σy
1
Cov. X, Y = σ x − xത . y − yത
n
1
σ x− xത . y−ഥ
y σ x− xത . y−ഥ
y
n
r= =
1 1
σ(x−തx)2 . σ(y−ഥ 2 σ(x−തx)2 .σ(y−ഥ
y) 2
n n
y )
@ B. S. Parajuli
The value of r ranges between ( -1) and ( +1) i.e. -1≤ r≤1
@ B. S. Parajuli
Goodness of fit measure in terms of correlation coefficient
@ B. S. Parajuli
Case I : When actual ranks are given
𝟔 σ 𝒅𝟐
Rank correlation coefficient 𝐑 = 𝟏 − 𝟐
𝒏(𝒏 −𝟏)
Where ∑d = 0 (always)
R1 = rank of the items with respect to one(first) variable
R2 = rank of the items with respect to second variable
The limits of R is -1 to +1.
Case II: When ranks are not given and not repeated
In this case first of all we assigned the rank for data from largest to smallest
OR smallest to largest as 1 , 2 and so on. Then procedure is same as
𝟔 σ 𝒅𝟐
R=𝟏−
𝒏(𝒏𝟐 −𝟏)
@ B. S. Parajuli
Case I : When actual ranks are given
The ranking of ten students in Statistics and Microprocessor are
as follows. Calculate the coefficient of rank correlation?
Statistics (X) 5 3 4 10 8 7 2 1 6 9
Microprocessor (Y) 6 4 9 8 1 2 3 10 5 7
Rank of X (R1) Rank of Y(R2) d = R1- R2 d2
Now, rank correlation
5 6 -1 1
3 4 -1 1 𝟔 σ 𝒅𝟐
𝑹 =𝟏−
4 9 -5 25 𝒏(𝒏𝟐 − 𝟏)
10 8 2 4
6 x 192
=1−
8 1 7 49 10. (102 − 1)
7 2 5 25
1152 −162
=1− =
2 3 -1 1 990 990
= −0.164
1 10 -9 81
6 5 1 1
9 7 2 4
∑d = 0 ∑d2 =
@ B. S. Parajuli 192
Case II: Ranks are not given and not repeated
The scores of 8 students in an examination of two subjects X and Y are as follows. Find the
rank correlation coefficient.
X 48 89 92 50 29 60 55 35
Y 60 49 69 45 50 51 70 75
X Rank of Y Rank of Y d = R1- d2
X (R1) (R2) R2
Now,
𝟔 σ 𝒅𝟐
48 6 60 4 2 4 𝑹 =𝟏−
𝒏(𝒏𝟐 −𝟏)
89 2 49 7 -5 25 6 × 90
=1−
92 1 69 3 -2 4 8(82 − 1)
50 5 45 8 -3 9 540
=1−
29 8 50 6 2 4 8(64 − 1)
60 3 51 5 -2 4 504 − 540 −36
= =
55 4 70 2 2 4 504 504
= −0.071
36 7 75 1 6 36
∑d =0 ∑d2
@ B. S. Parajuli =90
Case III : Repeated Ranks are given (tied Ranks)
Sometimes it may happen that two or more variant values are equal. In
such case it is necessary to rank two or more items as equal or repeated or
tie. In such case each item should give an average or a common rank.
2+3
If the 2nd and 3rd items have same value then rank for each of them is
2
=2.5 and rank of next item is 4.
Then in this case, some adjustments are to be made in the spearman’s
𝒎(𝒎𝟐 −𝟏)
formula. The adjustment factor σ is to be added to σ 𝒅𝟐
𝟏𝟐
𝒎𝟏 𝒎𝟐 −𝟏 𝒎𝟐 𝒎𝟐 −𝟏
𝟔{σ 𝒅𝟐 + 𝟏
𝟏𝟐
+ 𝟐
𝟏𝟐
+ ……….}
R=𝟏−
𝒏(𝒏𝟐 −𝟏)
Where m1, m2 ,…. be the number of times that an item is repeated.
@ B. S. Parajuli
Case III: Repeated Ranks are given (tied Ranks)
From the following data, compute the coefficient of rank
correlation between X and Y.
X 33 56 50 65 44 38 44 50 15 26
Y 50 35 70 25 35 58 75 60 55 26
@ B. S. Parajuli
Calculation
Score (X) Rank of X(R1) Score (Y) Rank of Y(R2) d = R1- R2 d2
33 8 50 6 2 4.00
56 2 35 7.5 -5.5 30.25
50 3.5 70 2 1.5 2.25
65 1 25 10 -9 81.00
44 5.5 35 7.5 -2 4.00
38 7 58 4 3 9.00
44 5.5 75 1 4.5 20.25
50 3.5 60 3 0.5 0.25
15 10 55 5 5 25.00
26 9 26 9 0 0.00
∑d = 0 ∑d2= 176
@ B. S. Parajuli
Here, n = 10
For series X : no. of times the value 50 repeated = m1 = 2
no. of times the value 44 repeated = m2 = 2
• For series Y: no. of times the value of 35 repeated = m3 = 2
𝒎𝟏 𝒎𝟐 −𝟏 𝒎𝟐 𝒎𝟐 −𝟏 𝒎𝟑 𝒎𝟐 −𝟏
𝟔{σ 𝒅𝟐 + 𝟏
𝟏𝟐
+ 𝟐
𝟏𝟐
+ 𝟑
𝟏𝟐
}
• R= 𝟏 −
𝒏(𝒏𝟐 −𝟏)
2(22 −1) 2(22 −1) 2(22 −1)
6[176+ 12 + 12 + 12 ]
R=1−
10(102 −1)
6 176+1.5
=1− = −0.07
990
@ B. S. Parajuli
Regression Analysis
Linear component
A line fitted to a set of data points to estimates the relationship
between two variables is called regression line. (y = a + bx)
@ B. S. Parajuli
Y Yi = β0 + β1Xi + ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value
of Y for Xi Random Error for this Xi
value
Intercept = β0
X
Xi
@ B. S. Parajuli
The two regression lines of regression
Y
Regression line of Y on X
(Y = a +bX)
(𝑋ഥ , 𝑌)
ത Points of intersection
Regression line of X on Y
(X = a +bY)
X
O (0,0)
@ B. S. Parajuli
Least square method
Let Y = a + bX ….. (1) be linear regression equation of Y on X.
WhereY = dependent variable and X = independent variable
a = constant orY –intercept (Value ofY when X = 0)
b= byx = regression coefficient of Y on X . It measures the average
change in dependent variable Y corresponding to a unit change in
independent variable X
By using the principles of least square, we can get two normal equations
of regression equation (1) are as:
∑Y = na +b∑X………….(2)
∑XY = a∑X + b∑X2…….(3)
By solving equations (2) & (3) we get the value of ‘a’ & ‘b’ and after
substituting them in the equation (1) we get required fitted regression
line. @ B. S. Parajuli
Alternatively
n.σ XY− σ X.σ Y ഥ .Y
σ XY− n. X ഥ 𝜎𝑦
b = byx = σ 2 σ 2 = σ 2 ഥ )2
also b = byx = r .
n. X − ( X) X − n(X 𝜎𝑥
σ 𝑥𝑦
b = byx = σ 2 where x = X - 𝑋ത and y = Y - 𝑌ത
𝑥
deviation/variates measured from their means
The computational formula for y-intercept ‘a’ is as:
σ𝑌 σ𝑋
ത ത
𝑎 = 𝑌 − 𝑏𝑋 = − 𝑏. .
𝑛 𝑛
After finding the value of ‘a’ & ‘b’, we get the required fitted linear
regression model of y on x as:
𝑌 = a + bX.
Alternatively we can also useY- 𝑌ത = byx (X - 𝑋)
ത
@ B. S. Parajuli
Let X = a + bY ….. (1) be linear regression equation of X onY.
Where X = dependent variable andY = independent variable
a = constant or X –intercept (Value of X whenY = 0)
b= bxy = regression coefficient of X on Y . It measures the average
change in dependent variable X corresponding to a unit change in
independent variableY
By using the principles of least square, we can get two normal
equations of regression equation (1) are as:
∑X = na +b∑Y………….(2)
∑XY = a∑Y + b∑Y2…….(3)
By solving equations (2) & (3) we get the value of ‘a’ & ‘b’ and after
substituting them in the equation (1) we get required fitted
regression line.
@ B. S. Parajuli
Alternatively
n.σ XY− σ X.σ Y ഥ .Y
σ XY− n. X ഥ 𝜎𝑋
b = bxy = σ 2 σ 2 = ഥ )2
σ Y2 − n(Y
also b = bXY = r .
n. Y − ( Y) 𝑌
σ 𝑥𝑦
b = bXY = σ 2 where x = X - 𝑋ത and y = Y - 𝑌ത deviation/variates
𝑦
measured from their means
The computational formula for y-intercept ‘a’ is as:
σ𝑋 σ𝑌
ത
𝑎 = 𝑋 − 𝑏𝑌 = ത − 𝑏. .
𝑛 𝑛
After finding the value of ‘a’ & ‘b’, we get the required fitted linear regression
model of y on x as:
𝑋 = a + bY.
Alternatively we can also use X- 𝑋ത = bxy (Y - 𝑌) ത
Correlation coefficient (r) = ± 𝒃𝒀𝑿 . 𝒃𝑿𝒀
If both regression coefficients are positive we take Positive sign and if
both are negative we take negative sign.
@ B. S. Parajuli
Example
From the following data:
X 0 2 3 5 6
Y 5 7 8 10 12
@ B. S. Parajuli
Solution
X Y X2 Y2 X .Y
0 5 0 25 0
2 7 4 49 14
3 8 9 64 24
5 10 25 100 50
6 12 36 144 72
σ 𝑋 = 16 σ 𝑌 = 42 σ 𝑋 2 = 74 σ 𝑌 2 = 382 σ 𝑋𝑌 = 160
σ𝑌
ത=
𝑌 =
42
= 8.4 𝑎𝑛𝑑 ത = σ𝑋
𝑋 =
16
= 3.2
𝑛 5 𝑛 5
[n σ xy− σ x .(σ y)]
(i) r =
n.σ x2 − σ x 2 .[n.σ y2 − σ y 2 ]
5 𝑥 160 −16 𝑥 42 128 128
r= = = = 0.992
5 𝑥 74 −162 . 5 𝑥 382 − 422 114 𝑥 146 129.01
(ii) 𝑟 2 = (0.992)2 = 0.984 = 98.4% of total variation on Y is explained by variation in X.
(iii) Let y = a + b x ….(i) be a regression equation y on x.
n.σ XY− σ X.σ Y 128
b= = = 1.12 𝑎𝑛𝑑 𝑎 = 𝑌ത − 𝑏𝑋ത = 8.4 − 1.12 𝑥 3.2 = 4.8
n.σ X2 − (σ X)2 114
@ B. S. Parajuli
Hence the fitted regression equation of Y on X is
𝑌 = 4.8 + 1.12𝑋
Interpretation of regression coefficient:
Since b = 1.12 which means if we increase X by 1 unit then on average Y
increase by 1.12 units
(iv) The value of Y when X = 4 is
𝑌 = 4.8 + 1.12 × 4 = 9.28
(v) standard error of estimate
σ Y 2 − a σ Y − b σ XY 382 − 4.8x42 − 1.12x160
Se = =
n−2 5−2
Se = 0.63
Hence average variability around the fitted regression line is 0.63.
Since Se ≠ 0 . The regression line is not perfect for estimating the
dependent variable.
@ B. S. Parajuli
Correlation Analysis Regression Analysis
1 Correlation analysis is the statistical tool, which is 1 Regression analysis is a measure expressing the average
used to describe the degree to which the variables relationship between the two variables whether the
are linearly related. variables are linearly or non-linearly related.
2 Correlation coefficient (r) between two variables 2 Regression analysis is concerned to establishing the
X and Y is the measure of the direction and degree functional relationship between two variables under study.
of the linear relationship between two variables. It is used to predict or estimate the value of dependent
variable for any given value of the independent variable.
3 Correlation coefficients are symmetric i. e. rXY = 3 Regression coefficients are not symmetrical i. e. bYX ≠ bXY
rYX
4 Correlation need not imply cause and effect 4 It clearly indicates the cause and effect relationship
relationship between the variables under study. between the variables. The variables corresponding to
cause and effect are known as independent and dependent
variables respectively.
5 Correlation coefficient (r) is a pure number i. e. 5 Regression coefficients are not a pure number i. e.
independent of unit of measurement. attached with unit of measurement.
6 Correlation analysis has limited applications in 6 Regression analysis has much wide applications.
comparison with regression analysis.
@ B. S. Parajuli
Measures of Variation
Total Sum of Square = Sum of square due to regression + Sum of square due to
regression
TSS = SSR + SSE
For Y = a+ bX
Total sum of square: is the measure of variation in the value of dependent variable (Y)
ത
around their mean value (𝑌).
∴ 𝑻𝑺𝑺 = σ 𝐘 – 𝐘 ഥ 𝟐 = σ𝐘 𝟐 − 𝐧. 𝐘ഥ𝟐
Regression sum of square: is the sum of the squared differences between the predicted
value of Y and the mean value of Y.
∴ 𝑺𝑺𝑹 = σ 𝐘 – 𝐘 ഥ 𝟐 = 𝐚 σ 𝐘 + 𝐛 σ 𝐗𝐘 − 𝐧. 𝐘 ഥ𝟐
Error sum of square: is the sum of the squared differences between the observed value of Y
and the predicted value of Y.
𝟐
∴ 𝑺𝑺𝑬 = σ 𝐘 – 𝐘 = 𝒀𝟐 − 𝒂 𝒀 − 𝒃 𝑿𝒀
𝑆𝑆𝑅
Thus, coeff of determination 𝑟 2 = but SST = SSE + SSR then SSR = TSS – SSE
𝑇𝑆𝑆
@ B. S. Parajuli
Standard Error of Estimate
The standard error of estimate of Y on X, denoted by Se measures the average variation
or scatteredness of observes data point around the regression line.
It is used to measure the reliability of the regression equation. It is calculated by the
following relation:
𝟐
𝑺𝑺𝑬
σ 𝐘–𝐘
∴ 𝑺𝒆 = MSE = = ( for simple regression k =1)
𝒏−𝒌−𝟏 𝒏−𝟐
σ 𝒀𝟐 − 𝒂 σ 𝒀 − 𝒃 σ 𝑿𝒀
∴ 𝑺𝒆 =
𝒏−𝟐
n = no of pair of observations
k = no of independent variables
If 𝑺𝒆 = 0 there is no variation of observed data around the regression line. In such case
regression line is perfect for estimating the dependent variable.
Residual (Error):
The residual or error term for the given data X in the model is the difference between
actual value of Y and estimated or predicted value of Y i.e. 𝑌 and it is denoted by e.
e = Y -@𝑌B. S. Parajuli
Example: A computer operator is interested to know how data rate of internet users depends
upon the bandwidth, the following result were gathered by the operator.
Band width: 17 35 41 19 25 20 10 15
Data Rate: 47 64 68 50 60 55 30 33
a. Is there any association between bandwidth and data rate?
b. Fit the regression model to describe the given data an also interpret the estimated
regression coefficient.
c. What percentage of variation on data rate is explained by the variation on bandwidth?
d. Estimate the data rate when band width is 22.
Example: You are given the following data :
𝑋ത = 20 , 𝑌ത = 20 , 𝜎𝑋 = 4, 𝜎𝑌 = 3, r = 0.7 . Obtain the two regression equations and find the
most likely value of Y when X = 24
Solution:
Hint : The line of Y on X is
𝜎
Y - 𝑌ത = r . 𝜎𝑌 ( X - 𝑋)
ത
𝑋
(iii) We have
𝜎𝑦
byx = r .𝜎
𝑥
4 6 𝜎𝑦
Or, 5 = 10 × (∵ 𝜎𝑥2 = 9)
3
Or, 4 = 𝜎𝑦
∴ 𝜎𝑦 = 4
@ B. S. Parajuli
Example: For fifty files transmitted, the regression equation of time taken (Y) on the
transmission of file size (X) is 4Y -5X - 8 = 0. Average size of transmission files is 40 GB. The
ratio of the standard deviation 𝜎𝑌 : 𝜎𝑋 is 5 :2. Find the average time taken to transmit file and
the coefficient of correlation between the time and size of the file.
Solution: Here, we have given the number of files to be transmitted i.e. n = 50 regression line of time taken (Y) on size
𝜎 5
of file (X) is 4Y -5X -8 = 0 𝑋ത = 40 , 𝑌 =
𝜎𝑋 2
Here, we have to obtain Y and 𝑟𝑋𝑌
ത 𝑌ത ).
We know, the regression equation passes through their mean values (𝑋,
Or, 4𝑌ത -5 𝑋ത = 8
or, 4 𝑌ത -5 × 40 = 8
or, 4 𝑌ത = 208
∴ 𝑌ത = 52
∴The average time taken to transmit files (𝑌ത ) = 52. 𝜎
Again, 𝑏𝑌𝑋 = r. 𝜎𝑌
Since, the above line is Y on X, 5 5
𝑋
𝜎𝑌 5
so, 4Y = 8 + 5X Or, 4
= r. 2
[ ∵ 𝜎𝑋
= 2
]
8 5 2
or, Y = + X Or, =r
4 4 4
1
5 ∴ 𝑟𝑋𝑌 = 2 = 0.5
or, Y = 2 + X
4
5
Hence the regression coefficient of Y on X i.e. 𝑏𝑌𝑋 = 4
@ B. S. Parajuli