Linear Regression Analysis and Least Square Methods
Linear Regression Analysis and Least Square Methods
Linear Regression Analysis and Least Square Methods
THE SCHEME
ORIGIN
SCATTER DIAGRAM AND REGRESSION
LEAST SQUARE METHODS
STANDARD ERROR ESTIMATES
CORRELATION ANALYSIS
EXAMPLES
LIMITATIONS ERRORS AND CAVEATS
PRACTICAL APPLICATIONS OF
REGRESSION ANALYSIS
Epidemiology - Early evidence relating tobacco
smoking to mortality and morbidity came from observational
studies employing regression analysis.
Finance - The capital asset pricing model uses linear
regression for analyzing and quantifying the systematic risk of
an investment.
Economics - Linear regression is the predominant empirical tool
in economics. Eg., it is used to predict consumption spending, fixed
investment spending, inventory investment, purchases of a
country's exports, spending on imports, the demand to hold liquid
assets, labor demand and labor supply.
Environmental Science - Linear regression finds application in
a wide range of environmental science applications.
INTRODUCTION TO REGRESSION
ANALYSIS
How to determine the relationship between variables.
Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent variable
on the dependent variable
TYPES OF RELATIONSHIPS
Direct Relationship : As the independent variable
increases, the dependent variable also increases.
Positive Linear Relationship
TYPES OF RELATIONSHIPS
Inverse Relationship : In this relationship the
dependent variable decreases with an increase in the
independent variable
Negative Linear Relationship
Curvilinear relationships
y
x
y
x
y
Weak relationships
y
x
y
x
y
x
y
Dependent
Variable
Y intercept
Slope of the
Line
Y a bX
Independent
variable
LINEAR REGRESSION
ASSUMPTIONS
Error values (e)
are statistically independent
Error values are normally distributed for any given
value of x
The probability distribution of the errors is normal
The probability distribution of the errors has constant
variance
The underlying relationship between the x variable
and the y variable is linear
20
LINEAR REGRESSION
y
Y = a + bx
Observed Value
of y for xi
i
Predicted Value
of y for xi
Slope = b
Random Error for
this x value
Intercept = a
xi
21
GOOD FIT
Graph 1
ALGEBRAIC SUM
Y-Values
1 + 2 -3=0
1
2
-3
Graph 2
ABSOLUTE SUM
Y-Values
4 +2+2=8
GOOD FIT
Graph 1
Graph 2
LEAST SQUARES
(1)^2 +( 2)^2 +(-3) ^2= 14
Y-Values
1
2
-3
(4)^2 +(2)^2+(2)^2= 24
Y-Values
Graph 1
Graph 2
Y-Values
Y-Values
1 + 2 -3=0
4 -2 -2=0
2
-3
GOOD
FIT
ABSOLUTE
SUM
Two different lines absolute sum seems to represent the
relation between the variables better.
Graph 1
Graph 2
1 +2+3=6
Y-Values
2
3
4+2+2=8
Y-Values
GOOD
FIT
ABSOLUTE
SUM
But before we reach any conclusion let us look at a peculiar
situation
Data set { (2,4), (7,6), (10,2)
Graph 1
Graph 2
0 +0+3=3
Y-Values
1+2+1.5=4.5
Y-Values
3
0
0
Graph 1 ignores the middle points but still has lower absolute error. Intuitively Graph 2
should have given a better fit for the complete data. So what is the problem?
For the same data set now let us use the least square method
we square the individual errors before we add them
Data set { (2,4), (7,6), (10,2)
Graph 1
Graph 2
0 +0+(3)^2=9
Y-Values
Y-Values
(1)^2+(2)^2+(1.5)^2=
7.25
3
0
0
Graph 2 which Intuitively was giving a better fit of the data sample now shows the line to
be giving a better fit than Graph 1
e (y y)
2
33
(y (a bx))
( x x )( y y )
b
(x x)
2
algebraic equivalent:
x y
xy
n
2
(
x
)
2
x n
and
a y bx
34
INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
a is the estimated average value of y
when the value of x is zero
35
ERRORS
How to Check Accuracy of Estimated Line
How to Check Reliability of Estimated Line
Expenditure(Lakhs)
123
130
110
10
60
15
21
Exp
Accuracy Check
Follow of Path
Individual Errors should cancel each other
CHECKING ACURACY
y
Y = a + bx
x
40
Reliability
More Reliable
y
Lesser Reliable
y
SSE
n2
Where
SSE = Sum of squares error = (Y Y)2
n = Sample size
68.2%
Se
95.5%
2 X Se
Interpreting SE
y
Y = a + bx +2Se
Y = a + bx +1Se
Y = a + bx
Y = a + bx -1Se
Y = a + bx -2Se
45
Interpreting SE
y
Y = a + bx +2Se
Y = a + bx +1Se
Y = a + bx
Y = a + bx -1Se
Y = a + bx -2Se
x
46
Se = 1.88 LAKHS
Y:
Expenditure(Lakhs)
123
130
110
10
60
15
21
CORRELATION ANALYSIS
Describe Degree to which one variable is
linearly related to another
Used in conjunction with RA to explain how
well the regression line explains the variation
of dependent variable
Coefficient of Determination
Coefficient of Correlation
COEFFICIENT OF
DETERMINATION
Measures Strength of Association
Developed from Variations
Fitted Regression Line (Y Y)2
Their own mean (Y Y)2
R2 = 1 - (Y Y)2
(Y Y)2
Varies Between 0 and 1
Interpretting R2
COEFFICIENT OF CORRELATION
Another Measure of Association
R = R2
Varies Between -1 and 1
Data provided
OVERHEADS
UNITS PRODUCED
191
40
170
42
272
53
155
35
280
56
173
39
234
48
116
30
153
37
178
40
55
56
( x x )( y y )
(x x)
2
algebraic equivalent:
x y
xy
n
2
(
x
)
2
x n
and
a y bx
57
UNITS
(x)
y2
x2
xy
191
40
36481
1600
7640
170
42
28900
1764
7140
272
53
73984
2809
14416
155
35
24025
1225
5425
280
56
78400
3136
15680
173
39
29929
1521
6747
234
48
54756
2304
11232
116
30
13456
900
3480
153
37
23409
1369
5661
10
178
40
31684
1600
7120
Sums
1922
420
395024
18228
84541
Means
192.2
42
58
Substituting in formulae
OVERHEAD
(y)
UNITS
(x)
y2
x2
xy
Sums
1922
420
395024
18228
84541
Means
192.2
42
x y
xy
n
2
(
x
)
2
x n
a y bx
a 192.2 (6.4915)(42)
(420)(1922)
84541
10
b
(420) 2
18228
10
3817
b
6.4915
588
a 80.4430
59
y - 80.4430 6.4915x
y - 80.4430 6.4915x
y - 80.4430 6.4915 (50)
y 244.1320
The predicted price for 50 units is 244.1320
60
n2
a y b xy
n2
61
Substituting in Formulae.
Sums
OVERHEAD
(y)
UNITS
(x)
y2
x2
xy
1922
420
395024
18228
84541
a = -80.4430
b= 6.4915
a y b xy
n2
s 10 .2320
62
GRAPHICAL PRESENTATION
Overhead Unit Produced: Scatter plot
and Regression line
y - 80.4430 6.4915x
63
Calculating the
Correlation Coefficient
Sample correlation coefficient:
( x x )( y y )
[ ( x x ) ][ ( y y ) ]
2
n xy x y
[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
64
Calculation Example
Tree
Height
Trunk
Diameter
xy
y2
x2
35
280
1225
64
49
441
2401
81
27
189
729
49
33
198
1089
36
60
13
780
3600
169
21
147
441
49
45
11
495
2025
121
51
12
612
2601
144
=321
=73
=3142
=14111
=713
65
Calculation Example
Tree
Height,
y
r
70
n xy x y
60
50
40
8(3142) (73)(321)
[8(713) (73)2 ][8(14111) (321)2 ]
0.886
30
20
10
0
0
10
Trunk Diameter, x
12
14
66
68
REFERENCES
STATISTICS FOR MANAGEMENT LEVIN & RUBIN
Statistics for Managers using Microsoft Excel, 5e 2008 Prenticehall, Inc.
Mba512 Simple Linear Regression Notes, uploaded by Wilkes
University
Wikipedia
dss.princeton.edu Online help Analysis
resources.esri.com/help/9.3/.../com/.../regression_analysis_basic
s.htm
Linear regression , uploaded by MBA CORNER By Babasab Patil
Linear regression , Tech_MX
Multiple Linear Regression II James Neill, 2013
Multiple PPTs on Slide Share