Linear Regression PDF
Linear Regression PDF
Linear Regression PDF
Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011. Unauthorized use and/ or duplication of this material or any part of this material including data, in any
form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions.
Learn to Evolve
Introduction to Linear Regression
Business Problem
• This was a case of prediction . How about doing root cause analysis ?
Regression modeling
Establishing a functional relationship between a set of Explanatory or
Independent variables X1, X2, …, Xp with the Response or Dependent variable Y.
Question
- Should we grant him/her the card?
8
Nature of Explanatory & Dependent variables
An Explanatory variable could be
Numerical
Discrete : e.g.Number of satisfactory trades
Continuous: e.g. Highest Credit Line
Categorical
Ordinal : e.g. Income Group (High/Medium/Low)
Nominal : e.g. Gender (Male/Female)
Continuous
Y
Binary(0/1)
Y>0 {0,1,2,3,….}
x x
x
x x x x
x x x
x x x
x
x x x
x x x
x x x
x x x
Logarithmic Polynomial
x x x x
x x
x x x x
x x
x x
x x x x x x
x x
x
x
x
What is OLS REGRESSION ANALYSIS?
OLS Regression basically try to draw the best fit regression line - a line such that the sum of the squared deviations of
the distances of all the points to the line is minimized.
x x
Error or x
x residual
term β “Best fit” line
1 Slope = β y = a + βE
Intercept = a
x
Independent variable (x)
Ordinary Least Squares (OLS) linear regression assumes that the underlying relationship between two variables can
best be described by a line.
13
Regression-Step-0
Step-0:
Step-1:
Once we have selected the dependent variable we wish to predict, the first step before running a
regression is to identify what independent variables might influence the magnitude of the dependent
variable and why.
Regression-Step-1
COLLECTING AND GRAPHING THE DATA
The first step is to collect the necessary information and to enter it in a format that allows the user to graph
and later "regress" the data.
Company
(Y ) (X ) revenue
Com pany C u s to m e r 500
re ve n u e in c o m e
180 10 400
460 44 relationship
0
300 60 0 10 20 30 40 50 60 70 80
425 62
400 70
Customer
income
Regression-Step-2
The way linear regression "works" is to start by naively fitting a horizontal no-slope (slope = A=0) line to the data. The
y-intercept B of this line is simply the arithmetic average of the collected values of the dependent variable.
Company
(Y ) (X ) revenue
Com pany C u s to m e r
500
re ve n u e in c o m e
e7 = 166
180 10 400 e9 = 131
100 e10 = 106
200
10
300 c5 = 166 The sum of the
20
290 23 e4 = -5 e8 = 6 squared residuals,
Sno-slope gives us a
200
350 30 e6 = -55 Y=295
e3 = -95
240 38 100
e1 = -115 measure of how
460 44 well the horizontal
0 e2 = -195
300 60 0 10 20 30 40 50 60 70 80 line fits the data
425 62
400 70
A v e ra g e
Customer
Y v a lu e = 2 9 5 income
Company
revenue
500 “Best fit” sloped line =
400 e9 = 27
e10 = -31 Revenue = 144 + 4.1 x (income)
300 c5 = 83
e4 = 52
200 Slope = 4.1
•T-statistics
•R2-statistics
INTERPRETING THE COEFFICIENT – SIGN TEST
The coefficient of the independent variable represents our best estimate for the change in the
dependent variable given a one-unit change in the independent variable.
Revenue
0 1 2 3 4 5 6 7
INTERPRETING THE INTERCEPT
Similarly, the intercept represents our best estimate for the value of the dependent variable when
the value of the independent variable is zero.
Standard error = 50
Standard error = 50
t-stat = 2.8
Revenue
The R2-statistic is the percent reduction in
the sum of squared residuals from using
500
400
300 our best fit sloped line vs. a horizontal line
200
100
0
R2 = Sno-slope – Sslope
S
80no-slope
= 121,523
0 10 20 30 40 50 60 70
Sno-slope
Income
R2 = 121,523121,523
– 49,230
Revenue
Sslope = 49,230
R2the
= 0.59
500
400 If independent variable does not drive (or
300
is not correlated) with the dependent
200
100
variable in any way, we would expect no
0 consistent change in "y" with consistently
0 10 20 30 40 50 60 70 80
changing "x." This is true when the slope is
zero or Sslope = Sno-slope which makes R2 = 0
Income
MULTIPLE REGRESSION
Multiple regression allows you to determine the estimated effect of multiple independent variables on the
dependent variables.
Tests
Testsfor
formultiple
multipleregressions
regressions
•Sign
•Signtest
test––check
checksigns
signsof
of
coefficients for hypothesized
coefficients for intuitive variability
Dependent
Dependentvariable:
variable: Y Multiple
Multipleregression
regressionprograms
programs change in dependent
to determine fit variable
will
willcalculate
calculatethe
thevalue
valueof ofall
all •T-statistic
•T-statistic––check
checkt-stat
t-statfor
foreach
each
Independent
Independentvariables: variables: the
thecoefficients
coefficients(a(0 0totoa
n)nand
) and coefficient
coefficienttotoestablish
establishifift>2
t>2(for
(foraa
X11,,X22, ,X3,3,. .. .. ., ,Xnn
give
givethethemeasures
measuresof of “good
“goodfit”)
fit”)
Relationship:
Relationship: variability
variabilityfor
foreach
eachcoefficient
coefficient
Y == a00++a1x11x1+, a22x22+, a3
x3+, .. .. .. ,+annxnn (i.e.,
(i.e.,RR and
22 andt-statistic)
t-statistic) •R
•R22,,adjusted
adjustedRR22
––RR22values
valuesincrease
increasewith
withthe
the
number
numberof ofvariables;
variables;therefore
therefore
check
checkadjusted
adjustedRR value
22 valuetotoestablish
establish
aagood
goodfitfit(adjusted
(adjustedRR close
22 closeto
to1)
1)
MULTIPLE REGRESSION
If you can dream up multiple independent variables or "drivers" of a dependent variable, you may want to use
multiple regression.
Independent Dependent
variable variables Slopes Intercept Multiple regression notes
y x1 a1 b
• Having more independent variables
x2 a2
always makes the fit better – even
• • if it is not a statistically significant
• •
• • improvement. So:
xi ai 1. Do the sign check for all
slopes and the intercept
2. Check the t-stats (should be
y = a1 x1 + a2 x2 . . . + ai xi + b
>2) for all slopes and the
= b + a 1x 1
i intercept
3. Use the adjusted R2 which
takes into account the false
improvement due to multiple
variables
Multiple regression – 4 primary issues
Multicollinearity Serial correlation/ Autocorrelation Heteroscedasticity Outlier
• High correlation • Residual terms are • Variance of the residual • If some values are
What is it? correlated with one term increases as the markedly different
among two or value of the independent from the majority
another. It occurs
more of the most often with time variable increases of the values
independent series data
variable
▪Enter Method : To get the coefficient of each and every variable in the regression
▪Back ward method : When the model is exploratory and we start with all the variables and then remove the
insignificant ones
▪Forward Method: Sequentially add variables one at a time based on the strength of their squared semi-partial
correlations (or simple bivariate correlation in the case of the first variable to be entered into the equation)
▪Step wise method : A combination of forward and backward at each step one can be entered (on basis of greatest
improvement in R2 but one also may be removed if the change (reduction) in R2 is not significant (In the Bordens and
Abbott text it sounds like they use this term to mean Forward regression)
Steps in Regression Model building
Rank ordering:
-order data in descending order of predicted values
-Break into 10 group
-check if average of actual is in the same order as average predicted
Validation
> summary(myModel)
Call:
lm(formula = Learning ~ Pre1 + Pre2 + Pre3 + Pre4)
Residuals:
Min 1Q Median 3Q Max
-0.40518 -0.08460 0.01707 0.09170 0.29074
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.22037 0.11536 -1.910 0.061055 .
Pre1 1.05299 0.12636 8.333 1.70e-11 ***
Pre2 0.41298 0.10926 3.780 0.000373 ***
Pre3 0.07339 0.07653 0.959 0.341541
Pre4 -0.18457 0.11318 -1.631 0.108369
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Influential Observations
# added variable plots
> av.plots(fit, one.page=TRUE, ask=FALSE)
# Cook's D plot
# identify D values > 4/(n-k-1)
> cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2))
> plot(fit, which=4, cook.levels=cutoff)
# Influence Plot
> influencePlot(fit, main="Influence Plot",
+ sub="Circle size is proportial to Cook's Distance" )
# Evaluate Nonlinearity
# component + residual plot
> cr.plots(fit, one.page=TRUE, ask=FALSE)
# Ceres plots
ceres.plots(fit, one.page=TRUE, ask=FALSE)
Regression Diagnostics: Code
# Normality of Residuals
# qq plot for studentized resid
> qq.plot(fit, main="QQ Plot")
# distribution of studentized residuals
> library(MASS)
> sresid <- studres(fit)
>hist(sresid, freq=FALSE,
+ main="Distribution of Studentized Residuals")
> xfit<-seq(min(sresid),max(sresid),length=40)
>yfit<-dnorm(xfit)
> lines(xfit, yfit)
# Evaluate homoscedasticity
# non-constant error variance test
> ncv.test(fit)
# plot studentized residuals vs. fitted values
> spread.level.plot(fit)
# Evaluate Collinearity
> vif(fit) # variance inflation factors
> sqrt(vif(fit)) > 2 # problem?
Non-independence of Errors
# Test for Autocorrelated Errors
> durbin.watson(fit)
# compare models
> fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
> fit2 <- lm(y ~ x1 + x2)
> anova(fit1, fit2)
Cross Validation
• You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package
# K-fold cross-validation
> library(DAAG)
> cv.lm(df=mydata, fit, m=3) # 3 fold cross-validation
• Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate
• You can assess R2 shrinkage via K-fold cross-validation. Using the crossval() function from the bootstrap package, do the following:
# Assessing R2 shrinkage using 10-Fold Cross-Validation
> fit <- lm(y~x1+x2+x3,data=mydata)
> library(bootstrap)
# define functions
> theta.fit <- function(x,y){lsfit(x,y)}
> theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
# matrix of predictors
> X <- as.matrix(mydata[c("x1","x2","x3")])
# vector of predicted values
> y <- as.matrix(mydata[c("y")])
# Stepwise Regression
> library(MASS)
> fit <- lm(y~x1+x2+x3,data=mydata)
> step <- stepAIC(fit, direction="both")
> step$anova # display results
Variable Selection
• Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps package
• In the following code nbest indicates the number of subsets of each size to report
• Here, the ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.)
Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/