Module01.1 LinearRegression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Linear Regression

Prof. Sayak Roychowdhury


Linear Regression
• Most widely used dependence technique, statistical data
model
• Used for both prediction and explanation
• Application ranges from research questions to find relation
between factor(s) and outcome, to business forecasting,
econometric models, marketing etc.
• Capability of explanation of a model is one of its more
important utility compared to other complicated black box
models (such as Neural Nets)
• Multiple Regression and its variants provides a framework for
in depth understanding of the process being investigated
• Extensively used in academic research (knowledge creation)
and managerial insights (impact of potential factors)
Glimpse of Linear Regression
• Minitab- Stat>Regression>Fit Regression Model > Select
the Response and the Factors > Ok

Data Point Body height Flight Time

1 1.5 1.91

2 1.4 1.83

3 2.7 0.86

4 1.1 1.72

5 0.9 1.28

6 0.8 1.09

7 2.9 0.79

8 2.2 1.1

9 3.3 0.81

10 1.8 1.67
Glimpse of Linear Regression
lm1 <- lm(mpg ~ hp, data = mtcars)
summary(lm1)
anova(lm1)
Normal Probability Plot of Residuals
x1 x2 . . . x
Normal Probability Plotk y
(response is Flight Time)
99 x11 x12 . . . x1k y1
95 x21 x22 . . . x2k y2
90

80
M M M M
70
xn1 xn2 . . . xnk yn
Percent

60
50
40
30
20

10

1
-1.0 -0.5 0.0 0.5 1.0
Residual

Residual =
Matrix Plot
Correlation
Questions to Ask?
• Is any of the predictors important in
predicting the response ?
• Which of the predictors are important?
• How well does the model fit the data?
• Given a set of predictors, how accurate is the prediction?
• What is the effect of individual observation on the model?
Questions to Ask?
• Is any of the predictors important in
predicting the response ?
Ans: ANOVA
• Which of the predictors are important?
Ans: All subsets or best subsets regression
• How well does the model fit the data?
Ans:
• Given a set of predictors, how accurate is the prediction?
Ans: , Cross-validation etc.
• What is the effect of individual observation on the model?
Ans: Leverage, Cook’s Distance
Assumptions for Linear Regression
• The primary assumptions for linear regression are
1. Linearity of the observed phenomenon
2. Constant variance of error terms
3. Normality of the error term distribution
4. Independence of error terms
• Adherence to the assumptions are tested through
graphical methods such as residual plots, normal
probability plot of residuals
Steps to do Regression
• Step 1. Create a flat file (ready to use software when done)
• Step 2. Start with a first order model (usually)
• Step 3. Fit the current model form.
• Step 4. Perform model diagnostics. If defensible, stop. Otherwise, try a different
form including possibly adding or removing factors. Return to Step 3.
• Step 5. (Sometime Optional) t-test coefficients and/or make decision.

• Comment: The process involves a degree of subjectivity and intuition about the
physical system and what model form makes sense and helps to answer the
relevant questions.
Estimation of Coefficients
Model form:

= Estimate of the response


= Estimate of coefficient
Estimation of Coefficients (SLR)
• =SSE
• Minimize the above function with respect to ’s.
• How do we do this?
• for



Estimation of Coefficients
• =SSE
• Minimize the above function with respect to ’s.
• How do we do this?
• for
• In matrix form:
• SSE= (minimize w.r.t )
• (derivative=0)

Predicted response: , Residual


• For a factorial design
Example #2
22 design with x1 x2 y
center-point -1 -1 4
1 -1 3
experimental -1 1 1 yi =b01 + b1 xi1 + b2 xi2 + ei
design 1 1 0 model equations
0 0 4

1 -1 -1
4
1 1 -1
3
X= 1
1
-1
1
1
1 y= 1
0
1 0 0
4
Design Matrix
Example #2- Estimation

5 0 0 0.20 0 0
X'X = 0 4 0 (X'X)-1 = 0 0.25 0
0 0 4 0 0 0.25
2.4
b = (X'X)-1 X'y =
-0.5
-1.5
y = 2.4 - 0.5 x1 - 1.5 x2 prediction equation
Example #2- New Model-Same Array
22 design with yi =b0 + b1 xi1 + b2 xi2 +
x1 x2 y
center-point -1 -1 4 b3 x2i2 + ei
1 -1 3
experimental -1 1 1 functional form of
design 1 1 0 the fit model
0 0 4

1 -1 -1 1 4
1 1 -1 1 3

X= 1
1
-1
1
1
1
1
1 y= 1
0
1 0 0 0 4
Different Design Matrix
Hypothesis Testing Multiple Regression


• Compare with to determine


significance
DOE vs On-hand Data

√ guaranteed , X loss of credibility, ? unclear


𝟐: How much of the variation in the response can be explained by the
model
𝟐
∑𝒏
𝒊 𝟏 𝒚𝒊
SST= ,
𝒏
: matrix of 1, n=# of runs
: Response vector
• 𝟐

𝟐
• 𝒂𝒅𝒋 ( = #of predictors)
𝟐
• 𝒑𝒓𝒆𝒅

• , (leave one out cross validation)


Regression Diagnostics
• Summary Statistics: Measures goodness of fit. describes the
fraction of the variation that is explainable from the data.
is acceptable.

• Variance Inflation Factors (VIFs): Numbers to assess severity of


multicollinearity in the model. Common rule VIF<10

Where is the value for variable on the other covariates.


• Normal Plot of Residuals: Indicates whether hypothesis tests on
estimates can be trusted. (points should roughly adhere to the
straight line, no apparent pattern or curvature)
Hypothesis Testing for Individual Coeff.
Wald Test


• Test statistic
• is the diagonal element of corresponding to
element
• If reject the null hypothesis
• CI (95%)

Confidence Interval on Mean Response
• Suppose we want the mean response value at a point

• The mean response at the point is

• The estimated mean response is

• Variance of is
• The Confidence Interval of mean response
at is
Prediction Interval of New Response
• Suppose we want to predict the actual response at a point

• The point estimate of a future observation is same as the


mean response

• The Prediction Interval of mean response


at is
Multicollinearity
• Multicollinearity exists when two or more of the
predictors in a regression model are moderately or highly
correlated with one another.
• Structural Multicollinearity: is a mathematical artifact
caused by creating new predictors from other predictors
— such as, creating the predictor from the predictor .
• Dataset Multicollinearity: is a result of a poorly designed
experiment, reliance on purely observational data, or the
inability to manipulate the system on which the data are
collected.
(source: https://online.stat.psu.edu/stat462/node/177/)
Multicollinearity
• When predictor variables are correlated:
• The estimated regression coefficient of any one variable
depends on which other predictor variables are included in the
model.
• The precision of the estimated regression coefficients decreases
as more predictor variables are added to the model.
• The marginal contribution of any one predictor variable in
reducing the error sum of squares varies depending on which
other variables are already in the model.
• Hypothesis tests for βk = 0 may yield different conclusions
depending on which predictor variables are in the model. (This
effect is a direct consequence of the three previous effects.)
Regression Flow Chart

Check VIFs,
Rsq, P-
values etc.
Model Selection: Example 1
Model Selection: Example 2
Forward Selection
• Step 1: Model m1 <- Fit a null model
• Step 2: Add variables one at a time ( simple Linear Reg
model)
• Step 3: Pick the model with lowest RSS and add it to the
m1
• Step 4: With remaining variables, add to m1 one at
a time and pick the model that provides best RSS
• Step 5: Continue until some stopping criterion is satisfied
Backward Selection
• Step 1: Start with all the variables in the model
• Step 2: Remove the variable which is least significant
(largest p-value)
• Step 3: Fit with remaining variables
• Step 4: Continue dropping variables until some stopping
criterion is met (threshold on p-value)
Other methods
• Mallow’s Cp
• AIC
• BIC
• Cross-validation
• Adjusted

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy