Module01.1 LinearRegression

Linear Regression
Prof. Sayak Roychowdhury

Linear Regression
• Most widely used dependence technique, statistical data
model
• Used for both prediction and explanation
• Application ranges from research questions to find relation
between factor(s) and outcome, to business forecasting,
econometric models, marketing etc.
• Capability of explanation of a model is one of its more
important utility compared to other complicated black box
models (such as Neural Nets)
• Multiple Regression and its variants provides a framework for
in depth understanding of the process being investigated
• Extensively used in academic research (knowledge creation)
and managerial insights (impact of potential factors)
Glimpse of Linear Regression
• Minitab- Stat>Regression>Fit Regression Model > Select
the Response and the Factors > Ok
Data Point Body height Flight Time
1 1.5 1.91
2 1.4 1.83
3 2.7 0.86
4 1.1 1.72
5 0.9 1.28
6 0.8 1.09
7 2.9 0.79
8 2.2 1.1
9 3.3 0.81
10 1.8 1.67
Glimpse of Linear Regression
lm1 <- lm(mpg ~ hp, data = mtcars)
summary(lm1)
anova(lm1)
Normal Probability Plot of Residuals
x1 x2 . . . x
Normal Probability Plotk y
(response is Flight Time)
99 x11 x12 . . . x1k y1
95 x21 x22 . . . x2k y2
90
80
M M M M
70
xn1 xn2 . . . xnk yn
Percent
60
50
40
30
20
10
1
-1.0 -0.5 0.0 0.5 1.0
Residual
Residual =
Matrix Plot
Correlation
Questions to Ask?
• Is any of the predictors important in
predicting the response ?
• Which of the predictors are important?
• How well does the model fit the data?
• Given a set of predictors, how accurate is the prediction?
• What is the effect of individual observation on the model?
Questions to Ask?
• Is any of the predictors important in
predicting the response ?
Ans: ANOVA
• Which of the predictors are important?
Ans: All subsets or best subsets regression
• How well does the model fit the data?
Ans:
• Given a set of predictors, how accurate is the prediction?
Ans: , Cross-validation etc.
• What is the effect of individual observation on the model?
Ans: Leverage, Cook’s Distance
Assumptions for Linear Regression
• The primary assumptions for linear regression are
1. Linearity of the observed phenomenon
2. Constant variance of error terms
3. Normality of the error term distribution
4. Independence of error terms
• Adherence to the assumptions are tested through
graphical methods such as residual plots, normal
probability plot of residuals
Steps to do Regression
• Step 1. Create a flat file (ready to use software when done)
• Step 2. Start with a first order model (usually)
• Step 3. Fit the current model form.
• Step 4. Perform model diagnostics. If defensible, stop. Otherwise, try a different
form including possibly adding or removing factors. Return to Step 3.
• Step 5. (Sometime Optional) t-test coefficients and/or make decision.
• Comment: The process involves a degree of subjectivity and intuition about the
physical system and what model form makes sense and helps to answer the
relevant questions.
Estimation of Coefficients
Model form:
= Estimate of the response

= Estimate of coefficient
Estimation of Coefficients (SLR)
• =SSE
• Minimize the above function with respect to ’s.
• How do we do this?
• for
•
•
Estimation of Coefficients
• =SSE
• Minimize the above function with respect to ’s.
• How do we do this?
• for
• In matrix form:
• SSE= (minimize w.r.t )
• (derivative=0)
Predicted response: , Residual

• For a factorial design
Example #2
22 design with x1 x2 y
center-point -1 -1 4
1 -1 3
experimental -1 1 1 yi =b01 + b1 xi1 + b2 xi2 + ei
design 1 1 0 model equations
0 0 4
1 -1 -1
4
1 1 -1
3
X= 1
1
-1
1
1
1 y= 1
0
1 0 0
4
Design Matrix
Example #2- Estimation
5 0 0 0.20 0 0
X'X = 0 4 0 (X'X)-1 = 0 0.25 0
0 0 4 0 0 0.25
2.4
b = (X'X)-1 X'y =
-0.5
-1.5
y = 2.4 - 0.5 x1 - 1.5 x2 prediction equation
Example #2- New Model-Same Array
22 design with yi =b0 + b1 xi1 + b2 xi2 +
x1 x2 y
center-point -1 -1 4 b3 x2i2 + ei
1 -1 3
experimental -1 1 1 functional form of
design 1 1 0 the fit model
0 0 4
1 -1 -1 1 4
1 1 -1 1 3
X= 1
1
-1
1
1
1
1
1 y= 1
0
1 0 0 0 4
Different Design Matrix
Hypothesis Testing Multiple Regression
•
•
•
• Compare with to determine

significance
DOE vs On-hand Data
√ guaranteed , X loss of credibility, ? unclear

𝟐: How much of the variation in the response can be explained by the
model
𝟐
∑𝒏
𝒊 𝟏 𝒚𝒊
SST= ,
𝒏
: matrix of 1, n=# of runs
: Response vector
• 𝟐
𝟐
• 𝒂𝒅𝒋 ( = #of predictors)
𝟐
• 𝒑𝒓𝒆𝒅
• , (leave one out cross validation)

Regression Diagnostics
• Summary Statistics: Measures goodness of fit. describes the
fraction of the variation that is explainable from the data.
is acceptable.
• Variance Inflation Factors (VIFs): Numbers to assess severity of

multicollinearity in the model. Common rule VIF<10
Where is the value for variable on the other covariates.

• Normal Plot of Residuals: Indicates whether hypothesis tests on
estimates can be trusted. (points should roughly adhere to the
straight line, no apparent pattern or curvature)
Hypothesis Testing for Individual Coeff.
Wald Test
•
•
• Test statistic
• is the diagonal element of corresponding to
element
• If reject the null hypothesis
• CI (95%)
•
Confidence Interval on Mean Response
• Suppose we want the mean response value at a point
• The mean response at the point is
• The estimated mean response is
• Variance of is
• The Confidence Interval of mean response
at is
Prediction Interval of New Response
• Suppose we want to predict the actual response at a point
• The point estimate of a future observation is same as the

mean response
• The Prediction Interval of mean response

at is
Multicollinearity
• Multicollinearity exists when two or more of the
predictors in a regression model are moderately or highly
correlated with one another.
• Structural Multicollinearity: is a mathematical artifact
caused by creating new predictors from other predictors
— such as, creating the predictor from the predictor .
• Dataset Multicollinearity: is a result of a poorly designed
experiment, reliance on purely observational data, or the
inability to manipulate the system on which the data are
collected.
(source: https://online.stat.psu.edu/stat462/node/177/)
Multicollinearity
• When predictor variables are correlated:
• The estimated regression coefficient of any one variable
depends on which other predictor variables are included in the
model.
• The precision of the estimated regression coefficients decreases
as more predictor variables are added to the model.
• The marginal contribution of any one predictor variable in
reducing the error sum of squares varies depending on which
other variables are already in the model.
• Hypothesis tests for βk = 0 may yield different conclusions
depending on which predictor variables are in the model. (This
effect is a direct consequence of the three previous effects.)
Regression Flow Chart
Check VIFs,
Rsq, P-
values etc.
Model Selection: Example 1
Model Selection: Example 2
Forward Selection
• Step 1: Model m1 <- Fit a null model
• Step 2: Add variables one at a time ( simple Linear Reg
model)
• Step 3: Pick the model with lowest RSS and add it to the
m1
• Step 4: With remaining variables, add to m1 one at
a time and pick the model that provides best RSS
• Step 5: Continue until some stopping criterion is satisfied
Backward Selection
• Step 1: Start with all the variables in the model
• Step 2: Remove the variable which is least significant
(largest p-value)
• Step 3: Fit with remaining variables
• Step 4: Continue dropping variables until some stopping
criterion is met (threshold on p-value)
Other methods
• Mallow’s Cp
• AIC
• BIC
• Cross-validation
• Adjusted

Module01.1 LinearRegression

Uploaded by

Copyright:

Available Formats

Module01.1 LinearRegression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module01.1 LinearRegression

Uploaded by

Copyright:

Available Formats

Linear Regression

Prof. Sayak Roychowdhury

Data Point Body height Flight Time

= Estimate of the response

Predicted response: , Residual

• Compare with to determine

√ guaranteed , X loss of credibility, ? unclear

• , (leave one out cross validation)

• Variance Inflation Factors (VIFs): Numbers to assess severity of

Where is the value for variable on the other covariates.

• The mean response at the point is

• The estimated mean response is

• The point estimate of a future observation is same as the

• The Prediction Interval of mean response

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.