Residual Analysis For Simple Linear Regression: X B B y N e N e

4.
Residual analysis for simple linear regression
4.1. Residuals
(a) Observed error
ε i = yi − E ( yi )
• Assumptions for regression model
o εi independent normal random variable
o Mean 0
o Constant variance σ2
(b) Residual / errors of fit
• Residual is defined as
ei = yi − yˆ i
o Part of Y not explained by the model
• Sample mean of ei
1 n 1 n
e = ∑ ei = ∑ ( yi − b0 − b1 xi )
n i =1 n i =1
1 n
= ∑ ( yi − ( y − b1 x ) − b1 xi )
n i =1
1⎡ n n
⎤
= ⎢ ∑ ( y i − y ) + b1 ∑ ( x − xi )⎥
n ⎣ i =1 i =1 ⎦
=0
(c) Standardized residual
• ei are not independent
o ∑ ei = 0 and ∑ xi ei = 0
• Sample variance of the n residuals
∑ (ei − e ) ∑e
2 2
SSE
= i
= = MSE
n−2 n−2 n−2
o If model is appropriate, MSE is unbiased for σ2
• Standardized residual is defined as
ei − e ei
=
MSE MSE
o Used at times in residual analysis
o Identifying outlying observations
(d) Studentized residuals
• Recall
Var ( yi ) = Var (β 0 + β1 xi + ε i ) = Var (ε i ) = σ 2
o For mean response
⎛ 1 ( x − x )2 ⎞
Var ( yˆ ) = σ 2 ⎜⎜ + 0 ⎟
⎝n S XX ⎟⎠
• We have (Exercise)
Var (ei ) = Var ( yi − yˆ i )
= Var ( yi ) + Var ( yˆ i ) − 2 × Cov ( yi , yˆ i )
⎡ ⎛ 1 ( x − x )2 ⎞⎤
= σ 2 ⎢1 − ⎜⎜ + i ⎟⎥
⎣⎢ ⎝ n S XX ⎟⎠⎦⎥
1
• Studentized residual is defined as
ei ei
=
SD (ei ) ⎛ 1 ( xi − x )2 ⎞
s 1 − ⎜⎜ + ⎟
⎟
⎝ n S XX ⎠
o Identifying outlying observations
4.2. Residual analysis

• Check for departure from linear regression model with normal errors
o The regression function is not linear
o The error terms do not have constant variance
o The error terms are not independent
o The model fits all but one or a few outliers
o The error terms are not normally distributed
o One or several important independent variables have been omitted from the model
• Residual plots
o Residuals vs independent variable
o Residuals vs fitted values
o Residuals vs time
o Residuals vs omitted independent variable
o Box plot
o Normal probability plot
Example
Westwood company data

Residual vs X Residual vs predicted value
6 6
4 4
2
2
Unstandardized Residual
0
0
-2
-2
-4
-4 40 60 80 100 120 140 160
10 20 30 40 50 60 70 80
Unstandardized Predicted Value
Lot size
Time plot Box plot

6 6
4 4
2
2
0
0
-2
-2
-4
0 2 4 6 8 10 12 -4
N= 9
Production run Unstandardized Resid
2
Normal Q-Q plot
Normal Q-Q Plot of Unstandardized Residua
1.5
1.0
.5
0.0
Expected Normal
-.5
-1.0
-1.5
-4 -2 0 2 4 6
Observed Value
(a) Nonlinearity
• Whether a linear regression function is appropriate for the data being analyzed by studied
from
o Residual vs independent variable
o Residual vs fitted values
o Scatter plot
• Linear model is appropriate
o Residuals fall within a horizontal band centered around 0
• Departure from the linear regression model
o Indication of the trend for a curvilinear regression function
e e
0 0
X X
Example
Transit data
• A study of relation between amount of transit information and bus ridership in eight comparable
test cities
• 8 observations are collected
• Number of bus transit maps distributed free to residents of the city at the beginning of the test
(X)
• Increase during the test period in average daily bus ridership during nonpeak hours (Y)
(a) Simple linear regression model

• Y = β0 + β1 X + ε
• b0 = -1.82
o SE = 1.052
o | t | = |-1.727| < 2.447 = t0.025,6
o β0 is insignificant at 5% level of significance
3
• b1 = 0.0435
o SE = 0.007
o | t | = 6.484 > 2.447 = t0.025,6
o β1 is significant at 5% level of significance
• R2 = 0.875
o The model fit the data well
• ANOVA table
Sum of Squares df Mean Square F
Regression 31.7637 1 31.7637 42.0388
Residual 4.5335 6 0.7556
Total 36.2972 7
o F-value = 42.04 > 8.81 = F0.05,1,6
p-value = 0.0006
o Significant linear trend at 5% level of significance
• Scatter plot (observed and fitted Y against X)

o Linear model capture the increasing linear trend while there is more than a simply linear
relationship between X and Y
7
3
increase in ridership
0
60 80 100 120 140 160 180 200 220 240
maps distributed
• Standardized residual against X and ŷ
o Lack of fit of the linear regression function

o Residuals depart from 0 in a systematic fashion
Negative for smaller ŷ (or X)
Positive for medium size ŷ (or X)
Negative for large ŷ (or X)
4
(b) Quadratic trend model
• Y = β0 + β1 X + β2 X2 + ε
• The model capture the relationship nicely
7
increase in ridership
2
0
60 80 100 120 140 160 180 200 220 240
maps distributed
• Standardized residual against predicted values of Y

o No particular pattern is observed
1.5
1.0
.5
0.0
Standardized Residual
-.5
-1.0
-1.5
0 1 2 3 4 5 6 7
Unstandardized Predicted Value
(b) Nonconstancy of error variance

• The linear model assumes that the error term (ε) has constant variance (σ2)
o Residuals vs independent variable
o Residuals vs fitted values
Constant variance Error variance increases with X

e e
0 0
X X
5
Example
Hormone data
• The data are from the results of two assay experiments for a certain hormone.
• In the experiment, the old (or reference, X) method is compared to a new (or test, Y) method.
• There are 85 measurements for each of the two methods.
• Y = β0 + β1 X + ε
• b0 = 0.08486
o SE = 0.51175
o | t | = 0.166 < 1.989 = t0.025,83
• b1 = 0.95201
o SE = 0.03177
o | t | = 29.970 > 1.989 = t0.025,83
• R2=0.9154
o The linear trend model fit the model very well
• Scatter plot of observed and predicted values
o The linear model capture the increasing linear trend very well
• Standardized residual against predicted values

o The variance of the residuals is not constant
o The larger the fitted value is (so as the regressor variable), the more spread out the residuals
are
o The relation between test method and the reference method is positive
o Error variance is larger for larger value for hormone than for smaller
6
(c) Outliers
• Outliers are extreme observations
• (Standardized) residual vs independent variable or fitted value
o Outliers are points lying far beyond the scatter of the remaining residuals
• Outliers can create great difficulty
o The model fitting distorted by the outlying cases
• Possible reasons
o Resulted from a mistake or other extraneous effect
Discard
Under least squares method, fitted line may be pulled disproportionately toward an
outlying observation
o Convey significant information
An interaction with another independent variable omitted from the model
Example
Consider 21 cases of X and Y values

y
4
17 20
12
3 16 18
14
9
6 13 19
4 15
2 10
12 7
8
1 11
3 5
21
0
0 10 20 30
x
(a) Full data set

• Y = β0 + β1 X + ε
• b0 = 1.254
o SE = 0.395
o | t | = 3.17 > 2.093 = t0.025,19
• b1 = 0.0629
o SE = 0.0315
o | t | = 2.00 < 2.093 = t0.025,19
• R2 = 0.1737
o The linear trend model fit the model badly
• Studentized residual against predicted value
o Case 21 is an outlier
2
12 17 20
Studentized Residual
1 4 6 9 14 16 18
12
7 13
0 10 19
8 15
3
-1 5 11
-2
-3 21
-4
1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
Predicted Value of y
7
(b) Reduced data set
• Case 21 removed
• 20 cases remained
• No more outlying case
y
4
17 20
12
3 16 18
14
9
6 13 19
4 15
2 10
1 2 7
8
1 11
3 5
0
0 2 4 6 8 10 12 14 16 18 20
x
• b0 = 0.967
o SE = 0.291
o | t | = 3.33 > 2.101 = t0.025,18
• b1 = 0.102
o SE = 0.024
o | t | = 4.20 > 2.101 = t0.025,18
• R2=0.4952
o The linear trend model fit the model fairly well
• Inclusion and exclusion of the outlier affects the significance of the linear model and also
the model fitness
o Outlier always has an effect?
• Studendized residual against predicted value
o No more outlier
2
12
Studentized Residual
17
1 4 6 9 20
1
2
14 16
18
0 7
10 13
3 8
-1 15
5 19
11
-2
1.0 1.5 2.0 2.5 3.0 3.5
Predicted Value of y
(d) Nonindependence
• The linear model assumes all error terms (and hence, all observations) are independent
• Whenever data are obtained in a time sequence, it is a good idea to prepare a time plot of the
residual
o Check if any correlation between the error terms over time
• Independent error terms
o Residuals fluctuate in a random pattern around the base line 0
8
• Lack of randomness
o Too much alternation or too little alternation
• Correlation between error terms
o Some effect connected with time (but not included in the regression model) was present
time
Example
Ice-cream data
• The data give the icecream consumption over 30 four-week periods from 18 March 1950 to 11
July 1953. There are 30 observations over 3 variables.
• Period
o The week of the study
• Consumption (Y)
o The icecream consumption (in pints per capita)
• Temp (X)
o The mean temperature (in degrees F)
• b0 = 0.2069
o SE= 0.0247
o | t | = 8.375 > 2.048 = t0.025,28
• b1 = 0.0031
o SE = 0.0005
o | t | = 6.502 > 2.048 = t0.025,28
• R2 = 0.602
o The model fits quite well
• Scatter plot of observed and predicted values
9
• Standardized residual against time (period)
o Correlation between the error terms with time
o Too little alternation
o The scatters follow a specific curve
(e) Nonnormality
• Significance tests (F or t-tests) are based on normal assumption (of the error term)
o Small departures from normality create no serious problems
o Major departures are of concern
• Distribution plots
o Boxplot, histogram, stem-and-leaf plot
o Check if gross departure from normality are shown by such a plot
• Comparison of frequencies
o Actual frequencies of the residuals vs expected frequencies under normality
o Using standard normal distribution
~68% of standardized residuals between -1 and +1
~90% between -1.645 and +1.645
o Using t distribution when sample size is small
• Normal probability plot / Normal Q-Q plot
o (Standardized) Residual vs expected quantile under normality
o Expected quantile of the i-th smallest standardized residual is
⎛ i − 0.375 ⎞
z⎜ ⎟
⎝ n + 0.25 ⎠
z(q) = q × 100 percentile of the standard normal distribution
o Near linear suggests agreement with normality
Departs substantially from linearity suggests non-normal distribution
o Typical residual plots of Q-Q plot
Observed residual against expected value
10
• Significance test for normality
o Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Cramér-von Mises test
o Accept normality when p-value > significance level
Example
Market data
• The data give the Market rate (Y) and the account rate (X)
• 54 cases
• b0 = 0.848
o SE = 1.976
o | t | = 0.429 < 2.007 = t0.025,52
• b1 = 0.610
o SE = 0.143
o | t | = 4.263 > 2.007 = t0.025,52
• R2=0.259
o The model does not fit well
• For the standardized residual
o Skewness = 1.034
Positively skewed
o Kurtosis = 0.376
A little leptokurtic
Thinner tails than normal
o 72.2% between −1 and +1
68.3% based on standard normal
67.8% based on t52
o 92.6% between −2 and +2
95.5% based on standard normal
94.9% based on t52
11
• Histogram
o Positively skewed
o Thinner tails
35
30
25
P
e
20
r
c
e
n 15
t
10
0
-6 -3 0 3 6 9 12
Re s id u al
• Normal Q-Q plot

o Order the residuals
o Calculate the quantiles from standard normal
x y ŷ Res STD Res Order (i) Expected z

6.42 -1.63 4.77 -6.40 -1.27 1 -2.27
6.78 -1.34 4.99 -6.33 -1.26 2 -1.88
18.45 5.86 12.11 -6.25 -1.24 3 -1.66
15.74 4.37 10.45 -6.08 -1.21 4 -1.50
32.58 14.73 20.73 -6.00 -1.19 5 -1.37
8.19 0.24 5.85 -5.61 -1.11 6 -1.26
12.02 3.11 8.18 -5.07 -1.01 7 -1.16
…
o The scatters do not fall on the diagonal line

15
10
R
e 5
s
i
d
u
a 0
l
-5
-10
-3 -2 -1 0 1 2 3
Normal Quantiles
12
• Normality tests
Test Statistic P-value

Shapiro-Wilk 0.897964 0.0002
Kolmogorov-Smirnov 0.191956 <0.0100
Cramer-von Mises 0.329436 <0.0050
Anderson-Darling 1.879455 <0.0050
o All p-values < 0.05

o The distribution of the error term is significantly different from normal
(f) Omission of important independent variables

• Residuals should be plotted against variables omitted from the model that might have
important effects on the response
o e.g. Time variable
o Determine whether there are any other key independent variables that could provide
important additional descriptive and predictive power to the model
Example
Fat data
• The data come from a study investigating a new method of measuring body composition.
• The body fat percentage, age and gender is given for 18 adults aged between 23 and 61.
• There are 18 observations on three variables.
• Age (X)
o The age of the subject in (completed) years
• Percent (Y)
o The body fat percentage of the subject
• Gender
o The gender of the subject
• b0 = 3.221
o SE = 5.076
o | t | = 0.635 < 2.120 = t0.025,16
• b1 = 0.548
o SE = 0.106
o | t | = 5.191 > 2.120 = t0.025,16
• R2=0.627
o The model fits quite well
13
• Scatter plot
o The regression model capture the linear relationship
• Standardized residual plot and Q-Q plot show that the model assumptions are valid
• Consider the an extra effect

o Standardized residual against gender
o Residuals for male are negative
o Gender has definite effect on productivity
o The model is still appropriate with the omission of gender
o Inclusion of gender improves the model
14
• A modified regression model with different gender effects
15

Residual Analysis For Simple Linear Regression: X B B y N e N e

Uploaded by

Copyright:

Available Formats

Residual Analysis For Simple Linear Regression: X B B y N e N e

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Residual Analysis For Simple Linear Regression: X B B y N e N e

Uploaded by

Copyright:

Available Formats

4.

Residual analysis for simple linear regression

4.2. Residual analysis

Westwood company data

Time plot Box plot

Production run Unstandardized Resid

(a) Simple linear regression model

• Scatter plot (observed and fitted Y against X)

• Standardized residual against X and ŷ

o Lack of fit of the linear regression function

• Standardized residual against predicted values of Y

Unstandardized Predicted Value

(b) Nonconstancy of error variance

Constant variance Error variance increases with X

• Standardized residual against predicted values

Consider 21 cases of X and Y values

(a) Full data set

• Normal Q-Q plot

x y ŷ Res STD Res Order (i) Expected z

o The scatters do not fall on the diagonal line

Test Statistic P-value

o All p-values < 0.05

(f) Omission of important independent variables

• Consider the an extra effect

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.