Residual Analysis For Simple Linear Regression: X B B y N e N e

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

4.

Residual analysis for simple linear regression

4.1. Residuals
(a) Observed error
ε i = yi − E ( yi )
• Assumptions for regression model
o εi independent normal random variable
o Mean 0
o Constant variance σ2
(b) Residual / errors of fit
• Residual is defined as
ei = yi − yˆ i
o Part of Y not explained by the model
• Sample mean of ei
1 n 1 n
e = ∑ ei = ∑ ( yi − b0 − b1 xi )
n i =1 n i =1
1 n
= ∑ ( yi − ( y − b1 x ) − b1 xi )
n i =1
1⎡ n n

= ⎢ ∑ ( y i − y ) + b1 ∑ ( x − xi )⎥
n ⎣ i =1 i =1 ⎦
=0
(c) Standardized residual
• ei are not independent
o ∑ ei = 0 and ∑ xi ei = 0
• Sample variance of the n residuals
∑ (ei − e ) ∑e
2 2
SSE
= i
= = MSE
n−2 n−2 n−2
o If model is appropriate, MSE is unbiased for σ2
• Standardized residual is defined as
ei − e ei
=
MSE MSE
o Used at times in residual analysis
o Identifying outlying observations
(d) Studentized residuals
• Recall
Var ( yi ) = Var (β 0 + β1 xi + ε i ) = Var (ε i ) = σ 2
o For mean response
⎛ 1 ( x − x )2 ⎞
Var ( yˆ ) = σ 2 ⎜⎜ + 0 ⎟
⎝n S XX ⎟⎠
• We have (Exercise)
Var (ei ) = Var ( yi − yˆ i )
= Var ( yi ) + Var ( yˆ i ) − 2 × Cov ( yi , yˆ i )
⎡ ⎛ 1 ( x − x )2 ⎞⎤
= σ 2 ⎢1 − ⎜⎜ + i ⎟⎥
⎣⎢ ⎝ n S XX ⎟⎠⎦⎥

1
• Studentized residual is defined as
ei ei
=
SD (ei ) ⎛ 1 ( xi − x )2 ⎞
s 1 − ⎜⎜ + ⎟

⎝ n S XX ⎠
o Identifying outlying observations

4.2. Residual analysis


• Check for departure from linear regression model with normal errors
o The regression function is not linear
o The error terms do not have constant variance
o The error terms are not independent
o The model fits all but one or a few outliers
o The error terms are not normally distributed
o One or several important independent variables have been omitted from the model

• Residual plots
o Residuals vs independent variable
o Residuals vs fitted values
o Residuals vs time
o Residuals vs omitted independent variable
o Box plot
o Normal probability plot

Example

Westwood company data


Residual vs X Residual vs predicted value
6 6

4 4

2
2
Unstandardized Residual
Unstandardized Residual

0
0

-2
-2

-4
-4 40 60 80 100 120 140 160
10 20 30 40 50 60 70 80
Unstandardized Predicted Value
Lot size

Time plot Box plot


6 6

4 4

2
2
Unstandardized Residual

0
0

-2
-2

-4
0 2 4 6 8 10 12 -4
N= 9

Production run Unstandardized Resid

2
Normal Q-Q plot
Normal Q-Q Plot of Unstandardized Residua
1.5

1.0

.5

0.0
Expected Normal

-.5

-1.0

-1.5
-4 -2 0 2 4 6

Observed Value

(a) Nonlinearity
• Whether a linear regression function is appropriate for the data being analyzed by studied
from
o Residual vs independent variable
o Residual vs fitted values
o Scatter plot
• Linear model is appropriate
o Residuals fall within a horizontal band centered around 0
• Departure from the linear regression model
o Indication of the trend for a curvilinear regression function

e e

0 0

X X
Example

Transit data
• A study of relation between amount of transit information and bus ridership in eight comparable
test cities
• 8 observations are collected
• Number of bus transit maps distributed free to residents of the city at the beginning of the test
(X)
• Increase during the test period in average daily bus ridership during nonpeak hours (Y)

(a) Simple linear regression model


• Y = β0 + β1 X + ε
• b0 = -1.82
o SE = 1.052
o | t | = |-1.727| < 2.447 = t0.025,6
o β0 is insignificant at 5% level of significance

3
• b1 = 0.0435
o SE = 0.007
o | t | = 6.484 > 2.447 = t0.025,6
o β1 is significant at 5% level of significance
• R2 = 0.875
o The model fit the data well
• ANOVA table
Sum of Squares df Mean Square F
Regression 31.7637 1 31.7637 42.0388
Residual 4.5335 6 0.7556
Total 36.2972 7
o F-value = 42.04 > 8.81 = F0.05,1,6
ƒ p-value = 0.0006
o Significant linear trend at 5% level of significance

• Scatter plot (observed and fitted Y against X)


o Linear model capture the increasing linear trend while there is more than a simply linear
relationship between X and Y
7

3
increase in ridership

0
60 80 100 120 140 160 180 200 220 240

maps distributed

• Standardized residual against X and ŷ

o Lack of fit of the linear regression function


o Residuals depart from 0 in a systematic fashion
ƒ Negative for smaller ŷ (or X)
ƒ Positive for medium size ŷ (or X)
ƒ Negative for large ŷ (or X)

4
(b) Quadratic trend model
• Y = β0 + β1 X + β2 X2 + ε
• The model capture the relationship nicely
7

increase in ridership
2

0
60 80 100 120 140 160 180 200 220 240

maps distributed

• Standardized residual against predicted values of Y


o No particular pattern is observed
1.5

1.0

.5

0.0
Standardized Residual

-.5

-1.0

-1.5
0 1 2 3 4 5 6 7

Unstandardized Predicted Value

(b) Nonconstancy of error variance


• The linear model assumes that the error term (ε) has constant variance (σ2)
o Residuals vs independent variable
o Residuals vs fitted values

Constant variance Error variance increases with X


e e

0 0

X X

5
Example

Hormone data
• The data are from the results of two assay experiments for a certain hormone.
• In the experiment, the old (or reference, X) method is compared to a new (or test, Y) method.
• There are 85 measurements for each of the two methods.

• Y = β0 + β1 X + ε
• b0 = 0.08486
o SE = 0.51175
o | t | = 0.166 < 1.989 = t0.025,83
o β0 is insignificant at 5% level of significance
• b1 = 0.95201
o SE = 0.03177
o | t | = 29.970 > 1.989 = t0.025,83
o β1 is significant at 5% level of significance
• R2=0.9154
o The linear trend model fit the model very well
• Scatter plot of observed and predicted values
o The linear model capture the increasing linear trend very well

• Standardized residual against predicted values


o The variance of the residuals is not constant
o The larger the fitted value is (so as the regressor variable), the more spread out the residuals
are
o The relation between test method and the reference method is positive
o Error variance is larger for larger value for hormone than for smaller

6
(c) Outliers
• Outliers are extreme observations
• (Standardized) residual vs independent variable or fitted value
o Outliers are points lying far beyond the scatter of the remaining residuals
• Outliers can create great difficulty
o The model fitting distorted by the outlying cases
• Possible reasons
o Resulted from a mistake or other extraneous effect
ƒ Discard
ƒ Under least squares method, fitted line may be pulled disproportionately toward an
outlying observation
o Convey significant information
ƒ An interaction with another independent variable omitted from the model

Example

Consider 21 cases of X and Y values


y
4
17 20

12
3 16 18
14
9
6 13 19
4 15
2 10
12 7
8
1 11
3 5

21
0
0 10 20 30
x

(a) Full data set


• Y = β0 + β1 X + ε
• b0 = 1.254
o SE = 0.395
o | t | = 3.17 > 2.093 = t0.025,19
o β0 is significant at 5% level of significance
• b1 = 0.0629
o SE = 0.0315
o | t | = 2.00 < 2.093 = t0.025,19
o β1 is insignificant at 5% level of significance
• R2 = 0.1737
o The linear trend model fit the model badly
• Studentized residual against predicted value
o Case 21 is an outlier
2
12 17 20
Studentized Residual

1 4 6 9 14 16 18
12
7 13
0 10 19
8 15
3
-1 5 11

-2

-3 21

-4
1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
Predicted Value of y

7
(b) Reduced data set
• Case 21 removed
• 20 cases remained
• No more outlying case

y
4
17 20

12
3 16 18
14
9
6 13 19
4 15
2 10
1 2 7
8
1 11
3 5

0
0 2 4 6 8 10 12 14 16 18 20
x

• b0 = 0.967
o SE = 0.291
o | t | = 3.33 > 2.101 = t0.025,18
o β0 is significant at 5% level of significance
• b1 = 0.102
o SE = 0.024
o | t | = 4.20 > 2.101 = t0.025,18
o β1 is significant at 5% level of significance
• R2=0.4952
o The linear trend model fit the model fairly well
• Inclusion and exclusion of the outlier affects the significance of the linear model and also
the model fitness
o Outlier always has an effect?
• Studendized residual against predicted value
o No more outlier

2
12
Studentized Residual

17
1 4 6 9 20
1
2
14 16
18
0 7
10 13

3 8
-1 15
5 19

11
-2
1.0 1.5 2.0 2.5 3.0 3.5
Predicted Value of y

(d) Nonindependence
• The linear model assumes all error terms (and hence, all observations) are independent
• Whenever data are obtained in a time sequence, it is a good idea to prepare a time plot of the
residual
o Check if any correlation between the error terms over time
• Independent error terms
o Residuals fluctuate in a random pattern around the base line 0

8
• Lack of randomness
o Too much alternation or too little alternation
• Correlation between error terms
o Some effect connected with time (but not included in the regression model) was present

time

Example

Ice-cream data
• The data give the icecream consumption over 30 four-week periods from 18 March 1950 to 11
July 1953. There are 30 observations over 3 variables.
• Period
o The week of the study
• Consumption (Y)
o The icecream consumption (in pints per capita)
• Temp (X)
o The mean temperature (in degrees F)

• b0 = 0.2069
o SE= 0.0247
o | t | = 8.375 > 2.048 = t0.025,28
o β0 is significant at 5% level of significance
• b1 = 0.0031
o SE = 0.0005
o | t | = 6.502 > 2.048 = t0.025,28
o β1 is significant at 5% level of significance
• R2 = 0.602
o The model fits quite well
• Scatter plot of observed and predicted values

9
• Standardized residual against time (period)
o Correlation between the error terms with time
o Too little alternation
o The scatters follow a specific curve

(e) Nonnormality
• Significance tests (F or t-tests) are based on normal assumption (of the error term)
o Small departures from normality create no serious problems
o Major departures are of concern
• Distribution plots
o Boxplot, histogram, stem-and-leaf plot
o Check if gross departure from normality are shown by such a plot
• Comparison of frequencies
o Actual frequencies of the residuals vs expected frequencies under normality
o Using standard normal distribution
ƒ ~68% of standardized residuals between -1 and +1
ƒ ~90% between -1.645 and +1.645
o Using t distribution when sample size is small
• Normal probability plot / Normal Q-Q plot
o (Standardized) Residual vs expected quantile under normality
o Expected quantile of the i-th smallest standardized residual is
⎛ i − 0.375 ⎞
z⎜ ⎟
⎝ n + 0.25 ⎠
ƒ z(q) = q × 100 percentile of the standard normal distribution
o Near linear suggests agreement with normality
ƒ Departs substantially from linearity suggests non-normal distribution
o Typical residual plots of Q-Q plot
ƒ Observed residual against expected value

10
• Significance test for normality
o Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Cramér-von Mises test
o Accept normality when p-value > significance level

Example

Market data
• The data give the Market rate (Y) and the account rate (X)
• 54 cases

• b0 = 0.848
o SE = 1.976
o | t | = 0.429 < 2.007 = t0.025,52
o β0 is insignificant at 5% level of significance
• b1 = 0.610
o SE = 0.143
o | t | = 4.263 > 2.007 = t0.025,52
o β1 is significant at 5% level of significance
• R2=0.259
o The model does not fit well
• For the standardized residual
o Skewness = 1.034
ƒ Positively skewed
o Kurtosis = 0.376
ƒ A little leptokurtic
ƒ Thinner tails than normal
o 72.2% between −1 and +1
ƒ 68.3% based on standard normal
ƒ 67.8% based on t52
o 92.6% between −2 and +2
ƒ 95.5% based on standard normal
ƒ 94.9% based on t52

11
• Histogram
o Positively skewed
o Thinner tails
35

30

25

P
e
20
r
c
e
n 15
t

10

0
-6 -3 0 3 6 9 12
Re s id u al

• Normal Q-Q plot


o Order the residuals
o Calculate the quantiles from standard normal

x y ŷ Res STD Res Order (i) Expected z


6.42 -1.63 4.77 -6.40 -1.27 1 -2.27
6.78 -1.34 4.99 -6.33 -1.26 2 -1.88
18.45 5.86 12.11 -6.25 -1.24 3 -1.66
15.74 4.37 10.45 -6.08 -1.21 4 -1.50
32.58 14.73 20.73 -6.00 -1.19 5 -1.37
8.19 0.24 5.85 -5.61 -1.11 6 -1.26
12.02 3.11 8.18 -5.07 -1.01 7 -1.16

o The scatters do not fall on the diagonal line


15

10

R
e 5
s
i
d
u
a 0
l

-5

-10

-3 -2 -1 0 1 2 3
Normal Quantiles

12
• Normality tests

Test Statistic P-value


Shapiro-Wilk 0.897964 0.0002
Kolmogorov-Smirnov 0.191956 <0.0100
Cramer-von Mises 0.329436 <0.0050
Anderson-Darling 1.879455 <0.0050

o All p-values < 0.05


o The distribution of the error term is significantly different from normal

(f) Omission of important independent variables


• Residuals should be plotted against variables omitted from the model that might have
important effects on the response
o e.g. Time variable
o Determine whether there are any other key independent variables that could provide
important additional descriptive and predictive power to the model

Example

Fat data
• The data come from a study investigating a new method of measuring body composition.
• The body fat percentage, age and gender is given for 18 adults aged between 23 and 61.
• There are 18 observations on three variables.
• Age (X)
o The age of the subject in (completed) years
• Percent (Y)
o The body fat percentage of the subject
• Gender
o The gender of the subject

• b0 = 3.221
o SE = 5.076
o | t | = 0.635 < 2.120 = t0.025,16
o β0 is insignificant at 5% level of significance
• b1 = 0.548
o SE = 0.106
o | t | = 5.191 > 2.120 = t0.025,16
o β1 is significant at 5% level of significance
• R2=0.627
o The model fits quite well

13
• Scatter plot
o The regression model capture the linear relationship

• Standardized residual plot and Q-Q plot show that the model assumptions are valid

• Consider the an extra effect


o Standardized residual against gender
o Residuals for male are negative
o Gender has definite effect on productivity
o The model is still appropriate with the omission of gender
o Inclusion of gender improves the model

14
• A modified regression model with different gender effects

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy