Residual Analysis For Simple Linear Regression: X B B y N e N e
Residual Analysis For Simple Linear Regression: X B B y N e N e
Residual Analysis For Simple Linear Regression: X B B y N e N e
4.1. Residuals
(a) Observed error
ε i = yi − E ( yi )
• Assumptions for regression model
o εi independent normal random variable
o Mean 0
o Constant variance σ2
(b) Residual / errors of fit
• Residual is defined as
ei = yi − yˆ i
o Part of Y not explained by the model
• Sample mean of ei
1 n 1 n
e = ∑ ei = ∑ ( yi − b0 − b1 xi )
n i =1 n i =1
1 n
= ∑ ( yi − ( y − b1 x ) − b1 xi )
n i =1
1⎡ n n
⎤
= ⎢ ∑ ( y i − y ) + b1 ∑ ( x − xi )⎥
n ⎣ i =1 i =1 ⎦
=0
(c) Standardized residual
• ei are not independent
o ∑ ei = 0 and ∑ xi ei = 0
• Sample variance of the n residuals
∑ (ei − e ) ∑e
2 2
SSE
= i
= = MSE
n−2 n−2 n−2
o If model is appropriate, MSE is unbiased for σ2
• Standardized residual is defined as
ei − e ei
=
MSE MSE
o Used at times in residual analysis
o Identifying outlying observations
(d) Studentized residuals
• Recall
Var ( yi ) = Var (β 0 + β1 xi + ε i ) = Var (ε i ) = σ 2
o For mean response
⎛ 1 ( x − x )2 ⎞
Var ( yˆ ) = σ 2 ⎜⎜ + 0 ⎟
⎝n S XX ⎟⎠
• We have (Exercise)
Var (ei ) = Var ( yi − yˆ i )
= Var ( yi ) + Var ( yˆ i ) − 2 × Cov ( yi , yˆ i )
⎡ ⎛ 1 ( x − x )2 ⎞⎤
= σ 2 ⎢1 − ⎜⎜ + i ⎟⎥
⎣⎢ ⎝ n S XX ⎟⎠⎦⎥
1
• Studentized residual is defined as
ei ei
=
SD (ei ) ⎛ 1 ( xi − x )2 ⎞
s 1 − ⎜⎜ + ⎟
⎟
⎝ n S XX ⎠
o Identifying outlying observations
• Residual plots
o Residuals vs independent variable
o Residuals vs fitted values
o Residuals vs time
o Residuals vs omitted independent variable
o Box plot
o Normal probability plot
Example
4 4
2
2
Unstandardized Residual
Unstandardized Residual
0
0
-2
-2
-4
-4 40 60 80 100 120 140 160
10 20 30 40 50 60 70 80
Unstandardized Predicted Value
Lot size
4 4
2
2
Unstandardized Residual
0
0
-2
-2
-4
0 2 4 6 8 10 12 -4
N= 9
2
Normal Q-Q plot
Normal Q-Q Plot of Unstandardized Residua
1.5
1.0
.5
0.0
Expected Normal
-.5
-1.0
-1.5
-4 -2 0 2 4 6
Observed Value
(a) Nonlinearity
• Whether a linear regression function is appropriate for the data being analyzed by studied
from
o Residual vs independent variable
o Residual vs fitted values
o Scatter plot
• Linear model is appropriate
o Residuals fall within a horizontal band centered around 0
• Departure from the linear regression model
o Indication of the trend for a curvilinear regression function
e e
0 0
X X
Example
Transit data
• A study of relation between amount of transit information and bus ridership in eight comparable
test cities
• 8 observations are collected
• Number of bus transit maps distributed free to residents of the city at the beginning of the test
(X)
• Increase during the test period in average daily bus ridership during nonpeak hours (Y)
3
• b1 = 0.0435
o SE = 0.007
o | t | = 6.484 > 2.447 = t0.025,6
o β1 is significant at 5% level of significance
• R2 = 0.875
o The model fit the data well
• ANOVA table
Sum of Squares df Mean Square F
Regression 31.7637 1 31.7637 42.0388
Residual 4.5335 6 0.7556
Total 36.2972 7
o F-value = 42.04 > 8.81 = F0.05,1,6
p-value = 0.0006
o Significant linear trend at 5% level of significance
3
increase in ridership
0
60 80 100 120 140 160 180 200 220 240
maps distributed
4
(b) Quadratic trend model
• Y = β0 + β1 X + β2 X2 + ε
• The model capture the relationship nicely
7
increase in ridership
2
0
60 80 100 120 140 160 180 200 220 240
maps distributed
1.0
.5
0.0
Standardized Residual
-.5
-1.0
-1.5
0 1 2 3 4 5 6 7
0 0
X X
5
Example
Hormone data
• The data are from the results of two assay experiments for a certain hormone.
• In the experiment, the old (or reference, X) method is compared to a new (or test, Y) method.
• There are 85 measurements for each of the two methods.
• Y = β0 + β1 X + ε
• b0 = 0.08486
o SE = 0.51175
o | t | = 0.166 < 1.989 = t0.025,83
o β0 is insignificant at 5% level of significance
• b1 = 0.95201
o SE = 0.03177
o | t | = 29.970 > 1.989 = t0.025,83
o β1 is significant at 5% level of significance
• R2=0.9154
o The linear trend model fit the model very well
• Scatter plot of observed and predicted values
o The linear model capture the increasing linear trend very well
6
(c) Outliers
• Outliers are extreme observations
• (Standardized) residual vs independent variable or fitted value
o Outliers are points lying far beyond the scatter of the remaining residuals
• Outliers can create great difficulty
o The model fitting distorted by the outlying cases
• Possible reasons
o Resulted from a mistake or other extraneous effect
Discard
Under least squares method, fitted line may be pulled disproportionately toward an
outlying observation
o Convey significant information
An interaction with another independent variable omitted from the model
Example
12
3 16 18
14
9
6 13 19
4 15
2 10
12 7
8
1 11
3 5
21
0
0 10 20 30
x
1 4 6 9 14 16 18
12
7 13
0 10 19
8 15
3
-1 5 11
-2
-3 21
-4
1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
Predicted Value of y
7
(b) Reduced data set
• Case 21 removed
• 20 cases remained
• No more outlying case
y
4
17 20
12
3 16 18
14
9
6 13 19
4 15
2 10
1 2 7
8
1 11
3 5
0
0 2 4 6 8 10 12 14 16 18 20
x
• b0 = 0.967
o SE = 0.291
o | t | = 3.33 > 2.101 = t0.025,18
o β0 is significant at 5% level of significance
• b1 = 0.102
o SE = 0.024
o | t | = 4.20 > 2.101 = t0.025,18
o β1 is significant at 5% level of significance
• R2=0.4952
o The linear trend model fit the model fairly well
• Inclusion and exclusion of the outlier affects the significance of the linear model and also
the model fitness
o Outlier always has an effect?
• Studendized residual against predicted value
o No more outlier
2
12
Studentized Residual
17
1 4 6 9 20
1
2
14 16
18
0 7
10 13
3 8
-1 15
5 19
11
-2
1.0 1.5 2.0 2.5 3.0 3.5
Predicted Value of y
(d) Nonindependence
• The linear model assumes all error terms (and hence, all observations) are independent
• Whenever data are obtained in a time sequence, it is a good idea to prepare a time plot of the
residual
o Check if any correlation between the error terms over time
• Independent error terms
o Residuals fluctuate in a random pattern around the base line 0
8
• Lack of randomness
o Too much alternation or too little alternation
• Correlation between error terms
o Some effect connected with time (but not included in the regression model) was present
time
Example
Ice-cream data
• The data give the icecream consumption over 30 four-week periods from 18 March 1950 to 11
July 1953. There are 30 observations over 3 variables.
• Period
o The week of the study
• Consumption (Y)
o The icecream consumption (in pints per capita)
• Temp (X)
o The mean temperature (in degrees F)
• b0 = 0.2069
o SE= 0.0247
o | t | = 8.375 > 2.048 = t0.025,28
o β0 is significant at 5% level of significance
• b1 = 0.0031
o SE = 0.0005
o | t | = 6.502 > 2.048 = t0.025,28
o β1 is significant at 5% level of significance
• R2 = 0.602
o The model fits quite well
• Scatter plot of observed and predicted values
9
• Standardized residual against time (period)
o Correlation between the error terms with time
o Too little alternation
o The scatters follow a specific curve
(e) Nonnormality
• Significance tests (F or t-tests) are based on normal assumption (of the error term)
o Small departures from normality create no serious problems
o Major departures are of concern
• Distribution plots
o Boxplot, histogram, stem-and-leaf plot
o Check if gross departure from normality are shown by such a plot
• Comparison of frequencies
o Actual frequencies of the residuals vs expected frequencies under normality
o Using standard normal distribution
~68% of standardized residuals between -1 and +1
~90% between -1.645 and +1.645
o Using t distribution when sample size is small
• Normal probability plot / Normal Q-Q plot
o (Standardized) Residual vs expected quantile under normality
o Expected quantile of the i-th smallest standardized residual is
⎛ i − 0.375 ⎞
z⎜ ⎟
⎝ n + 0.25 ⎠
z(q) = q × 100 percentile of the standard normal distribution
o Near linear suggests agreement with normality
Departs substantially from linearity suggests non-normal distribution
o Typical residual plots of Q-Q plot
Observed residual against expected value
10
• Significance test for normality
o Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Cramér-von Mises test
o Accept normality when p-value > significance level
Example
Market data
• The data give the Market rate (Y) and the account rate (X)
• 54 cases
• b0 = 0.848
o SE = 1.976
o | t | = 0.429 < 2.007 = t0.025,52
o β0 is insignificant at 5% level of significance
• b1 = 0.610
o SE = 0.143
o | t | = 4.263 > 2.007 = t0.025,52
o β1 is significant at 5% level of significance
• R2=0.259
o The model does not fit well
• For the standardized residual
o Skewness = 1.034
Positively skewed
o Kurtosis = 0.376
A little leptokurtic
Thinner tails than normal
o 72.2% between −1 and +1
68.3% based on standard normal
67.8% based on t52
o 92.6% between −2 and +2
95.5% based on standard normal
94.9% based on t52
11
• Histogram
o Positively skewed
o Thinner tails
35
30
25
P
e
20
r
c
e
n 15
t
10
0
-6 -3 0 3 6 9 12
Re s id u al
10
R
e 5
s
i
d
u
a 0
l
-5
-10
-3 -2 -1 0 1 2 3
Normal Quantiles
12
• Normality tests
Example
Fat data
• The data come from a study investigating a new method of measuring body composition.
• The body fat percentage, age and gender is given for 18 adults aged between 23 and 61.
• There are 18 observations on three variables.
• Age (X)
o The age of the subject in (completed) years
• Percent (Y)
o The body fat percentage of the subject
• Gender
o The gender of the subject
• b0 = 3.221
o SE = 5.076
o | t | = 0.635 < 2.120 = t0.025,16
o β0 is insignificant at 5% level of significance
• b1 = 0.548
o SE = 0.106
o | t | = 5.191 > 2.120 = t0.025,16
o β1 is significant at 5% level of significance
• R2=0.627
o The model fits quite well
13
• Scatter plot
o The regression model capture the linear relationship
• Standardized residual plot and Q-Q plot show that the model assumptions are valid
14
• A modified regression model with different gender effects
15