Appendix Robust Regression
Appendix Robust Regression
Appendix Robust Regression
Abstract
Linear least-squares regression can be very sensitive to unusual data. In this appendix to
Fox and Weisberg (2011), we describe how to fit several alternative robust-regression estima-
tors, which attempt to down-weight or ignore unusual data: M -estimators; bounded-influence
estimators; MM -estimators; and quantile-regression estimators, including 𝐿1 regression.
All estimation methods rely on assumptions for their validity. We say that an estimator or sta-
tistical procedure is robust if it provides useful information even if some of the assumptions used to
justify the estimation method are not applicable. Most of this appendix concerns robust regression,
estimation methods typically for the linear regression model that are insensitive to outliers and
possibly high leverage points. Other types of robustness, for example to model misspecification,
are not discussed here. These methods were developed between the mid-1960s and the mid-1980s.
With the exception of the 𝐿1 methods described in Section 5, they are not widely used today.
2 M -Estimation
In linear regression, the breakdown of the ordinary least squares (OLS) estimator is analogous to
the breakdown of the sample mean: A few extreme observations can largely determine the value of
1
the OLS estimator. In Fox and Weisberg (2011, Chap. 6), methods for detecting potentially influ-
ential points are presented, and these provide one approach to dealing with such points. Another
approach is to replace ordinary least squares with an estimation method that is less affected by
the outlying and influential points and can therefore produce useful results by accommodating the
non-conforming data.
The most common general method of robust regression is M -estimation, introduced by Hu-
ber (1964). This class of estimators can be regarded as a generalization of maximum-likelihood
estimation, hence the “M.”
We consider only the linear model
for the 𝑖th of 𝑛 independent observations. We assume that the model itself is not at issue, so
E(𝑦∣x) = x′𝑖 𝜷, but the distribution of the errors may be heavy-tailed, producing occasional outliers.
Given an estimator b for 𝜷, the fitted model is
symmetric, 𝜌(𝑒) = 𝜌(−𝑒), although in some problems one might argue that symmetry is
undesirable; and
For example, the least-squares 𝜌-function 𝜌(𝑒𝑖 ) = 𝑒2𝑖 satisfies these requirements, as do many other
functions.
2
where the influence curve 𝜓 is defined to be the derivative of 𝜌 with respect to its argument.
To facilitate computing, we would like to make this last equation similar to the estimating
equations for a familiar problem like weighted least squares. To this end, define the weight function
𝑤𝑖 = 𝑤(𝑒𝑖 ) = 𝜓(𝑒𝑖 )/𝑒𝑖 . The estimating equations may then be written as
𝑛
𝑤𝑖 (𝑦𝑖 − x′𝑖 b)x′𝑖 = 0
∑
𝑖=1
Solving
∑ 2 2 these estimating equations is equivalent to a weighted least-squares problem, minimizing
𝑤𝑖 𝑒𝑖 . The weights, however, depend upon the residuals, the residuals depend upon the estimated
coefficients, and the estimated coefficients depend upon the weights. An iterative solution (called
iteratively reweighted least-squares, IRLS ) is therefore required:
𝐸(𝜓 2 )
𝒱(b) = (X′ X)−1
[𝐸(𝜓 ′ )]2
[𝜓(𝑒𝑖 )]2 to estimate 𝐸(𝜓 2 ), and [ 𝜓 ′ (𝑒𝑖 )/𝑛]2 to estimate [𝐸(𝜓 ′ )]2 produces the estimated
∑ ∑
Using
asymptotic covariance matrix, 𝒱(b)
ˆ (which is not reliable in small samples).
3
OLS
1.4
6
30
1.2
2
20
w(e)
ψ(e)
ρ(e)
1.0
0
−6 −4 −2
5 10
0.8
0.6
0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
e e e
Huber
1.0
7
0.8
5
w(e)
ψ (e )
4
ρ(e)
0.6
3
2
0.4
−1.0
1
0.2
0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
e e e
bisquare
1.0
0.0 0.5 1.0
0.8
3
0.6
w(e)
ψ(e)
ρ(e)
0.4
1
0.2
−1.0
0.0
0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
e e e
Figure 1: Objective (left), 𝜓 (center), and weight (right) functions for the least-squares (top), Huber
(middle), and bisquare (bottom) estimators. The tuning constants for these graphs are 𝑘 = 1.345
for the Huber estimator and 𝑘 = 4.685 for the bisquare. (One way to think about this scaling is
that the standard deviation of the errors, 𝜎, is taken as 1.)
4
Method Objective Function Weight Function
Least-Squares 𝜌LS (𝑒) ={𝑒2 𝑤LS (𝑒) ={1
1 2
2𝑒 for∣𝑒∣ ≤ 𝑘 1 for∣𝑒∣ ≤ 𝑘
Huber 𝜌H (𝑒) = 1 2 𝑤H (𝑒) =
⎧ 𝑘∣𝑒∣ −
{ 2 [ for∣𝑒∣ >]𝑘 }
𝑘 𝑘/∣𝑒∣ for∣𝑒∣ > 𝑘
2 ( 𝑒 )2 3 ⎧ [ ( 𝑒 )2 ]2
⎨ 𝑘
1− 1− for∣𝑒∣ ≤ 𝑘 1− for∣𝑒∣ ≤ 𝑘
⎨
Bisquare 𝜌B (𝑒) = 6 𝑘 𝑤B (𝑒) = 𝑘
0 for∣𝑒∣ > 𝑘
⎩
𝑘 2 /6 for∣𝑒∣ > 𝑘
⎩
Table 1: Objective functions and weight functions for least-squares, Huber, and bisquare estimators.
the normal case; in particular, 𝑘 = 1.345𝜎 for the Huber and 𝑘 = 4.685𝜎 for the bisquare (where 𝜎
is the standard deviation of the errors) produce 95-percent efficiency when the errors are normal,
and still offer protection against outliers.
In an application, we need an estimate of the standard deviation of the errors to use these
results. Usually a robust measure of spread is employed in preference to the standard deviation of
the residuals. For example, a common approach is to take 𝜎 ˆ = MAR/0.6745, where MAR is the
median absolute residual.
3 Bounded-Influence Regression
Under certain circumstances, M -estimators can be vulnerable to high-leverage observations. Very-
high-breakdown bounded-influence estimators for regression have been proposed and R functions
for them are presented here. Very-high-breakdown estimates should be avoided, however, unless
we have faith that the model we are fitting is correct, because these estimates do not allow for
diagnosis of model misspecification (Cook et al., 1992).
One bounded-influence estimator is least-trimmed squares (LTS ) regression. Order the squared
residuals from smallest to largest:
The LTS estimator chooses the regression coefficients b to minimize the sum of the smallest 𝑚 of
the squared residuals,
∑𝑚
LTS(b) = (𝑒2 )(𝑖)
𝑖=1
where, typically, 𝑚 = ⌊𝑛/2⌋ + ⌊(𝑘 + 2)/2⌋, a little more than half of the observations, and the floor
brackets, ⌊ ⌋, denote rounding down to the next smallest integer.
While the LTS criterion is easily described, the mechanics of fitting the LTS estimator are
complicated (Rousseeuw and Leroy, 1987). Moreover, bounded-influence estimators can produce
unreasonable results in certain circumstances (Stefanski, 1991), and there is no simple formula for
coefficient standard errors.1
One application of bounded-influence estimators is to provide starting values for M -estimation.
This procedure, along with using the bounded-influence estimate of the error variance, produces
the so-called MM-estimator. The MM -estimator retains the high breakdown point of the bounded-
influence estimator and shares the relatively high efficiency under normality of the traditional
1
Statistical inference for the LTS estimator can be performed by bootstrapping, however. See Section 4.3.7 in the
text and the Appendix on bootstrapping .
5
M -estimator. MM -estimators are especially attractive when paired with redescending 𝜓-functions
such as the bisquare, which can be sensitive to starting values.
Call:
lm(formula = prestige ˜ income + education, data = Duncan)
Residuals:
Min 1Q Median 3Q Max
-29.54 -6.42 0.65 6.61 34.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.0647 4.2719 -1.42 0.16
income 0.5987 0.1197 5.00 1.1e-05
education 0.5458 0.0983 5.56 1.7e-06
Recall from the discussion of Duncan’s data in Fox and Weisberg (2011) that two observations,
ministers and railroad conductors, serve to decrease the income coefficient and to increase the
education coefficient, as we may verify by omitting these two observations from the regression:
Call:
lm(formula = prestige ˜ income + education, data = Duncan, subset = -c(6,
16))
Residuals:
Min 1Q Median 3Q Max
-28.61 -5.90 1.94 5.62 21.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.4090 3.6526 -1.75 0.0870
income 0.8674 0.1220 7.11 1.3e-08
education 0.3322 0.0987 3.36 0.0017
6
Residual standard error: 11.4 on 40 degrees of freedom
Multiple R-squared: 0.876, Adjusted R-squared: 0.87
F-statistic: 141 on 2 and 40 DF, p-value: <2e-16
Alternatively, let us compute the Huber M -estimator for Duncan’s regression model, using the
rlm (robust linear model) function in the MASS package:
> library(MASS)
> mod.huber <- rlm(prestige ˜ income + education, data=Duncan)
> summary(mod.huber)
Coefficients:
Value Std. Error t value
(Intercept) -7.111 3.881 -1.832
income 0.701 0.109 6.452
education 0.485 0.089 5.438
The Huber regression coefficients are between those produced by the least-squares fit to the full
data set and the least-squares fit eliminating the occupations minister and conductor.
It is instructive to extract and plot (in Figure 2) the final weights used in the robust fit. The
showLabels function from car is employed to label all observations with weights less than 0.8:
Ministers and conductors are among the observations that receive the smallest weight.
The function rlm can also fit the bisquare estimator. As we explained, starting values for the
IRLS procedure are potentially more critical for the bisquare estimator; specifying the argument
method="MM" to rlm requests bisquare estimates with start values determined by a preliminary
bounded-influence regression:
7
1.0
●●●●● ●● ●●●●●● ●●● ●● ●●● ●●●●●●●●●●●●
●
●
0.9
●
0.8
Huber Weight
0.7
● factory.owner
● mail.carrier
store.clerk ●
0.6 machinist ●
● contractor
● conductor● insurance.agent
0.5
● reporter
0.4
● minister
0 10 20 30 40
Index
Figure 2: Weights from the robust Huber estimator for the regression of prestige on income and
education.
Coefficients:
Value Std. Error t value
(Intercept) -7.389 3.908 -1.891
income 0.783 0.109 7.149
education 0.423 0.090 4.710
Compared to the Huber estimates, the bisquare estimate of the income coefficient is larger, and
the estimate of the education coefficient is smaller. Figure 3 shows a graph of the weights from
the bisquare fit, identifying the observations with the smallest weights:
> plot(mod.bisq$w, ylab="Bisquare Weight")
> showLabels(1:45, mod.bisq$w, rownames(Duncan),
+ id.method= which(mod.bisq$w < 0.8), cex.=0.6)
Finally, the ltsreg function in the lqs package is used to fit Duncan’s model by LTS regression:2
> (mod.lts <- ltsreg(prestige ˜ income + education, data=Duncan))
2
LTS regression is also the default method for the lqs function, which additionally can fit other bounded-influence
estimators.
8
1.0
●●● ●●●●●● ●● ● ●● ● ●● ●●●
● ●
● ● ● ● ●
●
● ●
●
● ●
● ●
0.8
●streetcar.motorman
factory.owner ●
● mail.carrier
Bisquare Weight
store.clerk ●
0.6
● machinist ●
contractor
● insurance.agent
0.4
● conductor
● reporter
0.2
● minister
0.0
0 10 20 30 40
Index
Figure 3: Weights from the robust bisquare estimator for the regression of prestige on income
and education.
Call:
lqs.formula(formula = prestige ˜ income + education, data = Duncan,
method = "lts")
Coefficients:
(Intercept) income education
-5.503 0.768 0.432
9
We begin with a simple simulated example with 𝑁1 “good” observations and 𝑁2 “bad” ones:
> set.seed(10131986) # for reproducibility
> library(quantreg)
> l1.data <- function(n1=100, n2=20){
+ data <- mvrnorm(n=n1,mu=c(0, 0),
+ Sigma=matrix(c(1, .9, .9, 1), ncol=2))
+ # generate 20 'bad' observations
+ data <- rbind(data, mvrnorm(n=n2,
+ mu=c(1.5, -1.5), Sigma=.2*diag(c(1, 1))))
+ data <- data.frame(data)
+ names(data) <- c("X", "Y")
+ ind <- c(rep(1, n1),rep(2, n2))
+ plot(Y ˜ X, data, pch=c("x", "o")[ind],
+ col=c("black", "red")[ind],
+ main=substitute(list(N[1] == n1, N[2] == n2), list(n1=n1, n2=n2)))
+ summary(r1 <-rq(Y ˜ X, data=data, tau=0.5))
+ abline(r1, lwd=2)
+ abline(lm(Y ˜ X, data), lty=2, lwd=2, col="red")
+ abline(lm(Y ˜ X, data, subset=1:n1), lty=3, lwd=2, col="blue")
+ legend("topleft", c("L1","OLS","OLS on good"),
+ inset=0.02, lty=1:3, lwd=2, col=c("black", "red", "blue"),
+ cex=.9)}
> par(mfrow=c(2, 2))
> l1.data(100, 20)
> l1.data(100, 30)
> l1.data(100, 75)
> l1.data(100, 100)
In Figure 4, all four panels have 𝑁1 = 100 observations sampled from a bivariate normal distribution
with means (0, 0)′ , variances (1, 1)′ , and correlation 0.9. In addition, 𝑁2 observations
√ are sampled
from a bivariate normal distribution with means (1.5, −1.5) and covariance matrix 2I2 . The value
of 𝑁2 varies from panel to panel. In each panel, three regression lines are shown: OLS fit to the
𝑁1 good data points; OLS fit to all the data; and median regression fit to all the data. If the goal
is to match, more or less, the OLS regression fit to the good data, then the median regression does
a respectable jobs for 𝑁2 ≤ 30, but it does no better than OLS on all the data for larger 𝑁2 . Of
course in these latter cases the distinction between “good” and “bad” data is hard to justify.
10
N1 = 100, N2 = 20 N1 = 100, N2 = 30
3
xx x
2
L1 x x L1 x
OLS xx x x OLS x x
2
OLS on good x x OLS on good x
xx
x x x xx x
xxx x x xx x
1
xx x
x xx x x x
x
x
xx x
1
xx x x x x xxxx x x
x x xxx xxxx xxx x x
x x xxxx xx
x xx xx xx
x xx xx x xx x x
0
xxx x xx x xx x
Y
0
x xx x xx xxxx x xx x x
xx xxxxx x xx
x xxx x xx x xxx xxxxx
x xxxxx x xx x o o x oo ooo
−1
−1
x x oo oo x xxx xxxx x x o oo
x
x
x o o o o o o oo xx x xx
o o o o
x
x xx o o
o
x oooo oo oo
o o
x oo
−2
o x o o
−2
x o x x o o
x o o
−1 0 1 2 −3 −2 −1 0 1 2 3
X X
x x xx
L1 x L1
2
2
OLS x x OLS x
OLS on good xx xx x OLS on good xx x
x x xx x xx
xxx x x xx x x x
1
x x
1
x x
x xxxxxx xxxxx xx x x xx x xxx xx
x
x xxxxx xxxxxx
x x xx xxxxx xx x x x
x x xxxx x xx x xxxx x
0
x x xxxxx o o
Y
x x
x xx xx x x xxxx xx x xxxxxxx o o
x
x x x x x o oo oo x x x o oooo o
o o x x xx o
−1
xx x oo o oooooo o o o o o o
o oo oo
−1
x x xxxxx x
xx x x xx oo ooooo x o ooooo ooo oooo o
x x xx o
o o
o o
o o o
o o ooo ooooo ooo o x xxx x o oooo ooo
o
o ooo oo oo o
o oooooooooo oooo
o xx xx o oooo o
o o ooo oooooo o
−2
xx x
x o oo o o ooooo o o
−2
x o ooooo oo ooo
x x o
x x o o
o o
−3
−2 −1 0 1 2 3 −2 −1 0 1 2
X X
Figure 4: Simulated data with an increasing number of “bad” or “outlying” observations. Three
lines are shown in each panel: the 𝐿1 regression (solid black line); OLS fit to all of the data (broken
red line); OLS fit to the “good” data points (dotted blue line).
11
2.0
1.5
ρ(x)
1.0
0.5 L1
L2
0.0
−2 −1 0 1 2
5.2 𝐿1 Facts
1. The 𝐿1 estimator is the MLE if the errors are independent with a double-exponential distri-
bution.
2. In Equation 2 (page 9) if x consists only of the constant regressor (1), then the 𝐿1 estimator
is the median.
3. Computations are not nearly as easy as for least squares, because a linear programming
solution is required for 𝐿1 regression.
4. If the 𝑛 × 𝑝 model matrix X = (x′1 , . . . , x′𝑛 )′ is of full column rank 𝑝 + 1, and if h is a set that
indexes exactly 𝑝 + 1 of the rows of X, then there is always an h such that the 𝐿1 estimate
𝜷
˜ fits these 𝑝 + 1 points exactly, so 𝜷˜ = (X′ Xh )−1 X′ yh = X−1 yh . Of course the number of
h h h
potential subsets is large, so this may not help much in the computations.
7. In general 𝐿1 regression estimates the median of 𝑦∣x, not the conditional mean.
8. Suppose we have Equation 2 (page 9) with the errors independent and identically distributed
from a distribution 𝐹 with density 𝑓 . The population median is 𝜉𝜏 = 𝐹 −1 (𝜏 ) with 𝜏 = 0.5,
and the sample median is 𝜉ˆ.5 =∑𝐹ˆ−1 (𝜏 ). We assume a standardized version of 𝑓 so 𝑓 (𝑢) =
(1/𝜎)𝑓0 (𝑢/𝜎). Write Q𝑛 = 𝑛 −1 x𝑖 x′𝑖 , and suppose that in large samples Q𝑛 → Q0 , a fixed
matrix. We will then have √
𝑛(𝜷˜ − 𝜷) ∼ N(0, 𝜔Q−1 )
0
12
−1
where 𝜔 = 𝜎 2 𝜏 (1 − 𝜏 )/{𝑓√
2
0 [𝐹0 (𝜏 )]} and 𝜏 = 0.50. For example, if 𝑓 is the standard normal
√
density, 𝑓 [𝐹0−1 (𝜏 )] = 1/ 2𝜋 = 0.399, and 𝜔 = 0.5𝜎/0.399 = 1.26𝜎, so in the normal case
the standard deviations of the 𝐿1 estimators are 26% larger than the standard deviations of
the OLS estimators.
9. If 𝑓 were known, asymptotic Wald tests and confidence intervals could be based on percentiles
of the normal distribution. In practice, 𝑓 [𝐹 −1 (𝜏 )] must be estimated. One standard method
due to Siddiqui is to estimate
[ ]
ˆ
𝑓 [𝐹 −1 (𝜏 )] = 𝐹ˆ−1 (𝜏 + ℎ) − 𝐹ˆ−1 (𝜏 − ℎ) /2ℎ
for some bandwidth parameter ℎ. This approach is closely related to density estimation, and
so the value of ℎ used in practice is selected by a method appropriate for density estimation.
Alternatively, 𝑓 [𝐹 −1 (𝜏 )] can be estimated using a bootstrap procedure.
10. For non-independent and identically distributed errors, suppose that 𝜉𝑖 (𝜏 ) is the 𝜏 -quantile
for the distribution of the 𝑖th error. One can show that
√
˜ − 𝜷) ∼ N 0, 𝜏 (1 − 𝜏 )H−1 Q0 H−1
[ ]
𝑛(𝜷
6 Quantile regression
6.1 Sample and Population Quantiles
For a sample 𝑥1 , . . . , 𝑥𝑛 , for any 0 < 𝜏 < 1 the 𝜏 th sample quantile is the smallest value that
exceeds 𝜏 × 100% of the data. In a population with distribution 𝐹 , we define the 𝜏 th population
quantile to be the solution to
𝐿1 is a special case of quantile regression in which we minimize the 𝜏 = .50-quantile, but a similar
calculation can be performed for any 0 < 𝜏 < 1, where the objective function 𝜌𝜏 (𝑢) is called in this
instance a check function,
𝜌𝜏 (𝑢) = 𝑢 × [𝜏 − 𝐼(𝑢 < 0)] (4)
where 𝐼 is the indicator function (more on check functions later). Figure 6 shows the check function
in Equation 4 for 𝜏 ∈ {.25, .5, .9}:
13
1.5
values of τ
.25
.5
1.0
.9
ρτ(x)
0.5
0.0
−2 −1 0 1 2
Figure 6: Check function for three values of 𝜏 for quantile regression. For 𝜏 = 0.5, positive and
negative errors are treated symmetrically, but for the other values of 𝜏 , positive and negative errors
are treated asymmetrically.
Quantile regression is just like 𝐿1 regression with 𝜌𝜏 replacing 𝜌.5 in Equation 3 (page 9), and with
𝜏 replacing 0.5 in the asymptotics.
14
> plot(MaxSalary ˜ Score, data=salarygov,
+ xlim=c(100, 1000), ylim=c(1000, 10000),
+ pch=c(2, 16)[fdom + 1], col=c("black", "green")[fdom + 1])
> mods <- rq(MaxSalary ˜ bs(Score, 5), tau=c(.1, .5, .9),
+ data=salarygov[!fdom, ])
> mods
Call:
rq(formula = MaxSalary ˜ bs(Score, 5), tau = c(0.1, 0.5, 0.9),
data = salarygov[!fdom, ])
Coefficients:
tau= 0.1 tau= 0.5 tau= 0.9
(Intercept) 1207.0 1507.3 1466.5
bs(Score, 5)1 -100.9 -151.9 437.4
bs(Score, 5)2 779.9 974.1 1300.7
bs(Score, 5)3 2010.6 2255.9 3176.0
bs(Score, 5)4 3724.4 3822.2 5010.0
bs(Score, 5)5 5122.0 6147.1 5733.8
We begin by defining an indicator variable for the emale-dominated job classes, and a vector for the
𝜏 s . We will graph the non-female-dominated classes in black and the female-dominated classes in
green. The quantile regression is fit using the rq function in the quantreg package. Its arguments
are similar to those for lm except for a new argument for setting tau; the default is tau=0.5 for
𝐿1 regression, and here we specify three values of 𝜏 . The fitted coefficients for the B-splines are
then displayed, and although these are not easily interpretable, the important point is that they
are different for each value of 𝜏 . The predict function returns a matrix with three columns, one
for each 𝜏 , and we use these values to add fitted regression lines to the graph. We fit the model to
the non-female-dominated occupations only, as is common is gender-discrimination studies.
The quantile regressions are of interest here to describe the variation in the relationship between
salary and score in the non-female-dominated job classes. Most of the female-dominated classes
fall below the median line and many below the 0.1-quantile. For extreme values of Score the more
extreme quantiles are very poorly estimated, which accounts for the crossing of the median and the
0.9 estimated quantiles for large values of Score.
15
10000
Quantile
0.1
0.5
0.9
8000
6000
MaxSalary
●
●
●
●
●
●
4000
● ● ●
● ● ●
● ● ●●
● ● ●● ●●
●●● ● ●
● ●● ● ● ●
● ●
●●● ● ●●
● ●● ● ● ● ● ●
● ●●●
● ●●●
● ●● ●
2000
● ●
●●
●● ●● ●
●●● ● ●
●● ● ●
●●●● ●● ●
●●● ●
●
●●●●
●●●
● ●●●● ● ●●
Non−Female−Dominated
●●
● ●● ● ●
●●●●●● ●●
●●●●
● ● ● Female−Dominated
Score
Coefficients:
coefficients lower bd upper bd
(Intercept) -6.4083 -12.4955 -3.6003
income 0.7477 0.4719 0.9117
education 0.4587 0.2195 0.6610
The summary method for rq objects reports 95-percent confidence intervals for the regression coeffi-
cients; it is also possible to obtain coefficient standard errors (see ?summary.rq). The 𝐿1 estimates
here are very similar to the M -estimates based on Huber’s weight function. Table 2 summarizes
the various estimators that we applied Duncan’s regression.
16
7 Complementary Reading and References
Robust regression is described in Fox (2008, Chap. 19). Koenker (2005) provides an extensive
treatment of quantile regression. A recent mathematical treatment of robust regression is given by
Huber and Ronchetti (2009). Andersen (2007) provides an introduction to the topic.
References
Andersen, R. (2007). Modern Methods for Robust Regression. Sage, Thousand Oaks, CA.
Cook, R. D., Hawkins, D. M., and Weisberg, S. (1992). Comparison of model misspecification
diagnostics using residuals from least mean of squares and least median of squares fits. Journal
of the American Statistical Association, 87(418):pp. 419–424.
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Sage, Thousand Oaks,
CA, second edition.
Fox, J. and Weisberg, S. (2011). An R Companion to Applied Regression. Sage, Thousand Oaks,
CA, second edition.
Huber, P. and Ronchetti, E. M. (2009). Robust Statistics. Wiley, Hoboken NJ, second edition.
Rousseeuw, P. J. and Leroy, A. M. (1987). Robust regression and outlier detection. Wiley, Hoboken,
NJ.
17