Lars Based S Estimator
Lars Based S Estimator
S-estimators
Claudio Agostinelli1 and Matias Salibian-Barrera2
1
Dipartimento di Statistica
Ca Foscari University
Venice, Italy claudio@unive.it
Department of Statistics
The University of British Columbia
Vancouver, BC, Canada matias@stat.ubc.ca
Abstract. We consider the problem of selecting a parsimonious subset of explanatory variables from a potentially large collection of covariates. We are concerned
with the case when data quality may be unreliable (e.g. there might be outliers
among the observations). When the number of available covariates is moderately
large, fitting all possible subsets is not a feasible option. Sequential methods like
forward or backward selection are generally greedy and may fail to include important predictors when these are correlated. To avoid this problem Efron et al.
(2004) proposed the Least Angle Regression algorithm to produce an ordered list
of the available covariates (sequencing) according to their relevance. We introduce
outlier robust versions of the LARS algorithm based on S-estimators for regression
(Rousseeuw and Yohai (1984)). This algorithm is computationally efficient and suitable even when the number of variables exceeds the sample size. Simulation studies
show that it is also robust to the presence of outliers in the data and compares
favourably to previous proposals in the literature.
sion
Introduction
70
model assumptions and we are interested in predicting the non-outlying observations. Therefore, we consider model selection methods for linear models
based on robust methods.
As it is the case with point estimation and other inference procedures,
likelihood-type model selection methods (e.g. AIC (Akaike (1970)), Mallows
Cp (Mallows (1973)), and BIC (Schwarz (1978)) may be severely affected
by a small proportion of atypical observations in the data. These outliers
may not necessarily consist of large values, but might not follow the model
that applies to the majority of the data. Model selection procedures that are
resistant to the presence of outliers in the sample have only recently started
to receive some attention in the literature. Seminal papers include Hampel
(1983), Ronchetti (1985, 1997) and Ronchetti and Staudte (1994). Other proposals include Sommer and Staudte (1995), Ronchetti, Field and Blanchart
(1997), Qian and K
unsch (1998), Agostinelli (2002a, 2002b), Agostinelli and
Markatou (2005), Morgenthaler, Welsch and Zenide (2003). See also the recent book by Maronna, Martin and Yohai (2006). These proposals are based
on robustified versions of classical selection criteria (e.g. robust Cp , robust
final prediction error, etc.). More recently M
uller and Welsh (2005) proposed
a model selection criterion that combines a measure of goodness-of-fit, a
penalty term to avoid over-fitting and and the expected prediction error conditional on the data. Salibian-Barrera and Van Aelst (2008) use the fast and
robust bootstrap of Salibian-Barrera and Zamar (2002) to obtain a faster
boostrap-based model selection method that is feasible to calculate for larger
number of covariates. Although less expensive from a computational point of
view than the stratified bootstrap of M
uller and Welsh (2005), this method,
as the previous ones, needs to compute the estimator on the full model.
A different approach to variable selection that is attractive when the number of explanatory variables is large is based on ordering the covariates according to their estimated importance in the full model. Forward stepwise
and backward elimination procedures are examples of this approach, whereby
in each step of the procedure a variable may enter or leave the linear model
(see, e.g. Weisberg (1985) or Miller (2002)). With backward elimination one
starts with the full model and then finds the best possible submodel with
one less covariate in it. This procedure is repeated until we fit a model with
a single covariate or a criterion is reached. A similar procedure is forward
stepwise, where we first select the covariate (say x1 ) with the highest absolute correlation with the response variable y. We take the residuals of the
regression of y on x1 as our new response, project all covariates orthogonally to x1 and add the variable with the highest absolute correlation to the
model. At the same step, variables in the model may be deleted according to
a criterion. These steps are repeated until no variables are added or deleted.
Unfortunately, when p is large (p = 100, for example), these procedure becomes unfeasible for highly-robust estimators, furthermore these algorithms
71
are known to be greedy and may relegate important covariates if they are
correlated with those selected earlier in the sequence.
The Least Angle Regression (LARS) of Efron et al. (2004) is a generalization of stepwise methods, where the length of the step is selected so as
to strike a balance between fast-but-greedy and slow-but-conservative
alternatives, as those in stagewise selection (see, e.g. Hastie, Tibshirani and
Friedman (2001)). It is easy to verify that this method is not robust to the
presence of a small amount of atypical observations. McCann and Welsch
(2007) proposed to add an indicator variable for each observation and then
run the usual LARS on the extended set of covariates. When high-leverage
outliers are possible, they suggest building models from randomly drawn
subsamples of the data, and then selecting the best of them based on their
(robustly estimated) prediction error. Khan, Van Aelst and Zamar (2007b)
showed that the LARS algorithm can be expressed in terms of the pairwise
sample correlations between covariates and the response variable, and proposed to apply this algorithm using robust correlation estimates. This is a
plug-in proposal in the sense that it takes a method derived using least
squares or L2 estimators and replaces the required point estimates by robust
counterparts.
In this paper we derive an algorithm based on LARS, but using a Sregression estimator (Rousseeuw and Yohai (1984)). Section 2 contains a brief
description of the LARS algorithm, while Section 3 describes our proposal.
Simulation results are discussed in Section 4 and concluding remarks can be
found in Section 5.
j = 1, . . . , n,
yi = 0
n
X
i=1
xi,j = 0
n
X
x2i,j = 1
for 1 j p .
i=1
so that the linear model above does not contain the intercept term.
The Least Angle Regression algorithm (LARS) is a generalization of the
Forward Stagewise procedure. The latter is an iterative technique that starts
with the predictor vector
= 0 Rn , and at each step sets
=
+ sign(cj ) x(j)
72
= 0. Let
A be the current predictor and let
c = X 0 (y
A ) ,
where X Rnp denotes the design matrix. In other words, c is the vector of current correlations cj , j = 1, . . . , p. Let A denote the active set,
which corresponds to those covariates with largest absolute correlations:
C = maxj {|cj |} and A = {j : |cj | = C}. Assume, without loss of generality, that A = {1, . . . , m}. Let sj = sign(cj ) for j A, and let XA Rnm
be the matrix formed by the corresponding signed columns of the design
matrix X, sj x(j) . Note that the vector uA = vA /kvA k, where
1
0
vA = XA (XA
XA )
1A ,
satisfies
0
XA
uA = AA 1A ,
(1)
A
A + uA ,
where is taken to be the smallest positive value such that a new covariate
joins the active set A of explanatory variables with largest absolute correlation. More specifically, note that, if for each we let () =
A + uA , then
for each j = 1, . . . , p we have
cj () = cor y (), x(j) = x0(j) (y ()) = cj aj ,
where aj = x0(j) uA . For j A, equation (1) implies that
|cj ()| = C AA ,
so all maximal current correlations decrease at a constant rate along this
direction. We then determine the smallest positive value of that makes the
correlations between the current active covariates and the residuals equal to
that of another covariate x(k) not in the active set A. This variable enters
the model, the active set becomes
A A {k} ,
and the correlations are updated to C C AA . We refer the interested
reader to Efron et al. (2004) for more details.
73
where () satisfies
n
1 X
ri ()
= b,
n i=1
()
: R R+ is a symmetric, bounded, non-decreasing and continuous function, and b (0, 1) is a fixed constant. The choice b = EF0 () ensures that the
resulting estimator is consistent when the errors have distribution function
F0 .
For a given active set A of k covariates let A , 0A ,
A be the S-estimators
of regressing the current residuals on the k active variables with indices in
A. Consider the parameter vector = (, 0 , ) that satisfies
!
n
X
ri x0i,k (A ) 0
1
= b.
n k 1 i=1
xij ,
i=1
and the corresponding correlation is
,
j () = covj ()
n
X
i=1
ri x0i,k (A ) 0
!2
.
j (0 ) .
1jp
74
Simulation results
i ,
k+1
k+2
k+3
k+4
i = 1, . . . , k ,
3k1
3k ,
75
Conclusion
5
4
3
2
1
0
76
10
20
30
40
50
MODEL SIZE
5
4
3
2
1
0
10
20
30
40
50
MODEL SIZE
77
5
4
3
2
1
0
10
20
30
40
50
MODEL SIZE
References
AGOSTINELLI, C. (2002a): Robust model selection in regression via weighted
likelihood methodology. Statistics and Probability Letters, 56 289-300.
AGOSTINELLI, C. (2002b): Robust stepwise regression. Journal of Applied Statistics, 29(6) 825-840.
AGOSTINELLI, C. and MARKATOU, M. (2005): M. Robust model selection by
cross-validation via weighted likelihood. Unpublished manuscript.
AKAIKE, H. (1970): Statistical predictor identification. Annals of the Institute of
Statistical Mathematics, 22 203-217.
EFRON, B., HASTIE, T., JOHNSTONE, I. and TIBSHIRANI, R. (2004): Least
angle regression. The Annals of Statistics 32(2), 407-499.
HAMPEL, F.R. (1983): Some aspects of model choice in robust statistics. In: Proceedings of the 44th Session of the ISI, volume 2, 767-771. Madrid.
HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001): The Elements of Statistical Learning. Springer-Verlag, New York.
KHAN, J.A., VAN AELST, S., and ZAMAR, R.H. (2007a): Building a robust
linear model with forward selection and stepwise procedures. Computational
Statistics and Data Analysis 52, 239-248.
KHAN, J.A., VAN AELST, S., and ZAMAR, R.H. (2007b): Robust Linear Model
Selection Based on Least Angle Regression. Journal of the American Statistical
Association 102, 1289-1299.
MALLOWS, C.L. (1973): Some comments on Cp . Technometrics 15, 661-675.
MARONNA, R.A., MARTIN, D.R. and YOHAI, V.J. (2006): Robust Statistics:
Theory and Methods. Wiley, Ney York.
78
McCANN, L. and WELSCH, R.E. (2007): Robust variable selection using least
angle regression and elemental set sampling. Computational Statistical and
Data Analysis 52, 249-257.
MILLER, A.J. (2002): Subset selection in regression. Chapman-Hall, New York.
MORGENTHALER, S., WELSCH, R.E. and ZENIDE, A. (2003): Algorithms for
robust model selection in linear regression. In: M. Hubert, G. Pison, A. Struyf
and S. Van Aelst (Eds.): Theory and Applications of Recent Robust Methods.
Brikh
auser-Verlag, Basel, 195-206.
MULLER,
S. and WELSH, A. H. (2005): Outlier robust model selection in linear
regression. Journal of the American Statistical Association 100, 1297-1310.