Rsimpls

Robust Methods for Partial Least Squares Regression
M. Hubert ∗and K. Vanden Branden
9th October 2003 (Revised version)
SUMMARY
Partial Least Squares Regression (PLSR) is a linear regression technique developed to deal
with high-dimensional regressors and one or several response variables. In this paper we in-
troduce robustified versions of the SIMPLS algorithm being the leading PLSR algorithm be-
cause of its speed and efficiency. Because SIMPLS is based on the empirical cross-covariance
matrix between the response variables and the regressors and on linear least squares re-
gression, the results are affected by abnormal observations in the data set. Two robust
methods, RSIMCD and RSIMPLS, are constructed from a robust covariance matrix for
high-dimensional data and robust linear regression. We introduce robust RMSECV and
RMSEP values for model calibration and model validation. Diagnostic plots are constructed
to visualize and classify the outliers. Several simulation results and the analysis of real data
sets show the effectiveness and the robustness of the new approaches. Because RSIMPLS is
roughly twice as fast as RSIMCD, it stands out as the overall best method.
KEY WORDS : Partial Least Squares Regression; SIMPLS; Principal Component Analysis; Ro-
bust Regression.
∗
Correspondence to: M. Hubert, Assistant Professor, Department of Mathematics, Katholieke
Universiteit Leuven, W. de Croylaan 54, B-3001 Leuven, Belgium, mia.hubert@wis.kuleuven.ac.be,
tel.+32/16322023, fax.+32/16322831
1
Original Research Article 2
1 Introduction
The PLS (NIPALS) regression technique was originally developed for econometrics by Wold
([29], [30]), but has become a very popular algorithm in other fields such as chemometrics,
social science, food industry, etc. It is used to model the linear relation between a set of
regressors and a set of response variables, which can then be used to predict the value of the
response variables for a new sample. A typical example is multivariate calibration where the
x-variables are spectra and the y-variables are the concentrations of certain constituents.
Throughout the paper we will print column vectors in bold and the transpose of a vector
v or a matrix V as v 0 or V 0 . Sometimes, the dimension of a matrix will be denoted using
a subscript, e.g. Xn,p stands for a (n × p) dimensional matrix. We apply the notation
(x1 , . . . , xn )0 = Xn,p for the regressors and (y1 , . . . , yn )0 = Yn,q for the response variables.
The merged data set (Xn,p , Yn,q ) will be denoted as Zn,m , with m = p + q.
The linear regression model we consider is:
0
yi = β0 + Bq,p xi + ei , (1)
where the error terms ei satisfy E(ei ) = 0 and Cov(ei ) = Σe of size q. The unknown
q-dimensional intercept is denoted as β0 = (β01 , . . . , β0q )0 and Bp,q represents the unknown
slope matrix. Typically in chemometrics, the number of observations n is very small (some
tens), whereas the number of regressors is very numerous (some hundreds, thousands). The
number of response variables q is in general limited to at most five.
Because multicollinearity is present, the classical multiple linear regression (MLR) esti-
mates have too large of a variance, hence biased estimation procedures, such as Principal
Component Regression (PCR) and Partial Least Squares Regression (PLSR) [13], are then
performed. In this paper, we will focus on PLSR. We will use the notation PLS1 when there
is only one response variable and the notation PLS2 otherwise.
It is well known that the popular algorithms for PLSR (NIPALS [29] and SIMPLS [2]) are
very sensitive to outliers in the data set. This will also be clearly shown in our simulation
study in Section 6. In this paper, we introduce several robust methods for PLSR which
are resistant to outlying observations. Some robustified versions of the PLS1 and PLS2
algorithms have already been proposed in the past. A first algorithm [28] has been developed
by replacing the different univariate regression steps in the PLS2 algorithm by some robust
alternatives. Iteratively reweighted algorithms have been obtained in [1] and [16]. These
algorithms are only valid for a one-dimensional response variable and they are not resistant
to leverage points. In [5], a robust PLS1 method is obtained by robustifying the sample
covariance matrix of the x-variables and the sample cross-covariance matrix between the
x- and y-variables. For this, the highly robust Stahel-Donoho estimator ([23],[3]) is used,
but unfortunately it can not be applied to high-dimensional regressors (n ¿ p) because
the subsampling scheme used to compute the estimator starts by drawing subsets of size
p + 2. Moreover the method can not be extended to PLS2. Recently, a robust method for
PCR which also applies to high-dimensional x-variables and multiple y-variables has been
introduced in [9].
In this paper we present several robustifications of the SIMPLS algorithm. SIMPLS is
very popular because it is faster than PLS1 and PLS2 as implemented using NIPALS, and
the results are easier to interpret. Moreover, if there is only one response variable (q = 1),
SIMPLS and PLS1 yield the same results. An outline of the SIMPLS algorithm is presented
in Section 2. We recall that SIMPLS depends on the sample cross-covariance matrix between
the x- and y-variables, and on linear least squares regression. In Section 3, we introduce
several robust methods which are obtained by using a robust covariance matrix for high-
dimensional data sets ([8]), and a robust regression method. The proposed algorithms are
fast compared to previous developed robust methods and they can handle cases where n ¿ p
and q ≥ 1. Section 4 discusses the selection of the number of components and the model
validation. In Section 5 we introduce several diagnostic plots which can help us to identify
the outliers and classify them in several types. The robustness of the proposed algorithms
is demonstrated in Section 6 with several simulations. In Section 7 we apply one of the new
methods on a real data set. We finish with some conclusions in Section 8.
2 The SIMPLS algorithm

The SIMPLS method assumes that the x- and y-variables are related through a bilinear
model:
xi = x̄ + Pp,k t̃i + gi (2)

yi = ȳ + A0q,k t̃i + fi . (3)
In this model, x̄ and ȳ denote the mean of the x- and the y-variables. The t̃i are called the
scores which are k-dimensional, with k ¿ p, whereas Pp,k is the matrix of x-loadings. The
residuals of each equation are represented by the gi and fi respectively. The matrix Ak,q
represents the slope matrix in the regression of yi on t̃i . Note that in the literature regarding
PLS, the matrix Ak,q is usually denoted as Q0k,q and the columns of Qq,k by the y-loadings
qa . We prefer to use another notation because qa will be used for the PLS weight vector,
see (4).
The bilinear structure (2) and (3) implies a two-steps algorithm. After mean-centering
the data, SIMPLS will first construct k latent variables T̃n,k = (t̃1 , . . . , t̃n )0 and secondly, the
responses will be regressed onto these k variables. We will refer to the columns of T̃n,k as
the components.
Consider first the construction of the components. In contrast with PCR, the k com-
ponents are not solely determined based on the x-variables. They are obtained as a linear
combination of the x-variables which have maximum covariance with a certain linear com-
bination of the y-variables. More precisely, let X̃n,p and Ỹn,q denote the mean-centered data
matrices, with x̃i = xi − x̄ and ỹi = yi − ȳ. The normalized PLS weight vectors ra and qa
(kra k = kqa k = 1) are then defined as the vectors that maximize for each a = 1, . . . , k
0
Ỹq,n X̃n,p
cov(Ỹn,q qa , X̃n,p ra ) = qa0 ra = qa0 Syx ra (4)
n−1
0 X̃ 0 Ỹ
n,q
where Syx = Sxy = p,n n−1
is the empirical cross-covariance matrix between the x- and
y-variables. The elements of the scores t̃i are then defined as linear combinations of the
mean-centered data: t̃ia = x̃0i ra , or equivalently T̃n,k = X̃n,p Rp,k with Rp,k = (r1 , . . . , rk ).
The maximization problem of (4) has one straightforward solution: r1 and q1 are the first
left and right singular vectors of Sxy . This implies that q1 is the dominant eigenvector of
Syx Sxy and r1 = Sxy q1 . To obtain more than one solution, the components X̃rj are required
to be orthogonal:
n
X
0 0
rj X̃ X̃ra = t̃ij t̃ia = 0, a > j. (5)
i=1
To satisfy this condition we first introduce the x-loading pj that describes the linear relation
between the x-variables and the jth component X̃rj . It is computed as
pj = (rj0 X̃ 0 X̃rj )−1 X̃ 0 X̃rj

= (rj0 Sx rj )−1 Sx rj (6)
with Sx the empirical covariance matrix of the x-variables. This definition implies that (5)
is fulfilled when p0j ra = 0 for a > j. The PLS weight vector ra thus has to be orthogonal
to all previous x-loadings Pa−1 = [p1 , . . . , pa−1 ]. Consequently, ra and qa are computed
as the first left and right singular vectors of Sxy projected on a subspace orthogonal to
Pa−1 . This projection is performed by constructing an orthonormal base {v1 , . . . , va−1 } of
a−1
{p1 , . . . , pa−1 }. Next, Sxy is deflated:
a a−1
Sxy = Sxy − va (va0 Sxy
a−1
) (7)
a
and ra and qa are the first left and right singular vectors of Sxy . We start this iterative
1
algorithm with Sxy = Sxy and repeat this process until k components are obtained. The
choice of the number of components k will be discussed in Section 4.
In the second stage of the algorithm, the responses are regressed onto these k components.
The formal regression model under consideration is thus:
yi = α0 + A0q,k t̃i + fi (8)

where E(fi ) = 0 and Cov(fi ) = Σf . Multiple linear regression (MLR) [11] provides esti-
mates:
Âk,q = (St )−1 Sty = (Rk,p

0
Sx Rp,k )−1 Rk,p
0
Sxy (9)
α̂ = ȳ − Â0 ¯t̃
0 q,k (10)
Sf = Sy − Â0q,k St Âk,q (11)
where Sy and St stand for the empirical covariance matrix of the y- and t-variables. Note
that here MLR denotes the classical least squares regression with multiple x-variables, and
when q > 1, with multiple y-variables (also known as multivariate multiple linear regression).
Because ¯t̃ = 0, the intercept α0 is thus estimated by ȳ. By plugging in t̃i = Rk,p
0
(xi − x̄)
in (3), we obtain estimates for the parameters in the original model (1), i.e.
B̂p,q = Rp,k Âk,q (12)

0
β̂0 = ȳ − B̂q,p x̄. (13)
Finally, also an estimate of Σe is provided by rewriting Sf in terms of the original parameters:
Se = Sy − B̂ 0 Sx B̂. (14)
Note that for a univariate response variable (q = 1), the parameter estimate B̂p,1 can be
rewritten as the vector β̂, whereas the estimate of the error variance Se simplifies to σ̂e2 = s2e .
Example: Fish data

We will illustrate the SIMPLS algorithm on a low-dimensional example introduced in [14].
This data set contains n = 45 measurements of fish. For each fish, the fat concentra-
tion was measured, whereas the x-variables consist of highly multicollinear spectra at nine
wavelengths. The goal of the analysis is to model the relation between the single response
variable, fat concentration, and these spectra. The x-variables are shown in Figure 1. On
this figure we have highlighted the observations which have the most outlying spectrum. It
was reported by Naes [14] that observations 39 to 45 are outlying. We see that all of them,
except observation 42, have indeed a spectrum which deviates from the majority.
[Figure 1 about here]
It has been shown in [5] and [6] that three components are sufficient to perform the PLS
regression. So, if we carry out the SIMPLS algorithm with k = 3, we obtain the regression
diagnostic plot shown in Figure 2(a). On the horizontal axis this plot displays the Maha-
lanobis distance of a data point in the t-space, which we therefore call its score distance
SDi(k) . It is defined by
SD2i(k) = t̃0i St−1 t̃i . (15)
Note that this distance SD2i(k) depends on k, because the scores t̃i = t̃i(k) are obtained from a
PLS model with k components. On the vertical axis, the standardized concentration residuals
ri(k)
with ri(k) = yi − β̂0 − β̂ 0 xi are displayed. We define outlying data points in the t-space
se q
as those observations whose score distance exceeds the cutoff-value χ2k,0.975 (because the
2
squared Mahalanobis distances of normally distributed scores are
q χk -distributed). Regression
outliers have an absolute standardized residual which exceeds χ21,0.975 = 2.24. The SIMPLS
diagnostic plot suggests that observations 43, 44 and 45 are outlying in the t-space, whereas
observations 1 and 43 can be classified as regression outliers. Their standardized residual is
however not that large, so they are rather borderline cases.
Figure 2(b) shows the robust diagnostic plot, based on the robust PLS method that will be
described in Section 3. For a precise definition of this plot, we refer to Section 5. Here, we
can clearly identify the observations with outlying spectrum (1, 12, 39, 40, 41, 43, 44 and
45). Moreover, the robust PLS method finds several regression outliers which can not be
seen on the SIMPLS plot.
3 Robustified versions of the SIMPLS algorithm

3.1 Robust covariance estimation in high dimensions
The SIMPLS method does not detect the outliers because both stages in the algorithm are
not resistant towards outlying observations. The scores t̃i are calculated based on the sam-
ple cross-covariance matrix Sxy between the x- and y-variables and the empirical covariance
matrix Sx of the x-variables, which are highly susceptible to outliers, as is the least squares
regression performed in the second stage of the algorithm. This will also be clear from the
simulation study in Section 6. In this section, we robustify the SIMPLS method by replacing
the sample cross-covariance matrix Sxy by a robust estimate of Σxy and the empirical covari-
ance matrix Sx by a robust estimate of Σx , and by performing a robust regression method
instead of MLR. Two variants of SIMPLS will be proposed, RSIMCD and RSIMPLS. These
algorithms can be applied for one or several response variables (q ≥ 1).
Our estimators are thus based on robust covariance matrices for high-dimensional data.
For this, we will use the ROBPCA method which has recently been developed [8].
Suppose we want to estimate the center and scatter of n observations zi in m dimensions,
with n < m. Because we are dealing with data sets with a very large number of variables, we
can not rely on a first class of well-known robust methods to estimate the covariance structure
of a sample. If the dimension of the data were small (m < n), we could for example apply
the Minimum Covariance Determinant (MCD) estimator of location and scatter [18]. The
principle of the MCD method is to minimize the determinant of the sample covariance matrix
of h observations with dn/2e < h < n, for which a fast algorithm (FAST-MCD) exists [21].
The center of the zi is then estimated by the mean z̄h and their scatter by the empirical
covariance matrix Sh of the optimal h-subset (multiplied with a consistency factor). To
increase the finite-sample efficiency, a reweighting step can be added as well. An observation
receives zero weight if its robust squared distance (zi − z̄h )0 Sh−1 (zi − z̄h ) exceeds χ2m,0.975 .
The reweighted MCD estimator is then defined as the classical mean and covariance matrix
of those observations with weight equal to 1.
However, when m > n, the MCD estimator is not applicable anymore because the covari-
ance matrix of h < m data points is always singular. For such high-dimensional data sets,
projection pursuit algorithms have been developed [12], [7]. ROBPCA combines the two
approaches. Using projection pursuit ideas as in Donoho [3] and Stahel [23], it computes the
outlyingness of every data point and then considers the empirical covariance matrix of the h
data points with smallest outlyingness. The data are then projected onto the subspace K0
spanned by the k0 ¿ m dominant eigenvectors of this covariance matrix. Next, the MCD
method is applied to estimate the center and the scatter of the data in this low-dimensional
subspace. Finally these estimates are backtransformed to the original space and a robust
estimate of the center µ̂z of Zn,m and of its scatter Σ̂z are obtained. This scatter matrix can
be decomposed as
Σ̂z = P z Lz (P z )0 (16)
z
with robust Z-eigenvectors Pm,k0
and Z-eigenvalues diag(Lk0 ,k0 ). Note that the diagonal
matrix L contains the k0 largest eigenvalues of Σ̂z in decreasing order. Then Z-scores T z
z
can be obtained from T z = (Z − 1n µ̂0z )P z . For all details about ROBPCA, we refer to [8].
3.2 Robust PLS

3.2.1 Robust scores
To obtain robust scores we first apply ROBPCA on Zn,m = (Xn,p , Yn,q ). This yields a robust
estimate of the center of Z, µ̂z = (µ̂0x , µ̂0y )0 , and an estimate of its shape, Σ̂z , which can be
split into Ã !
Σ̂x Σ̂xy
Σ̂z = . (17)
Σ̂yx Σ̂y
We estimate the cross-covariance matrix Σxy by Σ̂xy and compute the PLS weight vectors
ra as in the SIMPLS algorithm, but now starting with Σ̂xy instead of Sxy . In analogy with
(6) the x-loadings pj are defined as pj = (rj0 Σ̂x rj )−1 Σ̂x rj . Then the deflation of the scatter
matrix Σ̂axy is performed as in SIMPLS. In each step, the robust scores are calculated as:
tia = x̆0i ra = (xi − µ̂x )0 ra (18)
where x̆i are the robustly centered observations.
Remark 1 When performing the ROBPCA method on Zn,m , we need to determine k0 ,

which should be a good approximation of the dimension of the space spanned by the x- and
y-variables. If k is known, we set k0 = min(k, 10) + q. The number k + q represents the
sum of the number of x-loadings that gives a good approximation of the dimension of the
x-variables, and the number of response variables. The maximal value kmax = 10 is included
to ensure a good efficiency of the FAST-MCD method in the last stage of ROBPCA, but
may be increased if enough observations are available.
When analyzing a specific data set, k0 could be chosen by looking at the eigenvalues
of the empirical covariance matrix of the h observations with the smallest outlyingness. By
doing this, one should keep in mind that it is logical that k0 is larger than the number of
components k that will be retained in the regression step.
Remark 2 When p + q < n, we can directly compute the reweighted MCD-estimator on

Zn,m . The construction of the pairs of PLS weight vectors is then closely related to an
inter-battery method of Tucker (1958). The influence functions of the resulting PLS weight
vectors, which measure the infinitesimal effect of one outlier on the estimates, appear to be
bounded [26]. This illustrates the robustness of this approach towards point contamination.
3.2.2 Robust regression
Once the scores are derived, a robust linear regression is performed. The regression model
is the same as in (8), but now based on the robust scores ti :
yi = α0 + A0q,k ti + f˘i . (19)
Note that when q = 1, well-known robust methods such as the LTS regression [18] could be
used. This approach is followed in [9]. Here, we propose two methods that can be used for
regression with one or multiple response variables. Throughout we only use the notation for
the multivariate setting, but both approaches apply as well when yi is a scalar instead of a
vector. The first multivariate regression that we discuss is the MCD regression method [22].
The second one uses additional information from the previous ROBPCA step, and hence
will be called the ROBPCA regression.
MCD regression
The classical MLR estimates for the regression model presented in (19) can be written in
terms of the covariance Σ and the center µ of the joint variables (t, y):
Ã ! Ã !
µt Σt Σty
µ= , Σ= . (20)
µy Σyt Σy
If the center µ is estimated by the sample mean (t̄0 , ȳ 0 )0 and the covariance Σ by the sample
covariance matrix of (t, y), the classical estimates satisfy equations (9)-(11) if we replace ¯t̃
in (10) by t̄. Robust regression estimates are obtained by replacing the classical mean and
covariance matrix of (t, y) by the reweighted MCD estimates of center and scatter [22]. It
is moreover recommended to reweigh these initial regression estimates in order to improve
the finite-sample efficiency. Let ri(k) be the residual of the ith observation based on the
initial estimates that were calculated with k components. If Σ̂f˘ is the initial estimate for
the covariance matrix of the errors, then we define the robust distance of the residuals as:
0
RDi(k) = (ri(k) Σ̂−1 r )1/2 .
f˘ i(k)
(21)
The weights ci(k) are computed as
ci(k) = I(RD2i(k) ≤ χ2q,0.975 ) (22)
with I the indicator function. The final regression estimates are then calculated as in classical
MLR, but only based on those observations with weight ci(k) equal to 1. The robust residual
distances RDi(k) are recomputed as in (21) and also the weights ci(k) are adapted.
Analogously to (12)-(14), robust parameters for the original model (1) are then given by:
B̂p,q = Rp,k Âk,q (23)
0
β̂0 = α̂0 − B̂q,p µ̂x (24)
Σ̂e = Σ̂f˘. (25)
The resulting method is called RSIMCD.
Remark 3 Both MCD and ROBPCA assume that the data set contains at least h good
observations. We will therefore use the same value for h in the two steps of our algorithm,
although this is not necessary. To perform MCD on (t, y) it is required that h > k + q. With
kmax = 10 and h > d n2 e, this condition is certainly fulfilled if d n2 e ≥ 10 + q. This is usually
not a problem because q is very small.
The value for h influences the robustness of our estimates. It should be larger than
[ n+k20 +1 ] in the ROBPCA step and larger than [ n+k+q+1
2
] in the MCD regression [19]. In our
Matlab implementation, we have therefore set h = max([αn], [ n+10+q+1 2
]), with α = 0.75 as
default value. It is possible to increase or decrease the value of α where (1 − α) represents
the fraction of outliers the algorithm should be able to resist.
ROBPCA regression
The simulation study in Section 6 shows that RSIMCD is highly robust to many types of
outliers. Its computation time is mainly determined by applying ROBPCA on the (x, y)-
variables and MCD on the (t, y)-variables. Now, we introduce a second robust SIMPLS
algorithm which avoids the computation of the MCD on (t, y) by using additional information
from the ROBPCA step.
The MCD regression method starts by applying the reweighted MCD estimator on (t, y)
to obtain robust estimates of their center µ and scatter Σ. This reweighted MCD corresponds
to the mean and the covariance matrix of those observations which are considered not to be
outlying in the (k + q)-dimensional (t, y) space.
To obtain the robust scores ti , we first applied ROBPCA to the (x, y)-variables, and
obtained a k0 -dimensional subspace K0 which represented these (x, y)-variables well. Because
the scores were then constructed to summarize the most important information given in the
x-variables, we might expect that outliers with respect to this k0 -dimensional subspace are
often also outlying in the (t, y) space. Hence, we will estimate the center µ and the scatter
Σ of the (t, y)-variables as the weighted mean and covariance matrix of those (ti , yi ) whose
corresponding (xi , yi ) are not outlying to K0 :
Ã ! n
Ã ! n
µ̂t X ti X
µ̂ = = wi /( wi ) (26)
µ̂y i=1
yi i=1
Ã ! Ã !
Σ̂t Σ̂ty
n
X ti ³ ´ Xn
Σ̂ = = wi t0i y 0i /( wi − 1) (27)
Σ̂yt Σ̂y i=1
yi i=1
with wi = 1 if observation i is not identified as an outlier by applying ROBPCA on (x, y),

and wi = 0 otherwise.
The question remains how these weights wi are determined. When we apply ROBPCA,
we can identify two types of outliers: those who are outlying within K0 , and those who
are lying far from K0 (see [8] for a graphical sketch). The first type of outliers can be
p
easily
q identified as those observations whose robust distance D i(k 0 ) = (tzi )0 (Lz )−1 tzi exceeds
χ2k0 ,0.975 . Here Lz is defined as in (16) with Z = (X, Y ).
To determine the second type of outliers, we consider for each data point its orthogonal
distance ODi = k(zi − µ̂z ) − P z tzi k to the subspace K0 . The distribution of these orthogonal
distances are difficult to determine exactly, but motivated by the central limit theorem, it
appears that the squared orthogonal distances are roughly normally distributed. Hence, we
2
estimate their center and variance with the univariate MCD, yielding µ̂od2 and σ̂od 2 . We then
set wi = 0 if
p
ODi > µ̂od2 + σ̂od2 z0.975 , (28)
with z0.975 = Φ−1 (0.975), the 97.5% quantile of the Gaussian distribution. Another approx-
imation is explained in [8]. One can of course also plot the orthogonal distances to see
whether some of them are much larger than the others. We recommend this last approach
when interaction is possible in the analysis of a particular data set.
Having identified the observations with weight 1, we thus compute µ̂ and Σ̂ from (26)
and (27). Then, we proceed as in the MCD regression method. We plug these estimates
in (9) to (11), compute residual distances as in (21) and perform a reweighted MLR. This
reweighting step has the advantage that it might again include observations with wi = 0
which are not regression outliers. We will refer to this algorithm as the RSIMPLS method.
Remark 4 Both proposed robust PLS algorithms have several equivariance properties. The
ROBPCA method is orthogonally equivariant, which means that orthogonal transformations
of the data (rotations, reflections) transform the loadings appropriately and leave the scores
unchanged. Consequently, it can easily be derived that RSIMCD and RSIMPLS are equiv-
ariant for translations and orthogonal transformations in x and y. More precisely, let u ∈ Rp ,
v ∈ Rq , C any p-dimensional orthogonal matrix and D any q-dimensional orthogonal ma-
trix. If (β̂0 , B̂) denotes the estimates by running RSIMCD or RSIMPLS on the original data
(xi , yi ), it holds that:
B̂(Cxi + u, Dyi + v) = C B̂D0

β̂0 (Cxi + u, Dyi + v) = Dβ̂0 + v − DB̂ 0 C 0 u.
Remark 5 Instead of using hard-rejection rules for defining outliers, we could also apply
continuous weights between 0 and 1. But this introduces additional choices (weight functions,
cut-off values) and the results are not necessarily improved. For a more detailed discussion,
see [9].
Remark 6 We have also developed a robust PLS1 algorithm following the approach de-
veloped in [5] but by replacing the Stahel-Donoho estimator with the ROBPCA covariance
matrix. However, the results were in general not better than those obtained with RSIMPLS
or RSIMCD. Hence we prefer RSIMPLS because it can also be applied when q ≥ 1.
3.3 Comparison
In Section 6, we present the results of a thorough simulation study where we compare
the performance of these two different robust methods. These simulations indicate that
RSIMCD and RSIMPLS are very comparable, but if we compare the mean CPU-time in
seconds computed on a Pentium IV with 1.60 GHz (see Table 1) over five runs for different
situations, we see that RSIMPLS is roughly twice as fast as RSIMCD. This is explained by
the fact that we apply FAST-MCD in the second stage of RSIMCD. Hence, in the following
sections, we will mainly concentrate on RSIMPLS.
[Table 1 about here]
4 Model Calibration and Validation

4.1 Selecting the number of components
A very important issue in building the PLS model is the choice of the optimal number of
components kopt . The most common methods use a test set or cross-validation (CV). A test
set should be independent from the training set which is used to estimate the regression
parameters in the model, but should still be representative of the population. Let nT denote
the total number of observations in the test set. For each k, the root mean squared error
RMSEk for this test set can be computed as:
v
u nT Xq
u 1 X
RMSEk = t (yij − ŷij(k) )2 . (29)
nT q i=1 j=1
Here, the predicted values ŷij(k) of observation i in the test set are based on the parameter
estimates that are obtained from the training set using a PLS method with k components.
One then chooses kopt as the k-value which gives the smallest or a sufficiently small value for
RMSEk .
This RMSEk statistic can however attain unreliable values if the test set contains outliers,
even if the fitted values are based on a robust PLS algorithm. Such outlying observations
increase the RMSEk because they fit the model badly. Consequently, our decision about
kopt might be wrong. Therefore, we propose to remove the outliers from the test set before
computing RMSEk . Formally, let ri(k) be the residual for the ith observation in the test
0
set and ci(k) = I(ri(k) Σ̂−1 2
e ri(k) < χq,0.975 ). The weight ci(k) thus tells whether or not the ith
observation is outlying with respect to the PLS model with k components. Then, we select
the test data points which are not outlying in any of the models by computing ci = mink ci(k) .
Let Gt denote the set of points for which ci = 1, and let nt be its size: |Gt | = nt . Finally,
for each k, we define the robust RMSEk value as:
v
u q
u 1 XX
R-RMSEk = t (yij − ŷij(k) )2 . (30)
nt q i∈G j=1
t
This approach is fast because we only need to run the PLS algorithm once for each k. But
an independent test set is only exceptionally available. This can be solved by splitting the
original data into a training and a test set. However, the data sets we consider generally
have a limited number of observations and it is preferable that the number of observations
in the training step is at least 6 to 10 times the number of variables. That is why we
concentrate on the cross-validated RMSEk , which will be denoted by RMSECVk [24], [9].
Usually, the 1-fold or leave-one-sample out statistic is obtained as in (29) but now the index
i runs over the set of all the observations, and the predicted values ŷij(k) are based on the
PLS estimates obtained by removing the ith observation from the data set. The optimal
number of components is then again taken as the value kopt for which RMSECVk is minimal
or sufficiently small.
However, as we have argued for RMSEk , also the RMSECVk statistic is vulnerable to
outliers, so we also remove an outlying observation. Let B̂−i , β̂0,−i and Σ̂e,−i denote the
parameter estimates based on the data set without the ith observation, r−i(k) = yi − β̂0,−i −
0
B̂−i xi the ith cross-validated residual and RD2−i(k) = r−i(k)
0
Σ̂−1
e,−i r−i(k) the squared cross-
validated residual distance as in (21). Then analogously to (22) the cross-validated weight
assigned to the ith observation is defined as
c−i(k) = I(RD2−i(k) < χ2q,0.975 ).
If c−i(k) = 0, observation (xi , yi ) is recognized as a regression outlier in the PLS model with
k components. Several PLS models are constructed for k = 1, . . . , ktot components with
ktot the total or maximal number of components under consideration. Because we want to
compare these ktot different models, we should evaluate their predictive power on the same
set of observations. Hence, we could eliminate those observations which are outlying in any
of the models by defining for each observation
c−i = min c−i(k) . (31)

k
Let Gc denote the subset of observations for which c−i = 1 with |Gc | = nc , then an obser-
vation belongs to the set Gc when it is observed as a regular sample in each of the ktot PLS
models. For each k, we then define the robust RMSECVk value as:
v
u q
u 1 XX
R-RMSECVk = t (yij − ŷ−ij(k) )2 (32)
nc q i∈G j=1
c
0
with the cross-validated fitted value ŷ−i(k) = β̂0,−i + B̂−i xi . This approach has the advantage
that any suspicious observation is discarded in the R-RMSECVk statistic. It is also followed
in [17] to construct robust stepwise regression by means of a bounded-influence estimator
for prediction. On the other hand, when the number of observations is small, increasing ktot
can lead to sets Gc for which the number of observations nc is small compared to the total
number of observations n. Let us e.g. consider the Fish data set. When we choose ktot = 5,
nc = 30 (out of n = 45), but with ktot = 9, nc = 23, which is only half of the observations.
To avoid such small calibration sets, we alternatively define
c−i = median c−i(k) . (33)

k
With this definition, only those data points that are outlying with respect to most of the PLS
models under consideration are removed. Note that when k is even, we take the low-median
of the c−i,k in order to obtain a weight c−i which is always exactly zero or one. For the Fish
data set, we then obtain nc = 34 when ktot = 9. Figure 3 displays the R-RMSECV curve
for the Fish data with the two different weight functions. We see that both curves do not
differ very much, and they both indicate to select three components in the regression model,
which is similar to the conclusion of the analysis in [5] and [6]. We also superimposed the
R-RMSECV curves for the SIMPLS algorithm based on the same subsets Gc of observations
as RSIMPLS and again conclude that three components are sufficient.
The drawback of cross-validation is its computation time, because for each k the PLS
algorithm has to be run n times. To speed up the computations, we therefore fix k0 , the
number of principal components that are selected in ROBPCA to obtain the robust center µ̂z
and scatter Σ̂z in (17). If we then increase k, we only need to compute one extra component
a
by deflating Sxy once more as explained in (7). Fixing k0 has the additional advantage
that the weights wi that are needed in the ROBPCA regression do not change with k. To
determine k0 , we first compute ktot ≤ min{p, kmax = 10} that will be used as a maximal value
for k in the regression. The total number of parameters to be estimated equals kq for the
slope matrix Â, q for the intercept α̂0 and q(q − 1)/2 (or 1 when q = 1) for Σ̂e . To avoid
overfitting, we then require that
q(q − 1)
ktot q + q + <h (34)
2
where h stands for the size of the subset that is used in the ROBPCA, or MCD regression,
and which should be a lower bound for the number of regular observations out of n. Note
that if q = 1, we have only one scale estimate σ̂e , hence we require that
ktot + 2 < h.
Having determined ktot , we then set k0 = ktot +q. For the Fish data, this implies that ktot = 9
and k0 = 10.
Remark 7 In [9], also a robust Rk2 value is defined to determine the optimal number of
components. For q = 1, it is defined by
P 2
2 i∈Gt ri(k)
Rk = 1 − P 2
i∈Gt (yi − ȳc )
P
with ȳc = i∈Gt yi /nt and Gt is defined as in (30) with the test set being equal to the full
data set. In the multivariate case (q > 1), this is generalized to:
P Pq 2
i∈G r
Rk2 = 1 − P Pt q j=1 ij(k) 2
i∈Gt j=1 (yij − ȳj )
P
where ȳj = i∈Gt yij /nt . The optimal number of components kopt is then chosen as the
smallest value k for which Rk2 attains e.g. 80% or the Rk2 curve becomes nearly flat. This
approach is fast because it avoids cross-validation, but merely measures the variance of the
residuals instead of the prediction error.
4.2 Estimation of Prediction Error

Once the optimal number of components kopt is chosen, we can validate our model by esti-
mating the prediction error. We therefore define R-RMSEPkopt (Robust Root Mean Squared
Error of Prediction), as in (32):
v
u q
u 1 XX
R-RMSEPkopt = t (yij − ŷ−ij(kopt ) )2 (35)
np q i∈G j=1
p
where Gp is now the subset of observations with non-zero weight c−i(kopt ) in the PLS model
with kopt components, and |Gp | = np . The fitted values are obtained with k0 = kopt + q in
ROBPCA. Note that using this definition, we include all the regular observations for the
model with kopt components, which is more precise than the set Gc that is used in (32) and
which depends on ktot . Hence, in general R-RMSEPkopt will be different from R-RMSECVkopt .
For the Fish data set and RSIMPLS, we obtain R-RMSEP3 = 0.51 based on np = 33
observations. If we perform SIMPLS, and calculate the R-RMSEP3 value on the same set
of observations, we obtain 0.82, hence the robust fit yields a smaller prediction error. To
test whether this difference is significant, we applied the following bootstrap procedure. We
have drawn 150 bootstrap samples of size np = 33 from the (xi , yi ) with c−i(3) = 1. For each
bootstrap sample we have refitted the model with three components, and we have computed
the R-RMSEP3 as in (35) with the set Gp being fixed during all the computations. The
standard deviation of these 150 R-RMSEP values was equal to σ̂R-RMSEP = 0.09 and can be
used as an approximation to the true standard deviation of the R-RMSEP statistic. Because
0.82 > 0.51 + 2.5 ∗ 0.09 = 0.735 we conclude that the R-RMSEP3 based on RSIMPLS is
significantly different from R-RMSEP3 obtained with SIMPLS (at the 1% level).
5 Outlier Detection
5.1 Regression diagnostic plot
To identify the outlying observations, we will first construct a regression diagnostic plot as
in [20], [22] and [9]. Its goal is to identify outlying observations with respect to the regression
model (19). In our two robust PLS methods, we perform a regression of the q-dimensional yi
on the k-dimensional ti = ti(k) assuming k = kopt . We can distinguish three types of outliers.
These different types are represented in Figure 4(a) for the case of simple regression, so when
q = 1 and k = 1. Good leverage points lie in the direction of the fitted line or subspace, but
have outlying t-values. This is also the case for bad leverage points, that moreover do not
fit the model well. Vertical outliers are only outlying in the y-space. The latter two types of
outliers are known to be very influential for the classical least squares regression fit, because
they cause the slope to be tilted in order to accommodate the outliers.
To measure the outlyingness of a point in the t-space, we consider its robustified maha-
lanobis distance, which we now call the score distance SDi(k) , defined by
SD2i(k) = (ti − µ̂t )0 Σ̂−1
t (ti − µ̂t )
where µ̂t and Σ̂t are derived in the regression step, see e.g. (26) and (27). When we perform
SIMPLS, this score distance reduces to (15) because µ̂t̃ = 0. This score distance is put
on the horizontal axis of the regression diagnostic plot and exposes the good and the bad
leverage points. By analogy with [20], [22] andq[9], leverage points are those observations
whose score distance exceeds the cut-off value χ2k,0.975 . On the vertical axis, we put the
residual distance RDi,k :
RD2i(k) = ri(k)
0
Σ̂−1
e ri(k)
with ri(k) = yi − β̂0 − B̂ 0 xi being the residual of the ith observation. For univariate response
r
variables, this residual distance simplifies to the standardized residual RDi(k) = i(k) σ̂e
. Ver-
tical
q outliers and bad leverage points are now observations whose residual distance exceeds
χ2q,0.975 .
If the regression parameters are well estimated, i.e. if they are not influenced by the outliers,
this diagnostic plot should thus look as in Figure 4(b) for q = 1, and as in Figure 4(c) for
q > 1. Let us look again at Figure 2(b) which shows the regression diagnostic plot of the Fish
data with RSIMPLS. On this plot we see six clear bad leverage points (1, 12, 41, 43, 44, 45),
two vertical outliers (3, 10), two good leverage points (39, 40) and three borderline cases. The
diagnostic plot from SIMPLS is however not very informative. Some bad leverage points
(44, 45) are converted into good leverage points which illustrates that the least squares
regression is tilted to accommodate the outliers.
5.2 Score diagnostic plot

Next, similarly as in [10] and [8], we can also classify the observations with respect to the
PCA model (2). This yields the score diagnostic plot. On the horizontal axis, we place again
for each observation its score distance SDi(k) . On the vertical axis, we put the orthogonal
distance of an observation to the t-space:
ODi(k) = kx̆i − Pp,k ti k.
This allows us again to identify three types of outliers. Bad PCA-leverage points have
outlying SDi(k) and ODi(k) , good PCA-leverage points have only outlying SDi(k) , whereas
orthogonal outliers have only outlying ODi(k) . The latter ones are not yet visible on the
regression diagnostic plot. They have the property that they lie far from the t-space, but
they become regular observations after projection in the t-space. Hence, they will not badly
influence the computation of the regression parameters, but they might influence the load-
ings.
For the Fish data set this diagnostic plot is presented in Figure 5(a) for SIMPLS and in
Figure 5(b) for RSIMPLS. The horizontal line is computed as in (28). The outliers detected
in the regression diagnostic plot (Figure 2(b)) are all recognized as leverage points in this
score diagnostic plot. Furthermore we detect observation 42 as an orthogonal outlier. We
also detect two other orthogonal outliers (10, 28). For SIMPLS, this score diagnostic plot
(Figure 5(a)) also discovers sample 42 as an orthogonal outlier, but observations 43–45 are
classified as good leverage points.
Note that the regression and the score diagnostic plot can also be combined into one
three-dimensional figure exposing (SDi(k) , ODi(k) , RDi(k) ), see also [9].
6 Simulation Study
To compare the different algorithms, we have performed several simulations with low- and
high-dimensional data sets. For each situation, we generated 1000 data sets. First, we
consider the case without contamination. The data sets were then generated according to
the bilinear model (2) and (3), with:
T ∼ Nk (0k , Σt ), with k < p,

X = T Ik,p + Np (0p , 0.1Ip )
Y = T A + Nq (0q , Iq ), with A ∼ Nq (0q , Iq ).
Here, (Ik,p )i,j = 1 for i = j and 0 elsewhere. These simulation settings imply that kopt = k.
Next, we introduced different types of outliers by randomly replacing n² of the n observations
with ² = 10%. The conclusions obtained for ² = 20% contamination were similar to those
for ² = 10% and are therefore not included. If T² , X² and Y² denote the contaminated data,
the bad leverage regression points were constructed as:
T² ∼ Nk (10k , Σt ) (36)
X² = T² Ik,p + Np (0p , 0.1Ip ) (37)
whereas the y-variables were not changed. The vertical outliers were generated with the
uncontaminated t-variables, but adjusted y-variables:
Y² = T Ak,q + Nq (10q , 0.1Iq ). (38)
Finally, orthogonal outliers were constructed by putting
X² = T Ik,p + Np ((0k , 10p−k ), 0.1Ip ) (39)
and taking unadjusted y-variables.

In Table 2, we have listed the different choices for n, p, q, k and Σt . In every simulation
setup, we calculated the Mean Squared Error (MSE) of the slope matrix B̂, of the intercept
β̂0 and of the covariance matrix of the residuals Σ̂e with SIMPLS, RSIMCD and RSIMPLS
using k components. We also computed the results for the PLS (NIPALS) algorithm, but
these were quite similar to the results from SIMPLS and are therefore not included. If q = 1,
we included the mean angle (denoted by mean(angle)) between the estimated slope and the
true slope in the simulation results. Note that here, we mainly focus on the parameter
estimates. In the simulation study in [4] we concentrate on the predictive performance of
RSIMPLS for varying values of k.
[Table 2 about here]
[Table 3-5 about here]
Discussion of the results

Tables 3-5 summarize the results of the different simulations. When no contamination is
present, all the estimators perform well. SIMPLS yields the lowest MSE for the slope,
except for q = 1 and high-dimensional regressors (Table 4). Here, RSIMPLS and RSIMCD
surprisingly even give better results.
When the data set is contaminated, SIMPLS clearly breaks down which can be seen from
all the MSE’s which rise considerably. The bad leverage points are very damaging for all
the parameters in the model, whereas the intercept is mostly influenced by vertical outliers.
The orthogonal outliers are mostly influential at the low-dimensional data sets.
In contrast with SIMPLS, the values for the robust algorithms do not change very much.
In almost every setting, the differences between RSIMCD and RSIMPLS are very small.
Both robust PLS methods are thus comparable, but as mentioned in Section 3.3 we prefer
RSIMPLS because it is computationally more attractive than RSIMCD.
7 Example : Biscuit Dough Data

Finally, we apply RSIMPLS on the well-known high-dimensional Biscuit dough data [15].
The data originally contain four response variables, namely the concentration of fat, flour,
sucrose and water of 40 biscuit dough samples. In our analysis, we have removed the variable
fat because it is not very highly correlated with the other constituents, and it has a higher
variance. Because the three remaining response variables are highly correlated (see [9] for
some figures) and have variances of the same order, this data set seems appropriate for a
multivariate analysis. The aim is to predict these three biscuit constituents (q = 3) based on
the 40 NIR spectra with measurements every two nanometers, from 1200nm up to 2400nm.
We have done the same preprocessing as suggested in [7] which results in a data set of NIR
spectra in 600 dimensions. Observation 23 is known to be an outlier, but we will still consider
the data set with all 40 observations.
To decide on the numbers of components kopt we have drawn the R-RMSECV curve
with the median weight defined in (33) and with ktot = 7 derived from (34). This yields
nc = 25. Figure 6 suggests to take kopt = 3. The R-RMSECV curve for SIMPLS (based on
the same Gc as for RSIMPLS) is superimposed and reveals higher errors, but also suggests
three components.
We then performed RSIMPLS with kopt = 3 and k0 = 6, and obtained the robust diagnostic
plot in Figure 7(b). Observation 21 stands out as a clear outlier with a very large robust
residual distance around 60. Observation 23 is also recognized as a bad leverage point having
the largest score distance. Further, we distinguish three bad leverage points (7, 20, 24) with
merely large score distances, and one vertical outlier (22) with a somewhat larger residual
distance. There are also some borderline cases (20, 33). With SIMPLS, we obtain the
regression diagnostic plot in Figure 7(a). The three most extreme outliers (21, 23, 7) seen
from the robust analysis, are still detected, but their distances have changed enormously.
Observation 21 now has a residual distance RD21(3) = 5.91 and the score distance SD23(3) =
4.23. Observation 23 is almost turned into a good leverage point, whereas case 7 is a
boundary
q case because its residual distance is only 3.71, which does not lie very far from
χ23,0.975 = 3.06.
The RSIMPLS score diagnostic plot is shown in Figure 8(b). Observations 7, 20, 21, 23
and 24 are detected as bad PCA-leverage points. The score diagnostic plot for SIMPLS in
Figure 8(b) only indicates 23 as a good PCA-leverage point.
The robust prediction error R-RMSEP3 = 0.53. If we compute R-RMSEP3 with the fitted
values obtained with SIMPLS and Gp as in RSIMPLS, we obtain 0.70 for the prediction error.
This shows that RSIMPLS yields a lower prediction error than SIMPLS, evaluated at the
same subset of observations. To know whether this difference is significant, we applied the
same bootstrap procedure as explained in Section 4.2, from which we derived the standard
deviation σ̂R-RMSEP = 0.12. This yield approximately a significant difference at the 15% level.
To finish this example, we illustrate that for this data set, it is worthwhile to consider
the multivariate regression model where the three y-variables are simultaneously modelled
instead of performing three univariate calibrations. First, we computed the univariate pre-
diction errors based on the multivariate estimates. So we computed R-RMSEP3 for each
response variable separately (j = 1, . . . , 3):
s
1 X
R-RMSEP3 = (yij − ŷ−ij(3) )2
np i∈G
p
where ŷ−ij(3) are the fitted values from the multivariate regression and Gp is the subset of
observations retained in the multivariate regression. We obtained R-RMSEP(flour) = 0.37,
R-RMSEP(sucrose) = 0.82 and R-RMSEP(water) = 0.19. Then, we have applied RSIMPLS
for the three concentrations separately. It turned out that three components were satisfactory
for every response. These three univariate regressions resulted in R-RMSEP(flour) = 0.40,
R-RMSEP(sucrose) = 0.95 and R-RMSEP(water) = 0.18. Also these latter prediction errors
are based on the same subset Gp from the multivariate approach. For flour and sucrose we
thus obtain a higher prediction accuracy with the multivariate regression, whereas only water
is slightly better fitted by its own model.
8 Conclusion
In this paper we have proposed two new robust PLSR algorithms based on the SIMPLS
algorithm. RSIMCD and RSIMPLS can be applied to low- and high-dimensional regressor
variables, and to one or multiple response variables. First, robust scores are constructed,
and then the analysis is followed by a robust regression step. Simulations have shown that
they are resistant towards many types of contamination, whereas their performance is also
good at uncontaminated data sets. We recommend RSIMPLS because it is roughly twice
as fast as RSIMCD. A Matlab implementation of RSIMPLS is available at the web site
www.wis.kuleuven.ac.be/stat/robust.html as part of the Matlab toolbox for Robust
Calibration [27].
We have also proposed robust RMSECV curves to select the number of components, and
a robust estimate of the prediction error. Diagnostic plots are introduced to discover the
different types of outliers in the data and are illustrated on some real data sets. Also the
advantage of the multivariate approach has been illustrated.
In [4], a comparative study is made between RSIMPLS and RPCR with emphasis on
the predictive ability and the goodness-of-fit of these methods when varying the number of
components k. Currently, we are developing faster algorithms to compute the R-RMSECV
values to allow fast and robust model selection in multivariate calibration.
References
[1] Cummins DJ, Andrews CW. Iteratively reweighted partial least squares: a performance
analysis by Monte Carlo simulation. J. Chemometrics 1995; 9:489–507.
[2] de Jong S. SIMPLS: an alternative approach to partial least squares regression. Chemo-
metrics Intell. Lab. Syst. 1993; 18:251–263.
[3] Donoho DL. Breakdown Properties of Multivariate Location Estimators, Ph.D. Qualify-
ing paper, Harvard University, 1982.
[4] Engelen S, Hubert M, Vanden Branden K, Verboven S. Robust PCR and ro-
bust PLS: a comparative study; To appear in Theory and Applications of Recent
Robust Methods, edited by M. Hubert, G. Pison, A. Struyf and S. Van Aelst,
Series: Statistics for Industry and Technology, Birkhauser, Basel. Available at
http://www.wis.kuleuven.ac.be/stat/robust.html.
[5] Gil JA, Romera R. On robust partial least squares (PLS) methods. J. Chemometrics
1998; 12:365–378.
[6] Hardy AJ, MacLaurin P, Haswell SJ, de Jong S, Vandeginste BG. Double-case diagnostic
for outliers identification. Chemometrics Intell. Lab. Syst. 1996; 34:117–129.
[7] Hubert M, Rousseeuw PJ, Verboven S. A fast method for robust principal components
with applications to chemometrics. Chemometrics Intell. Lab. Syst. 2002; 60:101–111.
[8] Hubert M, Rousseeuw PJ, Vanden Branden K. ROBPCA: a new approach to robust
principal component analysis; submitted to Technometrics. Under revision. Available
at http://www.wis.kuleuven.ac.be/stat.
[9] Hubert M, Verboven S. A robust PCR method for high-dimensional regressors. J.

Chemometrics 2003; 17:438–452.
[10] Hubert M, Rousseeuw PJ, Verboven S. Robust PCA for high-dimensional data In: Dut-
ter R, Filzmoser P, Gather U, Rousseeuw PJ (eds.), Developments in Robust Statistics,
2003; Physika Verlag, Heidelberg, 169–179.
[11] Johnson R, Wichern D. Applied Multivariate Statistical Analysis (4th edn). Prenctice
Hall: New Jersey, 1998.
[12] Li G, Chen Z. Projection-pursuit approach to robust dispersion and principal compo-

nents: primary theory and Monte-Carlo. J. Am. Statist. Assoc. 1985; 80:759–766.
[13] Martens H, Naes T. Multivariate Calibration. Wiley: Chichester, UK, 1998.
[14] Naes T. Multivariate calibration when the error covariance matrix is structured. Tech-
nometrics 1985; 27:301–311.
[15] Osborne BG, Fearn T, Miller AR, Douglas S. Application of near infrared reglectance
spectroscopy to the compositional analysis of biscuits and biscuit dough. J. Scient. Food
Agric. 1984; 35:99–105.
[16] Pell RJ. Multiple outlier detection for multivariate calibration using robust statistical
techniques. Chemometrics Intell. Lab. Syst. 2000; 52:87–104.
[17] Ronchetti E, Field C, Blanchard W. Robust linear model selection by cross-validation.

J. Am. Statist. Assoc. 1997; 92:1017–1023.
[18] Rousseeuw PJ. Least median of squares regression. J. Am. Statist. Assoc. 1984; 79:871–
880.
[19] Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection. Wiley: New York,
1987.
[20] Rousseeuw PJ, van Zomeren BC. Unmasking multivariate outliers and leverage points.
J. Am. Statist. Assoc. 1990; 85:633-651.
[21] Rousseeuw PJ, Van Driessen K. A fast algorithm for the minimum covariance determi-
nant estimator. Technometrics 1999; 41:212–223.
[22] Rousseeuw PJ, Van Aelst S, Van Driessen, K, Agulló J. Robust multivari-
ate regression 2002; submitted to Technometrics. Under revision. Available at
http://win-www.uia.ac.be/u/statis.
[23] Stahel WA. Robust Estimation: Infinitesimal Optimality and Covariance Matrix Esti-
mators, Ph.D. thesis, ETH, Zürich, 1981.
[24] Tenenhaus M. La Régression PLS: Théorie et Pratique. Éditions Technip: Paris, 1998.
[25] Tucker LR. An inter-battery method of factor analysis. Psychometrika 1958; 23: 111–
136.
[26] Vanden Branden K, Hubert M. The influence function of the clas-

sical and robust PLS weight vectors. Submitted, 2003. Available at
http://www.wis.kuleuven.ac.be/stat/robust.html.
[27] Verboven S, Hubert M. A Matlab toolbox for robust calibration. In preparation, 2003.
[28] Wakeling IN, Macfie HJH. A robust PLS procedure. J. Chemometrics 1992; 6:18 9–198.
[29] Wold H. Estimation of principal components and related models by iterative least
squares. Multivariate Analysis, Academic Press, New York, 1966, 391–420.
[30] Wold H. Soft modelling by latent variables: the non-linear iterative partial least squares
(NIPALS) approach. Perspectives in Probability and Statistics (papers in honour of
M.S. Bartlett on the occasion of his 65th birthday) 1975; 117-142. Applied Probability
Trust, Univ. Sheffield, Sheffield.
2.5
45
2
39
40 44
41
12
1.5 43
1 1
0.5
0
1 2 3 4 5 6 7 8 9
Index
Figure 1: The regressors of the Fish data set.

SIMPLS RSIMPLS
44
15 15
10 10
Standardized residual
45
3
5 5
10 42 12
44 40
0 45 0
39
1 43
−5 −5
41 43
1
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Score distance (3 LV) Score distance (3 LV)
(a) (b)
Figure 2: Regression diagnostic plot for the Fish data set with: (a) SIMPLS; (b) RSIMPLS.
2.5
2
R−RMSECV value
1.5
SIMPLS(med)
1
SIMPLS(min)
RSIMPLS(med)
0.5
RSIMPLS(min)
0
1 2 3 4 5 6 7 8 9
Number of components
Figure 3: The R-RMSECV curves for the Fish data set: R-RMSECV curve for RSIMPLS
based on (31) (solid line and •), for RSIMPLS based on (33) (solid line and ¨), for SIMPLS
based on (31) (dashed line and •) and for SIMPLS based on (33) (dashed line and ¨).
18 10
8
16 8 6 vertical outliers
bad leverage point
14 6 6
13 8
12
4
2
10 3
2
regular observations good leverage points
8 13
y
0 3 2
4
6
−2
7
4
−4 4
2 5
bad leverage points
−6
5 vertical outlier
0 7
−8
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
−10
t 0 0.5 1 1.5 2 2.5 3 3.5
Score distance
(a) (b)
9
6
8
vertical outlier bad leverage point
6
Residual distance
good leverage point

1
0
0 1 2 3 4 5 6 7
Score distance
(c)
Figure 4: Different types of outliers in regression: (a) scatterplot in simple regression; (b)
regression diagnostic plot for univariate response variable; (c) regression diagnostic plot for
multivariate response variables.
SIMPLS RSIMPLS
0.03 0.03
44
39 43
0.025 42 0.025 42
10 45
Orthogonal distance
Orthogonal distance
10
0.02 0.02
39 40
28
41
0.015 0.015 28
8
12
0.01 44 0.01
43
0.005 0.005
45
1
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
(a) (b)
Figure 5: Score diagnostic plot for the Fish data set with: (a) SIMPLS; (b) RSIMPLS.
1.8
1.6 SIMPLS
1.4
1.2
1
R−RMSECV value
0.8
0.6
RSIMPLS
0.4
0.2
0
1 2 3 4 5 6 7
Number of components
Figure 6: The R-RMSECV curve for the Biscuit dough data set.
SIMPLS RSIMPLS
60 60
21
50 50
40 40
30 30
20 20 23
10 10 24
23
21 7 22 20
20
33
0 0
0 1 2 3 4 5 0 1 2 3 4 5
(a) (b)
Figure 7: Regression diagnostic plot for the Biscuit dough data set with: (a) SIMPLS; (b)
RSIMPLS.
SIMPLS RSIMPLS
0.5 0.5
23
0.45 0.45
0.4 0.4
Orthogonal distance
Orthogonal distance
0.35 0.35
0.3 7
0.3
20
0.25 0.25
0.2 0.2
21
0.15 0.15
24
0.1 0.1
0.05 23 0.05
0 0
0 1 2 3 4 5 0 1 2 3 4 5
(a) (b)
Figure 8: Score diagnostic plot for the Biscuit dough data set with: (a) SIMPLS; (b) RSIM-
PLS.
Table 1: The mean CPU-time in seconds over five runs of RSIMCD and RSIMPLS for several
set-ups.
q=1 q=5
n p k RSIMCD RSIMPLS k RSIMCD RSIMPLS
50 100 5 10.59 6.42 5 13.64 7.45
10 14.87 7.90 10 18.22 9.03
500 5 10.60 6.47 5 13.81 7.68
15 14.80 8.03 15 18.62 9.41
100 5 1 7.64 5.65 5 14.38 8.08
5 11.00 6.77 10 14.43 8.07
500 5 11.64 7.39 5 15.11 8.80
15 16.01 9.00 15 19.93 10.57
Table 2: Summary of the simulation setup.
Table q n p k Σt Σe
3 1 100 5 2 diag(4,2) 1
4 1 50 100 5 diag(7,5,3.5,2.5,1) 1
5 5 100 10 3 diag(4,2,1) Iq
5 5 50 100 5 diag(7,5,3.5,2.5,1) Iq
Table 3: Simulation results for low-dimensional regressors (p = 5) and one response variable
(q = 1).
mean(angle) MSE(β̂) MSE(β̂0 ) MSE(σ̂e2 )

Algorithm No contamination
SIMPLS 0.054 0.404 1.324 13.073
RSIMCD 0.082 0.734 1.649 4.196
RSIMPLS 0.080 0.701 1.541 6.560
10% bad leverage points
SIMPLS 1.132 60.236 19.075 7696
RSIMCD 0.076 0.645 1.645 8.400
RSIMPLS 0.077 0.684 1.823 4.888
10% vertical outliers
SIMPLS 0.124 2.072 100 8329
RSIMCD 0.073 0.613 1.661 7.802
RSIMPLS 0.075 0.654 1.776 4.815
10% orthogonal outliers
SIMPLS 0.258 5.616 1.988 67.455
RSIMCD 0.079 0.766 2.278 10.022
RSIMPLS 0.078 0.735 2.063 5.111
Table 4: Simulation results for high-dimensional regressors (p = 100) and one response
variable (q = 1).
mean(angle) MSE(β̂) MSE(β̂0 ) MSE(σ̂e2 )

SIMPLS 0.565 2.291 2.990 7.173
RSIMCD 0.429 1.127 4.595 13.679
RSIMPLS 0.424 1.088 4.084 11.347
SIMPLS 0.968 12.516 7.920 336
RSIMCD 0.420 1.081 4.651 15.290
RSIMPLS 0.417 1.052 4.612 10.707
SIMPLS 1.020 14.782 56.320 165
RSIMCD 0.509 1.645 5.814 18.693
RSIMPLS 0.504 1.578 6.208 13.360
SIMPLS 0.413 1.085 2.846 3.436
RSIMCD 0.417 1.060 4.464 17.374
RSIMPLS 0.414 1.039 4.333 11.877
Table 5: Simulation results for low- and high-dimensional regressors (p = 10 or p = 100)

and five response variables (q = 5).
n = 100, p = 10 n = 50, p = 100

MSE(B̂) MSE(β̂0 ) MSE(Σ̂e ) MSE(B̂) MSE(β̂0 ) MSE(Σ̂e )
SIMPLS 0.599 1.544 14.827 0.248 1.647 3.125
RSIMCD 0.940 1.773 12.120 0.475 3.607 4.343
RSIMPLS 0.965 1.843 11.348 0.468 3.120 3.849
SIMPLS 18.504 5.836 1328 1.271 3.945 178
RSIMCD 0.910 1.874 12.710 0.435 3.023 4.053
RSIMPLS 0.933 1.978 11.795 0.431 3.418 3.598
SIMPLS 3.389 103 7796 7.049 52.598 730
RSIMCD 0.921 1.889 12.735 0.441 2.931 4.144
RSIMPLS 0.945 1.973 11.843 0.437 3.252 3.682
SIMPLS 4.230 2.166 34.781 0.397 2.314 13.502
RSIMCD 1.071 2.632 18.278 0.430 2.932 4.156
RSIMPLS 0.984 2.192 13.551 0.427 3.184 3.622

Rsimpls

Uploaded by

Copyright:

Available Formats

Rsimpls

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rsimpls

Uploaded by

Copyright:

Available Formats

Robust Methods for Partial Least Squares Regression

M. Hubert ∗and K. Vanden Branden

9th October 2003 (Revised version)

2 The SIMPLS algorithm

xi = x̄ + Pp,k t̃i + gi (2)

pj = (rj0 X̃ 0 X̃rj )−1 X̃ 0 X̃rj

yi = α0 + A0q,k t̃i + fi (8)

Âk,q = (St )−1 Sty = (Rk,p

B̂p,q = Rp,k Âk,q (12)

Finally, also an estimate of Σe is provided by rewriting Sf in terms of the original parameters:

Example: Fish data

[Figure 1 about here]

[Figure 2 about here]

3 Robustified versions of the SIMPLS algorithm

3.2 Robust PLS

tia = x̆0i ra = (xi − µ̂x )0 ra (18)

where x̆i are the robustly centered observations.

Remark 1 When performing the ROBPCA method on Zn,m , we need to determine k0 ,

Remark 2 When p + q < n, we can directly compute the reweighted MCD-estimator on

3.2.2 Robust regression

yi = α0 + A0q,k ti + f˘i . (19)

with wi = 1 if observation i is not identified as an outlier by applying ROBPCA on (x, y),

B̂(Cxi + u, Dyi + v) = C B̂D0

[Table 1 about here]

4 Model Calibration and Validation

c−i(k) = I(RD2−i(k) < χ2q,0.975 ).

c−i = min c−i(k) . (31)

c−i = median c−i(k) . (33)

[Figure 3 about here]

4.2 Estimation of Prediction Error

5.2 Score diagnostic plot

ODi(k) = kx̆i − Pp,k ti k.

[Figure 5 about here]

T ∼ Nk (0k , Σt ), with k < p,

Y² = T Ak,q + Nq (10q , 0.1Iq ). (38)

Finally, orthogonal outliers were constructed by putting

X² = T Ik,p + Np ((0k , 10p−k ), 0.1Ip ) (39)

and taking unadjusted y-variables.

[Table 2 about here]

[Table 3-5 about here]

Discussion of the results

7 Example : Biscuit Dough Data

[Figure 6 about here]

[Figure 7 about here]

[Figure 8 about here]

[9] Hubert M, Verboven S. A robust PCR method for high-dimensional regressors. J.

[12] Li G, Chen Z. Projection-pursuit approach to robust dispersion and principal compo-

[13] Martens H, Naes T. Multivariate Calibration. Wiley: Chichester, UK, 1998.

[17] Ronchetti E, Field C, Blanchard W. Robust linear model selection by cross-validation.

[26] Vanden Branden K, Hubert M. The influence function of the clas-

Figure 1: The regressors of the Fish data set.

good leverage point

Table 2: Summary of the simulation setup.

mean(angle) MSE(β̂) MSE(β̂0 ) MSE(σ̂e2 )

mean(angle) MSE(β̂) MSE(β̂0 ) MSE(σ̂e2 )

Table 5: Simulation results for low- and high-dimensional regressors (p = 10 or p = 100)

n = 100, p = 10 n = 50, p = 100

You might also like