Rsimpls
Rsimpls
Rsimpls
SUMMARY
Partial Least Squares Regression (PLSR) is a linear regression technique developed to deal
with high-dimensional regressors and one or several response variables. In this paper we in-
troduce robustified versions of the SIMPLS algorithm being the leading PLSR algorithm be-
cause of its speed and efficiency. Because SIMPLS is based on the empirical cross-covariance
matrix between the response variables and the regressors and on linear least squares re-
gression, the results are affected by abnormal observations in the data set. Two robust
methods, RSIMCD and RSIMPLS, are constructed from a robust covariance matrix for
high-dimensional data and robust linear regression. We introduce robust RMSECV and
RMSEP values for model calibration and model validation. Diagnostic plots are constructed
to visualize and classify the outliers. Several simulation results and the analysis of real data
sets show the effectiveness and the robustness of the new approaches. Because RSIMPLS is
roughly twice as fast as RSIMCD, it stands out as the overall best method.
KEY WORDS : Partial Least Squares Regression; SIMPLS; Principal Component Analysis; Ro-
bust Regression.
∗
Correspondence to: M. Hubert, Assistant Professor, Department of Mathematics, Katholieke
Universiteit Leuven, W. de Croylaan 54, B-3001 Leuven, Belgium, mia.hubert@wis.kuleuven.ac.be,
tel.+32/16322023, fax.+32/16322831
1
Original Research Article 2
1 Introduction
The PLS (NIPALS) regression technique was originally developed for econometrics by Wold
([29], [30]), but has become a very popular algorithm in other fields such as chemometrics,
social science, food industry, etc. It is used to model the linear relation between a set of
regressors and a set of response variables, which can then be used to predict the value of the
response variables for a new sample. A typical example is multivariate calibration where the
x-variables are spectra and the y-variables are the concentrations of certain constituents.
Throughout the paper we will print column vectors in bold and the transpose of a vector
v or a matrix V as v 0 or V 0 . Sometimes, the dimension of a matrix will be denoted using
a subscript, e.g. Xn,p stands for a (n × p) dimensional matrix. We apply the notation
(x1 , . . . , xn )0 = Xn,p for the regressors and (y1 , . . . , yn )0 = Yn,q for the response variables.
The merged data set (Xn,p , Yn,q ) will be denoted as Zn,m , with m = p + q.
The linear regression model we consider is:
0
yi = β0 + Bq,p xi + ei , (1)
where the error terms ei satisfy E(ei ) = 0 and Cov(ei ) = Σe of size q. The unknown
q-dimensional intercept is denoted as β0 = (β01 , . . . , β0q )0 and Bp,q represents the unknown
slope matrix. Typically in chemometrics, the number of observations n is very small (some
tens), whereas the number of regressors is very numerous (some hundreds, thousands). The
number of response variables q is in general limited to at most five.
Because multicollinearity is present, the classical multiple linear regression (MLR) esti-
mates have too large of a variance, hence biased estimation procedures, such as Principal
Component Regression (PCR) and Partial Least Squares Regression (PLSR) [13], are then
performed. In this paper, we will focus on PLSR. We will use the notation PLS1 when there
is only one response variable and the notation PLS2 otherwise.
It is well known that the popular algorithms for PLSR (NIPALS [29] and SIMPLS [2]) are
very sensitive to outliers in the data set. This will also be clearly shown in our simulation
study in Section 6. In this paper, we introduce several robust methods for PLSR which
are resistant to outlying observations. Some robustified versions of the PLS1 and PLS2
algorithms have already been proposed in the past. A first algorithm [28] has been developed
by replacing the different univariate regression steps in the PLS2 algorithm by some robust
alternatives. Iteratively reweighted algorithms have been obtained in [1] and [16]. These
algorithms are only valid for a one-dimensional response variable and they are not resistant
to leverage points. In [5], a robust PLS1 method is obtained by robustifying the sample
covariance matrix of the x-variables and the sample cross-covariance matrix between the
x- and y-variables. For this, the highly robust Stahel-Donoho estimator ([23],[3]) is used,
but unfortunately it can not be applied to high-dimensional regressors (n ¿ p) because
Original Research Article 3
the subsampling scheme used to compute the estimator starts by drawing subsets of size
p + 2. Moreover the method can not be extended to PLS2. Recently, a robust method for
PCR which also applies to high-dimensional x-variables and multiple y-variables has been
introduced in [9].
In this paper we present several robustifications of the SIMPLS algorithm. SIMPLS is
very popular because it is faster than PLS1 and PLS2 as implemented using NIPALS, and
the results are easier to interpret. Moreover, if there is only one response variable (q = 1),
SIMPLS and PLS1 yield the same results. An outline of the SIMPLS algorithm is presented
in Section 2. We recall that SIMPLS depends on the sample cross-covariance matrix between
the x- and y-variables, and on linear least squares regression. In Section 3, we introduce
several robust methods which are obtained by using a robust covariance matrix for high-
dimensional data sets ([8]), and a robust regression method. The proposed algorithms are
fast compared to previous developed robust methods and they can handle cases where n ¿ p
and q ≥ 1. Section 4 discusses the selection of the number of components and the model
validation. In Section 5 we introduce several diagnostic plots which can help us to identify
the outliers and classify them in several types. The robustness of the proposed algorithms
is demonstrated in Section 6 with several simulations. In Section 7 we apply one of the new
methods on a real data set. We finish with some conclusions in Section 8.
In this model, x̄ and ȳ denote the mean of the x- and the y-variables. The t̃i are called the
scores which are k-dimensional, with k ¿ p, whereas Pp,k is the matrix of x-loadings. The
residuals of each equation are represented by the gi and fi respectively. The matrix Ak,q
represents the slope matrix in the regression of yi on t̃i . Note that in the literature regarding
PLS, the matrix Ak,q is usually denoted as Q0k,q and the columns of Qq,k by the y-loadings
qa . We prefer to use another notation because qa will be used for the PLS weight vector,
see (4).
The bilinear structure (2) and (3) implies a two-steps algorithm. After mean-centering
the data, SIMPLS will first construct k latent variables T̃n,k = (t̃1 , . . . , t̃n )0 and secondly, the
responses will be regressed onto these k variables. We will refer to the columns of T̃n,k as
the components.
Original Research Article 4
Consider first the construction of the components. In contrast with PCR, the k com-
ponents are not solely determined based on the x-variables. They are obtained as a linear
combination of the x-variables which have maximum covariance with a certain linear com-
bination of the y-variables. More precisely, let X̃n,p and Ỹn,q denote the mean-centered data
matrices, with x̃i = xi − x̄ and ỹi = yi − ȳ. The normalized PLS weight vectors ra and qa
(kra k = kqa k = 1) are then defined as the vectors that maximize for each a = 1, . . . , k
0
Ỹq,n X̃n,p
cov(Ỹn,q qa , X̃n,p ra ) = qa0 ra = qa0 Syx ra (4)
n−1
0 X̃ 0 Ỹ
n,q
where Syx = Sxy = p,n n−1
is the empirical cross-covariance matrix between the x- and
y-variables. The elements of the scores t̃i are then defined as linear combinations of the
mean-centered data: t̃ia = x̃0i ra , or equivalently T̃n,k = X̃n,p Rp,k with Rp,k = (r1 , . . . , rk ).
The maximization problem of (4) has one straightforward solution: r1 and q1 are the first
left and right singular vectors of Sxy . This implies that q1 is the dominant eigenvector of
Syx Sxy and r1 = Sxy q1 . To obtain more than one solution, the components X̃rj are required
to be orthogonal:
n
X
0 0
rj X̃ X̃ra = t̃ij t̃ia = 0, a > j. (5)
i=1
To satisfy this condition we first introduce the x-loading pj that describes the linear relation
between the x-variables and the jth component X̃rj . It is computed as
with Sx the empirical covariance matrix of the x-variables. This definition implies that (5)
is fulfilled when p0j ra = 0 for a > j. The PLS weight vector ra thus has to be orthogonal
to all previous x-loadings Pa−1 = [p1 , . . . , pa−1 ]. Consequently, ra and qa are computed
as the first left and right singular vectors of Sxy projected on a subspace orthogonal to
Pa−1 . This projection is performed by constructing an orthonormal base {v1 , . . . , va−1 } of
a−1
{p1 , . . . , pa−1 }. Next, Sxy is deflated:
a a−1
Sxy = Sxy − va (va0 Sxy
a−1
) (7)
a
and ra and qa are the first left and right singular vectors of Sxy . We start this iterative
1
algorithm with Sxy = Sxy and repeat this process until k components are obtained. The
choice of the number of components k will be discussed in Section 4.
In the second stage of the algorithm, the responses are regressed onto these k components.
The formal regression model under consideration is thus:
where E(fi ) = 0 and Cov(fi ) = Σf . Multiple linear regression (MLR) [11] provides esti-
mates:
where Sy and St stand for the empirical covariance matrix of the y- and t-variables. Note
that here MLR denotes the classical least squares regression with multiple x-variables, and
when q > 1, with multiple y-variables (also known as multivariate multiple linear regression).
Because ¯t̃ = 0, the intercept α0 is thus estimated by ȳ. By plugging in t̃i = Rk,p
0
(xi − x̄)
in (3), we obtain estimates for the parameters in the original model (1), i.e.
Se = Sy − B̂ 0 Sx B̂. (14)
Note that for a univariate response variable (q = 1), the parameter estimate B̂p,1 can be
rewritten as the vector β̂, whereas the estimate of the error variance Se simplifies to σ̂e2 = s2e .
It has been shown in [5] and [6] that three components are sufficient to perform the PLS
regression. So, if we carry out the SIMPLS algorithm with k = 3, we obtain the regression
diagnostic plot shown in Figure 2(a). On the horizontal axis this plot displays the Maha-
lanobis distance of a data point in the t-space, which we therefore call its score distance
SDi(k) . It is defined by
SD2i(k) = t̃0i St−1 t̃i . (15)
Original Research Article 6
Note that this distance SD2i(k) depends on k, because the scores t̃i = t̃i(k) are obtained from a
PLS model with k components. On the vertical axis, the standardized concentration residuals
ri(k)
with ri(k) = yi − β̂0 − β̂ 0 xi are displayed. We define outlying data points in the t-space
se q
as those observations whose score distance exceeds the cutoff-value χ2k,0.975 (because the
2
squared Mahalanobis distances of normally distributed scores are
q χk -distributed). Regression
outliers have an absolute standardized residual which exceeds χ21,0.975 = 2.24. The SIMPLS
diagnostic plot suggests that observations 43, 44 and 45 are outlying in the t-space, whereas
observations 1 and 43 can be classified as regression outliers. Their standardized residual is
however not that large, so they are rather borderline cases.
Figure 2(b) shows the robust diagnostic plot, based on the robust PLS method that will be
described in Section 3. For a precise definition of this plot, we refer to Section 5. Here, we
can clearly identify the observations with outlying spectrum (1, 12, 39, 40, 41, 43, 44 and
45). Moreover, the robust PLS method finds several regression outliers which can not be
seen on the SIMPLS plot.
the Minimum Covariance Determinant (MCD) estimator of location and scatter [18]. The
principle of the MCD method is to minimize the determinant of the sample covariance matrix
of h observations with dn/2e < h < n, for which a fast algorithm (FAST-MCD) exists [21].
The center of the zi is then estimated by the mean z̄h and their scatter by the empirical
covariance matrix Sh of the optimal h-subset (multiplied with a consistency factor). To
increase the finite-sample efficiency, a reweighting step can be added as well. An observation
receives zero weight if its robust squared distance (zi − z̄h )0 Sh−1 (zi − z̄h ) exceeds χ2m,0.975 .
The reweighted MCD estimator is then defined as the classical mean and covariance matrix
of those observations with weight equal to 1.
However, when m > n, the MCD estimator is not applicable anymore because the covari-
ance matrix of h < m data points is always singular. For such high-dimensional data sets,
projection pursuit algorithms have been developed [12], [7]. ROBPCA combines the two
approaches. Using projection pursuit ideas as in Donoho [3] and Stahel [23], it computes the
outlyingness of every data point and then considers the empirical covariance matrix of the h
data points with smallest outlyingness. The data are then projected onto the subspace K0
spanned by the k0 ¿ m dominant eigenvectors of this covariance matrix. Next, the MCD
method is applied to estimate the center and the scatter of the data in this low-dimensional
subspace. Finally these estimates are backtransformed to the original space and a robust
estimate of the center µ̂z of Zn,m and of its scatter Σ̂z are obtained. This scatter matrix can
be decomposed as
Σ̂z = P z Lz (P z )0 (16)
z
with robust Z-eigenvectors Pm,k0
and Z-eigenvalues diag(Lk0 ,k0 ). Note that the diagonal
matrix L contains the k0 largest eigenvalues of Σ̂z in decreasing order. Then Z-scores T z
z
can be obtained from T z = (Z − 1n µ̂0z )P z . For all details about ROBPCA, we refer to [8].
To obtain robust scores we first apply ROBPCA on Zn,m = (Xn,p , Yn,q ). This yields a robust
estimate of the center of Z, µ̂z = (µ̂0x , µ̂0y )0 , and an estimate of its shape, Σ̂z , which can be
split into à !
Σ̂x Σ̂xy
Σ̂z = . (17)
Σ̂yx Σ̂y
We estimate the cross-covariance matrix Σxy by Σ̂xy and compute the PLS weight vectors
ra as in the SIMPLS algorithm, but now starting with Σ̂xy instead of Sxy . In analogy with
(6) the x-loadings pj are defined as pj = (rj0 Σ̂x rj )−1 Σ̂x rj . Then the deflation of the scatter
Original Research Article 8
matrix Σ̂axy is performed as in SIMPLS. In each step, the robust scores are calculated as:
Once the scores are derived, a robust linear regression is performed. The regression model
is the same as in (8), but now based on the robust scores ti :
Note that when q = 1, well-known robust methods such as the LTS regression [18] could be
used. This approach is followed in [9]. Here, we propose two methods that can be used for
regression with one or multiple response variables. Throughout we only use the notation for
the multivariate setting, but both approaches apply as well when yi is a scalar instead of a
vector. The first multivariate regression that we discuss is the MCD regression method [22].
The second one uses additional information from the previous ROBPCA step, and hence
will be called the ROBPCA regression.
Original Research Article 9
MCD regression
The classical MLR estimates for the regression model presented in (19) can be written in
terms of the covariance Σ and the center µ of the joint variables (t, y):
à ! à !
µt Σt Σty
µ= , Σ= . (20)
µy Σyt Σy
If the center µ is estimated by the sample mean (t̄0 , ȳ 0 )0 and the covariance Σ by the sample
covariance matrix of (t, y), the classical estimates satisfy equations (9)-(11) if we replace ¯t̃
in (10) by t̄. Robust regression estimates are obtained by replacing the classical mean and
covariance matrix of (t, y) by the reweighted MCD estimates of center and scatter [22]. It
is moreover recommended to reweigh these initial regression estimates in order to improve
the finite-sample efficiency. Let ri(k) be the residual of the ith observation based on the
initial estimates that were calculated with k components. If Σ̂f˘ is the initial estimate for
the covariance matrix of the errors, then we define the robust distance of the residuals as:
0
RDi(k) = (ri(k) Σ̂−1 r )1/2 .
f˘ i(k)
(21)
The weights ci(k) are computed as
ci(k) = I(RD2i(k) ≤ χ2q,0.975 ) (22)
with I the indicator function. The final regression estimates are then calculated as in classical
MLR, but only based on those observations with weight ci(k) equal to 1. The robust residual
distances RDi(k) are recomputed as in (21) and also the weights ci(k) are adapted.
Analogously to (12)-(14), robust parameters for the original model (1) are then given by:
B̂p,q = Rp,k Âk,q (23)
0
β̂0 = α̂0 − B̂q,p µ̂x (24)
Σ̂e = Σ̂f˘. (25)
The resulting method is called RSIMCD.
Remark 3 Both MCD and ROBPCA assume that the data set contains at least h good
observations. We will therefore use the same value for h in the two steps of our algorithm,
although this is not necessary. To perform MCD on (t, y) it is required that h > k + q. With
kmax = 10 and h > d n2 e, this condition is certainly fulfilled if d n2 e ≥ 10 + q. This is usually
not a problem because q is very small.
The value for h influences the robustness of our estimates. It should be larger than
[ n+k20 +1 ] in the ROBPCA step and larger than [ n+k+q+1
2
] in the MCD regression [19]. In our
Matlab implementation, we have therefore set h = max([αn], [ n+10+q+1 2
]), with α = 0.75 as
default value. It is possible to increase or decrease the value of α where (1 − α) represents
the fraction of outliers the algorithm should be able to resist.
Original Research Article 10
ROBPCA regression
The simulation study in Section 6 shows that RSIMCD is highly robust to many types of
outliers. Its computation time is mainly determined by applying ROBPCA on the (x, y)-
variables and MCD on the (t, y)-variables. Now, we introduce a second robust SIMPLS
algorithm which avoids the computation of the MCD on (t, y) by using additional information
from the ROBPCA step.
The MCD regression method starts by applying the reweighted MCD estimator on (t, y)
to obtain robust estimates of their center µ and scatter Σ. This reweighted MCD corresponds
to the mean and the covariance matrix of those observations which are considered not to be
outlying in the (k + q)-dimensional (t, y) space.
To obtain the robust scores ti , we first applied ROBPCA to the (x, y)-variables, and
obtained a k0 -dimensional subspace K0 which represented these (x, y)-variables well. Because
the scores were then constructed to summarize the most important information given in the
x-variables, we might expect that outliers with respect to this k0 -dimensional subspace are
often also outlying in the (t, y) space. Hence, we will estimate the center µ and the scatter
Σ of the (t, y)-variables as the weighted mean and covariance matrix of those (ti , yi ) whose
corresponding (xi , yi ) are not outlying to K0 :
à ! n
à ! n
µ̂t X ti X
µ̂ = = wi /( wi ) (26)
µ̂y i=1
yi i=1
à ! à !
Σ̂t Σ̂ty
n
X ti ³ ´ Xn
Σ̂ = = wi t0i y 0i /( wi − 1) (27)
Σ̂yt Σ̂y i=1
yi i=1
set wi = 0 if
p
ODi > µ̂od2 + σ̂od2 z0.975 , (28)
Original Research Article 11
with z0.975 = Φ−1 (0.975), the 97.5% quantile of the Gaussian distribution. Another approx-
imation is explained in [8]. One can of course also plot the orthogonal distances to see
whether some of them are much larger than the others. We recommend this last approach
when interaction is possible in the analysis of a particular data set.
Having identified the observations with weight 1, we thus compute µ̂ and Σ̂ from (26)
and (27). Then, we proceed as in the MCD regression method. We plug these estimates
in (9) to (11), compute residual distances as in (21) and perform a reweighted MLR. This
reweighting step has the advantage that it might again include observations with wi = 0
which are not regression outliers. We will refer to this algorithm as the RSIMPLS method.
Remark 4 Both proposed robust PLS algorithms have several equivariance properties. The
ROBPCA method is orthogonally equivariant, which means that orthogonal transformations
of the data (rotations, reflections) transform the loadings appropriately and leave the scores
unchanged. Consequently, it can easily be derived that RSIMCD and RSIMPLS are equiv-
ariant for translations and orthogonal transformations in x and y. More precisely, let u ∈ Rp ,
v ∈ Rq , C any p-dimensional orthogonal matrix and D any q-dimensional orthogonal ma-
trix. If (β̂0 , B̂) denotes the estimates by running RSIMCD or RSIMPLS on the original data
(xi , yi ), it holds that:
Remark 5 Instead of using hard-rejection rules for defining outliers, we could also apply
continuous weights between 0 and 1. But this introduces additional choices (weight functions,
cut-off values) and the results are not necessarily improved. For a more detailed discussion,
see [9].
Remark 6 We have also developed a robust PLS1 algorithm following the approach de-
veloped in [5] but by replacing the Stahel-Donoho estimator with the ROBPCA covariance
matrix. However, the results were in general not better than those obtained with RSIMPLS
or RSIMCD. Hence we prefer RSIMPLS because it can also be applied when q ≥ 1.
3.3 Comparison
In Section 6, we present the results of a thorough simulation study where we compare
the performance of these two different robust methods. These simulations indicate that
RSIMCD and RSIMPLS are very comparable, but if we compare the mean CPU-time in
seconds computed on a Pentium IV with 1.60 GHz (see Table 1) over five runs for different
situations, we see that RSIMPLS is roughly twice as fast as RSIMCD. This is explained by
Original Research Article 12
the fact that we apply FAST-MCD in the second stage of RSIMCD. Hence, in the following
sections, we will mainly concentrate on RSIMPLS.
Here, the predicted values ŷij(k) of observation i in the test set are based on the parameter
estimates that are obtained from the training set using a PLS method with k components.
One then chooses kopt as the k-value which gives the smallest or a sufficiently small value for
RMSEk .
This RMSEk statistic can however attain unreliable values if the test set contains outliers,
even if the fitted values are based on a robust PLS algorithm. Such outlying observations
increase the RMSEk because they fit the model badly. Consequently, our decision about
kopt might be wrong. Therefore, we propose to remove the outliers from the test set before
computing RMSEk . Formally, let ri(k) be the residual for the ith observation in the test
0
set and ci(k) = I(ri(k) Σ̂−1 2
e ri(k) < χq,0.975 ). The weight ci(k) thus tells whether or not the ith
observation is outlying with respect to the PLS model with k components. Then, we select
the test data points which are not outlying in any of the models by computing ci = mink ci(k) .
Let Gt denote the set of points for which ci = 1, and let nt be its size: |Gt | = nt . Finally,
for each k, we define the robust RMSEk value as:
v
u q
u 1 XX
R-RMSEk = t (yij − ŷij(k) )2 . (30)
nt q i∈G j=1
t
This approach is fast because we only need to run the PLS algorithm once for each k. But
an independent test set is only exceptionally available. This can be solved by splitting the
Original Research Article 13
original data into a training and a test set. However, the data sets we consider generally
have a limited number of observations and it is preferable that the number of observations
in the training step is at least 6 to 10 times the number of variables. That is why we
concentrate on the cross-validated RMSEk , which will be denoted by RMSECVk [24], [9].
Usually, the 1-fold or leave-one-sample out statistic is obtained as in (29) but now the index
i runs over the set of all the observations, and the predicted values ŷij(k) are based on the
PLS estimates obtained by removing the ith observation from the data set. The optimal
number of components is then again taken as the value kopt for which RMSECVk is minimal
or sufficiently small.
However, as we have argued for RMSEk , also the RMSECVk statistic is vulnerable to
outliers, so we also remove an outlying observation. Let B̂−i , β̂0,−i and Σ̂e,−i denote the
parameter estimates based on the data set without the ith observation, r−i(k) = yi − β̂0,−i −
0
B̂−i xi the ith cross-validated residual and RD2−i(k) = r−i(k)
0
Σ̂−1
e,−i r−i(k) the squared cross-
validated residual distance as in (21). Then analogously to (22) the cross-validated weight
assigned to the ith observation is defined as
If c−i(k) = 0, observation (xi , yi ) is recognized as a regression outlier in the PLS model with
k components. Several PLS models are constructed for k = 1, . . . , ktot components with
ktot the total or maximal number of components under consideration. Because we want to
compare these ktot different models, we should evaluate their predictive power on the same
set of observations. Hence, we could eliminate those observations which are outlying in any
of the models by defining for each observation
Let Gc denote the subset of observations for which c−i = 1 with |Gc | = nc , then an obser-
vation belongs to the set Gc when it is observed as a regular sample in each of the ktot PLS
models. For each k, we then define the robust RMSECVk value as:
v
u q
u 1 XX
R-RMSECVk = t (yij − ŷ−ij(k) )2 (32)
nc q i∈G j=1
c
0
with the cross-validated fitted value ŷ−i(k) = β̂0,−i + B̂−i xi . This approach has the advantage
that any suspicious observation is discarded in the R-RMSECVk statistic. It is also followed
in [17] to construct robust stepwise regression by means of a bounded-influence estimator
for prediction. On the other hand, when the number of observations is small, increasing ktot
can lead to sets Gc for which the number of observations nc is small compared to the total
Original Research Article 14
number of observations n. Let us e.g. consider the Fish data set. When we choose ktot = 5,
nc = 30 (out of n = 45), but with ktot = 9, nc = 23, which is only half of the observations.
To avoid such small calibration sets, we alternatively define
With this definition, only those data points that are outlying with respect to most of the PLS
models under consideration are removed. Note that when k is even, we take the low-median
of the c−i,k in order to obtain a weight c−i which is always exactly zero or one. For the Fish
data set, we then obtain nc = 34 when ktot = 9. Figure 3 displays the R-RMSECV curve
for the Fish data with the two different weight functions. We see that both curves do not
differ very much, and they both indicate to select three components in the regression model,
which is similar to the conclusion of the analysis in [5] and [6]. We also superimposed the
R-RMSECV curves for the SIMPLS algorithm based on the same subsets Gc of observations
as RSIMPLS and again conclude that three components are sufficient.
The drawback of cross-validation is its computation time, because for each k the PLS
algorithm has to be run n times. To speed up the computations, we therefore fix k0 , the
number of principal components that are selected in ROBPCA to obtain the robust center µ̂z
and scatter Σ̂z in (17). If we then increase k, we only need to compute one extra component
a
by deflating Sxy once more as explained in (7). Fixing k0 has the additional advantage
that the weights wi that are needed in the ROBPCA regression do not change with k. To
determine k0 , we first compute ktot ≤ min{p, kmax = 10} that will be used as a maximal value
for k in the regression. The total number of parameters to be estimated equals kq for the
slope matrix Â, q for the intercept α̂0 and q(q − 1)/2 (or 1 when q = 1) for Σ̂e . To avoid
overfitting, we then require that
q(q − 1)
ktot q + q + <h (34)
2
where h stands for the size of the subset that is used in the ROBPCA, or MCD regression,
and which should be a lower bound for the number of regular observations out of n. Note
that if q = 1, we have only one scale estimate σ̂e , hence we require that
ktot + 2 < h.
Having determined ktot , we then set k0 = ktot +q. For the Fish data, this implies that ktot = 9
and k0 = 10.
Original Research Article 15
Remark 7 In [9], also a robust Rk2 value is defined to determine the optimal number of
components. For q = 1, it is defined by
P 2
2 i∈Gt ri(k)
Rk = 1 − P 2
i∈Gt (yi − ȳc )
P
with ȳc = i∈Gt yi /nt and Gt is defined as in (30) with the test set being equal to the full
data set. In the multivariate case (q > 1), this is generalized to:
P Pq 2
i∈G r
Rk2 = 1 − P Pt q j=1 ij(k) 2
i∈Gt j=1 (yij − ȳj )
P
where ȳj = i∈Gt yij /nt . The optimal number of components kopt is then chosen as the
smallest value k for which Rk2 attains e.g. 80% or the Rk2 curve becomes nearly flat. This
approach is fast because it avoids cross-validation, but merely measures the variance of the
residuals instead of the prediction error.
where Gp is now the subset of observations with non-zero weight c−i(kopt ) in the PLS model
with kopt components, and |Gp | = np . The fitted values are obtained with k0 = kopt + q in
ROBPCA. Note that using this definition, we include all the regular observations for the
model with kopt components, which is more precise than the set Gc that is used in (32) and
which depends on ktot . Hence, in general R-RMSEPkopt will be different from R-RMSECVkopt .
For the Fish data set and RSIMPLS, we obtain R-RMSEP3 = 0.51 based on np = 33
observations. If we perform SIMPLS, and calculate the R-RMSEP3 value on the same set
of observations, we obtain 0.82, hence the robust fit yields a smaller prediction error. To
test whether this difference is significant, we applied the following bootstrap procedure. We
have drawn 150 bootstrap samples of size np = 33 from the (xi , yi ) with c−i(3) = 1. For each
bootstrap sample we have refitted the model with three components, and we have computed
the R-RMSEP3 as in (35) with the set Gp being fixed during all the computations. The
standard deviation of these 150 R-RMSEP values was equal to σ̂R-RMSEP = 0.09 and can be
used as an approximation to the true standard deviation of the R-RMSEP statistic. Because
0.82 > 0.51 + 2.5 ∗ 0.09 = 0.735 we conclude that the R-RMSEP3 based on RSIMPLS is
significantly different from R-RMSEP3 obtained with SIMPLS (at the 1% level).
Original Research Article 16
5 Outlier Detection
5.1 Regression diagnostic plot
To identify the outlying observations, we will first construct a regression diagnostic plot as
in [20], [22] and [9]. Its goal is to identify outlying observations with respect to the regression
model (19). In our two robust PLS methods, we perform a regression of the q-dimensional yi
on the k-dimensional ti = ti(k) assuming k = kopt . We can distinguish three types of outliers.
These different types are represented in Figure 4(a) for the case of simple regression, so when
q = 1 and k = 1. Good leverage points lie in the direction of the fitted line or subspace, but
have outlying t-values. This is also the case for bad leverage points, that moreover do not
fit the model well. Vertical outliers are only outlying in the y-space. The latter two types of
outliers are known to be very influential for the classical least squares regression fit, because
they cause the slope to be tilted in order to accommodate the outliers.
To measure the outlyingness of a point in the t-space, we consider its robustified maha-
lanobis distance, which we now call the score distance SDi(k) , defined by
SD2i(k) = (ti − µ̂t )0 Σ̂−1
t (ti − µ̂t )
where µ̂t and Σ̂t are derived in the regression step, see e.g. (26) and (27). When we perform
SIMPLS, this score distance reduces to (15) because µ̂t̃ = 0. This score distance is put
on the horizontal axis of the regression diagnostic plot and exposes the good and the bad
leverage points. By analogy with [20], [22] andq[9], leverage points are those observations
whose score distance exceeds the cut-off value χ2k,0.975 . On the vertical axis, we put the
residual distance RDi,k :
RD2i(k) = ri(k)
0
Σ̂−1
e ri(k)
with ri(k) = yi − β̂0 − B̂ 0 xi being the residual of the ith observation. For univariate response
r
variables, this residual distance simplifies to the standardized residual RDi(k) = i(k) σ̂e
. Ver-
tical
q outliers and bad leverage points are now observations whose residual distance exceeds
χ2q,0.975 .
[Figure 4 about here]
If the regression parameters are well estimated, i.e. if they are not influenced by the outliers,
this diagnostic plot should thus look as in Figure 4(b) for q = 1, and as in Figure 4(c) for
q > 1. Let us look again at Figure 2(b) which shows the regression diagnostic plot of the Fish
data with RSIMPLS. On this plot we see six clear bad leverage points (1, 12, 41, 43, 44, 45),
two vertical outliers (3, 10), two good leverage points (39, 40) and three borderline cases. The
diagnostic plot from SIMPLS is however not very informative. Some bad leverage points
(44, 45) are converted into good leverage points which illustrates that the least squares
regression is tilted to accommodate the outliers.
Original Research Article 17
This allows us again to identify three types of outliers. Bad PCA-leverage points have
outlying SDi(k) and ODi(k) , good PCA-leverage points have only outlying SDi(k) , whereas
orthogonal outliers have only outlying ODi(k) . The latter ones are not yet visible on the
regression diagnostic plot. They have the property that they lie far from the t-space, but
they become regular observations after projection in the t-space. Hence, they will not badly
influence the computation of the regression parameters, but they might influence the load-
ings.
For the Fish data set this diagnostic plot is presented in Figure 5(a) for SIMPLS and in
Figure 5(b) for RSIMPLS. The horizontal line is computed as in (28). The outliers detected
in the regression diagnostic plot (Figure 2(b)) are all recognized as leverage points in this
score diagnostic plot. Furthermore we detect observation 42 as an orthogonal outlier. We
also detect two other orthogonal outliers (10, 28). For SIMPLS, this score diagnostic plot
(Figure 5(a)) also discovers sample 42 as an orthogonal outlier, but observations 43–45 are
classified as good leverage points.
Note that the regression and the score diagnostic plot can also be combined into one
three-dimensional figure exposing (SDi(k) , ODi(k) , RDi(k) ), see also [9].
6 Simulation Study
To compare the different algorithms, we have performed several simulations with low- and
high-dimensional data sets. For each situation, we generated 1000 data sets. First, we
consider the case without contamination. The data sets were then generated according to
the bilinear model (2) and (3), with:
Here, (Ik,p )i,j = 1 for i = j and 0 elsewhere. These simulation settings imply that kopt = k.
Next, we introduced different types of outliers by randomly replacing n² of the n observations
with ² = 10%. The conclusions obtained for ² = 20% contamination were similar to those
for ² = 10% and are therefore not included. If T² , X² and Y² denote the contaminated data,
the bad leverage regression points were constructed as:
T² ∼ Nk (10k , Σt ) (36)
X² = T² Ik,p + Np (0p , 0.1Ip ) (37)
whereas the y-variables were not changed. The vertical outliers were generated with the
uncontaminated t-variables, but adjusted y-variables:
the parameters in the model, whereas the intercept is mostly influenced by vertical outliers.
The orthogonal outliers are mostly influential at the low-dimensional data sets.
In contrast with SIMPLS, the values for the robust algorithms do not change very much.
In almost every setting, the differences between RSIMCD and RSIMPLS are very small.
Both robust PLS methods are thus comparable, but as mentioned in Section 3.3 we prefer
RSIMPLS because it is computationally more attractive than RSIMCD.
We then performed RSIMPLS with kopt = 3 and k0 = 6, and obtained the robust diagnostic
plot in Figure 7(b). Observation 21 stands out as a clear outlier with a very large robust
residual distance around 60. Observation 23 is also recognized as a bad leverage point having
the largest score distance. Further, we distinguish three bad leverage points (7, 20, 24) with
merely large score distances, and one vertical outlier (22) with a somewhat larger residual
distance. There are also some borderline cases (20, 33). With SIMPLS, we obtain the
regression diagnostic plot in Figure 7(a). The three most extreme outliers (21, 23, 7) seen
from the robust analysis, are still detected, but their distances have changed enormously.
Observation 21 now has a residual distance RD21(3) = 5.91 and the score distance SD23(3) =
4.23. Observation 23 is almost turned into a good leverage point, whereas case 7 is a
Original Research Article 20
boundary
q case because its residual distance is only 3.71, which does not lie very far from
χ23,0.975 = 3.06.
The RSIMPLS score diagnostic plot is shown in Figure 8(b). Observations 7, 20, 21, 23
and 24 are detected as bad PCA-leverage points. The score diagnostic plot for SIMPLS in
Figure 8(b) only indicates 23 as a good PCA-leverage point.
The robust prediction error R-RMSEP3 = 0.53. If we compute R-RMSEP3 with the fitted
values obtained with SIMPLS and Gp as in RSIMPLS, we obtain 0.70 for the prediction error.
This shows that RSIMPLS yields a lower prediction error than SIMPLS, evaluated at the
same subset of observations. To know whether this difference is significant, we applied the
same bootstrap procedure as explained in Section 4.2, from which we derived the standard
deviation σ̂R-RMSEP = 0.12. This yield approximately a significant difference at the 15% level.
To finish this example, we illustrate that for this data set, it is worthwhile to consider
the multivariate regression model where the three y-variables are simultaneously modelled
instead of performing three univariate calibrations. First, we computed the univariate pre-
diction errors based on the multivariate estimates. So we computed R-RMSEP3 for each
response variable separately (j = 1, . . . , 3):
s
1 X
R-RMSEP3 = (yij − ŷ−ij(3) )2
np i∈G
p
where ŷ−ij(3) are the fitted values from the multivariate regression and Gp is the subset of
observations retained in the multivariate regression. We obtained R-RMSEP(flour) = 0.37,
R-RMSEP(sucrose) = 0.82 and R-RMSEP(water) = 0.19. Then, we have applied RSIMPLS
for the three concentrations separately. It turned out that three components were satisfactory
for every response. These three univariate regressions resulted in R-RMSEP(flour) = 0.40,
R-RMSEP(sucrose) = 0.95 and R-RMSEP(water) = 0.18. Also these latter prediction errors
are based on the same subset Gp from the multivariate approach. For flour and sucrose we
thus obtain a higher prediction accuracy with the multivariate regression, whereas only water
is slightly better fitted by its own model.
8 Conclusion
In this paper we have proposed two new robust PLSR algorithms based on the SIMPLS
algorithm. RSIMCD and RSIMPLS can be applied to low- and high-dimensional regressor
Original Research Article 21
variables, and to one or multiple response variables. First, robust scores are constructed,
and then the analysis is followed by a robust regression step. Simulations have shown that
they are resistant towards many types of contamination, whereas their performance is also
good at uncontaminated data sets. We recommend RSIMPLS because it is roughly twice
as fast as RSIMCD. A Matlab implementation of RSIMPLS is available at the web site
www.wis.kuleuven.ac.be/stat/robust.html as part of the Matlab toolbox for Robust
Calibration [27].
We have also proposed robust RMSECV curves to select the number of components, and
a robust estimate of the prediction error. Diagnostic plots are introduced to discover the
different types of outliers in the data and are illustrated on some real data sets. Also the
advantage of the multivariate approach has been illustrated.
In [4], a comparative study is made between RSIMPLS and RPCR with emphasis on
the predictive ability and the goodness-of-fit of these methods when varying the number of
components k. Currently, we are developing faster algorithms to compute the R-RMSECV
values to allow fast and robust model selection in multivariate calibration.
Original Research Article 22
References
[1] Cummins DJ, Andrews CW. Iteratively reweighted partial least squares: a performance
analysis by Monte Carlo simulation. J. Chemometrics 1995; 9:489–507.
[2] de Jong S. SIMPLS: an alternative approach to partial least squares regression. Chemo-
metrics Intell. Lab. Syst. 1993; 18:251–263.
[3] Donoho DL. Breakdown Properties of Multivariate Location Estimators, Ph.D. Qualify-
ing paper, Harvard University, 1982.
[4] Engelen S, Hubert M, Vanden Branden K, Verboven S. Robust PCR and ro-
bust PLS: a comparative study; To appear in Theory and Applications of Recent
Robust Methods, edited by M. Hubert, G. Pison, A. Struyf and S. Van Aelst,
Series: Statistics for Industry and Technology, Birkhauser, Basel. Available at
http://www.wis.kuleuven.ac.be/stat/robust.html.
[5] Gil JA, Romera R. On robust partial least squares (PLS) methods. J. Chemometrics
1998; 12:365–378.
[6] Hardy AJ, MacLaurin P, Haswell SJ, de Jong S, Vandeginste BG. Double-case diagnostic
for outliers identification. Chemometrics Intell. Lab. Syst. 1996; 34:117–129.
[7] Hubert M, Rousseeuw PJ, Verboven S. A fast method for robust principal components
with applications to chemometrics. Chemometrics Intell. Lab. Syst. 2002; 60:101–111.
[8] Hubert M, Rousseeuw PJ, Vanden Branden K. ROBPCA: a new approach to robust
principal component analysis; submitted to Technometrics. Under revision. Available
at http://www.wis.kuleuven.ac.be/stat.
[10] Hubert M, Rousseeuw PJ, Verboven S. Robust PCA for high-dimensional data In: Dut-
ter R, Filzmoser P, Gather U, Rousseeuw PJ (eds.), Developments in Robust Statistics,
2003; Physika Verlag, Heidelberg, 169–179.
[11] Johnson R, Wichern D. Applied Multivariate Statistical Analysis (4th edn). Prenctice
Hall: New Jersey, 1998.
[14] Naes T. Multivariate calibration when the error covariance matrix is structured. Tech-
nometrics 1985; 27:301–311.
[15] Osborne BG, Fearn T, Miller AR, Douglas S. Application of near infrared reglectance
spectroscopy to the compositional analysis of biscuits and biscuit dough. J. Scient. Food
Agric. 1984; 35:99–105.
[16] Pell RJ. Multiple outlier detection for multivariate calibration using robust statistical
techniques. Chemometrics Intell. Lab. Syst. 2000; 52:87–104.
[18] Rousseeuw PJ. Least median of squares regression. J. Am. Statist. Assoc. 1984; 79:871–
880.
[19] Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection. Wiley: New York,
1987.
[20] Rousseeuw PJ, van Zomeren BC. Unmasking multivariate outliers and leverage points.
J. Am. Statist. Assoc. 1990; 85:633-651.
[21] Rousseeuw PJ, Van Driessen K. A fast algorithm for the minimum covariance determi-
nant estimator. Technometrics 1999; 41:212–223.
[22] Rousseeuw PJ, Van Aelst S, Van Driessen, K, Agulló J. Robust multivari-
ate regression 2002; submitted to Technometrics. Under revision. Available at
http://win-www.uia.ac.be/u/statis.
[23] Stahel WA. Robust Estimation: Infinitesimal Optimality and Covariance Matrix Esti-
mators, Ph.D. thesis, ETH, Zürich, 1981.
[24] Tenenhaus M. La Régression PLS: Théorie et Pratique. Éditions Technip: Paris, 1998.
[25] Tucker LR. An inter-battery method of factor analysis. Psychometrika 1958; 23: 111–
136.
[27] Verboven S, Hubert M. A Matlab toolbox for robust calibration. In preparation, 2003.
Original Research Article 24
[28] Wakeling IN, Macfie HJH. A robust PLS procedure. J. Chemometrics 1992; 6:18 9–198.
[29] Wold H. Estimation of principal components and related models by iterative least
squares. Multivariate Analysis, Academic Press, New York, 1966, 391–420.
[30] Wold H. Soft modelling by latent variables: the non-linear iterative partial least squares
(NIPALS) approach. Perspectives in Probability and Statistics (papers in honour of
M.S. Bartlett on the occasion of his 65th birthday) 1975; 117-142. Applied Probability
Trust, Univ. Sheffield, Sheffield.
Original Research Article 25
2.5
45
2
39
40 44
41
12
1.5 43
1 1
0.5
0
1 2 3 4 5 6 7 8 9
Index
SIMPLS RSIMPLS
44
15 15
10 10
Standardized residual
Standardized residual
45
3
5 5
10 42 12
44 40
0 45 0
39
1 43
−5 −5
41 43
1
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Score distance (3 LV) Score distance (3 LV)
(a) (b)
Figure 2: Regression diagnostic plot for the Fish data set with: (a) SIMPLS; (b) RSIMPLS.
Original Research Article 27
2.5
2
R−RMSECV value
1.5
SIMPLS(med)
1
SIMPLS(min)
RSIMPLS(med)
0.5
RSIMPLS(min)
0
1 2 3 4 5 6 7 8 9
Number of components
Figure 3: The R-RMSECV curves for the Fish data set: R-RMSECV curve for RSIMPLS
based on (31) (solid line and •), for RSIMPLS based on (33) (solid line and ¨), for SIMPLS
based on (31) (dashed line and •) and for SIMPLS based on (33) (dashed line and ¨).
Original Research Article 28
18 10
8
16 8 6 vertical outliers
bad leverage point
14 6 6
13 8
12
4
Standardized residual
2
10 3
2
regular observations good leverage points
8 13
y
0 3 2
4
6
−2
7
4
−4 4
2 5
bad leverage points
−6
5 vertical outlier
0 7
−8
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
−10
t 0 0.5 1 1.5 2 2.5 3 3.5
Score distance
(a) (b)
9
6
8
vertical outlier bad leverage point
6
Residual distance
0
0 1 2 3 4 5 6 7
Score distance
(c)
Figure 4: Different types of outliers in regression: (a) scatterplot in simple regression; (b)
regression diagnostic plot for univariate response variable; (c) regression diagnostic plot for
multivariate response variables.
Original Research Article 29
SIMPLS RSIMPLS
0.03 0.03
44
39 43
0.025 42 0.025 42
10 45
Orthogonal distance
Orthogonal distance
10
0.02 0.02
39 40
28
41
0.015 0.015 28
8
12
0.01 44 0.01
43
0.005 0.005
45
1
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Score distance (3 LV) Score distance (3 LV)
(a) (b)
Figure 5: Score diagnostic plot for the Fish data set with: (a) SIMPLS; (b) RSIMPLS.
Original Research Article 30
1.8
1.6 SIMPLS
1.4
1.2
1
R−RMSECV value
0.8
0.6
RSIMPLS
0.4
0.2
0
1 2 3 4 5 6 7
Number of components
Figure 6: The R-RMSECV curve for the Biscuit dough data set.
Original Research Article 31
SIMPLS RSIMPLS
60 60
21
50 50
Standardized residual
Standardized residual
40 40
30 30
20 20 23
10 10 24
23
21 7 22 20
20
33
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Score distance (3 LV) Score distance (3 LV)
(a) (b)
Figure 7: Regression diagnostic plot for the Biscuit dough data set with: (a) SIMPLS; (b)
RSIMPLS.
Original Research Article 32
SIMPLS RSIMPLS
0.5 0.5
23
0.45 0.45
0.4 0.4
Orthogonal distance
Orthogonal distance
0.35 0.35
0.3 7
0.3
20
0.25 0.25
0.2 0.2
21
0.15 0.15
24
0.1 0.1
0.05 23 0.05
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Score distance (3 LV) Score distance (3 LV)
(a) (b)
Figure 8: Score diagnostic plot for the Biscuit dough data set with: (a) SIMPLS; (b) RSIM-
PLS.
Original Research Article 33
Table 1: The mean CPU-time in seconds over five runs of RSIMCD and RSIMPLS for several
set-ups.
q=1 q=5
n p k RSIMCD RSIMPLS k RSIMCD RSIMPLS
50 100 5 10.59 6.42 5 13.64 7.45
10 14.87 7.90 10 18.22 9.03
500 5 10.60 6.47 5 13.81 7.68
15 14.80 8.03 15 18.62 9.41
100 5 1 7.64 5.65 5 14.38 8.08
5 11.00 6.77 10 14.43 8.07
500 5 11.64 7.39 5 15.11 8.80
15 16.01 9.00 15 19.93 10.57
Original Research Article 34
Table q n p k Σt Σe
3 1 100 5 2 diag(4,2) 1
4 1 50 100 5 diag(7,5,3.5,2.5,1) 1
5 5 100 10 3 diag(4,2,1) Iq
5 5 50 100 5 diag(7,5,3.5,2.5,1) Iq
Original Research Article 35
Table 3: Simulation results for low-dimensional regressors (p = 5) and one response variable
(q = 1).
Table 4: Simulation results for high-dimensional regressors (p = 100) and one response
variable (q = 1).