Tibshirani Lasso
Tibshirani Lasso
Tibshirani Lasso
SUMMARY
We propose a new method for estimation in linear models. The 'lasso' minimizes the
residual sum of squares subject to the sum of the absolute value of the coefficientsbeing less
than a constant. Because of the nature of this constraint it tends to produce some
coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies
suggest that the lasso enjoys some of the favourable properties of both subset selection and
ridge regression. It produces interpretable models like subset selection and exhibits the
stability of ridge regression. There is also an interesting relationship with recent work in
adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and
can be applied in a variety of statistical models: extensions to generalized regression models
and tree-based models are briefly described.
Keywords: QUADRATIC PROGRAMMING; REGRESSION; SHRINKAGE; SUBSET SELECTION
1. INTRODUCTION
Consider the usual regression situation: we have data (Xi, Yi), i = 1, 2, ..., N, where
Xi = (Xii, ..., xipl and Yi are the regressors and response for the ith observation.
The ordinary least squares (OLS) estimates are obtained by minimizing the residual
squared error. There are two reasons why the data analyst is often not satisfied with
the OLS estimates. The first is prediction accuracy: the OLS estimates often have low
bias but large variance; prediction accuracy can sometimes be improved by shrinking
or setting to 0 some coefficients. By doing so we sacrifice a little bias to reduce the
variance of the predicted values and hence may improve the overall prediction
accuracy. The second reason is interpretation. With a large number of predictors, we
often would like to determine a smaller subset that exhibits the strongest effects.
The two standard techniques for improving the OLS estimates, subset selection
and ridge regression, both have drawbacks. Subset selection provides interpretable
models but can be extremely variable because it is a discrete process - regressors are
either retained or dropped from the model. Small changes in the data can result in
very different models being selected and this can reduce its prediction accuracy.
Ridge regression is a continuous process that shrinks coefficients and hence is more
stable: however, it does not set any coefficients to 0 and hence does not give an easily
interpretable model.
We propose a new technique, called the lasso, for 'least absolute shrinkage and
selection operator'. It shrinks some coefficients and sets others to 0, and hence tries to
retain the good features of both subset selection and ridge regression.
t Address for correspondence: Department of Preventive Medicine and Biostatistics, and Department of Statistics,
University of Toronto, 12 Queen's Park Crescent West, Toronto, Ontario, M5S lAS, Canada.
E-mail: tibs@utstattoronto.edu
2. THE LASSO
2.1. Definition
Suppose that we have data (Xi, Yi), i = 1, 2, ..., N, where Xi = (Xii, ..., XiP)T are
the predictor variables and Yi are the responses. As in the usual regression set-up, we
assume either that the observations are independent or that the YiS are conditionally
independent given the xijs. We assume that the xij are standardized so that "Eixij/ N
= 0, "Ei J2.'
lJl,.,N = 1.
A A A
Here t ~ 0 is a tuning parameter. Now, for all t, the solution for a is a = y. We can
assume without loss of generality that y = 0 and hence omit a.
Computation of the solution to equation (1) is a quadratic programming problem
with linear inequality constraints. We describe some efficient and stable algorithms
for this problem in Section 6.
The parameter t ~ 0 controls the amount of shrinkage that is applied to the
estimates. Let {fj be the full least squares estimates and let to = "EIPJI. Values of
t < to will cause shrinkage of the solutions towards 0, and some coefficients may be
exactly equal to O. For example, if t = to/2, the effect will be roughly similar to
finding the best subset of size p/2. Note also that the design matrix need not be of full
rank. In Section 4 we give some data-based methods for estimation of t.
The motivation for the lasso came from an interesting proposal of Breiman (1993).
Breiman's non-negative garotte minimizes
i:
1=1
(Yi - a - ~ CjfrjXij)2
J
(2)
The garotte starts with the OLS estimates and shrinks them by non-negative factors
whose sum is constrained. In extensive simulation studies, Breiman showed that the
garotte has consistently lower prediction error than subset selection and is
competitive with ridge regression except when the true model has many small non-
zero coefficients.
A drawback of the garotte is that its solution depends on both the sign and the
magnitude of the OLS estimates. In overfit or highly correlated settings where the
OLS estimates behave poorly, the garotte may suffer as a result. In contrast, the lasso
avoids the explicit use of the OLS estimates.
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
1996] REGRESSION SHRINKAGE AND SELECTION 269
Frank and Friedman (1993) proposed using a bound on the Lq-norm of the
parameters, where q is some number greater than or equal to 0; the lasso corresponds
to q = 1. We discuss this briefly in Section 10.
(3)
where y is determined by the condition blfijl = t. Interestingly, this has exactly the
same form as the soft shrinkage proposals of Donoho and Johnstone (1994) and
Donoho et al. (1995), applied to wavelet coefficients in the context of function
estimation. The connection between soft shrinkage and a minimum L1-norm penalty
was also pointed out by Donoho et al. (1992) for non-negative parameters in the
context of signal or image recovery. We elaborate more on this connection in Section
10.
In the orthonormal design case, best subset selection of size k reduces to choosing
the k largest coefficients in absolute yalue and setting the rest to O. For some choice
of ').. t~is is .e9-u~valent to setting Pj = f3j if 1f3j1 > ').. and to 0 otherwise. Ridge
regression mmmuzes
subject to L p] ~ t. (4)
l+yfJ}
where y depends on x or t. The garotte estimates are
( 1-
Y
frJ2
)+fJ}.A
Fig. 1 shows the form of these functions. Ridge regression scales the coefficients by
a constant factor, whereas the lasso translates by a constant factor, truncating at O.
The garotte function is very similar to the lasso, with less shrinkage for larger
coefficients. As our simulations will show, the differences between the lasso and
garotte can be large when the design is not orthogonal.
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
270 TIBSHIRANI [No.1,
2.3. Geometry of Lasso
It is clear from Fig. 1 why the lasso will often produce coefficients that are exactly
O. Why does this happen in the general (non-orthogonal) setting? And why does it
not occur with ridge regression, which uses the constraint ~ PI ~ t rather than
~IPjl ~ t? Fig. 2 provides some insight for the case p = 2.
The criterion ~f:l (y; - ~j pjxiji equals the quadratic function
(plus a constant). The elliptical contours of this function are shown by the full curves
in Fig. 2(a); they are centred at the OLS estimates; the constraint region is the rotated
square. The lasso solution is the first place that the contours touch the square, and
this will sometimes occur at a corner, corresponding to a zero coefficient. The picture
for ridge regression is shown in Fig. 2(b): there are no corners for the contours to hit
and hence zero solutions will rarely result.
An interesting question emerges from this picture: can tl.!e signs of the lasso
estimates be different from those of the least squares estimates f3J? Since the variables
are standardized, when p = 2 the principal axes of the contours are at ± 45° to the
co-ordinate axes, and we can ~how that the contours must contact the square in the
same quadrant that contains po. However, whenp > 2 and there is at least moderate
correlation in the data, this need not be true. Fig. 3 shows an example in three
dimensions. The view in Fig. 3(b) confirms that the ellipse touches the constraint
region in an octant different from the octant in which its centre lies.
I~I~I
III
•..,
I
N
.....................
0
0 2 3 4 5 o 2 3 4 5
beta beta
(a) (b)
III
o o
o 2 345 o 2 3 4 5
beta beta
(c) (d)
Fig. 1. (a) Subset regression, (b) ridge regression, (c) the lasso and (d) the garotte: - - , form of
coefficient shrinkage in the orthonormal design case; , 45°-line for reference
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
1996] REGRESSION SHRINKAGE AND SELECTION 271
la) Ib)
Fig. 2. Estimation picture for (a) the lasso and (b) ridge regression
(a) (b)
Fig. 3. (a) Example in which the lasso estimate falls in an octant different from the overall least
squares estimate; (b) overhead view
Whereas the garotte retains the sign ofeach if;, the lasso can change signs. Even in cases
where the lasso estimate has the same sign vector as the garotte, the presellce of the OLS
estimates in the garotte can make it behave differently. The model :E CjfJ?xij with con-
It
straint :E Cj ~ t c~n be written as :E /3jxij with constraint :E /3j/ /fj ~ t. for example
p = 2 and Iff > ~ > 0 then the effect would be to stretch the square in Fig. 2(a)
horizontally. As a result, larger values of /31 and smaller values of /32 will be favoured
by the garotte.
estimates /fj are both positive. Then we can show that the lasso estimates are
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
272 TIBSHlRANI [No.1,
,..,....- ------- --
..,-
-----;;ii;
---:::::--::::;;-
.;
;;;~....--~~--=--~~--
.>:-::::::::::: ::-~:..':::.---
..... ,.::::::::::::::--:.~
::;::::::--:. ...... --
---
:::::--
o
2 3 4 5 6
beta1
Fig. 4. Lasso ( - - ) and ridge regression (----) for the two-predictor example: the curves show the
(fJi. fJ2) pairs as the bound on the lasso or ridge parameters is varied; starting with the bottom broken
curve and moving upwards, the correlation p is 0, 0.23, 0.45, 0.68 and 0.90
(5)
PI
where y is chosen so that + P2 = t. This formula holds for t ~ Pi + ~ and is valid
even if the predictors are correlated. Solving for y yields
(6)
" =
P2
(!-2 _fri -2 ~)+ .
In contrast, the form of ridge regression shrinkage depends on the correlation of
the predictors. Fig. 4 shows an example. We generated 100 data points from the
model y = 6xI + 3X2 with no noise. Here XI and X2 are standard normal variates with
correlation p. The curves in Fig. 4 show the ridge and lasso estimates as the bounds
on Pi + P~ and IPd + IP21 are varied. For all values of p the lasso estimates follow the
full curve. The ridge estimates (broken curves) depend on p. When p = 0 ridge
regression does proportional shrinkage. However, for larger values of p the ridge
estimates are shrunken differentially and can even increase a little as the bound is
decreased. As pointed out by Jerome Friedman, this is due to the tendency of ridge
regression to try to make the coefficients equal to minimize their squared norm.
(7)
where ffl is an estimate of the error variance. A difficulty with this formula is that it
gives an estimated variance of 0 for predictors with Pj = O.
This approximation also suggests an iterated ridge regression algorithm for
computing the lasso estimate itself, but this turns out to be quite inefficient. However,
it is useful for selection of the lasso parameter t (Section 4).
CD
c:i
CD
c:i
..,.
c:i
1
C\I
c:i
0
c:i
C\I
'9
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Fig. 5. Lasso shrinkage of coefficients in the prostate cancer example: each curve represents a
coefficient(labelled on the right) as a function of the (scaled) lasso parameter s = t/EIfJj'1 (the intercept
s
is not plotted); the broken line represents the model for = 0.44, selected by generalized cross-validation
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
274 TIBSHIRANI [No.1,
Fig. 5 shows the lasso estimates as a function of standardized bound s = t/I:I~.I'
Notice that the absolute value of each coefficient tends to 0 as s goes to O. In tnis
example, the curves decrease in a monotone fashion to 0, but this does not always
happen in general. This lack of mono tonicity is shared by ridge regression and subset
regression, where for example the best subset of size 5 may not contain the best
subset of size 4. The vertical broken line represents the model for s= 0.44, the
optimal value selected by generalized cross-validation. Roughly, this corresponds to
keeping just under half of the predictors.
Table 1 shows the results for the full least squares, best subset and lasso
procedures. Section 7.1 gives the details of the best subset procedure that was used.
The lasso gave non-zero coefficients to lcavol, lweight and svi; subset selection chose
the same three predictors. Notice that the coefficients and Z-scores for the selected
predictors from subset selection tend to be larger than the full model values: this is
common with positively correlated predictors. However, the lasso shows the opposite
effect, as it shrinks the coefficients and Z-scores from their full model values.
The standard errors in the penultimate column were estimated by bootstrap
resampling of residuals from the full least squares fit. The standard errors were
computed by fixing s at its optimal value 0.44 for the original data set. Table 2
TABLE 1
Results for the prostate cancer example
1 intcpt 2.48 0.07 34.46 2.48 0.07 34.05 2.48 0.07 35.43
21cavol 0.69 0.10 6.68 0.65 0.09 7.39 0.56 0.09 6.22
31weight 0.23 0.08 2.67 0.25 0.07 3.39 0.10 0.07 1.43
4 age -0.15 0.08 -1.76 0.00 0.00 0.00 0.01 0.00
51bph 0.16 0.08 1.83 0.00 0.00 0.00 0.04 0.00
6svi 0.32 0.10 3.14 0.28 0.09 3.18 0.16 0.09 1.78
71cp -0.15 0.13 -1.16 0.00 0.00 0.00 0.03 0.00
8 gleason 0.03 0.11 0.29 0.00 0.00 0.00 0.02 0.00
9 pgg45 0.13 0.12 1.02 0.00 0.00 0.00 0.03 0.00
TABLE 2
Standard error estimates for the prostate cancer example
'-r~
-r-~~-L....L.J...
lcavol lweight age Ibph svt Icp gleason pgg45
Fig. 6. Box plots of 200 bootstrap values of the lasso coefficientestimates for the eight predictors in
the prostate cancer example
compares the ridge approximation formula (7) with the fixed t bootstrap, and the
bootstrap in which twas re-estimated for each sample. The ridge formula gives a
fairly good approximation to the fixed t bootstrap, except for the zero coefficients.
Allowing t to vary incorporates an additional source of variation and hence gives
larger standard error estimates. Fig. 6 shows box plots of 200 bootstrap replications
s
of the lasso estimates, with fixed at the estimated value 0.44. The predictors whose
estimated coefficient is 0 exhibit skewed bootstrap distributions. The central 90%
percentile intervals (fifth and 95th percentiles of the bootstrap distributions) all
contained the value 0, with the exceptions of those for lcavol and svi.
ME = E{~(X) - TJ(X)}2,
the expected value taken over the joint distribution of X and Y, with ~(X) fixed. A
similar measure is the prediction error of ~(X) given by
We estimate the prediction error for the lasso procedure by fivefold cross-
validation as described (for example) in chapter 17 of Efron and Tibshirani (1993).
The lasso is indexed in terms of the normalized parameter s = tl'£, f3j, and the
prediction error is estimated over a grid of values of s from 0 to 1 inclusive. The value
syielding the lowest estimated PE is selected.
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
276 TIBSHlRANI (No.1,
Simulation results are reported in terms of ME rather than PE. For the linear
models 71(X) = X{3 considered in this paper, the mean-squared error has the simple
form
(9)
where W = diag(l,Bjl) and W- denotes a generalized inverse. Therefore the number of
effective parameters in the constrained fit (3 may be approximated by
Letting rss(t) be the residual sum of squares for the constrained fit with constraint
t, we construct the generalized cross-validation style statistic
(11)
We may llPply this result to the lasso estimator (3). Denote the estimated standard
error of fJj by i = &I.JN, where (J2 = L(Yi - yi
I(N - p). Then the fJj Ii are (condi-
tionally on X) approximately independent standard normal variates, and from
equation (11) we may derive the formula
Although the derivation of; assumes an orthogonal design, we may still try to use
it in the usual non-orthogonal setting. Since the predictors have been standardized,
the optimal value of t is roughly a function of the overall signal-to-noise ratio in the
data, and it should be relatively insensitive to the covariance of X. (In contrast, the
form of the lasso estimator is sensitive to the covariance and we need to account for it
properly.)
The simulated examples in Section 7.2 suggest that this method gives a useful
estimate of t. But we can offer only a heuristic argument in favour of it. Suppose that
XTX = V and let Z = XV- 1/ 2 , (J = j3V-1/2. Since the columns of X are standardized,
the region ~IOjl :::;; t differs from the region ~1,Bjl :::;; t in shape but has roughly the
same-sized marginal projections. Therefore the optimal value of ; should be about
the same in each instance.
Finally, note that the Stein method enjoys a significant computational advantage
over the cross-validation-based estimate of t. In our experiments we optimized over a
grid of 15 values of the lasso parameter t and used fivefold cross-validation. As a
result, the cross-validation approach required 75 applications of the model optim-
ization procedure of Section 6 whereas the Stein method required only one. The
requirements of the generalized cross-validation approach are intermediate between
the two, requiring one application of the optimization procedure per grid point.
with r = 1/}...
Fig. 7 shows the double-exponential density (full curve) and the normal density
(broken curve); the latter is the implicit prior used by ridge regression. Notice how
the double-exponential density puts more mass near 0 and in the tails. This reflects
the greater tendency of the lasso to produce estimates that are either large or O.
"':
0
~
t
i <'!
0
d
0
0
-4 -2 0 2 4
bela
Fig. 7. Double-exponential density ( - - ) and normal density (- - - - -): the former is the implicit
prior used by the lasso; the latter by ridge regression
Lawson and Hansen (1974) provided the ingredients for a procedure which solves the
linear least squares problem subject to a general linear inequality constraint G{3 ~ h.
Here G is an m x p matrix, corresponding to m linear inequality constraints on the p-
vector {3. For our problem, however, m = 2P may be very large so that direct
application of this procedure is not practical. However, the problem can be solved by
introducing the inequality constraints sequentially, seeking a feasible solution
satisfying the so-called Kuhn-Tucker conditions (Lawson and Hansen, 1974). We
outline the procedure below.
Let g({3) = l;~1 (y; - l;i/3j x;i , and let 6;, i = I, 2, ..., 2!' be the p-tuples of the
form (±I, ± I, ..., ± I). Then the condition l;IPjl ~ t is equivalent to 6;
{3 ~ t
for all i. For a given {3, let E = Ii: 6;{3 = t} and S = Ii: 6;{3 < t}. The set E is the
equality set, corresponding to those constraints which are exactly met, whereas S is
the slack set, corresponding to those constraints for which equality does not hold.
Denote by GE the matrix whose rows are 6; for i E E. Let 1 be a vector of Is oflength
equal to the number of rows of GE • " "
The following algorithm starts with E = {io} where 6 io = sign({3), {3 being the
overall least squares estimate. It solves the least squares problem subject to 6~{3 ~ t
and then checks whether l;IPjl ~ t. If so, the computation is complete; if not, the
violated constraint is added to E and the process is continued until l;IPjl ~ t.
Here is an outline of the algorithm.
(a) Start with E = {io} where 6io = sign(,80), ~ being the overall least squares
estimate.
(b) Find ~ to "minimize g({3) subject to GE {3 ~ t1.
(c) While {l;IPA > t}, ""
(d) add i to the set E where 6; = sign({3). Find (3 to minimize g({3) subject to
GE {3 ~ tl.
This procedure must always converge in a finite number of steps since one element
is added to the set E at each step, and there is a total of 2P elements. The final iterate
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
1996] REGRESSION SHRINKAGE AND SELECTION 279
TABLE 3
Results for example It
Method Median mean-squared Average no. of Average S
error o coefficients
Least squares 2.79 (0.12) 0.0
Lasso (cross-validation) 2.43 (0.14) 3.3 0.63 (0.01)
Lasso (Stein) 2.07 (0.10) 2.6 0.69 (0.02)
Lasso (generalized cross-validation) 1.93 (0.09) 2.4 0.73 (0.01)
Garotte 2.29 (0.16) 3.9
Best subset selection 2.44 (0.16) 4.8
Ridge regression 3.21 (0.12) 0.0
is a solution to the original problem since the Kuhn-Tucker conditions are satisfied
for the sets E and S at convergence.
A modification of this procedure removes elements from E in step (d) for which the
equality constraint is not satisfied. This is more efficient but it is not clear how to
establish its convergence.
The fact that the algorithm must stop after at most 2P iterations is of little comfort
if p is large. In practice we have found that the average number of iterations required
is in the range (0.5p, 0.75p) and is therefore quite acceptable for practical purposes.
A completely different algorithm for this problem was suggested by David Gay.
We write each f3j as f3j -: f3j, where f3t and f3j are non-negative. Then we solve the
least squares problem WIth the constraints f3j ~ 0, f3j ~ 0 and :E f3j + :Ej f3j ~ t. In
this way we transform the original problem (p vanables, 2P constraints) to a new
problem with more variables (2p) but fewer constraints (2p + 1). One can show that
this new problem has the same solution as the original problem.
Standard quadratic programming techniques can be applied, with the convergence
assured in 2p + 1 steps. We have not extensively compared these two algorithms but
in examples have found that the second algorithm is usually (but not always) a little
faster than the first.
7. SIMULATIONS
7.1. Outline
In the following examples, we compare the full least squares estimates with the
lasso, the non-negative garotte, best subset selection and ridge regression. We used
fivefold cross-validation to estimate the regularization parameter in each case. For
best subset selection, we used the 'leaps' procedure in the S language, with fivefold
cross-validation to estimate the best subset size. This procedure is described and
studied in Breiman and Spector (1992) who recommended fivefold or tenfold cross-
validation for use in practice.
For completeness, here are the details of the cross-validation procedure. The best
subsets of each size are first found for the original data set: call these So, SI. ..., Sp.
(So represents the null nodel; since y = 0 the fitted values are 0 for this model.)
Denote the full training set by T, and the cross-validation training and test sets by
T - TV and TV, for v = 1, 2, ..., 5. For each cross-validation fold v, we find the best
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
280 TIBSHlRANI [No.1,
TABLE 4
Most frequent models selected by the lasso
(generalized cross-validation) in example 1
Model Proportion
1245678 0.055
123456 0.050
1258 0.045
1245 0.045
13 others
125 (and 5 others) 0.025
subsets of each size for the data T - P: call these SO' SJ., ..., S;. Let PEV(J) be the
prediction error when S~ is applied to the test data TV, and form the estimate
PE(J) = ~ t
=1
PEV(J). (12)
We find the Jthat minimizesPE(J) and our selected model is SJ. This is not the same
as estimating the prediction error of the fixed models So, SJ, ..., Sp and then
choosing the one with the smallest prediction error. This latter procedure is described
in Zhang (1993) and Shao (1992), and can lead to inconsistent model selection unless
the cross-validation test set TV grows at an appropriate asymptotic rate.
7.2. Example 1
In this example we simulated 50 data sets consisting of 20 observations from the
model
y = {3Tx + ClE,
where {3 = (3, 1.5, 0, 0, 2, 0, 0, O)T and E is standard normal. The correlation
between Xi and Xj was p1i-J1 with p = 0.5. We set CI = 3, and this gave a signal-to-noise
ratio of approximately 5.7. Table 3 shows the mean-squared errors over 200
simulations from this model. The lasso performs the best, followed by the garotte
and ridge regression.
Estimation of the lasso parameter by generalized cross-validation seems to per-
form best, a trend that we find is consistent through all our examples. Subset
TABLE 5
Mostfrequent models selected by all-subsets
regression in example 1
Model Proportion
125 0.240
15 0.200
1 0.095
1257 0.040
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
1996] REGRESSION SHRINKAGE AND SELECTION 281
: I...-_.. . :--+--~
·
__• _!._...j--.-i·····,
~__
!
---~--I--.t._.-t---j--'--~"-
- ~
_ __ L__
~
_ __ L__ _
I
o
full Is lasso garotte best subset ridge
~ 1-·_·-~-~~i-~=;-·~--t--;···'---;-;t==;=~~··1
full Is lasso garotte best subset ridge
I -
full Is garotte
r --T-- -r - r - t- I
- .···-···f--r+-·-t '··T-"·r·t·-+-·+r··I-T---
ft
garotte
Fig. 8. Estimates for the eight coefficients in example I, excluding the intercept: , true coefficients
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
282 TIBSHlRANI [No.1,
TABLE 6
Results for example 2t
selection picks approximately the correct number of zero coefficients (5), but suffers
from too much variability as shown in the box plots of Fig. 8.
Table 4 shows the five most frequent models (non-zero coefficients) selected by the
lasso (with generalized cross-validation): although the correct model (1, 2, 5) was
chosen only 2.5% of the time, the selected model contained (1, 2, 5) 95.5% of the
time. The most frequent models selected by subset regression are shown in Table 5.
The correct model is chosen more often (24% of the time), but subset selection can
also underfit: the selected model contained (1, 2, 5) only 53.5% of the time.
7.3. Example 2
The second example is the same as example 1, but with {3j = 0.85, Yjand a = 3; the
signal-to-noise ratio was approximately 1.8. The results in Table 6 show that ridge
regression does the best by a good margin, with the lasso being the only other
method to outperform the full least squares estimate.
7.4. Example 3
For example 3 we chose a set-up that should be well suited for subset selection.
The model is the same as example 1, but with {3 =(5,0,0,0,0,0,0,0) and a = 2 so
that the signal-to-noise ratio was about 7.
The results in Table 7 show that the garotte and subset selection perform the best,
TABLE 7
Results for example 3t
Method Median mean-squared Average no. of Average 8
error o coefficients
Least squares 2.89 (0.04) 0.0
Lasso (cross-validation) 0.89 (0.01) 3.0 0.50 (0.03)
Lasso (Stein) 1.26 (0.02) 2.6 0.70 (0.01)
Lasso (generalized cross-validation) 1.02 (0.02) 3.9 0.63 (0.04)
Garotte 0.52 (0.01) 5.5
Subset selection 0.64 (0.02) 6.3
Ridge regression 3.53 (0.05) 0.0
followed closely by the lasso. Ridge regression does poorly and has a higher mean-
squared error than do the full least squares estimates.
7.5. Example 4
In this example we examine the performance of the lasso in a bigger model. We
simulated 50 data sets each having 100 observations and 40 variables (note that best
subsets regression is generally considered impractical for P > 30). We defined
predictors xij = zij + z, where zij and z, are independent standard normal variates.
This induced a pairwise correlation of 0.5 among the predictors. The coefficient
vector was f3 = (0, 0, ..., 0, 2, 2, ..., 2, 0, 0, ..., 0, 2, 2, ..., 2), there being 10
repeats in each block. Finally we defined y = f3T X + 15€ where € was standard
normal. This produced a signal-to-noise ratio of roughly 9. The results in Table 8
show that the ridge regression performs the best, with the lasso (generalized cross-
validation) a close second.
The average value of the lasso coefficients in each of the four blocks of 10 were
0.50 (0.06), 0.92 (0.07), 1.56 (0.08) and 2.33 (0.09). Although the lasso only produced
14.4 zero coefficients on average, the average value of s (0.55) was close to the true
proportion of Os (0.5).
(13)
This is called a 'soft threshold' estimator by Donoho and Johnstone (1994); they
applied this estimator to the coefficients of a wavelet transform of a function
measured with noise. They then backtransformed to obtain a smooth estimate of the
function. Donoho and Johnstone proved many optimality results for the soft
threshold estimator and then translated these results into optimality results for
function estimation.
Our interest here is not in function estimation but in the coefficients themselves.
We give one of Donoho and Johnstone's results here. It shows that asymptotically
the soft threshold estimator (lasso) comes as close as subset selection to the
performance of an ideal subset selector- one that uses information about the actual
parameters.
Suppose that
Yi = {3xi +Ei
where Ei "" N(O, er 2 ) and the design matrix is orthonormal. Then we can write
(14)
where Zj "" N(O, er 2 ) .
We consider estimation of {3 under squared error loss, with risk
11. DISCUSSION
In this paper we have proposed a new method (the lasso) for shrinkage and
selection for regression and generalized regression problems. The lasso does not
focus on subsets but rather defines a continuous shrinking operation that can
produce coefficients that are exactly O. We have presented some evidence in this
paper that suggests that the lasso is a worthy competitor to subset selection and ridge
regression. We examined the relative merits of the methods in three different
scenarios:
(a) small number of large effects-subset selection does best here, the lasso not
quite as well and ridge regression does quite poorly;
(b) small to moderate number of moderate-sized effects-the lasso does best,
followed by ridge regression and then subset selection;
(c) large number of small effects-ridge regression does best by a good margin,
followed by the lasso and then subset selection.
Breiman's garotte does a little better than the lasso in the first scenario, and a little
worse in the second two scenarios. These results refer to prediction accuracy. Subset
selection, the lasso and the garotte have the further advantage (compared with ridge
regression) of producing interpretable submodels.
There are many other ways to carry out subset selection or regularization in least
squares regression. The literature is increasing far too fast to attempt to summarize it
in this short space so we mention only a few recent developments. Computational
advances have led to some interesting proposals, such as the Gibbs sampling
approach of George and McCulloch (1993). They set up a hierarchical Bayes model
and then used the Gibbs sampler to simulate a large collection of subset models from
the posterior distribution. This allows the data analyst to examine the subset models
with highest posterior probability and can be carried out in large problems.
Frank and Friedman (1993) discuss a generalization of ridge regression and subset
selection, through the addition of a penalty of the form ).. EjlPjlq to the residual sum
of squares. This is equivalent to a constraint of the form EjlPjlq ~ t; they called this
the 'bridge'. The lasso corresponds to q = 1. They suggested that joint estimation of
the PjS and q might be an effective strategy but do not report any results.
Fig. 9 depicts the situation in two dimensions. Subset selection corresponds to
q -+ O. The value q = 1 has the advantage of being closer to subset selection than is
ridge regression (q = 2) and is also the smallest value of q giving a convex region.
Furthermore, the linear boundaries for q = I are convenient for optimization.
The encouraging results reported here suggest that absolute value constraints
might prove to be useful in a wide variety of statistical estimation problems. Further
study is needed to investigate these possibilities.
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
1996] REGRESSION SHRINKAGE AND SELECTION 287
Fig. 9. Contours of constant value of :ENljlq for given values of q: (a) q = 4; (b) q = 2; (c) q = 1;
(d) q = 0.5; (e) q = 0.1
12. SOFTWARE
Public domain and S-PLUS language functions for the lasso are available at the
Statlib archive at Carnegie Mellon University. There are functions for linear models,
generalized linear models and the proportional hazards model. To obtain them, use
file transfer protocol to lib.stat.cmu.edu and retrieve the file S/Iasso, or send an
electronic mail message to statlib@lib.stat.cmu.edu with the message send lassofrom S.
ACKNOWLEDGEMENTS
I would like to thank Leo Breiman for sharing his garotte paper with me before
publication, Michael Carter for assistance with the algorithm of Section 6 and David
Andrews for producing Fig. 3 in MATHEMATICA. I would also like to ack-
nowledge enjoyable and fruitful discussions with David Andrews, Shaobeng Chen,
Jerome Friedman, David Gay, Trevor Hastie, Geoff Hinton, lain Johnstone,
Stephanie Land, Michael Leblanc, Brenda MacGibbon, Stephen Stigler and
Margaret Wright. Comments by the Editor and a referee led to substantial
improvements in the manuscript. This work was supported by a grant from the
Natural Sciences and Engineering Research Council of Canada.
REFERENCES
Breiman, L. (1993) Better subset selection using the non-negative garotte. Technical Report. University
of California, Berkeley.
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984) Classification and Regression Trees.
Belmont: Wadsworth.
Breiman, L. and Spector, P. (1992) Submodel selection and evaluation in regression: the x-random case.
Int. Statist. Rev., 60, 291-319.
Chen, S. and Donoho, D. (1994) Basis pursuit. In 28th Asilomar Conf. Signals. Systems Computers.
Asilomar.
Donoho, D. and Johnstone, I. (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81,
425-455.
Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stem, A. S. (1992) Maximum entropy and the nearly
black object (with discussion). J. R. Statist. Soc. B, 54, 41-81.
Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995) Wavelet shrinkage;
asymptopia? J. R. Statist. Soc. B, 57, 301-337.
Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. London: Chapman and Hall.
Frank, I. and Friedman, J. (1993) A statistical view of some chemometrics regression tools (with
discussion). Technometrics, 35, 109-148.
Friedman, J. (1991) Multivariate adaptive regression splines (with discussion). Ann. Statist., 19, 1-141.
Downloaded from https://academic.oup.com/jrsssb/article/58/1/267/7027929 by UNIVERSIDAD POLITECNICA DE MADRID.E.U.DE INFORMATICA.BIBLIOTECA user on 10 May 2024
288 TIBSHlRANI [No. I,
George, E. and McCulloch, R. (1993) Variable selection via gibbs sampling. J. Am. Statist. Ass., 88,
884-889.
Hastie, T. and Tibsbirani, R. (1990) Generalized Additive Models. New York: Chapman and Hall.
Lawson, C. and Hansen, R. (1974) Solving Least Squares Problems. Englewood Gift's: Prentice Hall.
LeBlanc, M. and Tibsbirani, R. (1994) Monotone shrinkage of trees. Technical Report. University of
Toronto, Toronto.
Murray, W., Gill, P. and Wright, M. (1981) Practical Optimization. New York: Academic Press.
Shao, J. (1992) Linear model selection by cross-validation. J. Am. Statist. Ass., 88, 486-494.
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E. and Yang, N. (1989)
Prostate specific antigen in the diagnosis and treabnent of adenocarcinoma of the prostate, ii: Radical
prostatectomy treated patients. J. Urol., 16, 1076-1083.
Stein, C. (1981) Estimation of the mean of a multivariate normal distribution. Ann. Statist., 9, 1135-
1151.
Tibshirani, R. (1994) A proposal for variable selection in the cox model. Technical Report. University of
Toronto, Toronto.
Zhang, P. (1993) Model selection via multifold cv. Ann. Statist., 21, 299-311.