Statistical Practice: Estimators of Relative Importance in Linear Regression Based On Variance Decomposition
Statistical Practice: Estimators of Relative Importance in Linear Regression Based On Variance Decomposition
Statistical Practice: Estimators of Relative Importance in Linear Regression Based On Variance Decomposition
Ulrike Gromping
1. INTRODUCTION
In many linear regression applications, a main goal of analysis
is the determination of a ranking of the regressors or an explicit
quantification of the relative importance of each regressor for the
response. This type of application is often encountered in disciplines that rely on observational studies such as psychology,
biology, ecology, economy and so forth (see, e.g., application
areas in the references). If all regressors are uncorrelated, there
is a simple and unique answer to the relative importance question. However, it is the very nature of observational data that
regressors are typically correlated. In this case, assignment of
relative importance becomes a challenging task, for which the
standard output from linear regression models is not particularly
well suited. This article focuses on relative importance assessment based on variance decomposition for linear regression with
random regressor variables. Thus, metrics such as level importance [the product of the unstandardized regression coefficient
with the regressors mean, advocated, e.g., by Achen (1982, p.
Ulrike Gromping is Professor for Business Mathematics and Statistics at Department IIMathematics, Physics, and Chemistry, TFH Berlin, University of
Applied Sciences, Luxemburger Strasse 10, 13353 Berlin, Germany (E-mail:
groemping@tfh-berlin.de). The author thanks Norman Fickel, Barry Feldman,
Ursula Garczarek, and Sabine Landau for useful discussions and comments. The
constructive comments of both referees and the associate editor have also greatly
improved the article.
c
2007
American Statistical Association DOI: 10.1198/000313007X188252
p
j =1
j2 vj + 2
p1
p
j k vj vk j k + 2 . (2)
j =1 k=j +1
(3)
svar(M |S)
evar(M S) evar(S)
(4)
and
2. THE FRAMEWORK FOR VARIANCE
DECOMPOSITION IN LINEAR REGRESSION
2.1
2.2
(1)
Figure 1.
Two simple causal models that can lead to the linear regression model E(Y |X1 , X2 , X3 ) = 0 + X1 1 + X2 2 + X3 3 .
(a) Proper decomposition: the model variance is to be decomposed into shares, that is, the sum of all shares has to be the
model variance.
(b) Non-negativity: all shares have to be non-negative.
(c) Exclusion: the share allocated to a regressor Xj with j =
0 should be 0.
(d) Inclusion: a regressor Xj with j
= 0 should receive a
nonzero share.
The list could be extended by further reasonable requests; the criteria listed here are the ones most relevant for comparing LMG
and PMVD among each other and with other relative importance
metrics. Criteria (a) to (d) are requested by various authors [see,
e.g., Feldman 2005 (all), Darlington (1968, (a),(b)), Theil (1971,
(b)), Johnson and Lebreton (2004, (a), (b)), Cox (1985, slightly
different context, (c), (d))], and are cited here for ease of reference. Feldman (2005) postulated these four criteria in the sense
of strict admissibility criteria and showed that PMVD is admissible in this sense while LMG is not. The present author agrees
that criteria (a) and (b) are indispensable, so that the method by
Hoffman (1960) as justified by Pratt (1987) is not further discussed because of its violation of criterion (b). Criterion (d) is
fulfilled by both LMG and PMVD, and it is conjectured that it
will be fulfilled by any nontrivial metric that fulfills both (a) and
(b).
Exclusion (criterion (c)) is fulfilled for all relative importance
metrics mentioned in this articleeven the simple ones that have
been severely criticized in the literatureas long as all regressors are uncorrelated. Feldman (2005) generally requested exclusion since he considered a regressor with a zero coefficient to
be spurious. When thinking of predictive relevance, a regressor with 0 coefficient in the equation does indeed not contribute
anything useful, given that all regressors with nonzero coefficients are available, so that exclusion is a reasonable request. If
the relative importance question is asked with a causal interpretation in mind (as is, e.g., the case when aiming at prioritizing
intervention options) and regressors are correlated, exclusion is
a less convincing requirement: Figure 1 shows graphs of two
causal modelswith directed arrows indicating a direct causal
relationthat both (assuming linearity of all relations) imply the
same linear regression model with p = 3 correlated regressors.
In model I, regressor X1 directly influences both other regressors and the response. If the shaded arrow is deleted from the
graph, the coefficient 1 becomes zero, since in the presence of
X2 and X3 there is no additional explanatory value in X1 . Nevertheless, X1 obviously exerts an influence on Y via the other
two regressors, and there is no reason to request that it should be
allocated a share of zero. In model II, X2 and X1 have swapped
roles. Again, if the shaded arrow is deleted from the graph, the
coefficient 1 becomes zero. Now, it appears far more reasonable
that X1 should be allocated a share of zero.
As the linear regression model (1) is generally compatible with
many different causal models, among them also those models
for which exclusion is clearly unreasonable, exclusion does not
appear to be a reasonable requirement for relative importance
considerations, if causality considerations motivate the analysis.
Criteria (a) to (d) refer to properties of the theoretical quantities estimated and can also be applied for the estimated quantities, when replacing all theoretical values with their empirical
counterparts. In addition to these criteria, a reasonably low variability of the estimators in cases of moderate multicollinearity is
also an important aspect in assessing a methods performance.
3. WHAT DO RELATIVE IMPORTANCE METRICS
ESTIMATE?
3.1
12 v1 + 21 2 v1 v2 12 + 22 v2 + 2 .
(2*)
With n independent observations from the common distribution
of Y , X1 , and X2 , let y denote the n 1-vector of centered
responses, x1 the n 1-vector of centered values for regressor
X1 , and the superscript T transposition and 1 inversion. Then
the model SS for X1 in the role of the first and only regressor is
T
1
1
x1T y = x1T y
x1T x1
x1T y ,
yT x1 x1T x1
which, when divided by n, is by simple considerations consistent
for
2
1 v1 + 2 v1 v2 12
cov(Y, X1 )2
svar ({1}| ) =
=
var(X1 )
v1
2
. (5)
= 12 v1 + 21 2 v1 v2 12 + 22 v2 12
Note thatwhen alone in the modelthe first regressor captures
the full mixed term of the variance (2*) plus some of the unique
The American Statistician, May 2007, Vol. 61, No. 2 141
2
12 v1 + 1 2 v1 v2 12 + 0.5 22 v2 12 v1 12
. (6)
Each regressor receives half of the mixed term in (2*). In addition, for 12
= 0 the regressor with larger j2 vj donates part
of its contribution to the regressor with smaller j2 vj . This third
summand of (6) creates an equalization between correlated regressors with unequal j2 vj . In the light of the discussion of
Figure 1, this can be seen as a precaution that takes care of the
uncertainty regarding the underlying model structure. The third
summand of (6) also causes LMGs violation of the exclusion
criterion for correlated regressors: if 1 = 0, 2
= 0, 12
= 0,
there will be a nonzero share allocated to X1 .
For p regressors, the LMG share allocated to X1 is given as
1
LMG(1) =
svar({1}|S1 (r))
p!
rpermutation
1
p!
S{2,...,p}
(7)
All orders with the same S1 (r) can be summarized into one summand. Thus, the computational burden is reduced from calculation of p! summands to calculation of 2p1 summands which
are based on the 2p quantities evar(S) and evar(S {1}), S
{2, . . . , p}.
Some readers may find it more intuitive to think of LMG(1)
as the average over model sizes i of average improvements in
explained variance when adding regressor X1 to a model of size i
without X1 (see Christensen 1992), that is,
p1
1
p1
LMG(1) =
svar({1}|S)
. (7*)
i
p
S{2,...,p}
i=0
n(S)=i
We have already seen in the two-regressor case that LMG violates the exclusion criterion. The other three desirability criteria
are satisfied, as can be easily verified from (7), noticing that the
method averages non-negative contributions that sum to the total
variance. Since (7) can be calculated for various scenarios, investigations into the behavior of the estimand are possible without
simulation (see Section 3.4).
3.2
p1
svar
1
ri+1 , . . . , rp | {r1 , . . . , ri }
i=1
p1
i=1
(9)
that is, the weights are p(r) = L(r)/ r L(r), where summation in the denominator is over all possible permutations r.
The factors in product (9) are increasing in size from i = 1
to i = p 1. Weights are large, if the first regressor already captures a large portion of the explained variance (so that
(evar({1, . . . , p}) evar({r1 }))1 is already relatively large).
Also, if a set of regressors has a low explanatory value conditional on all other regressors, weights are large if all regressors from this set occur after the other regressors in the order. If
some coefficients are zero, limiting considerations (see Feldman
2002) show that weights become positive for orderings with all
0-coefficient variables last, while any other ordering receives a
weight of 0in fact, the results for data with one or more coefficients estimated as 0 are identical to the results from models
with the 0-coefficient variables omitted and their shares fixed at
0. Thus, PMVD weights guarantee exclusion, as they were designed to do. In addition, like any approach that can be written
as an average over orderings, PMVD also guarantees the other
three desirability criteria, using the same reasoning as for LMG.
For illustration of PMVD, let us apply (8) and (9) to a
scenario with two regressors X1 and X2 and nonzero coefficients. The L(r) consist of one factor only, with L((1, 2)) =
svar({2}|{1})1 , so that the weight p((1, 2)) becomes
p((1, 2)) =
2 v1
svar({1}|{2})
= 2 1 2 .
svar({1}|{2}) + svar({2}|{1})
1 v1 + 2 v2
With p((1, 2)) and p((2, 1)) inserted in (8), using the sequential
variances calculated in Section 3.1, the variance allocated to X1
simplifies to
12 v1 +
12 v1
12 v1
+ 22 v2
21 2 v1 v2 12
(10)
This result for two regressors has several specific properties none
of which generalizes to p > 2: the share of the mixed term
that a regressor receives is proportional to its individual term
in the model. Also, the weight for order (1, 2) coincides with
the proportion of R 2 allocated to X1 , and the weights do not
depend on the correlation between the Xs. For more than two
regressors, the scenario investigations in Section 3.4 will shed
further light on the behavior of PMVD.
3.3
1 = 5, 2 = 4, 3 = 3
= 0.5
=0
= 0.5
= 0.5
=0
= 0.5
= 0.5
=0
= 0.5
(1,2,3)
(2,1,3)
(2,3,1)
(1,3,2)
(3,1,2)
(3,2,1)
0.174
0.109
0.109
0.217
0.217
0.174
0.167
0.167
0.167
0.167
0.167
0.167
0.129
0.210
0.210
0.161
0.161
0.129
0.344
0.154
0.056
0.242
0.135
0.069
0.320
0.235
0.085
0.180
0.110
0.070
0.257
0.296
0.107
0.181
0.105
0.054
0.855
0.043
0.000
0.096
0.005
0.000
0.859
0.058
0.000
0.077
0.005
0.000
0.828
0.073
0.000
0.093
0.005
0.000
As mentioned before, LMG simply gives each order of regressors the same weight, that is, the weights are data-independent.
If a predetermined order of variables can be specified, we have
another situation of data-independent weights, for which one
order has weight one, all others weight 0. Applied researchers
facing a relative importance question often use an automated forward or backward selection of variables (e.g., Janz et al. 2001) to
determine the order of regressors. This is an extreme case of datadependent weights (1 for one order, 0 for all other orders), which
guarantees exclusion in case of backward selection, but can deliver quite arbitrary allocations as is obvious from comparing
svar({1}| ) to svar({1}|{2}) for the two-regressor case (see (5)
and subsequent calculations). Selection-based approaches have
been criticized, for example, by Bring (1996) and references
therein. The PMVD weights p(r) can be seen as a compromise
between this extreme form of data-dependent weights and equal
weights for all regressors. They are concentrated on a few orders
in case of very unequal s and are more balanced between orders
for constant or very similar s. Table 1 shows a few examples
for three regressors.
3.4
1 = 4, 2 = 1, 3 = 0.3
Order r
Trivially, LMG and PMVD coincide for uncorrelated regressors, since variance allocation is order-independent. Furthermore, symmetry considerations imply that both methods allocate
equal shares for all regressors in case of equi-correlated regres
sors if all j vj are identical. This is true for LMG because
all orders generally receive the same weight. For PMVD, identical weights for all orders are guaranteed by Feldmans (2005)
anonymity axiom in this situation. Combining these two situations, LMG and PMVD also coincide for several uncorrelated
groups of equi-correlated regressors with group wise constant
j vj . For any other situation, the two methods will yield more
or less different results, as is exemplified in Figure 2.
Figure 2 depicts allocated shares according to LMG (thick
lines) and PMVD (thin lines) for a group of four regressors for
four different scenarios (a) to (d) (vj = var(Xj ) fixed at 1). As
was pointed out above, LMG and PMVD coincide for scenario (a) (identical s with constant inter-regressor correlations).
For scenario (b) (identical s, corr(Xj , Xk ) = |j k| ), both
methods show a moderate dependence on the correlation pa-
Figure 2. Proportions of R 2 allocated to each regressor for LMG (thick line) and PMVD (thin line).
Parameter vectors: (a) and (b): = (1, 1, 1, 1)T , (c) and (d): = (4, 1, 1, 0.3)T .
Correlations among X1 , . . . , X4 : (a) and (c): corr(Xj , Xk ) = , j
= k. (b) and (d): corr(Xj , Xk ) = |j k| .
Table 2.
Simulation settings
Factor
Levels
Correlation structure
of (X1 , . . . , X4 ))
corr(Xj , Xk ) = for j
= k
corr(Xj , Xk ) = |j k|
Distribution of (X1 , . . . , X4 ))
Coefficient vectors (1 , . . . , 4 )T
1
2
3
4
Sample sizes
n = 100
n = 1000
True R 2
(controlled through 2 )
0.25
0.5
0.9
= (4, 1, 1, 0.3)T
= (1, 1, 1, 0.3)T
= (4, 1, 0, 0)T
= (1, 1, 1, 1)T
5 = (1.2, 1, 1, 0.3)T
6 = (1, 1, 1, 0)T
7 = (4, 3.5, 3, 2.5)T
independent observations
independent observations
) Observations on different units are independent. The correlation structure refers to the four regressors within each independently observed unit.
) Note that the regressors in the exponential case have nonzero expectation, which is irrelevant, since it can be subsumed in an estimate of the intercept and does
4.1
slightly less variable than LMG for moderate negative correlations in most cases. For -vectors with some zero elements
3 , 6 ), PMVD variability is very low for the respective shares
(
(with a distinct advantage over LMG for scenarios with high R 2 ,
not shown). This is in line with PMVDs property to satisfy the
exclusion criterion. Also, PMVD shows lowest variability where
true differences between coefficients are large with some coefficients relatively close to 0, while the variability disadvantage
versus LMG is higher for -vectors with relatively similar coefficients. Overall, since variability differences in favor of PMVD
are typically much smaller than those in favor of LMG, LMG is
preferable in terms of variation.
5. DISCUSSION
This article has investigated two ways of decomposing
R 2 in linear regression, LMG and PMVD. LMG has been
reinvented numerous times by various researchers (see Section
1) and is based on the heuristic approach of averaging over
all orders. Feldman (2005) criticized that LMG violates
the exclusion criterion and designed PMVD specifically for
satisfying the exclusion criterion, by employing a special set
of data-dependent weights. While Feldman saw satisfaction
of the exclusion criterion as so desirable that it was worth the
price of increased computation efforts and increased variability
of estimates, it has been pointed out in Section 2.2 of this
article that exclusion is not a desirable criterion under all
circumstances. If exclusion is considered an indispensable
criterion for an application, PMVD must be used in spite of
its larger variation and higher implementation effort. On the
other hand, if a causal interpretation of the variance allocations
is intended, LMGs equalizing behavior must be seen as a
natural result of model uncertainty, and LMG is to be preferred. Luckily, in many (not all) applications the two methods
Figure 3. Interquartile ranges from 500 allocated proportions of R 2 for LMG (thick line) and PMVD (thin line).
Scenario: R 2 = 0.25, n = 100, normally distributed Xs.
Correlations among X1 , . . . , X4 : corr(Xj , Xk ) = |j k| ( = 0.9 to 0.9 shown on horizontal axis for each , dashed vertical lines indicate = 0).
REFERENCES
Achen, C.H. (1982), Interpreting and Using Regression, Beverly Hills, CA: Sage.
Azen, R. (2003), Dominance Analysis SAS Macros. Available online at http:
// www.uwm.edu/ azen/ damacro.html.
Azen, R., and Budescu, D. V. (2003), The Dominance Analysis Approach for
Comparing Predictors in Multiple Regression, Psychological Methods, 8,
129148.
Bring, J. (1996), A Geometric Approach to Compare Variables in a Regression
Model, The American Statistician, 50, 5762.
Budescu, D. V. (1993), Dominance Analysis: A New Approach to the Problem
of Relative Importance in Multiple Regression, Psychological Bulletin,
114, 542551.