Subset ARMA Selection Via The Adaptive Lasso: Kun Chen and Kung-Sik Chan
Subset ARMA Selection Via The Adaptive Lasso: Kun Chen and Kung-Sik Chan
Subset ARMA Selection Via The Adaptive Lasso: Kun Chen and Kung-Sik Chan
{yt }, driven by an autoregressive moving-average (ARMA) [7] showed that minimizing the approximate BIC leads to
consistent estimation of the ARMA orders, under suitable
model:
regularity conditions.
p∗ q∗ Order determination is related to the more general prob-
(1) αj∗ yt−j = βj∗ t−j , lem of identifying the nonzero components in a subset
j=0 j=0 ARMA model. A subset ARMA model is an ARMA model
with a subset of its coefficients being nonzero, which is a use-
where (p∗ , q ∗ ) are the AR and MA orders, αj∗ s and βj∗ s are ful and parsimonious way for modeling high-order ARMA
the ARMA parameters with α0∗ = β0∗ = 1, and the t s are processes, e.g. seasonal time series. For ARMA process of
the innovations of zero mean, uncorrelated over time and of high orders, finding a subset ARMA model that adequately
constant variance σ 2 > 0. Here for simplicity the data are approximates the underlying process is more important from
mean corrected. a practical standpoint than simply determining the ARMA
∗ The authors gratefully thank the US National Science Foundation orders. [2] demonstrated that the method of [7] for estimat-
(NSF–0934617) for partial financial support. ing the ARMA orders can be extended to solving the prob-
† Corresponding author. lem of finding an optimal subset ARMA model, in which
maximum likelihood estimations are avoided by adopting 2. ADAPTIVE LASSO PROCEDURE FOR
the aforementioned long AR(n) approximation. Specifically, SUBSET ARMA MODEL SELECTION
their method consists of (i) fitting all subset regression mod-
Throughout this section we assume that {yt } is generated
els of yt on its own lags 1 to p and lags 1 to q of the residuals
according to model (1), and the underlying true ARMA or-
from a long autoregression, where p and q are some known ders p∗ ≤ p and q ∗ ≤ q, where p, q are known upper bounds
upper bounds of the true ARMA orders; and (ii) selecting of the true orders. Let y = (ym , . . . , yT )T , = (m , . . . , T )T ,
an optimal subset model from the pool of all subset regres- τ ∗ = (−α1∗ , . . . , −αp∗ , β1∗ , . . . , βq∗ )T and
sion models, according to some information criterion, e.g.
BIC. However, this method still relies on exhaustive sub- X = (x1 , . . . , xp+q )
set model selection which requires fitting a large number ⎛ ⎞
ym−1 · · · ym−p m−1 ··· m−q
of subset ARMA(p, q) models (2p+q of them!), which may ⎜ .. .. .. .. .. .. ⎟ .
=⎝ . . . . . . ⎠
be computational intensive and even impractical when (p, q)
are large. yT −1 · · · yT −p T −1 ··· T −q
In recent years, there has been extensive research on au- Then model (1) can be written in matrix form as
tomatic variable selection methods via regularization, e.g.
Lasso [12, 16] and SCAD [5]. Some main advantages of y = Xττ ∗ + .
these methods include computational efficiency and the
capability of conducting simultaneous parameter estima- It is assumed that only a subset of the (structural) param-
eters τj∗ (j = 1, . . . , p + q) are nonzero.
tion and variable selection. The Lasso method is one of
Our main goal here is to identify the correct subset of
the well-developed automatic model selection approaches
nonzero components in the above subset ARMA model.
for linear regression problems. However, the consistency
It has been shown that in linear regression models, the
of Lasso may only hold under some conditions, see [20]. adaptive Lasso method can achieve model selection con-
In contrast, as shown by [21] and [10], with appropriate sistency and produce asymptotically unbiased estimators
data-driven parameter-specific weighted regularization, the for the nonzero coefficients. However, the adaptive Lasso
adaptive Lasso approach achieves the oracle properties, i.e. method does not directly apply here, due to the difficulty
asymptotic normality and model selection consistency. More that the design matrix X involves the latent innovation
recently, the regularization approaches have been applied to terms t (t = m − 1, . . . , T − 1). Motivated by [3, 7] and
time series analysis, mainly for the autoregressive models. [2], a long AR(n) process is first fitted to the data to ob-
For example, [17] considered shrinkage estimation of regres- tain the residuals ˆt , whose expression is given in (2). Let
sive and autoregressive coefficients, and [8] and [14] con- X̂ denote the approximate design matrix obtained with the
sidered penalized order selection for vector autoregressive entries t replaced by ˆt (t = m − 1, . . . , T − 1). We then
models. However, to our knowledge, model selection meth- propose to select the optimal subset ARMA model by the
ods based on regularization have not been applied to the adaptive Lasso regression model of y on X̂. The adaptive
more general ARMA model selection problems, mainly due Lasso estimator of τ ∗ is given by
to the difficulty that the innovations t s in the ARMA rep-
(T )
p+q
resentation are unobservable. (3) τ̂τ = arg min y − X̂ττ + λT
2
ŵj |τj | ,
τ
Motivated by [3, 7] and [2], we propose to find an opti- j=1
mal subset ARMA model by fitting an adaptive Lasso re-
where λT is the tuning parameter controlling the degree of
gression of the time series yt on its own lags and those of
penalization, and ŵ = (ŵ1 , . . . , ŵp+q )T consists of p + q
the residuals that are obtained from fitting a long autore- data-driven weights. (Lasso corresponds to the case of using
gression to the yt s. Besides avoiding troublesome maximum equal weights, i.e. wi ≡ 1.) Following [21], the weights can
likelihood estimation of ARMA models, the proposed ap- be chosen as
proach also dramatically reduces the computational cost of
subset selection to the same order of cost of an ordinary ŵ = |τ̃τ |−η ,
least squares fit. We show that under mild regularity condi- T T
where τ̃τ = (X̂ X̂)−1 X̂ y is the least squares estimator of
tions, the proposed method achieves the oracle properties, τ ∗ based on X̂, and η is a prespecified nonnegative parame-
namely, it identifies the correct subset ARMA model with ter; here, the absolute value and the power operators apply
probability tending to one as the sample size increases to component-wise. Based on simulations and as suggested by
infinity, and that the estimators of the nonzero coefficients [21], we use η = 2 in all numerical studies reported below.
are asymptotically normal with the limiting distribution the Note that the weights can also be constructed based on a
same as that when the zero coefficients are known a pri- ridge regression estimator if sample size is small and multi-
ori. collinearity is a problem. Yet another alternative approach
(i) Construct adaptive weights ŵ by least squares (al- Lemma 3.1. Under Assumptions A1–A3, almost surely,
ternatively, ridge or Lasso) regression of y on X̂. 1 T
(5) max t yt−j = O(Q(T )),
(ii) Find the solution path of the adaptive Lasso re- 1≤j≤n T
t=1
gression.
(6) max |âj − aj | = O(Q(T )),
(iii) The optimal λT is the minimizer of some criterion 1≤j≤n
√ n
1
T Then
= T (âu − au ) t yt−j−u
T t=m 1 T 1 T
1 T TY Y T Y Ê
u=0 X11 X̂12
∞
X̂ X̂ = = .
√ 1
T
T 1 T 1 T
X̂21 X̂22
− T au t yt−j−u T Ê Y T Ê Ê
u=n+1
T t=m
√ √ Consider X̂12 and X̂12 , with a typical entry given by
= T · n · O(Q(T )) · O(Q(T )) + T · o(T −1 ) · O(Q(T )) 1
T
T t=m t−j
y ˆt−k = E(yt−j t−k ) + O(Q(T )) by Lemma 3.2.
n log log T Now consider X̂22 , a typical entry of which being, for some
=O √ .
T j, k = 1, . . . , q, equal to
Here we have used (4) and the uniform convergence results
1
T
in (5) and (6) of Lemma 3.1. Now consider (10), ˆt−j ˆt−k
T t=m
1
T
√ t − t )(ˆ
(ˆ t−j − t−j ) n
1
T
√ 1
T
n
− T au t − t )yt−j−u
(ˆ =E au yt−k−u t−j + O(nQ(T ))
u=n+1
T t=m
u=0
√ n n
1
T 1
(log log T ) 2 (log T )b
= T (âu − au ) (âv − av ) yt−v yt−j−u = E(t−k t−j ) + O √ .
u=0 v=0
T t=m T
∞
√ n 1
T
Here we have used (4) and the uniform convergence results in
− T (âu − au ) av yt−v yt−j−u T
u=0 v=n+1
T t=m Lemmas 3.1 and 3.2. Therefore, we have shown that T1 X̂ X̂
∞
has the same limit as T1 XT X, which converges to a nonsin-
√ 1
T
− T au t − t )yt−j−u
(ˆ . gular constant matrix, almost surely, by ergodicity and the
T t=m
u=n+1 fact that the innovation variance is positive. This completes
the proof.
By (4) and (6), it then suffices to show
Lemma 3.5. Let ˜ = y − X̂ττ ∗ . Under Assumptions A1–A3,
√ n n
1
T T T T
T (âu − au ) (âv − av ) yt−v yt−j−u √ ˜ has the same limiting distribution as X
X̂ √ ˜ →d
√ , i.e. X̂
T t=m T T T
T
u=0
v=0
W, X
√
→d W, where W ∼ N (0, σ 2
C).
√ n n T
= T (âu − au ) (âv − av )(γv−j−u + O(Q(T ))) T
Proof. We decompose X̂ √ ˜ into four parts:
u=0 v=0 T
T T T T T
X (X̂ − X)ττ ∗
T
(X̂ − X)T (X̂ − X)ττ ∗ λT
p+q
√ uj
+ √ + √ . +√ ŵj T τj∗ + √ | − |τj∗ .
T T T j=1 T
It suffices to show that the second, third and the fourth By Lemmas 3.4–3.6 and following [21], we have VT (u) →d
terms on the right side of (11) are o(1), which follows easily V (u) for every u, where
from Lemma 3.3. This completes the proof.
uTA CA uA − 2uTA WA if uj = 0 ∀j ∈
/A
T T V (u) =
Lemma 3.6. Recall τ̃τ = (X̂ X̂)−1 X̂ y is the least squares ∞ otherwise.
estimator of τ ∗ based on X̂. Under Assumptions A1–A3,
√
T (τ̃τ − τ ∗ ) →d N (0, σ 2 C−1 ). V (u) is convex and has a unique minimum. Following [12],
√ we have
Proof. We decompose T (τ̃τ − τ ∗ ) as follows:
ûA →d C−1
(T ) (T )
(12) A WA and ûAc →d 0.
−1
√ 1 T 1 1
T (τ̃τ − τ ∗ ) = X̂ X̂ √ XT + √ (X̂ − X)T Finally, upon recalling WA ∼ N (0, σ 2 CA ), the asymptotic
T T T normality result follows.
1
− √ X (X̂ − X)ττ
T ∗ Next, we show the consistency part. ∀j ∈ A, the asymp-
T totic normality result indicates that τ̂j →p τj∗ ; it follows
(T )
1
− √ (X̂ − X)T (X̂ − X)ττ ∗ . that P (j ∈ ÂT ) → 1. It suffices to show that ∀j ∈ / A,
T P (j ∈ ÂT ) → 0. Consider the event j ∈ / A and j ∈ ÂT . By
the Karush-Kuhn-Tucker (KKT) optimality conditions, we
By Lemma 3.3 and following the same argument used in
√ T have
proving Lemma 3.5, T (τ̃τ − τ ∗ ) = ( T1 X̂ X̂)−1 √1T XT +
o(1). The claimed limiting distribution then follows from 2x̂Tj (y − X̂τ̂τ (T ) ) λT ŵj
(13) √ = √ .
Lemmas 3.4 and 3.5. T T
√
T η/2 | T τ̃j |−η → ∞. Consider the left
λ ŵ λT
√
T j
=√
We first define some notations. Let A = {j : τj∗ = 0} Note that T T
side of (13),
and Ac = {j : τj∗ = 0}. Similarly, let ÂT = {j : τ̂j = 0}
(T )
(T )
and ÂcT = {j : τ̂j = 0}. Suppose Z is an m × n matrix, 2x̂Tj (y − X̂τ̂τ (T ) ) 2x̂Tj ˜ 2x̂Tj X̂ √
and A and B are subsets of the collection of row and column √ = √ + T (ττ ∗ − τ̂τ (T ) ).
T T T
indices of Z, respectively. We let ZAB denote a sub-matrix of
Z whose rows and columns are chosen from Z according to 2x̂T ˜
By Lemma 3.5, √jT = Op (1). By Lemma 3.4 and (12),
the index sets A and B, respectively. For simplicity, we may √
2x̂T
j X̂
write ZAA = ZA when Z is a square matrix, ZAB = Z·B T T (ττ ∗ − τ̂τ (T ) ) = Op (1). Thus
(ZA· ) when A (B) consists of all the row (column) indies,
and ZA· = ZA when Z is a vector. 2x̂Tj (y − X̂τ̂τ (T ) ) λT ŵj
P (j ∈ ÂT ) ≤ P √ = √ → 0.
Theorem 3.7 (Oracle Properties). T T
η √ Suppose A1–A3 hold,
2 → ∞ and λ / T → 0. Then
λT
and assume √ T
T T This completes the proof.
(i) Asymptotic normality:
√ 4. EMPIRICAL PERFORMANCE
T (τ̂τ A − τ ∗A ) →d N (0, σ 2 C−1
(T )
A ) as T → ∞.
(ii) Selection consistency: We study the empirical performance of the proposed sub-
limT →∞ P (ÂT = A) = 1. set model selection method by simulations. Four Gaussian
ARMA models are considered:
Proof. The proof is similar in structure to the proof of the
main result in [21]. Let τ = τ ∗ + √uT , ΨT (u) = y − X̂(ττ ∗ + Model I: (1 − 0.8B)(1 − 0.7B 6 )yt = t ;
p+q ∗ u (T )
j=1 ŵj |τj + T |, and û
√u )2 + λT √j = arg min ΨT (u).
T √ Model II: (1 − 0.8B)(1 − 0.7B 6 )yt = (1 + 0.8B)(1 +
(T ) (T ) ∗
Then û = T (τ̂τ − τ ). Let VT (u) = ΨT (u) − ΨT (0). 0.7B 6 )t ;
Then we have
Model III: yt = (1 + 0.8B)(1 + 0.7B 6 )t ; determined by LS, performed poorly, except for Model I; see
Table 1.
Model IV: yt = (1 − 0.6B − 0.8B 12 )t , The poor performance of the LS-weighted adaptive Lasso
may be partly attributed to multicollinearity, which may be
where B is the backshift operator so that B k yt = yt−k , and somewhat alleviated by using ridge regression, i.e. by mini-
{t } are independent standard normal random variables. mizing the penalized sum of squares with the penalty equal
The first three models are multiplicative seasonal models to the product of a (non-negative) tuning parameter times
with seasonal period 6, whereas the last model is a non- the L2 -norm of the regression coefficients. The tuning pa-
multiplicative seasonal model with seasonal period 12. For rameter may be determined by minimizing the generalized
the long autoregressive fits, the AR order was chosen by cross validation (GCV), see [18]. In our simulations, we im-
AIC, with the maximum order set to be 10 log10 (T ). (We plemented the ridge regression via the magic function of the
have also experimented by fixing the long AR order to the mgcv library [18] in the R platform [13]. Yet another method
preceding maximum order, but obtained similar results.) As is to use the Lasso to derive the initial weights. The Lasso
mentioned earlier, for the proposed adaptive Lasso method, is known to be consistent under some regularity conditions,
there are several ways for determining the weights. (Follow- see [20]. We have experimented with both AIC and BIC
ing [21], the power η in the weights was set to be 2 in all in determining the Lasso tuning parameter. For determin-
experiments.) The simplest method is ordinary least squares ing the initial weights, AIC and BIC yielded similar results
regression (LS), but LS suffers from large variability in the (unreported); hence, we only report results with the initial
case of low signal to noise ratio. In many applications in- weights determined by (i) Lasso with the tuning parameter
volving, say monthly data, sample size may range from 120 obtained by minimizing BIC, (ii) ridge regression with the
(10 years) to 360 (30 years). In our simulation experiments tuning parameter obtained by minimizing GCV and (iii) LS.
for models I to III, we set the sample size to be either 120, Table 1 shows the model selection results of the adaptive
240 or 360, with the maximum AR and MA lags both equal Lasso method under the three weighting schemes and dif-
to 14. Consequently, the number of data per parameter is ferent sample sizes. Each experiment was replicated 1,000
at most slightly higher than 10, for these sample sizes. For times. For each experiment, Table 1 provides 4 statistics:
such cases, LS is very variable, and the simulation results (i) the relative frequency of including all significant vari-
reported below show that the adaptive Lasso, with weights ables, (ii) the relative frequency of picking the true model,
selected model with the selected variables shaded in dark, transfer-function (dynamic regression) models. It is inter-
and the models are ordered from top to bottom, according esting to extend the proposed method to these settings.
to their BICs. According to Figure 1, the optimal subset While we have derived the oracle properties of the proposed
model contains lags 1 and 12 of the response variable and method using LS-based weights, it is worthwhile to inves-
lags 9, 11 and 12 of the errors (residuals from the long au- tigate the asymptotics for the case of Lasso-based weights,
toregression, being proxies for the latent innovations). The especially in view of the much better empirical properties
presence of the error lags 11 and 12 suggests a multiplica- of the adaptive Lasso selection method using Lasso-based
tive model. Indeed, the lag 1 of the error process appears in weights. Given the fact that the coefficients in an ARMA
some of the selected models shown in Figure 1. The pres- model naturally form an AR group and an MA group, it
ence of the error lag 9 is harder to interpret. Nevertheless, would be interesting to explore bi-level selection penalty
we fitted a subset ARIMA(1, 1, 9)×(1, 1, 1)12 model with the forms such as the group bridge penalty [11].
coefficients of error lags 2 to 8 fixed at zero. The model fit While we focus on subset ARMA models, a similar prob-
(unreported) suggested that the regular AR(1) and seasonal lem occurs in nonparametric stochastic regression models
AR(1) coefficients are non-significant, so these are dropped [19]. A challenging problem consists of lifting some of the
from a second fitted model: automatic model selection methods to the nonparametric
setting; see [9] for some recent works in the additive frame-
(1 − B)(1 − B 12 )yt = (1 − 0.64 (0.09) B − 0.26 (0.08) B 9 ) work.
× (1 − 0.81 (0.1) B 12 )ˆ
t , Received 9 October 2010
where the numbers in parentheses are the standard errors.
This fitted model appears to fit the data well as the residuals
REFERENCES
were found to be approximately white. [1] Akaike, H. (1974). A new look at the statistical model identi-
fication. IEEE Transactions on Automatic Control 19 716–723.
MR0423716
6. CONCLUSION [2] Cryer, J. D. and Chan, K.-S. (2008). Time Series Analysis:
With Applications in R, 2nd ed. Springer, New York.
The numerical studies reported in the preceding two sec- [3] Durbin, J. (1960). The fitting of time series models. Int. Statist.
Rev. 28 233–244.
tions illustrate the efficacy of the proposed subset ARMA
[4] Efron, B., Hastie, T., Johnstones, I. and Tibshirani, R.
selection method. There are other time series modeling tasks (2004). Least angle regression. The Annals of Statistics 32(2)
that require ARMA order section, e.g. VARMA models and 407–499. MR2060166