Subset ARMA Selection Via The Adaptive Lasso: Kun Chen and Kung-Sik Chan

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Statistics and Its Interface Volume 4 (2011) 197–205

Subset ARMA selection via the adaptive Lasso


Kun Chen∗ and Kung-Sik Chan∗†

In fitting an ARMA model, besides estimating the struc-


tural parameters αj∗ (j = 1, . . . , p∗ ), βj∗ (j = 1, . . . , q ∗ ) and
Model selection is a critical aspect of subset autoregres-
the variance σ 2 , the AR order p and MA order q must also
sive moving-average (ARMA) modelling. This is commonly
be determined from the observations yt (t = 1, . . . , T ). A
done by subset selection methods, which may be compu- commonly used approach to determining the ARMA orders
tationally intensive and even impractical when the true is to select a model that minimizes some information cri-
ARMA orders of the underlying model are high. On the terion, e.g. AIC [1] and BIC [15]. Such methods generally
other hand, automatic variable selection methods based on require carrying out maximum likelihood estimation for a
regularization do not directly apply to this problem because large number of ARMA models of different orders. How-
the innovation process is latent. To solve this problem, we ever, maximum likelihood estimation of an ARMA model
propose to identify the optimal subset ARMA model by fit- is prone to numerical problems due to multimodality of the
ting an adaptive Lasso regression of the time series on its likelihood function and the problem of overfitting when the
lags and the lags of the residuals from a long autoregression ARMA orders exceed their true values.
fitted to the time series data, where the residuals serve as [7] proposed an interesting and practical solution to the
proxies for the innovations. We show that, under some mild order determination problem. Firstly, a long AR(n) model is
regularity conditions, the proposed method enjoys the oracle fitted to the data, with the residuals then serving as proxies
properties so that the method identifies the correct subset for the unobserved innovations t [3]:
model with probability approaching 1 with increasing sam-
ple size, and that the estimators of the nonzero coefficients n
(2) ˆt = âj yt−j , â0 = 1, t = n + 1, . . . , T,
are asymptotically normal with the limiting distribution the
j=0
same as the case when the true zero coefficients are known a
priori. We illustrate the new method with simulations and where the âj s (j = 1, . . . , n) are the autoregressive coeffi-
a real application. cients estimated by solving the Yule-Walker equations (or
by least squares). The AR order n can be determined by
AMS 2000 subject classifications: Primary 62M10; minimizing the AIC criterion log σ̂ 2 + 2n/T , where σ̂ 2 is
n n
secondary 62J12. the corresponding estimator of the innovation variance. In
Keywords and phrases: Least squares regression, Ora- the second step, the ARMA parameters for various (p, q) or-
cle properties, Ridge regression, Seasonal ARIMA models, ders are estimated by regressing y on y , for j = 1, . . . , p
t t−j
Sparsity. and ˆt−j , for j = 1, . . . , q, where t = m, . . . , T with m =
n + max(p, q) + 1; the innovation variance of the model with
ARMA orders (p, q) is then estimated by the residual mean
1. INTRODUCTION square error, which is denoted by σ̃p,q 2
. The corresponding
2
Consider a discrete-time, stationary and ergodic process, BIC values are approximated by log σ̃p,q + (p + q) log T /T .

{yt }, driven by an autoregressive moving-average (ARMA) [7] showed that minimizing the approximate BIC leads to
consistent estimation of the ARMA orders, under suitable
model:
regularity conditions.
p∗ q∗ Order determination is related to the more general prob-
(1) αj∗ yt−j = βj∗ t−j , lem of identifying the nonzero components in a subset
j=0 j=0 ARMA model. A subset ARMA model is an ARMA model
with a subset of its coefficients being nonzero, which is a use-
where (p∗ , q ∗ ) are the AR and MA orders, αj∗ s and βj∗ s are ful and parsimonious way for modeling high-order ARMA
the ARMA parameters with α0∗ = β0∗ = 1, and the t s are processes, e.g. seasonal time series. For ARMA process of
the innovations of zero mean, uncorrelated over time and of high orders, finding a subset ARMA model that adequately
constant variance σ 2 > 0. Here for simplicity the data are approximates the underlying process is more important from
mean corrected. a practical standpoint than simply determining the ARMA
∗ The authors gratefully thank the US National Science Foundation orders. [2] demonstrated that the method of [7] for estimat-
(NSF–0934617) for partial financial support. ing the ARMA orders can be extended to solving the prob-
† Corresponding author. lem of finding an optimal subset ARMA model, in which
maximum likelihood estimations are avoided by adopting 2. ADAPTIVE LASSO PROCEDURE FOR
the aforementioned long AR(n) approximation. Specifically, SUBSET ARMA MODEL SELECTION
their method consists of (i) fitting all subset regression mod-
Throughout this section we assume that {yt } is generated
els of yt on its own lags 1 to p and lags 1 to q of the residuals
according to model (1), and the underlying true ARMA or-
from a long autoregression, where p and q are some known ders p∗ ≤ p and q ∗ ≤ q, where p, q are known upper bounds
upper bounds of the true ARMA orders; and (ii) selecting of the true orders. Let y = (ym , . . . , yT )T ,  = (m , . . . , T )T ,
an optimal subset model from the pool of all subset regres- τ ∗ = (−α1∗ , . . . , −αp∗ , β1∗ , . . . , βq∗ )T and
sion models, according to some information criterion, e.g.
BIC. However, this method still relies on exhaustive sub- X = (x1 , . . . , xp+q )
set model selection which requires fitting a large number ⎛ ⎞
ym−1 · · · ym−p m−1 ··· m−q
of subset ARMA(p, q) models (2p+q of them!), which may ⎜ .. .. .. .. .. .. ⎟ .
=⎝ . . . . . . ⎠
be computational intensive and even impractical when (p, q)
are large. yT −1 · · · yT −p T −1 ··· T −q
In recent years, there has been extensive research on au- Then model (1) can be written in matrix form as
tomatic variable selection methods via regularization, e.g.
Lasso [12, 16] and SCAD [5]. Some main advantages of y = Xττ ∗ +  .
these methods include computational efficiency and the
capability of conducting simultaneous parameter estima- It is assumed that only a subset of the (structural) param-
eters τj∗ (j = 1, . . . , p + q) are nonzero.
tion and variable selection. The Lasso method is one of
Our main goal here is to identify the correct subset of
the well-developed automatic model selection approaches
nonzero components in the above subset ARMA model.
for linear regression problems. However, the consistency
It has been shown that in linear regression models, the
of Lasso may only hold under some conditions, see [20]. adaptive Lasso method can achieve model selection con-
In contrast, as shown by [21] and [10], with appropriate sistency and produce asymptotically unbiased estimators
data-driven parameter-specific weighted regularization, the for the nonzero coefficients. However, the adaptive Lasso
adaptive Lasso approach achieves the oracle properties, i.e. method does not directly apply here, due to the difficulty
asymptotic normality and model selection consistency. More that the design matrix X involves the latent innovation
recently, the regularization approaches have been applied to terms t (t = m − 1, . . . , T − 1). Motivated by [3, 7] and
time series analysis, mainly for the autoregressive models. [2], a long AR(n) process is first fitted to the data to ob-
For example, [17] considered shrinkage estimation of regres- tain the residuals ˆt , whose expression is given in (2). Let
sive and autoregressive coefficients, and [8] and [14] con- X̂ denote the approximate design matrix obtained with the
sidered penalized order selection for vector autoregressive entries t replaced by ˆt (t = m − 1, . . . , T − 1). We then
models. However, to our knowledge, model selection meth- propose to select the optimal subset ARMA model by the
ods based on regularization have not been applied to the adaptive Lasso regression model of y on X̂. The adaptive
more general ARMA model selection problems, mainly due Lasso estimator of τ ∗ is given by
to the difficulty that the innovations t s in the ARMA rep-

(T )

p+q
resentation are unobservable. (3) τ̂τ = arg min y − X̂ττ  + λT
2
ŵj |τj | ,
τ
Motivated by [3, 7] and [2], we propose to find an opti- j=1
mal subset ARMA model by fitting an adaptive Lasso re-
where λT is the tuning parameter controlling the degree of
gression of the time series yt on its own lags and those of
penalization, and ŵ = (ŵ1 , . . . , ŵp+q )T consists of p + q
the residuals that are obtained from fitting a long autore- data-driven weights. (Lasso corresponds to the case of using
gression to the yt s. Besides avoiding troublesome maximum equal weights, i.e. wi ≡ 1.) Following [21], the weights can
likelihood estimation of ARMA models, the proposed ap- be chosen as
proach also dramatically reduces the computational cost of
subset selection to the same order of cost of an ordinary ŵ = |τ̃τ |−η ,
least squares fit. We show that under mild regularity condi- T T
where τ̃τ = (X̂ X̂)−1 X̂ y is the least squares estimator of
tions, the proposed method achieves the oracle properties, τ ∗ based on X̂, and η is a prespecified nonnegative parame-
namely, it identifies the correct subset ARMA model with ter; here, the absolute value and the power operators apply
probability tending to one as the sample size increases to component-wise. Based on simulations and as suggested by
infinity, and that the estimators of the nonzero coefficients [21], we use η = 2 in all numerical studies reported below.
are asymptotically normal with the limiting distribution the Note that the weights can also be constructed based on a
same as that when the zero coefficients are known a pri- ridge regression estimator if sample size is small and multi-
ori. collinearity is a problem. Yet another alternative approach

198 K. Chen and K.-S. Chan


for deriving the weights is to use the Lasso estimator, which A3. Assume n increases monotonically to infinity at a rate
is consistent under some conditions. c log T ≤ n ≤ (log T )b , where c ≥ (2 log ρ0 )−1 and ρ0 is
The adaptive Lasso is essentially a weighted L1 regular- the modulus of a zero of β ∗ (z) nearest to |z| = 1, for
ization method. Its loss function is convex and the entire some 1 < b < ∞.
solution path of various λT values can be computed effi- Assumption A1 ensures that the true model is stationary
ciently by a modified LARS algorithm [4], with the same and ergodic, and that p∗ and q ∗ are the true model orders.
order of computational cost of an ordinary least squares fit. Assumption A2 implies that the innovations are martingale-
Hence we omit the details of the computational algorithm. difference sequence, and hence uncorrelated over time; they
An unbiased estimator of the degrees of freedom of a Lasso also have identical variance. Furthermore, the best linear
model is shown to be the number of nonzero coefficients predictor of yt is the best
T in the least squares sense, and
[22], which can be used to construct information criteria for it also ensures that T1 t=1 2t − σ 2 converges to zero at a
selecting λT . After the solution path has been found, we sufficiently rapid rate [7]. Assumption A3 imposes that the
consider both the AIC and BIC criteria for determining the order of the long autoregression increases at a certain rate
optimal λT . not slower than
∞c log(T ).
The complete model selection strategy proposed is as fol- Let t = j=0 aj yt−j be the AR(∞) representation of
lows: model (1). Note that aj decreases at a geometric rate and
hence
I. Fit an AR(n) autoregressive model to obtain residuals ∞

that serve as proxies for the innovations as given in (2). (4) |aj | = o(T −1 ).
The AR order n can be determined by minimizing the j=n
AIC criterion as previously described.
II. Fit an adaptive Lasso regression of the time series y on The following lemmas are given either explicitly or im-
X̂ as described in (3). plicitly by [7] and [6].

(i) Construct adaptive weights ŵ by least squares (al- Lemma 3.1. Under Assumptions A1–A3, almost surely,

ternatively, ridge or Lasso) regression of y on X̂. 1  T

(5) max t yt−j = O(Q(T )),
(ii) Find the solution path of the adaptive Lasso re- 1≤j≤n T
t=1
gression.
(6) max |âj − aj | = O(Q(T )),
(iii) The optimal λT is the minimizer of some criterion 1≤j≤n

such as AIC and BIC. 1


where Q(T ) = (log log T /T ) 2 . Moreover, letting ct =
T −t
III. (Optional) Do maximum likelihood estimation and 1 s=1 ys ys+t , then
T
model diagnostics for the selected subset ARMA
model(s) chosen by some information criterion, e.g. (7) max |ct − γt | = O(Q(T )),
0≤t≤n
BIC.
where γt = E(ys ys+t ).
3. ASYMPTOTIC PROPERTIES OF THE Lemma 3.2. Under Assumptions A1–A3, almost surely,
ADAPTIVE LASSO ESTIMATOR
1 
T
(8) yt−j ˆt−k = E(yt−j t−k ) + O(Q(T )),
3.1 Assumptions and preliminary results T t=m
In this section, we introduce the main assumptions and uniformly for j = 1, . . . , p and k = 1, . . . , q.
some useful results from [7].
3.2 Oracle properties
A1. For the true system (1), the polynomials
We first prove some results related to the adopted AR(n)
∗ ∗ approximation. Then we prove our main results, which show

p 
q
α∗ (z) = αj∗ z j , β ∗ (z) = βj∗ z j with α0∗ = β0∗ = 1 that the adaptive Lasso estimator enjoys the oracle proper-
j=0 j=0 ties, i.e. asymptotic normality and model selection consis-
tency. In our proofs, we restrict to the case that the weights
are coprime, i.e. have no common factors, and that are derived from the least squares regression. The proofs can
α∗ (z) = 0, β ∗ (z) = 0, for |z| ≤ 1. be readily extended to the case for ridge regression based
A2. Let At be the σ-algebra of events determined by s (s ≤ weights, with appropriate conditions on the tuning parame-
t). We assume ter for the ridge regression. However, the case of Lasso-based
weights requires further study.
E(t |At−1 ) = 0, E(4t ) < ∞, and E(2t |At−1 ) = σ 2 . Lemma 3.3. Under Assumptions A1–A3, almost surely,

Subset ARMA selection via the adaptive Lasso 199


 √
1  T · n · O(Q(T )) · n · O(Q(T ))
T
n log log T =
(9) √ t−j − t−j ) = O
t (ˆ √ , 2 
T t=m T n log log T
=O √ .
2  T
1 
T
n log log T
(10) √ t − t )(ˆ
(ˆ t−j − t−j ) = O √ .
T t=m T Here we have used the uniform convergence results in Lem-
mas 3.1 and 3.2. This completes the proof.
Proof.
T
Lemma 3.4. Under Assumptions A1–A3, T1 X̂ X̂ → C al-
1 
T
√ t−j − t−j )
t (ˆ most surely, where C is a nonsingular constant matrix.
T t=m
n
Proof. Write X̂ = (Y, Ê), where
1  
T
=√ t (âu − au )yt−j−u ⎛ ⎞ ⎛ ⎞
T t=m u=0
ym−1 ··· ym−p ˆm−1 ··· ˆm−q
 ∞  ⎜ .. ⎟ , Ê = ⎜ .. .. ⎟ .
Y = ⎝ ... ..
. ⎠ ⎝ . ..
. ⎠
1   . .
T
−√ t au yt−j−u yT −1 ··· yT −p ˆT −1 ··· ˆT −q
T t=m u=n+1
 

√  n
1 
T Then
= T (âu − au ) t yt−j−u  
T t=m 1 T 1 T 
1 T TY Y T Y Ê
u=0 X11 X̂12

 
X̂ X̂ = = .
√  1 
T
T 1 T 1 T
X̂21 X̂22
− T au t yt−j−u T Ê Y T Ê Ê

u=n+1
T t=m
√ √ Consider X̂12 and X̂12 , with a typical entry given by
= T · n · O(Q(T )) · O(Q(T )) + T · o(T −1 ) · O(Q(T )) 1
T
 T t=m t−j 
y ˆt−k = E(yt−j t−k ) + O(Q(T )) by Lemma 3.2.
n log log T Now consider X̂22 , a typical entry of which being, for some
=O √ .
T j, k = 1, . . . , q, equal to
Here we have used (4) and the uniform convergence results
1 
T
in (5) and (6) of Lemma 3.1. Now consider (10), ˆt−j ˆt−k
T t=m

1 
T
√ t − t )(ˆ
(ˆ t−j − t−j ) n
1 
T

T t=m = âu ˆt−j yt−k−u


T t=m
 
u=0
√  n
1 
T

n
= T (âu − au ) t − t )yt−j−u
(ˆ = (au + O(Q(T ))){E(yt−k−u t−j ) + O(Q(T ))}
u=0
T t=m

 
u=0
 

√  1 
T

n
− T au t − t )yt−j−u
(ˆ =E au yt−k−u t−j + O(nQ(T ))
u=n+1
T t=m
u=0
 

√  n n
1 
T 1
(log log T ) 2 (log T )b
= T (âu − au ) (âv − av ) yt−v yt−j−u = E(t−k t−j ) + O √ .
u=0 v=0
T t=m T

 

√  n  1 
T
Here we have used (4) and the uniform convergence results in
− T (âu − au ) av yt−v yt−j−u T
u=0 v=n+1
T t=m Lemmas 3.1 and 3.2. Therefore, we have shown that T1 X̂ X̂

 
has the same limit as T1 XT X, which converges to a nonsin-
√  1 
T
− T au t − t )yt−j−u
(ˆ . gular constant matrix, almost surely, by ergodicity and the
T t=m
u=n+1 fact that the innovation variance is positive. This completes
the proof.
By (4) and (6), it then suffices to show
 
Lemma 3.5. Let ˜ = y − X̂ττ ∗ . Under Assumptions A1–A3,
√  n n
1 
T T T T
T (âu − au ) (âv − av ) yt−v yt−j−u √ ˜ has the same limiting distribution as X
X̂ √ ˜ →d
√  , i.e. X̂
T t=m T T T
T
u=0

v=0

W, X
√ 
→d W, where W ∼ N (0, σ 2
C).
√  n n T
= T (âu − au ) (âv − av )(γv−j−u + O(Q(T ))) T
Proof. We decompose X̂ √ ˜ into four parts:
u=0 v=0 T

200 K. Chen and K.-S. Chan


T 
X̂ ˜ XT  (X̂ − X)T  1 T ˜ T X̂
(11) √ = √ + √ VT (u) = u X̂ X̂ u − 2 √ u
T

T T T T T
 
X (X̂ − X)ττ ∗
T
(X̂ − X)T (X̂ − X)ττ ∗ λT
p+q
√ uj
+ √ + √ . +√ ŵj T τj∗ + √ | − |τj∗ .
T T T j=1 T

It suffices to show that the second, third and the fourth By Lemmas 3.4–3.6 and following [21], we have VT (u) →d
terms on the right side of (11) are o(1), which follows easily V (u) for every u, where
from Lemma 3.3. This completes the proof.
uTA CA uA − 2uTA WA if uj = 0 ∀j ∈
/A
T T V (u) =
Lemma 3.6. Recall τ̃τ = (X̂ X̂)−1 X̂ y is the least squares ∞ otherwise.
estimator of τ ∗ based on X̂. Under Assumptions A1–A3,

T (τ̃τ − τ ∗ ) →d N (0, σ 2 C−1 ). V (u) is convex and has a unique minimum. Following [12],
√ we have
Proof. We decompose T (τ̃τ − τ ∗ ) as follows:
ûA →d C−1
(T ) (T )
(12) A WA and ûAc →d 0.
−1 
√ 1 T 1 1
T (τ̃τ − τ ∗ ) = X̂ X̂ √ XT  + √ (X̂ − X)T  Finally, upon recalling WA ∼ N (0, σ 2 CA ), the asymptotic
T T T normality result follows.
1
− √ X (X̂ − X)ττ
T ∗ Next, we show the consistency part. ∀j ∈ A, the asymp-
T totic normality result indicates that τ̂j →p τj∗ ; it follows
(T )

1
− √ (X̂ − X)T (X̂ − X)ττ ∗ . that P (j ∈ ÂT ) → 1. It suffices to show that ∀j ∈ / A,
T P (j ∈ ÂT ) → 0. Consider the event j ∈ / A and j ∈ ÂT . By
the Karush-Kuhn-Tucker (KKT) optimality conditions, we
By Lemma 3.3 and following the same argument used in
√ T have
proving Lemma 3.5, T (τ̃τ − τ ∗ ) = ( T1 X̂ X̂)−1 √1T XT  +
o(1). The claimed limiting distribution then follows from 2x̂Tj (y − X̂τ̂τ (T ) ) λT ŵj
(13) √ = √ .
Lemmas 3.4 and 3.5. T T

T η/2 | T τ̃j |−η → ∞. Consider the left
λ ŵ λT

T j
=√
We first define some notations. Let A = {j : τj∗ = 0} Note that T T
side of (13),
and Ac = {j : τj∗ = 0}. Similarly, let ÂT = {j : τ̂j = 0}
(T )

(T )
and ÂcT = {j : τ̂j = 0}. Suppose Z is an m × n matrix, 2x̂Tj (y − X̂τ̂τ (T ) ) 2x̂Tj ˜ 2x̂Tj X̂ √
and A and B are subsets of the collection of row and column √ = √ + T (ττ ∗ − τ̂τ (T ) ).
T T T
indices of Z, respectively. We let ZAB denote a sub-matrix of
Z whose rows and columns are chosen from Z according to 2x̂T ˜
By Lemma 3.5, √jT = Op (1). By Lemma 3.4 and (12),
the index sets A and B, respectively. For simplicity, we may √
2x̂T
j X̂
write ZAA = ZA when Z is a square matrix, ZAB = Z·B T T (ττ ∗ − τ̂τ (T ) ) = Op (1). Thus
(ZA· ) when A (B) consists of all the row (column) indies,

and ZA· = ZA when Z is a vector. 2x̂Tj (y − X̂τ̂τ (T ) ) λT ŵj
P (j ∈ ÂT ) ≤ P √ = √ → 0.
Theorem 3.7 (Oracle Properties). T T
η √ Suppose A1–A3 hold,
2 → ∞ and λ / T → 0. Then
λT
and assume √ T
T T This completes the proof.
(i) Asymptotic normality:
√ 4. EMPIRICAL PERFORMANCE
T (τ̂τ A − τ ∗A ) →d N (0, σ 2 C−1
(T )
A ) as T → ∞.
(ii) Selection consistency: We study the empirical performance of the proposed sub-
limT →∞ P (ÂT = A) = 1. set model selection method by simulations. Four Gaussian
ARMA models are considered:
Proof. The proof is similar in structure to the proof of the
main result in [21]. Let τ = τ ∗ + √uT , ΨT (u) = y − X̂(ττ ∗ + Model I: (1 − 0.8B)(1 − 0.7B 6 )yt = t ;
p+q ∗ u (T )
j=1 ŵj |τj + T |, and û
√u )2 + λT √j = arg min ΨT (u).
T √ Model II: (1 − 0.8B)(1 − 0.7B 6 )yt = (1 + 0.8B)(1 +
(T ) (T ) ∗
Then û = T (τ̂τ − τ ). Let VT (u) = ΨT (u) − ΨT (0). 0.7B 6 )t ;
Then we have

Subset ARMA selection via the adaptive Lasso 201


Table 1. Summary statistics of a Monte Carlo study of the empirical performance of the adaptive Lasso subset model selection
method for Models I–III. Numbers in the columns with the heading “A” report the relative frequencies of picking all significant
variables; those under the heading “T” are the relative frequencies of picking the correct model; those under the heading “−”
report the false negative rates; those under the heading “+” report the false positive rates. Numbers under the the heading
“N” are the sample sizes. All experiments were replicated 1,000 times
Model I
AIC BIC
Lasso Lasso Ridge LS
N A T − + A T − + A T − + A T − +
120 0.76 0.19 0.09 0.07 0.75 0.44 0.09 0.04 0.24 0.00 0.26 0.15 0.75 0.19 0.13 0.22
240 0.98 0.30 0.01 0.06 0.98 0.80 0.01 0.01 0.44 0.01 0.19 0.20 0.85 0.40 0.07 0.18
360 1.00 0.32 0.00 0.06 1.00 0.87 0.00 0.01 0.54 0.01 0.15 0.20 0.87 0.49 0.07 0.17
Model II
AIC BIC
Lasso Lasso Ridge LS
N A T − + A T − + A T − + A T − +
120 0.40 0.01 0.12 0.20 0.32 0.04 0.15 0.14 0.26 0.00 0.18 0.22 0.02 0.00 0.38 0.45
240 0.81 0.03 0.04 0.21 0.76 0.15 0.05 0.13 0.36 0.01 0.18 0.30 0.05 0.00 0.35 0.47
360 0.92 0.04 0.02 0.21 0.92 0.24 0.02 0.10 0.36 0.01 0.20 0.36 0.04 0.00 0.36 0.48
Model III
AIC BIC
Lasso Lasso Ridge LS
N A T − + A T − + A T − + A T − +
120 0.03 0.00 0.54 0.18 0.03 0.00 0.58 0.13 0.04 0.00 0.47 0.16 0.05 0.00 0.75 0.29
240 0.17 0.00 0.36 0.21 0.14 0.01 0.43 0.14 0.10 0.00 0.37 0.20 0.07 0.00 0.61 0.36
360 0.38 0.00 0.25 0.23 0.34 0.02 0.31 0.15 0.16 0.00 0.33 0.23 0.06 0.00 0.62 0.36

Model III: yt = (1 + 0.8B)(1 + 0.7B 6 )t ; determined by LS, performed poorly, except for Model I; see
Table 1.
Model IV: yt = (1 − 0.6B − 0.8B 12 )t , The poor performance of the LS-weighted adaptive Lasso
may be partly attributed to multicollinearity, which may be
where B is the backshift operator so that B k yt = yt−k , and somewhat alleviated by using ridge regression, i.e. by mini-
{t } are independent standard normal random variables. mizing the penalized sum of squares with the penalty equal
The first three models are multiplicative seasonal models to the product of a (non-negative) tuning parameter times
with seasonal period 6, whereas the last model is a non- the L2 -norm of the regression coefficients. The tuning pa-
multiplicative seasonal model with seasonal period 12. For rameter may be determined by minimizing the generalized
the long autoregressive fits, the AR order was chosen by cross validation (GCV), see [18]. In our simulations, we im-
AIC, with the maximum order set to be 10 log10 (T ). (We plemented the ridge regression via the magic function of the
have also experimented by fixing the long AR order to the mgcv library [18] in the R platform [13]. Yet another method
preceding maximum order, but obtained similar results.) As is to use the Lasso to derive the initial weights. The Lasso
mentioned earlier, for the proposed adaptive Lasso method, is known to be consistent under some regularity conditions,
there are several ways for determining the weights. (Follow- see [20]. We have experimented with both AIC and BIC
ing [21], the power η in the weights was set to be 2 in all in determining the Lasso tuning parameter. For determin-
experiments.) The simplest method is ordinary least squares ing the initial weights, AIC and BIC yielded similar results
regression (LS), but LS suffers from large variability in the (unreported); hence, we only report results with the initial
case of low signal to noise ratio. In many applications in- weights determined by (i) Lasso with the tuning parameter
volving, say monthly data, sample size may range from 120 obtained by minimizing BIC, (ii) ridge regression with the
(10 years) to 360 (30 years). In our simulation experiments tuning parameter obtained by minimizing GCV and (iii) LS.
for models I to III, we set the sample size to be either 120, Table 1 shows the model selection results of the adaptive
240 or 360, with the maximum AR and MA lags both equal Lasso method under the three weighting schemes and dif-
to 14. Consequently, the number of data per parameter is ferent sample sizes. Each experiment was replicated 1,000
at most slightly higher than 10, for these sample sizes. For times. For each experiment, Table 1 provides 4 statistics:
such cases, LS is very variable, and the simulation results (i) the relative frequency of including all significant vari-
reported below show that the adaptive Lasso, with weights ables, (ii) the relative frequency of picking the true model,

202 K. Chen and K.-S. Chan


(iii) the false negative rate and (iv) the false positive rate.
Table 2. Summary statistics of a Monte Carlo study of the
For Models I to III, the maximum AR lag is 14 and so is
empirical performance of the adaptive Lasso subset model
the maximum MA lag. Recall that the tuning parameter
selection method for Models III–IV. Numbers in the columns
of the adaptive Lasso may be determined by either AIC
with the heading “A” report the relative frequencies of
or BIC. Columns 2–5 of Table 1 report the simulation re-
picking all significant variables; those under the heading “T”
sults for the adaptive Lasso with the weights determined by
are the relative frequencies of picking the correct model;
Lasso, and the tuning parameter of the adaptive Lasso de-
those under the heading “−” report the false negative rates;
termined by AIC, whereas columns 6–9 report those when
those under the heading “+” report the false positive rates.
the tuning parameter of the adaptive Lasso was determined
Numbers under the the heading “N” are the sample sizes. All
by BIC. Comparison between the two schemes show that
experiments were replicated 1,000 times
BIC and AIC performed similarly, with BIC having a higher
chance of picking the true model, lower false positive rates Model III (selection confined among MA models)
and slightly higher false negative rates; AIC tends to have a AIC BIC
higher chance of including all significant variables at the ex- N A T − + A T − +
pense of selecting more complex models than BIC. The pro- 120 0.90 0.25 0.03 0.17 0.87 0.38 0.05 0.10
posed method performed very well for the pure AR model, 240 1.00 0.28 0.00 0.18 1.00 0.51 0.00 0.09
less so for the mixed ARMA model, but performed some- 360 1.00 0.30 0.00 0.17 1.00 0.51 0.00 0.08
what poorly for the pure MA model as specified by Model Model III
III. Columns 10–13 and columns 14–17 of Table 1 display AIC BIC
the results for the proposed adaptive Lasso method with N A T − + A T − +
the weighting scheme given by ridge regression and LS, re- 480 0.54 0.00 0.18 0.24 0.48 0.04 0.24 0.15
spectively. These results show that the adaptive Lasso with 600 0.64 0.00 0.14 0.25 0.59 0.06 0.18 0.15
an LS-based weighting scheme generally performed quite 720 0.69 0.01 0.12 0.26 0.66 0.10 0.15 0.14
poorly except for the AR example, and using weights based Model IV
on ridge regression alleviated the problem somewhat, but it AIC BIC
is still outperformed by the method of adaptive Lasso with N A T − + A T − +
Lasso-based weights. 240 0.55 0.04 0.26 0.06 0.49 0.08 0.29 0.04
Given that it is difficult to identify an MA model, we 360 0.79 0.03 0.10 0.07 0.74 0.12 0.13 0.05
also experimented with searching among the subset MA
models. In Table 2, we reported results for another three
Lasso method is comparable to the case for Model III with
sets of experiments, with models selected by adaptive Lasso
sample sizes 600–720; the better performance may be due
with Lasso-based weights. In the first set of experiments,
to the larger gap between the two MA coefficients in Model
we repeated the experiments with Model III, but with the
IV, which induces weaker autocorrelations in the data. The
model selection confined among (subset) MA models, with
limited Monte Carlo experiments reported here suggest that
the maximum MA lag equal to 14. We show the results with
the proposed method is a promising new tool for identifying
the tuning parameter of the adaptive Lasso determined by
sparse stationary ARMA models, especially if the sparsity
either AIC or BIC. Again, AIC and BIC performed simi-
contains wide gaps in the ARMA coefficients. In practice,
larly, and now the proposed method works extremely well if
real data may be non-stationary, and they must be trans-
the search is restricted to MA models. This result suggests
formed to stationarity before applying the proposed adap-
that, for data analysis, it may be prudent to apply the pro-
tive Lasso subset model selection method.
posed method with the search among pure AR models, then
among mixed ARMA models and finally among the pure
MA models. One can then explore further with the optimal 5. A REAL APPLICATION
models from these three model selection exercises, in order As an example, we consider a time series of the monthly
to arrive at an “optimal” model for the data on hand. CO2 level from a monitoring site at Alert, Northwest Terri-
Table 2 also reports the results for Model III with larger tories, Canada. (The dataset is contained in the TSA library
sample sizes, namely, 480, 600 and 720. The results show in R.) The dataset was earlier analyzed by [2, Chapter 12],
that the rates of false negatives and false positives decline who fitted a period-12 seasonal ARIMA(0, 1, 1) × (0, 1, 1)12
steadily with increasing sample size, and the rate of includ- model. In particular the series is nonstationary. Here, we ap-
ing all significant variables increases steadily with sample ply regular differencing and period-12 seasonal differencing
size, as well. to the data before carrying out the Lasso-weighted adaptive
Finally, Table 2 reports the results for Model IV, with Lasso model selection. In practice, it is better to report a few
the maximum AR and MA lags equal to 26, and sample size optimal models selected by adaptive Lasso. Figure 1 shows
equal to 240 or 360. While this is also a pure MA model, the best 5 models selected by adaptive Lasso with its tuning
the performance of the proposed Lasso-weighted adaptive parameter determined by BIC; each row corresponds to one

Subset ARMA selection via the adaptive Lasso 203


Figure 1. Best five subset models selected for the regularly and seasonally differenced CO2 series.

selected model with the selected variables shaded in dark, transfer-function (dynamic regression) models. It is inter-
and the models are ordered from top to bottom, according esting to extend the proposed method to these settings.
to their BICs. According to Figure 1, the optimal subset While we have derived the oracle properties of the proposed
model contains lags 1 and 12 of the response variable and method using LS-based weights, it is worthwhile to inves-
lags 9, 11 and 12 of the errors (residuals from the long au- tigate the asymptotics for the case of Lasso-based weights,
toregression, being proxies for the latent innovations). The especially in view of the much better empirical properties
presence of the error lags 11 and 12 suggests a multiplica- of the adaptive Lasso selection method using Lasso-based
tive model. Indeed, the lag 1 of the error process appears in weights. Given the fact that the coefficients in an ARMA
some of the selected models shown in Figure 1. The pres- model naturally form an AR group and an MA group, it
ence of the error lag 9 is harder to interpret. Nevertheless, would be interesting to explore bi-level selection penalty
we fitted a subset ARIMA(1, 1, 9)×(1, 1, 1)12 model with the forms such as the group bridge penalty [11].
coefficients of error lags 2 to 8 fixed at zero. The model fit While we focus on subset ARMA models, a similar prob-
(unreported) suggested that the regular AR(1) and seasonal lem occurs in nonparametric stochastic regression models
AR(1) coefficients are non-significant, so these are dropped [19]. A challenging problem consists of lifting some of the
from a second fitted model: automatic model selection methods to the nonparametric
setting; see [9] for some recent works in the additive frame-
(1 − B)(1 − B 12 )yt = (1 − 0.64 (0.09) B − 0.26 (0.08) B 9 ) work.
× (1 − 0.81 (0.1) B 12 )ˆ
t , Received 9 October 2010
where the numbers in parentheses are the standard errors.
This fitted model appears to fit the data well as the residuals
REFERENCES
were found to be approximately white. [1] Akaike, H. (1974). A new look at the statistical model identi-
fication. IEEE Transactions on Automatic Control 19 716–723.
MR0423716
6. CONCLUSION [2] Cryer, J. D. and Chan, K.-S. (2008). Time Series Analysis:
With Applications in R, 2nd ed. Springer, New York.
The numerical studies reported in the preceding two sec- [3] Durbin, J. (1960). The fitting of time series models. Int. Statist.
Rev. 28 233–244.
tions illustrate the efficacy of the proposed subset ARMA
[4] Efron, B., Hastie, T., Johnstones, I. and Tibshirani, R.
selection method. There are other time series modeling tasks (2004). Least angle regression. The Annals of Statistics 32(2)
that require ARMA order section, e.g. VARMA models and 407–499. MR2060166

204 K. Chen and K.-S. Chan


[5] Fan, J. and Li, R. (2001). Variable selection via nonconcave pe- Journal of the Royal Statistical Society, Series B. 69 63–78.
nalized likelihood and its oracle properties. Journal of the Amer- MR2301500
ican Statistical Association 96 1348–1360. MR1946581 [18] Wood, S. N. (2006). Generalized Additive Models, An Introduc-
[6] Hannan, E. J. and Kavalieris, L. (1984). A method for tion with R. Chapman and Hall, London. MR2206355
autoregressive-moving average estimation. Biometrika 71 273– [19] Yao, Q. and Tong, H. (1994). On subset selection in non-
280. MR0767155 parametric stochastic regression. Statistica Sinica 4 51–70.
[7] Hannan, E. J. and Rissanen, J. (1982). Recursive estimation of MR1282865
mixed autoregressive-moving average order. Biometrika 69 81–94. [20] Zhao, P. and Yu, B. (2006). On model selection consistency of
MR0655673 Lasso. J. Mach. Learn. Res. 7 2541–2563. MR2274449
[8] Hsu, N. J., Hung, H. L. and Chang, Y. M. (2008). Subset se- [21] Zou, H. (2006). The adaptive Lasso and its oracle properties.
lection for vector autoregressive processes using Lasso. Computa- Journal of the American Statistical Association 101 1418–1429.
tional Statistics & Data Analysis 52 3645–3657. MR2427370 MR2279469
[9] Huang, J., Horowitz, J. L. and Wei, F. (2010). Variable selec- [22] Zou, H., Hastie, T. and Tibshirani, R. (2007). On the degree
tion in nonparametric additive models. The Annals of Statistics of freedom of the Lasso. The Annals of Statistics 35 2173–2192.
38 2283–2313. MR2676890 MR2363967
[10] Huang, J., Ma, S. and Zhang, C.-H. (2008). Adaptive Lasso for
high-dimensional regression models. Statistica Sinica 18 1603– Kun Chen
1618. MR2469326 Dept. of Statistics and Actuarial Science
[11] Huang, J., Ma, S., Xie, H. and Zhang, C.-H. (2009). A group University of Iowa
bridge approach for variable selection. Biometrika 96 339–355.
MR2507147 Iowa City, Iowa 52242
[12] Knight, K. and Fu, W. (2000). Asymptotics for Lasso-type esti- USA
mators. The Annals of Statistics 28 1356–1378. MR1805787 E-mail address: kun-chen@uiowa.edu
[13] R Development Core Team, (2008). R: A Language and En-
vironment for Statistical Computing ISBN 3-900051-07-0.
[14] Ren, Y. and Zhang, X. (2010). Subset selection for vector au- Kung-Sik Chan
toregressive processes via adaptive Lasso. Statistics & Probability Dept. of Statistics and Actuarial Science
Letters 80 1705–1712. University of Iowa
[15] Schwarz, G. (1978). Estimating the dimension of a model. The
Annals of Statistics 6 461–464. MR0468014 Iowa City, Iowa 52242
[16] Tibshirani, R. (1996). Regression shrinkage and selection via the USA
Lasso. Journal of the Royal Statistical Society (Series B) 58 267– E-mail address: kung-sik-chan@uiowa.edu
288. MR1379242
[17] Wang, H., Li, G. and Tsai, C. (2007). Regression coefficient
and autoregressive order shrinkage and selection via the Lasso.

Subset ARMA selection via the adaptive Lasso 205

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy